Month: February 2026

18 posts

OpenAI launches stateful AI on AWS, signaling a control plane power shift

Stateless AI, in which a model offers one-off answers without context from previous sessions, can be helpful in the short-term but lacking for more complex, multi-step scenarios. To overcome these limitations, OpenAI is introducing what it is calling, naturally, “stateful AI.” The company has announced that it will soon offer a stateful runtime environment in partnership with Amazon, built to simplify the process of getting AI agents into production. It will run natively on Amazon Bedrock, be tailored for agentic workflows, and optimized for AWS infrastructure. Interestingly, OpenAI also felt the need to make another announcement today, underscoring the fact that nothing about other collaborations “in any way” changes the terms of its partnership with Microsoft. Azure will remain the exclusive cloud provider of stateless OpenAI APIs. “It’s a clever structural move,” said Wyatt Mayham of Northwest AI Consulting. “Everyone can claim a win, but the subtext is clear: OpenAI is becoming a multi-cloud company, and the era of exclusive AI partnerships is ending.” What differentiates ‘stateful’ The stateful runtime environment on Amazon Bedrock was built to execute complex steps that factor in context, OpenAI said. Models can forward memory and history, tool and workflow state, environment use, and identity and permission boundaries. This represents a new paradigm, according to analysts. Notably, stateless API calls are a “blank slate,” Mayham explained. “The model doesn’t remember what it just did, what tools it called, or where it is in a multi-step workflow.” While that’s fine for a chatbot answering one-off questions, it’s “completely inadequate” for real operational work, such as processing a customer claim that moves across five different systems, requires approvals, and takes hours or days to complete, he said. New stateful capabilities give AI agents a persistent working memory so they can carry context across steps, maintain permissions, and interact with real enterprise tools without developers having to “duct-tape stateless API calls together,” said Mayham. Further, the Bedrock foundation matters because it’s where many enterprise workloads already live, he noted. OpenAI and Amazon are meeting companies where they are, not asking them to rearchitect their security, governance, and compliance posture. This makes sophisticated AI automation accessible to mid-market companies; they will no longer need a team of engineers to “build the plumbing from scratch,” he said. Sanchit Vir Gogia, chief analyst at Greyhound Research, called stateful runtime environments “a control plane shift.” Stateless can be “elegant” for single interactions such as summarization, code assistance, drafting, or isolated tool invocation. But stateful environments give enterprises a “managed orchestration substrate,” he noted. This supports real enterprise workflows involving chained tool calls, long running processes, human approvals, system identity propagation, retries, exception handling, and audit trails, said Gogia, while Bedrock enforces existing identity and access management (IAM) policies, virtual private cloud (VPC) boundaries, security tooling, logging standards, and compliance frameworks. “Most pilot failures happen because context resets across calls, permissions are misaligned, tokens expire mid workflow, or an agent cannot resume safely after interruption,” he said. These issues can be avoided in stateful environments.  Factors IT decision-makers should consider However, there are second order considerations for enterprises, Gogia emphasized. Notably, state persistence increases the attack surface area. This means persistent memory must be encrypted, governed, and auditable, and tool invocation boundaries should be “tightly controlled.” Further, workflow replay mechanisms must be deterministic, and observability granular enough to satisfy regulators. There is also a “subtle lock in dimension,” said Gogia. Portability can decrease when orchestration moves inside a hyperscaler native runtime. CIOs need to consider whether their future agent architecture remains cloud portable or becomes anchored in AWS’ environment. Ultimately, this new offering represents a market pivot, he said: The intelligence layer is being commoditized. “We are moving from a model race to a control plane race,” said Gogia. The strategic question now isn’t about which model is smartest. It is: “Which runtime stack guarantees continuity, auditability, and operational resilience at scale?” Partnership with Microsoft still ‘strong and central’ Today’s joint announcement from Microsoft and OpenAI about their partnership echoes OpenAI’s similar reaffirmation of the collaboration in October 2025. The partnership remains “strong and central,” and the two companies went so far as to call it “one of the most consequential collaborations in technology,” focused on research, engineering, and product development. The companies emphasized that: Microsoft maintains an exclusive license and access to intellectual property (IP) across OpenAI models and products. OpenAI’s Frontier and other first-party products will continue to be hosted on Azure. The contractual definition of artificial general intelligence (AGI) and the “process for determining if it has been achieved” is unchanged. An ongoing revenue share arrangement will stay the same; this agreement has always included revenue-sharing from partnerships between OpenAI and other cloud providers. OpenAI has the flexibility to commit to compute elsewhere, including through infrastructure initiatives like the Stargate project. Both companies can independently pursue new opportunities. “That joint statement reads like it was drafted by three law firms simultaneously, and that’s the point,” says Mayham. The anchor of the agreement is that Azure remains the exclusive cloud provider of stateless OpenAI APIs. This allows OpenAI to establish a new category on AWS that falls outside of Microsoft’s reach, he said. OpenAI is ultimately “walking a tightrope,” because it should expand distribution beyond Azure to reach AWS customers, which comprise a massive portion of the enterprise market, he noted. At the same time, they have to ensure Microsoft doesn’t feel like its $135 billion investment “just got diluted in strategic value.” Gogia called the statement “structural reassurance.” OpenAI must grow distribution across clouds because enterprise buyers are demanding multi-cloud flexibility. They don’t want to be confined to a single cloud; they want architectural optionality.” Also, he noted, “CIOs and boards do not want vendor instability. Hyperscaler conflict risk is now a board level concern.” New infusion of funding (again) Meanwhile, new $110 billion in funding from Nvidia, SoftBank, and Amazon will allow OpenAI to expand its global reach and “deepen” its infrastructure, the company says. Importantly, the funding includes the use of 3GW of dedicated inference capacity and 2 GW of training on Nvidia’s Vera Rubin systems. This builds on the Hopper and Blackwell systems already in operation across Microsoft, Oracle Cloud Infrastructure (OCI), and CoreWeave. Mayham called this “the headline within the headline.” “Cash doesn’t build AI products; compute does,” he said. Right now, access to next-generation Nvidia hardware is the “true bottleneck for every AI company on the planet.” OpenAI is essentially locking in a “guaranteed supply line” for the chips that power everything it does. The money from all three companies funds operations and infrastructure, but the Nvidia capacity and training allows OpenAI to use infrastructure at the frontier, said Mayham. “If you can’t get the processors, the cash is just sitting in a bank account.” Inference is now one of the biggest cost drivers in AI, and Gogia noted that frontier AI systems are constrained by physical infrastructure; GPUs, high bandwidth memory (HBM), high speed interconnects, and other hardware, as well as grid level power capacity. Are all finite resources. The current moves embed OpenAI deeper into the infrastructure stack, but the risk is concentration. When compute control centralizes among a small cluster of hyperscalers and chip vendors, the system can become fragile. To protect themselves, Gogia advised enterprises to monitor supply chain concentration. “In strategic terms, however, this move strengthens OpenAI’s durability,” he said. “It secures the physical substrate required to sustain frontier model scaling and enterprise inference growth.”
Read More

Red Hat ships AI platform for hybrid cloud deployments

Red Hat has made its Red Hat AI Enterprise platform generally available, with the intent to provide an AI platform to simplify development and deployment of hybrid cloud-based applications powered by AI. Availability of the platform was announced February 24. Engineered to solve the “production gap” for AI, Red Hat AI Enterprise unifies AI model and application life cycles—from model development and tuning to high-performance inference—on a standard, centralized infrastructure to accelerate delivery, increase operational efficiency, and mitigate risk by providing a comprehensive, all-in-one experience, Red Hat said. Users can move away from treating AI as a disjointed, bespoke effort and transform it into a scalable, repeatable factory process. Red Hat AI Enterprise is powered by the Red Hat OpenShift cloud application platform. Red Hat cited the following business benefits of Red Hat AI Enterprise: Accelerated time-to-value, with a ready-to-use environment for teams to “develop once and deploy anywhere” without rewriting code. Increased operational efficiency, simplifying workflows from code commits to model serving. Mitigated risk and governance, with a foundation for digital sovereignty, giving organizations control over where data and models reside. For platform engineers, AI engineers, and application developers, Red Hat AI Enterprise provides a foundation for modern AI workloads, Red Hat said. This includes AI life-cycle management, high-performance inference at scale, agentic AI innovation, integrated observability and performance modeling, and trustworthy AI and continuous evaluation. Tools are provided for dynamic resource scaling, monitoring, and security. For zero-downtime maintenance, rolling platform updates keep the AI stack current and protected without disrupting active inference services, according to Red Hat.
Read More

‘Silent’ Google API key change exposed Gemini AI data

Google Cloud API keys, normally used as simple billing identifiers for APIs such as Maps or YouTube, could be scraped from websites to give access to private Gemini AI project data, researchers from Truffle Security recently discovered. According to a Common Crawl scan of websites carried out by the company in November, there were 2,863 live Google API keys that left organizations exposed. This included “major financial institutions, security companies, global recruiting firms, and, notably, Google itself,” Truffle Security said. The alarming security weakness was caused by a silent change in the status of Google Cloud Platform (GCP) API keys which the company neglected to tell developers about. For more than a decade, Google’s developer documentation has described these keys, identified by the prefix ‘Aiza’, as a mechanism used to identify a project for billing purposes. Developers generated a key and then pasted it into their client-side HTML code in full public view. However, with the appearance of the Gemini API (Generative Language API) from late 2023 onwards, it seems that these keys also started acting as authentication keys for sites embedding the Gemini AI Assistant. No warning Developers might build a site with basic features such as an embedded Maps function whose usage was identified for metering purposes using the original public GCP API key. When they later added Gemini to the same project, to, for example, make available a chatbot or other interactive feature, the same key effectively authenticated access to anything the owner had stored through the Gemini API, including datasets, documents and cached context. Because this is AI, extracting data would be as simple as prompting Gemini to reveal it. That same access could also be exploited to consume tokens through the API, potentially generating large bills for the owners or exhausting their quotas, said Truffle Security. All an attacker would need to do is view a site’s source code and extract the key. “Your public Maps key is now a Gemini credential. Anyone who scrapes it can access your uploaded files, cached content, and rack up your AI bill,” the researchers pointed out. “Nobody told you.” API key exploitation is more than hypothetical. In a different context, a student who reportedly exposed a GCP API key on GitHub last June was left nursing a $55,444 bill (later waived by Google) after it was extracted and re-used by others. Truffle Security said it disclosed the issue with the keys to Google in November, and the company eventually admitted the issue was a bona fide bug. After being told of the 2,863 exposed keys, the company restricted them from accessing the Gemini API. On February 19, the 90-day bug disclosure window closed, with Google apparently still working on a more comprehensive fix. “The initial triage was frustrating; the report was dismissed as ‘Intended Behavior.’ But after providing concrete evidence from Google’s own infrastructure, the GCP VDP team took the issue seriously,” said Truffle Security. “Building software at Google’s scale is extraordinarily difficult, and the Gemini API inherited a key management architecture built for a different era.” Mitigation The first job for concerned site admins is to check in the GCP console for keys specifically allowing the Generative Language API. In addition, look for unrestricted keys, now identified by a yellow warning icon. Check if any of these keys are public. Exposed keys should all be rotated or ‘regenerated,’ with a grace period that considers the effect this will have on downstream apps that have cached the old one. This vulnerability underlines how small cloud evolution oversights can have wider, unforeseen consequences. Truffle Security noted that Google now says in its roadmap that it is taking steps to remedy the API key problem: API keys created through AI Studio will default to Gemini-only access, and the company will also block leaked keys, notifying customers when they detect this to have happened. “We’d love to see Google go further and retroactively audit existing impacted keys and notify project owners who may be unknowingly exposed, but honestly, that is a monumental task,” Truffle Security admitted. This article originally appeared on CSOonline.
Read More

The reliability cost of default timeouts

In user-facing distributed systems, latency is often a stronger signal of failure than errors. When responses exceed user expectations, the distinction between “slow” and “down” becomes largely irrelevant, even if every service is technically healthy. I’ve seen this pattern across multiple systems. One incident, in particular, forced me to confront how much production behavior is shaped by defaults we never explicitly choose. What stood out was not the slowness itself, but how “infinite by default” waiting quietly drained capacity long before anything crossed a traditional failure threshold. Details are generalized to avoid sharing proprietary information. When slowness turned into an outage The incident started with support tickets, not alarms. Early in the morning, they began to appear: Product pages don’t load. Checkout is stuck. The site is slow today. At the same time, our dashboards drifted in subtle ways. CPU climbed, memory pressure increased and thread pools filled while error rates stayed low. Product pages began hanging intermittently: some requests completed, others stalled long enough that users refreshed, opened new tabs and eventually left. I was on call that week. There had been a recent deployment, so I rolled it back early. It had no effect, which told us the issue wasn’t a specific change, but how the system behaved under sustained slowness. Within a few hours, the impact was measurable. Product page abandonment increased sharply. Conversion dropped by double digits. Support ticket volume spiked. Users started switching to competitors. By the end of the day, the incident resulted in a six-figure loss and, more importantly, a visible loss of user trust. The harder question wasn’t what failed, but why user impact appeared before our pages fired. The system crossed the user’s pain threshold long before it crossed any paging threshold. Our alerts were optimized for hard failures – errors, instance health, explicit saturation – while latency lived on dashboards rather than in paging. The failure mode we missed Product pages displayed prices in the user’s local currency. To do that, the Product Service called a downstream currency exchange API. That dependency did not go down. It became slow, intermittently, for long enough to trigger a cascade. As I dug deeper during the incident, one detail stood out. The Product Service used an HTTP client with default configuration, where the request timeout was effectively infinite. On the frontend, browsers stopped waiting after roughly 30 seconds. On the backend, requests continued to wait long after the user had already given up. Violetta Pidvolotska That gap mattered more than I expected. The first few hung currency calls held onto Product Service worker threads and outbound connections, so new requests began queuing behind work that no longer had a user on the other end. Once the shared pools started to saturate, it stopped being “only the currency path.” Even requests that didn’t require currency conversion slowed down because they waited for the same thread pool and the same internal capacity. At that point, the dependency didn’t need to fail to take the service down. It only needed to become slow while we kept waiting without a boundary. This wasn’t an error failure. It was a capacity failure. Blocked concurrency accumulated faster than it could drain, latency propagated outward and throughput collapsed without a single exception being thrown. Some mitigations helped only temporarily. Restarting instances or shedding traffic reduced pressure for a short time, but the relief never lasted. As long as requests were allowed to wait indefinitely, the system kept accumulating work faster than it could complete it. When we finally pinpointed the unbounded wait, the immediate fix sounded simple: set a timeout. The real lesson was deeper. Defaults that quietly shape system behavior At first glance, this looked like a simple misconfiguration. In reality, it reflected how common default settings influence system behavior in production. Many widely used libraries and systems default to infinite or extremely large timeouts. In Java, common HTTP clients treat a timeout of zero as “wait indefinitely” unless explicitly configured. In Python, requests will wait indefinitely unless a timeout is set explicitly. The Fetch API does not define a built-in timeout at all. These defaults aren’t careless. They’re intentionally generic. Libraries optimize for the correctness of a single request because they can’t know what “too slow” means for your system. Survivability under partial failure is left to the application. Production systems rarely fail under ideal conditions. They fail under load, partial outages, retries and real user behavior. In those conditions, unbounded waiting becomes dangerous. Defaults that feel harmless during development quietly make architectural decisions in production. When we later audited our services as a team, we found that many calls either had no timeouts or had values that no longer matched real production latency. The defaults had been shaping system behavior for years, without us explicitly choosing them. The mental model behind long timeouts What this incident revealed wasn’t just a missing timeout. It exposed a mental model many teams rely on, including ours at the time. That model assumes: Dependencies are usually fast Slowness is rare Defaults are reasonable Waiting longer increases the chance of success It prioritizes individual request success, often at the cost of overall system reliability. As a result, teams often don’t know their effective timeouts, different services use inconsistent values and some calls have no timeouts at all. Violetta Pidvolotska Even when timeouts exist, they are often far longer than what user behavior justifies. In our case, users retried within a few seconds and abandoned within about ten. Waiting beyond that didn’t improve outcomes. It only consumed capacity. Long timeouts can also mask deeper design problems. If a request regularly times out because it returns thousands of items, the issue isn’t the timeout itself. It’s missing pagination or poor request shaping. By optimizing for individual request success, teams unintentionally trade away system-level resilience. Timeouts as failure boundaries Before this incident, we mostly treated timeouts as configuration knobs. After that, we started treating them as failure boundaries. A timeout defines where a failure is allowed to stop. Without timeouts, a single slow dependency can quietly consume threads, connections and memory across the system. With well-chosen timeouts, slowness stays contained instead of spreading into a system-wide failure. We made a set of deliberate changes: 1. Enforced timeouts on the client side The caller decides when to stop waiting. Load balancers, proxies or servers could not reliably protect us from hanging forever, as the incident made clear. 2. Introduced explicit end-to-end deadlines for user-facing flows Downstream calls could only use the remaining time budget; waiting beyond that point was wasted work with no chance of improving the outcome. Violetta Pidvolotska We made those deadlines explicit and portable. In HTTP flows, we propagated an end-to-end deadline via a single X-Request-Deadline header so each service could compute the remaining time and set per-call timeouts accordingly. We chose a deadline (not a per-hop timeout) because it composes cleanly across service boundaries and retries. For gRPC paths, built-in deadlines allowed remaining time to propagate across service boundaries. We extended that same boundary through internal request context so background work stopped when the budget did. 3. Became deliberate about how timeout values were chosen Connection timeouts were kept short and tied to network behavior. Request timeouts were based on real production latency, not intuition. Rather than relying on averages, we focused on p99 and p99.9. When p50 was close to p99, we left room so minor slowdowns didn’t amplify into timeout spikes. This helped us understand how slow requests behaved under load and choose timeouts that protected capacity without causing unnecessary failures. For example, if 99% of requests completed in 300 milliseconds, a timeout of 350-400 milliseconds provided a better balance than tens of seconds. What happened beyond that point became a conscious product decision. In our case, when currency conversion timed out, we fell back to showing prices in the primary currency. Users consistently preferred an imperfect answer over waiting indefinitely. We also kept retries conservative in user-facing paths. A retry that doesn’t respect an end-to-end deadline is worse than no retry: it multiplies work after the user has already moved on. That’s how “helpful” retries turn into retry storms under partial slowness. As a team, we codified these decisions into shared client defaults and a mandatory review checklist used across new and existing call paths so unbounded waiting didn’t quietly return. Keeping timeouts honest Timeouts should never be silent. After the incident, we focused on three things: 1. Making timeouts observable Every timeout emitted a structured log entry with dependency context and remaining time budget. We tracked timeout rates as metrics and alerted on sustained increases rather than individual spikes. Rising timeout rates became an early warning signal instead of a surprise during incidents. Importantly, we updated paging to include user-impacting latency and “requests not finishing” signals, not just error rate. 2. Stopping the treatment of timeout values as constants Traffic grows, dependencies change and architectures evolve, so values that were reasonable a year ago are often wrong today. We reviewed timeout configuration whenever traffic patterns shifted, new dependencies were introduced or latency distributions changed. 3. Validating timeout behavior before real incidents forced the issue Introducing artificial latency in non-production environments quickly exposed hanging calls, retry amplification and missing fallbacks. It also forced us to separate two different questions: what breaks under load and what breaks under slowness. Traditional load tests answered the first. Fault-injection and latency experiments revealed the second, a form of controlled failure often described as chaos engineering. By introducing controlled delay and occasional hangs, we verified that deadlines actually stopped work, queues didn’t grow without bound and fallbacks behaved as intended. Lessons that carried forward This incident permanently changed how I think about timeouts. A timeout is a decision about value. Past a certain point, waiting longer does not improve user experience. It increases the amount of wasted work a system performs after the user has already left. A timeout is also a decision about containment. Without bounded waits, partial failures turn into system-wide failures through resource exhaustion: blocked threads, saturated pools, growing queues and cascading latency. If there is one takeaway from this story, it is this: define timeouts deliberately and tie them to budgets. Start from user behavior. Measure latency at p99, not just averages. Make timeouts observable and decide explicitly what happens when they fire. Isolate capacity so that a single slow dependency cannot drain the system. Unbounded waiting is not neutral. It has a real reliability cost. If you do not bound waiting deliberately, it will eventually bound your system for you. This article is published as part of the Foundry Expert Contributor Network.Want to join?
Read More

Cloud sovereignty isn’t a toggle feature

Sovereignty, locality, and “alternative cloud” strategies are often treated as simple settings in hyperscaler consoles. Pick a region, check a compliance box, and move on. IT consultancy Coinerella posted about replacing a typical US-centric startup baseline with a “Made in the EU” stack. They treat sovereignty as an architectural posture and an operating model that can save money. It still involves friction, compromise, and more responsibility than outsourcing to default ecosystems. The Coinerella approach is to deliberately refuse to let the platform drift toward AWS and US-based hyperscalers, driven by practical considerations such as data residency, General Data Protection Regulation (GDPR) compliance, reducing concentration risk, and demonstrating the operational viability of European infrastructure. Leaders often talk about sovereignty until the first production incident, the first compliance review, or the first integration gap. Coinerella remains committed and is addressing the consequences. A ‘made in the EU’ stack Coinerella didn’t pursue sovereignty by inventing new patterns. They recreated a fairly standard modern platform using European providers and selectively self-hosted services. For core infrastructure, they moved primary compute and foundational services to Hetzner, including virtual machines, load balancing, and S3-compatible object storage. This is where the story gets interesting: The hyperscaler narrative suggests that leaving AWS is mostly about giving up features. Coinerella found something different, at least for the basics. Compared with what many teams experience on AWS, their new performance and capability were solid, and the cost profile was compelling. When Hetzner didn’t provide a managed service they needed, they filled in the gaps with Scaleway. That included transactional email, a container registry, additional object storage, observability tools, and even domain registration. In many migrations, stitching together multiple providers is where complexity balloons; here, the company intentionally used that approach, choosing the best option available in the region rather than forcing a single vendor to do everything. At the edge, they relied on Bunny.net for the content delivery network and related capabilities, including storage, DNS, image optimization, web application firewall, and DDoS protection. That choice is a reminder that edge services are not just an add-on; they are a major part of the platform’s reliability and security posture. Their blog suggests the experience felt approachable, coming from the more common Cloudflare-centric world, which is exactly what you want when you’re reducing risk in a migration. Coinerella also addressed AI inference in a sovereignty-aware way by using European GPU capacity via Nebius rather than defaulting to US regions for inference calls. For identity, they used Hanko, a European authentication provider that supports modern authentication approaches like passkeys and handles common log-in expectations such as social log-ins. Finally, and importantly, they self-hosted a meaningful set of internal services on Kubernetes, using Rancher as the management layer. They ran Gitea for source control, Plausible for analytics, Twenty for CRM, Infisical for secrets management, and Bugsink for error tracking. If you’ve ever advised an enterprise to self-host “just a few things,” you know what this really means: You’re accepting a different operational contract, where savings and control come with life-cycle ownership. Surprises and extra hurdles Coinerella’s post is most valuable where they write about difficulties in the “boring” services that often make or break developer productivity. Email was one of the major friction points. In the US ecosystem, transactional email options are plentiful, polished, and easy to integrate, with a deep bench of community guidance for deliverability and troubleshooting. Coinerella made it work with a European alternative, but the takeaway is clear: The long tail of integrations, templates, and community answers isn’t evenly distributed across regions. It’s not that the service can’t function; it’s that you may have to serve as your own integration team more often. Source control was another challenge. Moving away from GitHub isn’t just about moving away from a Git remote; it’s about leaving an ecosystem: CI/CD defaults, actions, marketplace integrations, and the operational muscle memory of every developer who has internalized the GitHub way of doing things. Gitea can be a solid foundation, but it doesn’t automatically bring the full assembly line you get “for free” on the dominant platform. There were also cost anomalies. The author notes that some top-level domains appeared to be significantly more expensive through European registrars—sometimes dramatically so—without a satisfactory explanation why. That’s not an architectural deal-breaker, but it’s exactly the kind of real-world detail that proves a point: These journeys aren’t clean-room exercises. You’ll encounter unexpected differences in market structure, and you’ll have to decide how much they matter. Unavoidable dependencies If you’re looking for a purity narrative that claims “we removed every US dependency,” this isn’t it. Coinerella acknowledged that some dependencies are structural. User acquisition may require Google’s advertising ecosystem, and mobile distribution routes may have to go through Apple’s developer program. Social log-ins often rely on Google and Apple infrastructure, and removing them can harm conversion rates. Even AI introduces pressure: If you want access to specific frontier models, you may be forced to use US-based APIs. The smarter posture this blog implicitly recommends is to minimize what you can, isolate what you can’t, and be honest about the trade-offs. Sovereignty isn’t binary. It’s a spectrum of choices about where your core data and operational dependencies reside. Moving to an alt cloud Coinerella’s experience mirrors what many enterprises are learning as they move toward alt clouds, including sovereign clouds, private clouds, and other non-default platforms. The biggest lesson is that the economics of the move can be attractive precisely because you’re taking on more work. Lower infrastructure costs are real, but they come with increased integration responsibility, more platform engineering, and a higher need for operational maturity. This is also where the “want versus need” conversation becomes unavoidable. Hyperscalers have trained teams to select managed services the way you pick items off a menu, often because it’s convenient, fast, and politically easy. Alt cloud strategies force prioritization. You may want the newest managed feature set, the deepest marketplace, and the broadest ecosystem, but you may not need them to meet your business outcomes. When you choose sovereignty or a private-cloud footing, you often end up selecting simpler technologies that meet requirements, even if they’re less glamorous or less feature-rich. This is not a retreat. It’s a form of architectural discipline. However, none of this works without adding new practices. Finops becomes an engineering discipline that spans heterogeneous providers, self-hosted platforms, and capacity planning decisions you can no longer punt to a hyperscaler. Observability becomes a first-class design requirement because you’re building a platform that crosses boundaries and includes components you own end to end. You need consistent metrics, logs, traces, service-level objectives, and incident response procedures that work even when tools and APIs differ across providers. Because you’re doing more of the work, you need to be more explicit about patching, security, backups, recovery testing, and operational runbooks. The point isn’t that this is too hard to do. The point is that it’s hard in predictable ways. Coinerella’s blog makes the case that the journey is worth the trouble, but it’s not easy—and that’s the framing enterprise leaders need. If you expect sovereignty to be a product feature, you’ll be disappointed. If you treat it as a strategic posture that comes with real engineering commitments, you can get the control, cost profile, and locality benefits you’re looking for without being surprised by the work required.
Read More

Google’s Android developer verification program draws pushback

Google’s planned Android developer verification program, requiring Android apps to be registered by verified developers, is getting pushback, with opponents urging developers not to sign up for the program and to make their opposition known. An open letter opposing the verification program was posted February 24 at Keep Android Open, a consortium that is fighting the Google verification program. Among the 41 signatories as of February 26 are the Electronic Frontier Foundation, the Free Software Foundation, the Center for Digital Progress, and the Software Freedom Conservancy. “Android, currently an open platform where anyone can develop and distribute applications freely, is to become a locked-down platform, requiring that developers everywhere register centrally with Google in order to be able to distribute their software,” said Marc Prud’hommeaux of the F-Droid Android development community in a blog post. Google could not be reached for comment on February 26. The program was announced August 25, 2026. Starting in September, Android will require all apps to be registered by verified developers before they can be installed on certified Android devices. “To better protect users from repeat bad actors spreading malware and scams, we’re adding another layer of security to make installing apps safer for everyone: developer verification,” said Suzanne Frey, Google’s VP, Product, Trust and Growth for Android, in the blog post announcing the program. “This creates crucial accountability, making it much harder for malicious actors to quickly distribute another harmful app after we take the first one down” she said. Keep Android Open disagrees. In the open letter, the organization calls upon Google to: Immediately rescind the mandatory developer registration requirement for third-party distribution. Engage in transparent dialogue with civil society, developers, and regulators about Android security improvements that respect openness and competition. Commit to platform neutrality by ensuring that Android remains a genuinely open platform where Google’s role as platform provider does not conflict with its commercial interests. Keep Android Open wants developers to resist by refusing to sign up for early access, refusing to perform early verification, and refusing to accept an invitation to the Android Developer Console. Instead, the group advises, developers should respond to the invitation with a list of concerns and objections. It encourages consumers to contact national regulators and express concerns.
Read More