Resilience Engineering Lessons from James Kretchmer: Learning from the AWS 2025 Outage

When people opened their devices on October 20, 2025, the internet felt strangely still. The cloud, usually buzzing quietly in the background, had gone quiet. For hours, millions of people couldn’t log into work apps, stream a movie, or even check their smart thermostats.

In the background, one of the most advanced digital infrastructures ever built: Amazon Web Services (AWS), was buckling under the weight of a small, almost invisible flaw.

Few people understand what it means to build for, and survive, moments like this better than James Kretchmar, SVP and CTO for Akamai’s Cloud Technology Group, the team responsible for one of the world’s most distributed computing platforms.

James, brimming with insights taken from 21 years of helping the company support over 4,400 points of presence globally, shared his thoughts with UCToday on the lessons companies need to learn and what the future of IT resilience will look like.

What Happened: Dissecting the AWS Outage

The Domain Name System (DNS) is the digital address book that enables every website, API, and service to locate one another in milliseconds. When that system falters, everything built on top of it starts to wobble. That’s exactly what happened inside AWS on that October morning.

“According to Amazon’s public outage report, the root cause was related to DNS, one of the fundamental layers of the internet,” Kretchmar explained. “It’s critical not only for customers accessing services, but also for internal systems.”

At the center of the disruption was something deceptively simple: a race condition. That’s a software defect that occurs when two processes expect to happen in a specific order but overlap unpredictably.

“In this case, that timing defect led to blank DNS responses, which caused parts of the service to fail.”

That minor timing flaw set off a chain reaction. DNS requests began returning empty results. Load balancers failed to connect to healthy nodes. Internal monitoring systems (many of which also relied on AWS’s own DNS) started timing out.

Within minutes, thousands of companies were reporting partial or total outages. Collaboration platforms froze. Retail checkouts stalled. Cloud contact centers went silent.

Even businesses that didn’t use AWS directly were hit indirectly through partners and SaaS vendors that did. Analysts later estimated billions in lost productivity and revenue worldwide. For teams responsible for customer experience and uptime, it was a crash course in resilience engineering; a discipline focused not on preventing every fault, but on ensuring systems can bend without breaking.

“We can talk about how companies guard against that sort of thing. But fundamentally, it shows how something small and technical can ripple into a major outage because of how dependent everything is on these shared systems.”

Why It Matters: Shared Risk in the Cloud Era

Outages of this scale often start small, buried deep in automation logic or change-control processes. But as Kretchmar notes, IT resilience depends as much on how organizations respond as on how their systems are designed. Every second counts, every dependency matters, and every assumption about reliability is suddenly tested in real time.

The AWS incident forced thousands of leaders to confront an uncomfortable truth. In the age of hyperscale computing, a failure in one provider’s code can quickly become a failure for everyone.

“Even robust providers can suffer from rare, systemic failures.”

The very cloud networks that power today’s innovation have also woven a single, fragile web of dependence. Three hyperscalers now host more than 70 percent of enterprise workloads worldwide. Despite endless discussion of flexibility and redundancy, most of the world’s information, communication, and trade still run through only a handful of digital gateways.

That dependency is only growing. Companies have spent the last decade rushing toward the cloud for flexibility and speed. Still, few have invested as much in understanding how to maintain steady operations when the unthinkable happens.

“Cloud dependence has created shared risk; a single vendor issue can ripple through global operations. The challenge for every organization is to architect for failure, not just for uptime.”

That phrase “architect for failure” has become a rallying cry in the resilience engineering (link here) community. It means designing systems, processes, and teams that anticipate disruption, detect it early, and adapt in real-time.

It also means recognizing that IT resilience isn’t just the responsibility of infrastructure teams. The C-Suite and boards now have to treat reliability the same way they approach cybersecurity: as a measurable business risk.

Resilience Engineering Lessons for IT & Cloud Architects

For IT leaders and cloud architects, the 2025 AWS outage offered a humbling checklist of what to rethink. Kretchmar broke it down into three clear pillars: architecture, governance, and preparation.

“There are several pillars to building resiliency. The first is architectural design, making sure your systems can withstand different types of failure, including things like race conditions,” he explained.

“But architecture alone isn’t enough. You also need mechanisms that prevent one small issue from escalating into a larger outage. That might include self-healing systems that detect when something’s gone wrong and automatically mitigate it.”

This is resilience engineering at its most practical: designing for failure, not perfection. Systems must expect turbulence. It’s what separates mature infrastructure from wishful thinking. Netflix famously tests its environments regularly; Akamai does the same through distributed self-healing networks that re-route traffic around trouble before users ever notice.

Yet the human side of these systems is just as vital as the technology itself.

“Beyond the technology, you need solid change management and governance: reviewing systems regularly and ensuring best practices are consistently applied,” said Kretchmar. “Incident management is also critical, so when something does go wrong, your team knows exactly how to respond.”

Service disruptions will always happen, but disorder doesn’t have to follow. The most prepared organizations schedule thorough change reviews, roll out updates in stages, and keep well-practiced rollback procedures ready to go.

“Scenario planning is invaluable,” Kretchmar advised. “Run ‘what if’ exercises in peacetime, simulate major failures, identify gaps, and close them before they become problems.”

Lessons for Security Leaders

When the cloud falters, security often takes the hit first. Logs stop updating. Alerts fall silent. The very tools meant to detect and contain threats can vanish in the same outage that caused the crisis. The 2025 AWS incident was no exception.

Kretchmar told us, “Resiliency really matters for security systems. It’s crucial to probe your vendors on how they maintain reliability, not just their SLA numbers.”

It’s easy to assume that security vendors, such as firewalls, monitoring systems, and identity platforms, are immune to the same risks that bring down business applications. They aren’t. Most run on the same cloud backbones, governed by the same control planes.

That means the same race condition that knocked out DNS could just as easily silence an intrusion-detection feed or disable authentication for an entire workforce.

For Kretchmar, the difference between surviving and suffering in those moments comes down to diligence.

“Ask detailed questions: How do they roll out changes? How do they phase deployments to avoid breaking things? We’ve seen major incidents where updates were pushed too quickly and caused outages in security software.”

He added:

“So, for security leaders, it’s about due diligence. Understand your vendors’ processes deeply and make sure their reliability practices match the criticality of their role.”

Resilience Engineering Lessons for the C-Suite & Boards

For executives, the 2025 AWS outage was a pivotal moment in the boardroom. Overnight, service interruptions that began in data centers rippled into investor calls, customer support escalations, and front-page news. James Kretchmar’s advice to the C-suite is disarmingly straightforward:

“Boards can approach this the same way they already think about cybersecurity risk.”

That framing matters. Cybersecurity has long been viewed as a collective responsibility, backed by dedicated funding, regular reporting, and constant auditing. Cloud reliability and business continuity should be governed with that same seriousness, Kretchmar noted, before adding:

“Identify potential risks, understand your exposure, and ensure there’s a clear plan to mitigate those risks. You don’t need to prescribe technical solutions, just create the framework and keep the discussion active.”

In other words, executives don’t need to be cloud architects; they just need to ask the right questions:

  • Where are our single points of failure?
  • How do our vendors test their own IT resilience
  • When was our last real-world simulation of a full-scale outage?

The answers reveal how prepared a company truly is, and the questions should be asked again, regularly.

“Regularly reviewing and reassessing resilience helps keep everyone aligned and ensures it remains a top priority.”

Reliability isn’t built in crisis; it’s shaped by culture and by leaders who value operational stability as much as innovation. Resilience engineering becomes an integral part of brand protection, safeguarding both customer trust and shareholder confidence.

Pragmatically, some executives are turning to partners who can shoulder some of that load. However, that doesn’t entirely take the work away from leaders.

Lessons for Business & Strategy Leaders

Probably the biggest question for leaders to ask following the AWS outage is this: “Should business leaders be avoiding cloud concentration?”

Kretchmar believes that question sits at the heart of every modern strategy conversation.

“It’s definitely worth considering, though the right approach depends on the use case,” he said. “For workloads like virtual machines or object storage, multi-cloud makes sense. Designing with portable technologies allows you to switch clouds if one fails. The key is avoiding lock-in with proprietary features.”

That flexibility is the essence of a multi-cloud strategy, the ability to move or replicate workloads across providers without rewriting everything from scratch. It’s a major part of resilience engineering, but one that many struggle with. IDC estimates that more than 80 percent of global enterprises now use more than one cloud provider, yet only a fraction can shift production workloads seamlessly when disaster strikes.

“Technologies developed under the Cloud Native Computing Foundation (CNCF) are great examples,” Kretchmar added. “They’re open, portable, and supported across providers.”

These open frameworks enable companies to build once and run anywhere, thereby reducing their dependency on any single vendor’s quirks or outages.

Still, Kretchmar cautioned against blind adoption.

“That said, there are exceptions. For example, with security solutions, having multiple overlapping systems can create more risk than resilience. Integration complexity can introduce its own failure points. So for security, it’s often better to pick one robust solution and go deep with it.”

That’s the balance leaders now face: freedom versus simplicity, flexibility versus focus. The answer lies in business priorities. A retailer might value redundancy; a healthcare provider might value regulatory clarity.

“If you depend too much on one provider’s unique services, you lose flexibility. For most workloads, using open, portable standards helps prevent that, though there are exceptions. But overall, it’s a smart way to maintain control.”

Looking Ahead: Resilience Engineering for the Future

Every major outage leaves behind two kinds of companies: those that rush to patch the problem and those that decide to change the way they think. James Kretchmar belongs firmly to the second camp. His final reflections aren’t about AWS at all; they’re about the discipline required to make resilience a habit, not a headline.

“It really comes down to consistent attention and investment. It’s too easy to ignore reliability until something breaks. Just like with cybersecurity, we have to recognize reliability as a critical, ongoing commitment,” Kretchmar said. “The growing complexity of systems is part of the challenge; complexity can be the enemy of reliability if not managed carefully.”

But as Kretchmar says, it “can” be managed. Organizations just need to focus on it every day, not just after a crisis. “At Akamai, even with our strong track record, we’ve learned hard lessons along the way. Twenty years ago, we had an incident caused by a bad change, which led us to overhaul our systems to make them far more robust. Those investments have paid off ever since.”

Sustainable IT resilience means accepting that the work is never done. Governance reviews, incident drills, and multi-region tests all form part of an ongoing cycle of improvement.

True cloud reliability, then, isn’t just about failovers and backups. It’s about culture. Teams that celebrate uptime, learn openly from mistakes, and build feedback loops into every deployment create systems that genuinely improve with time. Those who treat resilience as a compliance box tend to encounter the same failures again, albeit at a higher cost.

Resilience Engineering: A Shared Responsibility

The 2025 AWS outage reminded every CIO, CTO, and boardroom that resilience isn’t something you buy; it’s something you build.

James Kretchmar’s reflections make one thing clear: resilience is everyone’s business. From engineers writing deployment scripts to executives approving budgets, the ability to withstand disruption now defines an organization’s credibility as much as its customer experience.

For Kretchmar, it all comes back to discipline and humility:

“When the cloud provider fails, you discover how much you truly depend on it. The question isn’t if you’ll have an outage, it’s when, and how you’ll respond.”

Ultimately, engineering for resilience isn’t a guarantee of perfection. It’s a culture of readiness, tested in real-world pressure, and proven by every organization that chooses to learn before the lights go out again.

This post originally appeared on Service Management - Enterprise - Channel News - UC Today.