When the Cloud Crashed: A Day in the Life of a Delivery-Recovery Engineer

techieswiki

6 months ago

Do you remember those days when everything that can go wrong goes wrong? October 20th, 2025, was just one of those days.

I work for one of the big grocery retailer in the U.S., and a sizable part of my role is to help my teams keep our delivery workflows as smooth as possible when the upstream program has a hiccup. But when Amazon Web Services – a backbone of so many of our systems went down – we entered into a bit of a technical limbo, waiting for things to come back to life.

The Calm Before the Storm

It began without any warning. Halfway through my morning coffee, I looked at the alerts from our delivery partner integrations and saw that several orders were failing. Honestly, I figured it was some small problem downstream – a misconfigured API or maybe a quick network outage or hiccup. Minutes turned into tens of minutes and then we saw more and more failures.

Then the message came: AWS was having a massive outage. Servers, APIs, DNS resolution – all broken. For us, one of our main delivery partner systems (that sits on AWS) was not available – new customer orders could not submit, and orders already in the queue were stuck, half-processed.

I watched as our dashboard went red. Alerts escalated. Teams scrambled.

Twelve Hours of Chaos

What’s unusual is how long this went on. In cloud-world, we sometimes see brief regional hiccups, but this lasted around 12 hours (though some reports extend it further depending on backlog processing).
AWS later confirmed the outage originated in their US-EAST-1 region, blaming internal DNS resolution issues tied to EC2’s internal network and health monitoring subsystems. Reuters

The ripple effects were global: Snapchat, Reddit, Signaling apps, gaming platforms, and more—everywhere you look, services were down. The Verge “The incident highlights the complexity and fragility of the internet,” said Mehdi Daoudi, CEO of Catchpoint. PC Gamer

And so, as our customers attempted to place grocery orders, many were met with errors, timeouts, or “service unavailable” messages.

My Mission: Recovery

My job in that moment was simple in theory, brutal in execution:

Recover all failed orders – We had to reconcile, replay or manually patch every order that came through during the blackout window. These were all legit meals to be delivered, with groceries needed for legit people.
Process new orders – Even while the AWS was sagging, we needed to create a fallback or workaround to process new customer orders so we didn’t bleed.
Communicate – Let downstream teams, partners, and customer-facing communications if possible (i.e. “we’re on it”) to help mitigate chaos.

We divided and conquered:

The support team began comping through logs, capturing which orders failed, their state (created, payment pending, items picked), and preparing replays.

Our engineering arm built quick feature flags and routing fallbacks. Some APIs were rerouted through alternate regions, degraded gracefully, or throttled to reduce pressure on failing endpoints.

We queued recovery jobs, monitored retry logic, and held our breath.

By late afternoon, as AWS gradually restored core services, we started replaying order jobs. Slowly, the stuck queue cleared. The dashboards returned from red to yellow to green. Orders that were “lost” found their way home again.

What Went Through My Mind

“Did I just roll the dice on a single region dependency?”
“We should’ve had failover paths to alternate cloud regions.”
“Every minute these downstream orders stay unsubmitted is revenue lost – can we triage the worst first?”

At 7 PM local, I finally allowed myself a deep breath: the system was mostly stable, new orders flowing, and as many failed ones as possible replayed.

What the World Was Saying

This outage was one of AWS’s biggest in recent memory some call it the largest since last year’s CrowdStrike meltdown – Reuters
Business Insider reported even Wordle (daily puzzle game) users panicked when their streaks vanished temporarily. Business Insider
Smart beds – yes, you read it right. Overheated or froze mid-session because their cloud backend was unreachable. The CEO of Eight Sleep apologized publicly: “This outage has impacted some of our users … we’ll work 24/7 to outage-proof the beds.” PCWorld
Some commentators (like Senator Elizabeth Warren) saw it as proof of how too much power is concentrated in a few cloud providers: “If one company can bring down much of the internet, that’s too large.” Omni

Lessons and Reflections

Don’t trust a single region. Use multi-region redundancy and fallback endpoints.
Circuit breakers and graceful degradation are not optional they’re survival gear.
Real-time monitoring and alerting on DNS, API latency, and health checks are your early warning sirens.
Chaos engineering practice: break things in test environments intentionally to uncover fragility.

A Bad Day, But a Good Story

Sometimes, the tech world gives you a brutal stress test and all you can do is respond. Oct 20th was one of those days. But I came out with scars, lessons, and a stronger system.

At 11 PM, I closed my laptop. The world was still humming—apps were back, orders were flowing, the cloud held. But in my head, I replayed every alert, every failure, every fallback we threw.

Tomorrow, we build something more resilient.

What are your thoughts? Let me know in the comments below!