When the Cloud Trembled: Inside the AWS Outage on 20 October 2025
Published Oct 21, 2025

When the Cloud Trembled: Inside the AWS Outage on 20 October 2025

On 20 October 2025, a fault inside Amazon Web Services rippled across the internet, interrupting apps, websites and devices for millions. Here is what happened, who was affected, why it matters, and how teams can build more resilient systems.

What happened

On Monday 20 October 2025, Amazon Web Services experienced a major outage that began in its US-EAST-1 region and quickly disrupted thousands of apps and sites worldwide. Popular services such as Alexa, Snapchat, Fortnite and many others were affected before recovery progressed through the day.

By the evening, Amazon reported that systems had returned to normal operations, although some businesses faced lingering delays as backlogs cleared.

The root cause in plain English

Reporting and company updates point to a problem that interfered with Domain Name System lookups for critical AWS services in US-EAST-1. In simple terms, the address book of the internet was not answering correctly for parts of the AWS network, so many apps could not find the services they needed.

Who and what was affected

  • Consumer platforms, including social and gaming apps, saw sign-in failures and timeouts.
  • UK services reported disruption, including banking and public sector sites that rely on AWS.
  • Everyday devices, such as smart doorbells and voice assistants, were unresponsive for many users.

A quick timeline

Early morning US Eastern time: errors begin in US-EAST-1 and spread as dependent systems fail. Services start to recover within hours, but full stability takes longer while providers clear queues and restore normal traffic. Amazon states the incident is fully mitigated by the evening.

Why this outage matters

The incident highlights how dependent the internet has become on a small number of cloud providers. When one region stumbles, downstream services across finance, media, education and the public sector can feel the shock. Experts warned that concentration risk leaves users at the mercy of too few providers, renewing calls for stronger resilience and oversight.

Lessons for engineering teams

Outages will happen. What matters is how we design and operate systems to reduce impact. The following practices help:

  • Design for regional failure: run active-active or warm standby across multiple regions so traffic can shift quickly. Test failover regularly, not only on paper.
  • Reduce hidden dependencies: map your reliance on managed services such as databases, queues and identity providers. Build sensible timeouts, retries and circuit breakers so one failing component does not take down the rest.
  • Cache wisely: keep critical configuration and content close to users so read paths survive short DNS or API disruptions.
  • Invest in observability: instrument user journeys and back-end calls so you can see which dependency failed first and communicate accurately.
  • Plan communications: prepare clear status page templates and customer messaging for outages. Honest updates keep trust intact during recovery.

What businesses can do next

Review incident notes from your monitoring tools and post-mortem the impact on your users. If your risk profile is high, consider multi-region or multi-cloud for the most critical workloads. Where full duplication is not practical, target the services that would hurt most if they failed and add pragmatic safeguards.

Final thought

Monday’s outage was a sharp reminder that the cloud is just someone else’s computers stitched together with networks and naming systems. The internet is resilient, but not invincible. With thoughtful architecture and clear communication, teams can turn a bad day on the platform into a manageable incident rather than a crisis.