A single point of failure triggered the Amazon outage affecting millions

Date:

Share:



In turn, the delay in network state propagations spilled over to a network load balancer that AWS services rely on for stability. As a result, AWS customers experienced connection errors from the US-East-1 region. AWS network functions affected included the creating and modifying Redshift clusters, Lambda invocations, and Fargate task launches such as Managed Workflows for Apache Airflow, Outposts lifecycle operations, and the AWS Support Center.

For the time being, Amazon has disabled the DynamoDB DNS Planner and the DNS Enactor automation worldwide while it works to fix the race condition and add protections to prevent the application of incorrect DNS plans. Engineers are also making changes to EC2 and its network load balancer.

A cautionary tale

Ookla outlined a contributing factor not mentioned by Amazon: a concentration of customers who route their connectivity through the US-East-1 endpoint and an inability to route around the region. Ookla explained:

The affected US‑EAST‑1 is AWS’s oldest and most heavily used hub. Regional concentration means even global apps often anchor identity, state or metadata flows there. When a regional dependency fails as was the case in this event, impacts propagate worldwide because many “global” stacks route through Virginia at some point.

Modern apps chain together managed services like storage, queues, and serverless functions. If DNS cannot reliably resolve a critical endpoint (for example, the DynamoDB API involved here), errors cascade through upstream APIs and cause visible failures in apps users do not associate with AWS. That is precisely what Downdetector recorded across Snapchat, Roblox, Signal, Ring, HMRC, and others.

The event serves as a cautionary tale for all cloud services: More important than preventing race conditions and similar bugs is eliminating single points of failure in network design.

“The way forward,” Ookla said, “is not zero failure but contained failure, achieved through multi-region designs, dependency diversity, and disciplined incident readiness, with regulatory oversight that moves toward treating the cloud as systemic components of national and economic resilience.”



Source link

━ more like this

Relive the Commodore 64’s glory days with a slimmer, blacked-out remake

The Commodore 64 is back in black, sort of. Retro Games and Plaion Replai released a limited edition redesign of the best-selling computer,...

Chinese startup shows off a dancing humanoid robot that starts at $1,400

For roughly the same price as a flagship smartphone, you could instead buy an affordable humanoid robot that's meant for consumer and educational...

Shuttered robot vacuum maker Neato is ending cloud services sooner than planned

Starting soon, Neato robovac owners will no longer be able to control their devices using the app. Neato Robotics, which shut down in...

Apple makes the M5 MacBook Pro’s battery ever so slightly easier to replace

Just like a minor upgrade in specs, Apple's latest M5 MacBook Pro gets the slightest improvement when it comes to repairability. According to...

Putin’s nuclear weapons positioned close to NATO in ‘preparation for war’ – London Business News | Londonlovesbusiness.com

Norway’s Defence Minister Tore Sandvik has warned Vladimir Putin has positioned hi nuclear fleet miles from NATO’s border in “preparation for war.” Sandvik warned...
spot_img