Amazon has published a lengthy report about the outage that knocked numerous websites, services, apps and games offline on October 20. It all started with a bug in its automation software DynamoDB, where its AWS customers store their data, which then triggered more issues in its other systems that relied on the software.
As Amazon explains, DynamoDB maintains hundreds of thousands of DNS records and is supposed to be able to fix any issue automatically. But on October 20, the DynamoDB DNS management system suffered from a bug that resulted in an empty DNS record for Amazon’s data centers in North Virginia. DynamoDB was supposed to repair the issue on its own, but it had failed to do so, prompting Amazon to fix the problem manually. While the issue was happening, all systems that needed to connect to DynamoDB couldn’t and experienced DNS failures, including the customers of its cloud computing services. It felt like half the internet wasn’t working when that happened.
The websites and services affected by the outage include Amazon itself, Amazon Alexa devices, Bank of America, Snapchat, Canva, Reddit, Apple Music, Apple TV, Lyft, Duolingo, Fortnite, Disney+, Venmo, Doordash, Hulu, PlayStation and even Eight Sleep, whose beds connect to the internet to adjust their temperature and incline. Some of them were slow to respond, while others were completely inaccessible.
“We apologize for the impact this event caused our customers. While we have a strong track record of operating our services with the highest levels of availability, we know how critical our services are to our customers, their applications and end users, and their businesses. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further,” Amazon said in a statement.
