Skip to content

Search the site

AWSCloudOutagesNews

AWS outage saw “cell management” system get flustered by big shards

The system "incorrectly determined that the healthy hosts were unhealthy and began redistributing shards..."

We love a good post-mortem at The Stack, particularly for a hyperscaler outage. Whether it’s Google Cloud in Paris trying to explain how it set on fire, got flooded, and ran out of water; Microsoft Azure admitting, sotto voce, that its encryption key infrastructure is a rotten mess and is breaking things, or AWS explaining how it lost control of data centre cooling and couldn’t execute “purge mode” as servers overheated frantically, we’re all twitching curtains; ditto for thing like HPE deleting 77TB of data from a supercomputer with a borked update; “if it bleeds”, as they say, “it leads”.

Whilst such reports may be a guilty pleasure for those with an ingrained journalistic bloodlust, less facetiously they are also often a fascinating insight into organisations’ architectures. The way in which they are delivered can betray something about a company’s culture too. Are they quietly buried? Do they seem sincere? Are they weirdly crammed with marketing and crummy jargon? (“Now is not the time, Crowdstrike”).

AWS incident post-mortems are something of a rarity from Amazon (it has posted just two since late 2021). So with a fresh one quietly out, and seemingly overlooked by the world at large in the wake of an incident in July 2024, we thought we would dig in and share our digested learnings.

This post is for subscribers only

Subscribe

Already have an account? Sign In

Latest