A protracted Cloudflare outage was triggered by total power failure at the Tier 3 data centre running its control plane – and a string of previously unrecognised dependencies that meant a planned failover across its active-active data centres could not complete, the company revealed.
In a detailed post-mortem just days after the November 2 outage, Cloudflare CEO Matthew Prince admitted that the content distribution network provider and cybersecurity company had not tested what happened if the PDX-04 data centre hosting most of its high availability cluster suffered a complete power failure or other such failure.
After a power failure at PDX-04 took out all electricity (despite it having multiple utility providers, generators, and batteries) Prince said that “we discovered that a subset of services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04… two critical services… Kafka and ClickHouse… had services that depended on them that were running in the high availability cluster.
“Far too many of our services depend on the availability of our core facilities,” he reflected [and] “dependencies shouldn’t have been so tight, should have failed more gracefully, and we should have caught them.”
“We must expect that entire data centers may fail… We are shifting all non-critical engineering functions to focusing on ensuring high reliability of our control plane. As part of that, we expect the following changes:
- "Remove dependencies on our core data centers for control plane configuration of all services and move them wherever possible to be powered first by our distributed network
- "Ensure that the control plane running on the network continues to function even if all our core data centers are offline
- "Require that all products and features that are designated Generally Available must rely on the high availability cluster (if they rely on any of our core data centers), without having any software dependencies on specific facilities
- "Require all products and features that are designated Generally Available have a reliable disaster recovery plan that is tested
- "Test the blast radius of system failures and minimize the number of services that are impacted by a failure
- "Implement more rigorous chaos testing of all data center functions including the full removal of each of our core data center facilities
- "Thorough auditing of all core data centers and a plan to reaudit to ensure they comply with our standards
- "Logging and analytics disaster recovery plan that ensures no logs are dropped even in the case of a failure of all our core facilities
Prince has been widely praised for the swift, detailed write-up on the Cloudflare outage (which was largely resolved, after a lot of manual server rebuilds and a “thundering herd” problem, on November 4, 04:25).
Customers who suffered a protracted outage may be somewhat less forgiving (thousands were affected with multiple services failing).
If anything, the incident is a reminder that building reliable systems is an iterative process improved by deep testing and also learning from pain – like Cloudflare's self-admitted failure to ensure that that new products were architected properly for resilience.
"We were also far too lax about requiring new products and their associated databases to integrate with the high availability cluster. Cloudflare allows multiple teams to innovate quickly. As such, products often take different paths toward their initial alpha" the post-mortem says.
"While, over time, our practice is to migrate the backend for these services to our best practices, we did not formally require that before products were declared generally available (GA). That was a mistake as it meant that the redundancy protections we had in place worked inconsistently depending on the product" CEO Matthew Prince wrote.
Quizzed on what the plan is for a natural disaster in the Hillsboro, Oregon area where Cloudflare has clustered the three independent active-active data centres underpinning its core systems, Prince responded on X: “Facilities need to be close enough together to have active-active writes.
“[We] took seismic concerns into account when picking the three facilities. And, in [the] worst case [scenario we] would do full disaster recovery to Europe or Asia — which we now have more practice doing.”
The full post-mortem deserves a read and is here.