Google faced a series of global outages today (Friday November 12) with Gmail down (IMAP servers not responding) for some users, cloud SQL and Google Cloud Console down for others. The issues appear to have begun around 8:30am BST — with support recognising it as “an issue with Google Cloud infrastructure components” and the incident apparently largely resolved by 11:38 BST.
With Gmail down being the issue that attracted the most immediate attention from users globally, the issue also affected the following services for some hours, with Google Cloud saying at 10:57am BST:
- Cloud App Engine: Customers may see traffic drop for us-central1 and europe-west1
- Cloud Bigtable: “Mitigation still in progress, ETA for resolution still unknown
- Cloud Monitoring UI: There is a mitigation in place at the GFE infrastructure level that is rolling out and is expected to resolve this issue.
- Cloud Console: All Cloud Console paths may be unavailable.
- Cloud Spanner: Customers coming through GFE (not CFE or cloud interconnect) will experience UNAVAILABLE error and latency for both DATA and ADMIN operations
See also: Microsoft blames key rotation failure for Azure Portal outage. Improvements to “Safe Deployment” pending.
- Cloud Functions: Customers may see traffic drop for us-central1 and europe-west1.
- Cloud Run: Cloud Run users are seeing increased HTTP 500s and authentication failures when trying to access apps.
- Google Cloud Endpoints : Cloud Endpoints may be unavailable in europe-west1 and europe-west4 (most affected regions)
- Cloud SQL:Regions europe-west1,europe-west4 and europe-west5 (could be more). Workaround: Users should retry failed operations. Our engineering team continues to investigate the issue. We will provide an update by Friday, 2021-11-12 03:30 US/Pacific with current details. We apologize to all who are affected by the disruption.
Many of the issues appeared to have been resolved by ~11:48am BST, with the Google App engine (a service for developing and hosting web applications in Google-managed data centres) the last to recover.
Software updates getting pushed to production and slipping through checks designed to prevent unwanted issues are regularly to blame for this kind of issue. Slack, AWS, Azure, Fastly and Facebook have all faced outages this year. Fastly blamed a “service configuration” issue; Azure blamed a March outage on “a service update targeting an internal validation test ring [that] was deployed, causing a crash upon startup in the Azure AD backend services. A latent code defect in the Azure AD backend service Safe Deployment Process (SDP) system caused this to deploy directly into our production environment, bypassing our normal validation process”. AWS customers are still awaiting a post-incident write-up for a sustained outage in AWS-East-1 in September 2021, meanwhile.