Global Teams outage triggered by "broken connection" to storage

A global Teams outage early Thursday was triggered by a "recent deployment [that] contained a broken connection to an internal storage service" Microsoft has admitted -- causing "downstream impact to multiple Microsoft 365 services with Teams integration, such as Microsoft Word, Office Online and SharePoint Online."

Users were left with the notice that "operation failed with unexpected error" or "we ran into a problem. Reconnecting..." with some reporting that when signing into Microsoft portals their primary tenant was "not even listed in my list of available tenants". The Teams outage also affected some Windows 365 clients.

After initially directing users to its admin center, where Microsoft provides detail on outages, Redmond belatedly admitted that yes, that was affected by the incident too. (Plus ça change...)

https://twitter.com/SCPh_PIO/status/1549961232563679232

Although the incident appears to have been global, it occurred outside of the US and European working days and most reports from affected users stemmed from Asia.

Among those affected were the Phillipines Supreme Court, which lost access to its emails.

The impact of the Teams outage continued after Microsoft rolled back what appears to have been, in essence, a broken software update, lasting for up to three hours for some users.

Complex dependencies can make software updates to sprawling, heterogeneous user bases challenging. Microsoft explained its safety buffers to avoid this kind of incident in its Azure cloud services in 2020 after an authenticated-related outage. It noted at the time that “Azure AD is designed to be a geo-distributed service deployed in an active-active configuration with multiple partitions across multiple data centers around the world, built with isolation boundaries. Normally, changes initially target a validation ring that contains no customer data, followed by an inner ring that contains Microsoft only users, and lastly our production environment. These changes are deployed in phases across five rings over several days.

"In this case, the SDP [safe deployment process] system failed to correctly target the validation test ring due to a latent defect that impacted the system’s ability to interpret deployment metadata. Consequently, all rings were targeted concurrently... Within minutes of impact, we took steps to revert the change using automated rollback systems which would normally have limited the duration and severity of impact. However, the latent defect in our SDP system had corrupted the deployment metadata, and we had to resort to manual rollback processes."