Skip to content

Search the site

MicrosoftOutagesAzureNews

A DDoS attack triggered Azure outage - and Microsoft's defences finished the job

New details of global mega-outage revealed, with Microsoft blaming disastrous "usage spike" on an implementation error in its own response to a cyberattack.

Microsoft teams are rushing to fix the issue (Image: Unsplash)
Microsoft teams are rushing to fix the issue (Image: Unsplash)

A sweeping Microsoft outage brought down Azure network infrastructure across the world yesterday.

Now Microsoft has revealed the the incident's "trigger event" was a DDoS attack. Unfortunately, an implementation error with its defences then "amplified the impact of the attack rather than mitigating it."

Following the DDoS self-defeat, Microsoft experienced an "unexpected usage spike" in components of Azure Front Door (AFD), which Redmond describes as its "modern" cloud Content Delivery Network (CDN), and Azure Content Delivery Network (CDN) components, which it observed "performing below acceptable thresholds, leading to intermittent errors, timeout, and latency spikes."

The incident brought down Azure App Services, Application Insights, Azure IoT Central, Azure Log Search Alerts, Azure Policy, as well as the Azure portal itself and "a subset" of Microsoft 365 and Microsoft Purview services.

Azure first announced the outage in a rolling status update that confirmed the incident's global impact across the Americas, APAC, Europe and the Middle East. That update page was a must-watch as the mega-outage swept the world.

"We are investigating reports of issues connecting to Microsoft services globally," Microsoft wrote at the beginning of the saga. "Customers may experience timeouts connecting to Azure services. We have multiple engineering teams engaged to diagnose and resolve the issue."

When did the Azure and Microsoft outage take place?

The most recent update, a mitigation statement from Azure, noted that the problems happened between 11:45 and 19:43 UTC on 30 July 2024, when "a subset of customers may have experienced issues connecting to Microsoft services globally."

It wrote: "While the initial trigger event was a Distributed Denial-of-Service (DDoS) attack, which activated our DDoS protection mechanisms, initial investigations suggest that an error in the implementation of our defenses amplified the impact of the attack rather than mitigating it."

Microsoft went on to describe its mitigation efforts and provide more details of the initial cause of the incident.

"Once the nature of the usage spike was understood, we implemented networking configuration changes to support our DDoS protection efforts, and performed failovers to alternate networking paths to provide relief. Our initial network configuration changes successfully mitigated majority of the impact by 14:10 UTC. Some customers reported less than 100% availability, which we began mitigating at around 18:00 UTC.

"We proceeded with an updated mitigation approach, first rolling this out across regions in Asia Pacific and Europe. After validating that this revised approach successfully eliminated the side effect impacts of the initial mitigation, we rolled it out to regions in the Americas. Failure rates returned to pre-incident levels by 19:43 UTC - after monitoring traffic and services to ensure that the issue was fully mitigated, we declared the incident mitigated at 20:48 UTC. Some downstream services took longer to recover, depending on how they were configured to use AFD and/or CDN."

The outage "demonstrates the ease at which DDoS actors can wreak havoc against critical business services", Donny Chong, Director at Nexusguard, told The Stack.

“Anyone can carry out an attack of this magnitude from their own bedroom if they have the right equipment," Chong said. "While no company can guarantee the always-on availability of its cloud services, customers of these services have high expectations today, and that’s exactly what attackers are counting on.

“This latest outage should serve as a wake-up call for any company with global infrastructure to go on the offensive and take a proactive stance toward adapting its digital infrastructure to be more resistant to new forms of attack.”

MORE TO FOLLOW - THIS STORY IS BEING UPDATED

READ MORE: Crowdstrike CEO apologises after content update "defect" caused global blue screen of death outage

Latest