Crowstrike promises RCA as C++ null pointer blamed

Updated July 24: See also our take on the preliminary findings here, CrowdStrike's analysis here and further insight on the "null bytes" here.

Crowdstrike says it is conducting a “thorough root cause analysis” to understand how a threat detection update it pushed out to its millions of end-users on Friday crippled computers globally – grounding flights, cancelling cancer operations, halting deliveries, freezing cash machines and triggering an emergency UK government “COBRA” meeting.

The incident was the second Crowdstrike update in just three weeks to crash customers’ computers; most are set to update automatically. Microsoft estimates that this one hit 8.5 million Windows devices.

The update was “designed to target newly observed, malicious named pipes being used by common C2 frameworks in cyberattacks” the endpoint detection and response (EDR) company said in an update early on Saturday – a comment swiftly linked, speculatively/not definitively, by security researchers to a significant release (4.10) of Cobalt Strike days earlier.

Multiple engineers identified the issue via analysis of stack dumps as being triggered by a null pointer bug in the C++ the Crowdstrike update was written in; it appears to have tried to call an invalid region of memory that results in a process getting immediately killed by Windows, but that take looked increasingly controversial and Crowdstrike itself said that the incident was not due to "null bytes contained within Channel File 291 [the update that triggered the crashes] or any other Channel File."

I don't do Windows but here are some (initial) details about why the CrowdStrike's CSAgent.sys crashed

Faulting inst: mov r9d, [r8]
R8: unmapped address

...taken from an array of pointers (held in RAX), index RDX (0x14 * 0x8) holds the invalid memory address@_JohnHammond pic.twitter.com/oqlAVwSlJj
— Patrick Wardle (@patrickwardle) July 19, 2024

Restoration has been challenging for many, if not all of those impacted. Azure, AWS, and other cloud providers running Windows instances with the Crowdstrike agent installed were among those heavily affected, triggering further knock-on downstream consequences for cloud users.

“We are doing a thorough root cause analysis to determine how this logic flaw occurred. This effort will be ongoing. We are committed to identifying any foundational or workflow improvements that we can make to strengthen our process. We will update our findings in the root cause analysis as the investigation progresses,” Crowdstrike said on Saturday.

It has teamed up with Intel to “remediate affected hosts remotely using Intel vPro and with Active Management Technology” it added (see here.)

See also: Solarwinds and its CISO are not off the hook in court ruling. Here's what you missed

Omkhar Arasaratnam, general manager of the Open Source Security Foundation (OpenSSF) told The Stack: Monocultural supply chains (single operating system, single EDR) are inherently fragile and susceptible to systemic faults - as we've seen. Good system engineering tells us that changes in these systems should be rolled out gradually, observing the impact in small tranches versus all at once. More diverse ecosystems can tolerate rapid change as they're resilient to systemic issues,” he added.

Aleksandr Yampolskiy, SecurityScorecard CEO, pointed out that research his firm conducted with McKinsey showed that 62% of the global attack surface is concentrated in the products and services of just 15 companies. (To be crystal clear, this was not, of course, an "attack surface" incident per se, but as a measure of systemic risk it is nonetheless a notable figure.)

Both the NCSC and Crowdstrike warned that phishing and other malicious attacks associated with the outage had already started up, whilst heavy opining about the legitimacy and efficacy of EDRs that add “invasive kernel drivers on top of consumer operating systems” continued across social media (the worst take one entirely arbitrarily blaming “DEI” hires).

Scott Hanselman, VP of Developer Community at Microsoft said on X, responding to that latter take: “I’ve been coding 32 years. When something like this happens it’s an organizational failure. Yes, some human wrote a bad line. Someone can ‘git blame’ and point to a human and it’s awful.

"But it’s the testing, the Cl/CD, the A/B testing, the metered rollouts, an oh shit button to roll it back, the code coverage, the static analysis tools, the code reviews, the organizational health, and on and on [that count].

“It’s always one line of code but it’s NEVER one person... Engineering practices failed to find a bug multiple times, regardless of the seniority of the human who checked that code in. Solving the larger system thinking SDLC matters more than the null pointer check. This isn’t a ‘git gud C++ is hard’ issue and it damn well isn’t an DEI one,” he added. (Amen.)

Views on the incident, detection logic updates, Windows drivers, kernel mode EDR, QA, anything else you'd like to share? We're all ears.