CrowdStrike has slipped out a preliminary post incident review (PIR) into how a flawed update made it through its validation system before bringing over 8.5 million windows systems to their knees last week.
The PIR shows that the Endpoint Detection and Response firm relied on a “content validator” to approve certain updates, rather than a deployment strategy that included actual tests of how detection updates performed when pushed to endpoints dynamically loading code in the kernel.
CrowdStrike has now pledged to implement a far more robust testing and deployment process for its Falcon engine’s “Rapid Response” content updates, including giving customers greater control over deployments.
The PIR was dropped out at 3.35 UTC, meaning the bulk of the world’s technology community have yet to dive into it, but raises big questions.
In a section titled “how do we prevent this from happening again” it gave one example as: “Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.”
Customers will be wondering why this was not happening previously.
Ditto a promise to now start doing “stress testing, fuzzing and fault injection” and “stability testing” on its “Rapid Response” updates.
Notably, Crowdstrike appears to have known it had gaps here.
Since before the incident it has been recruiting release engineers to “diagnose and debug complex issues” and focus on “making improvements in the release process” – adverts posted before the crash show.
A senior release engineer, for example, will have “familiarity with common CI/CD tools such as Jenkins, Git, or Bitbucket” and “inform staff and management of multiple worldwide business units (including subject matter experts, design teams, and technology teams) of release risk and work to effectively mitigate it” an advert posted in June 2024 shows.
CrowdStrike Rapid Response updates?
Its Rapid Response updates help provide “behavioral pattern-matching operations on the sensor… stored in a proprietary binary file that contains configuration data” and CrowdStrike had earlier clarified that the botched update and been released to target “newly observed, malicious named pipes being used by common C2 frameworks in cyberattacks.”
(That earlier comment was swiftly linked by some security researchers to a significant release (4.10) of the Cobalt Strike framework days earlier.)
Friday’ crisis was down to a Rapid Response Content update which shipped with an “undetected” flaw. The update details the release and testing process for sensor (agent) content AND Rapid Response Content.
The former are subjected to automated testing, as well as internal dogfooding, and release to early adopters, before a full release to customers, who “have the option of selecting which parts of their fleet should install the latest sensor release”. The latter are not.
The PIR explains how Crowdstrike runs new Rapid Response updates through a "Content Validator" tester which performs checks on the content – but “due to a bug in the Content Validator [the botched update in question] passed validation despite containing problematic content data” – something that suggests the update was not tested in production.
The “timeline of events” reaches back to February when a new sensor update was released. Three forms of what amount to threat detection logic were rolled out through April, all performing as expected. But on July 19, 2024, two additional updates were deployed. One, Channel File 291, passed despiste the buggy content data and deployed into production, resulting in an out-of-bounds memory read and crashing systems.
Early birds on X and other platforms would at this point be spluttering that this does not sound like a robust testing procedure, amounting to ‘well, it worked last time so this time should be fine’. They will be confirmed in their view by the company’s steps to “prevent this happening again”.
Crowdstrike has promised a further full RCA.
Meanwhile it said it will “Provide customers with greater control over the delivery of Rapid Response Content updates by allowing granular selection of when and where these updates are deployed.”