CrowdStrike RCA leaves a lacuna – firm rebuts vuln claims

CrowdStrike has published its final Root Cause Analysis (RCA) into how it released an update that crashed millions Windows machines globally.

It also separately published a blog rebutting claims published by Chinese cybersecurity firm Qihoo360 alleging that the bug was exploitable.

The CrowdStrike RCA highlights five key technical causes of the incident.

These have also been widely re-hashed and re-litigated elsewhere.

The incident boils down to CrowdStrike's failure to spot an input validation mismatch between a newly-deployed threat detection configuration and its agent running in kernel mode that triggered an out-of-bounds read.

CrowdStrike’s promised mitigations include “test cases with additional scenarios that better reflect production usage” – always welcomed.

Zack Allen, a security research director at Datadog, has also created a tidy visual synopsis of the RCA’s findings, shared below with kind permission.

The CrowdStrike RCA immediately faced criticism for its verbosity and for failing to address the organisational root causes that led to the issue.

The digested verdict from social media?

An architect: “A very long explanation for an out of bounds read, because of insufficient testing procedures and negligence in the kernel code.”

A lecturer: “What a bunch of absolute corporate yap for basic issues.”

Marketing appeared to have had a hand in what should have been a lucid exposition of the incident and the lessons learned from the crisis.

(The first paragraph of its explanatory section begins by telling readers “The CrowdStrike Falcon sensor delivers powerful on-sensor AI and machine learning models to protect customer systems by identifying and remediating the latest advanced threats.” No mention of the regex engine powering its agent’s “Content Interpreter” at this point, of course…)

QA: Getting an independent review

CrowdStrike said that it is “conducting an independent review of the end-to-end quality process from development through deployment.”

But the lack of reflection in the RCA on the organisational setup that led to the chain of control failures was noted by many (“Let's actually load the channel file onto an actual Falcon instance on an actual Windows machine' was a bridge too far in testing rigor,” as one observer wrote. But why?)

As Lorin Hochstein wrote, reflecting on the CrowdStrike RCA: “Systems reach the current state that they’re in because, in the past, people within the system made rational decisions based on the information they had at the time, and the constraints that they were operating under."

The software engineer added: "The only way to understand how incidents happen is to try and reconstruct the path that the system took to get here, and that means trying to as best as you can to recreate the context that people were operating under when they made those decisions.”

Quite whether CrowdStrike is trying to do this remains an open question.

Staged deployments ftw

Customers may be reassured however by the fact that it belatedly plans “staged deployment” in future that “mitigates impact if a new Template Instance causes failures such as system crashes, false-positive detection volume spikes or performance issues… New Template Instances that have passed canary testing are to be successively promoted to wider deployment rings or rolled back if problems are detected” the RCA says.

CrowdStrike’s Falcon platform has now also “been updated to provide customers with increased control over the delivery of Rapid Response Content [intelligence used to “augment novel detections and preventions on the sensor without requiring sensor code changes” that is delivered as channel files interpreted by a regex-engine in the kernel] the firm said.

eBPF has entered the chat

Debate continues to rage, meanwhile, over whether eBPF (a way to run sandboxed programs in the Linux kernel without changing kernel source that is widely used by cloud security companies like Sysdig) code might be the answer to this kind of issue. eBPF for Windows is a “work in progress”.

But as Intel fellow Brendan Gregg puts it cheerfully on his blog: “Once Microsoft's eBPF support for Windows becomes production-ready, Windows security software can be ported to eBPF as well. These security agents will then be safe and unable to cause a Windows kernel crash.”

Critics say that whatever the tech deployed, it will require developers and their employers to ensure secure coding practices, and robust testing. (Microsoft has meanwhile also expressed its own views for security vendors over what should and shouldn’t happen in kernel mode here.)

CrowdStrike RCA leaves a lacuna – firm rebuts vulnerability claims

QA: Getting an independent review

Staged deployments ftw

eBPF has entered the chat

Join peers following The Stack on LinkedIn

QA: Getting an independent review

Sign up for The Stack

Staged deployments ftw

eBPF has entered the chat

Join peers following The Stack on LinkedIn