How Fidelity’s “chaos buffet” pushed AWS to new Lambda tools

Fidelity International, a $925 billion financial services firm, operates a “chaos buffet” platform dedicated to stress-testing its cloud applications.

But the company until recently struggled to extend its chaos engineering* approach to applications powered by AWS Lambda serverless functions.

Fidelity has over 7,700 applications (75% of its estate) in the public cloud. Among these are its trade order management system, which relies heavily on Lambda's serverless, event-driven compute service.

Its “chaos buffet” approach (built on a range of techniques and approaches) sees it regularly run chaos testing of over 3,000 applications.

This has helped it test over 120,000+ varieties of potential fault to understand the impact of various types of incident on the business.

That’s according to Keith Blizard, head of site reliability engineering (SRE) and production services, speaking at an AWS re:Invent 2024 session.

"Everything breaks..."

He emphasised that a shift to the cloud entailed a very different approach to building resilience: “We used to build the resiliency into the hardware… as we moved to the cloud, we had to change that mindset.

“Because whether it is latency [issues], IAM failures… it will happen; everything breaks [at some point]” said Blizard on December 4.

“You need the right level of re-try logic. The thresholds and limits across an ecosystem [are not just about] an app-to-a-database. They’re about tens to hundreds of microservices connecting. [You need to understand failure points] across a business flow/business process you’re supporting.”

See also: AWS root default overhaul has cloud security leaders saying “thank f***”

Joining him on stage was AWS technical manager and Morgan Stanley veteran Joe Cho who explained how Fidelity is a big user of AWS’s serverless Lambda for asynchronous processing including of trades.

He said AWS and Fidelity had asked “can we ‘shift left’ earlier in the resilience lifecycle?” That's the essence of chaos engineering…”

Missing: A way to test Lambda failures

More specifically, Blizard explained: “[We needed a way to] execute faults in the CI/CD pipeline… not every dev needs to be an expert. We just give developers and squads configuration profiles to test against. We had to build a mechanism… to take decision making away from individuals.

“But what was missing for us was… a way to test Lambda – I really love Lambda. It’s been a really great capability. Allows us to just execute code.

“But it was really hard to include a fault without updating the code. In a system like this [Fidelity’s trade order management system] Lambda was used to synchronise the data, so if it had high latency, we needed to understand from a business focus flow, what would be affected…”

See also: AWS outage saw “cell management” system get flustered by big shards

Whilst AWS has its “Fault Injection System” or FIS for chaos engineering it lacked the ability to do much with Lambda, Fidelity initially found.

Its team sat down with AWS to try and come up with a solution.

Ultimately, the two worked out that they could inject faults into so-called Lambda “layers” that are separate to the business code. For example to emulate latencies that could be caused by, for example, a lack of allocated memory, they could inject a sleep function into the layer. (Layers are created as a python package, published, and attached to the Lambda.)

Prior to this collaboration, AWS customers wanting to run chaos experiments involving Lambda had to 1) modify code and use runtime-specific libraries; 2) Inject faults using self-managed Lambda extensions leveraging the Runtime API proxy pattern, AWS explained.

As a result of Fidelity’s demands, AWS has now built out the ability to run chaos engineering that impacts Lambda via its FIS managed service, using a “chaos” Lambda extension that runs as a separate process within the Lambda execution environment, and intercepting invocations before they reach the runtime, it explained. (That proposition went GA in November.)

As Blizard explained, pointing to the slide above: "When you talk about cloud outages, senior leaders they think it's that bottom left. [An entire outage]. We do have regional failures [but] Amazon is spending a lot of time enabling native active-active, active-passive regional failover capabilities.

"At Fidelity I [focus on] the circle on the right to understand and start building out controls to drive resilience across the environment."

Being able to now inject a range of "faults" into Lambda via FIS is invaluable, he concluded.

Join peers following The Stack on LinkedIn

*Chaos engineering involves intentionally injecting faults into a system to understand how it responds to failures and test its resilience.