"Banish politics": Learning lessons from outages

Major incidents are major moments of organisational learning. Direct or supply chain attacks, like SolarWinds, are damaging, but they also serve to help upgrade defences as holes are filled, patches applied and better policies enacted, writes Eduardo Crespo, Vice President of EMEA, PagerDuty. The same can be said of incidents accidentally caused by the organisation itself, such as the July 19 global IT outage. These can all teach us a considerable amount about the digital matters within our own control as software vendors or those tasked with responsibility for critical digital infrastructure.

Plan

A comprehensive incident response plan should outline all the necessary steps and protocols your operations team must take based on the infrastructure, services and personnel on hand. A structured plan helps organisations reduce the impact on customers (internal or external) and ensure the fastest, smoothest recovery process possible.

Where novel events occur, whether experienced by the organisation or merely observed, operations leadership must review their plans to ensure that they can prevent, mitigate or avoid anything similar occurring to them. There’s only one major failing of leadership in this situation: “We didn’t think that would happen to us.”

Digital infrastructure is becoming deeply complex and intertwined with cloud, data and AI resources dependent on third parties. The possibility for the unexpected cascade to topple a service just keeps growing. Therefore, it’s more important than ever to understand the whole digital infrastructure and plan for addressing potential issues with appropriate automation.

Learn

Incidents are learning experiences. Anyone solely focussed on immediate restoration — without embedding those lessons into corporate knowledge and best practices — is a hindrance to building a more resilient enterprise. Resilience is an aspect of digital maturity, which is the organisation’s ability to create value through its digital resources. It is also a key predictor of success for companies launching a digital transformation.

A digitally mature enterprise combines a few things:

The right tooling – of which automation is critical – for rapid, effective incident response.
Effective DevOps practices, including a full-service ownership attitude.
Holistic or cross-functional organisational processes and responses.
A continuous learning culture that prioritises understanding, growth and improvement rather than blame and punishment.

Perform

The operations team must be able to see and understand the patterns making up each incident, regardless of who is on duty at the time. They must focus on their tasks and ideally have the learnings packaged as part of the automatic logging process. Then, during the postmortem (or the better termed ‘post-incident review’), they can process root causes, understand related factors and, ultimately, refine and promote better solutions for future incidents.

Automation is the lifeline as complexity increases, and teams will want automation to support tasks such as suggesting queries or sharing unexpected data points that humans may overlook. Ultimately, anything that improves human performance then allows more focus spent on higher-level challenges, strategic imperatives and a rigorous adherence to preventive and predictive maintenance.

Commit to the process

Ultimately, outages can increase digital resilience when handled maturely. Without the right attitude for technical and business leadership, however, there is the risk of merely limping on, or worse: a spiralling or cascading set of events impacting users and customers.

Technology can be bought and skills can be acquired, but that cultural element is also a critical part of the package of becoming a mature and, therefore, a resilient organisation. Leaders may wish to start with that human cognition element first, and refer back to it often, if they really want incidents to transform into beneficial learning experiences.