Skip to content

Search the site

Learning lessons from outages

"When incidents occur, the most important thing is fixing, learning and applying that knowledge institutionally. Finger-pointing and ceremonial sacrifices are not a mature response. Bias towards concrete action and banish politics"

Major incidents are major moments of organisational learning. Direct or supply chain attacks, like SolarWinds, are damaging, but they also serve to help upgrade defences as holes are filled, patches applied and better policies enacted, writes Eduardo Crespo, Vice President of EMEA, PagerDuty. The same can be said of incidents accidentally caused by the organisation itself, such as the July 19 global IT outage. These can all teach us a considerable amount about the digital matters within our own control as software vendors or those tasked with responsibility for critical digital infrastructure.

Plan

A comprehensive incident response plan should outline all the necessary steps and protocols your operations team must take based on the infrastructure, services and personnel on hand. A structured plan helps organisations reduce the impact on customers (internal or external) and ensure the fastest, smoothest recovery process possible.

Where novel events occur, whether experienced by the organisation or merely observed, operations leadership must review their plans to ensure that they can prevent, mitigate or avoid anything similar occurring to them. There’s only one major failing of leadership in this situation: “We didn’t think that would happen to us.

Digital infrastructure is becoming deeply complex and intertwined with cloud, data and AI resources dependent on third parties. The possibility for the unexpected cascade to topple a service just keeps growing. Therefore, it’s more important than ever to understand the whole digital infrastructure and plan for addressing potential issues with appropriate automation.

Learn

Incidents are learning experiences. Anyone solely focussed on immediate restoration — without embedding those lessons into corporate knowledge and best practices — is a hindrance to building a more resilient enterprise. Resilience is an aspect of digital maturity, which is the organisation’s ability to create value through its digital resources. It is also a key predictor of success for companies launching a digital transformation. 

A digitally mature enterprise combines a few things: 

  • The right tooling – of which automation is critical – for rapid, effective incident response.
  • Effective DevOps practices, including a full-service ownership attitude. 
  • Holistic or cross-functional organisational processes and responses. 
  • A continuous learning culture that prioritises understanding, growth and improvement rather than blame and punishment.

See also: AWS outage saw “cell management” system get flustered by big shards

Digital maturity is a necessary element to do quality preventative maintenance. Digitally mature organisations tend to do a few things very well, including having the time and headspace to apply a learning focus to their operations processes.

  • Automation and strong processes allow organisations to understand and act on patterns found within their digital infrastructure and resources. This information is used to refine and improve their operations to create greater resiliency.
  • Pattern spotting goes beyond the technical layer. If your operations team is continually stretched firefighting, there’s a non-trivial risk of burnout.
  • Analyse team members’ patterns while mapping knowledge bottlenecks and where skills lie. Resilience stems from the interplay between people, organisational processes and digital infrastructure. Additionally, mistreating operations personnel certainly plays a part in digital risk of failure.
  • Find best practices that can help the organisation make the commitment where learnings are measured and tracked, such as mean time to remediation.

Learning also requires culture, which is always the hardest factor to shape. When incidents occur, the most important thing is fixing, learning and applying that knowledge institutionally. Finger-pointing and ceremonial sacrifices are not a mature response. Bias towards concrete action and banish politics. Leaning into organisational management and people dynamics is challenging, slow and hard to measure, but absolutely worthwhile.

Perform

The operations team must be able to see and understand the patterns making up each incident, regardless of who is on duty at the time. They must focus on their tasks and ideally have the learnings packaged as part of the automatic logging process. Then, during the postmortem (or the better termed ‘post-incident review’), they can process root causes, understand related factors and, ultimately, refine and promote better solutions for future incidents.

Automation is the lifeline as complexity increases, and teams will want automation to support tasks such as suggesting queries or sharing unexpected data points that humans may overlook. Ultimately, anything that improves human performance then allows more focus spent on higher-level challenges, strategic imperatives and a rigorous adherence to preventive and predictive maintenance.

Commit to the process

Ultimately, outages can increase digital resilience when handled maturely. Without the right attitude for technical and business leadership, however, there is the risk of merely limping on, or worse: a spiralling or cascading set of events impacting users and customers.

Technology can be bought and skills can be acquired, but that cultural element is also a critical part of the package of becoming a mature and, therefore, a resilient organisation. Leaders may wish to start with that human cognition element first, and refer back to it often, if they really want incidents to transform into beneficial learning experiences.

Join peers following The Stack on LinkedIn

Latest