Inside a tech meltdown that caused Bank Holiday flight chaos

Password problems, data anomalies and legacy system challenges caused a Bank Holiday air traffic outage that grounded 2,000 flights across the UK, an independent inquiry has found.

More than 700,000 passengers suffered delays and cancellations on Monday, 28 August 2023, after both primary and backup air traffic control systems collapsed in just 20 seconds, forcing staff to carry out manual processing.

The cause of the outage was "the inability of the system software to remain in a full operational state" when processing flight plan data for a specific flight from Los Angeles to Paris (Orly), the review found.

Issues with this data resulted in critical exception errors, which reduced the number of flight plans that could be processed from 800 per hour to approximately 60.

When an engineer arrived on the scene to fix the issue, his password details "could not be readily verified due to the architecture of the system".

"Several factors contributed to the technical failure and that it is unlikely that the same unique set of circumstances would ever occur again, and that if they did, due to the actions already taken by NATS, the outcome would be different," said Louise Haigh MP, Secretary of State for Transport.

Read The Stack's original coverage of the incident here.

How data flows between flights and air traffic control

Anatomy of an airline outage

In a Major Incident Investigation Final Report, NATS (formerly National Air Traffic Services) gave a detailed, blow-by-blow account of the technical disaster.

NATS has operated a manual Flight Plan Routing System (FPRS) system since the late 1990s, which was automated in 2004 and given an extra letter in its name to become FPRSA.

In August 2018, a supplier called Comsoft delivered a new version of the system and completed testing. The resulting FPRSA-R sub-system has now operated continuously since September 2018 and processed more than 15 million flight plans.

"Until the incident of 28th August 2023, there were no delays caused by the use of this system in the operation." NATS wrote.

On that fateful day, all systems were operating normally, and no system upgrades were being implemented.

"Given the nature of the incident, NATS could not have reasonably forecast the subsequent software exception ahead of the incident and there were no indications that this might occur," the incident report continued.

In the UK, air traffic controllers typically receive flight plan data via EUROCONTROL - a pan-European, civil-military organisation which manages the air traffic network across the continent.

Under normal circumstances, it sends flight plan data from its Integrated Flight Planning System (IFPS) in a format called ATS Data Exchange Presentation (ADEXP), which is transmitted through a system called the Aeronautical Messaging Switch (AMS-UK) to the FPRSA-R sub-system, which converts it into a format that is compatible with the UK National Airspace System (NAS).

The UK air traffic control system received a flight plan as usual. However, the data contained "six specific attributes relating to two identical waypoint names".

NATS' Incident Response command-and-control structure

When this "unique combination" collided with "logic applied by the system", it could not be processed and caused a critical software exception, whereupon FPRSA-R "acted according to its programming" and went into maintenance mode.

"In this instance, the primary system had not failed," NATS wrote. "It placed itself into maintenance mode to make sure irreconcilable - and therefore, potentially unsafe - information was not sent to an air traffic controller."

Although the system took a moment to file a log entry noting that it had gone down, this log cannot describe the precise nature of an issue, only recording the time when an issue occurred rather than which flight caused it.

After making its diary entry, the FPRSA-R system became non-operational and handed over control to a backup system hosted on separate hardware with its own power and data feeds.

Unfortunately, the backup system applied the same logic to the flight plan with the same result and followed its sibling into oblivion, raising its own critical exception, writing a log and then also slumping into maintenance mode.

With both primary and backup FPRSA-R sub-systems out of action, flight plans could no longer be automatically processed, and manual intervention was now required.

NATS graphics demonstrating engineering events requiring fallback procedures between Janunary and October 2023

At 8:59, engineers at NATS' air traffic control centre in Swanwick, Hampshire, tried to restart the system. Engineering teams are split into tiers, with Level 1 staff available around the clock and specialist Level 2 workers working during office hours - but available remotely outside these times.

Seven minutes after the attempted reboot, a Level 2 engineer was contacted. NATS’ executive team was then informed, and "regulations" were put in place across UK airspace, reducing the number of flights to 75% of their expected volume.

Hours later, a Level 2 engineer arrived on site at 11:47. "His journey to the office took longer than normally expected due to traffic congestion," NATS reported. A Level 3 engineer was contacted just five minutes later.

"For the next 35 minutes, several unsuccessful attempts were made to perform a full system restart of the FRPSA-R servers, including powering off the associated hardware," NATS continued.

When this half an hour of pain had ended, flights across the nation were again regulated downwards, meaning even fewer flights were in the air. When the regulations were first applied, Swanwick managed 300 flights per hour, and a second facility in Prestwick, Scotland, controlled 30. After the failed reboot, this was reduced to 20 flights per hour for Swanwick and 10 per hour for Prestwick.

At 13:26, the Level 3 engineer began to work his magic and FPRSA-R processed an initial batch of test flight plans following the restoration of the system and resolving a database issue across the two servers.

"The database in question was only present on one of the pair of servers," NATWS wrote. "If the server without the database was inadvertently started first, it could not validate start-up requirements."

During the sixth hour of the incident, a fix was applied. Regulations began to be lifted between 15:24 and 18:03. "Air traffic service resumed normal operations, though the wider impact of the incident lasted much longer for some airlines and their passengers," NATS wrote.

What went wrong?

Firstly, the data contained these six points (seen above) which confused the air traffic control system:

1) Duplicate waypoints outside of UK airspace

2) These duplicates should have been outside of UK airspace and "on either side" of the nation.

3) One of these waypoints was not near a UK airspace boundary exit point, so it could not be processed by FPRSA-R search logic.

4) The first duplicate waypoint was not present in the flight plan message in a section called ICAO4444,

5) A second duplicate waypoint near the UK FIR exit point should have been absent from ICAO4444.

6) The actual UK exit point should have been kept out of the ICAO4444 flight plan segment of the flight plan message.

"The circumstances under which this incident could occur and lead to the software exception noted are extremely rare and specific," NATS noted.

NATS also operates a "joint decision-making model" relying on "pre-invocation, escalation, and invocation phases.

"In this circumstance, a blend of joint decision making with a single individual providing overall incident oversight could have led to a quicker critical escalation path," it wrote.

The 26-minute password delay was caused by issues with only having one server up at a time.

"The system was brought back up on one server, which did not contain the password database," NATS revealed. "When the engineer entered the correct password, it could not be verified by the server."

NATS concluded that the event was rare and highly unusual, praising the day-to-day work of its air traffic teams.

"The events of the 28th August demonstrate the complexity of the air traffic management network which, as a rule, operates safely and efficiently throughout the year," it said. "Technical and operational issues are overcome on a daily basis without impacting the network. It is a tribute to the network’s highly skilled staff that their work goes unnoticed by the travelling public – because that means the system is working smoothly.

"Yet unforeseen circumstances of a highly technical nature inevitably occur, requiring a reduction of flight capacity for the skies to remain safe."

Get in touch with jasper@thestack.technology to share any information about this air traffic outage or any other incident.

Inside the tech meltdown that caused Bank Holiday flight chaos

Anatomy of an airline outage

What went wrong?

Join peers following The Stack on LinkedIn

Anatomy of an airline outage

Sign up for The Stack

What went wrong?

Join peers following The Stack on LinkedIn