Interxion outage takes down global metals trading as customers fume

A power outage on Monday at a central London data centre run by Interxion took down trading on the London Metals Exchange (LME). Billions in trades were put on ice for over five hours as the exchange was forced to switch to a different data centre. The incident raising troubling questions about the data centre's failover capabilities, as Interxion customers bewailed dismal communication from the co-location provider.

The metals market had been due to open at 1am London time on Tuesday morning and eventually resumed at 6:15 am, per Bloomberg, after the exchange migrated its systems to a backup data centre. (The LME moved its primary site for its LME Select matching engine to Interxion’s LON-1 data centre in late 2016.)

That cut many traders off from the venue, which underpins the trade of some $64 billion in base metal futures contracts daily, with other associated software taking even longer to restore.

Interxion’s trio of London data centres housed in the Old Truman Brewery in Brick Lan, are critical national infrastructure (CNI) that are home to what the company describes as “three quarters of the Fortune 500 and… the most established financial services, banking and trading community in Europe”. They connect tenants to 90+ carriers and ISPs, including the London Internet Exchange (LINX) and the London Access Point (LONAP), for a client base of blue chips, including many plugged in to London Stock Exchange; 20 minutes’ walk southwest.

Services started failing for customers shortly after 6pm Monday evening. Amid a deafening silence from the company -- bought by US data centre giant Digital Realty for $8.4 billion in a deal that closed in March 2020 – many scrambled engineers to the site, where they were eventually told that there had been a power outage.

https://twitter.com/sargassosupport/status/1480625119185846275

That should have, in theory, resulted in a smooth failover to another power supply. There are 14 diesel generators and 140,000+ litres of diesel in the building’s basement ready to kick in if either of the site’s two separate power supplies go and its Uninterruptible Power Supply (UPS)’s apparatus’ go with them. (The electronic switchgear that swaps power from mains to generators failed, we understand. If you have more precise detail or something to say about the outage, including as a customer, please do get in touch.) As one customer noted however "for what happened at Interxion... a whole chain of things would have to be broken."

Throughout the Interxion outage at LON1 the company kept resolutely silent, later reportedly telling customers that customer support systems were taken out with the outage (a failure that continues to be a problem for even the biggest organisations out ther: AWS in December 2021 admitted that during an outage at its US-EAST-1 data centre as “our Support Contact Center also relies on the internal AWS network, so the ability to create support cases was impacted” adding that it plans to roll out a “new support system architecture that actively runs across multiple AWS regions to ensure we do not have delays in communicating with customers.”)

That did not, Interxion customers were quick to add, excuse it from failing to communicate even via Twitter, which appears to be staffed just by marketing. Owner Digital Realty was no help either. (Its “blue tick” Digital Realty EMEA Twitter page hasn’t been updated since December 4, 2020.)

The company later shared a statement saying: "At 1810 on Monday, 10th January, Interxion's Hanbury Street facility (LON1), experienced a critical power interruption, which impacted some of the mains control equipment and caused outages across the services in LON1. The problem was quickly identified and services started to come back online from 1945 until 2230 when the facility returned to operational status.

The company has apologised "to all our customers and partners affected and for difficulties communicating during the outage" it said, adding "a full investigation is underway to determine the root cause of the interruption, the findings of which will be used to ensure an even more resilient infrastructure in the future.

In an addendum that may or may not have been appreciated by those affected, it added: "We remain proud of our reputation for global reliability and availability and have maintained five nines uptime over the past 14 years."

What went wrong exactly? Were you affected? Get in touch.