Toyota factory outage: Server storage snafu blamed

Toyota has apologised to partners after being forced to shut down 14 factories for 36 hours because IT maintenance went south – the plants, combined, represent approximately a third of its production capacity.

During a maintenance procedure “data that had accumulated in the database was deleted and organized, and an error occurred due to insufficient disk space, causing the system to stop” Toyota said.

It added: “Since these servers were running on the same system, a similar failure occurred in the backup function, and a switchover could not be made*. This led to the suspension of domestic plant operations.”

A server with larger capacity was spun up and production resumed the next day. Toyota said that “countermeasures” have been put in place.

Lee Denham, Engineering Director at Databarracks, noted to The Stack that it is difficult to accurately assess precisely what happened without more detail, but that capacity management and monitoring problems “are avoidable if you monitor and plan in advance… Also, it is best practice to segregate production storage from backup storage," he added: "This sounds like a virtual platform with both on the same environment.”

The incident comes after Toyota shut down 28 production lines at the same 14 factories in 2022 when supplier, Kojima Industries, said a file server had been infected with a virus. (The Nikkei noted at the time, raising the pulse of partners, that many of Toyota’s 400+ Tier 1 suppliers "are connected to the automaker's kanban just-in-time production control system...")

*Yes, even $279 billion (2022 revenues) companies' IT systems are sometimes held together with duct tape... No, we don't approve either.

Toyota outage: Sloppy thinking?

As Percona CEO Peter Zaitsev put it to The Stack: "A well documented, regularly scheduled backup plan should include redundancy for backups, including offsite and offline storages, a good automation plan with tested procedures for both backups and restore, use proper encryption and include monitoring for the whole process. The most important element is to test your backups. If you can’t recover the data when you need it, you may as well not have a backup in place! The biggest challenge around recovery is how to carry this out effectively.

"Maintenance and changes to systems are always something more "dangerous" than normal operations. If something broke the primary, a better process would be to investigate that failure and not to deploy the changes on Secondary. It is even better to make sure changes are first changed on the non-production environment, before they are then put into production.

"The second highlight is disk space - this is extremely important to monitor as well as overprovision for critical systems. The problem with disk space is unlike other resources, like CPU capacity which makes system slower when it is exceeded, running out of disk space tends to bring systems to an absolute halt causing downtime or worse, data loss. Testing how systems actually behave when running out of disk space is important for critical systems, yet it is unfortunately often omitted," he said.

Peter Pugh-Jones, Director of Financial Services at Confluent, suggested that the incident reflects an out-of-date design ethos for critical systems – prioritising the disaster recovery/failover process over the ability to tackle issues before they emerge. He told The Stack: “When it comes to the process of complex system design, companies are all too often the victim of legacy thinking. They are reliant upon core systems designed to prioritise old principles of disaster recovery and failover, and that reflect technological and engineering principles established in the 1980s...

“Warnings that low disk space might impact operations, for example, should’ve been flagged at multiple points throughout the process."

Organisations should, instead, be looking to build more resilient and responsive systems built around data streaming analytics, he said.

Toyota: Anyone home?

In another sign that all is not entirely well in the IT estate of Toyota, a $279 billion-by-2022-revenue heavyweight of the automotive world, the company admitted in May this year that it had parked the data of millions of drivers – including vehicle location data – on a publicly available cloud database for over a decade owing to human error that went undetected. That affected “almost the entire customer base who signed up for its main cloud service platforms since 2012” as well as Lexus customers.

Worse things have happened at sea…

The incident’s not the biggest storage-related crisis to come out of Japan in recent years, however. The Stack recalls December 2021’s incident at the prestigious Kyoto University, which saw 77TB of research data wiped from its supercomputer after a software update pushed by system provider HPE saw all files older than 10 days held in large capacity disc storage backup deleted from the Cray machine, rather than just the intended log files. Kyoto University said 34 million files from 14 research groups were deleted in the incident.

HPE explained that incident (translated from Japanese) as follows: “The backup script included a ‘find’ command to delete log files older than 10 days. In addition to functional improvement of the script, the variable name passed to the find command for deletion was changed to improve visibility and readability… However, there was a lack of consideration in the release procedure of this modified script. We were not aware of the side effects of this behavior and released the [updated] script, overwriting [a bash script] while it was still running," HPE admitted.

"This resulted in the reloading of the modified shell script in the middle of the execution, resulting in undefined variables. As a result, the original log files in /LARGE0 [backup disc storage] were deleted instead of the original process of deleting files saved in the log directory."