A “death bug” in certain SSD drives from Western Digital’s SanDisk that causes permanent drive failure and data loss at 40,000 hours of operation continues to catch users unawares despite a firmware fix being issued in 2020.
The latest victim appears to have been popular website Hacker News, which suffered a sustained outage on Friday July 8 which it attributed to a disk failure. A “fallback server, which we switched to last night when the primary server failed” also failed, forcing the site to restore operations from mercifully available backups.
Various posts on that site and elsewhere suggest the failure relates to the well-known SanDisk death bug.
The issue first reared its head in November 2019 when HPE warned that a wide range of its own-branded solid state products would fail at 32,768 hours of operation time — the company did not name the SSD manufacturer responsible at the time.
It remerged in March 2020 when server providers Dell and HPE warned that neglecting to get a firmware fix for a batch of newly identified products dependent on SanDisk SSDs “will result in drive failure and data loss at 40,000 hours of operation and require restoration of data from backup if there is no fault tolerance, such as RAID 0 or even in a fault tolerance RAID mode if more SSDs fail than can be supported by the fault tolerance of the RAID mode on the logical drive” as HPE said.
“After the SSD failure occurs, neither the SSD nor the data can be recovered. In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously” HPE added at the time. (Experienced IT teams will aim to build stacks with non-sequential serial numbers and diverse storage products but it’s hard to get everything right all of the time and patches are not always made promptly, as few would dispute…)
SanDisk drives ranging from 200GB to 1.6TB are understood to be affected. These can be found in a sprawling array of Dell and HPE servers: both companies furnished users with a full list of impacted products at the time. Other OEMs are likely to be affected and will have also alerted customers. HPE made Linux, VMware, and Windows scripts available which perform an SSD drive firmware check for the 40,000 power-on-hours failure issue, as has Dell, which pointed the finger at SanDisk model numbers LT0200MO, LT0400MO, LT0800MO, LT1600MO, LT0200WM, LT0400WM, LT0800WM, LT0800RO and LT1600RO as being responsible.
The update corrects a logging check: “Assert had a bad check to validate the value of circular buffer’s index value. Instead of checking the max value as N, it checked for N-1,” Dell’s advisory said.
SanDisk owner Western Digital told The Stack’s founder at the time that the company had “discovered a firmware issue in a specific line of older, end-of-life SanDisk SAS SSDs, and pre-emptively contacted and began collaborating with our OEM partners to quickly provide a solution for their customers. A firmware fix for this issue is available for use by customers. As part of our policy, we cannot comment any further.”
It directed any further questions to OEM partners.
It’s never too late to run a check for this issue and to check your backup strategy!