Wednesday, November 27, 2019

32,768

In yet another example of the class of bugs my old Bell Labs colleagues refer to as counter rollover - in this instance apparently an int16_t (signed sixteen-bit integer) variable used to count hours - Hewlett-Packard warns that some of their solid state drives will fail at 32,768 hours of use.
HP Warns That Some SSD Drives Will Fail at 32,768 Hours of Use
Bulletin: HPE SAS Solid State Drives - Critical Firmware Upgrade Required for Certain HPE SAS Solid State Drive Models to Prevent Drive Failure at 32,768 Hours of Operation
That's an MTBF of 32,767 hours, or 0x7FFF in hexadecimal, the largest signed integer you can fit into sixteen bits. That works out to short of four years, far less than the five year warranty offered on such drives.

Some users report several drives - presumably installed at the same time - all failed within a fifteen minute window. Bet we can guess how long it took the sysadmin to install those drives in a RAID (and so much for redundancy).

HP is providing a firmware update. (I've never updated the firmware on a solid state drive. I'm a little surprised it's even possible.)
The fact that this catastrophic rollover event only occurs between the third and fourth years of operation makes you appreciate the difficulty in testing such firmware. You can't run the devices for four years before you ship them. You have to find another way to ferret out bugs of this nature, such as code inspections, white box unit testing, simulation, effectively an accelerated wear testing of the firmware algorithms.

In the words of my former office mate at the Labs:
They missed a cardinal rule: when implementing a counter or timestamp, ensure its rollover happens only after your anticipated career EOL1.
To which I replied:
The advent of an efficient uint64_t data type on embedded processors was a huge boon to my apparent career success!
Footnotes

1 End Of Life

See Also

C. Overclock, "Time Flies", 2015-05-09

C. Overclock, "Time Flies Again", 2019-07-27

Updates

2019-11-28: minor edits, corrections, and reformatting.

No comments: