Wednesday, November 27, 2019

32,768

In yet another example of the class of bugs my old Bell Labs colleagues refer to as counter rollover - in this instance apparently an int16_t (signed sixteen-bit integer) variable used to count hours - Hewlett-Packard warns that some of their solid state drives will fail at 32,768 hours of use.
HP Warns That Some SSD Drives Will Fail at 32,768 Hours of Use
Bulletin: HPE SAS Solid State Drives - Critical Firmware Upgrade Required for Certain HPE SAS Solid State Drive Models to Prevent Drive Failure at 32,768 Hours of Operation
That's an MTBF of 32,767 hours, or 0x7FFF in hexadecimal, the largest signed integer you can fit into sixteen bits. That works out to short of four years, far less than the five year warranty offered on such drives.

Some users report several drives - presumably installed at the same time - all failed within a fifteen minute window. Bet we can guess how long it took the sysadmin to install those drives in a RAID (and so much for redundancy).

HP is providing a firmware update. (I've never updated the firmware on a solid state drive. I'm a little surprised it's even possible.)
The fact that this catastrophic rollover event only occurs between the third and fourth years of operation makes you appreciate the difficulty in testing such firmware. You can't run the devices for four years before you ship them. You have to find another way to ferret out bugs of this nature, such as code inspections, white box unit testing, simulation, effectively an accelerated wear testing of the firmware algorithms.

In the words of my former office mate at the Labs:
They missed a cardinal rule: when implementing a counter or timestamp, ensure its rollover happens only after your anticipated career EOL1.
To which I replied:
The advent of an efficient uint64_t data type on embedded processors was a huge boon to my apparent career success!
Footnotes

1 End Of Life

See Also

C. Overclock, "Time Flies", 2015-05-09

C. Overclock, "Time Flies Again", 2019-07-27

Updates

2019-11-28: minor edits, corrections, and reformatting.

Sunday, September 01, 2019

Geotagging While Airborne

While uploading my vacation photos to Flickr, I was surprised that one photograph that I had taken from the window of the 757 was geotagged.

Untitled

When I use my iPhone 7 in airplane mode, apps usually complain that GPS location isn't available. That's because typically the GPS receiver is integrated into the cellular radio - GPS being a requirement in the U.S. so that emergency services can locate you when you dial 911 - and when GPS isn't available (for example, when you're indoors), your phone may use cell tower triangulation and WiFi router identification to determine your location. In airplane mode, I'd expect all of these radios to be turned off.

My iPhone was in airplane mode, but I had just the WiFi radio enabled to use the in-flight WiFi - mostly, I confess, as an experiment, having worked on business aviation products that provide that service. Apparently this was enough to enable the GPS receiver, but, I assume, keep the RF section of the cellular radio turned off. This tells me something interesting about how iOS manages its RF resources.
(Querying the Salmon of Knowledge, as I've come to call internet searches following my recent travels in Ireland, one article says that the iPhone has a GPS receiver separate from the cellular radio. Another says that in more recent versions of iOS, GPS is never disabled in airplane mode.)
Flickr typically displays the geotag, if available in the metadata in the photograph, as a tiny map with a place name underneath it that is a URL that does a search for photographs with a nearby geotag. But in this case, it was just a square of blue labelled with "A mysterious place with no name".

2

When I looked at the URL, it had the latitude and longitude encoded as parameters.

https://www.flickr.com/search/?lat=56.798622&lon=-16.282175&radius=0.25&has_geo=1&view_all=1

I pasted 56.798622, -16.282175  into Google Maps, and got a result properly way out in the Atlantic.

4

Dropping into satellite view, I got an orbital image of where the 757 was when I took the photograph.

5

Readers of my blog (all two of you) may recall that I've done some work integrating my own GPS software with Google Earth and remotely tracked my travel with a moving map display as I drove my automobile around, with a Raspberry Pi running my software and a GPS receiver and an LTE modem sitting on the dashboard [Better Never Than Late]. It hadn't really occurred to me until now that you could do something similar with an aircraft and create your own moving map display if you had internet access to Google while airborne.

Of course this is exactly what the aircraft does with its own moving map display, except that the map is onboard in some box in the avionics bay. Although I didn't work on that particular feature, at least two of the business aircraft products for which I was one of the developers did exactly that, using the GPS coordinates provided by the aircraft navigation system over an ARINC 429 serial bus.

Saturday, July 27, 2019

Time Flies Again

Friday, the European Union Aviation Safety Agency, the EU's equivalent of the U.S. Federal Aviation Administration (FAA), issued a revised mandatory Airworthiness Directive for the Airbus A350-941 jet airliner. Quoting from AD 2017-0129R1, the EASA says:
Prompted by in-service events where a loss of communication occurred between some avionics systems and avionics network, analysis has shown that this may occur after 149 hours of continuous aeroplane power-up. Depending on the affected aeroplane systems or equipment, different consequences have been observed and reported by operators, from redundancy loss to complete loss on a specific function hosted on common remote data concentrator and core processing input/output modules. 
This condition, if not corrected, could lead to partial or total loss of some avionics systems or functions, possibly resulting in an unsafe condition. 
To address this potential unsafe condition, Airbus issued the AOT to provide instructions to reset the internal timer. Consequently, EASA issued AD 2017-0129 to require repetitive on ground power cycles (resets).
This means exactly what you think it does: you need to power cycle your A350-941 aircraft no less often than every 149 hours unless and until a software fix is applied.

So: what's special about 149 hours? The AD doesn't say. Let's see if we can figure that out using a little firmware forensics.

Multiply 149 hours by 60 to get 8940 minutes.

Multiply that by 60 to get 536400 seconds.

Multiply that by 1000 to get 536400000 milliseconds.

Finally, multiply that by 4 to get 2145600000.

We could equivalently have multiplied 536400 seconds by one million to get microseconds, then divided it by 250 to get the same value.

So what?

2145600000 is perilously close to 2147483647 or 0x7FFFFFFF in hexadecimal, which is is (231 - 1), the largest positive number represented in two's compliment binary form that you can store in a 32-bit variable.

So I feel pretty confident in making this prediction: somewhere in the A350-941 firmware or software there is a 32-bit signed variable that is incremented every 250 microseconds. After doing so for exactly 149 hours, 7 minutes, and 50.91175 seconds, the very next 250 microsecond clock tick will make that variable overflow and its value will transition from positive to negative as it increments from 0x7FFFFFFF to 0x80000000.

Wackiness ensues.

(An alternative hypothesis is a 32-bit unsigned variable which is incremented twice as often, or every 125 microseconds. It eventually wraps from 0xFFFFFFFF to 0x00000000, similarly confusing an algorithm.)

Long time readers of my blog will recognize that this counter rollover bug is similar to the one I previously described in the Boeing 787 Dreamliner; that bug could result in the loss of all electrical power on the aircraft after 248 days.

Commercial aircraft in service can stay powered up for a long time. They transition from aircraft power to ground power when at the gate, maintaining power so that the interior can be cleaned, the galleys restocked, the air conditioning kept running while boarding, and so forth. Then they go back to on-board power until the flight lands and the cycle repeats.

248 days is a long time. But keeping an aircraft powered up for 149 hours seems not only plausible, but likely. So how was this not uncovered during testing?

Counter rollover is a class of bug that I spent a significant portion of my career ferreting out of firmware and software for telecommunications systems - the kinds of systems expected to run 24x7 with extraordinarily high reliability - during my time at Bell Labs. It kept cropping up in the products I was helping develop (sometimes, admittedly, in my own code), and also in other vendors' products with which we or our customers were integrating our equipment.

Part of systems engineering - the art and craft of discovering, defining, and specifying requirements for a product, and insuring that they are met - must be deciding how long a product or component is expected to run before it must be power cycled. No human artifact is perfect, and nothing runs forever. I dimly remember during my mainframe days that IBM recommended periodically rebooting OS/360 on our 360/65 because of control block degradation; a euphemism, I suspect, for memory corruption or counter overflow.

But 149 hours? That does not seem like a very long time to me.