Saturday, April 04, 2020

Meet the new bug. Same as the old bug.

The Guardian in the U.K. reports on an FAA Airworthiness Directive that requires Boeing 787 aircraft be power cycled every fifty-one days to prevent "several potentially catastrophic failure scenarios".

From the AD:
The FAA is adopting a new airworthiness directive (AD) for all The Boeing Company Model 787–8, 787–9, and 787–10 airplanes. This AD requires repetitive cycling of the airplane electrical power. This AD was prompted by a report that the stale-data monitoring function of the common core system (CCS) may be lost when continuously powered on for 51 days. This could lead to undetected or unannunciated loss of common data network (CDN) message age validation, combined with a CDN switch failure. The FAA is issuing this AD to address the unsafe condition on these products.  
My friend and colleague Doug Young, who knows a thing or two about aircraft and avionics, brought this to my attention. I find fifty-one to be an interesting number, because unlike a lot of this class of bugs, it doesn't map quite so obviously into some kind of power-of-two/number-of-bits/frequency issue.

I tried to talk Doug into it being related to the number of bits (nineteen) in the data field of an ARINC 429 message (A429 being a common avionics bus that he and I have worked with), but even I thought it was a bit of a stretch, since it would require some system clock maintaining a frequency of 10000 ticks per day, causing the data field to overflow after 52.4287 days.

Both Doug and I independently arrived at the possibility of a uint32_t value in units of milliseconds, but that overflows after 49.71 days, a discrepancy that I find makes it unlikely.

As I've mentioned before, I ran into stuff like this all the time in my Bell Labs telecommunications days. Occasionally - alas - it was in code I wrote, definitely making for a learning experience.

Oh, and by the way, just last month HP announced yet another firmware bug in which some of their disk drives stop working after 40,000 hours of operation, also not an obvious power-of-two issue.

We will get fooled again.

No comments: