Saturday, May 09, 2015

Time Flies

From The Guardian, "US aviation authority: Boeing 787 bug could cause 'loss of control'" [May 1st 2015], describing a bug in the Dreamliner aircraft's electrical power generators that resulted in a U.S. Federal Aviation Administration (FAA) Airworthiness Directive (AD):
The plane’s electrical generators fall into a failsafe mode if kept continuously powered on for 248 days. The 787 has four such main generator-control units that, if powered on at the same time, could fail simultaneously and cause a complete electrical shutdown.
And from the FAA AD [2015-NM-058-AD May 1st 2015]:
We are adopting a new airworthiness directive (AD) for all The Boeing Company Model 787 airplanes. This AD requires a repetitive maintenance task for electrical power deactivation on Model 787 airplanes. This AD was prompted by the determination that a Model 787 airplane that has been powered continuously for 248 days can lose all alternating current (AC) electrical power due to the generator control units (GCUs) simultaneously going into failsafe mode. This condition is caused by a software counter internal to the GCUs that will overflow after 248 days of continuous power. We are issuing this AD to prevent loss of all AC electrical power, which could result in loss of control of the airplane.
Knowing nothing more its periodicity I'll make a prediction about the bug. I'm going to do this by making a successive series of transformations on the period until I find a number that I like. What "like" means in this context will become obvious to the reader.

248 days times 24 hours in a day is 5,952 hours.

5,952 hours times 60 minutes in an hour is 357,120 minutes.

357,120 minutes times 60 seconds in a minute is 21,417,200 minutes. Okay, I'm starting to like this number.

21,427,200 minutes times 10 is 214,272,000 1/10s of a second. Not quite there yet.

214,272,000 1/10s of a second times 10 is 2,142,720,000 1/100s of a second.

2,142,720,000 1/100s of a second. I bet you see it now, don't you? This value is perilously close to the maximum positive number in a signed 32-bit integer variable, which is 2,147,483,647 or 0x7fffffff in hexadecimal.

Suppose, just suppose, you were writing software to manage a power generator in a Boeing 787 Dreamliner. You needed to keep track of time, and you choose 1/100s of a second as your resolution, so that adding 1 to your variable indicated 1/100s of a second. Maybe that was the resolution of your hardware real-time clock. Maybe your RTOS had a time function that returned a value with that resolution. Or maybe - and this is the most common reason in my experience - you decided that was a convenient resolution because it was fine grained enough for your application while having dynamic range - the difference between the possible maximum and minimum values - that you could not imagine ever exceeding the capacity of the signed 32-bit integer variable into which you stored it. 248 days is a long time. What are the chances of the system running for that long?

So, if we chose a signed 32-bit integer variable, and 1/100s of a second as our the resolution of our "tick", when would the value in our variable roll over into the negative range? Because if it rolled over into the negative range - which would happen in the transition from 2,147,483,647 to 2,147,483,648 or 0x80000000 hexadecimal, which in a signed variable would actually be interpreted as -2,147,483,648 - wackiness would ensue. Time would appear to be running backwards, comparisons between timestamps would not be correct, human sacrifice, cats and dogs living together, a disaster of biblical proportions.

That sounds pretty dire. Maybe we better check our math by working backwards.

So the maximum value of 2,147,483,648 in 1/100s of a second divided by 100 is 21,474,836 seconds.

21,474,836 seconds divided by 60 is 357,913 minutes.

357,913 minutes is 5,965 hours.

5,965 hours is 248 days.

I feel pretty confident that, without knowing nothing other than the failure's periodicity, we should be looking for some place in the software where time is being kept with a resolution of 1/100s of a second in a signed 32-bit integer.

When this failure was first brought to my attention - which mostly happened because one of my clients is a company that develops avionic hardware and software for commercial and business aircraft - I did this calculation in just a few seconds and was pleased to discover once again that such a simple prediction was trivially easy to make.

I became really good at this kind of thing during my time working in a group at Bell Labs in the late 1990s developing software and firmware for ATM switches and network interface cards. We called this class of bug counter rollover because it manifests in all sorts of contexts, not just time, in any system which is expected to run 24x7 for as long as... well, in the telecommunications world, pretty much forever. I fixed this bug in the protocol stack in the ATM NIC we where developing, and then later proceeded to discover it in pretty much every ATM switch by every vendor our customers hooked our NIC up to.

It became a pretty regular occurence that I'd be on a conference call, sometimes with our own field support organization, sometimes with developers working for other ATM switch vendors, be told that the system was failing - disconnecting, or rebooting, or some such thing - with some periodicity like "once a day", and within a minute or two of noodling around in my lab notebook I'd astound everyone (except of course members of our own group) by announcing "You are maintaining your T125 timer in a 16-bit signed variable with a resolution of 5 seconds per tick. You have never tested your switch with a connection lasting longer than 22 hours."

Once, I said something like this, there was a long pause on the line, and an awestruck developer on the other end agreed that that was the case. But once you see the arithmetic - it's so simple you can't really even call it math - it's like the magician showing you how the trick is done. It sorta takes the magic out of it.

There are a few techniques that can be used to help address counter rollover.

Unsigned variables: When an unsigned variable rolls over, unsigned subtraction between the variable and its prior value still yields the correct magnitude, providing the variable has only rolled over once. You just have to understand that the value in your variable isn't absolute, but is only useful when subtracted from the some other unsigned value with the same units.

Wider variables or coarser resolution: Modern compilers and processors often offer 64-bit variables. This may increase the dynamic range to the point where a rollover won't occur until the Sun goes nova. Choosing a coarser resolution - a larger tick value - can have the same effect.

Detect the imminent counter rollover so that you can compensate for it: Instead of computing prime = value + increment you first compute temporary = maximium - increment and compare temporary < value; if so, then your level of entropy is about to increase. You need to do something about it in the code.

As developers of systems that are intended to run a long time - longer maybe than we typically even test it for - and especially for systems whose failure may have safety consequences - and that includes avionics systems, and telecommunications systems, and lots of other kinds of systems - we have to be hyperaware of the consequences of our decisions. This is why, whether organizations think of it this way or not, developers are making critical systems engineering decisions on a daily basis just in the choice of measurement units and data types.

Update (2015-05-10)

Coincidentally, I recently took a two day class on DO-178, an FAA guideline on processes to develop avionics software at various levels of safety criticality. At the higher levels of criticality, code test coverage of 100% is required. Surely the software in the Boeing 787 GCUs had to meet the higher Design Assurance Levels (DAL) specified in DO-178, like maybe B or even A, which require extremely stringent processes. I've written before about how unit test code coverage is necessary but not sufficient. In that article I was talking about the need for range coverage. This is yet another example of the same thing, where the range is a duration of time. Maybe we can call this temporal coverage.