Saturday, May 09, 2015

Time Flies

From The Guardian, "US aviation authority: Boeing 787 bug could cause 'loss of control'" [May 1st 2015], describing a bug in the Dreamliner aircraft's electrical power generators that resulted in a U.S. Federal Aviation Administration (FAA) Airworthiness Directive (AD):
The plane’s electrical generators fall into a failsafe mode if kept continuously powered on for 248 days. The 787 has four such main generator-control units that, if powered on at the same time, could fail simultaneously and cause a complete electrical shutdown.
And from the FAA AD [2015-NM-058-AD May 1st 2015]:
We are adopting a new airworthiness directive (AD) for all The Boeing Company Model 787 airplanes. This AD requires a repetitive maintenance task for electrical power deactivation on Model 787 airplanes. This AD was prompted by the determination that a Model 787 airplane that has been powered continuously for 248 days can lose all alternating current (AC) electrical power due to the generator control units (GCUs) simultaneously going into failsafe mode. This condition is caused by a software counter internal to the GCUs that will overflow after 248 days of continuous power. We are issuing this AD to prevent loss of all AC electrical power, which could result in loss of control of the airplane.
Knowing nothing more its periodicity I'll make a prediction about the bug. I'm going to do this by making a successive series of transformations on the period until I find a number that I like. What "like" means in this context will become obvious to the reader.

248 days times 24 hours in a day is 5,952 hours.

5,952 hours times 60 minutes in an hour is 357,120 minutes.

357,120 minutes times 60 seconds in a minute is 21,417,200 minutes. Okay, I'm starting to like this number.

21,427,200 minutes times 10 is 214,272,000 1/10s of a second. Not quite there yet.

214,272,000 1/10s of a second times 10 is 2,142,720,000 1/100s of a second.

2,142,720,000 1/100s of a second. I bet you see it now, don't you? This value is perilously close to the maximum positive number in a signed 32-bit integer variable, which is 2,147,483,647 or 0x7fffffff in hexadecimal.

Suppose, just suppose, you were writing software to manage a power generator in a Boeing 787 Dreamliner. You needed to keep track of time, and you choose 1/100s of a second as your resolution, so that adding 1 to your variable indicated 1/100s of a second. Maybe that was the resolution of your hardware real-time clock. Maybe your RTOS had a time function that returned a value with that resolution. Or maybe - and this is the most common reason in my experience - you decided that was a convenient resolution because it was fine grained enough for your application while having dynamic range - the difference between the possible maximum and minimum values - that you could not imagine ever exceeding the capacity of the signed 32-bit integer variable into which you stored it. 248 days is a long time. What are the chances of the system running for that long?

So, if we chose a signed 32-bit integer variable, and 1/100s of a second as our the resolution of our "tick", when would the value in our variable roll over into the negative range? Because if it rolled over into the negative range - which would happen in the transition from 2,147,483,647 to 2,147,483,648 or 0x80000000 hexadecimal, which in a signed variable would actually be interpreted as -2,147,483,648 - wackiness would ensue. Time would appear to be running backwards, comparisons between timestamps would not be correct, human sacrifice, cats and dogs living together, a disaster of biblical proportions.

That sounds pretty dire. Maybe we better check our math by working backwards.

So the maximum value of 2,147,483,648 in 1/100s of a second divided by 100 is 21,474,836 seconds.

21,474,836 seconds divided by 60 is 357,913 minutes.

357,913 minutes is 5,965 hours.

5,965 hours is 248 days.

I feel pretty confident that, without knowing nothing other than the failure's periodicity, we should be looking for some place in the software where time is being kept with a resolution of 1/100s of a second in a signed 32-bit integer.

When this failure was first brought to my attention - which mostly happened because one of my clients is a company that develops avionic hardware and software for commercial and business aircraft - I did this calculation in just a few seconds and was pleased to discover once again that such a simple prediction was trivially easy to make.

I became really good at this kind of thing during my time working in a group at Bell Labs in the late 1990s developing software and firmware for ATM switches and network interface cards. We called this class of bug counter rollover because it manifests in all sorts of contexts, not just time, in any system which is expected to run 24x7 for as long as... well, in the telecommunications world, pretty much forever. I fixed this bug in the protocol stack in the ATM NIC we where developing, and then later proceeded to discover it in pretty much every ATM switch by every vendor our customers hooked our NIC up to.

It became a pretty regular occurence that I'd be on a conference call, sometimes with our own field support organization, sometimes with developers working for other ATM switch vendors, be told that the system was failing - disconnecting, or rebooting, or some such thing - with some periodicity like "once a day", and within a minute or two of noodling around in my lab notebook I'd astound everyone (except of course members of our own group) by announcing "You are maintaining your T125 timer in a 16-bit signed variable with a resolution of 5 seconds per tick. You have never tested your switch with a connection lasting longer than 22 hours."

Once, I said something like this, there was a long pause on the line, and an awestruck developer on the other end agreed that that was the case. But once you see the arithmetic - it's so simple you can't really even call it math - it's like the magician showing you how the trick is done. It sorta takes the magic out of it.

There are a few techniques that can be used to help address counter rollover.

Unsigned variables: When an unsigned variable rolls over, unsigned subtraction between the variable and its prior value still yields the correct magnitude, providing the variable has only rolled over once. You just have to understand that the value in your variable isn't absolute, but is only useful when subtracted from the some other unsigned value with the same units.

Wider variables or coarser resolution: Modern compilers and processors often offer 64-bit variables. This may increase the dynamic range to the point where a rollover won't occur until the Sun goes nova. Choosing a coarser resolution - a larger tick value - can have the same effect.

Detect the imminent counter rollover so that you can compensate for it: Instead of computing prime = value + increment you first compute temporary = maximium - increment and compare temporary < value; if so, then your level of entropy is about to increase. You need to do something about it in the code.

As developers of systems that are intended to run a long time - longer maybe than we typically even test it for - and especially for systems whose failure may have safety consequences - and that includes avionics systems, and telecommunications systems, and lots of other kinds of systems - we have to be hyperaware of the consequences of our decisions. This is why, whether organizations think of it this way or not, developers are making critical systems engineering decisions on a daily basis just in the choice of measurement units and data types.

Update (2015-05-10)

Coincidentally, I recently took a two day class on DO-178, an FAA guideline on processes to develop avionics software at various levels of safety criticality. At the higher levels of criticality, code test coverage of 100% is required. Surely the software in the Boeing 787 GCUs had to meet the higher Design Assurance Levels (DAL) specified in DO-178, like maybe B or even A, which require extremely stringent processes. I've written before about how unit test code coverage is necessary but not sufficient. In that article I was talking about the need for range coverage. This is yet another example of the same thing, where the range is a duration of time. Maybe we can call this temporal coverage.

Friday, April 03, 2015

Altered Finite States

Folks who have spent time in a flotation tank report that their first experience consisted mainly of figuring out just how the heck to use a flotation tank. Having now had two one-hour sessions in a flotation tank myself, I can confirm that this is accurate.

I-sopod Flotation Tank
Photograph: Wikimedia

Flotation tanks, also known as isolation tanks, and occasionally as sensory deprivation tanks, are these days large horizontal enclosed tubs filled with a few inches of water containing a very high concentration of dissolved epsom salts (magnesium sulfate). The resulting solution is so dense that you can’t help but float when you lie down in it. The solution in the flotation tank is heated to body temperature, sanitized, and continuously filtered. The solution does not taste salty; it tastes vile (take my word on this).

Flotation therapy is used for a variety of purposes ranging from relaxation, meditation, or (if you’re like me) because you’ve seen the movie Altered States or the television show Fringe. I am happy to report that I neither regressed to a neanderthal, nor did I emerge from the tank in a parallel universe. As far as I can tell, anyway. I did warn Mrs. Overclock ahead of time that she might get a phone call in the event I woke up naked in the Denver zoo.

March: The First Session

For my first session, I attended a local spa called A New Spirit. Yeah, I know. It was full of New Age music, gently falling water, and the smell of incense, just as you might imagine. Being an engineer, I would have been happier if it had been all stainless steel and the employees wore lab coats and carried clipboards. But what really surprised me was that it wasn’t until I completed my session and was ready to leave that I saw a woman who wasn’t an employee; up until then, all of the customers I saw sitting around wearing the terry cloth robes provided by the spa were men, some of whom were waiting their turn in one of the three flotation tanks.

Each tank was in a private room. I showered in one of the two bathrooms to remove surface dirt and oil from my skin. I stored all of my personal effects in the provided tote bag, put on the bath robe, entered the tank room that had been assigned to me, and closed the door. I hung up my robe, inserted the disposable foam earplugs provided by the spa, and, naked as the day I was born (but a lot fatter and hairier), I opened the hatch at the front of the tank and climbed in.

All three of the flotation tanks at this spa were of similar size, but this one had a hinged hatch at one end, while the other two tanks (which I peeked at when they weren’t being used) had sliding hatches in their mid-sections. The tank appeared to be fiberglass, in top and bottom halves. The bottom half had a water tight liner in it. Both the solution and the air inside the tank were heated to body temperature. I knelt down into the solution, closed the hatch, and laid down.

I spent the next hour trying, mostly unsuccessfully, to relax. At first I couldn’t see or hear anything. But gradually I became dimly aware of the New Age music, which didn’t seem that loud even outside of the tank room, but managed to permeate both the ear plugs and my tinnitus. Eventually the music became dominated by the sound of my pulse in my ears, and, after a bit, the noise of water trickling past the earplug in my left ear. This latter sound came to seem as loud as  the rushing torrent of a stream in the mountains in the spring.

As my eyes became adjusted to the dark, I because aware of a ring of light around the seam between the top and bottom halves of the tank. That’s kind of remarkable, since the tank room itself was very dimly lit. But unless I turned my head to either side, I couldn’t really see this band of light, nor did it illuminate the interior of the tank enough to actually make out any detail.

I went through the (apparently common) issue of where to put my arms: down at my sides, up above my head, somewhere in between. I felt compelled to explore my surroundings. I’m of average height and I found that if I stretched my legs down and my arms up above my head, I could just barely touch the far ends of the tank. The tank was proportionately narrower, so that I could touch either side without extending my arms very much at all. As I tried to relax and float, I drifted around in the tank, occasionally bumping into the sides.

The spa provided one of those foam "swim noodle" tubular pool toys to use as a sort of pillow, and I had to experiment with that as well: use it, don’t use it, put it under my neck, beneath my head, etc. The spa web site mentioned that some people find sitting up in the tank more comfortable. I tried that, only to discover that the heating elements for the water were below the floor, making my bum uncomfortably warm. Being an old guy, my arms and legs would stiffen up and become uncomfortable, so I had to flex them from time to time, which caused the water to slosh around, which caused me to repeatedly bump into the sides of the tank.

After a loooooooooong subjective period of time, an attendant entered the room, knocked on the outside of the tank to let me know my hour was up, and then left. (They ask that you knock back so that they know you aren’t asleep, or dead, or in a parallel universe, and I did that.) I got out of the tank, dried off, and put my robe on. I left the room, took another shower to wash the epsom salts off, got dressed, paid, and left.

First impression: it was interesting. But I spent most of my hour flailing around. I really needed to try it again.

April: The Second Session

I changed a few things for my second flotation session at A New Spirit. I bought some disposable silicon swimmer’s earplugs that I hoped would do a better job of keeping the solution out of my ears. (They did, but didn’t do any better in blocking out the New Age music.) I took a shower before I left the house to save a little time at the spa. Other than that, this visit started out much the same as the first one. I was even assigned the same tank room.

But this time I spent less time figuring out what to do and just tried to relax. I did better this time. I still had to flex my stiffened arms and legs occasionally. I found that I was the most comfortable with my arms in a “hands up, don’t shoot” posture, below the foam float that I still used behind my head, and sometimes with my hands lightly clasped. I was able to concentrate more on my breathing and less on my no longer so strange environment. I had less tension in my neck and back. All things considered, the second session went a lot better. I had some quality thinking time.

What did I think about? Being a typical male, sex, every few minutes. Sometimes with Blair Brown. But mostly I thought about socket multiplexing.

No, really.

I’ve been working on some open source Linux/GNU-based socket handling code based on work I did under SunOS way back around 1989. (This is part of my Diminuto C-based library you can find on GitHub.)  I had wanted to do some refactoring of the socket multiplexing unit test, which is a simple running example of a single-threaded service provider. I worked out all the details while in the tank. Once I got back to my office at home, I completed what I had originally estimated to be a day’s work in a couple of hours. I also fit in lunch at the local deli on my way home from the spa, and a trip to the gym late in the afternoon to lift weights. I call that a good day.

Future Sessions

I’m going to keep up monthly sessions in the flotation tanks at A New Spirit for a while. It’s part of my effort to learn to relax, to try to reduce my addiction to constant mental stimulation, to increase my attention span, and improve my ability to get into the zone. I’m also going to work on learning to meditate. All of this in an effort to think deeper thoughts, to gain better insights, and to enhance my ability to think longer term, by at least temporarily removing the more or less trivial distractions that are ubiquitous in today's world.

Or maybe I just like lying around buck naked.

Monday, March 16, 2015

Being Evidence-Based Using the sizeof Operator

How long is an int in C?

How does the length of an int compare to that of a long in C++?

These and related questions crop up, I dunno, seems like weekly, in discussion forums related to embedded development in social media sites like What then follows is a lengthy meandering comment thread full of misinformation or at least many answers that are only applicable to a very specific toolchain or hardware target, probably the only target the commenter has ever used.

Folks, you don't have to guess at this. You can be evidence-based and find out exactly what the answer is for your particular combination of toolchain and target.

The sizeof operator in C and C++ determines the size in bytes of the variable, array, type, or structure to which it is applied, at compile-time. It can't tell you anything about the dynamic run-time behavior of how your application uses memory. Nor will it give you the results you might expect if the compiler cannot know the size of the argument. But it can tell you all sorts of other useful stuff.

For example, given

struct Header {
 struct Header * next;
 size_t size;

typedef struct Header header_t;

header_t datum;

header_t data[10];

header_t * that;

you can use expressions like

sizeof(struct Header)







and even


in your C and C++ code to automatically insert appropriate constant values of type size_t instead of coding those values as, say, preprocessor symbols, or hard coding them as integer constants.

You don't have to take my word for it. You can code it up, compile it, and run it yourself, and see what happens.

Below is a little C program that I compile and run every time I find myself using an unfamiliar compiler or processor. And sometimes even a familiar compiler or processor, because I am old and have senior moments. This is from my Diminuto library of stuff I have found useful for doing systems programming under Linux/GNU. But you can just write your own and omit all the Diminuto-specific types.

As long time readers of this blog know, I have run similar code on non-Linux systems like VxWorks-based PowerPCs; on Harvard architectures like Arduino and PIC micro-controllers; on chips that had twenty-four bit integers; on an architecture for which a signed integer type was not the same length as an unsigned of the same integer type. It pays to check and not guess. There's a lot of crazy stuff out there.

(Note that the blogger editor may wrap some of the source lines. You can always get the original source code from GitHub.)

/* vi: set ts=4 expandtab shiftwidth=4: */
 * @file
 * Copyright 2014 Digital Aggregates Corporation, Colorado, USA
 * Licensed under the terms in README.h
 * Chip Overclock
 * There's a lot of duplication here, but I'm paranoid that way. Remarkably,
 * I once worked on an embedded project using a proprietary non-GNU C compiler
 * in which the sizeof of signed type was not the same as the sizeof of the
 * unsigned of the same type. I was a little astounded by that. Also, note
 * that you can take the sizeof a void and of a function type (as opposed to a
 * void or function pointer). It's good to know these things.

#include <stdio.h>
#include <pthread.h>
#include <stdint.h>
#include <stddef.h>
#include "com/diag/diminuto/diminuto_types.h"

#define printsizeof(_TYPE_) printf("sizeof(%s)=%zu\nsizeof(%s*)=%zu\n", #_TYPE_, sizeof(_TYPE_), #_TYPE_, sizeof(_TYPE_*))

typedef enum Enum { ENUM = 0 } enum_t;

typedef void (function_t)(void);

int main(void)
    printsizeof(signed char);
    printsizeof(unsigned char);
    printsizeof(signed short);
    printsizeof(unsigned short);
    printsizeof(signed int);
    printsizeof(unsigned int);
    printsizeof(signed long);
    printsizeof(unsigned long);
    printsizeof(long long);
    printsizeof(signed long long);
    printsizeof(unsigned long long);
    return 0;


My current build system is a Dell PC with a four-core 2.4GHz Pentium processor. It's running Ubuntu 14.04, Linux 3.13.0, and gcc 4.8.2. Here is what I get when I run the program there.

sizeof(signed char)=1
sizeof(signed char*)=8
sizeof(unsigned char)=1
sizeof(unsigned char*)=8
sizeof(signed short)=2
sizeof(signed short*)=8
sizeof(unsigned short)=2
sizeof(unsigned short*)=8
sizeof(signed int)=4
sizeof(signed int*)=8
sizeof(unsigned int)=4
sizeof(unsigned int*)=8
sizeof(signed long)=8
sizeof(signed long*)=8
sizeof(unsigned long)=8
sizeof(unsigned long*)=8
sizeof(long long)=8
sizeof(long long*)=8
sizeof(signed long long)=8
sizeof(signed long long*)=8
sizeof(unsigned long long)=8
sizeof(unsigned long long*)=8

My current reference target is an Nvidia Jetson board with a TK1 four-core ARMv7 processor. It's running Ubuntu 14.04, Linux 3.10.24, and gcc 4.8.2. Here is what I get when I run the program there.

sizeof(signed char)=1
sizeof(signed char*)=4
sizeof(unsigned char)=1
sizeof(unsigned char*)=4
sizeof(signed short)=2
sizeof(signed short*)=4
sizeof(unsigned short)=2
sizeof(unsigned short*)=4
sizeof(signed int)=4
sizeof(signed int*)=4
sizeof(unsigned int)=4
sizeof(unsigned int*)=4
sizeof(signed long)=4
sizeof(signed long*)=4
sizeof(unsigned long)=4
sizeof(unsigned long*)=4
sizeof(long long)=8
sizeof(long long*)=4
sizeof(signed long long)=8
sizeof(signed long long*)=4
sizeof(unsigned long long)=8
sizeof(unsigned long long*)=4

See? Now was that so hard?