Saturday, July 27, 2019

Time Flies Again

Friday, the European Union Aviation Safety Agency, the EU's equivalent of the U.S. Federal Aviation Administration (FAA), issued a revised mandatory Airworthiness Directive for the Airbus A350-941 jet airliner. Quoting from AD 2017-0129R1, the EASA says:
Prompted by in-service events where a loss of communication occurred between some avionics systems and avionics network, analysis has shown that this may occur after 149 hours of continuous aeroplane power-up. Depending on the affected aeroplane systems or equipment, different consequences have been observed and reported by operators, from redundancy loss to complete loss on a specific function hosted on common remote data concentrator and core processing input/output modules. 
This condition, if not corrected, could lead to partial or total loss of some avionics systems or functions, possibly resulting in an unsafe condition. 
To address this potential unsafe condition, Airbus issued the AOT to provide instructions to reset the internal timer. Consequently, EASA issued AD 2017-0129 to require repetitive on ground power cycles (resets).
This means exactly what you think it does: you need to power cycle your A350-941 aircraft no less often than every 149 hours unless and until a software fix is applied.

So: what's special about 149 hours? The AD doesn't say. Let's see if we can figure that out using a little firmware forensics.

Multiply 149 hours by 60 to get 8940 minutes.

Multiply that by 60 to get 536400 seconds.

Multiply that by 1000 to get 536400000 milliseconds.

Finally, multiply that by 4 to get 2145600000.

We could equivalently have multiplied 536400 seconds by one million to get microseconds, then divided it by 250 to get the same value.

So what?

2145600000 is perilously close to 2147483647 or 0x7FFFFFFF in hexadecimal, which is is (231 - 1), the largest positive number represented in two's compliment binary form that you can store in a 32-bit variable.

So I feel pretty confident in making this prediction: somewhere in the A350-941 firmware or software there is a 32-bit signed variable that is incremented every 250 microseconds. After doing so for exactly 149 hours, 7 minutes, and 50.91175 seconds, the very next 250 microsecond clock tick will make that variable overflow and its value will transition from positive to negative as it increments from 0x7FFFFFFF to 0x80000000.

Wackiness ensues.

(An alternative hypothesis is a 32-bit unsigned variable which is incremented twice as often, or every 125 microseconds. It eventually wraps from 0xFFFFFFFF to 0x00000000, similarly confusing an algorithm.)

Long time readers of my blog will recognize that this counter rollover bug is similar to the one I previously described in the Boeing 787 Dreamliner; that bug could result in the loss of all electrical power on the aircraft after 248 days.

Commercial aircraft in service can stay powered up for a long time. They transition from aircraft power to ground power when at the gate, maintaining power so that the interior can be cleaned, the galleys restocked, the air conditioning kept running while boarding, and so forth. Then they go back to on-board power until the flight lands and the cycle repeats.

248 days is a long time. But keeping an aircraft powered up for 149 hours seems not only plausible, but likely. So how was this not uncovered during testing?

Counter rollover is a class of bug that I spent a significant portion of my career ferreting out of firmware and software for telecommunications systems - the kinds of systems expected to run 24x7 with extraordinarily high reliability - during my time at Bell Labs. It kept cropping up in the products I was helping develop (sometimes, admittedly, in my own code), and also in other vendors' products with which we or our customers were integrating our equipment.

Part of systems engineering - the art and craft of discovering, defining, and specifying requirements for a product, and insuring that they are met - must be deciding how long a product or component is expected to run before it must be power cycled. No human artifact is perfect, and nothing runs forever. I dimly remember during my mainframe days that IBM recommended periodically rebooting OS/360 on our 360/65 because of control block degradation; a euphemism, I suspect, for memory corruption or counter overflow.

But 149 hours? That does not seem like a very long time to me.

Monday, June 17, 2019

This Is What You Have To Deal With

I'm passing this anecdote along not because I'm asking for help (although any insight would be welcome), but as an example of what a low-level real-time development looks like.

I spent several days troubleshooting what I still assume to be a bug in my software that deals with output from Global Navigation Satellite System (GNSS) - of which the U.S. Global Positioning System (GPS) is one example - receivers, regarding the synchronization of the NMEA, UBX, and RTCM state machines with the input stream from the U-blox UBX-ZED-F9P receiver (a U-Blox 9 device) on my development system named Nickel (Intel NUC5i7RYH, Ubuntu 18.04.1).

NMEA is the typical output of GNSS receivers, consisting of "sentences" of printable ASCII characters. UBX is a proprietary binary format used for messaging to and from GNSS receivers manufactured by U-blox. RTCM is a standard binary format used for messaging to and from GNSS receivers that support differential GNSS (DGNSS). When a receiver like the F9P generates all three types of output, your software has to be able to grok all three formats and switch between them seamlessly.

I tested the same code on Cadmium (NUC7i7BNH, Ubuntu 16.04.5) and saw the same symptom: every tens of seconds or so my software would loose sync with the input stream than regain it shortly afterwards.

I could not reproduce the problem on Gold, an ARM-based Raspberry Pi (Broadcom BCM2837B0 SoC, Raspbian 9.4) using exactly the same application software and the same receiver hardware, nor on Bodega or Mochila, also ARM-based Pis (also Broadcom BCM2837B0, Raspbian 9.8).

So I ran the standard Linux utility socat on Nickel to save about a minute's worth of NMEA data (UBX and RTCM weren't enabled on the F9P for this test). I ran my software against the file (so: no real-time constraints) and recorded the Sync Lost messages that included the offset into the data where the errors occurred. At this point I was pretty sure I'd found a bug in the state machines in my software.

Then I did a hex dump of the original file generated by socat, checked at those offsets, and lo and behold the NMEA stream was corrupted: periodically a spurious handful of bytes, looking suspiciously like a fragment of an NMEA sentence, was inserted at the end of a valid NMEA sentence.

So this corruption occurs without my software being involved at all.

I ran my software on Nickel using the GlobalSat BU353W10 receiver (a U-Blox 8 device), and I saw similar occasional loss of sync due to corruption of the NMEA stream. As before, the same software and GPS hardware on a Pi worked without problems.

I updated Cadmium to Ubuntu 19.04 and observed the same misbehavior with the BU353W10. I updated the BIOS firmware on Cadmium to the latest version and it still lost sync every tens of seconds.

I built the same software and used the BU353W10 receiver on Mercury (Dell OptiPlex 7040, Ubuntu 14.04.4) and saw sync lost every tens of seconds.

The evidence points an issue either with the USB hardware or firmware on all of my Intel boxes, or with the USB software stacks in many versions of Ubuntu (even though both Ubuntu and Raspbian are based on Debian, making it likely that their USB stacks share a common provenance).

I'm not buying it. Either seem unlikely. What is the likelihood that I'm the only developer in existence to have run into this? Gotta be pretty much zero. And my search-fu has turned up nothing.

I'd like this to be a bug in my software - and I am still assuming it is, perhaps a flaw shared with socat - because then I could fix it. But its root cause remains a mystery.

I use the Intel-based NUCs as my development systems (the one Intel-based Dell box is a legacy development system), and the Pis as test systems, typically deploying to ARM-based systems in the field.

I have no clue at this point what this is. But when I figure it out - and I hope that I do - I will have learned something useful.

Update (2019-06-19)

Just by way of providing actual evidence, here is a snippet extracted from about one minute of data collected via the command
cat /dev/ttyACM0 > FILE
running on Nickel, the Intel NUC5i7RYH running Ubuntu 18.04.1, from a BU353W10 (U-blox 8) USB-dongle GPS receiver:

$GNGSA,M,3,07,30,08,28,09,11,27,18,51,48,13,01,1.04,0.55,0.88*1D\r\n
$GNGSA,M,3,78,87,88,77,68,69,79,67,81,,,,1.04,0.55,0.88*12\r\n
7A\r\n
$GNGGA,141257.00,3947.65236,N,10509.20014,W,2,12,0.57,1715.2,M,-21.5,M,,0000*4A\r\n
$GNGSA,M,3,07,30,08,28,09,11,27,51,48,13,01,05,1.05,0.57,0.88*12\r\n

The 7A\r\n ('7', 'A', carriage return, line feed) inserted between two sentences is obvious. It has the appearance of a checksum and standard end matter of another NMEA sentence (and in fact there are several sentences following this corruption that happen have 0x7A as their checksum).

Keep in mind that these corruptions don't appear when I run the same test on an ARM-based Raspberry Pi. This makes me wonder if there is another process that runs on the NUC but not the Pi that is consuming data from the same port. This isn't as crazy as it sounds; I have inadvertently run redundant instances of my gpstool software or the gpsd GPS daemon and run into similar issues (yes, I already checked for that). However, running the commands
fuser /dev/ttyACM0
and
lsof | grep ttyACM0
(which displays what process has a particular file open, and all open files and the processes using them, respectively) only show the cat command when I run this test.

An interesting difference between the Ubuntu NUC and the Raspbian Pi is that on the former it isn't necessary to configure /dev/ttyACM0 as a serial port (baud rate, raw mode, etc.) in my application; on the latter, it is (although it doesn't seem to matter to what I set the baud rate).

Tuesday, March 26, 2019

Decoding Special Machine Instruction Sequences with Ghidra

Because of the low level of development I do, I am still from time to time called upon to decode existing, or write new, sequences of just a few assembler instructions to accomplish very specialized operations. I usually do this with the relevant processor handbook laying in my virtual lap. While it is true that during my career in the 1970s and 1980s I probably wrote hundreds of thousands of lines of assembler code for various machines (I wrote my master's thesis project in PDP-11 assembler, for example), most of that was done before I had access to a C compiler.

When I started playing with Ghidra, the U.S. National Security Agency's recently open sourced software reverse engineering (SRE) tool, I was curious how its C decompiler dealt with instruction sequences that were perhaps outside of the normal idiomatic sequences it had been programmed to recognize. This was easy enough to find out: I had several examples of such sequences in my own repositories on GitHub.

This work was done with Ghidra 9.0 dated 2019-02-28.
(As usual, you can click on an image to be taken to a larger version.)
Scattergun

Scattergun is my project to evaluate hardware entropy generators. These are devices which use transistor noise or even quantum mechanical effects to generate random bits for applications like creating unguessable encryption keys.

Two of the generators I tested in Scattergun were the Intel rdrand and rdseed machine instructions. rdrand is a general purpose hardware random number generator introduced in the Intel Ivy Bridge architecture. It generates random numbers using an algorithm that is seeded, and periodically reseeded, by a hardware entropy source built into the processor. rdseed is a hardware entropy generator introduced in the Intel Broadwell architecture. It provides access directly to the hardware entropy source built into the processor.

Scattergun provides the seventool utility that can be used to generate random bits using either rdseed or rdrand to feed to the Scattergun test suite. The functions in seventool that access those instructions look like this.

Screen Shot 2019-03-21 at 3.58.31 PM

There's a lot of conditional compilation here because I needed seventool to compile run on several different vintages of GNU C compiler suites. Some of them did not have assemblers that recognized the rdrand and rdseed mnemonics. Some of them had compilers that already had built-in functions to generate the appropriate instruction sequences. But for the i7-6700T processor on which I ran these Scattergun tests, I used the compilation options that used the rdrand and rdseed assembler mnemonics. These functions (which weren't actually, as it turns out, inlined by the GNU C compiler) generate code to execute the appropriate machine instruction in-line, fetch the indication of success from the carry bit, and return that as the function value while saving the random goodness in a value-result parameter.

Here's what Ghidra had to say about these functions. In the first screen shot, I asked Ghidra to decompile the rdrand C function.

Screen Shot 2019-03-26 at 9.47.36 AM

 You can see the RDRAND machine instruction in the center screen showing the disassembly.

I did the same for the rdseed function.

Screen Shot 2019-03-26 at 9.43.37 AM

Similarly, you can see the disassembled RDSEED machine instruction in the disassembly.

Ghidra decompiled both the rdseed and rdrand instruction sequences into pseudo-function calls, both the actual machine instruction, and the check of the carry bit for success. It's pretty clear that the x86_64 instruction decode for Ghidra has been configured to recognize both of these as idiomatic machine instruction sequences. That's pretty remarkable.

Or maybe not. I probably shouldn't be surprised that the premiere encryption cracking organization on the planet is especially interested in detecting applications that use hardware features especially designed for creating encryption keys. (The fact that I was able to get to this point within about an hour of installing Ghidra might say something about the hazards of meta-analysis revealed by open sourcing tools like this.)

Diminuto

Diminuto, my C systems programming library for (mostly) Linux/GNU platforms, includes some definitions to make it easy for me to play with memory barriers: machine instruction sequences available on many architectures (not just x86_64) that enforce cache coherency on multi-core processors. The good news is that multi-core architectures have machine instructions to explicitly do this (that is, they do if they work), or at least instructions that do this as a side effect. The better news is that modern GNU C compilers have built-in pseudo-functions that generate the appropriate instruction sequences for each different architectures, making it easier to write portable code that use memory barriers.
(I confess there is a lot of "do as I say and not as I do" in the Diminuto memory barrier code, which is pretty experimental. You are for the most part much better off coding POSIX mutex operations, which use memory barriers as part of their implementation. Diminuto can help you with that, too.)
Here is a unit test program in Diminuto that calls the general memory barrier function diminuto_barrier several times. It then calls the read barrier function diminuto_acquire, and then the write barrier function diminuto_release. Interspersed are assignment operations to a volatile variable that in my naivety I had hoped would keep the compiler from optimizing them out.

Screen Shot 2019-03-19 at 2.14.20 PM

A read barrier (which is said to have acquire semantics) is intended to block the execution core until its cache has been updated with the latest values from other cores. A write barrier (with release semantics) is intended to block the execution core until its cache changes have been propagated to the other cores. The general barrier does both.

Here is the implementation of all three functions.

Screen Shot 2019-03-19 at 2.16.02 PM

You can see that they are all implemented as preprocessor definitions that expand to predefined GNU C functions. The __sync_synchronize function implements a general (both read and write) memory barrier. The read and write barriers are a little more complicated;  they use the GNU C __sync_lock_test_and_set and __sync_lock_release functions, not for their intended purpose as machine-level synchronization mechanisms using busy waiting (spin locks), but because they implement read and write memory barriers as a side effect.

Here is how Ghidra decodes this unit test for code compiled for the x86_64.

Screen Shot 2019-03-21 at 4.09.56 PM

The disassembly in the center screen is straightforward. There are the three MFENCE instructions corresponding to the diminuto_barrier function calls that each expanded into __sync_synchronize. You can see the MOV instructions that does the variable assignment. You can see the XCHG that implements the __sync_lock_test_and_set in diminuto_acquire (there's no loop because we just want the read barrier as a side effect), and a MOV that implements the __sync_lock_release in diminuto_release.

The C decompile on the right is more problematic. Not only are none of the memory barrier operations represented in any way, but the assignment statements that were in the original C source are missing as well. In fact, the only code shown in the decompile is that which was generated in the unit test framework EXIT macro that handles the end of test processing. The entire body of the unit test is gone.

It is no more than a guess on my part, but maybe Ghidra fell into a machine instruction sequence that it didn't recognize, and rather than add a notation in the decompile it simply ignored it. The authors of Ghidra have a lot more experience using it to reverse engineer software using Ghidra than I do (which is, basically, none). But this seems unfortunate, since the use of memory barriers is a common idiom in the kinds of low level work that I do, particularly with devices that expose data registers by memory mapping them into the processor address space.

Hazer

Hazer is my project to evaluate Global Navigation Satellite System (GNSS) receivers, of which the Global Position System (GPS) receiver that may be in your mobile phone is one example. The gpstool utility in Hazer makes use of Diminuto scoped operators: operators that have an explicit begin (opening) and end (closing) syntax that correspond to underlying begin and end semantics.

One example makes use of the memory barriers we just discussed to implement the Coherent Section scoped operators.

Screen Shot 2019-03-26 at 10.43.46 AM

These Coherent Section operators, when used at the beginning and ending of a section of code, automatically perform a read barrier at the beginning of the section and a write barrier at the end of the section.

Perhaps a little more typical are the Critical Section scoped operators.

Screen Shot 2019-03-26 at 10.44.27 AM

The Critical Section begin operator automatically causes the application to acquire (block until it is available and then lock) a POSIX mutex passed in by the application, and the end operator automatically releases (unlocks) the mutex as the flow of control leaves.

The Hazer gpstool utility has an option to monitor the 1PPS (One Pulse Per Second) 1Hz timing signal that is generated by some GNSS receivers. One of the ways that serial-attached receivers generate this pulse is by toggling the Data Carrier Detect (DCD) control line on the serial port. gpstool uses a separate thread to follow DCD and communicate its state back to the main thread.

Screen Shot 2019-03-21 at 4.15.23 PM

The dcdpoller function uses the Coherent Section scoped operators to bracket the reading of the done variable that tells the thread to exit (when, for example, the application terminates), and the more conventional Critical Section scoped operators to synchronize read-modify-write access to the oneppsp variable that it uses to inform the main thread that a 1PPS event has occurred. (The function can also optionally strobe an external GPIO pin to pass the DCD state on to another device.)

I routinely run gpstool on both x86_64 and ARM architectures. Here is the Ghidra analysis of the dcdpoller function on the x86_64.

Screen Shot 2019-03-21 at 4.20.21 PM

The XCHG and MOV instructions that form the beginning and ending of the Coherent Section are clearly visible in the disassembly, but are completely absent in any form in the decompile. But at least the check of the done variable is present in the decompilation. The beginning and ending of the Critical Section, including the handling of the POSIX mutex clean up, is well represented.

Here is the Ghidra analysis of the same function compiled for the ARM.

Screen Shot 2019-03-21 at 4.49.36 PM

The ARM ldrex (Load Exclusive) and strex (Store Exclusive) instructions are how the ARM implements an atomic test-and-set operation. These are visible in the disassembly where the original C code enters the Coherent Section. The lock is released when the flow of control exits the Coherent Section via (I think) conventional mov (Move) and str (Store) instructions. The decompile represents this using two successive coproc_moveto_Data_Memory_Barrier pseudo function calls. The disassembly shows that the done value is accessed inside the Coherent Section, but this isn't at all obvious from the decompile. The decompile of the Critical Section using the POSIX mutex was recognizable as being the same operation as in the x86_64 analysis, which is no surprise; it's probably the same underlying implementation, with only minor differences for architecture specific operations like spin locks.
(It occurred to me later to do a Ghidra analysis of the Diminuto memory barrier unit test compiled for the ARM instead of the x86_64. I got results consist with this Hazer analysis: the presence of many coproc_moveto_Data_Memory_Barrier pseudo function calls in the decompile, but the body of the unit test was missing except for an empty do-while-false block.) 
Conclusions

I'm not an expert in either x86_64 or ARM assembler coding, but at least I was able to make my way through the disassembled listings without much drama, and with only one reference to a processor manual. The decompiled listings were useful as well, but it would have been easy to have been mislead had I not been the author of the original program being analyzed and so knew what to expect.

Ghidra is going to be my go-to tool for this kind of analysis. In fact, I can easily see using it as a static analysis tool when coding specialized instruction sequences in C or C++ to make sure I haven't botched something. I have already decided to make some changes to the few (okay, two) places in my libraries where I make use - probably inadvisedly - of Coherent Sections outside of unit tests. I probably wouldn't have figured that out without Ghidra.

Ghidra ain't perfect, but it ain't bad - I'm amazed the decompiler works at all - and the price is right.