Monday, June 17, 2019

This Is What You Have To Deal With

I'm passing this anecdote along not because I'm asking for help (although any insight would be welcome), but as an example of what a low-level real-time development looks like.

I spent several days troubleshooting what I still assume to be a bug in my software that deals with output from Global Navigation Satellite System (GNSS) - of which the U.S. Global Positioning System (GPS) is one example - receivers, regarding the synchronization of the NMEA, UBX, and RTCM state machines with the input stream from the U-blox UBX-ZED-F9P receiver (a U-Blox 9 device) on my development system named Nickel (Intel NUC5i7RYH, Ubuntu 18.04.1).

NMEA is the typical output of GNSS receivers, consisting of "sentences" of printable ASCII characters. UBX is a proprietary binary format used for messaging to and from GNSS receivers manufactured by U-blox. RTCM is a standard binary format used for messaging to and from GNSS receivers that support differential GNSS (DGNSS). When a receiver like the F9P generates all three types of output, your software has to be able to grok all three formats and switch between them seamlessly.

I tested the same code on Cadmium (NUC7i7BNH, Ubuntu 16.04.5) and saw the same symptom: every tens of seconds or so my software would loose sync with the input stream than regain it shortly afterwards.

I could not reproduce the problem on Gold, an ARM-based Raspberry Pi (Broadcom BCM2837B0 SoC, Raspbian 9.4) using exactly the same application software and the same receiver hardware, nor on Bodega or Mochila, also ARM-based Pis (also Broadcom BCM2837B0, Raspbian 9.8).

So I ran the standard Linux utility socat on Nickel to save about a minute's worth of NMEA data (UBX and RTCM weren't enabled on the F9P for this test). I ran my software against the file (so: no real-time constraints) and recorded the Sync Lost messages that included the offset into the data where the errors occurred. At this point I was pretty sure I'd found a bug in the state machines in my software.

Then I did a hex dump of the original file generated by socat, checked at those offsets, and lo and behold the NMEA stream was corrupted: periodically a spurious handful of bytes, looking suspiciously like a fragment of an NMEA sentence, was inserted at the end of a valid NMEA sentence.

So this corruption occurs without my software being involved at all.

I ran my software on Nickel using the GlobalSat BU353W10 receiver (a U-Blox 8 device), and I saw similar occasional loss of sync due to corruption of the NMEA stream. As before, the same software and GPS hardware on a Pi worked without problems.

I updated Cadmium to Ubuntu 19.04 and observed the same misbehavior with the BU353W10. I updated the BIOS firmware on Cadmium to the latest version and it still lost sync every tens of seconds.

I built the same software and used the BU353W10 receiver on Mercury (Dell OptiPlex 7040, Ubuntu 14.04.4) and saw sync lost every tens of seconds.

The evidence points an issue either with the USB hardware or firmware on all of my Intel boxes, or with the USB software stacks in many versions of Ubuntu (even though both Ubuntu and Raspbian are based on Debian, making it likely that their USB stacks share a common provenance).

I'm not buying it. Either seem unlikely. What is the likelihood that I'm the only developer in existence to have run into this? Gotta be pretty much zero. And my search-fu has turned up nothing.

I'd like this to be a bug in my software - and I am still assuming it is, perhaps a flaw shared with socat - because then I could fix it. But its root cause remains a mystery.

I use the Intel-based NUCs as my development systems (the one Intel-based Dell box is a legacy development system), and the Pis as test systems, typically deploying to ARM-based systems in the field.

I have no clue at this point what this is. But when I figure it out - and I hope that I do - I will have learned something useful.

Update (2019-06-19)

Just by way of providing actual evidence, here is a snippet extracted from about one minute of data collected via the command
cat /dev/ttyACM0 > FILE
running on Nickel, the Intel NUC5i7RYH running Ubuntu 18.04.1, from a BU353W10 (U-blox 8) USB-dongle GPS receiver:


The 7A\r\n ('7', 'A', carriage return, line feed) inserted between two sentences is obvious. It has the appearance of a checksum and standard end matter of another NMEA sentence (and in fact there are several sentences following this corruption that happen have 0x7A as their checksum).

Keep in mind that these corruptions don't appear when I run the same test on an ARM-based Raspberry Pi. This makes me wonder if there is another process that runs on the NUC but not the Pi that is consuming data from the same port. This isn't as crazy as it sounds; I have inadvertently run redundant instances of my gpstool software or the gpsd GPS daemon and run into similar issues (yes, I already checked for that). However, running the commands
fuser /dev/ttyACM0
lsof | grep ttyACM0
(which displays what process has a particular file open, and all open files and the processes using them, respectively) only show the cat command when I run this test.

An interesting difference between the Ubuntu NUC and the Raspbian Pi is that on the former it isn't necessary to configure /dev/ttyACM0 as a serial port (baud rate, raw mode, etc.) in my application; on the latter, it is (although it doesn't seem to matter to what I set the baud rate).

Tuesday, March 26, 2019

Decoding Special Machine Instruction Sequences with Ghidra

Because of the low level of development I do, I am still from time to time called upon to decode existing, or write new, sequences of just a few assembler instructions to accomplish very specialized operations. I usually do this with the relevant processor handbook laying in my virtual lap. While it is true that during my career in the 1970s and 1980s I probably wrote hundreds of thousands of lines of assembler code for various machines (I wrote my master's thesis project in PDP-11 assembler, for example), most of that was done before I had access to a C compiler.

When I started playing with Ghidra, the U.S. National Security Agency's recently open sourced software reverse engineering (SRE) tool, I was curious how its C decompiler dealt with instruction sequences that were perhaps outside of the normal idiomatic sequences it had been programmed to recognize. This was easy enough to find out: I had several examples of such sequences in my own repositories on GitHub.

This work was done with Ghidra 9.0 dated 2019-02-28.
(As usual, you can click on an image to be taken to a larger version.)

Scattergun is my project to evaluate hardware entropy generators. These are devices which use transistor noise or even quantum mechanical effects to generate random bits for applications like creating unguessable encryption keys.

Two of the generators I tested in Scattergun were the Intel rdrand and rdseed machine instructions. rdrand is a general purpose hardware random number generator introduced in the Intel Ivy Bridge architecture. It generates random numbers using an algorithm that is seeded, and periodically reseeded, by a hardware entropy source built into the processor. rdseed is a hardware entropy generator introduced in the Intel Broadwell architecture. It provides access directly to the hardware entropy source built into the processor.

Scattergun provides the seventool utility that can be used to generate random bits using either rdseed or rdrand to feed to the Scattergun test suite. The functions in seventool that access those instructions look like this.

Screen Shot 2019-03-21 at 3.58.31 PM

There's a lot of conditional compilation here because I needed seventool to compile run on several different vintages of GNU C compiler suites. Some of them did not have assemblers that recognized the rdrand and rdseed mnemonics. Some of them had compilers that already had built-in functions to generate the appropriate instruction sequences. But for the i7-6700T processor on which I ran these Scattergun tests, I used the compilation options that used the rdrand and rdseed assembler mnemonics. These functions (which weren't actually, as it turns out, inlined by the GNU C compiler) generate code to execute the appropriate machine instruction in-line, fetch the indication of success from the carry bit, and return that as the function value while saving the random goodness in a value-result parameter.

Here's what Ghidra had to say about these functions. In the first screen shot, I asked Ghidra to decompile the rdrand C function.

Screen Shot 2019-03-26 at 9.47.36 AM

 You can see the RDRAND machine instruction in the center screen showing the disassembly.

I did the same for the rdseed function.

Screen Shot 2019-03-26 at 9.43.37 AM

Similarly, you can see the disassembled RDSEED machine instruction in the disassembly.

Ghidra decompiled both the rdseed and rdrand instruction sequences into pseudo-function calls, both the actual machine instruction, and the check of the carry bit for success. It's pretty clear that the x86_64 instruction decode for Ghidra has been configured to recognize both of these as idiomatic machine instruction sequences. That's pretty remarkable.

Or maybe not. I probably shouldn't be surprised that the premiere encryption cracking organization on the planet is especially interested in detecting applications that use hardware features especially designed for creating encryption keys. (The fact that I was able to get to this point within about an hour of installing Ghidra might say something about the hazards of meta-analysis revealed by open sourcing tools like this.)


Diminuto, my C systems programming library for (mostly) Linux/GNU platforms, includes some definitions to make it easy for me to play with memory barriers: machine instruction sequences available on many architectures (not just x86_64) that enforce cache coherency on multi-core processors. The good news is that multi-core architectures have machine instructions to explicitly do this (that is, they do if they work), or at least instructions that do this as a side effect. The better news is that modern GNU C compilers have built-in pseudo-functions that generate the appropriate instruction sequences for each different architectures, making it easier to write portable code that use memory barriers.
(I confess there is a lot of "do as I say and not as I do" in the Diminuto memory barrier code, which is pretty experimental. You are for the most part much better off coding POSIX mutex operations, which use memory barriers as part of their implementation. Diminuto can help you with that, too.)
Here is a unit test program in Diminuto that calls the general memory barrier function diminuto_barrier several times. It then calls the read barrier function diminuto_acquire, and then the write barrier function diminuto_release. Interspersed are assignment operations to a volatile variable that in my naivety I had hoped would keep the compiler from optimizing them out.

Screen Shot 2019-03-19 at 2.14.20 PM

A read barrier (which is said to have acquire semantics) is intended to block the execution core until its cache has been updated with the latest values from other cores. A write barrier (with release semantics) is intended to block the execution core until its cache changes have been propagated to the other cores. The general barrier does both.

Here is the implementation of all three functions.

Screen Shot 2019-03-19 at 2.16.02 PM

You can see that they are all implemented as preprocessor definitions that expand to predefined GNU C functions. The __sync_synchronize function implements a general (both read and write) memory barrier. The read and write barriers are a little more complicated;  they use the GNU C __sync_lock_test_and_set and __sync_lock_release functions, not for their intended purpose as machine-level synchronization mechanisms using busy waiting (spin locks), but because they implement read and write memory barriers as a side effect.

Here is how Ghidra decodes this unit test for code compiled for the x86_64.

Screen Shot 2019-03-21 at 4.09.56 PM

The disassembly in the center screen is straightforward. There are the three MFENCE instructions corresponding to the diminuto_barrier function calls that each expanded into __sync_synchronize. You can see the MOV instructions that does the variable assignment. You can see the XCHG that implements the __sync_lock_test_and_set in diminuto_acquire (there's no loop because we just want the read barrier as a side effect), and a MOV that implements the __sync_lock_release in diminuto_release.

The C decompile on the right is more problematic. Not only are none of the memory barrier operations represented in any way, but the assignment statements that were in the original C source are missing as well. In fact, the only code shown in the decompile is that which was generated in the unit test framework EXIT macro that handles the end of test processing. The entire body of the unit test is gone.

It is no more than a guess on my part, but maybe Ghidra fell into a machine instruction sequence that it didn't recognize, and rather than add a notation in the decompile it simply ignored it. The authors of Ghidra have a lot more experience using it to reverse engineer software using Ghidra than I do (which is, basically, none). But this seems unfortunate, since the use of memory barriers is a common idiom in the kinds of low level work that I do, particularly with devices that expose data registers by memory mapping them into the processor address space.


Hazer is my project to evaluate Global Navigation Satellite System (GNSS) receivers, of which the Global Position System (GPS) receiver that may be in your mobile phone is one example. The gpstool utility in Hazer makes use of Diminuto scoped operators: operators that have an explicit begin (opening) and end (closing) syntax that correspond to underlying begin and end semantics.

One example makes use of the memory barriers we just discussed to implement the Coherent Section scoped operators.

Screen Shot 2019-03-26 at 10.43.46 AM

These Coherent Section operators, when used at the beginning and ending of a section of code, automatically perform a read barrier at the beginning of the section and a write barrier at the end of the section.

Perhaps a little more typical are the Critical Section scoped operators.

Screen Shot 2019-03-26 at 10.44.27 AM

The Critical Section begin operator automatically causes the application to acquire (block until it is available and then lock) a POSIX mutex passed in by the application, and the end operator automatically releases (unlocks) the mutex as the flow of control leaves.

The Hazer gpstool utility has an option to monitor the 1PPS (One Pulse Per Second) 1Hz timing signal that is generated by some GNSS receivers. One of the ways that serial-attached receivers generate this pulse is by toggling the Data Carrier Detect (DCD) control line on the serial port. gpstool uses a separate thread to follow DCD and communicate its state back to the main thread.

Screen Shot 2019-03-21 at 4.15.23 PM

The dcdpoller function uses the Coherent Section scoped operators to bracket the reading of the done variable that tells the thread to exit (when, for example, the application terminates), and the more conventional Critical Section scoped operators to synchronize read-modify-write access to the oneppsp variable that it uses to inform the main thread that a 1PPS event has occurred. (The function can also optionally strobe an external GPIO pin to pass the DCD state on to another device.)

I routinely run gpstool on both x86_64 and ARM architectures. Here is the Ghidra analysis of the dcdpoller function on the x86_64.

Screen Shot 2019-03-21 at 4.20.21 PM

The XCHG and MOV instructions that form the beginning and ending of the Coherent Section are clearly visible in the disassembly, but are completely absent in any form in the decompile. But at least the check of the done variable is present in the decompilation. The beginning and ending of the Critical Section, including the handling of the POSIX mutex clean up, is well represented.

Here is the Ghidra analysis of the same function compiled for the ARM.

Screen Shot 2019-03-21 at 4.49.36 PM

The ARM ldrex (Load Exclusive) and strex (Store Exclusive) instructions are how the ARM implements an atomic test-and-set operation. These are visible in the disassembly where the original C code enters the Coherent Section. The lock is released when the flow of control exits the Coherent Section via (I think) conventional mov (Move) and str (Store) instructions. The decompile represents this using two successive coproc_moveto_Data_Memory_Barrier pseudo function calls. The disassembly shows that the done value is accessed inside the Coherent Section, but this isn't at all obvious from the decompile. The decompile of the Critical Section using the POSIX mutex was recognizable as being the same operation as in the x86_64 analysis, which is no surprise; it's probably the same underlying implementation, with only minor differences for architecture specific operations like spin locks.
(It occurred to me later to do a Ghidra analysis of the Diminuto memory barrier unit test compiled for the ARM instead of the x86_64. I got results consist with this Hazer analysis: the presence of many coproc_moveto_Data_Memory_Barrier pseudo function calls in the decompile, but the body of the unit test was missing except for an empty do-while-false block.) 

I'm not an expert in either x86_64 or ARM assembler coding, but at least I was able to make my way through the disassembled listings without much drama, and with only one reference to a processor manual. The decompiled listings were useful as well, but it would have been easy to have been mislead had I not been the author of the original program being analyzed and so knew what to expect.

Ghidra is going to be my go-to tool for this kind of analysis. In fact, I can easily see using it as a static analysis tool when coding specialized instruction sequences in C or C++ to make sure I haven't botched something. I have already decided to make some changes to the few (okay, two) places in my libraries where I make use - probably inadvisedly - of Coherent Sections outside of unit tests. I probably wouldn't have figured that out without Ghidra.

Ghidra ain't perfect, but it ain't bad - I'm amazed the decompiler works at all - and the price is right.

Monday, March 25, 2019


The National Security Agency, the U.S. intelligence organization responsible for cracking foreign governments' encrypted communications, while at the same time trying to ensure those same state actors don't crack their own, recently open sourced a tool kit they call Ghidra: a suite of tools for software reverse engineering (SRE).

("Ghidra" is the multi-headed Toho film monster whose name is probably derived from "Hydra", the many headed serpent from Greek mythology.

Photo credit: Wikipedia
He is more commonly known in Japan as "Ghidorah". )

Ghidra is written in Java (it requires OpenJDK 11), Jython (an implementation of Python that runs on the Java Virtual Machine), and C++. It runs on 64-bit Windows, MacOS, and Linux platforms. It has a graphical user interface and includes a disassembler, and even more remarkably, a decompiler.

In order to represent a file of machine instructions as decompiled code - C code in my examples - the decompiler has to recognize idiomatic instruction sequences that are specific to a compiler - GNU C in my examples - and to a machine architecture - x86_64 and ARM in my case. To distinguish instruction sequences from data, the disassembler has to follow the logical flow of instruction execution, decoding successive instruction sequences, and following all possible transfers of control (jumps, branches, function calls) to new sequences. This forms a cyclic graph of control paths to determine exactly what bits are instructions and what bits are data, the latter being bits to which there is no flow of control. This is no mean feat.

To get you interested in plowing through the rest of this article - or maybe just jumping right to playing with Ghidra yourself - I'll start off with an example.
(You can mouse click on any of the images to open a larger and hopefully more readable version. My fu with the Blogger platform kinda sucks at rendering code samples no matter what I try.)
Here is the prototype for a function from my Diminuto library of C systems programming tools that debounces digital input signals. The application calls this function at some fixed frequency - 100Hz (every 10ms) works well for typical mechanical contact switches - and gets a 0 (unasserted) or 1 (asserted) result.

Screen Shot 2019-03-25 at 8.46.58 AM

Here is its implementation. (I didn't invent this algorithm, I've just used the heck out of it. The header file for this feature lists the references from Dr. Marty and from Jack Ganssle.)

Screen Shot 2019-03-21 at 5.16.35 PM

Here is what the data structure that the debouncer uses to maintain state looks like. You see that I used bit fields, something that isn't obvious from the implementation.

Screen Shot 2019-03-22 at 8.06.27 AM

Bit fields aren't the greatest feature in C, mostly because the C standard doesn't specify in what order the bit fields occur, leaving that up to the implementation. That means you can't use bit fields portably. Big oops on the part of the C standards folks. But when I originally wrote this feature, it wasn't for Linux platforms, it was for tiny eight-bit PIC micro controllers, where data space was at a premium (fifty-six bytes of read-write memory on this particular firmware project). When I ported the code to Diminuto, I kept the bit fields.

Here is what Ghidra produces when analyzing this library compiled into a shared object file for a x86_64 target using GNU C.

Screen Shot 2019-03-25 at 8.51.10 AM

Indices into particular sections of the object module are shown on the left. The disassembly into x86_64 machine code is shown in the center (it's long, the screen doesn't contain the entire function). The decompiled C code is shown on the right.

The C code here is particularly helpful. You can see how the bit fields are handled by logical operations on the entire integer variable. (In contrast, the Harvard-architecture PIC micro controller for which this was originally written had machine instructions to do bit-wise operations which made its executable code stored on-chip in ROM much more compact.)

Many times I've had to resort to using GNU tools like objdump, or other similar target-specific tools, to disassamble files containing machine instructions - device drivers, libraries, shared objects, executables, ROM images, what have you - typically in order to debug a problem in third-party software, or to integrate some product development effort with existing legacy software, for which I didn't have access to the original source code. (Or, in a couple of darkly humorous episodes, for which my client had lost their own original source code.)

As much of a life saver as objdump is, it is limited as to what it can disassemble, and it requires that I brush up on my assembler skills, which these days are limited to writing or decoding short instruction sequences that are embedded in C or C++ code to do specialized things like implement memory barriers or generate random bits using built-in entropy generators. It would be a lot more productive to have a tool that produces (even approximate) C code.

There are commercial tools that do this. I've seen IDA Pro, a graphical Interactive DisAssembler tool produced by Belgium-based Hex-Rays, in action. The Pro version can be licensed to include a decompiler and interactive debugger for a specific architecture like x86_64 or ARM64. The licensing scheme is a little complicated, and a Pro license can run from hundreds to thousands of dollars per architecture. Many times I've longed to have IDA Pro on my laptop.

But apparently that longing never manifested into spending money on a IDA Pro license. (In hindsight, the timing for the release of Ghidra couldn't be better, since I recently revisited my budget for software reverse engineering tools.) So it was with great interest - and not a little trepidation - that I read about the NSA open sourcing their own SRE tool, and, shortly after that, installing Ghidra version 9.0 on one of my x86_64 development systems running Ubuntu 18.04. Subject to having an OpenJDK 11 available, Ghidra installed and ran without drama.

Ghidra supports a broad variety of processor architectures - many of which will cause the ears of an embedded developer to prick up - including 6502, ARM, AVR8,  MIPS, PIC, PowerPC, x86, and even Z80.

But while Ghidra is a remarkable tool, it isn't perfect. A common use case in my world is that I want to reverse engineer an ARM32 or ARM64 binary using my x86_64 laptop running Ubuntu in a VM under Windows. While for the most part Ghidra is up for this - this is where its write-once-run-anywhere Java implementation really pays off - it only took me a couple of hours of playing with it to run into my first documented bug in the 9.0 version (the release dated 2019-02-28): in binary files like shared object libraries that have multiple symbol tables, it misplaces some of the symbols so that it becomes convinced that one function (identified by a symbol) is in the middle of another function (regardless of the lack of idiomatic function entry instructions).

Screen Shot 2019-03-22 at 1.06.42 PM

In the ARMv7 shared object binary above, Ghidra identifies the diminuto_cue_debounce function, but is convinced that the apply_gain function (which was in a completely different object module, one that implements a Proportional-Integral-Differential or PID controller, which was linked along with the debouncer and many other object modules into this common shared object) starts in the middle of it. The Ghidra bug tracker has a comment that this bug will be fixed in the next release.
Update 2019-04-02: I installed Ghidra 9.0.1 dated 2019-03-25, created a new project, reimported the pertinent file, reanalyzed it, and still see this issue when the source is compiled for the ARMv7L target (but not when compiled for the x86_64 target). What I thought was the relevant bug has been marked closed. I submitted a new bug report.
Update 2019-04-03: A Ghidra developer made the following comment on my bug report: 
The import of this file looks fine. If your only concern is the 0x10000 offset when compared to objdump this is not an import error unless the file has been prelinked in which case we should use the prelinked image base. This file is a relocatable shared library which can generally be imported to any base address of your choosing (see importer options). We default to offset 0x10000 which is somewhat arbitrary. We try to avoid loading at offset 0 when possible to help avoid mistaking small constants for address offsets. The importer has an image base offset option which can be set.
I set the offset from the default of 0x10000 to 0 by selecting Import File... then Options... and editing the Image Base field. That seems to have worked. I don't know whether this is necessary for any targets other than the ARMv7L, or why it wasn't necessary for the x86_64 target which also has the Image Base set to 0x10000 by default.
Update 2019-04-04: The same Ghidra developer make the subsequent comment: 
When I originally looked at your sample I had only done an import to see where the ELF loader was placing symbols. Having just run analysis I now see the issue you pointed out. You may have also noticed a bunch of DWARF related errors in your log during analysis when an image base of 0x10000 was used. Using an image base of 0 likely avoids these errors. We currently are not handling relocation records which apply to the DWARF debug data which is based at 0 for your sample. This results in DWARF analysis laying down symbols at the incorrect offset since the image base adjustment had not been applied. When DWARF data exists the 0 image base should probably be used until we are able to handle DWARF debug data relocations.
Bugs aside, Ghidra is a useful tool. While it makes reverse engineering binaries that contain symbols almost straight-forward (albeit not necessarily easy), it can also be used to analyze binaries that lack symbols, binaries that may even be deliberately obfuscated. Although I haven't played with it yet, it appears to allow you to add your own notations and to improve the disassembly and decompilation with your own manual analysis.

But even without doing this human labor, Ghidra produces useful artifacts for reverse engineering. Here is a screen shot of Ghidra's analysis of an ARMv7 executable binary stripped of symbols.

Screen Shot 2019-03-25 at 9.41.47 AM

While Ghidra might not make it obvious, this program wipes data from a device by overwriting it with data generated from a polynomial whose initial value is seeded with a random number. It does this using synchronous I/O, bypassing the buffer cache, so that when a write operation completes, there is a chance that the data block really is overwritten, and not just scheduled to be overwritten sometime in the future. (The use case is wiping USB drives containing sensitive data in field.) Ghidra at least gives you a head start by providing a first-cut C decompile with synthesized symbol names that you can supplement with your own notes as you understand what's going on.
Important safety tip: Ghidra might make reverse engineering too easy. If you bother to read the typical End-User Licensing Agreement (EULA), you may find it specifically forbids reverse engineering the code that you just paid for. My colleagues and I have had some discussion about this when trying to integrate a recalcitrant third-party software package into our embedded product. It is not uncommon for a EULA to effectively prevent you from using the third-party product to which it applies.
Ghidra is evolving into a useful tool to keep in your embedded product development kit, along with static analysis tools like nm and objdump, as well as run-time tools like valgrind, ltrace, strace, and even gdb. In future articles I'll write about using Ghidra  to reverse engineer my own code that includes some of those specialized instruction sequences I mentioned earlier. And I'll keep an eye out for the next release.