Monday, October 31, 2011

Up the Source Code Organization

I've spent decades working in humongous C and C++ source code bases. Humongous means many millions of lines of code. Some of it I even wrote. Most of it I didn't. Most of it didn't even originate in the organization in which I found myself working. Integrating large source bases in which major components weren't necessarily designed to work together can be a challenge.

With the plethora of open source software now available, it's not unusual that to use that new software stack that could save you months of development time, you discover you need to install a handful of other software stacks on which it depends. And those stacks have their own dependencies. And so on. Gone are the days in which you could just get by with the standard C library.

These days at Digital Aggregates I've been working on a project that incorporates not only several third-party software stacks written in C or C++, but some of my own software libraries, some of which I wrote years ago. My integration experience has caused me to revisit how I put those libraries together. Laziness being a virtue among software developers, I've come up with some organizational techniques to make my life easier at least when using my own software. These techniques exploit two things: the Digital Aggregates domain name, diag.com, and a project name unique within the company, and borrows from similar techniques used in the Java world.

Every Digital Aggregates project gets a project name. The name itself doesn't have to have any significance to the project, although it usually does in at least a pun-ish way that may only have meaning to me. The name is not an acronym, nor is it a name already used at the time by some well known (to me, anyway) software package

For example, today I wrote code using Desperado, a collection of C++ classes that implement design patterns I've found useful in embedded software development; Diminuto, a collection of C functions that implement design patterns I've found useful in systems programming on Linux and GNU based systems; Lariat, a small software wrapper around Google Test, Goggle's excellent C++ unit testing framework, that allows you to set resource limits like real-time execution duration on unit tests from the command line; and Hayloft, a work in progress. The project names help me keep everything straight.

As mundane as it sounds, the project name starts life on a manilla file folder. I keep at least some paper documentation around, either temporarily or permanently. The manilla file folder keeps all of it together and it can be easily identified as it lays on my desk or is filed in the file cabinet. On my desk in my home office right now I have file folders labelled Biscuit, Lariat, and Hayloft. Just minutes ago I consulted the file folder labelled Desperado in the file cabinet.

I keep lab notebooks that are organized not by project but chronologically and by client. I use the project name to identify notes I make in the notebook.

Obviously, I use the project name when I write about my work here in my blog, as in Automating Maintenance on Embedded Systems with Biscuits where Biscuit is the project name.

I also use it as part of the URL for the project page on the Digital Aggregates web site, for example

which of course incorporates the Digital Aggregates domain name as well. Furthermore, the project name becomes part of the tar ball name. For example

is a compressed tar ball for the Zinc distribution of Desperado.

I use the project name as the repository name in Subversion, my current source code control system of choice. My Subversion layout for every project follows the pattern I used for Desperado, and is more or less right out of the Subversion documentation.

desperado/trunk/Desperado
desperado/tag/Desperado
desperado/branches/Desperado

desperado is the Subversion repository name. The directory trunk contains the main development branch, tag contains a check point of each major release, and branches contains any temporary or ancillary development branches.

Directory names on disk incorporate the project name. For example, the implementation files for Desperado are in

${HOME}/src/Desperado

where this is just a check out of the main development branch from Subversion.

I'm a big fan of Eclipse, the open source GUI-based IDE, for C, C++, and Java development. The project name becomes the Eclipse project name, so the name Desperado shows up in the Project Explorer or C/C++ views in Eclipse, with all the source files underneath it.

Organizing header files for projects which use multiple components can be especially challenging, and I frequently see it done especially poorly. For my C and C++ projects that result in libraries intended to be used in other work, I've borrowed from some Java best practices and use both the domain name and project name as part of the header file path name. For example, a source file I was edited today had the following #include statements.

#include "com/diag/hayloft/Packet.h"
#include "com/diag/desperado/Platform.h"
#include "com/diag/desperado/Print.h"
#include "com/diag/desperado/Dump.h"

The domain name becomes part of the path used to include the header file for a specific project. This approach is used not just by source files using the libraries, but also in the source files that implement the library. This makes it perfectly clear from which project a header file is being included, and prevents any header file name collisions. For example, the Desperado header files are under

${HOME}/src/Desperado/include/com/diag/desperado

while the Hayloft header files are under

${HOME}/src/Hayloft/include/com/diag/hayloft

and the GNU g++ or gcc command line options

-I${HOME}/src/Desperado/include

and

-I${HOME}/src/Hayloft/include

are used to point the compiler to the right places. (You might choose to use the -iquote option instead.) Although these point to the source code directories where I do development, they could just as easily point to /usr/include or maybe /usr/local/include and the same naming system would prevent any conflicts with other header files.

I'm also a fan of namespaces in C++, where I use a similar approach. All of the Desperado C++ symbols are in the namespace

::com::diag::desperado

where as all of the Hayloft C++ symbols are in the namespace

::com::diag::hayloft

which results in code snippets that look like the one below.

namespace com {
namespace diag {
namespace hayloft {

/**
* Ctor.
*/
explicit Logger(::com::diag::desperado::Output &ro)
: ::com::diag::desperado::Logger(ro)
, mask(0)
{}

}
}
}

This is a constructor for the Hayloft class Logger that derives from the Desperado class Logger and uses a reference to an object of the Desperado Output class as an argument.
Furthermore, some projects require more complex namespace organizations, and this is reflected in both the implementation and header file directory hierarchies. For example, C++ symbols in the namespace

::com::diag::hayloft::s3

find their header files in

${HOME}/src/Hayloft/include/com/diag/hayloft/s3

and their implementation files in

${HOME}/src/Hayloft/s3 .

This may sound complex, but it is in practice easily done.

I predict now you are wondering what I do for C functions, where namespaces aren't an option. Using the Digital Aggregates domain name as part of C functions violates my laziness rule. It just seems to be more work than it's worth. But I do include the project name and a name roughly equivalent to a C++ class name into the function name. For example

void * diminuto_map_map(
  uintptr_t start,
  size_t length,
  void ** startp,
  size_t * lengthp
)

is a prototype for a C function in the Diminuto library that is part of the map feature for a function that maps a physical memory address to a virtual memory address.

I use a similar approach when defining macros that are expanded by the C preprocessor, resulting in names like

DIMINUTO_LOG_PRIORITY_NOTICE

or

diminuto_list_next

which can sometimes seem a little cumbersome.

I try not to use preprocessor macros at all when writing C++. But I do use the expanded naming system in both C and C++ when defining preprocessor guard symbols that prevent header files from being included more than once. This results, for example, in the guard preprocessor symbol

_H_COM_DIAG_HAYLOFT_S3_LOCATIONCONSTRAINT

for a C++ header file in the s3 sub-directory and sub-namespace.

I do use the domain name in the name of any environmental variables. For example, you can set the log level in Hayloft by setting the value of an environmental variable

COM_DIAG_HAYLOFT_LOGGER_MASK

which incorporates the domain name, the project name, and even the class name. Since environmental variables are in a global namespace that is be shared among pretty much all of the software being used by a particular user, and because you typically don't have to type the environmental variable name very often, this seems the safest approach.
In a future article I'll be talking about how I apply Google Test in projects like Hayloft, where I use a similar naming scheme that differs slightly to prevent collisions between the header files and classes under test and the test and mock classes themselves.

Update (2016-07-22)

Some time ago I converted from Subversion to Git, and started hosting my source code repositories on GitHub. I follow a similar convention as described above by giving my repositories names like com-diag-diminuto and com-diag-scattergun. This makes it really easy to keep track of stuff when I clone a repo in my src directory on my build server. When I use Eclipse, I give my projects the same name as the repo name. After using these scheme for almost five years now, I like it better and better.

Friday, October 14, 2011

Automating Maintenance on Embedded Systems with Biscuits

If you make a living, as I have from time to time, developing embedded systems, it's not unusual to find your work day interrupted on a regular basis by application developers that want their own development systems updated to the latest platform release, or firmware installed to support a new version of an FPGA, or some other routine system maintenance chore. Even embedded systems based on Linux and GNU are frequently too small to implement the tools used by big iron, graphical interfaces like the Synaptic Package Manager for Ubuntu, or even command line tools like yum, apt, or dpkg. Sometimes the chore is as mundane as creating a tar ball of the system logs to ship back to you so you can debug a field problem. But you still don't want to burden a non-geek user with all the skills, and maybe even the root password, necessary to do that. What would be great is if you could automate that system maintenance chore in a secure fashion and get it done by just handing the user a USB thumb drive and telling them where to stick it. Biscuits do that.

A biscuit is a cpio archive that has been compressed with bzip2 and then encrypted using gpg (GNU Privacy Guard a.k.a. GNUPG) into a binary file. The biscuit command decrypts and decompresses the file, by default named biscuit.bin in the current working directory, and extracts its contents into a temporary directory. It then looks for an unencrypted executable file named biscuit, script or binary, in that directory. If it finds it, it modifies the PATH and LD_LIBRARY_PATH environmental variables to include the temporary directory, changes its current working directory to the original delivery media (for example, a USB drive), and then runs the executable. What happens next is up to the executable. When the executable exits, the temporary directory is cleaned up and and deleted. While it runs, the executable has access to both the temporary directory, where any other collateral material from inside the biscuit can be found, and the working directory, where results may be placed.

Usage: biscuit [ -C working_directory ] [ -f biscuit_file ]

The encryption defaults to EIGamal (ELG-E), an asymmetric key encryption algorithm based on Diffie-Hellman key exchange, using 1024 bit keys. EIGamal is not patent encumbered, and although longer keys are possible, 1024 bits strikes a good balance between speed and security for slower embedded processors. The server where biscuits are packaged maintains a key ring containing the public (encrypting) keys for all the embedded systems on which biscuits may be deployed. Each embedded system has a key ring containing its own secret (decrypting) keys.

Different unique public/secret key pairs can be created, allowing you to distinguish between product lines, architectures, even individual systems, whatever makes sense for your business. This allows you to create biscuits for product flavor A that cannot be executed on product flavor B: the wrong biscuits simply won't successfully decrypt. The encryption serves as both an authentication mechanism (the biscuit comes from a reputable source) and an authorization mechanism (the biscuit can be executed using root privileges). Since users can't crack open biscuits, they not only can't create their own biscuits, they can't even see what your biscuits do. Biscuits are opaque binary files. They cannot be easily reverse engineered without cracking the encryption. They can be delivered on removable media such as USB thumb drives, CD-ROMs, or DVDs, or they can be downloaded across a network via ftp, tftp,or scp.

I've been using this or similar approaches to automating system maintenance of embedded systems for several years now. A link to a clean room implementation of biscuits can be found on the Biscuit project web page. It's mostly just a Makefile to build GNUPG for multiple architectures, generate and manage the keys for both the build server and the embedded hosts, create the biscuit command, and package up biscuit binary files, plus a couple of example biscuit scripts. But the distribution tar ball also contains some useful stuff for automating the use of biscuits in your embedded system, depending on your needs and comfort level.

Users can just run the biscuit command manually and point it at a biscuit.bin binary file. You can set your embedded host up so that only root can do so, so that users must use sudo to run biscuit.

If your embedded Linux system uses udev, like big iron distributions like Ubuntu, you can install the 99-com-diag-biscuit.rules file into /etc/udev/rules.d/99-com-diag-biscuit.rules . That will cause your system to temporarily mount any block device that is inserted, look for a biscuit.bin file, run the biscuit command against it, and unmount the device when it completes. The udev rules I wrote for my use only apply to removable storage devices by excluding the device names used for permanently attached block storage. You should customize these rules to fit your own platform.

If your system uses a simpler version of udev, like embedded distributions like Angström, you can install the mount.sh script into /etc/udev/scripts/mount.sh to do the same thing. Or just look at the one line I inserted into Angström's mount.sh that runs biscuit against the auto-mounted media, and do something similar on your system.

If you have an older embedded system that doesn't use udev but instead uses hotplug, you can install the biscuit.hotplug.sh script into /etc/hotplug.d/block/biscuit.hotplug to accomplish the same thing.

If your kernel was configured to support hotplug, but you don't have any of the infrastructure, you can still do the above, plus you can install hotplug.sh into /sbin/hotplug. Look for /proc/sys/kernel/hotplug on your embedded host. If it exists, your kernel is built to support it. If its contents are blank, then you need to tell the kernel to invoke /sbin/hotplug by echo-ing that string into /proc/sys/kernel/hotplug every time you boot your system, or rebuild your kernel configured to do this automatically.

There are some caveats you would be wise to heed.

Once biscuits are released into the wild, you must assume that they will go feral. People will dig up a USB drive you gave them years before, and insert it into a system. They may do so out of desperation, remembering that this did something useful long ago. Or they may simply have forgotten where the drive came from and don't even know that there is a biscuit on it. This is the dark side of biscuits: they are effectively viruses.

Biscuit scripts deal with the former by being paranoid: for example, before loading new firmware on a system, the script makes sure that the system isn't already running an even newer version of the firmware. I have also written scripts that deleted the original encrypted file from the USB drive, making it a use once biscuit. Or the script dropped an identifying file on the USB drive (I like using a MAC address from the host embedded system as a file name), and checked for the presence of this breadcrumb first whenever it runs to check whether it has passed this way before. This approach works well because the user can delete the breadcrumb manually to force the biscuit to rerun.

If you include collateral material on the USB drive, for example scripts, shared libraries, binaries, data, try hard to put it inside the biscuit. I have seen scripts that invoked another script in cleartext on the USB drive outside of the encrypted file. This allows a villain to simply rewrite the cleartext script to do whatever they want, like starting a root shell on the console, subverting the authentication and authorization provided by the encryption. Sometimes the collateral is simply too large to put inside the biscuit because of limits in available temporary storage, which on embedded systems is often RAM disk, into which the biscuit binary file is unpacked. If that's the case, put checksums or hashes (I like MD5 or SHA-1) of the collateral inside the biscuit and have the script verify the likely integrity of the collateral before using it.

Similarly, use some care when the biscuit script executes commands that are on the embedded host. This is of course typically done, for example running fundamental stuff like bash. But don't run scripts or binaries that can be altered by unprivileged users. Your security is only as good as that of your root password.

It is important that you build both the build server gpg and the host embedded system gpg from the same GNUPG source tree. You must not assume that different versions of GNUPG are interoperable. That may be the case, but if it isn't you'll end up creating biscuits on your build server that your embedded hosts can't use. If you do upgrade to a new version of GNUPG on your build server while you have deployed embedded hosts using the older version, be very thorough in your interoperability testing.

Once you build keys and have installed them on embedded hosts, check the keys into source code control or otherwise archive them. You will not get the same keys if you generate them again, even if you use exactly the same parameters. Again, you can render biscuits unusable if you are not careful. However, with some not too mad skills, you can recover lost keys from your deployed embedded systems themselves should a crisis occur.

Be paranoid about your keys, not just on the build server, but on your embedded hosts too. On the embedded hosts, which may be deployed into the field and hence presumably beyond your physical control, the keys and the directory that holds them should be accessible only by root. Unprivileged users should use sudo to run the biscuit command manually, if they can do so at all.

But with some care, biscuits are a means of safely automating a variety of system maintenance functions without giving away the keys to the kingdom or forcing end users or even application developers to become embedded systems programmers. For complex tasks like software update, they are a way of implementing complex logic and making it a part of the software distribution package, not part of a command in the embedded host itself.