Friday, April 20, 2012

All the Interesting Problems Are Scalability Problems

In "It's Not Just About Moore's Law" [2006] I presented this graph based on research I had done while I was working in the computing division at the National Center for Atmospheric Research in Boulder Colorado.

Power Curves

The vertical logarithmic axis shows how technologies change over linear time on the horizontal axis. Here are some of the assumptions I used which were believed to be true at the time I mined the data [1997].

Microprocessor speed doubles every 2 years.
Memory density doubles every 1.5 years.

Bus speed doubles every 10 years.

Bus width doubles every 5 years.

Network connectivity doubles every year.

Network bandwidth increases by a factor of 10 every 10 years.
Secondary storage density increases by a factor of 10 every 10 years.
CPU cores per microprocessor chip double every 1.5 years.

This is old enough now that I probably need to revisit it. For example, microprocessor speed has stalled in recent times. But the basic idea is sound: all things change, but they do not change at the same rate. This means that over time the architectural decisions you made based on the technology at hand at the time are probably no longer correct for the technology you have today. The balanced design you started with eventually no longer makes sense. The scalable solution you came up with five years ago may only scale up if every thing you built it from scales up at the same rate. But over time, it doesn't. And it's just going to get worse.

It has been my experience that all the interesting problems are scalability problems. This graph shows that there is a temporal component to scalability.

People that don't work in technology have this idea that artifacts like hardware and software are somehow frozen in time. Those people are running Windows 98 on a 300MHz desktop with Office 97. People that work in technology know nothing could be further from the truth. Technology changes so quickly (see above) that it's a Red Queen's Race just to stay in one place and keep everything running.

In "The Total Cost of Code Ownership" [2006] I presented yet another graph, based mostly on data from Stephen Schach's book Object-Oriented and Classical Software Engineering [McGraw-Hill, 2002], which he in turn based on surveys of actual software development projects.

Software Life-Cycle Costs - Schach 2002

Notice that the by far the largest share of the cost of the software development life cycle is maintenance, that is, making changes to the software code base after initial development has been completed. It amounts to two-thirds of the entire cost of the software base over its life cycle. If you could somehow completely eliminate the entire cost of all the initial development and testing, you would have reduced your software life cycle cost by only a third.

Surprised? Do you think that number is too high or too low? Most people that have never worked with a large code base that supported a broad product line think it's too high. Those are usually the same people that don't have processes with which to measure their actual software life cycle costs. But organizations that do have to support multi-million line code bases that are part of product lines that generate tens of millions of dollars in revenue annually think that number is too low. Way too low. I've heard the number 80% bandied about.

Les Hatton has observed that software maintenance can be broadly classified into three categories: corrective (fixing code that doesn't work), perfective (improving some aspect of working code, such as performance or even cost of maintenance), and adaptive. It's this latter category that brings these two graphs together. Adaptive maintenance is when you have to change your code because something outside of your control changed.

In my work in the embedded domain, adaptive maintenance frequently occurs because a vendor discontinued the manufacture of some critical hardware component on which your product depends, and there is no compatible substitute. (And by the way, for you hardware guys, pin-compatibility just means you can lay that new part down without spending hundreds of thousands of dollars to re-design, re-layout, and re-test your printed circuit board. With complex hardware components today that may have an entire microcontroller core and firmware hidden inside, that don't mean squat. Case in point: surface mount solid state disks, to pick an example completely at random.) I've seen a product abandoned because there was no cost effective solution.

In my work in the enterprise server-side domain, it's not any different. Vast amounts of software are based on commercial or open-source frameworks and libraries. You want to upgrade to the latest release to get critical bug fixes or new features upon which you've been waiting that make a difference between making or missing your ship date, only to discover that the software on which you so critically depend has been improved by the provider roto-tilling the application programming interface. This is the kind of thing that sends product managers running around like their hair was on fire.

So here's the deal. You can't get around this. It's like a law of thermodynamics: information system entropy. Because all things change, but not at the same rate, you always face a moving target. Adaptive maintenance is an eternal given. The only way to avoid it is for your product to fail in the market place. Short lived products won't face this issue.

Or you can learn to deal with it. Which is one of the reasons that I like Arduino and its eight-bit Atmel megaAVR microcontroller as a teaching and learning platform.

Wait... what now?

Close Up of JTAG Pod and EtherMega Board

This is a Freetronics EtherMega, sitting a few inches from me right now, festooned with test clips connected to a JTAG debugging pod. It's a tiny (about four inches by two inches) Arduino-compatible board with an ATmega2560 microcontroller. The ATmega2560 runs at 16MHz and its Harvard architecture features 256KB of flash memory for executable code and persistent data and 8KB of SRAM for variable data.

Mac Mini

This is my Mac Mini on which I'm writing this article right now. It smaller than a manilla file folder. It runs at 2.4GHz, and its von Neumann architecture features 4GB of RAM.

Ignoring stuff like instruction set, bus speed, and cache, my Mac Mini has a processor that runs about 150 times the speed of that on the EtherMega. But it has almost sixteen thousand times the memory. And that's ignoring the Mac's 320GB disk drive.

This means when writing code for the tiny EtherMega, any assumptions I may have been carrying around regarding space and time trade-offs based on prior experience on desktop or server platforms get thrown right out the window. For example, depending on the application, it is quite likely better to re-compute something every time you need it on the EtherMega than it is to compute it once and store it, because on the EtherMega bytes are far more precious than cycles.

I can hear my colleagues from my supercomputing days at NCAR laughing now.

NCAR Mesa Lab (looking East)

I'm not the first to have noticed that the skill sets for developing software for embedded systems are very much the same as those required to develop for high performance computing or big distributed platforms. I built a career around that very fact. There is a trend in HPC (the domain formerly known as supercomputing) to re-compute rather than compute-and-store, or compute-and-transmit, because raw computing power has grown at a much faster rate than that of memory or communications bandwidth (again: see above). It turns out that developing software for itsy bitsy microcontrollers has more in common than you might think with developing for ginormous supercomputers.

Writing software for these tiny microcontrollers forces you to consider serious resource constraints. To face time-space tradeoffs right up front. To really think about not just how to scale up, but how to scale down. To come to grips with how things work under the hood and make good decisions. There is no room to be sloppy or careless.

Working with technologies like Arduino and FreeRTOS on my Amigo project has made me a better, smarter, more thoughtful software developer. I am confident it can do the same for you, regardless of your problem domain, big or small.

Thursday, April 12, 2012

Learning By Doing

 I figured out a long time ago that I'm not happy unless I'm learning new things. It's more than just being happy, really. For me, learning new stuff is very much self-medication. And the only way I can learn new stuff, really internalize new information, is by applying the stuff I learn. And the best way for me to do that is to generate deliverable. This has resulted in a bunch of projects that contain all sorts of useful collateral, some of which has found its way into real products of paying clients. Even that which hasn't, has served me well as a kind of reference design that I routinely go back to when I'm working in related areas. I attribute much of my career success (and I've had a lot of it) to this approach.

Amigo, my foray into low-power eight-bit microcontrollers, has been no different. Here, in no particular order, are some of the lessons I've learned, relearned, or have had reinforced.

C++ works just fine for embedded applications.

C++ isn't just usable for embedded applications, even those with real-time requirements, that run on targets that are resource-contrainted. It's superior to alternatives like C or even assembler. I've been using FreeRTOS, a popular real-time microkernel with a tiny footprint, on the Freetronics EtherMega board with the Atmel AVR ATmega2560 microcontroller. This is a platform with 256KB of flash and only 8KB of SRAM. I've written a C++ layer around the FreeRTOS facilities to provide classes like Queue, Task, MutexSemaphore, etc. I've written interrupt-driven device drivers in C++ for AVR hardware features like USART and SPI.

Not only does C++ work, but the result was a cleaner, simpler, easier-to-use design than I could have accomplished in C. How much SRAM overhead does C++ add over using the FreeRTOS C API? For example: Queue, four bytes; Task, eight bytes; MutexSemaphore, four bytes. Two bytes of each of those could have been eliminated by not having virtual methods. The remaining extra bytes would have been added in a C layer as well. Thanks to inline C++ methods, there is frequently no additional overhead in flash to using the C++ API.

Formal unit testing is the best thing since sliced bread.

I wish I could just port my favorite C++ unit testing framework, Google Test, to the AVR. I have considered writing a framework around Google Test to have it run on my desktop but execute unit tests on the target. But so far a few carefully written preprocessor macros like UNITTEST(__NAME__), FAILED(__LINE__), and PASSED() have been more than adequate. A single unit test C++ main() program that exercises most of what I've written so far consumes only about 27KB of flash and 6KB of SRAM, most of which is stack space for main() and four concurrent tasks. Below (mostly to record it for my own reference, truth be told) is the output of the unit test suite. (Update 2012-04-26: I pasted in the latest version.)

Unit Test Console
Unit Test Morse PASSED.
Unit Test Task PASSED.
Unit Test Sink
Now is the time for all good men to come to the aid of their country.
Unit Test sizeof
sizeof(signed char)=1
sizeof(unsigned char)=1
sizeof(signed short)=2
sizeof(unsigned short)=2
sizeof(signed int)=2
sizeof(unsigned int)=2
sizeof(signed long)=4
sizeof(unsigned long)=4
sizeof(long long)=8
sizeof(signed long long)=8
sizeof(unsigned long long)=8
Unit Test stack PASSED.
Unit Test heap PASSED.
Unit Test littleendian and byteorder PASSED.
Unit Test Low Precision delay PASSED.
Unit Test High Precision busywait PASSED.
Unit Test Dump
Unit Test Uninterruptible PASSED.
Unit Test BinarySemaphore PASSED.
Unit Test CountingSemaphore PASSED.
Unit Test MutexSemaphore PASSED.
Unit Test CriticalSection PASSED.
Unit Test PeriodicTimer PASSED.
Unit Test OneShotTimer PASSED.
Unit Test Digital I/O (requires text fixture on EtherMega) PASSED.
Unit Test Analog Output (uses red LED on EtherMega) PASSED.
Unit Test Analog Output (uses pin 9 on EtherMega) L M H M L PASSED.
Unit Test Analog Output (uses pin 8 on EtherMega) L M H M L PASSED.
Unit Test SPI (requires WIZnet W5100) PASSED.
Unit Test W5100 (requires WIZnet W5100) PASSED.
Unit Test Socket (requires internet connectivity) PASSED.
Unit Test errors=0 (so far)
Unit Test Source (type control-D to exit)
England expects each man to do his duty.
Unit Test errors=0
Type "<control-a><control-\>y" to exit Mac screen utility.

(Update 2013-03-22: that line above that says text fixture should be test fixture, and refers to the wiring on the board that allows the unit test -- really a functional test -- to succeed when it controls the actual hardware.)

When I make a change to existing code, I just build, upload, and execute the entire unit test suite. When it gets to the end in thirty-seconds or so and declares Unit Test errors=0 I am pretty confident that I haven't screwed something up. When I add new functionality, I add another unit test code segment to main(). If I start to run out of space, I'll start deactivating selected tests by turning off the #if conditional compilation statements I've put in the code. But so far that isn't a problem, and doesn't look likely to become one in the near future.

Unit tests do more than assure me I haven't done something stupid. I get a lot of feedback about my design by eating my own dog food. If I find using a feature that I've written to be cumbersome in the unit test, I know that I've botched the API.

The unit tests also serve as a living, functional example of how I expect my code to be used. I deliberately try to use my software in a way I expect it to be used in an application, or even to suggest ways in which it might be used. I often go back to my own unit tests to remind myself how to use my own code.

Tool chains for embedded projects can still be problematic.

I've written recently about my adventures in AVR tool chains when I discovered that my code worked just fine with GCC 4.5.1 but failed mysteriously and catastrophically with GCC 4.3.4. I don't have much to add to that except that the AVR, with its Harvard architecture, and the broad range of configurations of microcontrollers in the AVR product line, can make code generation for these targets challenging. Unfortunately, this is likely true for a lot of microcontrollers, and indeed for any processor other than the mainstream Intel x86. My reading of disassembled code suggests me that GNU C++ doesn't quite support virtual methods in the upper 128KB of flash on the AVR. I'd be happy to be proven wrong.

I'd be extremely reluctant to ever say I'd found a compiler bug. My reading and writing about memory models and support for the C and C++ volatile keyword has convinced me that these areas are subtle and fraught with peril not just for embedded developers but potentially for everyone, even with perfectly working compilers. But I am still puzzled why a function that returned a pointer to a volatile variable returned NULL when checking the value just before it was returned showed it to be correct. And why I had to cast the result of a sizeof() operator in order to print something other than a monstrously large number when sizeof(size_t) is two bytes. When you are writing code close to bare metal, strange and unexpected things can sometimes happen.

Lexical scoping can be like having a superpower.

C programmers (and just about everyone else) already know about scoping. When a local variable comes into scope, the compiler generates code to allocate it on the stack. When it goes out of scope, the compiler generates code to deallocate it from the stack.

/* foo is out of scope. */
int foo = 0; /* foo comes into scope. */

* :
* foo is in scope.
* :

/* foo is about to go out of scope. */
/* foo is out of scope. */

C++ extends this to objects by automatically calling the class constructor when an object of that class comes into scope, and it automatically calls the class destructor when that object goes out of scope. The great thing about this is that constructors and destructors are methods that you write that can do all sorts of things, including things that may or may not be related to the object being allocated and deallocated. The Resource Acquisition is Initialization idiom is a way to exploit this.

For example, Amigo implements the class MutexSemaphore with give() and take() methods. The class CriticalSection stores a reference to a MutexSemaphore and calls take() against it in its constructor, and calls give() in its destructor. This is how you implement a critical section to protect data shared between concurrent tasks.

MutexSemaphore mutex; // Shared among tasks.

CriticalSection cs(mutex);

// Code accessing shared date goes here.

That's it. Everything else is done for you. No matter how you exit that lexical block, the compiler guarantees that the destructor will be called to release the recursive mutex semaphore.

Similarly, Uninterruptible saves the current interrupt state (by saving a copy of SREG, the status register) and disables interrupts in its constructor, and restores the interrupt state in its destructor. So here's a section of code that runs with interrupts disabled.

Uninterruptible ui;

// Code to run without interruption goes here.

Endianess is sometimes a function of the tool chain.

It is legitimate to say that the megaAVR architecture is little-endian: the first or lowest address of a multi-byte variable points to the least-significant byte in that variable. Except on the AVR, there is no such thing as a multi-byte variable. Everything is done in byte-sized chunks using eight-bit registers.

Some registers are split into multiple eight-bit registers that are logically concatenated. For example, the sixteen-bit stack pointer is split into SPL (low) memory-mapped to address 0x3d and SPH (high) at address 0x3e. That's little-endian. But there is no way for an application to atomically access both of these chunks at one time as a variable. It is the code generated by the compiler that assumes a byte ordering of short or long variables stored in two or four consecutive bytes of memory. This was a new idea to me.

Harvard architecture requires some new thinking.

Executable code in the megaAVR architecture resides in flash memory or program memory. Non-persistent data resides in static random access memory or data memory. This is generically known as Harvard architecture, as opposed to von Neumann architecture where everything is accessed from a common memory, or at least a common memory bus. On the megaAVR, program memory is (two-byte) word addressed, while data memory is byte addressed. Persistent constant data can be stored in program memory too, but requires special functions that call dedicated machine instructions to bridge the gap from flash to SRAM for processing.

Just to make it even more complicated, some megaAVRs have a three-byte program counter instead of two-byte because their flash memory exceeds 128KB; word addressed, remember? So pointers to stuff in program memory may be three bytes, depending on the model of megaAVR, but pointers to stuff in data memory will always be two bytes. You can have two pointers with exactly the same numerical value, but one points to data in flash, the other to data in SRAM; if you mix up their usage, wackiness ensues.

It's up to you to keep this straight. The GCC AVR tool chain does not deal with this automatically. The non-automatic part makes life more complicated because the SRAM is so small -- a whopping 8KB on the ATmega2560, but a miniscule 2KB on the ATmega328P used on the Arduino Uno -- you absolutely must store large constant data in flash, or you'll find all your SRAM taken up with stuff like character strings. Ask me how I know this.

The GCC C and C++ compilers and the AVR C library includes extensions, attributes, type definitions, functions, and preprocessor macros to enable you to write code to deal with all of this. But write code you must. You will find your source files littered with stuff like PROGMEMPSTR, strlen_P(), pgm_read_byte() and the like. Once you get used to it, it's actually pretty straightforward. But it is definitely a different way of thinking. You also hope you'll never have to port this code to a different microcontroller architecture.

Resource constraints make you a better developer.

It is a lot easier to write code for a whopping big server with gigabytes of real memory, gigabytes more of virtual memory, terabytes of disk space, and many processing cores each of which runs at many gigahertz, and a virtual machine that hides all the hardware from you. Heck, anybody can do that. Shoehorning a complex multi-tasking C++ application into 8KB of SRAM, that takes some real thought.

You are forced to make architectural decisions up front and careful design and implementation decisions as you go. You are forced into understanding the consequences of your actions as you decide: do I really need virtual functions in this class? You have to figure out how things actually work under the hood as you ponder: what happens when I go past the 128KB flash boundary in my application?

As they do in so many problem domains, constraints force you to confront the implications of your decisions head on. That makes you a better developer, and that is why I like the megaAVR as a teaching platform. I like to say "all the interesting problems are really scalability problems", and resource constraints allows you to see scalability problems while spending a lot less money.

C++ templates are a win in the eight-bit realm.

C++ templates are a form of code generation, like the C preprocessor but more structured, and so they must be used judiciously, especially in a resource constrained environment. But they can be used to solve some of the very problems that resource constraints bring to the table. And they can be used to make your software more reliable with little or no additional overhead.

When you have only 8KB of data memory into which you must squeeze all your variables, a sizable stack for each task, and a heap from which memory may be dynamically allocated, you may come to realize that your heap isn't going to be very big. C++ templates are a way of implementing variable sized objects as local variables on a stack, instead of using malloc() to dynamically allocate them. I've written about this before.

Templates can make your software more reliable by allowing you to implement generic code in a base class, then make it type specific in a derived class generated by a template.  For example, I wrote the C++ wrapper Queue around the FreeRTOS queue facility. FreeRTOS queues are synchronized ring buffers that can be used by an application to pass data back and forth with an interrupt-driven device driver, or for two concurrent tasks to pass data back and forth. Amigo uses them in both ways. A FreeRTOS queue can contain any number of fixed length objects. The Queue class is as generic as the underlying FreeRTOS functions. But the TypedQueue class extends Queue for a specific data type, and makes all of the Queue operations type safe. This makes it a lot harder to screw up and send the wrong message to the wrong queue.

template <typename _TYPE_>
class TypedQueue
: public Queue


explicit TypedQueue(Count count, const signed char * name = 0)
: Queue(count, sizeof(_TYPE_), name)

virtual ~TypedQueue() {}

bool peek(_TYPE_ * buffer, Ticks timeout = IMMEDIATELY) { return Queue::peek(buffer, timeout); }

bool receive(_TYPE_ * buffer, Ticks timeout = NEVER) { return Queue::receive(buffer, timeout); }

bool receiveFromISR(_TYPE_ * buffer, bool & woken = unused.b) { return Queue::receiveFromISR(buffer, woken); }

bool send(const _TYPE_ * datum, Ticks timeout = NEVER) { return Queue::send(datum, timeout); }

bool sendFromISR(const _TYPE_ * datum, bool & woken = unused.b) { return Queue::sendFromISR(datum, woken); }

bool express(const _TYPE_ * datum, Ticks timeout = NEVER) { return Queue::express(datum, timeout); }

bool expressFromISR(const _TYPE_ * datum, bool & woken = unused.b) { return Queue::expressFromISR(datum, woken); }


Ring buffers are a fundamental interprocess communication mechanism.

By the way, synchronized ring buffers, that is, buffers that provide atomic reads and writes with synchronized access to concurrent tasks and that wrap around the underlying storage, have long been useful interprocess communication (IPC) mechanisms for solving general producer-consumer problems. Queues like those in FreeRTOS are most often, in my experience, used to store individual bytes of data. But they are equally adept at storing pointers to buffers or even to objects, and so can be thought of as a more general asynchronous message passing scheme.

C++ references are better than pointers.

C++ has pointers, just like C. But it also has references, which actually are pointers but with some useful restrictions, like: there is no such thing as a NULL reference. Yes, if you are quite clever, you can create a NULL reference, but your code will soon be on its way to a fatal error. Using references instead of pointers can make your code simpler and more reliable.

A common idiom for optional function arguments in C is to declare them to be pointers. If they aren't used in a particular call to a function, you pass a NULL pointer. Your function has to check for this. Sometimes your forget. Wackiness ensues.

In C++ you can use references and default parameters instead.

bool Queue:sendFromISR(const void * datum, bool & woken = unused);

This method of the Queue class is used to send data to a synchronized ring buffer from an interrupt service routine. It has two arguments: a pointer to the data to be sent, and a reference to a boolean variable that is returned to the caller with a value indicating whether or not this operation woke up a higher priority task. (Users of FreeRTOS will already be familiar with this idiom.)

C++ turns the second parameter into what is effectively a pointer, although the syntax for its use inside the instance method makes it look just like a variable. The pointer dereferencing stuff is all handled automatically by the compiler. That's why there is no possibility of a NULL pointer: there is no way syntactically for you to specify it, and hence you can't even check for it in the method.

Sometimes the application cares about the returned boolean value, and sometimes it doesn't. When it does, it passes its own boolean variable as the second argument, overriding the default parameter. When it doesn't, C++ passes the default parameter, a reference to the boolean variable unused. The instance method never has to check for a NULL pointer, because a NULL pointer can't ever be passed in. There is always a reference to a boolean variable for the method to use.

And what the deuce is unused? It's just a dummy variable, defined elsewhere, which is write-only: it's written to by Queue::sendFromISR() but no one ever reads it. It could be implemented, for example, as a private class (static) variable of the Queue class.

I should mention: there are some odd things about references too. C++ deliberately makes it hard to have an instance variable to which a reference has not yet been assigned. The syntax for assigning a reference to the variable can look like an assignment statement, where as an actual assignment statement is actually assigning something not to the reference variable but to the thing to which it refers. You may really only be able tell them apart in context. That throws a lot of folks new to C++. When I use references in constructor arguments (which I do routinely) I actually prefer to convert them to pointers to be stored in pointer instance variables. I find that leads to fewer mistakes, both on my part and on the parts of maintenance developers who come after me.

Here's an example that does just that.

class CriticalSection


CriticalSection(MutexSemaphore & mutex)
: mutexp(&mutex)
if (!mutexp->take()) {
mutexp = 0;

~CriticalSection() {
if (mutexp != 0) {


MutexSemaphore * mutexp;


Doxygen is great even if you don't use Doxygen.

My love affair with Doxygen goes back more than a decade. Inspired by javadoc, Doxygen is a tool that scans your source code for comments written in a very specific format, and generates API documentation based on your code and those comments. You can use Doxygen with any of several programming languages (including Java) to automatically generate documentation in the form of HTML web pages, TeX files, PDF documents, etc. It works great with C and C++.

For example

 * This is the function is that from which nothing ever returns.
 * It disables interrupts, takes over the console serial port,
 * prints a message if it can using busy waiting, and infinite
 * loops. This version can be called from either C or C++
 * translation units.
 * @param file points to a file name in program space,
 * typically PSTR(__FILE__).
 * @param line is a line number, typically __LINE__.
CXXCAPI void amigo_fatal(PGM_P file, int line);


Screen shot: amigo_fatal Doxygen comments

But I would still love Doxygen even if I never used any of the documentation that it generates. Doxygen enforces a very specific comment format and discipline for commenting functions, methods, parameters, classes, files, and even preprocessor symbols and macros. Running Doxygen against the source code base yields warnings about undocumented source code. Doxygen is like an automated code inspector that lets me know when I've slipped up. It's one of the many ways I keep myself honest.

Since the public and protected API is defined in header files, that is typically where I put the bulk of my Doxygen comments. Documenting my public API helps refine its design just like unit tests do: if while writing Doxygen comments I find myself thinking "This is rubbish! Who is the cretin that designed this?" I know my API design is lacking in credibility.

Big city techniques work just fine in small town microcontrollers.

I've discovered that with some care and discipline, the techniques I have used for the past decade or two for embedded and real-time development on larger platforms work just fine on tiny eight-bit microcontrollers, and they bring all the same advantages to the table.

I continue to learn.

Update 2012-05-14

Since writing this I have run my entire unit test suite without changes on an Arduino Mega ADK board with an Arduino Ethernet shield. Getting it to work took all of maybe ten minutes, and almost all of that was trying to figure out the pin alignment when plugging in the Ethernet Shield onto the Mega ADK. This says a lot about the compatibility of the Freetronics EtherMega board, which is supposed to behave like an Arduino Mega board with an Ethernet shield. Apparently it does.

Wednesday, April 11, 2012

Hitting a Moving Target While Flying Solo

Embedded developers inevitably seem to have to worry a lot more about their tool chains than folks who develop desktop or server-side applications. Tool chains are that vast collection of compilers and utilities you use to get from source code that you write to a binary executable image that runs on the actual hardware that you care about.

For one thing, when you are writing code that is close to bare metal, things that might seem otherwise trivial are actually really important. Like whether or not the machine code implementation of a function falls in the first 128 kilobytes of memory. Or whether an access can change the state of a variable that happens to be a memory mapped I/O register.

But the other big issue is that configuring and building a tool chain is no small feat. Configuring and building a tool chain for cross compilation -- that is, one that generates and processes executable machine code for a different hardware target than the one on which it is running -- is even more fraught with peril. It typically requires careful selection and configuration of a number of large, complex, and independent components, such as a specific GNU compiler collection package (from whence come the C and C++ compilers), a specific binary utilities package (which provides the linker among other necessary tools), a run-time library package, and what not.

This is a troublesome issue that desktop and server-side developers seldom have to worry about. Not that those folks don't have their own problems. Back in my Enterprise Java days I did have to give some thought before clicking on "Install" when a pop-up announced that a new version of Java was available. And the speed at which the many open source frameworks a large application might rely upon changed was astounding. But many of the desktop and server-side developers I hang out with today don't even know whether their system has a C or C++ compiler. And rightfully don't care.

I have had much success recently at building and running a multi-threaded interrupt-driven C++ application using FreeRTOS on the Freetronics EtherMega 2560 board which uses the Atmel AVR ATmega2560 microcontroller. I have been using the AVR CrossPack package of GNU cross compilers on my desktop Mac with no problems whatsoever. I have also had the occasion to build and test my application under Windows 7 using the AVR Studio 5.1 IDE which includes a very slightly older version of the GCC tool chain. So it seemed like a no brainer to try building on my big multicore Ubuntu 10.04 server using the cross-compilation tool chain installed by the Synaptics package manager.

Uh oh.

Yeah, the application built just fine, but went seriously south right about the time my code enabled interrupts on the microcontroller. South as in jumping to the reset vector and entering into a rolling reboot. My expensive JTAG debugger was useless in this case, since it's only supported on Windows, and the Windows build worked.

The AVR CrossPack for my Mac uses GCC 4.5.1 and AVR libc 1.8.0. AVR Studio on Windows uses GCC 4.5.1 and AVR libc 1.7.1. Those worked just fine. Ubuntu uses GCC 4.3.4 and AVR libc 1.6.7. Those sucked, at least for my application. I was so bold as to run the Mac and Ubuntu binary executable images through the AVR disassembler, figuring what the heck, I had a passing familiarity with AVR assembler, and how different could they be? Yeah, right. The graphical diff tool ran for a long time before finally disabusing me of that notion.


So last night, after futzing around for the better part of an afternoon with visions of having to spend a few days trying to lovingly handcraft a whole new tool chain, I posted a query to AVR Freaks, an international forum of AVR users. By this morning I had many suggestions from folks in the U.S., Denmark, Sweden, and Germany, one of which pointed me to a pre-built Debian package on a British web site that included just the versions of the tool chain I needed. It took me all of maybe fifteen minutes to go from reading that comment, through installing the package, modifying my makefile, and building and downloading my application, to all of my unit tests passing.

You guys rock.

That's the good news. Here's the bad news: this is not uncommon. Open source software, including tool chains, are rapidly moving targets. How many times have you heard a manager say "we don't have to develop any of that code, it's all open source, and it's free"? It's seldom that simple, for embedded developers, or for any other kind of developer for that matter. It's only free in the sense that the manager doesn't have to cut a purchase order. Or in the sense that their employees' time isn't considered valuable.

Whether using open source is easy or not will depend on your very specific requirements and the exact combination of tools, utilities, and libraries that you need, and even what operating system distribution and release you are running on the machine upon which you want to install this stuff. The level of difficulty can range from a few minutes work (see above) to something completely outside of your schedule. You may not be able to reliably gauge where you are on this spectrum until you are deeply into it. Meanwhile: it's all mutating, each independent package changing at its own rate, driven by someone else's requirements which may or may not jive with your own.

This can be an even more vexing issue for embedded developers. I remember a few years ago I was debugging a hang during boot with my custom Linux 2.6 build for a client's embedded project using a Freescale PowerPC processor. I traced it down to a bug inside the Linux kernel in processor-specific code that handled the hardware clock; depending on what the initial non-derministic value was in a hardware register, kernel initialization code executed during boot would wait until the clock wrapped around. That could take a while. Like maybe hours or days. This code, by sheer random chance, might work the first time you executed it. Maybe even the second time. But eventually your processor was going away and not come back until you lost patience hit the reset button. It didn't take much testing to notice.

When you are using open source software on a mainstream processor -- which these days means an Intel x86 of some vintage -- you can be reasonably sure that hundreds if not thousands of people have already been using the same software on a daily basis before you ever laid eyes on it. But in the embedded domain, you have to accept the fact that it is entirely possible that you are the only guy in the entire world using, or even to have ever used, that exact version of that exact software on that exact processor model.

In which case the adage "with enough eyes all bugs are shallow", while perhaps true, isn't helpful.