Monday, January 29, 2007

Generic Embedded Programming Using Templates

A couple of years ago a senior technologist for the CTO organization of the company I was working for at the time remarked that real-time applications could not be written in C++. He said this in the presence of myself and another developer who had spent the past decade in product development for this same company writing real-time embedded code in C++. She and I just looked at each other and shrugged.

Developing embedded systems in C++ is not new. Embedded pundits like Jack Ganssle and Dan Saks have been writing about it for years in venues like Embedded Systems Design. But some of the C++ features that sound like old news to developers working in the mainstream still sound exotic to many embedded developers, who by nature are a conservative lot. And rightly so. Embedded systems frequently find themselves in mission critical applications that are intended to run 24x7. And embedded requirements often impose severe real-time and memory constraints on their implementations. No matter what language you use and what features of it you choose to exploit, writing embedded code requires knowledge of how the language works under the hood and a discipline regarding what features to use and what ones to stay away from.

One of the features of C++ that still seems exotic to many embedded developers is its ability for generic programming. Generic programming is where the developer writes an algorithm in a data-type independent way, and then tells the compiler to generate an implementation for a specific data type. Generic programming allows you to debug an algorithm once then apply it to a much broader variety of circumstances. It reduces the amount of code that must be maintained, and since code maintenance is as much as two-thirds of the total software lifecycle cost, this can directly affect the bottom line of product development. Generic programming abstracts out implementation details which increases the reusability of code.

Most people associate generic programming in C++ with templates, and in this article so shall we. Templates directly support generic programming in exactly the way described above, by allowing you to specify an algorithm then apply the algorithm to different data types, leaving it up to the compiler to generate the appropriate code. As we shall see in later articles, there are other features of C++ (and even of C) that also support generic programming. But since templates are the first thing that folks think about when they think about generic programming, we’ll look at examples of using templates from the Desperado open source C++ library.

(Disclaimer: my goal here is not to teach you C++ or how to use templates. There are lots of books that can do that. I’m going to try to show you some examples of how templates, and generic programming in general, can be used to solve common problems in embedded applications. Rather than go into a lot of detail, I’m just trying to give you a flavor of how generic programming might be applied in an embedded context. Downloading and perusing the Desperado source code is a good idea if you really want to get into the nitty-gritty of the implementation using templates)

Example: Memory Testing

Michael Barr wrote an article in Embedded Systems Programming (now Embedded Systems Design) on how to do efficient testing of memory in software. Barr implemented a suite of three tests: a data bus test, an address bus test, and a memory device test. The code for these tests is dependent on the width of the address and data busses. Depending on the design of the embedded memory system, these widths (which may not be the same) may be 16, 32, or 64 bits.

Desperado implements Barr’s memory test suite as a template, a model algorithm coded without exactly defining the data type, which it stores in the header file <Ram.h>. The definition of the Ram template in <Ram.h> starts out looking something like


template <typename TYPE>
class Ram { ... }


where TYPE is a data type passed as a compile-time parameter to the template when the C++ compiler generates the resulting type-specific code.

An application may ask the C++ compiler to generate the code for a 32-bit wide memory test with the following code snippet, using the uint32_t hardware-dependent 32-bit integer data type defined in the standard POSIX header file <stdint.h>, and the size_t operating-system-dependent data type defined in the standard C header file <stddef.h>. (These system header files are included automatically by the header file <Ram.h>.)


#include <Ram.h>
...
Ram<uint32_t> test32;


The application then may run all three tests of the suite against the memory at location 0x80000000 that is 4KB in length;


uint32_t* address = (uint32_t*)0x80000000;
size_t size = 4096;
...
test32.address(address, size);
test32.data(address);
test32.device(address, size);


This assumes that the address and data buses are the same width. If they are not, the application can ask C++ to generate a second instance of the code with a different width and use both tests appropriately.


Ram<uint64_t> test64;
...
uint64_t* address = (uint64_t*)0x80000000;
...
test64.data(address);


Desperado makes this process even simpler by pre-instantiating code for 8, 16, 32 and 64-bit bus widths, which can be used by an application by just linking them in.


#include <uint32_Ram.h>
...
uint32_t* address = (uint32_t *)0x80000000;
size_t size = 4096;
...
uint32_Ram.address(address, size);
uint32_Ram.data(address);
uint32_Ram.device(address, size);


Example: Cyclic Redundancy Checks

Ross Williams wrote a great paper, A Painless Guide to CRC Error Detection Algorithms, which describes a cyclic redundancy check (CRC) algorithm which can be used to implement any of a wide variety of standard CRCs. CRCs are used to detect data corruption in just about every telecommunications and networking protocol, as well as in many data storage systems. The CRC algorithm he provides is specific to the data type of the resulting CRC value that it generates. Commonly used CRC algorithms produce either 16-bit or 32-bit values. Williams’ algorithm also takes four parameters, the CRC polynomial (which describes how the CRC is computed, encoded as a bit pattern the same width as the resulting CRC value), a boolean indicating whether or not the CRC is reflected (which changes how the value is computed), an initial value of the CRC before any data is passed through the algorithm, and a value with which the CRC partial result is exclusive-or’ed to get the final CRC value.

Desperado implements Williams’ CRC algorithm as a template, with the resulting CRC data type as its parameter, which it defines in the header file <Crc.h>. The resulting class generated by the C++ compiler has a constructor that takes the four parameters of Williams’ CRC algorithm. The definition of the Crc template in <Crc.h> starts out looking something like


template <typename TYPE>
class Crc { ... }


where TYPE is a data type passed as a compile-time parameter to the template when the C++ compiler generates the resulting type-specific code. The generic constructor has a prototype that looks something like


explicit Crc(TYPE poly, bool refl, TYPE init, TYPE xoro);


which declares the four parameters needed by Williams’ CRC algorithm. Note that the parameters poly (the polynomial), init (the initial value), and xoro (the xor bit mask) are all of type TYPE, which is not specified until the template is instantiated by the application. The code for the Crc algorithm is written without knowing exactly what that type is.

The application can ask the C++ compiler to generate the industry standard CRC32 algorithm like so


#include <Crc.h>
...
Crc<uint32_t> crc32(0x04c11db7, true, 0xffffffff, 0xffffffff);


which also passes the four parameters of the CRC32 algorithm to the constructor when it instantiates an object of type Crc<uint32_t>. (Note that the value 0x04c11db7 is a common bit-encoding used to represent the polynomial used to compute the standard CRC32 value.) The application then can use this CRC32 algorithm to compute a cyclic redundancy check across a buffer of data, first initializing the crc32 object. (The Desperado Crc template defines operators for the resulting class which allow an object of that type to be used as a functor, that is, an object which can be called like a function. Hence crc32 is not a function, but rather an object, and the following code is calling methods defined for that object. Williams’ CRC algorithm also requires a state variable, which Desperado chooses not to keep inside the Crc object so that the same Crc object may be used in reentrant code.)


uint32_t state;
...
crc32(state);
uint32_t crc = crc32(state, buffer, size);


To make the power of templates for embedded systems a little more obvious, here are the complete definitions for the industry standard CRC16, CRC32 and ATM AAL5 cyclic redundancy check implementations using the Desperado Crc template.


#include <Crc.h>
...
Crc<uint16_t> crc16(0x8005, true, 0, 0);
Crc<uint32_t> crc32(0x04c11db7, true, 0xffffffff, 0xffffffff);
Crc<int32_t> aal5crc(0x04c11db7, false, 0xffffffff, 0xffffffff);


As before, Desperado pre-instantiates the commonly used CRC16 and CRC32 algorithms so that they can be used by an application by just linking them in.


#include <crc32_Crc.h>
...
uint32_t state;
crc32_Crc(state);
uint32_t crc = crc32_Crc(state, buffer, size);


Example: Variable-sized Objects on the Stack

Most people think of the compile-time template parameters as always being data types, but they can also be primitive types, like integers. This can be used to get around a limitation of many embedded systems: the desire not to allocate and free memory from the heap except at boot time.

Memory allocation can be really problematic for systems with real-time constraints. The time required for heap management is non-deterministic, dependent on the current state of the heap and what has been allocated and freed in the past. This makes it difficult or impossible to predict how long it will take to allocate and free memory in a time-constrained section of code.

Embedded systems very commonly address this issue by pre-allocating all objects at boot time and placing them in pools, which are often implemented as linked lists. When you need a buffer, you do not malloc or new memory from the heap, you merely unlink a buffer from the buffer pool. When you are done with the buffer, you do not free or delete it, you link it back onto the same buffer pool. If you need a buffer and the pool is empty, you choices are [1] break down and allocate a new buffer from the heap, making your pool a little larger when you return it, [2] try again later, or [3] signal an alarm, which might be as simple as lighting a little red LED on the front panel. Choice #3 is more common than you might think. Running out of a resource in an embedded application often indicates a serious failure in traffic engineering of the system in which it is embedded.

Unlike Java (prior to version 6 anyway), C++ also gives you the option of allocating an object temporarily on the stack. Stack variables exist as long as they are in scope, which means within the function or execution block in which they are defined. Often this is perfect for allocating an object whose lifespan is only as long as the current function. (Under no circumstances however should you allocate an object on the stack, pass a pointer to it to another functional unit, and then exit the function. The other functional unit holding the pointer now has the address of a section of the stack which has been de-allocated and which will eventually be reused for some other purpose. Wackiness will ensue.) This is done as simply as


#include <Transliterator.h>
...
{
uint8_t stack[16];
Transliterator transliterator(stack, 16);
...
}


where the object transliterator of type Transliterator will be deleted as soon as the thread of control passes the closing curley brace. In the example above, Transliterator is a simple general-purpose parser implemented in Desperado as a table-driven push-down automata (the original meaning of PDA) and defined in the header file <Transliterator.h>. You can imagine an embedded application defining a simple language for a command line interpreter available via a console port for purposes of field troubleshooting. The application defines the language in the form of a table, which the Transliterator interprets to parse the language.

PDAs require a stack in which to keep track of their current context when the language that they are parsing contains nested elements, for example nested parenthesis for a parser that interprets arithmetic expressions. This is what the array is used for in the example above. How big a context stack does a Desperado Transliterator need? It depends entirely on the language it is parsing, which is defined not by the Transliterator itself, but by the application that is using it.

If allocating objects on the heap were not a problem during run-time, the size of the necessary context stack could just be passed as a parameter to the Transliterator constructor and it would just allocate an array of that size to use during parsing. But since we want to avoid allocating and freeing memory after boot time, we instead allocate a context stack on the C++ stack, as shown above.

Desperado streamlines this process by defining a second class, TransliteratorType, defined in the header file <TransliteratorType.h>, that extends Transliterator by taking the desired size of the context stack as a template parameter. The definition of TransliteratorType in <TransliteratorType.h> looks something like this


template <size_t CONTEXTS>
class TransliteratorType : public Transliterator { ... }


where CONTEXTS is the size of the context stack. As part of the definition of the TransliteratorType class there is a field


uint8_t array[CONTEXTS];


that defines and allocates the context stack as part of the object.

When the application creates a TransliteratorType object by doing the following


#include <TransliteratorType.h>
...
{
TransliteratorType<16> transliterator;
...
}


then the TransliteratorType object, the Transliterator object that is its base class object, and the context stack in a size 16, are all allocated on the C++ stack as part of the TransliteratorType object itself, to be automatically freed when the thread of control exists the code block.

I admit that this example is not a great one, since the context stack can be just as easily allocated on the C++ stack explicitly without resorting to the use of a template. But this technique can also be used in a similar fashion to define variable length objects to be allocated on the stack and later deallocated for use within a code block, without resorting to new and delete. And it does serve as an example of how to bundle such a variable length allocation into the class definition.

What’s the fine print?

There can be some significant compile-time overhead for using templates, particularly when using very complex template definitions like those found in the Standard Template Library (STL), which is now part of the C++ standard library. However, if you are writing your own templates like the ones I described above that are part of Desperado, you are unlikely to notice much difference in compile time. Even when using the STL (which I recommend), compile time performance differences are typically negligible. (I have had the 3.3.2 version of the GNU C++ compiler go into what appeared to be an infinite loop when using the template-related –frepo compiler option, so your mileage may vary.)

There is no more run-time execution overhead to using templates than if your application used the equivalent non-template C++ code using virtual methods, where all method calls are mapped through a virtual table or v-table. I used to worry about this, but I got over it, since it only adds a very few machine instruction to every method call. Still, if your real-time constraints are that tight, you should worry not just about templates, but about any C++ virtual function. If you are not using virtual functions, your C++ code will be at least as efficient as the equivalent C code, if not better, thanks to the many improved optimization capabilities of C++.

Templates can cause code bloat, and this must be managed carefully in an embedded application. Instantiating a template at compile-time causes the C++ compiler to generate the code to implement the class given the template parameters. This can cause multiple instantiations of the same code to be generated by the compiler. In the best case this may be detected at link time, but in the worst case the resulting executable may be a lot larger than it needs to be.

Section 6.5, “Where’s the Template?” of the GCC manual describes your options for controlling when and where the G++ compiler generates codes for templates. Desperado’s makefile uses the G++ command line parameter


–fno-implicit-templates


which forces the application to explicitly define when and where code for templates is generated. For any template instantiations that Desperado uses itself or provides by default, those instantiations are generated once and only once and are stored as object files in the Desperado library, where they may be incorporated into an application at link time. For example, this is done in Desperado as simply as having a translation unit having just these two lines


#include <Crc.h>
template class Crc<uint32_t>;


which causes the compiler to generate and compile the code to implement a Crc object for an unsigned 32-bit value. Any other translation unit, in Desperado or an application using it, that uses the Crc template in a manner like


Crc<uint32_t> crc32(0x04c11db7, true, 0xffffffff, 0xffffffff);


would just link in this pre-defined and pre-compiled code. I recommend this explicit approach for using templates, although the Section 6.5 offers other alternatives.

Templates are just one way for writing generic code. In future articles I will describe other techniques, some of which you may be already using without realizing it.

Update 2009-11-22

Nearly three years later, here I am doing an application with the Standard Template Library and still trying to use the -fno-implicit-templates compiler option. While I could probably eventually get it to compile by explicitly instantiating the requisite STL templated classes and functions, I have convinced myself that it's not worth it for my current project. The option might make sense if you (like me in 2007) have complete control over your template classes. But using the STL, or other complex templated frameworks, makes this fine control over template instantiation much harder. I say: drop the -fno-implicit-templates compiler option if the resulting compiled code fits your memory footprint.

Update 2012-04-30

But if you are really resource constrained, you probably aren't using the STL anyway, and the -fno-implicit-templates option is still a very good idea. For example, when using templates on tiny eight-bit microcontrollers, memory is so precious that I think explicitly instantiating templates is a very good idea. That what I've done for my Amigo project, yet another three years later.

Sources

Michael Barr, “Software-Based Memory Testing”, Embedded Systems Programming, July 2000, pp. 28-40

Richard M. Stallman et al., Using the GNU Compiler Collection (GCC), 4.1.1, 2005

Ross Williams, A Painless Guide to CRC Error Detection Algorithms, Rocksoft Pty Ltd., Adelaide, Australia, 1993

Desperado, Digital Aggregates Corp., 2007

Amigo, Digital Aggregates Corp., 2012

3 comments:

mcjoe said...

Great stuff. A caveat should be made regarding STL use. Under the hood, the STL is doing memory allocation and deallocation, so it may not be appropriate everywhere.

Chip Overclock said...

You're absolutely right, my bad. It depends on the situation and the STL implementation, so your mileage may vary.

The last time I used the STL, I was using some of the collection classes, Map and MultiMap as I dimly recall. The STL implementation I was using, based on the SGI distribution I think, actually maintained pools of objects to avoid the overhead of constantly doing new and delete plus the risk of fragmenting the heap. I found this acceptable in my embedded application. You may not.

But I encourage folks to look closely at the STL. If you can use it, it will save you huge amounts of time. And you shouldn't be scared off just because it uses templates; this can be managed, even in an embedded application.

Just to be even more self-promotional than usual, Desperado has a C++ class (and Buckaroo has an equivalent Java class) that implements a clever (meaning: I didn't invent the idea, I stole it, in this case from the Linux kernel) doubly-linked circular list implementation, the Chain, that might be a suitable simple collection for really constrained systems that can't use the STL. It can be used to implement a queue or a stack, and has all sorts of useful properties. You can remove objects from the anywhere on a chain in contant time, you can find the root of the Chain in constant time from any object on that Chain, you can embed the Link portion of the Chain inside your object, or keep each Link as a separate object. And a single object can be easily managed on multiple Chains simultaneously.

But I love the STL the same way I love the Java collection classes. There's no glory in writing Yet Another hash table algorithm.

Thanks, Joe!

Chip Overclock said...

Dan Saks has kindly pointed out a mistake in the declaration of the address variable in the Ram test example, which I have corrected. Thanks, Dan!