Tuesday, April 04, 2017

Some Stuff That Has Worked For Me In C and C++

Diminuto (Spanish for "tiny") is a library of C functions and macros that support the kind of low level embedded, real-time, and systems programming work I am typically called upon to do on the Linux/GNU platform. It's no longer so tiny. I started it way back in 2008 and it is still going strong. Its open-source (FSF LGPL 2.1) code has found its way into a number of shipping commercial products for more than one of my clients. Diminuto can be found on GitHub along with eighteen (so far) other repositories (some private) of my work.

My forty year career has been a long, strange, and marvelous trip. Along the way, I've found a number of techniques of architecture, design, and implementation in C and C++ that have worked well for me, solving a number of recurring problems. I don't claim that these techniques are the best, or that I am the only, or even the first, developer to have thought them. Some of them I shamelessly borrowed, sometimes from other languages or operating systems, because I try to know a good idea when I see one.

This article documents some of those ideas.

Standard Headers and Types

Just about every big closed-source project I've worked on defined its own integer types. Even the Linux kernel does this. Types like u32 for unsigned 32-bit integers, or s8 for signed 8-bit integers. Historically - and I'm talking ancient history here - that was a good idea, way back when C only had integer types like int and char, and there weren't dozens and dozens of different processors with register widths that might be 8, 16, 32, or 64 bits.

But this kind of stuff is now available in standard ANSI C headers, and Diminuto leverages the heck out of them.

stdint.h defines types like uint32_t, and int8_t, and useful stuff like uintptr_t, which is guaranteed to be able to hold a pointer value, which on some platforms is 32-bits, and on others 64-bits.

stddef.h defines types like size_t and ssize_t that are used by a variety of POSIX and Linux systems calls and functions, and are guaranteed to be able to hold the value returned by the sizeof operator.

stdbool.h defines the bool type and constant values for true and false (although, more about that in a moment).

In at least one client project, I talked them into redefining their own proprietary types via typedef to use these ANSI types, which simplified porting their code to new platforms.

Boolean Values

Even though I may use the bool type defined in stdbool.h, I don't actually like the constants true and false. For sure, false is always 0. But what value is true? Is it 1? Is it all ones, e.g. 0xff? Is 9 true? How about -1?

In C, true is anything that is not 0. So that's how I code a true value: !0. I don't care how the C standard or the C compiler encodes true, because I know for sure that !0 represents it.

I normalize any value I want to use as a boolean using a double negation. For example, if I have two variables alpha and beta that are booleans. Do I use
(alpha == beta)
to check if they are equal? What if alpha is 2 and beta is 3? Both of those are true, but the comparison will fail. Do I use
(alpha && beta)
instead? No, because I want to know if they are the same, both true, or both false, not if they are both true. If C had a logical exclusive OR operator, I'd use that - or, actually, the negation of that - but it doesn't. I could do something like
(((alpha && beta) || ((!alpha) && (!beta)))
but that hurts my eyes.

Double negation addresses this: !!-1 equals !0 equals whatever the compiler uses for true. I use
((!!alpha) == (!!beta))
unless I am positive that both alpha and beta have already been normalized.

I also do this if I am assigning a value to a boolean
alpha = !!beta
unless I am very sure that beta has already been normalized.

Inferring Type

If I want to check if a variable being used as a boolean - no matter how it was originally declared - is true, I code the if statement this way.
if (alpha)
But if the variable is a signed integer and I want to know if it is not equal to zero, I code it this way.
if (alpha != 0)
And if it's an unsigned integer, I code it this way.
if (alpha > 0)
When I see a variable being used, I can usually infer its intended type, no matter how it might be declared elsewhere.

Parentheses and Operator Precedence

You may have already noticed that I use a lot of parentheses, even where they are not strictly speaking necessary. Can I remember the rules of operator precedence in C? Maybe. Okay, probably not. But here's the thing: I work on big development projects, hundreds of thousands or even millions of lines of code, with a dozen or more other developers. And if I do my job right, the stuff I work on is going to have a lifespan long after I leave the project and move on to something else. Just because I can remember the operator precedence of C, the next developer that comes along may not. So I want to make my assumptions explicit and unambiguous when I write expressions.

Also: sure, I may be writing C right now. But this afternoon, I may be elbow deep in C++ code. And tomorrow morning I may be writing some Python or Java. Tomorrow afternoon, sadly, I might be hacking some JavaScript that the UI folks wrote, even though I couldn't write a usable line of JavaScript if my life depended on it. Even if I knew the rules of operator precedence for C, that's not good enough; I need to know the rules for every other languages I may find myself working in.

But I don't. So I use a lot of parentheses.

This is the same reason I explicitly code the access permissions - private, protected, public - when I define a C++ class. Because the default rules in C++ are different for class versus struct (yes, you can define access permissions in a struct in C++), and different yet again for Java classes.

Exploiting the sizeof Operator

I never hard-code a value when it can be derived at compile time. This is especially true of the size of structures or variables, or values which can be derived from those sizes.

For example, let's suppose I need two arrays that must have the same number of elements, but not necessarily the same number of bytes.
int32_t alpha[4];
int8_t beta[sizeof(alpha)/sizeof(alpha[0])];
The number of array positions of beta is guaranteed to be the same as the number of array positions in alpha, even though they are different types, and hence different sizes. The expression
(sizeof(alpha)/sizeof(alpha[0]))
divides the total number of bytes in the entire array with the number of bytes in a single array position.

This has been so useful that I defined a countof macro in a header file to return the number of elements in an array, providing it can be determined at compile time. I've similarly defined macros like widthof, offsetof, memberof, and containerof.

Doxygen Comments

Doxygen is a documentation generator for C and C++, inspired by Java's documentation generator Javadoc. You write comments in a specific format, run the doxygen program across your source code base, and it produces an API document in HTML. You can use the HTML pages directly, or convert them using other tools into a PDF file or other formats.

/**
 * Wait until one or more registered file descriptors are ready for reading,
 * writing, or accepting, a timeout occurs, or a signal interrupt occurs. A
 * timeout of zero returns immediately, which is useful for polling. A timeout
 * that is negative causes the multiplexer to block indefinitely until either
 * a file descriptor is ready or one of the registered signals is caught. This
 * API call uses the signal mask in the mux structure that contains registered
 * signals.
 * @param muxp points to an initialized multiplexer structure.
 * @param timeout is a timeout period in ticks, 0 for polling, <0 for blocking.
 * @return the number of ready file descriptors, 0 for a timeout, <0 for error.
 */
static inline int diminuto_mux_wait(diminuto_mux_t * muxp, diminuto_sticks_t timeout)
{
    return diminuto_mux_wait_generic(muxp, timeout, &(muxp->mask));
}

Even if I don't use the doxygen program to generate documentation, the format enforces a useful discipline and consistent convention in documenting my code. If I find myself saying "Wow, this API is complicated to explain!", I know I need to revisit the design. I document the public functions in the .h header file, which is the de facto API definition. If there are private functions in the .c source file, I put Doxygen comments for those functions there.

Namespaces (updated 2017-04-25)

Most of the big product development projects I've worked on have involved writing about 10% new code, and 90% integrating existing closed-source code from prior client projects. When that integration involved using C or C++ code from many (many) different projects, name collisions were frequently a problem, either with symbols in the source code itself (both libraries have a global function named log), or in the header file names (e.g. both libraries have header files named logging.h).

In my C++ code, such as you'll find in my Grandote project on GitHub (a fork from Desperadito, itself a fork from Desperado, both since deprecated), there is an easy fix for the symbol collisions: C++ provides a mechanism to segregate compile-time symbols into namespaces.

namespace com {
namespace diag {
namespace grandote {

Condition::Condition()
{
 ::pthread_cond_init(&condition, 0);
}

Condition::~Condition() {
 ::pthread_cond_broadcast(&condition);
 ::pthread_yield();
 ::pthread_cond_destroy(&condition);
}

int Condition::wait(Mutex & mutex) {
 return ::pthread_cond_wait(&condition, &mutex.mutex); // CANCELLATION POINT
}

int Condition::signal() {
 return ::pthread_cond_broadcast(&condition);
}

}
}
}

It doesn't matter that another package defines a class named Condition, because the fully-qualified name of my class is actually com::diag::grandote::Condition. Any source code that is placed within the namespace com::diag::grandote will automatically default to using my Condition, not the other library's Condition, eliminating the need for me to specify that long name every time.

Note too that the namespace incorporates the domain name of my company in reverse order - borrowing an idea from how Java conventionally organizations its source code and byte code files. This eliminates any collisions that might have occurred because another library itself has the same library, class, or module name as part of its own namespace.

This works great for C++, but the problem remains for C. So there, I resort to just prepending the library name and the module name to the beginning of every function. You already may have noticed that in the diminuto_mux_wait function in the Doxygen example above: the wait operation in the mux module of the diminuto library.

But one problem that neither C++ namespaces nor my C naming convention, solves is header file collisions. So in both C and C++ I organize header file directories in a hierarchical fashion so that #include statements look like this, from my Assay project.

#include <stdio.h>
#include <errno.h>
#include "assay.h"
#include "assay_parser.h"
#include "assay_fixup.h"
#include "assay_scanner.h"
#include "com/diag/assay/assay_scanner_annex.h"
#include "com/diag/assay/assay_parser_annex.h"
#include "com/diag/diminuto/diminuto_string.h"
#include "com/diag/diminuto/diminuto_containerof.h"
#include "com/diag/diminuto/diminuto_log.h"
#include "com/diag/diminuto/diminuto_dump.h"
#include "com/diag/diminuto/diminuto_escape.h"
#include "com/diag/diminuto/diminuto_fd.h"
#include "com/diag/diminuto/diminuto_heap.h"

The first two header files, stdio.h and errno.h, are system header files. The next four header files whose names begin with assay_ are private header files that are not part of the public API and which are in the source directory with the .c files being compiled. The next two header files are in a com/diag/assay subdirectory that is part of the Assay project. The remaining header files are in a com/diag/diminuto subdirectory that is part of the Diminuto project.

Here's another example, from the application gpstool that makes use of both my Hazer and Diminuto projects.

#include <assert.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include "com/diag/hazer/hazer.h"
#include "com/diag/diminuto/diminuto_serial.h"
#include "com/diag/diminuto/diminuto_ipc4.h"
#include "com/diag/diminuto/diminuto_ipc6.h"
#include "com/diag/diminuto/diminuto_phex.h"

This convention is obviously not so important in my C header files, where the project name is part of the header file name. But it becomes a whole lot more important in C++, where my convention is to name the header file after the name of the class it defines, as this snippet from Grandote illustrates.

#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include "com/diag/grandote/target.h"
#include "com/diag/grandote/string.h"
#include "com/diag/grandote/Platform.h"
#include "com/diag/grandote/Print.h"
#include "com/diag/grandote/DescriptorInput.h"
#include "com/diag/grandote/DescriptorOutput.h"
#include "com/diag/grandote/PathInput.h"
#include "com/diag/grandote/PathOutput.h"
#include "com/diag/grandote/ready.h"
#include "com/diag/grandote/errno.h"
#include "com/diag/grandote/Grandote.h"

Once I came upon this system of organizing header files, not only did it become much easier to integrate and use multiple projects into a single application, but the source code itself became more readable, because it was completely unambiguous where everything was coming from. This strategy worked so well, I use it not just for C and C++, but to organize my Python, Java, and other code too. I also organize my GitHub repository names, and, when I use Eclipse, my project names, in a similar fashion:

  • com-diag-assay,
  • com-diag-grandote,
  • com-diag-diminuto,
  • com-diag-hazer

etc.

This is more important than it sounds. Digital Aggregates Corporation (www.diag.com) is my consulting company that has been around since 1995. It holds the copyright on my open-source code that may find its way into my clients' products. The copyright of my closed-source code is held by another of my companies, Cranequin LLC (www.cranequin.com). So if I see the header file directory com/cranequin or a project or repository with the prefix com-cranequin, I am reminded that I am dealing with my own proprietary code under a different license.

Time (updated 2017-07-26)

The way time is handled in POSIX or Linux - whether you are talking about time of day, a duration of time, or a periodic event with an interval time - is a bit of a mess. Let's see what I mean.

gettimeofday(2), which returns the time of day, uses the timeval structure that has a resolution of microseconds. It's cousin time uses an integer that has a resolution of seconds.

clock_gettime(2), which when used with the CLOCK_MONOTONIC_RAW argument returns an elapsed time suitable for measuring duration, uses the timespec structure that has a resolution of nanoseconds.

setitimer(2), which is be used to invoke an interval timer, uses the itimerval structure that has a resolution of microseconds.

timer_settime(2), which also invokes an interval timer, uses the itimerspec structure that has a resolution of nanoseconds.

select(2), which is be used to multiplex input/output with a timeout, uses the timerval structure and has a resolution of microseconds. Its cousin pselect uses the timespec structure that has a resolution of nanoseconds.

poll(2), which can also be used to multiplex input/output with a timeout, uses an int argument that has a resolution of milliseconds. It's cousin ppoll uses the timespec structure that has a resolution of nanoseconds.

nanosleep(2), which is used to delay the execution of the caller, uses the timespec structure that has a resolution of nanoseconds. Its cousin sleep(3) uses an unsigned int argument that has a resolution of seconds. Its other cousin usleep(3) uses an useconds_t argument that is a 32-bit unsigned integer and which has a resolution of microseconds.

Seconds, milliseconds, microseconds, nanoseconds. So many different granularities of time. Several different types. It's too much for me.

Diminuto has two types in which time is stored: diminuto_ticks_t and diminuto_sticks_t. Both are 64-bit integers, the only difference being one is unsigned and the other is signed. (I occasionally regret even that.) All time is maintained in a single unit of measure: nanoseconds. This unit is generically referred to as a Diminuto tick.

diminuto_time_clock returns the time of day (a.k.a. wall clock time) in Diminuto ticks. There is another function in the time module to convert the time of day from a ticks since the POSIX epoch into year, month, day, hour, minute, second, and fraction of a second in - you guessed it - Diminuto ticks. The POSIX clock_gettime(2) API is used with the CLOCK_REALTIME argument.

diminuto_time_elapsed returns the elapsed time in Diminuto ticks suitable for measuring duration. It also uses the clock_gettime(2) API but with the CLOCK_MONOTONIC_RAW argument. This makes the returned values unaffected by changes in the system clock wrought by system administrators, by the Network Time Protocol daemon, by Daylight Saving Time, or by the addition of leap seconds. However, this means the value returned by the function is only really useful in comparison or in arithmetic operations with prior or successive values returned by the same function.

diminuto_timer_periodic invokes a periodic interval timer, and diminuto_timer_oneshot invokes an interval timer that fires exactly once, both using Diminuto ticks to specify the interval. Both are implemented using the POSIX timer_create(2) API with the CLOCK_MONOTONIC argument.

diminuto_mux_wait uses pselect(2) for I/O multiplexing, and diminuto_poll_wait does the same but uses ppoll(2), both specifying the timeout in Diminuto ticks. (I believe pselect is now implemented in the Linux kernel using ppoll, but that wasn't true when I first wrote this code and they had very different performance characteristics.)

diminuto_delay delays the caller for the specified number of Diminuto ticks. It is implemented using the POSIX nanosleep(2) API.

Floating Point (updated 2017-07-26)

If you really need to use floating point, you'll probably know it. But using floating point is problematic for lots of reasons. Diminuto avoids using floating point except in a few applications or unit tests where it is useful for reporting final results.

Even though Diminuto uses a single unit of time, that doesn't mean the underlying POSIX or Linux implementation can support all possible values in that unit. So every module in Diminuto that deals with time has an inline function that the application can call to find out what resolution the underlying implementation supports. And every one of those functions returns that value in a single unit of measure: Hertz. Hertz - cycles per second - used because it can be expressed as an integer. It's inverse is the smallest time interval supported by the underlying implementation.

diminuto_frequency returns 1,000,000,000 Hz, the base frequency used by the library. The inverse of this is one billionth of a second or one nanosecond.

diminuto_time_frequency returns 1,000,000,000 Hz, the inverse being the resolution of timespec in seconds.

diminuto_timer_frequency returns 1,000,000,000 Hz, the inverse being the resolution of itimerspec in seconds.

diminuto_delay_frequency returns 1,000,000,000 Hz, the inverse being the resolution of timespec in seconds.

With just some minor integer arithmetic - which can often be optimized out at compile-time - the application can determine what is the smallest number of ticks the underlying implementation can meaningfully support in each module. (Note that this is merely the resolution representable in the POSIX or Linux API; the kernel or even the underlying hardware may support an even coarser resolution.)

Remarks

I first saw the countof technique in a VxWorks header file perhaps twenty years ago while doing embedded real-time development in C++ at Bell Labs. Similarly, a lot of these techniques have been picked up - or learned the hard way - while taking this long strange trip that has been my career. I have also benefitted greatly from having had a lot of mentors, people smarter than I am, who are so kind and generous with their time. Many of these techniques gestated in my C++ library Desperado, which I began putting together in 2005.

I started developing and testing the Diminuto C code for a number of reasons.
  • I found myself solving the same problems over and over again, sometimes for different clients, sometimes even for different projects for the same client. For one reason or another - sometimes good, sometimes not so good - the proprietary closed-source code I developed couldn't be shared between projects. But open-source code could easily be integrated into the code base.
  • I wanted to capture some of the design patterns I had found useful in my closed-source work. Working, unit tested, open-source code, developed in a clean room environment, and owned by my own company, was a useful way to do that.
  • I needed a way to get my head around some of the evolving C, C++, POSIX, and Linux APIs. Since I can only really learn by doing, I had to do, which for me meant writing and testing code.
  • I wanted to make an API that was more consistent, more easily interoperable, and less prone to error than was offered by raw POSIX, Linux, and GNU.
  • In the past I have been a big proponent of using C++ for embedded development. I have written hundreds of thousands of lines of C++ while doing such work over the years. But more recently, I have seen a gradual decline in the use of C++ by my clients, with a trend more towards segregating development into systems code in C and application code in languages like Python or Java. Although I miss C++, I have to agree with the economics that. And I fear that C++ has evolved into a language so complex that it is beyond the ken of developers that my clients can afford.
I am not a "language lawyer". I get paid to develop product that ships and generates revenue. I believe I have been successful in part because I am a very hands-on and evidence based person, but also because I am extremely pragmatic about what works and what doesn't.

This is some stuff that, over the span of many years, has worked.

Update (2017-04-14)

I recently forked Desperadito (which itself is a mashed up fork of both Desperado and Hayloft) into Grandote (https://github.com/coverclock/com-diag-grandote). What's different is Grandote uses the Diminuto C library as its underlying platform abstraction. It requires that you install Diminuto, Lariat (a Google Test helper framework), and Google Test (or Google Mock). That's some effort, but at least for me it wasn't overly burdensome when I built it all this morning on a Ubuntu 16.04 system. The other improvement over Desperadito is that both the old Desperado hand coded unit tests, and the newer Hayloft Goggle Test unit tests, work. Grandote would be my C++ framework going forward, if I were to need my own C++ framework. (And Grandote doesn't preclude using STL or Boost as well.)