Saturday, November 26, 2011

Making Both Archives and Shared Objects

It's not uncommon to want to support both static linking, with a library archive, and dynamic linking, with a shared object. I almost always choose the latter, but it is not uncommon for small embedded applications to be statically linked. This can actually reduce the memory footprint of the entire system when not many object files are shared. Or sometimes it's necessary for esoteric reasons of shared memory or what not. Google Test, my favorite C++ unit testing framework, actually recommends static linking against its libgtest.a archive.

When supporting both static and dynamic linking, I always generate the library archive, for example libhayloft.a, then automate the generation of the shared object, libhayloft.so. Here's a Makefile snippet that does that. It simply unloads the entire archive into a temporary directory, creates a shared object, then removes the directory. (I've resolved all the make variable names to make this a little more comprehensible.)

libhayloft.so: libhayloft.a
HERE="`pwd`"; \
THERE="`mktemp -d /tmp/hayloft.XXXXXXXXXX`"; \
( cd $$THERE; ar xv $$HERE/libhayloft.a ); \
gcc -shared -Wl,-soname,libhayloft.so -o \
libhayloft.so $$THERE/*.o; \
rm -rf $$THERE

Thursday, November 24, 2011

Dependency Generation with Subdirectories using gcc

Here I am on Thanksgiving morning working on my latest project, Hayloft. As is not uncommon, Mrs. Overclock (a.k.a. Dr. Overclock, Medicine Woman) has to work today, leaving your fearless leader to finish up the dinner preparations later in the day. So I thought I'd debug an issue that's been driving me to distraction: automatic dependency generation using the GNU gcc/g++ compilers.

In Hayloft, I have translation units (that's what the standards call C and C++ source files) organized into subdirectories, for example hayloft/Logger.cpp and s3/BucketCreate.cpp, but using just one Makefile. This makes it easier to manage the code base, with absolutely no additional effort on my part, since the make command's pattern matching causes a rule like

%.o: %.cpp
g++ -o $@ -c $<

to do exactly what I want: The % in the target %.o matches the entire file name path, for example s3/BucketCreate, and is propagated into the prerequisite %.cpp. The object file ends up in the same subdirectory. Life is good.

Alas, if I do the usual command to generate dependencies

g++ -MM -MG s3/BucketCreate.cpp

it creates a rule not with the s3/BucketCreate.o target (which is what I want) but with BucketCreate.o. Since I'm running the Makefile from the project's root directory, there will never be a BucketCreate.o nor a requirement for it. If I edit a prerequisite for s3/BucketCreate.cpp (that is, any header file that it includes), s3/BucketCreate.o will never be automatically regenerated.

So I had to write a little rule to go through the source code base, generate the dependencies for each source file individually, and prepend the directory path for that source file onto the make target. Here's what this looks like. (Apologies as usual for any Blogger editor weirdness.)

DEPENDS:=${shell find . -type f \( -name '*.c' \
-o -name '*.cpp' \) -print}

depend:
cp /dev/null dependencies.mk
for F in $(DEPENDS); do \
D=`dirname $$F | sed "s/^\.\///"`; \
echo -n "$$D/" >> dependencies.mk; \
$(CXX) $(CPPFLAGS) -MM -MG $$F \
>> dependencies.mk; \
done

-include dependencies.mk

It's a bit of a tribute to make and the shell that this is even possible.

Update (2012-03-14)

I think this rule might be a little simpler, having just now discovered the -MT option, although I had to separate the C and C++ files to handle their differing suffixes.

CFILES:=$(shell find . -type f -name '*.c' \
-print)
CXXFILES:=$(shell find . -type f -name '*.cpp' \
-print)

depend:
cp /dev/null dependencies.mk
for F in $(CFILES); do \
D=`dirname $$F`; \
B=`basename -s .c $$F`; \
$(CC) $(CPPFLAGS) -MM -MT $$D/$$B.o -MG $$F \
>> dependencies.mk; \
done

for F in $(CXXFILES); do \
D=`dirname $$F`; \
B=`basename -s .cpp $$F`; \
$(CXX) $(CPPFLAGS) -MM -MT $$D/$$B.o -MG $$F \
>> dependencies.mk; \
done


Update (2013-11-06)

I upgraded my server to a later version of Ubuntu which doesn't appear to have the -s flag on the basename command. So now I'm back to something more like the prior approach. 

Monday, November 14, 2011

Abstraction in C++ using I/O Functors

In Abstraction in C Using Source and Sinks, I wrote about how useful I've found it to abstract out I/O interfaces so that I could write software that didn't have to know from where its input was coming or to where its output was going: sockets, standard I/O library FILE pointers, memory buffers, the system log, etc. That C-based project, Concha, was a clean room reimplementation of work I had done for a client back in 2007. What I didn't mention was that it was in turn inspired by work I had done in 2006 for a C++-based project, Desperado. Now that I'm building yet another project, Hayloft, on top of Desperado, I'm reminded how much I like the Desperado I/O abstraction.

The Desperado I/O abstraction defines two interfaces, Input and Output. These interfaces make use of the ability in C++ to overload the parentheses operators to create functors (a.k.a. function objects): objects that can be manipulated through a function call interface. Looking at code, you will think you are looking at function calls. What you are really seeing are instance method calls against an object that overrides the parentheses operators.

The Input interface defines the following four operations.

int operator() ();

This returns the next character as an integer, or EOF (end of file) to indicate that no more input is available.

int operator() (int ch);

This pushes a single character back into the input stream. (This vastly simplifies the implementation of many common parsing algorithms that require one-token look ahead.) The interface guarantees that at least one character can be pushed back. It need not be the most recent character read. How the push back is actually done is up to the underlying implementation; buffering it inside the object in the derived class implementation is acceptable. If the push back is successful, the character is returned, otherwise EOF is returned.

ssize_t operator() (char * buffer, size_t size);

This returns a line terminated by a newline character (/n) or EOF. The optional size parameter allows the caller to place a limit on the number of characters returned. The result is guaranteed to be NUL terminated as long as the buffer is at least one byte in length. The actual number of characters input is returned, or EOF if none.

ssize_t operator() (void * buffer, size_t minimum, size_t maximum);

This is typically used for unformatted data. The minimum parameter indicates the minimum number of characters to be returned. The implementation blocks until that many characters are available or EOF is reached. The maximum parameter indicates the maximum number of characters to be returned if it can be done without blocking. Specifying a minimum of zero is a common way to implement non-blocking I/O using polling. Specifying the same value for minimum and maximum simply blocks for a fixed amount of data. Using the value one for minimum results in something similar to the POSIX read system call. The actual number of characters input is returned, or EOF if none.

That's it. No open or close: those are the job of either the caller, or of the implementation's constructor and destructor. The Input base class isn't pure: it actually implements all of these operators, returning EOF for all operations. That makes the base class the equivalent of /dev/null.

Since all implementations derive from the Input base class, you can pass a pointer or reference to any implementation to a function expecting an Input pointer or reference and it will read its data from that object without having any idea what the actual underlying data source is. Desperado implements a variety of derived classes, such as DescriptorInput (its constructor takes a file descriptor as an argument), FileInput (a FILE pointer), BufferInput (a read/write memory buffer), DataInput (a read-only memory buffer), and PathInput (a path name in the file system). All of these derived classes implement the full Input interface.

The Hayloft project leverages this in its Parameter class. Parameter takes an Input reference or pointer, uses the line input functor, and reads a parameter value into a C++ std::string that is a instance variable named parameter. It has no idea where this value is coming from: a file, a socket, a memory location, what have you. Here's the complete implementation of the method in Parameter that does this.

(I apologize in advance for any violence done to this and other code snippets by the Blogger formatter - which truly sucks at code examples even when editing raw HTML - and for any typos I make when transcribing this from actual working source code.)

void Parameter::source(Input & input, size_t maximum) {
int ch;
while (maximum > 0) {
ch = input();
if ((ch == EOF) || (ch == '\0') || (ch == '\n')) { break; }
parameter += ch;
--maximum;
}
}

You can see the Input functor called on the fourth line: the input object reference is used just as if it were a function.

The Output interface defines the following five operations.

int operator() (int c);

A single character in an integer is emitted. The character is returned or EOF if unsuccessful.

ssize_t operator() (const char * s, size_t size);

A null terminated string is emitted. The optional size parameter places a limit on the number of characters emitted. The actual number of characters emitted is returned or EOF if none.

ssize_t operator() (const char * format, va_list ap);

A variable length argument list is emitted according to the printf-style format string. The actual number of characters emitted is returned or EOF if none.

ssize_t operator() (const void * buffer, size_t minimum, size_t maximum);

At least a minimum and no more than a maximum number of bytes are emitted. As before, more than the minimum is emitted if it can be done without blocking. The actual number of characters emitted is returned or EOF if none.

int operator() ();

Any data buffered in the underlying implementation are flushed to the output stream. A non-negative number is returned for success, EOF for failure.

Similar to the Input interface, the Output base class is not pure: it implements all of these functors, each of which throws the data away and returns success. The Output base class is also the equivalent of /dev/null.

As you might expect, Desperado implements a variety of derived classes: DescriptorOutput, FileOutput, BufferOutput, PathOutput, and SyslogOutput (which writes all of its output to the system log).

The Desperado Print class makes effective of Output functors. Its constructor takes a single Output reference as its only parameter.

Print:Print(Output & output);

Print defines its own functor which takes a variable length argument list. Here is its entire implementation.

ssize_t Print::operator() (const char * format ...) {
va_list ap;
ssize_t rc;
va_start(ap, format);
rc = output(format, ap);
va_end(ap);
return rc;
}

You can see the output constructor reference used just like a function call on the fifth line.

In an application, you can now write code like this, which I do all the time. (Warning: head explosion may be imminent.)

extern Output * myoutputp;
Print printf(*myoutputp);

printf("An error occurred!\n");
printf("errno=%d\n", errno);

This illustrates one of the greatest strengths and greatest weaknesses of C++: if you encountered this while reading code, you would likely to assume that the printf was the standard I/O statement you know and love. But it isn't at all; it's a Print object. And that Print object has no idea where it's output is going. Depending on actual type of its constructor argument, it could be a socket, a file, or even a memory buffer.

It is possible to have a class that offers both an Input and an Output interface. Hayloft does this for its Packet class. Packet implements an infinite (as long as memory holds out anyway) bi-directional memory buffer. Although it offers specialized methods to prepend and append data, it also exposes both an Input and Output interface so that a Packet can be used as a data sink (by passing its Output interface) or as a data source (through its Input interface). A Packet can be used in this manner as a ring buffer.

This pattern is so common that Desperado defines an InputOutput interface for such implementations. This interface defines two methods that each return a reference to the appropriate interface.

Input & input();

Output & output();

This allows you to do things like

extern Packet * mypacketp;
Print printf(mypacketp->output());

to collect printed output into a Packet.

I/O functors illustrate one of those capabilities that makes C++ both extremely powerful and often hard to understand, because it allows one to use C++ not just as a programming language but as a meta-language, effectively creating a new domain specific language with its own operations, one that just happens to look vaguely like C++. It also means that while reading C++ code, especially in large code bases, it can be very difficult to make any assumptions about what is going on without both a broad and a deep understanding of both C++ and the underlying code.