Monday, November 08, 2010

Arroyo Goes South for the Winter

Ever since moving to the ARM11-based TI OMAP3530 processor on the BeagleBoard, I would have sworn I would never have a reason to revisit the ARM9-based Atmel AT91 processor on the AT91RM9200-EK evaluation kit. But recent events have proven me wrong. What where those events?

I screwed up.

I'm passing this along as a cautionary tale of working with tool chains in the embedded realm. For those of you who write a lot of C and C++ code for the Linux platform and compile on the same architecture on which your code will run, for most of you that would be the Intel Pentium architecture, you may not know how good you've got it. When doing embedded development, it's quite common -- in fact typical -- to compile on one processor but run the resulting binaries on a completely different processor.

In my case, I build on a Linux-based Intel Pentium server, but run the resulting binaries (including a Linux kernel) on an ARM9, ARM11, or PowerPC processor, depending on the project. So my GNU C compiler has been compiled and configured to execute on an Intel Pentium processor, which is my host or build architecture, but it generates instructions for, say, an ARM9 processor, which is my target architecture. This is called cross compilation.

Not only does my gcc/g++ compiler generate completely different machine instructions in its .o and a.out files, but uses a completely different set of GNU header files than it would if I were compiling for my Pentium server, header files that are specific to my target architecture. When I'm compiling very low level code, the header files may be specific not only to the target architecture, but to the Linux kernel version running on the target.

This is where I tripped up. Back when I originally worked on Diminuto and Arroyo on the AT91RM9200-EK board, I just happened to be using the same Linux kernel version on my build server as I was on my ARM9 target processor. Everything worked just fine, but it worked coincidentally. I recently upgraded my build server to a later version of Linux, and also, in an unrelated effort, upgraded my compiler suite to the latest version of the excellent GNU-based CodeSourcery ARM tool chain that supports the ARM11 instruction set. Just as a test I tried building Diminuto and Arroyo, and they wouldn't build. The first thing I suspected was the new tool chain. So I fell back to the old tool chain I had used to originally build these old projects. And they still wouldn't build.

It didn't take that much sleuthing to realize that the problem was in the kernel header files. The wrong kernel header files. I needed to use the kernel header files for the kernel I had built myself, not the ones that came with the tool chain, nor the ones on my build server.

(The fact that the build worked at all is still kind of remarkable considering even using the same kernel, my build server had the header files for the Pentium x86 architecture while Arroyo needed the header files for the arm architecture. If you peruse the source distribution for the Linux kernel you will see that it can be built for many different architectures with many different architecture-dependent header files. Apparently for the header files that Arroyo needed they weren't that different.)

Fixing this was a bit of an effort. I had to churn the Diminuto and Arroyo make-based build systems, but also, as a precaution, the Desperado build system just to make sure I hadn't messed up my C++ build process.

And of course, that wasn't enough: I had to do a soup-to-nuts Arroyo kernel and root file system build using the revised build process and new CodeSourcery tool chain, then go through a complete installation on to the AT91RM9200-EK board, and go through the update process just to make sure I hadn't screwed something else up. This all occurred at the same time I happened to be replacing my old IBM ThinkPad T30 running Windows XP with a Apple Mac Mini with 27" Cinema Display as my desktop development environment. When something failed, I had lots of stuff to look at.

As you may guess, I eventually got it all to work. But it wasn't easy. It brings to mind how unexpectedly difficult (not to mention time consuming) embedded development can be, even (or, frankly, especially) when using open source tools and source code.

Below is a picture of the whole mess. The huge Cinema display dominates the picture. The compact Mac Mini sits on a shelf above it. You can see the green AT91RM9200-EK board to the lower right of the photograph. The temporarily abandoned tiny red BeagleBoard is looking a little lonely just below the center of the photograph.

Arroyo Goes South for the Winter

But wait, what's this: a Windows 7 notebook in the bottom center of the photograph? That's right. When you do embedded development, you often live and die by the serial console terminal. And despite having tried several USB-to-serial converters and two different serial console software tools, I have discovered that there is nothing so far in the Mac environment equivalent to my beloved Putty utility. I'd be glad to be proven wrong.

Update (2011-01-03)

As I mentioned in the comments, I'm pretty happy with MacWise, a commercial terminal emulator for the Mac. I'm using it with a TrendNet USB-to-Serial adaptor/cable. To get MacWise to work reliably with the TrendNet device I had to download a new driver for the USB-to-Serial chipset from the Apple support web site. But it's been smooth sailing since then.

Tuesday, October 19, 2010

Android and Contraption

Back in July I wrote briefly about Cascada, my little project to move my software developed as part of my Diminuto project from the AT91RM9200-EK board (which uses the Atmel AT91 ARM9 processor) to the smaller, much cheaper, and more powerful BeagleBoard (which uses the TI OMAP 3530 ARM11 processor) with the TinCanTools Zippy2 expansion board. Cascada turned out to be so straightforward, I was a little disappointed. I didn't really feel like I'd gotten the learning experience I'd expected.

So I evolved the project into Contraption, where I run the "Rowboat" port of Android, the Google mobile device environment, to the BeagleBoard. This would have been a no-brainer too had I not insisted on getting the Zippy2 Ethernet port working. That required I reverse engineer the configurations of the Angstrom U-Boot bootloader and Linux kernel for the BeagleBoard that support the Zippy2, use what I learned to build the Linux that come with Android (which is slightly different), substitute in the U-Boot that initializes the Zippy2, and get the whole mess running on my hardware.

A few evenings of head scratching, and the next thing you know I'm displaying my blog on the Android browser.


Scattered around my tiny lab bench is (roughly left to right) a WiFi to Ethernet bridge (which will play a role in a later article), the tiny BeagleBoard, an Ethernet switch, a BDI3000 JTAG debugger (mostly unnecessary), a fan (also mostly unnecessary except to keep me cool), and a powered USB hub. The keyboard and mouse on the left are connected to an ancient ThinkPad I use as a window and TFTP server, and the smaller keyboard and mouse on the right are connected to the BeagleBoard. Below is the Android home screen running on the BeagleBoard.

Google Android ("Rowboat") on BeagleBoard

Why Android? Lots of reasons, but the most compelling was that I had come across a technical report written by some students at the University of Applied Sciences Northwestern Switzerland about using Android in industrial automation. I'm all about frameworks for real-time software design, and this was right up my alley. It will also be an excuse to brush up my Java skills and learn how to develop in the Android environment.

Somewhat ironically, since I've spent the past nearly fifteen years working for one Bell Labs spin-off or another, my interest in Android is not really related to mobile phones. I have a whole other project in mind. But that will have to wait for a future article.

Wednesday, October 06, 2010

Design Patterns and the Hylaean Theoric World

Neal Stephenson's novel Anathem is a 980 page discourse on philosophy, mathematics, and science disguised as a damn good science fiction novel. The philosophers, mathematicians, and scientists in his novel have the concept of the Hylean Theoric World (HTW), a realm in which the immutable truths of their universe and, as it turns out, ours, can be found. This is their version of our Platonic Idealism: the fact that the number three is prime is true no matter where you live, in what era you live in , and what your laws of nature may be. The concept exists independent of everything else, as if it inhabited its own reality.

One of the more controversial ideas of HTW (and of Platonic Idealism) is that that independent reality is a real place. But while it's hard to argue with the fact that three is prime, it's a little tougher to convince people that there is an alternate universe where the fact that three is prime originates. I can appreciate, however, that point of view.

That's why for decades, scientists and science fiction writers pondering how to communicate with intellects on another planet have assumed that the starting point for any such dialogue would be mathematics. You would establish a symbolism and nomenclature to represent stuff that you knew your opposite number on the other planet would understand, regardless of the fact that they might have tentacles, breath methane, and use smell as their principle sense organ.

Mathematics would be the root of a common inter-species language, but perhaps not science, at least not all science. There is some evidence, still quite controversial, that the laws of nature might not be exactly the same throughout the universe. So you run the risk of the the tentacled folk pondering your latest missive and saying "These crazy aliens, just when we thought we understood them, they go on about this Fine Structure Constant as if it were constant! We have clearly misunderstood something." The good news, though, is that anyone with whom we are likely to have a dialogue approaching anything like real-time would have to be close enough to us that this isn't going to be an issue.

It's easy to see how mathematics would live firmly in the Hylaen Theoric World, even if all of chemistry and physics might not. What else?

How about design patterns? One of the struggles authors of design pattern books have is how to represent design patterns. You have to appreciate the problem here, because while design patterns are not tied to a specific language, they are often much more easily implemented in one language than another. (I probably did something like dependency injection even back when I used FORTRAN IV, but I have no idea how I might have done it. I remember frequently resorting to writing subroutines for FORTRAN in assembler back in those days, because assembler had pointers.)

The classic Gang of Four design pattern book uses a mixture of text, diagrams, and source code in Smalltalk and C++. Some design patterns books have been published in separate C++ and Java editions. Donald Knuth invented his own assembler-like language just to tackle the same issue in his classic The Art of Computer Programming series of books on algorithms.

But the design patterns themselves really are language-agnostic. So I claim they are examples of Platonic Ideals, and live in the HTW. If we could get past the representation issue, design patterns would be another way to communicate with our tentacled colleagues.

Many design patterns are not obvious, and understanding them and successfully applying them carries enormous potential consequences. Sure, the tentacled ones would nod each of their heads and appreciate our explanation of their well known adapter, facade, and proxy patterns. But when we beam more complicated ideas like map-reduce at them, their tentacles might start shaking as they understood the implications of a powerful distributed computing architecture. We could, perhaps inadvertently, change their world. Or vice versa: they could change ours by transmitting a meme so powerful that it captured our intellects and imaginations.

The other day I listened to some of my colleagues try to define what they meant by "elegant code". Here is how I have defined it in the past: elegant code is that which reveals a higher truth. But it wasn't until I was reading Anathem that I realized that what I was really saying was elegant code is that which reveals a Platonic Ideal pattern that was previously unknown to me, one that offers me a glimpse into the reality that is the Hylaen Theoric World. One that I am totally going to appropriate and use myself.

Because when you are an engineer, Rule One is: steal all the good ideas.

Saturday, September 25, 2010

I LOLed!

Last evening I was driving home from work on southbound I-25 just north of Denver when I passed a car with the following Colorado vanity license plate:


If you're out there, YMMD.

Saturday, September 18, 2010

The Fine Structure Constant Isn't

Ira Flatow's Science Friday on NPR's Talk of the Nation yesterday had a mind blowing story titled "Strange Physics: Dark Flow, Fine Structure Constant". Part of the story was how the fine structure constant (conventionally denoted as α or alpha) appears to vary over space but not over time.

The fine structure constant is a dimensionless number derived from some fundamental physical constants like the speed of light in a vacuum. Its value is roughly 1/137. It was established in the first few moments of the big bang as matter solidified and reality as we know it more or less settled out. If the fine structure constant differed by just a few percent in either direction, the laws of nature would differ to the extent that the chemical reactions that life as we know it depends upon could not take place. The anthropic principle is a philosophical argument that says these fundamental constants are what they are because if they weren't we couldn't be here to measure them.

We can in fact measure these fundamental physical constants, and so compute alpha, not just in any well equipped laboratory, but also elsewhere at any number of observatories around the world by pointing telescopes at distant stars and using very accurate spectroscopes. We know about how distant these stars are from the amount the light from each is red-shifted.

But when we do this, we get different numbers for alpha, and the numbers we get do not depend on how far away the stars are (which would imply that the constant changes with time, which would be less troublesome) but in what direction we look (implying that it changes with location). This is mind boggling, because either the fundamental laws of nature vary from place to place, or something else is affecting our instruments in a strange but very consistent manner.

Mrs.Overclock (a.k.a. Dr. Overclock, Medicine Woman) and I were discussing this over dinner when my science fictional lobe came up with these ideas:

1. If the constant varies across space, that could explain the Fermi Paradox: there isn't life (as we define it anyway, which determines how we try to detect it) elsewhere because there really is something very special about our region of space.

2. An ultra-super-hella-weapon that altered the fine structure constant in a region of space would be... useful... for eliminating troublesome civilizations. The term "weapon of mass destruction" seems laughably modest to describe such a device.

This is the kind of stuff I think about on a daily basis.

Update 2010-09-18

The science/fiction blog io9 had a great article on this very topic that I missed while Mrs. Overclock and I were in Australia (at the World Science Fiction Convention in Melbourne) and New Zealand (touring The Lord of the Rings filming locations on the south island near Queenstown and Wanaka). Check it out!

Update 2010-09-23

Dave Goldberg, Ph.D. physicist, wrote a detailed and very readable article on this topic, also over at io9. Check it out squared!

Saturday, August 14, 2010

Memory Models and Compiler Optimization

Those software developers who have the misfortune of being forced to hang around with me know one of my favorite topics is how memory models presented by optimizing compilers and modern microprocessors have broken the contract between most programming languages and the hardware on which they run.

This has prompted the C++ standards folks to propose augmenting the volatile keyword with a new atomic keyword in the 2010 draft of the ISO/IEC 14882 standard. This new keyword attempts to provide a synchronization mechanism similar to what Java currently does with its own volatile keyword, insuring that memory is read and written atomically using whatever means is at hand on the underlying hardware platform. (The C++ volatile keyword merely means that memory can change unexpectedly, behind the executing program's back as it were, for example as in the case of I/O devices that have control and data registers mapped into memory.)

Sarita V. Adve (University of Illinois at Urbana-Champaign) and Hans-J. Boehm (Hewlett-Packard) are two names that crop up a lot as authors of much of the material I read on this topic. I just recently read their article in Communications of the ACM,

S. Adve, H. Boehm, "Memory Models: A Case for Rethinking Parallel Languages and Hardware", CACM, 53.8, August 2010, pp. 91-101

and I found it to be an excellent yet readable survey on the topic. One of the more amusing examples they give about the issue of compiler optimization involves the following C code snippet. Given external integer variables X and Y and two local integer variables R1 and R2:

R1 = X;
R2 = X;
Y = (R1 == R2);

This may make no sense to developers who are not in the habit of dealing with concurrency such as multithreaded code or device drivers, but to the rest of us it seems pretty clear: Y is set to false if and only if the value of X changed in between the two times it was accessed. But if X isn't declared volatile, the compiler doesn't think it makes any sense either. The developer has taken advantage of a data race condition of which the compiler is completely unaware. Here are the optimizations that are likely to be performed.

First, the compiler recognizes that there is no point in accessing X twice, since it already has the value in R1.

R1 = X;
R2 = R1;
Y = (R1 == R2);

Then it realizes that there is no point in checking for equality, since it knows that R1 and R2 have the same value;

R1 = X;
R2 = R1;
Y = 1;

Then it sees that the value of Y does not depend on the values of any of the other variables, so it hides the memory subsystem write latency by moving the write operation earlier.

Y = 1;
R1 = X;
R2 = R1;

The resulting code bears no resemblance functionally to what the programmer wrote, yet the compiler optimizations are perfectly legal, and in fact are likely to be performed by the C compiler you are using today. (I really like this example, because I have written C code almost exactly like the original snippet, where X is a volatile hardware time base register.)

Adve and Boehm assert, as many of their colleagues do, that relying on such a data race, instead of using real synchronization mechanisms such as memory barriers, atomic memory operations, mutexen, semaphores and the like, even for read-only variables, leads to unreliable code. Even if it works today, it might not work after your next compiler upgrade. And as others have pointed out, depending on the C and C++ volatile keyword to be implemented correctly by your compiler vendor is unfortunately also an iffy proposition.

Adve and Boehm point out some of the widespread practices in software development that make solving the memory model problem difficult:
  • global address space;
  • imperative languages;
  • object oriented design; and
  • pointer based data structures.
Holy cats, is that all? Yeah, I think that's enough. Reminds me of my graduate school days during the furor over the Japanese "Fifth Generation" project in which Japan was going to use massively parallel data flow architectures and artificial intelligence to make all traditional computing systems obsolete. The research group I was in was looking at operating system and programming languages architectures for such highly concurrent systems. We played with designs incorporating functional languages, single assignment, lazy evaluation, and other far out ideas. As far out as those ideas were, though, many of them turned out to be used by the designers of today's optimizing compilers; it's all under the hood, used in intermediate forms that your source code takes along the path from C and C++ to machine code.

To paraphrase computer scientist Tony Hoare, I don't know what the super concurrent programming language for future massively many core processors will be, but I have an inkling it might not be called C or C++.

Saturday, July 17, 2010

Cascada and the BeagleBoard

Readers who are interested in embedded development may recall Diminuto and Arroyo, my projects to develop a simple Linux and GNU-based teaching environment around the Atmel AT91RM9200EK evaluation board. That evaluation board uses the AT91RM9200 System on a Chip (SoC) based on the 32-bit ARM9 processor core. This SoC and other AT91 variants have been very popular over the years in embedded systems in my experience. My only complaint about the AT91RM9200EK was it's price tag: around US$1200, which made it a little pricey for outfitting a lab with several workstations for teaching a class.

Some time ago, Texas Instruments introduced a new SoC family, the OMAP (Open Multimedia Application Platform). The BeagleBoard is a single board computer that uses the OMAP 3530 SoC, which is based on the 32-bit ARM Cortex-A8 processor core. The OMAP 3530 is an asymmetric multi-core system which includes a digital signal processor (DSP) core and a graphics processor, making it very popular in hand-held devices like mobile phones. The BeagleBoard retails for around US$150, and is trivially capable of running Linux, not to mention Linux-based systems like Google's Android. It includes peripherals such as USB ports, an SD card slot, video out, and audio in and out. All of this is in a form factor that will fit in your shirt pocket.

Several vendors have produced daughter cards for the BeagleBoard that provide additional peripherals. Digi-Key along with chip vendor Micrel produces one such daughter card, the Zippy2, which adds an RJ45 Ethernet jack, a second serial port, a real-time clock, a second SD card slot, and other useful stuff, all in a form factor identical to that of the BeagleBoard. The Zippy2 retails for about US$100. Add a few tens of dollars of cables and other miscellanea and you have yourself a real system.

BeagleBoard C4 and Zippy2 Expansion CardI just completed soldering the expansion connector onto my very own BeagleBoard, joining the BeagleBoard to my Zippy2, installing the pre-built U-Boot and Linux images onto an SD card using my Linux server, and booting it all up. Entire process: about two hours. Seriously. That includes opening the boxes and setting up my soldering iron.

I expect to spend the next few months playing with this new project, which I have dubbed Cascada. This will include porting some of my Diminuto, Arroyo, and Desperado software over to it. You may expect to be subjected to an article or two on this effort.

Never has learning systems programming and embedded development been so easy or so cheap.

Sunday, July 11, 2010

Cognitive Dissonance

I reckon that spending fifteen years attending and working at a university qualifies me as a former academic, even through my time in that role ended two decades ago. I went on to work at a national laboratory and supercomputer center that was funded in part by the U.S. National Science Foundation (NSF), and then eventually to Bell Laboratories. I'm living proof that you can have a pretty good career just making it up as you go along, and I got a lot of insight into the full spectrum of the research and development work experience, from academia to federally funded big science to corporate R&D.

So it was with real interest that I read Beryl Lieff Benderly's article "The Real Science Gap" in Miller-McCune magazine, published by the non-profit Miller McCune Center for Research, Media, and Public Policy. Benderly observes something I've been wondering about for some time: if we're so short of scientists and engineers, as our politicians, high-technology pundits, and industry leaders would have us believe, how come there are so many people with advanced science and engineering degrees that are under-employed or unemployed?

Benderly describes how our current system of research in the United States evolved: following World War II, the U.S. government established a system of institutes and foundations that funded universities, the universities funded researchers, post-doctorate fellows, and graduate students, and the results of their research sometimes made it into products and technologies manufactured and applied by corporate America. But over time, the entry-level jobs for newly minted Ph.D.s dried up as funding focused on those few research organizations that have been very successful, leaving them, with just a few exceptions, out of work or at best working at relatively low-wage jobs as post-doc research assistants. Benderly's article is well worth reading no matter where you are personally on the R&D employment spectrum.

Coincidentally, right after reading that article, I came across Andy Grove's article "How to Make an American Job" in the July 5/July 11 2010 issue of Bloomberg Businessweek. Grove, you may recall, is an engineer who was employee #3 of Intel and went on to run the company. Grove makes a similar observation about how corporate R&D used to be the driver of the creation of manufacturing jobs in the U.S., but now is mostly stimulating job creation overseas.

His article is also worth reading, as is the rebuttal by Vivek Wadhwa, "Why Andy Grove Is Wrong About Job Growth", the gist of which is: thanks to globalization, things have changed since Intel was founded in the 1970s, and who wants those Foxconn jobs anyway?

The fact is, there is are useful nuggets of wisdom in all three articles. But rather than pontificating on them myself, I'd like to encourage you to read all three articles and see what you think.

Tuesday, May 25, 2010

Running with Scissors Around Diminuto and Arroyo

(I apologize in advance for the formatting of this article. If there is a way to present code in a good way using the Blogger editing tool, I haven't found it. Even hand-editing in HTML using the pre HTML tags allows leading whitespace to be removed.)

Two years ago I wrote a series of articles on Diminuto and Arroyo, the results of my pet project to put together a platform on which to teach systems programming and embedded software development on the commercially available ARM-based AT91RM9200-EK evaluation board. It was a bigger effort that I had anticipated that included tool chains, kernel images, root file systems images (either as an initramfs RAM disk for Diminuto or an SD-card based EXT2 file system for Arroyo), as well as a library of systems programming functions and small useful tools built on top of it.

Most of Diminuto and Arroyo have remained unchanged. But libdiminuto and its collection of tools has been a moving target, as I have added things that I have found useful over the years. Here's some of the new stuff in version 1.3.0.

The library now contains a simple little function that lets you dump memory objects in hexadecimal in network byte order, for example as part of your debugging output. The output of the unit test for the function looks like this.

# ./unittest-dump
beb81b80:          00010203 04050607 08090a0b |    ............|
beb81b90: 0c0d0e0f 10111213 14151617 18191a1b |................|
beb81ba0: 1c1d1e1f 20212223 24252627 28292a2b |.... !"#$%&'()*+|
beb81bb0: 2c2d2e2f 30313233 34353637 38393a3b |,-./0123456789:;|
beb81bc0: 3c3d3e3f 40414243 44454647 48494a4b |<=>?@ABCDEFGHIJK|
beb81bd0: 4c4d4e4f 50515253 54555657 58595a5b |LMNOPQRSTUVWXYZ[|
beb81be0: 5c5d5e5f 60616263 64656667 68696a6b |\]^_`abcdefghijk|
beb81bf0: 6c6d6e6f 70717273 74757677 78797a7b |lmnopqrstuvwxyz{|
beb81c00: 7c7d7e7f 80818283 84858687 88898a8b ||}~.............|
beb81c10: 8c8d8e8f 90919293 94959697 98999a9b |................|
beb81c20: 9c9d9e9f a0a1a2a3 a4a5a6a7 a8a9aaab |................|
beb81c30: acadaeaf b0b1b2b3 b4b5b6b7 b8b9babb |................|
beb81c40: bcbdbebf c0c1c2c3 c4c5c6c7 c8c9cacb |................|
beb81c50: cccdcecf d0d1d2d3 d4d5d6d7 d8d9dadb |................|
beb81c60: dcdddedf e0e1e2e3 e4e5e6e7 e8e9eaeb |................|
beb81c70: ecedeeef f0f1f2f3 f4f5f6f7 f8f9fafb |................|
beb81c80: fcfdfeff                            |....            |

Since the function does all the heavy lifting, writing a dump tool that can dump files and such in the same format from standard input is trivial.

# dump -?
usage: dump [ -t ]
-t          Tee input to stdout, output to stderr
-?          Print menu

When the data you want to look at comes from a serial port or other asynchronous character-oriented device, this isn't quite the tool you want. You want something that can dump data in real-time, as it comes across the communication port, and let you examine both the printable and unprintable characters in the stream. The name phex is short for "print hex" but I pronounce it fex.

# phex -?
usage: phex [ -e ] [ -l BYTES ] [ -n ] [ -t ]
-e          Also escape the characters '"', '\'', and '?'
-l BYTES    Limit line length to BYTES instead of 80
-n          Do not escape newlines
-t          Tee input to stdout, output to stderr
-x          Print normally escaped characters in hex
-?          Print menu

When phex senses that its standard input is a serial port, it places the port in raw mode, so that the Linux terminal driver doesn't try to interpret any of the special characters. phex automatically handles line wrapping as its dumping so that it doesn't drive your own display terminal crazy. It prints unprintable characters as C-style escape sequences. It prints each character in real-time as it reads it from standard input. It normally prints to standard output, but it can be easily told to print to the unbuffered standard error file descriptor.

# ./unittest-phex | phex
\x1d\x1e\x1f !"#$%&'()*+,-./0123456789:;<=>?@ABC

Dealing with memory-mapped hardware devices is part of the job when you are writing device drivers and doing other systems programming chores. Before you launch into writing thousands of lines of device driver code, it pays to poke around the device to learn how it works. memtool lets you do that. When run as root, it allows you to map an arbitrary physical address block to the virtual address space and manipulate data within it. It doesn't require any special device driver, and runs completely in user-space.

# memtool -?
usage: memtool [ -d ] [ -a ADDDRESS ] [ -l BYTES ] [ -[1|2|4|8] ADDRESS ] [ -r | -[s|c|w] NUMBER ] [ -u USECONDS ] [ ... ]
-1 ADDRESS    Use byte at ADDRESS
-2 ADDRESS    Use halfword at ADDRESS
-4 ADDRESS    Use word at ADDRESS
-8 ADDRESS    Use doubleword at ADDRESS
-a ADDRESS    Optionally map region at ADDRESS
-c NUMBER     Clear NUMBER mask at ADDRESS
-d            Enable debug mode
-f            Proceed if the last result was 0
-l BYTES      Optionally map BYTES in length
-r            Read ADDRESS
-s NUMBER     Set NUMBER mask at ADDRESS
-t            Proceed if the last result was !0
-u USECONDS   Sleep for USECONDS microseconds
-?            Print menu

What kind of things can you do with memtool? On the AT91RM9200-EK board, there is a red LED controlled by the Programmable I/O (PIO) B controller. Here is a little bash script that uses memtool to flash the red LED on and off by directly manipulating the PIOB controller registers.

PIOB=0xfffff600 # PIOB
SIZE=0x200 # PIOC - PIOB
PER=0xfffff600  # Program Enable Register
OER=0xfffff610 # Output Enable Register
SODR=0xfffff630 # Set Output Data Register
CODR=0xfffff634 # Clear Output Data Register

GREEN=0x1 # PIOB0: disk activity
YELLOW=0x2 # PIOB1: cpu idle
RED=0x4 # PIOB2: ours

USEC=1000000 # 1 second

TRIES=30 # 30 tries (about a minute)

memtool -a ${PIOB} -l ${SIZE} \
-4 ${PER} -s ${RED} \
-4 ${OER} -s ${RED}

while [ ${TRIES} -gt 0 ]; do
memtool -a ${PIOB} -l ${SIZE} \
-4 ${CODR} -s ${RED} -u ${USEC} \
-4 ${SODR} -s ${RED} -u ${USEC}
TRIES=`expr ${TRIES} - 1`

I've gotten email from a couple of folks who have used Diminuto or Arroyo. (Believe me, no one could have been more surprised than me!) One of them mentioned that it would be great to have some examples of simple device drivers. One of the more common memory-mapped devices that I encounter is the Field Programmable Gate Array or FPGA. FPGAs are programmable logic devices that may implement functions as simple as a few logic gates or as complex as a full-blown processor core. They are frequently used as a sort of hardware glue to connect devices together on a board, or to implement complex application-specific logic. The logic that they implement frequently exposes control, status, and data registers on the processor memory bus. It occurred to me that it would be easy to write a generic memory mapped device driver that would work with all sorts of devices, including FPGAs.

The mmdriver device driver is exactly that. It is a loadable kernel module that provides managed, non-root, access to the underlying memory mapped device, whatever it is. The physical memory space that it controls can be specified at compile-time through the use of compiler flags to define C preprocessor symbols. But it can also be dynamically configured at install time though the use of module parameters. If no configuration is provided, mmdriver maps to the registers of the very same PIOB controller in the memtool example above. (If you think you know where this is going, you're probably right.)

# insmod diminuto_mmdriver.ko
# lsmod
Module                  Size  Used by    Not tainted
diminuto_mmdriver       6956  0
# mknod /dev/mmdriver c 240 0
# ./unittest-mmdriver
# rmmod diminuto_mmdriver

On top of mmdriver is mmdrivertool, a user-space tool very much like memtool but which uses ioctl commands to mmdriver, running in kernel-space, to manipulate registers in the managed device.

# mmdrivertool -?
usage: mmdrivertool [ -d ] [ -U DEVICE ] [ -[1|2|4|8] OFFSET ] [ -[s|c|w] NUMBER ] [ -u USECONDS ] [ ... ]
-1 OFFSET   Use byte at OFFSET
-2 OFFSET   Use halfword at OFFSET
-4 OFFSET   Use word at OFFSET
-8 OFFSET   Use doubleword at OFFSET
-c NUMBER   Clear NUMBER mask at OFFSET
-d          Enable debug mode
-f          Proceed if the last result was 0
-r          Read OFFSET
-t          Proceed if the last result was !0
-u USECONDS Sleep for USECONDS microseconds
-U DEVICE   Use DEVICE instead of /dev/mmdriver
-?          Print menu

Here is as script that blinks that very same red LED, this time using mmdrivertool and the mmdriver device driver.

PER=0x0 # Program Enable Register
OER=0x10 # Output Enable Register
SODR=0x30 # Set Output Data Register
CODR=0x34 # Clear Output Data Register

GREEN=0x1 # PIOB0: disk activity
YELLOW=0x2 # PIOB1: cpu idle
RED=0x4 # PIOB2: ours

USEC=1000000 # 1 second

TRIES=30 # 30 tries (about a minute)

mmdrivertool \
-4 ${PER} -s ${RED} \
-4 ${OER} -s ${RED}

while [ ${TRIES} -gt 0 ]; do
mmdrivertool \
-4 ${CODR} -s ${RED} -u ${USEC} \
-4 ${SODR} -s ${RED} -u ${USEC}
TRIES=`expr ${TRIES} - 1`

With both memtool and mmdrivertool, here's the big disclaimer: when you are accessing physical hardware using these tools, you are handling a loaded firearm with the safeties off. Tools like this are difficult to test. Difficult enough that I haven't found a way to completely unit test all the code paths in them. Once, while poking around for a read/write hardware register that I could use to unit test both tools, I got my evaluation board so sideways that I had to power cycle it to get it back, only to discover that I had scrogged my persistent root file system in the process.

Get used to it. That's what systems programming and embedded software development is all about. You have to break a few eggs (or in my case, smoke a few boards) to learn this skill.

Update 2013-09-23: Later versions of memtool use offsets much as mmdrivertool does. This turns out to work a lot better with how memory mapped registers are typically defined in processor reference manuals and header files, as offsets from a base address.

$ memtool -?
usage: memtool [ -d ] [ -o ] [ -a ADDDRESS ] [ -l BYTES ] [ -[1|2|4|8] OFFSET ] [ -m NUMBER ] [ -r | -[s|S|c|C|w] NUMBER ] [ -u USECONDS ] [ -t | -f ] [ ... ]
       -1 OFFSET     Use byte at OFFSET
       -2 OFFSET     Use halfword at OFFSET
       -4 OFFSET     Use word at OFFSET
       -8 OFFSET     Use doubleword at OFFSET
       -C NUMBER     Clear 1<
       -S NUMBER     Set 1<
       -a ADDRESS    Map region at ADDRESS
       -c NUMBER     Clear NUMBER mask at ADDRESS
       -d            Enable debug mode
       -f            Proceed if the last result was 0
       -l BYTES      Map length BYTES at ADDRESS
       -m NUMBER     Mask at ADDRESS with ~NUMBER prior to subsequent sets
       -o            Enable core dumps
       -r            Read ADDRESS
       -s NUMBER     Set NUMBER mask at ADDRESS
       -t            Proceed if the last result was !0
       -u USECONDS   Sleep for USECONDS microseconds
       -w NUMBER     Write NUMBER to ADDRESS
       -?            Print menu

Here's the script shown above ported to the new version of the tool. Warning: I haven't actually tested this script, but I have used the new memtool at length.

PIOB=0xfffff600 # PIOB
SIZE=0x200 # PIOC - PIOB
PER=0x00  # Program Enable Register
OER=0x10  # Output Enable Register
SODR=0x30 # Set Output Data Register
CODR=0x34 # Clear Output Data Register

GREEN=0x1 # PIOB0: disk activity
YELLOW=0x2 # PIOB1: cpu idle
RED=0x4 # PIOB2: ours

USEC=1000000 # 1 second

TRIES=30 # 30 tries (about a minute)

memtool -a ${PIOB} -l ${SIZE} \
    -4 ${PER} -s ${RED} \
    -4 ${OER} -s ${RED}

while [ ${TRIES} -gt 0 ]; do
memtool -a ${PIOB} -l ${SIZE} \
    -4 ${CODR} -s ${RED} -u ${USEC} \
    -4 ${SODR} -s ${RED} -u ${USEC}
TRIES=`expr ${TRIES} - 1`

Saturday, May 08, 2010

The Real-Life Consequences of Complexity

Back in March, John Paul MacDuffie, management professor at the Wharton School at University of Pennsylvania, interviewed Takahiro Fujimoto, an economics professor at the University of Tokyo and a long-time observer of Toyota, about the recent issues Toyota has been facing regarding the recalls of its automobiles.

Toyota is the largest automobile manufacturer in the world and has enjoyed a enviable reputation for quality. I think it's fair to say that the issues that it is facing surprised everyone. The edited transcript of the short interview is well worth reading.

Fujimoto places the blame squarely in the management and engineering design realm. He makes it very clear that he doesn't believe the issues are in manufacturing. There is much discussion about how the designers at Toyota are challenged to deal with the ever increasing levels of technical complexity which stem not just from the technologies that they use, but from the demands of their customers and the regulatory requirements of the international markets into which they sell. Fujimoto also implies that there was a significant amount of hubris involved in Toyota's over-confidence in their ability to deal with the complexity.

Although I don't design automobiles, or anything that (as far as I know anyway) goes into an automobile, I do routinely design and implement real-time systems on which peoples' lives depend. It always struck me as counter-intuitive that technical solutions often start out as more complex than they need to be, and only become simpler over time. It is contrary to what you might expect, but in fact it is the simpler solution that takes more time to develop because simplicity only comes with a more complete understanding of the problem by the designer. The overly complex solution is often the first one implemented. As schedules are compressed and budgets tightened, it is frequently the only one implemented.

I see this a lot. It has an impact not only on the consumer of the product, but on the manufacturer's ability to maintain the product over time. Software developers reading this have already realized that this is another manifestation of the need to continually refactor software during development and maintenance. But my observation is that this can still be a hard sell to project managers who, concerned with their schedules and budgets inside their silo or cost center or whatever the faddish term is these days, ask

"And if we do this refactoring, what will the end user see that is different?"

and are told by the technical lead

"Well, if we do everything absolutely perfectly, nothing."

That is not a good way to get work funded.

I wouldn't be surprised if a similar conversation occurred inside Toyota some time ago.

Friday, May 07, 2010

A Matter of Scale: The CAP Theorem and Memory Models

In his keynote for the 2000 ACM PoDC conference, Eric A. Brewer made a remarkable assertion. He said that given three desirable qualities of distributed systems, consistency, availability, and partition tolerance, you can at best have two out of three. This became to be known as Brewer's Conjecture.

You might be inclined to think Brewer knew what he was talking about. He was not only a professor of computer science at U. C. Berkeley, but was the co-founder of Inktomi, a company that produced the HotBot Internet search engine. For a time, HotBot was the most popular Internet search engine, replacing Altavista, until its supremacy was upset by Google and Inktomi was acquired by Yahoo!

Even more remarkably, a couple of years later Seth Gilbert and Nancy Lynch at MIT formally proved Brewer's Conjecture for a couple of abstract (but highly applicable) models, leading to the CAP Theorem.

The CAP Theorem has been getting more press lately, thanks in part to Werner Vogels, the CTO of, writing about how that e-commerce giant gets gets around the fundamental architectural implications of the theorem for large distributed systems. So it's not a big surprise that I would have come across the CAP Theorem in my obsessively wide reading.

But although I've done my time in web services (Fun Facts to Known and Tell: Java Business Integration), high performance computing (Post Modern Deck Construction), and distributed systems (Traffic Management and System Scalability), lately I've been agonizing over the machine code produced by compilers and the bare metal on which it runs (Cache Coherency and Memory Models). So I was standing in the shower the other morning, listening to that little voice in my head, trying to figure out why I was so focused on the implications of the CAP Theorem when my current interests leaned more towards cache coherency or the lack of it.

And it finally occurred to me: they're the same problem. It is simply a matter of scale.

Here's the gist of the CAP Theorem as I understand it.

Consistency means serialization. To have strong consistency, transactions run across distributed servers on shared data must look as if they had been serialized and run one at a time as atomic transactions, whether that's the case or not. This is what the database and transaction people are talking about when they cite the ACID properties: Atomicity, Consistency, Isolation, Durability.

Availability means reliability. Every request gets a timely response, and no client is left hanging. It's this requirement for high availability that typically leads to distributed systems in the first place, since HA systems achieve it by eliminating single points of failure. What constitutes failure is relative in this context. Julian Brown has remarked that an extra tenth of a second in response time costs one percent in sales.

Partition-tolerance means the system deals with the occasional temporary network outage. This is a reality even on a LAN inside a centralized data center, but obviously more so in a geographically distributed system depending on a WAN. Failure here is relative depending on the application. Network partitioning can be seen as an increase in latency. For real-time applications, an increase in latency of milliseconds may constitute a network failure, while applications requiring only an eventual data reconciliation will be more tolerant.

Brewer remarked that there is no such thing as a 100% working system (something the high performance computing folks are grappling with as they try to build exascale systems from commodity parts), and no such thing as 100% fault tolerance. But hugely scalable cloud providers like Google,, and Ebay achieve as close to 100% as they can by throwing a lot of parallel (to handle the workload) and redundant (to achieve high availability) hardware the problem. Geographic location is also a single point of failure, not just from natural disasters but from availability of infrastructure like bandwidth, latency, and power. So such systems tend to be geographically distributed as well. This makes them highly dependent on network connectivity to communicate with one another.

The theorem states that at most two of the three, (strong) consistency, (high) availability, and (network) partition-tolerance, can be achieved. Each distributed system designer has to make their own decision as to where their application lies in the CAP space, and what compromises they are going to make, based on their business needs. A distributed system may provide strong consistency and high availability but can only do so as long as the servers can talk to one another. Or it may tolerate network outages but become completely unavailable while waiting for the network to heal. Or it might relax its need for strong consistency while being highly available even in the face of temporary network outages.

In their paper on the CAP Theorem, Gilbert and Lynch use two different models to construct their proof. The asynchronous model has no temporal dimension. Agents in the asynchronous model are simply state machines that are only concerned with message order and not with when messages arrive. The partially synchronous model has a sense of time duration. Agents in the synchronous model can take action when an expected message doesn't arrive within a time out interval. In the synchronous model, they show that it is possible to achieve a useful compromise between consistency and availability in the face of network partitioning.

While both models are applicable to real world real-time systems that I have helped develop over the past few decades, it is the synchronous model that has been leveraged by the big e-commerce players. In particular, Vogel has written about how uses weak or eventual consistency as its own necessary sacrifice to the CAP Theorem.

So, finally, here's the thing: all this exists, writ small, in the world of multiple processors and cores, cache coherency protocols, and shared memory. The individual cores constitute the distributed agents. Their individual memory caches contain the copies of shared data. The interconnects between the cores form the network. The cache coherency protocols correspond to the transaction protocols. Threads in a multi-threaded application reading and writing shared data in main memory create the workload. The same issues of consistency, availability, and partition-tolerance apply. And the same trade-offs are made, whether you realize it or not.

Don't think the interconnect network in your multi-core system is ever partitioned? Sure it is. Network connectivity failures are never permanent, unless you are unplugging the power cord from your server and tossing the whole box into the dumpster. It's all just a question of how much latency your system is willing to tolerate. A distributed system of the likes of Google's may tolerate seconds, minutes, or even hours, relying on redundancy to carry the load until the network can heal. The kinds of message passing systems I've helped develop over the years might have a latency threshold measured in tens or hundreds of milliseconds. A microcontroller talking to an FPGA might tolerate microseconds. A microprocessor execution unit waiting for a word from cache might measure its latency threshold in fractions of a nanosecond before the execution pipeline stalls.

It's all the same model. Just the decimal point moves.

Ulrich Dreppler, of Red Hat, authored a massive but indispensable tome on how memory actually works in modern microprocessors. In it he wrote of the MESI (Modified, Exclusive, Shared, Invalid) cache coherency protocol:

If multiple threads modify the same memory location concurrently, processors do not guarantee any specific result. This is a deliberate decision made to avoid costs which are unnecessary in 99.999% of all cases. For instance, if a memory location is in the ‘S’ state and two threads concurrently have to increment its value, the execution pipeline does not have to wait for the cache line to be available in the ‘E’ state before reading the old value from the cache to perform the addition. Instead it reads the value currently in the cache and, once the cache line is available in state ‘E’, the new value is written back. The result is not as expected if the two cache reads in the two threads happen simultaneously; one addition will be lost.

This is exactly the kind of compromise the distributed systems folk are talking about when they discuss the implications of the CAP Theorem. In fact, while reading Gilbert and Lynch's paper, I was struck by how much the model they were describing in their proof corresponded to the MESI protocol, and how closely their synchronous model described the kind of real-time systems, distributed and otherwise, with which I am familiar. In their quest for performance, microprocessor designers have left behind ACID for what the distributed systems folk refer to as BASE: Basically Available, Soft-state, Eventual consistency. And they've done it for all the same reasons.

Why is this correspondence important? Because I think the CAP Theorem is just as valuable a tool for designing and reasoning about multi-processor and multi-core systems as it is for globe-girdling distributed systems.

I was in that shower a long time. Mrs. Overclock (etc.) probably wondered what was going on.

Update (2010-05-08)

When I had my ah-ha moment in the shower, I knew it was obvious enough that someone must have thought of it before. But my own research, online and otherwise, didn't turn anything up. I had read Vogels' article in ACM Queue, but this morning I went back and read his original blog article from which that was taken. Sure enough, two years ago, in the comments section, Manish Vachharaiani made the same observation about the similarity between the CAP Theorem and cache coherency.

And in those same comments, Roy Friedman cites the first formal treatment of the CAP issue, Attiya and Welch, which I've added to my sources below.


Hagit Attiya, Jennifer L. Welch, "Sequential Consistency versus Linearizability", ACM Transactions on Computing Systems, 12(2):91-122, 1994

Eric A. Brewer, "Towards Robust Distributed Systems", ACM Symposium of Principles of Distributed Computing, July 2000

Julian Browne, "Brewer's CAP Theorem", January 2009

Ulrich Drepper, What Every Programmer Should Know About Memory, Red Hat Inc., November 2007

Armando Fox, Eric A. Brewer, "Harvest, Yield, and Scalable Tolerant Systems", Proceedings of the Seventh Workshop on Hot Topics in Operating Systems, IEEE Computer Society, 1999

J. L. Sloan, "Implications of Memory Models (or Lack of Them) for Software Developers", Digital Aggregates Corporation, April 2010

Werner Vogels, "Eventually Consistent", ACM Queue, October 2008

Saturday, March 13, 2010

More On Outsourcing for Small Businesses

In February 2007 I wrote Outsourcing Technical Services for Small Businesses. That blog article described how I set up my one-man company, Digital Aggregates Corporation. Since then, I've helped several others do exactly the same thing, and one of the first things I did was to point them towards that article as a good starting point. I'm pleased to say that as a result, I now work daily with others who are self-employed through their own corporations or LLCs.

Along the way I've been updating that article, which I've renamed a slightly broader Outsourcing for Small Businesses, although the URL remains the same. If you're still pondering whether to take the leap, you might want to re-read the article. The updated material is explicitly marked.