Tuesday, May 25, 2010

Running with Scissors Around Diminuto and Arroyo

(I apologize in advance for the formatting of this article. If there is a way to present code in a good way using the Blogger editing tool, I haven't found it. Even hand-editing in HTML using the pre HTML tags allows leading whitespace to be removed.)

Two years ago I wrote a series of articles on Diminuto and Arroyo, the results of my pet project to put together a platform on which to teach systems programming and embedded software development on the commercially available ARM-based AT91RM9200-EK evaluation board. It was a bigger effort that I had anticipated that included tool chains, kernel images, root file systems images (either as an initramfs RAM disk for Diminuto or an SD-card based EXT2 file system for Arroyo), as well as a library of systems programming functions and small useful tools built on top of it.

Most of Diminuto and Arroyo have remained unchanged. But libdiminuto and its collection of tools has been a moving target, as I have added things that I have found useful over the years. Here's some of the new stuff in version 1.3.0.

The library now contains a simple little function that lets you dump memory objects in hexadecimal in network byte order, for example as part of your debugging output. The output of the unit test for the function looks like this.

# ./unittest-dump
beb81b80:          00010203 04050607 08090a0b |    ............|
beb81b90: 0c0d0e0f 10111213 14151617 18191a1b |................|
beb81ba0: 1c1d1e1f 20212223 24252627 28292a2b |.... !"#$%&'()*+|
beb81bb0: 2c2d2e2f 30313233 34353637 38393a3b |,-./0123456789:;|
beb81bc0: 3c3d3e3f 40414243 44454647 48494a4b |<=>?@ABCDEFGHIJK|
beb81bd0: 4c4d4e4f 50515253 54555657 58595a5b |LMNOPQRSTUVWXYZ[|
beb81be0: 5c5d5e5f 60616263 64656667 68696a6b |\]^_`abcdefghijk|
beb81bf0: 6c6d6e6f 70717273 74757677 78797a7b |lmnopqrstuvwxyz{|
beb81c00: 7c7d7e7f 80818283 84858687 88898a8b ||}~.............|
beb81c10: 8c8d8e8f 90919293 94959697 98999a9b |................|
beb81c20: 9c9d9e9f a0a1a2a3 a4a5a6a7 a8a9aaab |................|
beb81c30: acadaeaf b0b1b2b3 b4b5b6b7 b8b9babb |................|
beb81c40: bcbdbebf c0c1c2c3 c4c5c6c7 c8c9cacb |................|
beb81c50: cccdcecf d0d1d2d3 d4d5d6d7 d8d9dadb |................|
beb81c60: dcdddedf e0e1e2e3 e4e5e6e7 e8e9eaeb |................|
beb81c70: ecedeeef f0f1f2f3 f4f5f6f7 f8f9fafb |................|
beb81c80: fcfdfeff                            |....            |

Since the function does all the heavy lifting, writing a dump tool that can dump files and such in the same format from standard input is trivial.


# dump -?
usage: dump [ -t ]
-t          Tee input to stdout, output to stderr
-?          Print menu

When the data you want to look at comes from a serial port or other asynchronous character-oriented device, this isn't quite the tool you want. You want something that can dump data in real-time, as it comes across the communication port, and let you examine both the printable and unprintable characters in the stream. The name phex is short for "print hex" but I pronounce it fex.

# phex -?
usage: phex [ -e ] [ -l BYTES ] [ -n ] [ -t ]
-e          Also escape the characters '"', '\'', and '?'
-l BYTES    Limit line length to BYTES instead of 80
-n          Do not escape newlines
-t          Tee input to stdout, output to stderr
-x          Print normally escaped characters in hex
-?          Print menu

When phex senses that its standard input is a serial port, it places the port in raw mode, so that the Linux terminal driver doesn't try to interpret any of the special characters. phex automatically handles line wrapping as its dumping so that it doesn't drive your own display terminal crazy. It prints unprintable characters as C-style escape sequences. It prints each character in real-time as it reads it from standard input. It normally prints to standard output, but it can be easily told to print to the unbuffered standard error file descriptor.

# ./unittest-phex | phex
\0\x01\x02\x03\x04\x05\x06\a\b\t\n\v\f\r\x0e\x0f\x10
\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c
\x1d\x1e\x1f !"#$%&'()*+,-./0123456789:;<=>?@ABC
DEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstu
vwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88
\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94
\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0
\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac
\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8
\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4
\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0
\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc
\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8
\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5
\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff

Dealing with memory-mapped hardware devices is part of the job when you are writing device drivers and doing other systems programming chores. Before you launch into writing thousands of lines of device driver code, it pays to poke around the device to learn how it works. memtool lets you do that. When run as root, it allows you to map an arbitrary physical address block to the virtual address space and manipulate data within it. It doesn't require any special device driver, and runs completely in user-space.

# memtool -?
usage: memtool [ -d ] [ -a ADDDRESS ] [ -l BYTES ] [ -[1|2|4|8] ADDRESS ] [ -r | -[s|c|w] NUMBER ] [ -u USECONDS ] [ ... ]
-1 ADDRESS    Use byte at ADDRESS
-2 ADDRESS    Use halfword at ADDRESS
-4 ADDRESS    Use word at ADDRESS
-8 ADDRESS    Use doubleword at ADDRESS
-a ADDRESS    Optionally map region at ADDRESS
-c NUMBER     Clear NUMBER mask at ADDRESS
-d            Enable debug mode
-f            Proceed if the last result was 0
-l BYTES      Optionally map BYTES in length
-r            Read ADDRESS
-s NUMBER     Set NUMBER mask at ADDRESS
-t            Proceed if the last result was !0
-u USECONDS   Sleep for USECONDS microseconds
-w NUMBER     Write NUMBER to ADDRESS
-?            Print menu

What kind of things can you do with memtool? On the AT91RM9200-EK board, there is a red LED controlled by the Programmable I/O (PIO) B controller. Here is a little bash script that uses memtool to flash the red LED on and off by directly manipulating the PIOB controller registers.

PIOB=0xfffff600 # PIOB
SIZE=0x200 # PIOC - PIOB
PER=0xfffff600  # Program Enable Register
OER=0xfffff610 # Output Enable Register
SODR=0xfffff630 # Set Output Data Register
CODR=0xfffff634 # Clear Output Data Register

GREEN=0x1 # PIOB0: disk activity
YELLOW=0x2 # PIOB1: cpu idle
RED=0x4 # PIOB2: ours

USEC=1000000 # 1 second

TRIES=30 # 30 tries (about a minute)

memtool -a ${PIOB} -l ${SIZE} \
-4 ${PER} -s ${RED} \
-4 ${OER} -s ${RED}

while [ ${TRIES} -gt 0 ]; do
memtool -a ${PIOB} -l ${SIZE} \
-4 ${CODR} -s ${RED} -u ${USEC} \
-4 ${SODR} -s ${RED} -u ${USEC}
TRIES=`expr ${TRIES} - 1`
done

I've gotten email from a couple of folks who have used Diminuto or Arroyo. (Believe me, no one could have been more surprised than me!) One of them mentioned that it would be great to have some examples of simple device drivers. One of the more common memory-mapped devices that I encounter is the Field Programmable Gate Array or FPGA. FPGAs are programmable logic devices that may implement functions as simple as a few logic gates or as complex as a full-blown processor core. They are frequently used as a sort of hardware glue to connect devices together on a board, or to implement complex application-specific logic. The logic that they implement frequently exposes control, status, and data registers on the processor memory bus. It occurred to me that it would be easy to write a generic memory mapped device driver that would work with all sorts of devices, including FPGAs.

The mmdriver device driver is exactly that. It is a loadable kernel module that provides managed, non-root, access to the underlying memory mapped device, whatever it is. The physical memory space that it controls can be specified at compile-time through the use of compiler flags to define C preprocessor symbols. But it can also be dynamically configured at install time though the use of module parameters. If no configuration is provided, mmdriver maps to the registers of the very same PIOB controller in the memtool example above. (If you think you know where this is going, you're probably right.)

# insmod diminuto_mmdriver.ko
# lsmod
Module                  Size  Used by    Not tainted
diminuto_mmdriver       6956  0
# mknod /dev/mmdriver c 240 0
# ./unittest-mmdriver
# rmmod diminuto_mmdriver

On top of mmdriver is mmdrivertool, a user-space tool very much like memtool but which uses ioctl commands to mmdriver, running in kernel-space, to manipulate registers in the managed device.

# mmdrivertool -?
usage: mmdrivertool [ -d ] [ -U DEVICE ] [ -[1|2|4|8] OFFSET ] [ -[s|c|w] NUMBER ] [ -u USECONDS ] [ ... ]
-1 OFFSET   Use byte at OFFSET
-2 OFFSET   Use halfword at OFFSET
-4 OFFSET   Use word at OFFSET
-8 OFFSET   Use doubleword at OFFSET
-c NUMBER   Clear NUMBER mask at OFFSET
-d          Enable debug mode
-f          Proceed if the last result was 0
-r          Read OFFSET
-s NUMBER   Set NUMBER mask at OFFSET
-t          Proceed if the last result was !0
-u USECONDS Sleep for USECONDS microseconds
-w NUMBER   Write NUMBER to OFFSET
-U DEVICE   Use DEVICE instead of /dev/mmdriver
-?          Print menu

Here is as script that blinks that very same red LED, this time using mmdrivertool and the mmdriver device driver.

PER=0x0 # Program Enable Register
OER=0x10 # Output Enable Register
SODR=0x30 # Set Output Data Register
CODR=0x34 # Clear Output Data Register

GREEN=0x1 # PIOB0: disk activity
YELLOW=0x2 # PIOB1: cpu idle
RED=0x4 # PIOB2: ours

USEC=1000000 # 1 second

TRIES=30 # 30 tries (about a minute)

mmdrivertool \
-4 ${PER} -s ${RED} \
-4 ${OER} -s ${RED}

while [ ${TRIES} -gt 0 ]; do
mmdrivertool \
-4 ${CODR} -s ${RED} -u ${USEC} \
-4 ${SODR} -s ${RED} -u ${USEC}
TRIES=`expr ${TRIES} - 1`
done

With both memtool and mmdrivertool, here's the big disclaimer: when you are accessing physical hardware using these tools, you are handling a loaded firearm with the safeties off. Tools like this are difficult to test. Difficult enough that I haven't found a way to completely unit test all the code paths in them. Once, while poking around for a read/write hardware register that I could use to unit test both tools, I got my evaluation board so sideways that I had to power cycle it to get it back, only to discover that I had scrogged my persistent root file system in the process.

Get used to it. That's what systems programming and embedded software development is all about. You have to break a few eggs (or in my case, smoke a few boards) to learn this skill.

Update 2013-09-23: Later versions of memtool use offsets much as mmdrivertool does. This turns out to work a lot better with how memory mapped registers are typically defined in processor reference manuals and header files, as offsets from a base address.

$ memtool -?
usage: memtool [ -d ] [ -o ] [ -a ADDDRESS ] [ -l BYTES ] [ -[1|2|4|8] OFFSET ] [ -m NUMBER ] [ -r | -[s|S|c|C|w] NUMBER ] [ -u USECONDS ] [ -t | -f ] [ ... ]
       -1 OFFSET     Use byte at OFFSET
       -2 OFFSET     Use halfword at OFFSET
       -4 OFFSET     Use word at OFFSET
       -8 OFFSET     Use doubleword at OFFSET
       -C NUMBER     Clear 1<
       -S NUMBER     Set 1<
       -a ADDRESS    Map region at ADDRESS
       -c NUMBER     Clear NUMBER mask at ADDRESS
       -d            Enable debug mode
       -f            Proceed if the last result was 0
       -l BYTES      Map length BYTES at ADDRESS
       -m NUMBER     Mask at ADDRESS with ~NUMBER prior to subsequent sets
       -o            Enable core dumps
       -r            Read ADDRESS
       -s NUMBER     Set NUMBER mask at ADDRESS
       -t            Proceed if the last result was !0
       -u USECONDS   Sleep for USECONDS microseconds
       -w NUMBER     Write NUMBER to ADDRESS
       -?            Print menu

Here's the script shown above ported to the new version of the tool. Warning: I haven't actually tested this script, but I have used the new memtool at length.


PIOB=0xfffff600 # PIOB
SIZE=0x200 # PIOC - PIOB
PER=0x00  # Program Enable Register
OER=0x10  # Output Enable Register
SODR=0x30 # Set Output Data Register
CODR=0x34 # Clear Output Data Register

GREEN=0x1 # PIOB0: disk activity
YELLOW=0x2 # PIOB1: cpu idle
RED=0x4 # PIOB2: ours

USEC=1000000 # 1 second

TRIES=30 # 30 tries (about a minute)

memtool -a ${PIOB} -l ${SIZE} \
    -4 ${PER} -s ${RED} \
    -4 ${OER} -s ${RED}

while [ ${TRIES} -gt 0 ]; do
memtool -a ${PIOB} -l ${SIZE} \
    -4 ${CODR} -s ${RED} -u ${USEC} \
    -4 ${SODR} -s ${RED} -u ${USEC}
TRIES=`expr ${TRIES} - 1`
done

Saturday, May 08, 2010

The Real-Life Consequences of Complexity

Back in March, John Paul MacDuffie, management professor at the Wharton School at University of Pennsylvania, interviewed Takahiro Fujimoto, an economics professor at the University of Tokyo and a long-time observer of Toyota, about the recent issues Toyota has been facing regarding the recalls of its automobiles.


Toyota is the largest automobile manufacturer in the world and has enjoyed a enviable reputation for quality. I think it's fair to say that the issues that it is facing surprised everyone. The edited transcript of the short interview is well worth reading.

Fujimoto places the blame squarely in the management and engineering design realm. He makes it very clear that he doesn't believe the issues are in manufacturing. There is much discussion about how the designers at Toyota are challenged to deal with the ever increasing levels of technical complexity which stem not just from the technologies that they use, but from the demands of their customers and the regulatory requirements of the international markets into which they sell. Fujimoto also implies that there was a significant amount of hubris involved in Toyota's over-confidence in their ability to deal with the complexity.

Although I don't design automobiles, or anything that (as far as I know anyway) goes into an automobile, I do routinely design and implement real-time systems on which peoples' lives depend. It always struck me as counter-intuitive that technical solutions often start out as more complex than they need to be, and only become simpler over time. It is contrary to what you might expect, but in fact it is the simpler solution that takes more time to develop because simplicity only comes with a more complete understanding of the problem by the designer. The overly complex solution is often the first one implemented. As schedules are compressed and budgets tightened, it is frequently the only one implemented.

I see this a lot. It has an impact not only on the consumer of the product, but on the manufacturer's ability to maintain the product over time. Software developers reading this have already realized that this is another manifestation of the need to continually refactor software during development and maintenance. But my observation is that this can still be a hard sell to project managers who, concerned with their schedules and budgets inside their silo or cost center or whatever the faddish term is these days, ask

"And if we do this refactoring, what will the end user see that is different?"

and are told by the technical lead

"Well, if we do everything absolutely perfectly, nothing."

That is not a good way to get work funded.

I wouldn't be surprised if a similar conversation occurred inside Toyota some time ago.

Friday, May 07, 2010

A Matter of Scale: The CAP Theorem and Memory Models

In his keynote for the 2000 ACM PoDC conference, Eric A. Brewer made a remarkable assertion. He said that given three desirable qualities of distributed systems, consistency, availability, and partition tolerance, you can at best have two out of three. This became to be known as Brewer's Conjecture.

You might be inclined to think Brewer knew what he was talking about. He was not only a professor of computer science at U. C. Berkeley, but was the co-founder of Inktomi, a company that produced the HotBot Internet search engine. For a time, HotBot was the most popular Internet search engine, replacing Altavista, until its supremacy was upset by Google and Inktomi was acquired by Yahoo!

Even more remarkably, a couple of years later Seth Gilbert and Nancy Lynch at MIT formally proved Brewer's Conjecture for a couple of abstract (but highly applicable) models, leading to the CAP Theorem.

The CAP Theorem has been getting more press lately, thanks in part to Werner Vogels, the CTO of Amazon.com, writing about how that e-commerce giant gets gets around the fundamental architectural implications of the theorem for large distributed systems. So it's not a big surprise that I would have come across the CAP Theorem in my obsessively wide reading.

But although I've done my time in web services (Fun Facts to Known and Tell: Java Business Integration), high performance computing (Post Modern Deck Construction), and distributed systems (Traffic Management and System Scalability), lately I've been agonizing over the machine code produced by compilers and the bare metal on which it runs (Cache Coherency and Memory Models). So I was standing in the shower the other morning, listening to that little voice in my head, trying to figure out why I was so focused on the implications of the CAP Theorem when my current interests leaned more towards cache coherency or the lack of it.

And it finally occurred to me: they're the same problem. It is simply a matter of scale.

Here's the gist of the CAP Theorem as I understand it.

Consistency means serialization. To have strong consistency, transactions run across distributed servers on shared data must look as if they had been serialized and run one at a time as atomic transactions, whether that's the case or not. This is what the database and transaction people are talking about when they cite the ACID properties: Atomicity, Consistency, Isolation, Durability.

Availability means reliability. Every request gets a timely response, and no client is left hanging. It's this requirement for high availability that typically leads to distributed systems in the first place, since HA systems achieve it by eliminating single points of failure. What constitutes failure is relative in this context. Julian Brown has remarked that an extra tenth of a second in response time costs Amazon.com one percent in sales.

Partition-tolerance means the system deals with the occasional temporary network outage. This is a reality even on a LAN inside a centralized data center, but obviously more so in a geographically distributed system depending on a WAN. Failure here is relative depending on the application. Network partitioning can be seen as an increase in latency. For real-time applications, an increase in latency of milliseconds may constitute a network failure, while applications requiring only an eventual data reconciliation will be more tolerant.

Brewer remarked that there is no such thing as a 100% working system (something the high performance computing folks are grappling with as they try to build exascale systems from commodity parts), and no such thing as 100% fault tolerance. But hugely scalable cloud providers like Google, Amazon.com, and Ebay achieve as close to 100% as they can by throwing a lot of parallel (to handle the workload) and redundant (to achieve high availability) hardware the problem. Geographic location is also a single point of failure, not just from natural disasters but from availability of infrastructure like bandwidth, latency, and power. So such systems tend to be geographically distributed as well. This makes them highly dependent on network connectivity to communicate with one another.

The theorem states that at most two of the three, (strong) consistency, (high) availability, and (network) partition-tolerance, can be achieved. Each distributed system designer has to make their own decision as to where their application lies in the CAP space, and what compromises they are going to make, based on their business needs. A distributed system may provide strong consistency and high availability but can only do so as long as the servers can talk to one another. Or it may tolerate network outages but become completely unavailable while waiting for the network to heal. Or it might relax its need for strong consistency while being highly available even in the face of temporary network outages.

In their paper on the CAP Theorem, Gilbert and Lynch use two different models to construct their proof. The asynchronous model has no temporal dimension. Agents in the asynchronous model are simply state machines that are only concerned with message order and not with when messages arrive. The partially synchronous model has a sense of time duration. Agents in the synchronous model can take action when an expected message doesn't arrive within a time out interval. In the synchronous model, they show that it is possible to achieve a useful compromise between consistency and availability in the face of network partitioning.

While both models are applicable to real world real-time systems that I have helped develop over the past few decades, it is the synchronous model that has been leveraged by the big e-commerce players. In particular, Vogel has written about how Amazon.com uses weak or eventual consistency as its own necessary sacrifice to the CAP Theorem.

So, finally, here's the thing: all this exists, writ small, in the world of multiple processors and cores, cache coherency protocols, and shared memory. The individual cores constitute the distributed agents. Their individual memory caches contain the copies of shared data. The interconnects between the cores form the network. The cache coherency protocols correspond to the transaction protocols. Threads in a multi-threaded application reading and writing shared data in main memory create the workload. The same issues of consistency, availability, and partition-tolerance apply. And the same trade-offs are made, whether you realize it or not.

Don't think the interconnect network in your multi-core system is ever partitioned? Sure it is. Network connectivity failures are never permanent, unless you are unplugging the power cord from your server and tossing the whole box into the dumpster. It's all just a question of how much latency your system is willing to tolerate. A distributed system of the likes of Google's may tolerate seconds, minutes, or even hours, relying on redundancy to carry the load until the network can heal. The kinds of message passing systems I've helped develop over the years might have a latency threshold measured in tens or hundreds of milliseconds. A microcontroller talking to an FPGA might tolerate microseconds. A microprocessor execution unit waiting for a word from cache might measure its latency threshold in fractions of a nanosecond before the execution pipeline stalls.

It's all the same model. Just the decimal point moves.

Ulrich Dreppler, of Red Hat, authored a massive but indispensable tome on how memory actually works in modern microprocessors. In it he wrote of the MESI (Modified, Exclusive, Shared, Invalid) cache coherency protocol:

If multiple threads modify the same memory location concurrently, processors do not guarantee any specific result. This is a deliberate decision made to avoid costs which are unnecessary in 99.999% of all cases. For instance, if a memory location is in the ‘S’ state and two threads concurrently have to increment its value, the execution pipeline does not have to wait for the cache line to be available in the ‘E’ state before reading the old value from the cache to perform the addition. Instead it reads the value currently in the cache and, once the cache line is available in state ‘E’, the new value is written back. The result is not as expected if the two cache reads in the two threads happen simultaneously; one addition will be lost.

This is exactly the kind of compromise the distributed systems folk are talking about when they discuss the implications of the CAP Theorem. In fact, while reading Gilbert and Lynch's paper, I was struck by how much the model they were describing in their proof corresponded to the MESI protocol, and how closely their synchronous model described the kind of real-time systems, distributed and otherwise, with which I am familiar. In their quest for performance, microprocessor designers have left behind ACID for what the distributed systems folk refer to as BASE: Basically Available, Soft-state, Eventual consistency. And they've done it for all the same reasons.

Why is this correspondence important? Because I think the CAP Theorem is just as valuable a tool for designing and reasoning about multi-processor and multi-core systems as it is for globe-girdling distributed systems.

I was in that shower a long time. Mrs. Overclock (etc.) probably wondered what was going on.

Update (2010-05-08)

When I had my ah-ha moment in the shower, I knew it was obvious enough that someone must have thought of it before. But my own research, online and otherwise, didn't turn anything up. I had read Vogels' article in ACM Queue, but this morning I went back and read his original blog article from which that was taken. Sure enough, two years ago, in the comments section, Manish Vachharaiani made the same observation about the similarity between the CAP Theorem and cache coherency.

And in those same comments, Roy Friedman cites the first formal treatment of the CAP issue, Attiya and Welch, which I've added to my sources below.

Sources

Hagit Attiya, Jennifer L. Welch, "Sequential Consistency versus Linearizability", ACM Transactions on Computing Systems, 12(2):91-122, 1994

Eric A. Brewer, "Towards Robust Distributed Systems", ACM Symposium of Principles of Distributed Computing, July 2000

Julian Browne, "Brewer's CAP Theorem", January 2009

Ulrich Drepper, What Every Programmer Should Know About Memory, Red Hat Inc., November 2007

Armando Fox, Eric A. Brewer, "Harvest, Yield, and Scalable Tolerant Systems", Proceedings of the Seventh Workshop on Hot Topics in Operating Systems, IEEE Computer Society, 1999


J. L. Sloan, "Implications of Memory Models (or Lack of Them) for Software Developers", Digital Aggregates Corporation, April 2010

Werner Vogels, "Eventually Consistent", ACM Queue, October 2008