Sunday, June 14, 2009

Cache Coherency and Memory Models

Sometime around 2002 I was working on an object-oriented multi-threading framework in C++ for an embedded project and the most remarkable thing happened. I had an object that represented a thread. My application created the object via the usual new operator and started the thread. The first thing the thread did was access its own thread object. It core dumped. Core dump analysis using gdb revealed that the virtual pointer in the object was null.

I know what you're thinking.

That's impossible.

That's what I thought too. That event led to a personal research project that consumed many of hours of my free time and led to my putting together a presentation on the Implications of Memory Models (or Lack of Them) for Software Developers. In it, I describe the research of others that reveals that modern processor memory subsystems have broken the memory models implicit in the software.

This includes software written in higher level languages like C, C++, and Java (prior to Version 6 anyway) and as well as that written in assembler. It includes multi-processors, multi-core uniprocessors, and even those single core uniprocessors that are hyper-threaded. The upshot is that, thanks to various processor, memory, and compiler optimizations, memory may not be read or written when you think it is from looking at your code. Or even the compiled code.

Java addressed this issue in Version 6, providing you use volatile or the language's built-in synchronization mechanisms. But C and C++ developers dealing with variables shared among threads must still resort to the appropriate use of the volatile and explicit memory barrier instructions, which may be provided by your threading library's synchronization functions, providing you use them.

Fortunately, at least gcc offers the generic __sync_synchronize() built-in that performs a platform-specific memory barrier with full fence semantics. Unfortunately, in their paper Volatiles are Miscompiled, and What to Do about It, Eric Eide and John Regehr at University at Utah reveal that code containing the volatile keyword is frequently miscompiled. The life of a systems programmer is seldom simple.

Recently I discovered Ulrich Drepper's lengthy white paper on What Every Programmer Should Know About Memory. He gives a detailed explanation of how the hardware architecture of cache and memory subsystems can affect performance and correctness of software. I recommend it to every one that does embedded or systems development.

(Actually, I recommend it to all developers. But my Java buddies have pointed out on more than one occasion that they don't really care to know how things work under the hood. That's an attitude that I can't relate to. But I was one of those kids who took clocks apart to see what made them tick.)

It took me a couple of weeks to read Drepper's 114 page paper cover to cover. The entire time I was reading his description of the MESI cache coherency protocol, I was trying, without much success, to reconcile it with my prior reading on processor memory models. (Briefly, the MESI protocol, which tries to keep the local caches of individual processor cores in sync, places each cache line in a state of Modified, Exclusive, Shared, or Invalid.) 

Then finally I got to section 6.4.2, "Atomicity Operations", page 68:

If multiple threads modify the same memory location concurrently, processors do not guarantee any specific result. This is a deliberate decision made to avoid costs which are unnecessary in 99.999% of all cases. For instance, if a memory location is in the ‘S’ state and two threads concurrently have to increment its value, the execution pipeline does not have to wait for the cache line to be available in the ‘E’ state before reading the old value from the cache to perform the addition. Instead it reads the value currently in the cache and, once the cache line is available in state ‘E’, the new value is written back. The result is not as expected if the two cache reads in the two threads happen simultaneously; one addition will be lost.

If this doesn't run chills down your spine, you're not thinking clearly. Drepper goes on to discuss some of the same atomicity operations I talk about in my presentation. His paper also details the write reordering which I suspect was the cause of my null C++ virtual pointer.

Why is all this stuff important?

In the early 1980s I was a computer science graduate student working on my thesis under the direction of my mentor Bob Dixon, in a research group that was looking at programming language and operating system architectures for massively parallel processors. This was when Japan's Fifth Generation project, dataflow architectures, and the massively parallel Connection Machine were big. At the time, it all seemed bleeding edge and, given our budget, largely hypothetical.

But just a few years later I found myself working at a national lab, the National Center for Atmospheric Research, which not only had a Connection Machine, but other exotic highly parallel systems including several Cray Research supercomputers with multiple processors and vector hardware.

A few years later still and I am more than a little surprised to find that I have a four-core Dell server running Linux in my basement. The future has a way of happening.

In The Landscape of Parallel Computing Research: A View from Berkeley, researchers David Patterson et al. describe the future that they see: not just multi-core, but many-core, the equivalent of a massively parallel processor running on your desktop, with the dozens or hundreds of cores sharing memory. Suddenly, understanding concurrency, synchronization, and memory models (or the eventual higher level languages that implement it all without the developer having to worry about it) seems pretty necessary.

For me, it's 1983 all over again.