Sunday, June 14, 2009

Cache Coherency and Memory Models

Sometime around 2002 I was working on an object-oriented multi-threading framework in C++ for an embedded project and the most remarkable thing happened. I had an object that represented a thread. My application created the object via the usual new operator and started the thread. The first thing the thread did was access its own thread object. It core dumped. Core dump analysis using gdb revealed that the virtual pointer in the object was null.

I know what you're thinking.

That's impossible.

That's what I thought too. That event led to a personal research project that consumed many of hours of my free time and led to my putting together a presentation on the Implications of Memory Models (or Lack of Them) for Software Developers. In it, I describe the research of others that reveals that modern processor memory subsystems have broken the memory models implicit in the software.

This includes software written in higher level languages like C, C++, and Java (prior to Version 6 anyway) and as well as that written in assembler. It includes multi-processors, multi-core uniprocessors, and even those single core uniprocessors that are hyper-threaded. The upshot is that, thanks to various processor, memory, and compiler optimizations, memory may not be read or written when you think it is from looking at your code. Or even the compiled code.

Java addressed this issue in Version 6, providing you use volatile or the language's built-in synchronization mechanisms. But C and C++ developers dealing with variables shared among threads must still resort to the appropriate use of the volatile and explicit memory barrier instructions, which may be provided by your threading library's synchronization functions, providing you use them.

Fortunately, at least gcc offers the generic __sync_synchronize() built-in that performs a platform-specific memory barrier with full fence semantics. Unfortunately, in their paper Volatiles are Miscompiled, and What to Do about It, Eric Eide and John Regehr at University at Utah reveal that code containing the volatile keyword is frequently miscompiled. The life of a systems programmer is seldom simple.

Recently I discovered Ulrich Drepper's lengthy white paper on What Every Programmer Should Know About Memory. He gives a detailed explanation of how the hardware architecture of cache and memory subsystems can affect performance and correctness of software. I recommend it to every one that does embedded or systems development.

(Actually, I recommend it to all developers. But my Java buddies have pointed out on more than one occasion that they don't really care to know how things work under the hood. That's an attitude that I can't relate to. But I was one of those kids who took clocks apart to see what made them tick.)

It took me a couple of weeks to read Drepper's 114 page paper cover to cover. The entire time I was reading his description of the MESI cache coherency protocol, I was trying, without much success, to reconcile it with my prior reading on processor memory models. (Briefly, the MESI protocol, which tries to keep the local caches of individual processor cores in sync, places each cache line in a state of Modified, Exclusive, Shared, or Invalid.) 

Then finally I got to section 6.4.2, "Atomicity Operations", page 68:

If multiple threads modify the same memory location concurrently, processors do not guarantee any speciļ¬c result. This is a deliberate decision made to avoid costs which are unnecessary in 99.999% of all cases. For instance, if a memory location is in the ‘S’ state and two threads concurrently have to increment its value, the execution pipeline does not have to wait for the cache line to be available in the ‘E’ state before reading the old value from the cache to perform the addition. Instead it reads the value currently in the cache and, once the cache line is available in state ‘E’, the new value is written back. The result is not as expected if the two cache reads in the two threads happen simultaneously; one addition will be lost.

If this doesn't run chills down your spine, you're not thinking clearly. Drepper goes on to discuss some of the same atomicity operations I talk about in my presentation. His paper also details the write reordering which I suspect was the cause of my null C++ virtual pointer.

Why is all this stuff important?

In the early 1980s I was a computer science graduate student working on my thesis under the direction of my mentor Bob Dixon, in a research group that was looking at programming language and operating system architectures for massively parallel processors. This was when Japan's Fifth Generation project, dataflow architectures, and the massively parallel Connection Machine were big. At the time, it all seemed bleeding edge and, given our budget, largely hypothetical.

But just a few years later I found myself working at a national lab, the National Center for Atmospheric Research, which not only had a Connection Machine, but other exotic highly parallel systems including several Cray Research supercomputers with multiple processors and vector hardware.

A few years later still and I am more than a little surprised to find that I have a four-core Dell server running Linux in my basement. The future has a way of happening.

In The Landscape of Parallel Computing Research: A View from Berkeley, researchers David Patterson et al. describe the future that they see: not just multi-core, but many-core, the equivalent of a massively parallel processor running on your desktop, with the dozens or hundreds of cores sharing memory. Suddenly, understanding concurrency, synchronization, and memory models (or the eventual higher level languages that implement it all without the developer having to worry about it) seems pretty necessary.

For me, it's 1983 all over again.


Chip Overclock said...

My friend and former colleague Craig Ruff made the following remarks which he has kindly permitted me to post:

I read your blog entry about the thread object core dump. I ran into the a similar problem, and was forced to rely on a Thread::Constructed method called from the most-derived constructor to defer the pthread_create call until all of the object's parts had been initialized. Even then I had a bit of final initialization to do after the pthread_create (to record the resulting thread id and to generate a readable thread object name), so the wrapper routine that is started waits on the thread object's condition variable before calling the objects Run method. If the Thread object is constructed in the idle state instead, it just continues to wait until the SetRunning method is called by someone else.

Chip Overclock said...

I've run into this issue, that of construction of the base class object versus the construction of the derived class object, in other contexts. In fact, I first ran into it during my Enterprise Java development days, and that experience has let me to a design pattern for both Java and C++ constructors i which I limit actions taken in the constructor to strictly simply memory initialization. It is for just this reason that many Java frameworks require the implementation of life cycle methods to control how and when an object may communicate with other objects during the various phases of its construction and initialization.

I've seen this pattern in other large software systems as well written in both C and C++. The phases may be along the order of [1] may initialize its own memory, [2] may access other objects in the same subsystem, [3] may access other objects in other subsystems, [4] may communicate across the network. This is keeps the construction and initialization of many components in lock step.

As a happy side effect, the act of keeping code out of constructors that access other objects or subsystems has made unit testing much simpler since such objects can be constructed without having to create mock objects or providing a rich simulated surround for the object to communicate with.