Sarita V. Adve and Hans-J. Boehm, "Memory Models: A Case for Rethinking Parallel Languages and Hardware", Communications of the ACM, 53.8, August 2010, pp. 90-101 (preprint)
by two experts on the topic. Adve (U. of Illinois at Urbana-Champaign) and Boehm (HP Laboratories) write frequently and cogently, together and separately, in this area that should be of interest to anyone who, like me, develops real-time and/or multi-threaded code. This includes everyone from the person writing a device driver in C destined for an embedded multi-core platform to the person writing in Java for an enterprise server-side application. I have in fact been both of those persons.
In this particular article Adve and Boehm suggest software-hardware co-development will be necessary to tackle the issue of broken memory models which are repaired with memory models too subtle and difficult for anyone except experts to understand and hence impossible to teach at the undergraduate level. What I really like about the article though is that six or so pages are an excellent, detailed introduction to the topic that is very accessible for those willing to make the effort.
One of their several tiny pseudo-code snippets used as examples actually made me laugh out loud: a speculative store by one thread based on an analysis of prior application behavior, either statically by the compiler at compile-time or dynamically by the processor at run-time, causes another thread to act such that this speculation turns out to be correct post-hoc, creating a "causality loop" resulting in a "self-fulfilling speculation". This can result in the storing of a speculative value (42 in their example) created out of thin air that has nothing to do with the actual algorithm being executed.
They make the point that although no such processor or compiler does this today, such an operation isn't that different from speculative operations that are performed today by processors and compilers, and current memory models do nothing to indicate that such an operation is incorrect. They also say that it is remarkably difficult to define a formal memory model that disallows such an operation without disallowing optimizations that are routinely done today.
I feel like Stephen King reading about the funeral home that tossed bodies in the pond out back when their cremation oven broke down: you can't make this stuff up; no one would believe it! There is also some discussion about the forthcoming atomic keyword in C++, which is like the volatile keyword in Java, which is completely different in semantics from the volatile keyword in C. This is the kind of thing will keep me gainfully employed for the next few years.
Adve and Boehm also make the excellent point that the use of memory barriers, in the form of either compiler directives and machine instructions, is problematic because for nearly all compilers and processors such barriers affect all of the active variables (for the former) or all of cached memory (for the latter), not just the one variable shared between threads with which the developer is concerned. This can potentially have a huge impact on performance, including cache hit rate and use of memory bus bandwidth.
Thirty years ago I found myself in graduate school teaching a course in real-time software design where my students wrote multi-threaded applications, device drivers, and interrupt service routines, while simultaneously working on my thesis as part of a research group looking at programming language designs for massively-parallel data-flow machines. In their technical report on the advent of many-core processors
K. Asanovic at al., The Landscape of Parallel Computing Research: A View from Berkeley, EECS Department, U. C. Berkeley, UCB/EECS 2006-183, December 2006
the authors (which includes David Patterson of Patterson and Hennesy) remark that the skill sets for embedded development and for many-core development are remarkably similar. That's obviously been my experience too. At my extremely advanced age, pondering the implications of many-core processors, with dozens or even hundreds of cores, it's making me positively nostalgic for the old days of non-procedural single-assignment languages with lazy-evaluation running on massively-parallel data-flow architectures.