There is a danger in succumbing to the seduction of choosing reading material that agrees with your point of view. Sure, you enjoy reading it in a "look how smart I am" kind of way. But in the end, it isn't clear you've actually learned anything. So it is with some trepidation that I recommend The Landscape of Parallel Computing Research: A View from Berkeley, a recent white paper by a multidisciplinary group of researchers at University of California at Berkeley. Although maybe I like it because it supports some of my recent assertions, I prefer to think it is because I learned quite a bit from it. It includes a list of old conventional wisdom versus new conventional wisdom about microprocessor design that gave me pause. I'm tempted to just cut and paste the list here, but in the spirit of fair use, I'll just list some of the issues in my own words, in the hope that you'll be inspired to read the white paper.
- In a reversal of what has been true in the past, power is expensive, but transistors are free; we can put more transistors on a chip than we have the power to turn on.
- For desktops and servers, static power consumed by leakage can be forty percent of the total power.
- The cost of chip masks for feature sizes below sixty-five nanometers are so expensive to produce that researchers can no longer afford to build physical prototypes of new designs.
- For many technologies, bandwidth improves by at least the square of the improvement in latency.
- Modern microprocessors may be able to do a floating-point multiply in four clock cycles, but take as many as 200 cycles to access DRAM, leading to a reversal in which loads and stores are slow, but floating point operations are fast.
- There are diminishing returns in achieving higher degrees of instruction level parallelism using tricks like branch prediction, out of order execution, speculative execution, and the like. (And, I would add, these tricks have already broken the memory models of many popular programming languages).
- The doubling of sequential microprocessor performance has slowed; continuing the growth in performance as per Moore's Law will require increasing degrees of parallelism.
The authors cite a chip used by Cisco in a router that incorporates 188 RISC cores using a 130 nanometer process. They go on to predict that as many as 1000 cores may be possible on a single chip using a thirty nanometer feature size. They mention the many advantages of building microprocessors from multiple cores.
- Achieving greater performance by parallelism is energy efficient.
- Multiple processors can be exploited in fault-tolerant or redundant designs.
- For parallel codes, many small cores yields the highest performance per unit area.
- Smaller processing elements are easier to design and verify.
Thankfully, they go into some detail about the difficulties one might encounter when writing software for a 1000 core microprocessor. This gives me some hope that hardware designers might have a clue about how challenging it is to write such fine-grained parallel code. Some of the issues they mention are: inefficiencies in cache-coherence protocols; the difficulty in writing reliable code using synchronization locks; the immaturity of techniques, like transactional memory, which provide higher levels of abstraction for parallelism; and the cognitive effort required to use programming models like MPI, used by my friends in the high performance computing (HPC) community, which requires that the developer explicitly map computational tasks onto specific processors.
It's not like it's a new problem. I remember back in my HPC days how researchers much preferred the eight-processor Cray Research supercomputer to the 64,000 processor Connection Machine MPP that sat right next to it. I explained it to Mrs. Overclock this way: if you want to build a deck on the back of your house, which is easier: doing it with eight master carpenters, or with 64,000 junior high schoolers taking shop? If you could figure out some way to keep all of the students productive, you might just get it done faster. But the effort involved might also lead you to conclude that you would have been better off just building the deck by yourself.
And it's also not like we aren't already facing this issue. Today's microprocessors have already broken the memory models of many programming languages; witness the fixes that had to go into Java 1.5 so that common multi-thread design patterns like double-check-locking would work reliably. There are legacy C/C++ code bases using multiple threads that cannot run reliably on a multi-core processor, or even on a hyper-threaded processor that merely pretends to have multiple cores, without a significant porting effort. Until these issues are resolved, such code bases have stepped off the Moore's Law curve. Just let me off at the corner, thanks, I'll walk from here.
I find it unlikely that developers will be able to effectively utilize a thousand-core processor using the kind of coarse-grained parallelism offered by the POSIX thread library or the Java thread model. It's more likely that languages will have to evolve, in much the way that the standard language of supercomputing, FORTRAN, has: to incorporate fine-grained parallelism in language constructs like the do-loop, as part of its syntax. Or even to implicitly define parallelism as part of the language semantics, in a way similar to the research projects I remember from graduate school when the Japanese Fifth Generation project was looming on the Eastern horizon.
Parallelism has been a running theme during my thirty-plus-year career, from writing interrupt-driven real-time code, to designing distributed systems, to supporting supercomputers, to working on embedded projects with as many as thirty separate processing units on a single board, to developing enterprise applications suitable for multi-core servers. I look forward to seeing where things will go.
K. Asanovic et al., The Landscape of Parallel Computing Research: A View from Berkeley, UCB/EECS-2006-183, Electrical Engineering and Computer Science, U. C. Berkeley, December 18, 2006