Monday, February 12, 2007

Small is Beautiful, but Many is Scary

There is a danger in succumbing to the seduction of choosing reading material that agrees with your point of view. Sure, you enjoy reading it in a "look how smart I am" kind of way. But in the end, it isn't clear you've actually learned anything. So it is with some trepidation that I recommend The Landscape of Parallel Computing Research: A View from Berkeley, a recent white paper by a multidisciplinary group of researchers at University of California at Berkeley. Although maybe I like it because it supports some of my recent assertions, I prefer to think it is because I learned quite a bit from it. It includes a list of old conventional wisdom versus new conventional wisdom about microprocessor design that gave me pause. I'm tempted to just cut and paste the list here, but in the spirit of fair use, I'll just list some of the issues in my own words, in the hope that you'll be inspired to read the white paper.

  • In a reversal of what has been true in the past, power is expensive, but transistors are free; we can put more transistors on a chip than we have the power to turn on.
  • For desktops and servers, static power consumed by leakage can be forty percent of the total power.
  • The cost of chip masks for feature sizes below sixty-five nanometers are so expensive to produce that researchers can no longer afford to build physical prototypes of new designs.
  • For many technologies, bandwidth improves by at least the square of the improvement in latency.
  • Modern microprocessors may be able to do a floating-point multiply in four clock cycles, but take as many as 200 cycles to access DRAM, leading to a reversal in which loads and stores are slow, but floating point operations are fast.
  • There are diminishing returns in achieving higher degrees of instruction level parallelism using tricks like branch prediction, out of order execution, speculative execution, and the like. (And, I would add, these tricks have already broken the memory models of many popular programming languages).
  • The doubling of sequential microprocessor performance has slowed; continuing the growth in performance as per Moore's Law will require increasing degrees of parallelism.

The authors cite a chip used by Cisco in a router that incorporates 188 RISC cores using a 130 nanometer process. They go on to predict that as many as 1000 cores may be possible on a single chip using a thirty nanometer feature size. They mention the many advantages of building microprocessors from multiple cores.

  • Achieving greater performance by parallelism is energy efficient.
  • Multiple processors can be exploited in fault-tolerant or redundant designs.
  • For parallel codes, many small cores yields the highest performance per unit area.
  • Smaller processing elements are easier to design and verify.

Thankfully, they go into some detail about the difficulties one might encounter when writing software for a 1000 core microprocessor. This gives me some hope that hardware designers might have a clue about how challenging it is to write such fine-grained parallel code. Some of the issues they mention are: inefficiencies in cache-coherence protocols; the difficulty in writing reliable code using synchronization locks; the immaturity of techniques, like transactional memory, which provide higher levels of abstraction for parallelism; and the cognitive effort required to use programming models like MPI, used by my friends in the high performance computing (HPC) community, which requires that the developer explicitly map computational tasks onto specific processors.

It's not like it's a new problem. I remember back in my HPC days how researchers much preferred the eight-processor Cray Research supercomputer to the 64,000 processor Connection Machine MPP that sat right next to it. I explained it to Mrs. Overclock this way: if you want to build a deck on the back of your house, which is easier: doing it with eight master carpenters, or with 64,000 junior high schoolers taking shop? If you could figure out some way to keep all of the students productive, you might just get it done faster. But the effort involved might also lead you to conclude that you would have been better off just building the deck by yourself.

And it's also not like we aren't already facing this issue. Today's microprocessors have already broken the memory models of many programming languages; witness the fixes that had to go into Java 1.5 so that common multi-thread design patterns like double-check-locking would work reliably. There are legacy C/C++ code bases using multiple threads that cannot run reliably on a multi-core processor, or even on a hyper-threaded processor that merely pretends to have multiple cores, without a significant porting effort. Until these issues are resolved, such code bases have stepped off the Moore's Law curve. Just let me off at the corner, thanks, I'll walk from here.

I find it unlikely that developers will be able to effectively utilize a thousand-core processor using the kind of coarse-grained parallelism offered by the POSIX thread library or the Java thread model. It's more likely that languages will have to evolve, in much the way that the standard language of supercomputing, FORTRAN, has: to incorporate fine-grained parallelism in language constructs like the do-loop, as part of its syntax. Or even to implicitly define parallelism as part of the language semantics, in a way similar to the research projects I remember from graduate school when the Japanese Fifth Generation project was looming on the Eastern horizon.

Parallelism has been a running theme during my thirty-plus-year career, from writing interrupt-driven real-time code, to designing distributed systems, to supporting supercomputers, to working on embedded projects with as many as thirty separate processing units on a single board, to developing enterprise applications suitable for multi-core servers. I look forward to seeing where things will go.


K. Asanovic et al., The Landscape of Parallel Computing Research: A View from Berkeley, UCB/EECS-2006-183, Electrical Engineering and Computer Science, U. C. Berkeley, December 18, 2006


Chip Overclock said...

When space cowboy Steve Tarr isn't sending missions to deep space, he comments on my blog:

A comment on your "Small is Beautiful..." article. I think the Berkeley folks see the big picture (having not read their work), but I think your example is on the fringe. The many are the ubiquitous micro-controllers and in ever increasing ASIC/FPGA implementations. The future is not a bunch of RISC supporting execution of a single large tasks (super computer centric thinking), but small processors pre/post processing and controlling things. Robotics is a good example. Sensor preprocessors feed image analysis and object identification processors that feed attitude and direction processors that send commands to motion control that feeds motor controllers.

Chip Overclock said...

Having stood in both the HPC world and in the embedded world, I get your drift. There are a lot of commonalities between these two worlds, but most of the time they don't even realize the other exists. However this Berkeley paper surprised me in that it explicitly stated that many of their ideas were drawn from the embedded world. The paper mentions specifically that the traditional embedded real-time approach to parallelization - message passing - would likely be a usable approach for high-core-count designs. What I think will happen is that instead of building boards that contain thirty distinct processing elements (as you and I have worked on together), cores will be dedicated to specific purposes in a product even though they are all in a single microprocessor. (It is already common, as you know, that embedded processors have multiple special purpose cores for communications or digital signal processing.) Embedded guru Jack Ganssle mentions in his "Better Firmware Faster" seminar that spreading development across multiple processors actually reduces time-to-market, so I expect this multi-core approach to be very attractive to both the HPC and the embedded camps.

Chip Overclock said...

Embedded developer and long-time adopted nephew Todd Blachowiak offers the following comments, echoing many of the same thoughts as Mr. Tarr:

I'm not sure I buy into the homogenous massively multi-core model for the embedded world. Much of the embedded world is also dealing with external devices, so a model of two CPU classes makes more sense to me. One class that is optimized to deal with computation and memory in a high performance way and another class that is device control oriented. Modern high performance cores with the memory buffering, coalescing and caching can play havoc on device control if the programmer is not highly tuned into each specific architecture in the path to/from the device they are controlling. This is only exaggerated if trying to program in a high level language.

A distinct CPU designed to aid device control is an approach I'd like to see. This would allow you to have an additional device control bus. This limits bus interference from the high performance general purpose cores.

But certainly give me many of each CPU type.