Sunday, December 21, 2014

Is the General-Purpose Processor a Myth?

In a recent article in ACM Queue, the Association of Computing Machinery's newsletter for practicing software engineers, David Chisnall argues that "There's No Such Thing as a General Purpose Processor" [12.10, 2014-11-06]. And he has some interesting stuff to say along the way.
It's therefore not enough for a processor to be Turing complete in order to be classified as general purpose; it must be able to run all programs efficiently. The existence of accelerators (including GPUs) indicates that all attempts thus far at building a general-purpose processor have failed. If they had succeeded, then they would be efficient at running the algorithms delegated to accelerators, and there would be no market for accelerators.
His argument is that modern processors implement a specific memory and processing model suited for the execution of C-like languages, which makes them unsuited for the kinds of applications for which we use various specialized hardware like graphical processing units (GPUs) and digital signal processors (DSPs). If modern CPUs were indeed general purpose, they would be able to run GPU- or DSP-style applications efficiently.

I don't agree with him -- I would say being general purpose means that modern processors can run graphical or digital signal processing applications at all, not that they are necessarily optimal for doing so -- but I get his point. Modern processors as as specialized as GPUs and DSPs in the sense that they are designed around a particular application model.
The ability to run an operating system is fundamental to the accepted definition. If you remove the ability to run an operating system from a processor that is considered general purpose, then the result is usually described as a microcontroller. Some devices that are now regarded as microcontrollers were considered general-purpose CPUs before the ability to run a multitasking, protected-mode operating system became a core requirement.
I kinda sorta like this too as the definition of a micro controller. True, I am among many who have run FreeRTOS on a eight-bit micro controller, but FreeRTOS isn't a protected-mode operating system. And, remarkably, Dmitry Grinberg actually booted Linux on an eight micro controller, although it wasn't pretty.
Parallelism in software comes in a variety of forms and granularity. The most important form for most CPUs is ILP (instruction-level parallelism). Superscalar architectures are specifically designed to take advantage of ILP. They translate the architectural instruction encodings into something more akin to a static single assignment form (ironically, the compiler spends a lot of effort translating from such a form into a finite-register encoding) so that they can identify independent instructions and dispatch them in parallel.
Any mention of single assignment gets my attention since it was the basis of my master's thesis a few decades ago. In a single assignment form, a variable can be assigned a value once and only once. Single assignment languages (like the one I implemented) sound impossible to write software in, but in fact it is simply a different coding style. For example, iteration can be done by recursion, where different values of a variable in each iteration are in fact held in different locations in the stack frame. I was surprised, years later, to discover that compiler writers depended upon translating conventional programming languages into a single assignment form for purposes of parallelization (which was exactly why I was using it in the early 1980s).
It's worth noting that, in spite of occupying four times the die area and consuming four times the power, clock-for-clock the ARM Cortex A15 (three-issue, superscalar, out-of-order) achieves only 75-100 percent more performance than the (two-issue, in-order) A7, in spite of being able (theoretically) to exploit a lot more ILP.
Chisnall is applying Amdahl's Law here: no matter how and at what level in the hardware architecture parallelism is implemented, an application has to be specifically designed to take advantage of it, and only a portion of it is likely to be able to do so. My former supercomputer colleagues would recognize his argument immediately, and would understand that algorithms designed to run efficiently on GPUs and DSPs are as different as those that run well on supercomputers, in the latter case by virtue of being embarrassingly parallel.

Chisnall's article is worth a read. He has made me ponder how we may be missing out on radical improvements in efficiency because of the blinders we have on as we continue to design around special purpose processors like Pentium and ARM.