Saturday, October 28, 2006

Gene Amdahl and Albert Einstein

Gene Amdahl was a hardware architect from the age of Big Iron. His company, Amdahl, manufactured mainframes that were plug-compatible with those sold by IBM. In the 1960s there was a trend towards multiprocessor architectures, and towards the parallelization of software applications to take advantage of such multi-headed systems. Amdahl coined what would become known as Amdahl's Law, a rule of thumb that can be used to estimate the actual performance improvement you might expect from parallelization.

Amdahl's Law most frequently takes the form of

1/(F + ((1-F) / N))

where F is the fraction of the computation that is by nature serial, (1-F) is the fraction that can be parallelized, and N is the number of processors you can spread the (1-F) portion across. What Amdahl was saying is: no matter how many processors you have, the speed up due to parallelization is proportional to that fraction of the computation that can actually take advantage of all those processors. Like all the best rules of thumb, this is kind of glaringly obvious and at the same time subtle. It all depends on making (1-F) big. If (1-F) is small, you can waste a lot of money on making N as large as you want to no avail.

As many people are learning today with the new multi-core microprocessors, and as my mainframe and supercomputer friends could have told you thirty years ago, designing a software application to take advantage of multiple processors is no small feat. Typically in all but a handful of very specialized applications, the kind that seem to occur only if you are doing classified research for the Department of Energy, only a portion, sometimes a small portion, of a program lends itself to the kind of fine-grained parallelization needed to take advantage of large numbers of processors. The programs that do lend themselves to massively parallel processors tends to be composed many independent computations. Otherwise the communications overhead or the serialization required for synchronization kills you.

Google has made stunningly effective use of large numbers of clustered processors using their MapReduce software architecture. I got a little verklempt with nostalgia when I read the paper by Dean and Ghemawat on MapReduce, because it reminded me of my days in graduate school decades ago, working on software designs for the kind of massively parallel hardware architectures envisioned by the Japanese and their Fifth Generation research. But as clever as MapReduce is, the applications that map well to it are still far too limited for just the same reasons.

Just to make matters worse, as folks like David Bacon, Scott Meyers, and Andrei Alexandrescu are quick to point out, even coarse-grained parallelization, like the kind you find in multi-threaded applications, is fraught with subtle peril, thanks to hardware memory models that bear little resemblance to how most developers think memory actually works. I sometimes find myself thinking of the thousands and thousands of lines of embedded C, C++ and Java code I have written in the past decade, wondering if any of it would actually work reliably on a modern multi-core processor.

But the beauty of Amdahl's Law is that it is very broadly applicable way beyond just making effective use of lots of processors. My favorite is latency in distributed applications.

When you try to send data from one computer to another, whether it's a book order to Amazon.com in the form of a SOAP message, or a pirated movie from your favorite peer-to-peer site, it takes a while to get there. If you try to speed things up by getting a faster network, you find out that what constitutes faster is not so simple. Network latency occurs in lots of places between the time that first bit leaves the near end and the last bit arrives at the far end. There's the software processing latency that takes place in the application, in the network protocol stack, and in the NIC device driver, before the data even gets on the wire. There's the bandwidth of the wire itself. There's the same software processing latency at the far end.

And then, as Albert Einstein would tell you, there's the speed of light.

See, when you upgrade your unbearably slow ten megabit per second network interface card with that shiny new gigabit per second card, all you have really done is affect just one small (1-F) of the total network latency equation, the transmission latency. In fact, if your computer is not capable of feeding that NIC with data at a gigabit per second, you might not have really accomplished anything. But even if it can, each bit is still governed by the ultimate speed limit: the rate at which a signal can propagate across the wire, and that rate is never faster than 300 million meters per second. No matter how fast your network is, that first bit is still going to take some time to travel end to end. 300 million meters per second sounds fast, until you start looking at Serious Data, then you start to realize that the propagation latency is going to kill you.

How do different network technologies actually achieve ever higher bandwidths? Sometimes they really do make data travel faster down the wire (but never faster than the speed of light). The now defunct Cray Research Corp. wrapped their signal wires in Gore-Tex insulation because it lowered the dielectric constant, which actually decreased the propagation delay. The speed of signal propagation in fiber-optic cable is faster than it is in copper wire. But mostly, whether using copper or fiber, they cleverly encode a bunch of bits in such a way that a single signal carries multiple bits of data, all of which arrive at the same time. The signal may take the same time to arrive at the far end, but once it gets there, you get more for your money.

Of course, on Star Trek, they were always using tachyon beams, Einstein be damned.

There are other ways to slow down even infinitely high bandwidth networks, some of which impact software designs at the application level. If you send a byte of data to the far end and have to wait for a reply before sending any more, you are going to take double the signal propagation hit on every single byte. You will have to wait for the byte to get to the far end, then wait for the reply to come back, so that you know your transmission was successful and you do not have to resend your byte. Network protocols and the applications that use them get around this latency by sending honking big packets of data at a time, and sending many packets before they have to stop and wait for an acknowledgement.

You can compute how much out-going data you have to buffer in memory in order to make effective use of your network pipe. Network protocols typically call this the window size. It is computed from the bandwidth-delay product. Take the size of your network pipe in terms of, say, bits per second, and multiply it by the round trip propagation delay in seconds for sending a bit down the wire and getting a reply bit back. The result is the number of bits you have to buffer in memory waiting for an acknowledgement from the far end.

The nasty part about this is that the higher the bandwidth of the network, the larger the bandwidth-delay product is, and you can never reduce the round trip propagation delay below that implied by the speed of light. I once worked on a project to send data between two supercomputers on an OC-3 link via a geosynchronous satellite. The bandwidth was 155 megabits per second. The round trip propagation delay was half a second. Do the math. We had to buffer nine megabytes of data at both ends of the socket. And this was on manly high speed SRAM memory, not wimpy DRAM with a virtual backing store. This was surely one expensive distributed application.

So before resorting to the easy trick of upgrading to a faster network technology, do some back of the envelope calculations to see if your investment will pay off. And do the same before buying that shiny new multi-processor/multi-core server.

Sources

David Bacon et al., The "Double-Checked Locking Pattern is Broken" Declaration

Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", OSDI 2004

Scott Meyers and Andrei Alexandrescu, "C++ and the Perils of Double-Checked Locking", Dr. Dobb's Journal, July and August 2004

J. L. Sloan, "Introduction to TCP Windows and Window Shifting/Scaling", Digital Aggregates Corp., 2005

J. L. Sloan, "Vaster Than Empires and More Slow: The Dimensions of Scalability", Digital Aggregates Corp., 2006

Wikipedia, "Amdahl's Law", October 2006

2 comments:

Chip Overclock said...

One of the dangers of having a Ph.D. physicist on the payroll is he is always correcting your math. Dr. Howard says:

"The speed of bits in transmission media is governed by the group velocity, which is always less than the phase velocity. In vacuum, the group and phase velocities are both c, and their product remains c squared.
The index of refraction can be interpreted as the ratio of c to the phse velocity. Hence for glass (n~1.5), the phase velocity (or speed of bits) is about 2/3c.
But the general point about optimizing is quite common in all system applications. One must always consider how each component contributes to the process you are tyring to improve. Sometimes, making one less efficient is required to improve the whole, but owners of that subprocess frequently are hesitant to accept that conclusion."

Chip Overclock said...

Network engineer and former colleague Paul Hyder adds:

"And just to add to your throughput information it turns out that the maximum throughput with a TCP like protocol is:

Rate = MSS/RTT * 0.7/sqrt(p)

Where MSS is the Max Segment Size, RTT is the round trip time, and p is the probability of packet loss. Basically at speeds above 1Gbit/sec you need either extremely low packet loss or larger MSS. (Mathis et al. 1997)"