Monday, December 25, 2006
Traffic Management
Actually, don't. Even your sewer system is based on some stochastic assumptions, and when those assumptions are violated, congestion occurs. In the case of toilets, I think you know what kind of congestion I mean.
Even in relatively simple messaging systems involving a single process or a single Java virtual machine, shared resources can be temporarily exhausted if many components try to firehose messages at each other with no feedback mechanism to affect some kind of rate control. At best, performance slows down for a spell until things sort themselves out. At worst, buffer pools run dry, the heap is exhausted, and the entire system may even deadlock. This is a well known failure mode in real-time systems, yet I routinely see developers writing code that assumes that they can fire away messages as quickly and as often as possible with little or no regard as to whether the recepient is in a position to receive them.
There are a lot of mechanisms used in real-time systems that can control how quickly a message producer may send a message so as to not overwhelm the message consumer that receives it. (It's important to understand that producer and consumer are roles played by communicating applications for the duration of a single message exchange. An application may and typically will act as both a producer and a consumer.) Below are brief descriptions of some of the traffic management mechanisms I've encountered, from the simplest to the most complex.
XON/XOFF
XON/XOFF is a form of traffic management used for flow control that will be familiar to anyone who remembers ASCII terminals. XON is the ASCII character produced by typing control-Q. XOFF is a control-S. The X is a common abbreviation for "trans" as in "transmit". When the consumer's buffer is full (or nearly so) it sends an XOFF to the producer, who suspends sending until it receives an XON. XON/XOFF is probably the most widely used traffic management mechanism on the planet by virtue of it being implemented in nearly every RS-232 serial device in existence.
The downside of XON/XOFF is that it doesn't scale well with latency and bandwidth. Just to make the problem obvious, let's assume that your producer and consumer are using a 155 megabit per second (OC-3) communication channel over a geosynchronous satellite link. Data transmitted via geosynchronous satellites have a round-trip time (RTT) of about half a second just due to the speed of light; it takes a quarter of a second to get up to the satellite, another quarter of a second to get down again.
By the time the consumer sends an XOFF to the producer, the producer operating at full channel speed will already have sent over thirty-eight megabits of data that is just travelling along the radio signal up to the satellite and back down to earth, and will send another thirty-eight megabits in the time it takes the XOFF to travel from the consumer to the producer over that same channel.
To use a simple mechanism like XON/XOFF, the consumer has to be able send the XOFF to the producer while it still has seventy-seven megabits of slack in its input buffer in order to receive all the data that is in transit, or will be in transit, before the producer can shut off.
ACK/NAK
ACKnowledge and Negative AcKnowledge is probably the most straightforward traffic management mechanism. The producer is blocked upon sending a message until the consumer either acknowledges or fails to acknowledge receiving the message. It reliably prevents the producer from overwhelming the consumer.
Its biggest downside is performance. Assume the worst case where the producer sends one bit of data over that same satellite link, then waits for the acknowledgement from the consumer. It takes the one bit of data a quarter second to go from the producer to the consumer. Ignoring any processing latency, it takes the acknowledgement another quarter of a second to go from the consumer to the producer. It took a half-second to transmit a bit from producer to consumer before a second bit could be sent, so your 155Mb/s channel has an effective bandwidth of two bits per second. This is why communications channels are more efficient with larger packet sizes.
Suppose the producer instead sends a megabit before it waits for a reply. The first bit of that megabit still takes a quarter of a second to get from the producer to the consumer, and the last bit takes more than six milliseconds at 155Mb/s. Then the acknowledgment takes another quarter of a second to go from the consumer back to the producer. Your channel now has an effective bandwidth of just under two megabits per second. You are still only utilizing a little over one percent of your channel.
What is the magic number for packet size? With a simple protocol like ACK/NAK, there is no magic number. No matter how many bits you send before turning the channel around, you are still wasting bandwidth when you stop transmitting and wait for an acknowledgement.
The ACK/NAK pattern shows up more often than you might think, even if your application isn't using a communications network at all. Synchronous message passing, where the producer blocks waiting for the consumer to receive the message, is a form of the ACK/NAK pattern. Synchronous message passing does not scale well as the complexity of the application grows, yet it is very common in web services or in any remote procedure call (RPC) mechanism like DCE, Corba, or Axis. Applications may unwittingly make requests concurrently to each other, blocking waiting for an acknowledgement that will never come because the two ends are now deadlocked waiting for the other end.
Synchronous message passing can also be a subtle form of priority inversion. While an application acting as a producer is blocked waiting for an acknowledgement from its consumer, other perhaps higher priority applications are blocked trying to send messages to the blocked application in its role as a consumer.
Windowing and Credit
To use the full capacity of your data channel, you would like to keep it full all the time. To do that, the producer has to be able to put enough new data in the channel to keep it full while an acknowledgement for prior data travels from the consumer. Communications protocols handle this by implementing a windowing or credit mechanism.
A window is a fixed amount of data that may be outstanding at any one time, as yet unacknowledged by the consumer. The producer is allowed to send this much data before being forced to wait for an acknowledgement. The producer guarantees that it has at least this much buffer space so that it can save the data it has already sent, in case it receives a negative acknowledgement from the consumer and has to send it again. Examples of windowing protocols are ISDN's Link Access Protocol for the D-Channel (LAP-D) which is used for signaling in non-IP digital telephony, and IP's Transmission Control Protocol (TCP). The producer and consumer may agree on a window size as part of their initial connection setup, or the window size may be fixed as part of the protocol itself.
A credit is the amount of free buffer space in the consumer signalled to the producer as part of an acknowledgement. This feedback mechanism lets the producer know how much more data it is allowed to send before having to stop and wait for more credit. The consumer guarantees that it has buffer space at least as large as the credit it sends. Because a consumer may be receiving messages from more than one producer into the same buffer space, the credit that a producer sees may bear little relation to the amount of data it has actually sent to the consumer. TCP/IP also uses a credit mechanism.
The optimal window or initial credit size is the bandwidth-delay product. The bandwidth-delay product of our satellite link is the 155 megabits per second bandwidth multiplied by the half second RTT, or more than nine megabytes. Although most communication channels don't suffer the RTT of a geosynchronous satellite link, many, such as gigabit Ethernet, have substantially higher bandwidth, still yielding a large bandwidth-delay product. To make effective use of channels with high bandwidth or high RTT (or both), both the producer and the consumer require large amounts of memory for buffering outgoing and incoming data. (Think what implications there might be regarding the reliability of your communications network when you have a high bandwidth-delay product. In an unreliable network, the producer may well spend most of its time and network bandwidth retransmitting a huge amount of buffered data.)
The windowing and credit patterns also show up more often than you would think. They are both a form of asynchronous message passing in which the a limit is placed on how far ahead a producer may get ahead of a consumer before being blocked waiting for an acknowledgement. Allowing applications to use unfettered asynchronous message passing does not scale well as load increases. Resources in the underlying message passing mechanism can be quickly exhausted by an enthusiastic producer with a slow witted consumer. This can lead to a deadlock of the entire system.
Traffic Shaping and Policing
All of the flow control mechanisms discussed so far require consent between the producer and the consumer. Because they all involve temporarily pausing the producer, they also introduce jitter into the data stream. Jitter is variation in the inter-arrival time (IAT) of successive packets in the data stream.
Jitter isn't generally a problem for data transfer using applications like FTP. It can be inconvenient when using interactive applications like TELNET. But it wreaks havoc with data streams that have real-time constraints like audio or video streams or VOIP. Jittered packets in real-time streams may arrive too early for play back and have to be buffered, or too late and have to be discarded completely. (What the audio or video player or VOIP telephone may play back in place of the discarded packet is an interesting problem left as an exercise to the reader.)
Technologies like Asynchronous Transfer Mode (ATM) impose traffic contracts on individual data streams. A traffic contract is a formal description of the bandwidth and burstiness characteristics of the data stream plus any real-time contraints it may have regarding jitter. ATM devices police traffic by marking incoming packets that exceed their data stream's specified traffic contract. Marked packets may be immediately dropped by the hardware, or dropped later down stream should congestion occur. Your local speed trap is a form of traffic policing. ATM devices may shape traffic by timing the introduction of outgoing packets into the network to conform to their data stream's traffic contract. Timed stoplights at freeway entrance ramps are a form of traffic shaping.
Although traffic policing and shaping to an arbitrary traffic contract may sound complicated, there are relatively simple algorithms, such as the virtual scheduling algorithm described the ATM Forum's Traffic Management 4.0 specification, that implement them. Such algorithms are efficient enough that in ATM equipment they are implemented in commercially available chipsets.
Traffic shaping is implemented solely by the producer, and traffic policing solely by the consumer. Traffic shaping smooths data flow, while traffic policing prevents congestion. I like traffic policing and shaping mechanisms not only because I've implemented a few in commercial products, but also because they allow me to reason about the maximum inflow and outflow rates of packets while troubleshooting a customer system.
Connection Admission Control
Connection Admission Control (CAC) in its simplest form is a function that returns true or false when given a traffic contract as an argument. CAC decides whether a network device can handle a new connection with the specified contract. CAC takes into account the traffic contracts of all existing connections, and its intent is to guarantee the fulfillment of existing contracts even if that means rejecting a new connection request.
In service provider networks, rejecting a new connection often means rejecting revenue, so this is not done lightly. The ATM switch or other network device that can accept more connections than another similar device and still fulfill all of its contracts has a leg up on the competition. This kind of intelligence is typically the result of complex statistical traffic models. I have implemented several CAC algorithms in commercial products, but my designs were based upon theoretical work by Ph.D.s in ivory towers. (That didn't keep me from giving the occasional talk on the topic.)
SOA and EDA
I have tried to stress here and there that traffic management is important even if you are not using an actual communications network. Lately where I have seen its need emerge is in projects implementing a Service Oriented Architecture (SOA) or Event Driven Architecture (EDA), often over some kind of Enterprise Service Bus (ESB). Whether their designers realize it or not, these systems are real-time messaging systems, and suffer from all of the traffic management issues of more traditional communications systems.
For the past year I have been working with a team of engineers on an application that uses an implementation of the Java Business Integration (JBI) standard (JSR-208), a specification for a Java-based ESB. JBI defines both synchronous and asynchronous messaging passing mechanisms, but traffic management and flow control are left up to the applications that use it, by virtue of not being discussed in the specification. Companies using JBI or any other ESB (or indeed, implementing any kind of SOA or EDA at all) would be well advised to take into account issues of traffic management in their systems.
Sources
Chip Overclock, "Gene Amdahl and Albert Einstein", October 2006
N. Giroux, et al., Traffic Management Specification Version 4.0, ATM Forum, af-tm-0056.000, April 1996
L. He, et al., "Connection Admission Control Design for GlobeView-2000 ATM Core Switches", Bell Labs Technical Journal, 3.1, January-March 1998
J. L. Sloan, "Introduction to TCP Windows and Window Shifting/Scaling", Digital Aggregates Corp., April 2005
J. L. Sloan, "ATM Traffic Management", Digital Aggregates Corp., August 2005
Ron Ten-Hove, at al., Java Business Integration (JBI), JSR-208, August 2005
Thursday, December 21, 2006
Rocket Science as Product Development
At the recommendation of my friend and colleague Scott Thomas, I read Steve Squyres' book Roving Mars, a fascinating glimpse into how science, engineering, and politics play together in the rarified world of interplanetary exploration. Squyres was the principle investigator for the Mars rover missions. In his book, Squyres describes how his science team came up with the idea of sending a mobile robot - a kind of instrument-laden all-terrain vehicle that you could fit in the trunk of your car - tens of millions of miles to another planet. How politics led to them to send not one but two Mars Exploration Rovers (MERs), Spirit and Opportunity, to two very different locations on the Red Planet. And what they found when they got there.
If like me you occasionally gaze up longingly at a clear night sky, or if you've been known to read a novel or two by authors such as Banks, Baxter, Benford, or Brin, then you can be forgiven if your eyes, like mine, get a little teary at the thought of this kind of adventure.
The science described in the second half of the book is alone enough to make it compelling reading. Proof that Mars was a damp if not down right wet place sometime in the distant past. But for us engineers in the audience, the most riviting part is the first half, which is the best description I have ever read about the drama that is product development.
No, really.
Mars and Earth would be in perfect alignment and proximity for a launch in the summer of 2003. The launch window was only five weeks wide. And if they missed it, those perfect celestial conditions would not occur again for eighteeen years. The twin-rover program cost something in the neighborhood of eight hundred million dollars, and was sixteen years in the making from inception to launch. If the engineers and developers missed their dates by more than that five week window, it was more than an inconvenience. It was more than a career limiting move. It was even more than a disaster. Some of the engineers who poured their hearts and minds into that project were old enough (like, my age) that it was within the realm of possibility that they would not be alive for another launch opportunity. It was pretty much hit that launch window or... well, the alternative didn't bear thinking about too much.
And so Squyres launches into a detailed description of the many, frequently painful, time-space-budget-feature tradeoffs that are part of every single product development program. And for this product development, failure was not an option. If you have friends and relatives that wonder what you do, give them this book, and tell them that maybe you don't send rovers to Mars (or maybe you do). But your day to day life is a lot like the people's in this book.
Those that follow the U.S. space program already know the ending of this story. Not only were the rovers successful, but wildly so. They far outlived their projected lifespans, continuing to do useful science well after the date when their batteries should have died because their solar panels were too covered in dust and the Martian winter of sixty below zero temperatures had frozen all their joints and axles. Part of this was just luck: the occasional Martian windstorm cleaned the dust from their solar panels, extending their lives. But part of it was good design and engineering.
One of the most edge-of-your-seat portions of the book was when the MER team at the Jet Propulsion Laboratory in Pasadena lost contact with Spirit. Squyres describes how the engineers, working on some hypotheses regarding firmware failures, poured through code to try to predict how such hypothetical faults could have occurred. One of the ideas they had was that the on-board flash memory used for storing data had somehow become corrupted, leading to rolling reboots of the main processor. (Other sources have revealed that this was a VxWorks system running on an RS6000 processor.) It wasn't a perfect hypothesis, because they could not figure out a way in which the flash file system could have gotten corrupted, nor could they duplicate it on the test rover on Earth. But the pattern of failure, where the rover would transmit mostly useless status information for just a few minutes then quit, fit most of the forensic data.
To test this theory, they had to transmit a command to the rover over a ten-bit per second auxiliary channel (yes, about one character a second) with about a six minute round trip latency to tell it to reboot with a temporary file system build in RAM. They had to hope the rover received the command during the brief window when it was likely to be operating during the sunny Martian day and before it rebooted. That it would successfully execute the command. And that it would bring up the high bandwidth channel at the right time when there was enough solar power to transmit, when the rotation of Mars placed the rover correctly, when the relay satellites in orbit around Mars were in the right places, and when the reply could be received by one of three Earth stations in the U.S., Australia, and Spain. If this didn't work, half the mission and several hundred millions of dollars were down the tubes because of a corrupted flash file system. This is what people mean when they use the term "mission critical".
It worked. They were able to reassert control over the rover, reformat the flash, and return the rover to normal operation. But here's the part that only people who are in product development will really appreciate. The command they used to save their bacon was not in any requirements document. It was put in place by an engineer who placed the success of the mission ahead of meeting requirements.
Now having been both an engineer and a manager, I know it is a very slippery slope to allow engineers to insert their own unplanned features into a product. This has probably killed more products than it will ever save. It raises issues of economics and security. But what I think it comes down to is this: listen carefully to your engineers, particularly to what they are worried about. They may be trying to save your bacon.
I totally and completely identified with the engineers in this book. I can't tell you how many times I've worked on a project where half the team was sitting in front of a display talking to some firmware via a diagnostic port at a remote customer site over a 9600 baud modem connection, typing in simple commands trying to diagnose a severe problem, asking the on-site technician on the phone questions like "Would you describe the pattern of blinking of the red LED as 'Jingle Bells'?", while the other half of the team was pouring over a hundred thousand lines of C++ code trying to come up with a failure mode that fit the facts. Is it any wonder I preach the need to include the capability for remote field troubleshooting at the lowest levels?
No doubt about it: space men, earth men, we are all in the same tribe.
Sources
Steve Squyres, Roving Mars, Hyperion, 2005
Wednesday, December 20, 2006
Rules of Thumb
My degrees are in computer science, so when I had to start wearing a tie - purely metaphorically speaking - I knew I needed some formal training. But some of the best stuff I learned about management, and about being a manager, I didn't learn in any class or seminar. Some of it I learned on the job. Some of it from a good mentor or two. Some by observation. And much of it in the school of hard knocks. I also read a hell of a lot.
Here are a few of the things I have learned.
It takes you at least as long to get out of trouble as it took you to get into it.
Say you have a corporate culture that sucks. How long will it take to turn it around? Jack Welch might be right that people's outward behavior can be changed quickly. But just because they are smiling and nodding doesn't mean they trust you as far as they can throw you. Or aren't thinking "up yours" to every point you try to make. If you had poor personnel policies for a couple of years, expect to take at least couple more years before you have eradicated all of their negative effects.
For every action there is an equal and opposite reaction.
I once heard a department head complain about a lack of employee loyalty after the company we both worked for had spent the last several years laying off half of its employees. What did she expect? If you treat people with mistrust, expect to not be trusted. If you show disrespect, don't be surprised when you get no respect. If you treat people like idiots, don't expect to be winning any MacArthur grants.
You are not saving money if you have merely moved costs from where they can be measured to where they can't.
The classic case of this is eliminating desktop and laptop computer support to save those precious IT dollars that, let's face it, are just overhead. Like a lot of overhead, it enables your highly paid technical staff to do their jobs. By shifting the simpler support functions to your engineers, you now have a bunch of extremely expensive albeit disgruntled technicians. I've also seen engineers earning six figures doing the kind of clerical work I used to have my administrative assistant do (and who was really good at it) at a fraction of the cost. Keep saving money like this and eventually you'll go bankrupt.
People pretty much act the way they are incented to act.
If your people exhibit some behaviors that are, let's say, contra-indicated, your first step should be to look at your system of incentives. I've written about and given talks on Robert Austin's book Measuring and Managing Performance in Organizations (Dorset House, 1996) in which he writes about how incentive programs drive dysfunction into organizations. Nelson Repenning of MIT has written on fire-fighting in organizations, how rewarding fire-fighting leads to more fires to fight, and if unchecked, leads to a death spiral. Managers try to motivate employees with all sorts of stuff that only can be classified as motherhood and apple pie. Some of it they might even believe themselves. But people have so much on their plates, they can't meet impossible deadlines, come in on tiny budgets, and produce high quality work. It's just impossible. The only way people know what the company really values is by carefully watching who is rewarded for doing what.
If you waste people's time, you are sending them a message that it is okay to be wasteful.
I once worked for a company that estimated the cost of big meetings by multiplying the number of participants by their average fully loaded cost. The division director would actually say "If we gather everyone in the auditorium, that's a $4000 meeting. Is it that important?" Everyone understood that time was money. And because it was wrong to waste time, it was also wrong to waste anything else.
On their deathbed, no one ever said "I wish I'd spent more time at work."
But you might say "I wish I'd done more writing." Or "spent more time with my kids." Or even "spent more time with smart engineers." When I'm racing out the door on my way to work, I always stop and pet the cats. Whatever it is for you, take the time to do it. Keep your priorities straight.
What rules of thumb do you use?
Saturday, December 16, 2006
Lunch-Time Tales
-- Moopheus
Sources
Rod Black
Jerry Dallmann
Doug Gibbons
Jim Homer
John Meiners
Tam Noirot
John Sloan
Tom Staley, The Matrix Transcript
The Matrix, a Wachowski Brothers film, 1999
Wednesday, November 22, 2006
Question-Driven Development
It took me more than thirty-five years of writing code to put a name to my personal software process: Question-Driven Development. I find that I am constantly asking questions about my code as it, more or less, magically appears on the screen in front of my eyes. Sometimes the question is what the hell is this? But just as frequently it is one of the following.
How will I test this during development?
- Can I unit test every single code path?
- Do I have error legs that are difficult to exercise?
- Have I checked my code coverage using tools like Cobertura?
- Do I have confidence that taking a rare error leg won't somehow escalate the problem?
- Are there private inner classes that need to be tested independently?
- Have I tested my settors and gettors?
- Do I have reasonable defaults if the settors aren't used?
- If settors must be used, why aren't those settings set during construction?
- Can I have a no-argument constructor, and if not, why not?
- Have I tested my component for memory leaks using tools like JProfiler?
- Have I done static analysis using tooks like Klocwork?
- Is this code thread-safe, and if not, why not?
How will I debug this during integration?
- How can I tell if its a bug my component or another component?
- Are the interfaces between components independently verifiable, and if not, why not?
- Are there independent ways to indict my component versus other components?
- Do I have a reasonable message logging framework?
- Have I chosen the severity of each logged message appropriately?
- Can I quickly indict the failing subsystem?
- Will turning up the log level while under load firehose the screen or the log file?
- Will turning up the log level while under load somehow escalate the problem?
- If performance problems arise, can I determine where the time/space is being spent?
- Do I have a way of independently testing my component in the context of the larger system?
- Do I have a way of testing my component end-to-end with the other components?
- Can I indict our software versus other software?
- Can I efficiently capture the forensic information I need in a transportable form?
- Can I quickly get the customer's system back up and running?
- What is my priority when a problem occurs: forensic data capture or returning to an operational system?
- Can I quickly determine a plausible story to reassure the customer?
- Is that story likely to be correct or am I going to look stupid later?
- Are development resources available to quickly fix bugs and turnaround a new release?
- Have I given the field support folks the tools they need to get their job done?
- Can the field support folks do their jobs without calling me all the time at all hours?
- Have I chosen the severity of each of my external alarms appropriately?
- What is the cost of raising an alarm?
- Can the system recover automatically or are there circumstances where our field support folks need to roll a truck?
How will I transition this to another developer during my move to another project?
- Will the new developer need to be a subject matter expert like me?
- Do I need paper documentation or is a wiki or javadoc more appropriate?
- Have I left the new developer a mess to clean up?
- Do I have deprecated code that I need to remove?
- How do I know for sure that no one is using the deprecated code?
- Is the deprecated code unit tested?
What other questions should I be asking?
Does anybody really know what time it is?
Calendars and timekeeping, like music, model railroading, firearms, motorcycles, and a few other pasttimes, are one of those interests that seem to keep cropping up among technical folk. I fell into it myself a while back, and it consumed two years of my copious free time.
My interest was motivated by two things that irritated the heck out of me. First, more than once I ran into bugs in production software that miscalculated dates and times, causing me no end of grief and misery. I kept thinking, this can't be that hard. (I was both right and wrong on that count.) Second, on a single legacy software product that I worked on, there must have been at least a half dozen different ways in which the date and time was represented to the end user. I kept thinking, there has got to be a standard way of doing this. (I was right on that one.)
Realizing that I never really understood something until I either taught it or implemented it (typically both), I ended up developing the time and date routines in the Desperado library of embedded open source C++ components available from Digital Aggregates. Developers working on platforms like Linux or Java, both of which provide quite usable support for dates and times, may never really appreciate how much work must have gone into those systems. Developers working on embedded platforms which lack such support should hope they never need it.
This article is a collection of fun facts to know and tell that I learned along the way about calendars and timekeeping. It is in no way a tutorial or a technical refererence (although I provide some of those at the end).
There is a standard for representing date and time: ISO 8601. It allows a few variations in the basic format to allow for cultural differences in representation and for readability by both humans and machines. But it enforces a consistency that avoids the ambiguity that still arises as to whether 02/01 is the first of February or the second of January. Plus, its format results in the character-representation of dates sorting into chronological order. One form of a ISO 8601 conformant date and time stamp would look like "2006-11-22T11:54:04-07:00", which encodes the date, local time, and the offset from UTC.
There is a standard calendar, the Common Era, which is a proleptic assumption of the Gregorian calendar. Proleptic means that the Common Era calendar extends backwards in time as if the Gregorian calendar had always existed. See, here's the thing. The Gregorian calendar was named for and endorsed by Pope Gregory XIII circa 1582, replacing the Julian calendar named for that Roman emperor, which tells you something about how long it had been around. But the Gregorian calendar was not universally adopted until 1923. Yeah, that's right, folks were getting confused on dates well into the twentieth century.
What this means for those of you who are building that time machine in your basement is: I have no idea how your date-and-time controls are going to work. The date is going to depend on exactly when and where you are going. If you are dropping into Greece in 1901, you are on your own, because their 1901 is not likely to be your 1901. You might be tempted just to represent everything in seconds-in-the-past, but that's problematic too (we'll get to that in a bit).
The Common Era has no Year Zero. The first year is 1C.E. (A.D. is a Gregorian calendar term that has unpleasent connotations for non-Christians), and the year before was 1B.C.E. (B.C. ditto).
The Common Era calendar (and the Gregorian calendar) repeats every 400 years. Yes, that's right, the algorithm you've known since childhood of a leap year every four years is wrong. It isn't a leap year if it falls on the century (the year ends in 00), unless it is the fourth century, in which case, it is.
Around 1882, a Protestant rector in Germany named Christian Zeller needed to predict on what Sunday Easter would fall. He became immortalized by his Zeller's Congruence Formulae that calculates on what day of the week any date in the Gregorian calendar will fall. Everyone uses this; don't even think about developing your own algorithm.
There is a standard clock. In fact, there are several standard clocks, depending on what kind of time you want to keep. We like to think of a day containing twenty-four hours. The problem is the concept of a "day", which is tied to the Earth's rotation. Apparently the Earth hasn't been wound up lately, because it appears to be slowing down.
International Atomic Time (TAI) is the time kept by the most precise timepieces we know how to make, atomic clocks that measure the natural vibration of the cesium atom. TAI has no correlation to the Earth's rotation. Atomic clocks aren't perfect either, but they are the best we know how to build.
Coordinated Universal Time (UTC) is the time we all know and love, the time at the Prime Meridian, once known as Greenwich Mean Time (GMT). (If you have never been to the Greenwich Observatory, I recommend it. Take the ferry.) UTC is tied to the Earth's rotation. So to keep UTC in sync with the slowing of the daily cycle of our planet, occasionally a leap second has to be inserted into the UTC timekeeping.
The decision as to when this is done is made by the International Earth Rotation Service. ("Hands up! International Earth Rotation Service! Nobody move!") So far, a leap second has been added to UTC twenty-three times, first in 1972, the most recent in 2005. The IERS is capable of inserting more than one leap second at a time, or even removing leap seconds in case the Earth starts speeding up (don't laugh, they're serious when they say this). Leap seconds are generally inserted at the end of a calendar year, but not always: nine times a leap second has been inserted (one imagines, on some kind of emergency basis) at the end of June.
(The acronyms TAI and UTC represent the official names of these measurements as written in French.)
The Global Positioning System (GPS) works strictly off computations based on the differences in timestamps received from satellites whose orbits are known with very high precision. (When I read about this, my first thought was: this really is Rocket Science!) Every GPS satellite carries multiple redundant cesium or rubidium atomic clocks. Because of this, GPS time is absolutely without question the most accurate time we know how to keep. Your cell phone is by far the most accurate timepiece you are ever likely to own, because, unless your service provider really hoses it up, the telephone system is synced to GPS time. (Before GPS existed, AT&T actually owned a couple cesium atomic clocks just for this purpose.)
Trick question: how many time zones are there? Answer: twenty-five. Not twenty-four. The time zone that falls on the Prime Meridian extends thirty-minutes on either side of it. Take a one-eighty around the Earth and you find the International Date Line (IDL), which effectively splits its time zone in two thirty-minute zones, each twenty-four hours apart. So in that sixty minute window, it is the same clock time on either side of the IDL, but the zones are a day apart.
The U.S. military has a cool way of indicating time zones using a single letter, as in 23:30T. Z indicates the time zone on the Prime Meridian. Since Z is spoken as "Zulu" in the International Phonetic Alphabet (all the cool kids are doing it), that's why time given in UTC is sometimes said to be "Zulu time". The time zone you are standing in can be indicated with a J, so local time is "Juliet", no matter what time zone you are in. That conveniently leaves twenty-four other letters to indicate the twenty-four time zones other than that of the Prime Meridian. The U.S. Mountain Standard Time zone I live in is T or "Tango", which is UTC-07:00. This of course completely ignores the fact that lots of places have time zone offsets of fractions of an hour. This may explain why the U.S. has never invaded Newfoundland (UTC-3:30). On the other hand, they did invade Afghanistan (UTC+4:30), so your mileage may vary.
Daylight Saving Time (DST) (note: "Saving" and not "Savings") is a wonderful complication if you are trying to write code. First of all, different countries that have some form of DST - and many do not - have different notions of when it begins and ends. Most of the U.S. implements DST, but not all. The U.S. has thus far used three different dates when DST begins and ends, first in 1966, then changing their minds in 1986 and 2007. The dates when DST begins and ends are always in the form of "the second Sunday in March" or "the last Sunday in October", making life especially pleasant for the developer. The decision of when to change when DST begins and ends in the U.S. is done by our elected representatives, and as you might guess, is fraught with politics.
Two years of hobby labor resulted in the Desperado C++ classes CommonEra, LocalTime, AtomicSeconds, LeapSeconds, TimeZone, DaylightSavingTime, and others. I am reasonably confident (so far) that I got it right.
GPS time already forgoes the use of leap seconds, and as a result, just in the time that GPS time has existed, the wall clock time we know from the rotation of the Earth, slowing due to the drag caused by the lunar tides, has drifted fifteen seconds from the time GPS keeps through the use of atomic clocks. Hardware and software that displays wall clock time, but which synchronizes to GPS time, has to keep track of those missing leap seconds and add them back in. If such a change were to be made, it would cause UTC to be completely abstracted away from the wall clock time we use on a day to day basis.
Sources
I. R. Bartky, E. Harrison, "Standard and Daylight-saving Time", Scientific American, May 1979
P. Chan et al., The Java Class Libraries, Second Edition, Volume 1, Addison-Wesley, 1998, pp. 1775-1776
ISO, Data elements and interchange formats - Information interchange - Representation of dates and times, ISO8601:1988(E), International Organization for Standardization, Geneva, Switzerland, First edition, 1988-06-15
ISO, Portable Operating System Interface (POSIX) - Part 1, ISO9945-1:1996(E), Annex B, 2.2.2, International Organization for Standardization, Geneva, Switzerland
ITU-R, Standard-frequency and time-signal emissions, ITU-R TF.460-6, International Telecommunications Union, Radiocommunication Assembly, 2002
G. Klyne, C. Newman, Date and Time on the Internet: Timestamps, RFC 3339, The Internet Society, July 2002
NIST, The NIST Reference on Constants, Units, and Uncertainty
E. M. Reingold, N. Dershowitz, Calendrical Calculations, Millennium Edition, Cambridge University Press, Cambridge, U.K., 2002
J. R. Stockton, Date and Time
B. N. Taylor, Guide for the Use of the International System of Units (SI), Special Publication 811, NIST, 1995
B. N. Taylor, Metric System of Measurements: Interpretation of the International System of Units for the United States; Notice, Part II, NIST, July 1998
B. N. Taylor, ed., The International System of Units (SI), Special Publication 330, NIST, 2001
U. S. Code, Title 15, Chapter 6, Subchapter IX, "Standard Time"
"Leap Seconds", U. S. Naval Observatory
M. Wolf, C. Wicksteed, Date and Time Formats, NOTE-datetime, World Wide Web Consortium, September 1997
Desperado, Digital Aggregates Corp., 2006
Monday, November 20, 2006
Chip's Instant Managed Beans
But message logging sometimes isn't all its cracked up to be. Frequently it isn't even possible. Most embedded projects I've worked on in the past three decades had no persistent store of any kind. The smaller projects were lucky to have a console serial port buried on the board somewhere, and the more complex ones might have had an Ethernet port. But even if there was a consenting syslog server to which to log messages, logging wasn't always practical or possible on real-time systems because of the processing overhead and bandwidth required to log the potential firehose of output when things really got sideways. More than once, unfortunately sometimes while troubleshooting a customer system, I have turned up the log level only to discover that the CPU cost was so high the hardware missed its watchdog poke. Wackiness ensued.
Even under the best of circumstances, a huge message log is frequently just more information than you need. Nothing like having your trusty field support person FTP a twenty megabyte log file to your desktop and tell you the customer wants to know what happened by close of business. (Fortunately, a helpful manager will call you every twenty minutes to "assist with your analysis".)
My embedded mentors Tamarra Noirot and Randy Billinger patiently taught me another approach that is an alternative or sometimes a supplement to logging: collections of counters managed as objects and exposed upon demand to the outside world. Sometimes you would give an eye tooth to know if at any time in the past had the board ever encountered a hard error on an I/O port, had the signaling stack ever thrown an exception, had a DSP ever been rebooted, had an incoming packet ever arrived with an invalid checksum. If you (or your trusty field support person) could just query the system to see if any of those error counters had ever incremented, you could, perhaps, at least come up with a plausible story of what probably happened. And maybe even get a clue that would help you fix the problem. Or at least shift the blame. Incrementing a counter is a whole lot cheaper than logging a message, and frequently carries just as much information.
There are several mechanisms to expose counters to the outside world. Sometimes it's a simple command line interpreter and a serial port. On systems which have an internet prototocol stack, it's telnet or secure shell. On yet more complex systems, it's an SNMP MIB.
If you are developing in Java, you're in luck: the latest release comes with a remote management capability: managed beans. A bean is a common Java design pattern. It is a class which has a no-argument constructor, its properties are accessible and modifable via public gettor and settor methods named according to a convention, and it is serializable. Hence, Java objects that are beans can be created, accessed, modified and even persisted, using Java's reflection mechanism, without any prior knowledge of the implementation. A managed bean, or mbean, can be registered with an mbean server so that it can be accessed and modified through the server, even remotely across a network. Java 1.5 has a built-in platform mbean server, and also includes jconsole, a GUI tool with which to browse registered mbeans.
This stuff makes embedded developers' mouths water. It you're not into Java, imagine having a standard mechanism that allowed you to export C++ objects to an SNMP MIB, than having a MIB browser that let you examine the contents of those objects, change them, and even call methods inside of them. It would give you the capability to monitor and control applications from the laptop in your office, even if those applications are deeply embedded in your remote servers without any external user interface.
As much as this sounds like rocket science, implementing a standard mbean in Java is actually pretty simple. But standard mbeans, whose capabilities and interface are fixed at compile time, aren't that flexible. Dynamic mbeans, on the other hand, are incredibly flexible, but not that simple to implement.
What we need is a serving of Chip's Instant Managed Beans. Instant mbeans are general-purpose dynamic mbeans that allow you to instrument your application with almost no effort at all. Instant mbeans come in two delicious flavors: Counters and Parameters.
The Counters instant mbean lets you expose an array of long integer counters inside your application to a management tool like jconsole. You can watch activity counters inside your application increment, giving you that warm fuzzy feeling that your software is actually doing something. You can reset error counters and see if they change. You can monitor high water marks. You can even alter tuning parameters in real-time.
And instant mbeans are so easy to use. Here is a code snippet that creates an instant mbean for some error counters in an application.
enum Counter {
RECEIVED_INVALID_MESSAGE,
ILLEGAL_STATE_ENCOUNTERED,
CAUGHT_IO_EXCEPTION,
SHOULD_BE_IMPOSSIBLE
}
Counters counters = new Counters(Counter.class);
Next we register the instant mbean with an mbean server under a Java object name. Sure, we could use any mbean server and object name we wanted to, but if we don't do anything more than the code snippet below, the Counters mbean will be registered with the default platform mbean server under a perfectly usable default object name.
counters.start();
Now if our application encounters an illegal state in its state machine, it executes the following code snippet to increment the appropriate counter. It can even log it, as shown below.
counters.inc(Counter.ILLEGAL_STATE_ENCOUNTERED);
logger.warning("illegal state encountered: count="
+ counters.get
(Counter.ILLEGAL_STATE_ENCOUNTERED));
If we remembered to run our application with the JVM flag
-Dcom.sun.management.jmxremote
so that the platform mbean server allows connections from the outside world, then all we need to do is run jconsole on the server itself, or remotely using the appropriate URL, and the values of all our error counters will be visible through the GUI to watch and even to modify.
Here is a screen snapshot of jconsole managing an application using exactly the code from above.
Chip's Instant Managed Beans are just that simple. But wait, there's more!
If your application is configured using, for example, a Java properties file, you can use the Parameters instant mbean to expose the Properties object. You can observe through jconsole what the configuration parameters are, and even alter them in real-time. And no inefficient polling here! The Parameters instant mbean lets you register a callback object so that your application will be notified when a parameter has been changed through jconsole.
Given the properties file
ConfigurationFile.properties
with the following keyword=value pairs
AcceptTimeoutPeriodMs=2000
MaximumQueueSize=200
Temporary=/var/tmp
RootPath=./Buckaroo
here is a code snippet to create a Parameter instant mbean from a Properties object.
Properties properties = new Properties();
InputStream stream =
new FileInputStream
("ConfigurationFile.properties");
properties.load(stream);
Parameters parameters = new Parameters(properties);
parameters.start();
Now we can use jconsole to see the how our application was configured, and even to change the configuration as it runs.
Here is another screen snapshot of jconsole managing an application using exactly the code from above.
You can start out using instant mbeans to quickly get your application instrumented. Then, as the mood strikes you, use the instant mbeans as examples for developing your own more complicated managed beans.
The Counters and Parameters instant mbeans are part of the Buckaroo open source Java library available from Digital Aggregates under the Apache License. Chip's Instant Managed Beans makes remote application management easy and fun for the whole family!
Sources
Buckaroo, Digital Aggregates Corp., 2006
Saturday, October 28, 2006
Gene Amdahl and Albert Einstein
Amdahl's Law most frequently takes the form of
1/(F + ((1-F) / N))
where F is the fraction of the computation that is by nature serial, (1-F) is the fraction that can be parallelized, and N is the number of processors you can spread the (1-F) portion across. What Amdahl was saying is: no matter how many processors you have, the speed up due to parallelization is proportional to that fraction of the computation that can actually take advantage of all those processors. Like all the best rules of thumb, this is kind of glaringly obvious and at the same time subtle. It all depends on making (1-F) big. If (1-F) is small, you can waste a lot of money on making N as large as you want to no avail.
As many people are learning today with the new multi-core microprocessors, and as my mainframe and supercomputer friends could have told you thirty years ago, designing a software application to take advantage of multiple processors is no small feat. Typically in all but a handful of very specialized applications, the kind that seem to occur only if you are doing classified research for the Department of Energy, only a portion, sometimes a small portion, of a program lends itself to the kind of fine-grained parallelization needed to take advantage of large numbers of processors. The programs that do lend themselves to massively parallel processors tends to be composed many independent computations. Otherwise the communications overhead or the serialization required for synchronization kills you.
Google has made stunningly effective use of large numbers of clustered processors using their MapReduce software architecture. I got a little verklempt with nostalgia when I read the paper by Dean and Ghemawat on MapReduce, because it reminded me of my days in graduate school decades ago, working on software designs for the kind of massively parallel hardware architectures envisioned by the Japanese and their Fifth Generation research. But as clever as MapReduce is, the applications that map well to it are still far too limited for just the same reasons.
Just to make matters worse, as folks like David Bacon, Scott Meyers, and Andrei Alexandrescu are quick to point out, even coarse-grained parallelization, like the kind you find in multi-threaded applications, is fraught with subtle peril, thanks to hardware memory models that bear little resemblance to how most developers think memory actually works. I sometimes find myself thinking of the thousands and thousands of lines of embedded C, C++ and Java code I have written in the past decade, wondering if any of it would actually work reliably on a modern multi-core processor.
But the beauty of Amdahl's Law is that it is very broadly applicable way beyond just making effective use of lots of processors. My favorite is latency in distributed applications.
When you try to send data from one computer to another, whether it's a book order to Amazon.com in the form of a SOAP message, or a pirated movie from your favorite peer-to-peer site, it takes a while to get there. If you try to speed things up by getting a faster network, you find out that what constitutes faster is not so simple. Network latency occurs in lots of places between the time that first bit leaves the near end and the last bit arrives at the far end. There's the software processing latency that takes place in the application, in the network protocol stack, and in the NIC device driver, before the data even gets on the wire. There's the bandwidth of the wire itself. There's the same software processing latency at the far end.
And then, as Albert Einstein would tell you, there's the speed of light.
See, when you upgrade your unbearably slow ten megabit per second network interface card with that shiny new gigabit per second card, all you have really done is affect just one small (1-F) of the total network latency equation, the transmission latency. In fact, if your computer is not capable of feeding that NIC with data at a gigabit per second, you might not have really accomplished anything. But even if it can, each bit is still governed by the ultimate speed limit: the rate at which a signal can propagate across the wire, and that rate is never faster than 300 million meters per second. No matter how fast your network is, that first bit is still going to take some time to travel end to end. 300 million meters per second sounds fast, until you start looking at Serious Data, then you start to realize that the propagation latency is going to kill you.
How do different network technologies actually achieve ever higher bandwidths? Sometimes they really do make data travel faster down the wire (but never faster than the speed of light). The now defunct Cray Research Corp. wrapped their signal wires in Gore-Tex insulation because it lowered the dielectric constant, which actually decreased the propagation delay. The speed of signal propagation in fiber-optic cable is faster than it is in copper wire. But mostly, whether using copper or fiber, they cleverly encode a bunch of bits in such a way that a single signal carries multiple bits of data, all of which arrive at the same time. The signal may take the same time to arrive at the far end, but once it gets there, you get more for your money.
Of course, on Star Trek, they were always using tachyon beams, Einstein be damned.
There are other ways to slow down even infinitely high bandwidth networks, some of which impact software designs at the application level. If you send a byte of data to the far end and have to wait for a reply before sending any more, you are going to take double the signal propagation hit on every single byte. You will have to wait for the byte to get to the far end, then wait for the reply to come back, so that you know your transmission was successful and you do not have to resend your byte. Network protocols and the applications that use them get around this latency by sending honking big packets of data at a time, and sending many packets before they have to stop and wait for an acknowledgement.
You can compute how much out-going data you have to buffer in memory in order to make effective use of your network pipe. Network protocols typically call this the window size. It is computed from the bandwidth-delay product. Take the size of your network pipe in terms of, say, bits per second, and multiply it by the round trip propagation delay in seconds for sending a bit down the wire and getting a reply bit back. The result is the number of bits you have to buffer in memory waiting for an acknowledgement from the far end.
The nasty part about this is that the higher the bandwidth of the network, the larger the bandwidth-delay product is, and you can never reduce the round trip propagation delay below that implied by the speed of light. I once worked on a project to send data between two supercomputers on an OC-3 link via a geosynchronous satellite. The bandwidth was 155 megabits per second. The round trip propagation delay was half a second. Do the math. We had to buffer nine megabytes of data at both ends of the socket. And this was on manly high speed SRAM memory, not wimpy DRAM with a virtual backing store. This was surely one expensive distributed application.
So before resorting to the easy trick of upgrading to a faster network technology, do some back of the envelope calculations to see if your investment will pay off. And do the same before buying that shiny new multi-processor/multi-core server.
Sources
David Bacon et al., The "Double-Checked Locking Pattern is Broken" Declaration
Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", OSDI 2004
Scott Meyers and Andrei Alexandrescu, "C++ and the Perils of Double-Checked Locking", Dr. Dobb's Journal, July and August 2004
J. L. Sloan, "Introduction to TCP Windows and Window Shifting/Scaling", Digital Aggregates Corp., 2005
J. L. Sloan, "Vaster Than Empires and More Slow: The Dimensions of Scalability", Digital Aggregates Corp., 2006
Wikipedia, "Amdahl's Law", October 2006
Monday, September 11, 2006
1890: A Tipping Point
Being my usual anal-retentive self, I began researching the 1955. While it is not nearly as famous as an earlier Browning design, the 1911 (which is known by most as the iconic ".45 Automatic") it does have some interesting claims to fame. The 1955 was a re-release of an earlier design, the 1910, which itself was an earlier version of what evolved into the 1911. And the tiny Browning 1910 was the handgun used to assassinate Archduke Franz Ferdinand in 1914, an event which sparked off the First World War.
Author and television host James Burke, through his PBS shows such as Connections and The Day the Universe Changed, was always taking his viewers down a long contorted path of history to show how inventions and events were improbably linked together. And so shall I. WWI lead to the collapse of Germany under the leadership of Kaiser Wilhelm II. The collapse of Germany lead to the rise of the Nazi party and of Adolph Hitler. This lead to WWII (which unlike WWI really was a global conflict). And WWII lead to the Atomic Age.
So you could argue that 1914 was a tipping point in history, and that the assassination of an archduke with a Browning 1910 Pocket Pistol led to the Atomic Age. But I'm going to argue that there was an even more interesting tipping point connected to this chain of events: the year 1890.
Wilhelm II was an ambitious, impatient, and it turns out maybe a little brain-damaged, emperor of Germany and King of Prussia. In 1890, just two years after ascending to the throne, Wilhelm fired his Chancellor, Otto von Bismarck, because Bismarck just wasn't getting with the program for the conquest of Europe. Now by all accounts, Bismarck was a Prussian's Prussian. He has been described as brilliant. And by all accounts, although he may not have been the nicest guy, he really understood that diplomacy was a whole lot cheaper than warfare. So clearly he had to go. The canning of Bismarck in 1890 was a pivotal point in history as part of a chain events that eventually led to the Atomic Age.
The year 1890 was an interesting one for other reasons. In that year, the United States Census Bureau declared, based on the 1880 census, that the western frontier was closed. All the unclaimed land had been claimed, most of it had been settled, and thanks to the Transcontinental Railroad, the western United States was safe from incursion by foreign powers. (Just as Eisenhower saw the need for the Interstate Highway System for moving troops based on his experience with the German autobahns during WWII, Lincoln saw the need for the Transcontinental Railroad to move troops to the west coast to secure it from invasion by the European powers via Canada or Mexico). Because of this, most historians consider 1890 as the end of the era we think of as the "Old West".
I think it must be hard for folks from outside of the United States to appreciate the mythic quality the Old West has in our country. The period lasted barely a generation, from the end of the American Civil War in 1865 until 1890. If you added up all of the western films and television shows ever made (including the spaghetti westerns made in Italy), the total viewing hours might be longer than the Old West period actually lasted. But today, we're still making western movies and television series, still writing western novels, and millions of us (including me) still own at least one pair of cowboy boots that are likely to never set foot in a stirrup. I might own a couple of cowboy hats too.
The U.S. Census Bureau played another crucial role in the year 1890. The 1880 census took so long to tabulate, seven years, that the Bureau was seriously worried that the 1890 census might take longer than a decade to complete. They turned to a recent Ph.D. graduate, Herman Hollerith, for help. Hollerith had designed a mechanical tabulating machine that could sort and collate information stored in the form of holes punched on paper cards the size of the 1890-era U.S. dollar. The Bureau adopted Hollerith's invention, and the 1890 Census was completed in two and a half years. Hollerith went on to found the Tabulating Machine Company, which in the fullness of time became IBM.
So here's the crux of it: 1890 was the year in which the U.S. Census Bureau ended the Old West and began the Information Age. And it was the year in which the sacking of Otto von Bismarck would lead to two World Wars and the Atomic Age.
Living in Denver Colorado and working in information technology, I find it remarkable, and very resonant, that the Old West ended and the Information Age began in the same year, and through the same agency of the U.S. Government. And actions taking place that same year led to a chain of events that so thoroughly defined our current world.
Monday, July 10, 2006
In Defense of Misbehavior
Some years ago I read Robert Austin’s 1996 book Measuring and Managing Performance in Organizations. It’s not the kind of book I would have normally chosen to read at that point in my life. But I was on a tear reading books on software engineering methodology and people management, and I kept stumbling across references to it. Reading it changed my whole perspective on Life, the Universe, and Everything. The recent death of Enron chairman Kenneth Lay inspired me to try to organize my thoughts on Austin’s topic of measurement dysfunction.
Austin’s thesis can be summed up as follows:
- To be effective, incentive plans must be tied to objective measurements of employees’ performance against objectives.
- To incent employees to produce the optimal desired results and avoid unintended negative consequences, all possible aspects of their performance must be objectively measured.
- In any but the most trivial of tasks, such complete measurement is at worst impossible, or at best too expensive to be practical.
The implications of this should be terrifying to managers trying to steer the ship of industry, because it says that the helm is at best only loosely coupled to the rudder. Steering inputs may have little effect, the opposite effect, or no effect at all. The linkage may be so complicated as to appear non-deterministic. Or, perhaps worst of all, the lack of complete objective measures may lull the person at the helm into thinking everything is fine when in fact they are steering blindly in a field of icebergs.
In my experience, this seems unbelievable to some, obvious to others. What Austin did, though, was apply agency theory to demonstrate this mathematically. Agency theory is an application of game theory, the very same branch of mathematics that made Nobel laureates of folks like Robert Aumann, John Nash, and Thomas Schelling. It is the basis of much of modern contract law and employment practices. Common employment practices, such as overtime pay for hourly employees, is based on the math behind agency theory.
What is so compelling about Austin’s work is that while you can disagree with his premise when stated as a bold fact, it is a lot harder to argue with the math. Okay, so you don’t like his results; what part of the math don’t you agree with? That the less free time employees have, the more valuable is their remaining free time? That employees’ production is a mixture of results measurable by different metrics (for example, quality, functionality, time to market)? That there is some specific mix of results that is optimal? That some metrics are expensive or impossible to measure?
We have all heard of examples of measurement dysfunction, possibly under different terminology in other contexts. Incentive distortion, unintended consequences, perverse incentives, and moral hazard are just a few I have come across in reading articles on economics, law, management, and ethics. Measurement dysfunction is so fundamental that once you grasp the basic idea, you start seeing it in the news (or experiencing it first hand) everywhere.
Amazon.com’s call center agents were measured by the number of calls they processed per hour, inciting them to hang up on customers in the middle of conversations.
Gateway’s technical support agents shipped whole new computers to customers with relatively minor (but time consuming) problems in order to make their monthly bonuses.
The incentives for corporate executives like Kenneth Lay and Jeffrey Skilling to lie, cheat and steal were so great that they overcame the intrinsic motivators towards honesty and good behavior and the extrinsic disincentives like heavy fines and jail time. In fact, if you can make tens of millions of dollars deceiving your employees, your shareholders, and your government, mightn’t some jail time seem worth the risk? When you face the prospect of forty million dollars in the bank and a few years in jail, paying off the Aryan Brotherhood for protection and bribing a few prison guards suddenly seems doable (whether it really is or not). It may not be until such corporate officers face the prospect of getting letters from their families written on toilet paper from the community soup kitchen, describing how they were sleeping in cardboard boxes because federal agencies froze all their assets, will the disincentives towards crime appear adequate. The incentives for finding loopholes even in the 2002 Sarbanes-Oxley Act are great.
Software developers and the organizations that employ them seem particularly prone to measurement dysfunction. As Joel Spolsky has pointed out (and as has been my personal experience), if you incent developers to write bug-free code, they will go out of their way to cover up the bugs they can’t find, which will hence be shipped with the product. If you incent developers to fix bugs, they will inflate their metrics by introducing more bugs for them to find. If you measure programmer productivity on lines of code delivered, don’t expect any efforts at code reuse or code optimization to succeed; you’re not rewarding them for the lines of code they didn’t write. If you reward developers for customer support, subtle bugs will increasingly appear in production code so that developers can take heroic action. Nelson Repenning has written much on the topic of how rewarding fire fighting in organizations leads to more fires to fight.
I recall talking to the head of a large software development organization who was asking for a magic quality filter through which code could be run in order to add that objective measure to the incentive program. Folks, software developers solve problems and reverse engineer complex systems for a living. If they’re good at it, it is like their brains are hardwired to the task. And they love a challenge. I am completely confident that the developers I work with on a daily basis are perfectly capable of gaming any incentive system that their employer puts in place, without necessarily actually achieving any stated goal of the program. Plus, some aspects of software algorithm quality, such as it does not contain an infinite loop, are actually proveably impossible to detect in all cases (the so-called "halting problem" from my graduate school days).
Which, finally, brings me to the real points of this article.
First: any incentive program is bound to drive dysfunction into an organization. If you must have an incentive program, expect to spend large sums of money and much time tuning it to minimize the dysfunction. Don’t expect to even recognize that dysfunction is occurring. When I have worked in an organization that employed forced-distribution of ranking of employees, and have brought the topic of measurement dysfunction up to managers, every single one of them said “Yes, I understand that this is a risk, but so far it isn’t happening here.” Folks, of course it’s happening here. You have merely provided incentives for your employees to hide it from you. Or maybe you’re in denial. Either way, I see the dysfunction every single day as suboptimal results are delivered in order to improve objective, but partial, metrics.
Second: don’t blame your employees for responding to your incentive program. It is the incentive program that is at fault, not the employee. Upper management says all sorts of things that come under the heading of “motherhood and apple pie”. The only way an employee really knows what upper management truly values is via the incentive program. You may say “quality is job one”. But if you can’t measure quality (and you cannot measure all dimensions of it), but you continue to terminate employees that don’t make their dates, then it is clear to everyone that “quality is job seven, or maybe nine, and besides anything past job four isn’t really important”.
At the time of his dissertation, Austin, now a faculty member at the Harvard Business School, was an executive with the Ford Motor Company Europe. The insight his work gave me motivated me to go so far as to order his Ph.D. dissertation on which his very readable book is based. One wonders what insight from his personal experience managing both technology and people was Austin able to bring to his research.
Sources
Douglas Adams, Life, the Universe, and Everything, Del Rey, 2005
Robert Austin, Measuring and Managing Performance in Organizations, Dorset House, 1996
Robert Daniel Austin, Theories of measurement and dysfunction in organizations, (dissertation), Carnegie Mellon University, University Microfilm, #9522945, 1995
Robert Cenek, "Forced Ranking Forces Fear", Cenek Report, 2006
Robert Cenek, "Forced Ranking Forces Fear: An Update", Cenek Report, 2006
Tom DeMarco and Timothy Lister, Peopleware: Productive Projects and Teams, 2nd edition, Dorset House, 1999
W. Edwards Deming, Out of the Crisis, MIT Press, 2000, p. 23
Jena McGregor, "The Struggle to Measure Performance", BusinessWeek, January 9, 2006
Jena McGregor, "Forget Going With Your Gut", BusinessWeek, March 20, 2006
Jefffrey Pfeffer and Robert I. Sutton, Hard Facts, Dangerous Half-Truths, & Total Nonsense, Harvard Business School Press, 2006
Nelson Repenning et al., "Past the Tipping Point: The Persistence of Firefighting in Product Development", California Management Review, 43, 4:44-63, 2001
Nelson Repenning et al., "Nobody Ever Gets Credit For Fixing Defects that Didn't Happen: Creating and Sustaining Process Improvement", California Management Review, 43, 4:64-68, 2001
Joel Spolsky, "Measurement", Joel On Software, July 15, 2002
Friday, June 09, 2006
People Are Our Most Important Fungible Commodity
My favorite example that he brought up: the company encouraged their engineering staff to forego vacations in order to make dates. Then they instituted a policy that forbad carrying over vacation into the next year. Then they generously offered to pay fifty cents on the dollar for any unused vacation. Then, to add insult to injury, they remarked that this policy was illegal in California, so those lucky bastards would be reimbursed at full price. Brilliant!
Completely coincidentally, I was poking around in the HR Policy Manual of this same company just a few days later and I noticed a section entitled "Retention Policy". The section had a single line which basically said "under review". Geeze, I didn't realize until then that what I had said was literally true. I was just being my usual smart ass Chip Overclock-self at the time.
There are lots of obvious reasons why this is a Really Bad Idea. Like: if your business is developing high technology products, and if you have no manufacturing or you have outsourced it (as this company had), then most of your capital is intellectual capital. That is, the knowledge and experience inside your employees' heads. Most of this knowledge is domain specific stuff that you can't hire off the street, but can only acquire from your competitors or create anew, either at relatively high expense. And when I say "capital", I really mean it. The accounting trend appears to be to capitalize the cost of software development so that it appears in the financial report as an asset instead of an expense. Watching intellectual capital walk out the door should be like watching part of your factory burn down, from the point of view of your financials.
I had seen some talented people walk out the door of this company in the past year. Bad enough that four engineers that I knew personally had recently left, but just a few months before the company's chief product-line architect had walked out the door to work at their biggest competitor. His parting email said they had made him an offer he couldn't refuse. The rumor mill said he had felt marginalized and unappreciated by upper management, who had probably been too busy flying customers around in one of their corporate jets to realize that the guy with their entire product-line roadmap in his head was about to defect.
No one probably really knows the true cost of these defections, but for the engineers you can at least estimate the cost of replacing them. In the second edition of their book Peopleware, Tom DeMarco and Timothy Lister write about "assessing the investment in human capital". There is the cost of interviewing and selecting a candidate. There is the lost productivity of other employees as they valiantly try to temporarily cover the workload. There is the ramp-up cost of the new employee during the time they are relatively unproductive and they require a lot of mentoring and hand-holding. Figuring a six month ramp-up time, this 1999 book figured the cost at around $150,000. The ramp-up time may be worse; another manager at this same company estimated it at about two years for some product areas. Note that none of this covers the cost of having your architects and engineers working for your competitor. We're talking probably at least a few hundred thousand dollars. Makes cutting back on office supplies seem kind of pointless.
Let me says this as simply as I can: whether or not you have an employee retention strategy, for sure your competitors all have an employee acquisition strategy targeted at you.
In the high-technology job market, your competitors include a lot of companies outside of your market domain, most of whom you have probably never heard of, although some of them may be your biggest customers. Think outsourcing overseas is going to solve this? Sorry, your competitors for intellectual capital have already thought of that too, and are right this very instant hiring the best that Bangalore, Prague, and Dublin have to offer. Or they'll wait until you train them, then they'll hire them away from you. Right now, IBM is hiring all the best tech talent in India to the tune of six billion dollars over the next three years. So long, and thanks for all the fish!
Of course, if you really are banking on the job market not improving, then what you're really saying is that you hope the economy surrounding your pool of current and potential employees doesn't improve. If this is the same economy into which you sell your goods and services, this is probably not a growth strategy. Rule of thumb: success should not hinge on two mutually exclusive conditions. Just my opinion.
Relax. It could be worse.
Back in 1995 I was privileged to spent a month working, lecturing, and traveling around mainland China. I remember vividly the director of a laboratory in Beijing, funded by the PRC's equivalent of the National Science Foundation, showing me a bar chart that displayed the age distribution of the researchers at the lab. There was this huge hole in the chart. It slowly dawned on me that this was the impact of Mao's Cultural Revolution. For about a decade, the PRC's production of scientists fell to zero. (The director himself gave a group of us a tour of the farmland he once worked as part of his reeducation as a freshly minted Ph.D.) The director knew that eventually the most senior researchers would retire and there would be no next tier to take their place, to lead the younger Ph.D.s, to fill the mentoring and research needs of the lab.
Want to achieve a similar effect? Freeze hiring in R&D for several years while times are lean. Then wonder what upper management is going to do when all the fifty-and-older baby boomers burn out and there aren't enough senior people left to provide the leadership you need for the younger folk.
Of course, Mao would have told you, there's always farming.
Sources
HR Policy Manual of nameless multinational corporation whose products you may be using
Douglas Adams, The Hitchhikers Guide to the Galaxy, Random House, 1981
Tom DeMarco and Timothy Lister, Peopleware: Productive Projects and Teams, 2nd ed., Dorset House, 1999
Barry Karafin, Business Management for Technologists, seminar, 2005
Paul McDougall, "How 6 Billion IBM Dollars Helped Chase Apple Out Of India", InformationWeek, June 6, 2006
Jeffrey Pfeffer and Robert I. Sutton, Hard Facts, Dangerous Half-Truths, & Total Nonsense, Harvard Business School Press, 2006
Saturday, June 03, 2006
The End of Civilization As We Know It
Thirty years later I found myself a cranky old man thinking of leaving software development to the offshore outsourcing firms and getting a job at Starbucks. I was seriously disappointed that the predictions hadn't come true, and I wondered what I was going to do with the contents of that six-foot-high gun safe in the basement. The Club of Rome did not foresee the economic forces that lead to the discovery of new resource deposits, that would push technology to give us greater energy efficiency, and just flat out eliminate the need for much of our energy consumption. See, there is this new fangled thing called the Internet. Some folks think it is going to be big.
The Club of Rome and prognosticators of their ilk were not much better at predicting the future than my beloved science fiction authors from the 1940s through the 1950s who missed the boat on mobile phones, laptop computers, and the World Wide Web. Yes, some folks hit the mark here and there (Vannevar Bush comes to mind), but we still don't have our flying cars. Here's something else they all missed: the coming global catastrophe of declining population.
The United Nations predicts, based on national population censuses and current trends, that the global population, currently around 6.5 billion, will slow in growth, and peak at 9.1 billion around 2050. Probably to no one's surprise, all growth will be in the least developed countries, with the most rapid growth in developing countries. The populations of the fifty least developed countries will double, while the populations of the most developed nations will, for the most part, shrink (except for immigration).
The flattening of the growth curve has two reasons: decreasing fertility rates, and increasing mortality rates. This is no surprise to anyone who knows a shred of arithmetic, but what might raise a few eyebrows is where these trends are occurring. Fertility rates are dropping in the developed world even in traditionally Catholic nations like Spain and Italy. And mortality rates are rising in Eastern Europe. The U.N. credits this latter effect to the spread of HIV, although other studies have also suggested that the young men of the former Soviet Bloc are drinking themselves to death.
The populations of just eight countries are expected to account for half of the world's population between now and 2050: India, Pakistan, Nigeria, the Congo, Bangladesh, Uganda, the United States of America, Ethiopia, and China (listed in order of the size of their contributions to population growth).
These projections may be just as shaky as the Club of Rome's, and for similar reasons. And maybe this is just the direction of the current swing of the pendulum. But it is interesting to think of the consequences of these trends, should they come to pass.
- Reduction in the number of consumers in the developed countries, and a general slowing in the production of new consumers globally: probably good for the environment; probably bad for the economies of the developed world.
- Larger pools of skilled workers in countries like India, Pakistan, the United States, and China: probably good for high-technology firms; probably bad for skilled workers native to the developed countries.
- Larger pools of software engineers available to develop and support open source solutions: probably bad for vendors of closed/proprietary technology solutions.
- Growing influx of immigration from the developing countries to the developed countries providing cheap manual labor: probably good for producers of hard goods and labor services, probably bad for laborers native to the developed countries.
- A shift in national identity of the developed countries, due to immigration, and a change in global culture, depending on where population growth occurs: this suggests immigration and fertility may be used as weapons of ideology. (However, in at article in Wired, Stuart Luman documents a drop in fertility rates in Islamic nations.)
- An increase in the mean age of the population (which has been going on for some time in the developed world): a change in cultural emphasis and in national priorities to deal with an aging population.
- The United States continues to grow while Europe shrinks: the marginalization of the European Union as a world power.
Think even further ahead and consider what kind of world may exist if population levels continue to decline to below their current levels of today. This will happen much sooner in the developed nations (unless immigration really steps up) than in the developing world. Some countries are taking this very seriously. In an editorial in Newsweek, Robert Samuelson remarks that Russian president Vladimir Putin has proposed a baby bounty to remedy the situation.
Fortunately, Al Gore, Colin Campbell, and others have suggested that global catastrophe may still be within reach in my life time, thanks to global warming and oil depletion.
There's still hope.
Sources
Vannevar Bush, "As We May Think", Atlantic Monthly, July 1945
Donella H. Meadows, et al., The Limits to Growth: A Report for the Club of Rome's Project on the Predicament of Mankind, Pan Macmillan, 1974
"Population Growth Slows", The Futurist, 37.4, July-August 2003
Stuart Luman, "The Decline of Civilization", Wired, November 2004, p. 78
United Nations, "World Population to Grow from 6.5 Billion to 9.1 Billion by 2050", press release, February 24, 2005
United Nations, World Population Prospects - The 2004 Revision - Highlights, February 2005
Robert J. Samuelson, "The End of Motherhood" , Newsweek, May 29, 2006
Tuesday, May 30, 2006
It's Not Just About Moore's Law
Almost a decade ago I was working in a group that maintained a hierarchical mass storage system responsible for managing data to and from a facility full of CRAYs and other supercomputers, as well as to the desktops of several hundred researchers. Concerned about the scalability of the system over time, I was pondering Moore's Law. Gordon Moore, co-founder of Intel, observed that the complexity of integrated circuits doubles with respect to minimum component cost every twenty-four months. This is typically re-stated, somewhat inaccurately, that processor speed doubles every two years, and is sometimes misquoted as every eighteen months.
It occurred to me that processor speed could not be the only thing changing. After poking around a bit in a variety of sources ranging from NASA to the National Science Foundation to industry trade magazines, I built the following table of geometric progressions that I have tried to keep updated over the years.
Microprocessor speed doubles every 2 years.
Memory density doubles every 1.5 years.
Bus speed doubles every 10 years.
Bus width doubles every 5 years.
Network connectivity doubles every year.
Network bandwidth increases by a factor of 10 every 10 years.
Secondary storage density increases by a factor of 10 every 10 years.
Minimum feature size halves (density doubles) every 7 years.
Die size halves (density doubles) every 5 years.
Transistors per die doubles every 2 years.
CPU cores per microprocessor chip double every 1.5 years.
I plugged some of these into a spreadsheet and generated the following logarithmic graph. Note that some of the curves are hard to see because they fall on top of one another. (Click on the graph to see a larger version.)
What we have here is a set of mostly disparate power curves that illustrate how the performance of major components change with respect to one another over time. And not much time, either. For example, in a decade, processor speed will increase by a factor of thirty, while bus speed will merely double, and network bandwidth will increase by an order of magnitude.
Admittedly, some of the numbers from my original research may have gotten a little shaky as manfacturers have had to resort to more and more arcane measures to maintain their rate of improvement. But if I can at least convince you that different technologies are on very different curves of performance improvement, then you are pretty much forced to concede that the balanced system architecture you design today might not cut the mustard just a few years down the road as components and surrounding infrastructure are upgraded. The design decisions that made sense at the time, and the software implementations that were based on those assumptions, may not seem as wise later.
In fact, depending on the half-life of the system that is based on these assumptions, its basic architecture may have to be revisited more than once during its lifetime.
Sources
Computerworld
EE Times
NASA
National Media Laboratory
NSF Engineering Research Centers