Tuesday, January 10, 2012

Since when has consistency not been eventual?

Database isn't really my area.

But you don't have to be a database expert to understand what those folks mean when they use the acronym ACID (Reuter and Härder, 1983) that states for every database transaction:

Atomicity: either all or none of its changes are made;
Consistency: before and after, the database is in a consistent state;
Isolation: it either succeeds or fails regardless of concurrent transactions;
Durability: once it succeeds, its changes are persistent.

These requirements of database transactions are necessary to make sure, for example, that when you make a withdrawal from your bank account, the money you had yesterday is still there. Real-time guys like me get this because underneath the hood the database systems uses the same or similar synchronization primitives as those on which we depend on a daily basis to make sure planes keep flying and telephones keep ringing. For those readers not into this stuff, a lot of the synchronization mechanisms amount to a bank teller saying to the next guy in line "I'll be with you in a moment once I'm finished with this customer", except it has to be implemented all the way down to the lowest level of the hardware.

This was challenging enough when databases were hosted on single servers so large they frequently needed multiple processors to keep up with the workload. But when databases got so huge that they had to be distributed across multiple servers on a network, satisfying ACID suddenly got really hard. Brewer's Theorem (Brewer, 2000) and its more narrowly but formally defined cousin the CAP Theorem (Gilbert and Lynch, 2002), states that a distributed system can deliver at most two of the following three qualities:

Consistency: each server sees the same data at the same time;
Availability: each request gets a response, success or failure;
Partition tolerance: the system continues to run even when the network fails.

The need for availability and partition tolerance is especially compelling given that the network is basically unreliable. In fact, this assumption is at the very core of the initial design of what today we think of as the Internet, which started life out as a network for military communication in the face of a nuclear war. Network traffic would be briefly interrupted as cities were vaporized by megaton Soviet warheads and packets were routed around the flaming craters. Armageddon turns out to have been a good model for other catastrophic failures too, like someone accidentally powering down an equipment rack or inadvertently cutting a fiber optic cable with a back hoe.

Designers of software for distributed systems have to take this basic assumption of unreliability into account. It is in fact the first of the eight Fallacies of Distributed Computing (Deutsch 1994, Gosling 1997, et al.):

1. The network is reliable.
2. Latency is zero.
3. Bandwidth is infinite.
4. The network is secure.
5. Topology doesn't change.
6. There is one administrator.
7. Transport cost is zero.
8. The network is homogeneous.

For e-commerce giants like Amazon.com, it became a lot more important to tolerate some inconsistency issues than it was to have a system that didn't respond when I hit the Buy now with 1-Click button. (Which I do. All too often. 1-Click ordering is the bane of my credit card.) Sure, sometimes you can order something that is actually out of stock. Or order something twice. But those kinds of consistency errors are infrequent, and when they do occur, they can be detected and cleaned up after the fact. The same approach has been adopted by many cloud computing architectures for the same reason: it's the only way to build scalable systems, because scalability requires distribution.

Amazon.com coined their architectural willingness to incur some inconsistency as eventual consistency (Vogels, 2008), but the database folks adopted as the alternative to ACID the punnish acronym BASE (Pritchett, 2008):

Basically: more or less
Available: reliable
Soft state: with some inconsistencies here and there
Eventual consistency: that we'll fix later.

This approach allows, for example, Amazon.com to spread their Amazon Web Services (AWS) infrastructure across thousands of individual servers in a network that is distributed internationally. I've mentioned before that in testing Hayloft, my C++ interface to Amazon Simple Storage Service (S3), I not only could see the effects on my subsequent transactions of S3 updating its bicoastal server farm in the United States to take into account my prior transactions, but I could see the increase in latency of those updates on Black Friday, the day after Thanksgiving in the U.S., which is typically the biggest day of the year for retail commerce here, e- or otherwise.

So here's the thing. Eventual consistency is not a new idea. Probably not to the database community, but certainly not to the community of developers I hang out with who write software for both small embedded and gigantic distributed systems. (Those are the same people.) Or at least it shouldn't be. Let me give you an example.

The biggest distributed system I ever worked on was a Private Branch eXchange (PBX). This system had several distributed redundant servers controlling up to sixty-four cabinets of hardware. Each cabinet had as many as five racks. Each rack had as many as a dozen or so boards. The most complex of these boards (which I happened to have worked on) had three microprocessor chips, twenty-four individual digital signal processor (DSP) chips, and several big field programmable gate array (FPGA) chips. At least one of the microprocessor chips had dozens of concurrent software threads running at all times. The main controlling server, which these days probably has multiple cores, also had dozens, if not hundreds, of parallel threads. A single PBX could (and did) span national boundaries. And each PBX typically communicated with other PBXes, with the public telephone network, and with other equally distributed systems.

All of this stuff routinely communicated via message passing, over a combination of backplane wiring, dedicated telecommunications circuits, and internet connections. Even the simplest board in this system had at least one chip on it capable of responding to a limited set of messages coming across the backplane. When you look at the big picture, that's a lot of processors all independently doing stuff, all at the same time.

So what the heck was all this stuff doing? For the most part, each element was doing what it was told to do, either by a message from the main controlling server, from another board in the system, or in the case of inter-PBX connections from another system entirely. But whatever it was doing, it took time to do it. A typical pattern was: take a message from the queue of incoming messages, inject the message into a state machine that determines what to do given the current circumstances, and execute the resulting action. Some of these actions might (and typically did) result in one or more messages being dispatched to other threads on the same board, or other boards in the same cabinet, or back to the controlling server, or even to a completely different PBX.

Remember those fallacies of distributed computing? Not only is the network not reliable, but latency is not zero, and bandwidth is not infinite. So it takes a while for those messages to arrive at their destinations. If they arrive at all. Only to be queued up at the receiving end, perhaps behind other messages. Maybe many other messages, all waiting for a thread to finish what it's currently doing and go back and receive them. Thanks to communications latency, queueing latency, and processing latency, it might be seconds between the time a state machine makes a decision, a message is dispatched, that message is received, and acted upon.

If you could somehow take an instantaneous snapshot of the entire, gigantic system, what you would see is dozens, hundreds, thousands of processors, each acting on a message that doesn't describe the current state of things at all, but instead the state of things a while ago. How long a while? It's different for every system, every processor, every application, every workload, every instant from one to the next. It's not just eventually consistent. It's more or less perpetually inconsistent. Why? Because it was the only way to build a scalable system.

Being slightly out of sync all the time can result in a kind of positive feedback loop, where two threads, each having a different idea of the current state of affairs, both of them possibly being wrong, keep telling one another to do the wrong thing. The good news is that if this isn't addressed, it frequently results in something fairly obvious, like a communications system serving thousands of people crashing. But it can also lead to much more subtle emergent behavior in which the system as a whole acts mysteriously non-deterministic.

Once you appreciate the magnitude of it, it's a little miracle anything works at all. I used the example of a PBX because that's the biggest distributed system I could think of that I've worked on. But a similar analysis applies to any large distributed architecture: Amazon Web Services, traffic light systems, air traffic control systems, automated stock exchanges, Strategic Air Command. All of these systems have always had their own mechanisms for dealing with their special brand of eventual consistency. Here's some mechanisms I've seen used.

Audits. An audit simply means fixing a consistency issue when you encounter it, either in real-time (what the distributed database people call read repair and write repair), or by having software specifically tasked to go and look for it (asynchronous repair). More than once I've received email from Amazon.com telling me that something I ordered was actually out of stock, or asking me to verify that I meant to order two of an item. These are all the results of audits. Big distributed systems like PBXes routinely have dozens of audits running, all the time, just to apply corrective action to achieve eventual consistency.

Out of Band Messaging, Priority Messaging. Not all messages are created equal. Some messages get to take the equivalent of the passing lane and jump ahead of all the other queued messages. This is either done by using a different communications path or by jumping the message ahead of all the others in the queue. Once a thread receives such a message, it still has to decide what to do with each of the stale messages behind it: throw it away, reply to it with failure, or even reply with success under the assumption of what the sender doesn't know won't hurt it. Sometimes the result of such a high priority message is for the receiver to exit, reset, or reboot.

Hysteresis, Feedback. For a lot of real-time processes, it makes sense to take your time when you are changing something visible to others. When you receive a message telling you that you can increase your bandwidth, don't jack up your bandwidth all at once. Instead, turn that tap slowly, injecting some hysteresis or delay into the process. This gives everyone downstream more time to adjust to your changes, and to complain about it if it doesn't work for them. If you get active feedback, if for example you are monitoring real-time error rates or you start getting flow control messages from your downstream neighbors, you can stop turning the tap.

Traffic Shaping, Traffic Policing, Flow Control. All of these are mechanisms for controlling the rate at which a thread may send or receive messages. For a lot of applications, real-time data has an expiration date, and it is better to just throw it away than to process it late. If a thread is told that a message it is about to send will arrive with an unusual amount of delay (we might say that the message exceeds the traffic contract between the sender and the receiver), the sender might choose to implement a different strategy or take corrective action.

This article is in some ways my love letter to real-time systems. I have known the joy of analyzing a post mortem hemorrhage of thousands of lines of log messages late at night so that I could tell a customer why their multi-million dollar system crashed. Of watching LEDs flash on a board from across a data center equipment room and suddenly wondering "WHAT THE F...!?!?!?!" Of staring at a diagnostic laptop as I watched a system servicing an entire building increment error counters towards MAXINT. When I first starting reading about eventual consistency I knew that it was something with which I was already familiar.

Even though database isn't really my area.

2 comments:

Paul Moorman said...

IT is an interesting profession. On the one hand we deal with hundreds or thousands of cores executing billions or trillions of instructions every second, all of which need to execute to perfection to keep the phone from ringing in the middle of the night. Or blow up something unintentionally. Having a digital and logical brain is a big asset, and having good compilers, architectures and designs help too. Nice to have good and dedicated testers. :-)

Next up is the business side of IT, typically running a few percentage points of sales. That's a lot of money, people, contracts, assets, problems, processes, etc. So we need to be really good business people in our own right.

And hardest of all for us left-brained creatures, we get to apply this technology to solve business problems, to design compelling web sites and tablet applications. This is much more an analog world than a digital one, with things like trust, intuition, marketing and communications skills at the forefront.

John, your article just made me reflect not only on the daunting nature of our technology, but the daunting nature of our other challenges.

Wouldn't choose another career, not in 1974 (at the start of it all), nor in 2012. Just gets better and better.

Chip Overclock said...

I agree completely. Hardly a day goes by I don't look back and realize how fortunate I've been to end up where I am, doing what I do, with no end of new stuff to learn in sight.

Thanks as always for your insightful remarks.