Monday, February 18, 2013

Imperfect People Build Imperfect Systems

I've done a lot of work over the decades on systems that were expected to be fault-tolerant and highly reliable. These projects ranged from enormous geographically-distributed enterprise communications systems that spanned national boundaries, to tiny sensor networks that are part of an environmental control system. I've done a lot of reading on best practices for developing fault-tolerant, high-reliability systems at all scales, and have done my best to incorporate those learnings into my own product development work.

But because imperfect people build imperfect systems, I also did a lot of reading on how humans muck things up more or less inevitably. Here are some of the books I recommend, in the order that I read them, and what I took away from them.

James R. Chiles, Inviting Disaster: Lessons from the Edge of Technology, HarperBusiness, 2001

In 1982, during a storm off the coast of Newfoundland, the drilling rig Ocean Ranger capsized and sank with its crew of eight-four men. There were no survivors. The British rigid airship R101 was largest flying machine ever built, and would remain so until Germany built the Hindenburg some year later. When the R101 made its maiden voyage in 1929, it was unable to keep its own weight airborne, shearing off the roofs of cottages in the English countryside, until it finally crashed, killing forty-eight of the fifty-four people on board. In 1979, World War III was narrowly averted when a training tape was inadvertently inserted onto the screens of four U. S. command centers with no indication that it was a simulation.

These are just three of the dozens of case studies that Chiles presents and dissects in his book. Some of them are more familiar: Apollo 13, the Challenger space shuttle, Three Mile Island, and Chernobyl. Chiles' book manages to be both fascinating and horrifying.

Many of these disasters were the result of ordinary mistakes made in extraordinary circumstances. Sometimes the people involved didn't realize a disaster was in the offing. Sometimes they chose to ignore the evidence right in front of them. Often they depended on their own unreliable intuition, a failure in cognition that will be reiterated in the book below by Dörner. Transitions are the time of the greatest chance of making mistakes or of errors occurring. Sometimes folks did not appreciate the scale of risk even when the probability of failure was low. Sometimes they did not appreciate the hidden technology buried deep in the systems they were using. Valiant heroism often occurs in the midst of disasters, but unlike the movies, seldom succeeds.

The leaders of an organization have tremendous influence, sometimes unwittingly and unknowingly, on safety, or the lack of it, through their own attitudes and prioritization, which affect the organization's culture. And a common theme in these case studies is how leaders seldom insist on hearing bad news; in fact, subordinates are punished for bringing bad news upstairs. Chiles also talks about how important it is to "fill in the cracks": make minor course corrections before small mistakes expand into systemic breakdowns.

For me, one of the biggest lessons from this book is that we frequently do not design systems to recover from multiple failures that occur simultaneously. It's just too hard to think about, and it seems so unlikely. Yet it was that kind of impossible chain of independent failures that just happened to line up in time that lead to the deaths of the crew of the Ocean Ranger.

Dietrich Dörner, The Logic of Failure: Recognizing and Avoiding Error in Complex Situations, Basic Books, 1996

Dörner is a psychologist whose book describes series of computer simulations with a wide variety of volunteers that required each to deal with a dynamic system with interrelated components. His goal was to discover the common cognitive errors humans make when trying to manage such systems.

For example: an African tribe whose livelihood depends on herds of cattle and sheep, which depend on the grassland, which depends on the water table. Well meaning participants in the simulation would bring Western health care to the tribe. The mortality rate would go down. The population would go up. The number of animals necessary to support the tribe would rise. The amount of grass necessary to feed the herds would outstrip the water supply. Famine would result. In fact, pretty much anything the simulation participants tried to do to improve the lot of the tribe would ultimately, sooner or later, result in famine.

The game wasn't rigged. The system was just the beyond the capability of most people, even trained economists and ecologists, to understand all the interdependencies when they had to discover them in real-time and make iterative decisions. The irony is that if the participants had just stood back and done nothing, the tribe would have been fine. Natural forces had, over many years, optimized all the components in the system for the available resources. Another simulation involving a watch factory in a small European community had similar results.

It was hard not to think about capitalism and market forces versus communism and central management when reading this book, even though I don't recall the author making that comparison explicitly. Dörner compares the results of these simulations with real-life mistakes that often had disastrous consequences, such as Chernobyl.

Humans suck at understanding and managing dynamic systems. Systems may have limited resources with only a certain amount of buffering between interrelated components. Variables in the system may have relationships exhibiting both positive and negative feedback. We are too quick to apply our own conditioned responses to situations that are different from those we have experienced before. We ignore what has been successfully done by others before -- frequently for very good reasons -- in the same circumstances. We try to make abstractions which can result in hiding or discarding vitally important detail.

The author remarks that many of these cognitive mistakes are probably the result of economizing the scarce and slow resource that is human thought. He talks about the necessity of "redundancy of potential command", by which he means the delegation of decision making to empowered subordinates, a point that Gawande will reiterate in his book below. His studies also suggest that we are highly motivated by a desire to preserve a positive view of our own competence, sometimes with deadly consequences. One of the big take aways from this book is the need for communication in organizations, both vertically and horizontally, and the absolute requirement for empowered delegation, bringing a kind of distributed and parallel processing to the human domain.

Two great quotes:

"Advocates of progress often have too low an opinion of what already exists."
-- Bertolt Brecht

"One jumps into the fray, then figures out what to do next."
-- Napoleon

Atul Gawande, The Checklist Manifest: How to Get Things Right, Picador, 2009

Mrs. Overclock, a.k.a. Dr. Overclock, Medicine Woman, read this book and passed it along to me. I'm glad she did. Gawande is a surgeon in Boston, a MacArthur fellow, and author, who participated in a World Health Organization project to improve surgical outcomes and reduce complications. Not just in the kinds of places you might expect WHO to operate, but in places like the United States, Britain, and New Zealand.

The result was the evolution of a series of checklists. I mean literally: a poster on the wall or a page on a clipboard with just a few brief bullet items on it, listing simple but important tasks that must have been completed, to be read, reviewed, and verified at pre-defined pause points during the surgery: before anesthetic is administered, before the incision is made, before the patient leaves the operating room.

The use of surgical checklists has lead to a drastic reduction in post-operative complications like infection, and has saved tens of millions of dollars, not to mention many many lives.

WHO Surgical Safety Checklist

This will seem all quite familiar to the aviators in my audience. Gawande devotes much of the book to describing how checklists evolved from the early days of aviation, when it seemed that multi-engine aircraft might be too complicated for humans to fly, and how checklists, both notebooks and computerized, are now a routine part of every flight, commercial or private. He also talks about how checklists, in the form of project management charts, have enabled construction firms to build skyscrapers and nuclear submarines, keeping track of thousands of details and tasks for hundreds of skilled craftspeople. And he talks about the role of checklists in the safe ditching of U. S. Airways flight 1549 in the Hudson river in 2009.

Like the books mentioned above, Gawande talks about the cognitive weaknesses in humans, in this case to deal with fine levels of detail. He talks about the difference between the simple, the complicated (lots of moving parts, but once an algorithm is discovered, success is easily repeatable), and the complex (every situation is unique). Humans are not wired for complicated, but we may the only ones that can tackle the complex. My friend, occasional colleague, and embedded wonk John Lowe likes to say, in tone of voice that is a combination of wistful and ironic: "If only there were a machine that could somehow take mundane, repetitive tasks, and automate them...". For many complicated jobs, the checklist is that machine.

The author discusses the difference between a do-confirm checklist and a read-do checklist. He also talks about the need for both task actions and coordination actions, the latter being a pre-programmed time to touch bases with the those involved to make sure everyone is on the same page. In various contexts, this may be called a meeting, a briefing, a huddle, or a scrum.

A good checklist -- and there is a lot of effort that goes into creating and refining a good checklist -- has enormous leverage. While a single instance of a failure may affect only one person, that class of failure can affect many. Eliminating it has broad consequences. A good checklist is one that gets used. For that to happen, checklists should have only five to nine items, about the same size as short term memory, and should take only sixty to ninety seconds to go through.

One of the things that makes checklists difficult to adopt is ego: some professionals feel the checklist usurps their authority, particularly when it is a subordinate reading the checklist. But those professionals suck at the fine detail, as much as they would like to think they do not. Surgeons and pilots use checklists because the checklist helps them succeed.

I will continue to read about failures -- in both technology and people -- and about how to prevent them. We could use a dose of fault-tolerance and high-reliability in both our systems and ourselves.

2 comments:

Anonymous said...

I would like to make several comments.

1. The way humans can deal with complexity is by keeping their eyes and minds open. John Lowe also famously repeats the importance of gathering data, and it is why he solves so many problems. You characterization of complexity as situational uniqueness highlights this need to pay attention to what is different now that needs addressing.

2. I have had good luck with checklists. In my role as a volunteer IT staff I created procedures for people who had to administer systems with which they were uncomfortable. By highlighting the procedures as the equivalent of a pilots checklist - a mnemonic to validate what you are doing - they found following these simple cookbooks palatable.

3. This is also why many development programs go awry. Talented people dislike process constraints. They are correct that it can inhibit their creativity. However, the key to managing such teams is to have a good process in place, and encourage case-by-case deviations. However, each deviation has to be recognized as such, and vetted to be sure none of the reasons the process was in place are being forgotten.

Ken Howard

Chip Overclock said...

An old friend and former colleague, the late Paul Hyder, once remarked "engineers need structure", by which he meant that we need process and infrastructure and rules, even if we occasionally chafe against them. This remark is a part of him that continues to live in me.

He's right, and his rightness has only become more apparent as I grow older and more experienced, and as I see organizations that lack process and rules and structure. (I find this happens a lot in organizations that haven't realized that they are software development companies, or that do but are struggling to scale up their organization to tackle larger projects. But that's a topic for another article.) Talented people dislike process constraints, but experienced people recognize the need for them.

Checklists are a good thing, but checklists are like software APIs: it's more difficult to create a good one that people will use and stick with than might be apparent if you have never created one. Also, like APIs, using a checklist you created yourself is critical to refining it. Checklists are no more a silver bullet than any other tool, but they sure are a good tool to keep in your toolkit.

What interested me about the Dorner book is the cognitive failures that seem to be common in humans. It says, I think, something about the deep structure of ourselves. And knowing what cognitive limitations exist, we can, perhaps, guard against them. With structure and process, and checklists.

Thanks, as always, for your insightful comments.