Monday, January 05, 2009

House M.D. and Evidence-Based Troubleshooting

Mrs. Overclock (a.k.a. Dr. Overclock, Medicine Woman) and I got sucked into the television series House M.D. over the holidays. The USA cable network was running a marathon of House reruns. We haven't watched the first-run episodes (which run on the Fox cable network) except mostly by accident, and the last thing we need is to watch more television. But Mrs. Overclock is understandably interested in medical dramas (and, unfettered by the need to fill an hour of air time, will often beat House and his diagnostic team to the solution). I was a little surprised to find that House tickles certain centers of my brain, too.

If you're not familiar with the show, Gregory House (played by British comic actor Hugh Laurie in an impressive dramatic turn) leads a diagnostic group of physicians at a hospital in New Jersey. Each week they to try to cure some critically ill patient with a mysterious illness. They don't always succeed. It's a tribute to evidence based medicine. They have to choose tests that won't kill the patient, without really knowing what's wrong with the patient. They keep eliminating possibilities, and racking their brains to think up new ones. Frequently there is more than one problem, and they interact in strange ways. And always, the clock is ticking.

It's exactly like troubleshooting large, complex, distributed, real-time, high-availability, production systems, like a PBX.

I've done my share of field support of such systems. I've lived in a small room with several other developers, clustered around a workstation looking at a remote customer system. And I've gotten on a plane with a laptop and a protocol analyzer in my checked luggage. It's just like House. You keep brainstorming ideas of what could be going wrong. You try to think of tests to isolate the problem, to indict a particular hardware or software component, without crashing the production system. Your pour through log files, frequently inventing filtering tools on the fly, to make sense of the fire hose of information, to eliminate possibilities. You have to always keep track of what you know you know, of what you know you don't know, always being prepared to discard a much loved hypothesis in the face of new evidence. And always, the clock is ticking.

Frequently when I do this kind of work, and particularly when I am doing development for these kinds of systems, I have to keep reminding myself that not only are careers at stake, but lives. Customers may depend on the my software to dial 911 when someone has chest pain. Even in a non-safety critical situation, rebooting to see if it fixes the problem isn't an option if it means dropping hundreds of in progress calls. House's team can't afford to take a cavalier attitude, and neither could I.

I found a lot to relate to, watching House M.D. If you want to know what developing for high-availability systems is like, you would be well advised to check it out.


Paul Moorman said...

The big lesson I've learned over the years in complex problem solving is too assign a probably to each "fact" I encounter, with the limits being 1 and 99, in other words, never rule anything completely out or completely in, period. Those problems that involve more than one underlying problem are particularly baffling and I've more than once uttered to myself, "I can't wait until this one is solved and it all makes sense." Never been much a fan of games like Dungeons and Dragons, but finding the root cause of a nasty computer problem is a joy.

Chip Overclock said...

For sure, a white board is one of the best diagnostic tools ever invented. As is a lab notebook. Space cowboy Steve Tarr (ask him about developing software for a comet impactor probe) always called this process "peeling the onion": you just kept peeling away layers trying to find the center.

Tom said...

Completely agree. Fixing a car has many of the same characteristics as well.