Thursday, December 21, 2006

Rocket Science as Product Development

Yep, you parsed that title correctly.

At the recommendation of my friend and colleague Scott Thomas, I read Steve Squyres' book Roving Mars, a fascinating glimpse into how science, engineering, and politics play together in the rarified world of interplanetary exploration. Squyres was the principle investigator for the Mars rover missions. In his book, Squyres describes how his science team came up with the idea of sending a mobile robot - a kind of instrument-laden all-terrain vehicle that you could fit in the trunk of your car - tens of millions of miles to another planet. How politics led to them to send not one but two Mars Exploration Rovers (MERs), Spirit and Opportunity, to two very different locations on the Red Planet. And what they found when they got there.

If like me you occasionally gaze up longingly at a clear night sky, or if you've been known to read a novel or two by authors such as Banks, Baxter, Benford, or Brin, then you can be forgiven if your eyes, like mine, get a little teary at the thought of this kind of adventure.

The science described in the second half of the book is alone enough to make it compelling reading. Proof that Mars was a damp if not down right wet place sometime in the distant past. But for us engineers in the audience, the most riviting part is the first half, which is the best description I have ever read about the drama that is product development.

No, really.

Mars and Earth would be in perfect alignment and proximity for a launch in the summer of 2003. The launch window was only five weeks wide. And if they missed it, those perfect celestial conditions would not occur again for eighteeen years. The twin-rover program cost something in the neighborhood of eight hundred million dollars, and was sixteen years in the making from inception to launch. If the engineers and developers missed their dates by more than that five week window, it was more than an inconvenience. It was more than a career limiting move. It was even more than a disaster. Some of the engineers who poured their hearts and minds into that project were old enough (like, my age) that it was within the realm of possibility that they would not be alive for another launch opportunity. It was pretty much hit that launch window or... well, the alternative didn't bear thinking about too much.

And so Squyres launches into a detailed description of the many, frequently painful, time-space-budget-feature tradeoffs that are part of every single product development program. And for this product development, failure was not an option. If you have friends and relatives that wonder what you do, give them this book, and tell them that maybe you don't send rovers to Mars (or maybe you do). But your day to day life is a lot like the people's in this book.

Those that follow the U.S. space program already know the ending of this story. Not only were the rovers successful, but wildly so. They far outlived their projected lifespans, continuing to do useful science well after the date when their batteries should have died because their solar panels were too covered in dust and the Martian winter of sixty below zero temperatures had frozen all their joints and axles. Part of this was just luck: the occasional Martian windstorm cleaned the dust from their solar panels, extending their lives. But part of it was good design and engineering.

One of the most edge-of-your-seat portions of the book was when the MER team at the Jet Propulsion Laboratory in Pasadena lost contact with Spirit. Squyres describes how the engineers, working on some hypotheses regarding firmware failures, poured through code to try to predict how such hypothetical faults could have occurred. One of the ideas they had was that the on-board flash memory used for storing data had somehow become corrupted, leading to rolling reboots of the main processor. (Other sources have revealed that this was a VxWorks system running on an RS6000 processor.) It wasn't a perfect hypothesis, because they could not figure out a way in which the flash file system could have gotten corrupted, nor could they duplicate it on the test rover on Earth. But the pattern of failure, where the rover would transmit mostly useless status information for just a few minutes then quit, fit most of the forensic data.

To test this theory, they had to transmit a command to the rover over a ten-bit per second auxiliary channel (yes, about one character a second) with about a six minute round trip latency to tell it to reboot with a temporary file system build in RAM. They had to hope the rover received the command during the brief window when it was likely to be operating during the sunny Martian day and before it rebooted. That it would successfully execute the command. And that it would bring up the high bandwidth channel at the right time when there was enough solar power to transmit, when the rotation of Mars placed the rover correctly, when the relay satellites in orbit around Mars were in the right places, and when the reply could be received by one of three Earth stations in the U.S., Australia, and Spain. If this didn't work, half the mission and several hundred millions of dollars were down the tubes because of a corrupted flash file system. This is what people mean when they use the term "mission critical".

It worked. They were able to reassert control over the rover, reformat the flash, and return the rover to normal operation. But here's the part that only people who are in product development will really appreciate. The command they used to save their bacon was not in any requirements document. It was put in place by an engineer who placed the success of the mission ahead of meeting requirements.

Now having been both an engineer and a manager, I know it is a very slippery slope to allow engineers to insert their own unplanned features into a product. This has probably killed more products than it will ever save. It raises issues of economics and security. But what I think it comes down to is this: listen carefully to your engineers, particularly to what they are worried about. They may be trying to save your bacon.

I totally and completely identified with the engineers in this book. I can't tell you how many times I've worked on a project where half the team was sitting in front of a display talking to some firmware via a diagnostic port at a remote customer site over a 9600 baud modem connection, typing in simple commands trying to diagnose a severe problem, asking the on-site technician on the phone questions like "Would you describe the pattern of blinking of the red LED as 'Jingle Bells'?", while the other half of the team was pouring over a hundred thousand lines of C++ code trying to come up with a failure mode that fit the facts. Is it any wonder I preach the need to include the capability for remote field troubleshooting at the lowest levels?

No doubt about it: space men, earth men, we are all in the same tribe.


Steve Squyres, Roving Mars, Hyperion, 2005

No comments: