Thursday, September 04, 2014

War Stories

Nearly forty years ago I was taking a required senior/graduate level computer science course for which we had to write in assembler a multitasking system with device drivers and such for a PDP-11/05. I would go into the lab first thing in the morning before work and my software would work flawlessly. I could go into the lab in the evening after work and could not get my DMA disk device driver to work at all. This went on for days. I was nearly in tears, pulling my hair out.

I had to demonstrate my software to the professor to pass the course. So I signed up for a morning slot in his schedule. My software worked. I passed.

After the term was over I had reason to go back into that lab during the break and I ran into the hardware support guys taking the system apart.

"What's the deal?"

"We think there's a loose solder joint or something in the disk controller. It quits working when it gets warm."

I smiled and nodded and went on my way.

(I would go on to teach this class as a graduate student, the original professor would be my thesis advisor, and what I learned in that class formed the basis for my entire career since then. It would also form the basis for Mrs. Overclock's Rule: "If Mr. Overclock spends too much time debugging his software, he should start looking at the hardware.")

* * *

Decades ago I ran a systems administration and lab support group at a state university. It was the end of the academic term and I was deleting the course accounts to clean up the disk on a VAX/750 running Berkeley Unix 4.2 in one of the labs I was responsible for. This is something my student assistants normally did, but I thought I would get started on it.

The clean up actually took a long time to execute, so I was going to run it as a background process so I could do other stuff on the system console as it ran. I logged in and executed the following commands.

cd /
rm -rf home/cs123 &

I noticed that I didn't get a shell prompt back as I expected to. I waited for a moment or two more, began to get concerned, then started looking more closely at exactly what I had typed.

Have you ever noticed that the * character and the & character are right next to each other on the QWERTY keyboard?

I tried to cancel the command but it was too late. I had just started a process that would delete the entire root file system — including the operating system — from the disk.

One of my student assistants walked by and noticed me staring at the console. She asked "How is it going?"

I sighed and said "Could you please go fetch me the backup tapes?"

(I would go on to automate the process of creating and deleting student directories with shell scripts so that this would be unlikely to ever occur again.)

* * *

Back in my Bell Labs days I was in a lab struggling to troubleshoot some complex real-time traffic shaping firmware I had written for an ATM network interface card that had an OC-3 fiber optic physical interface. Using fiber optics meant the test equipment was all horrendously expensive.

I was working late one night — and truth be told a little peeved at myself for taking this long to debug my code — when it suddenly dawned on me that between the ATM broadband analyzer, the ATM protocol analyzer, the multi-cabinet telecom equipment under test, the network traffic generators, and all the fiber optic cable I had strewn all over the place, I was probably using a million dollars worth of equipment, just to debug my code. It was a major insight: I could never had done that kind of work in a smaller organization.

With all the emphasis these days on cheap computers and free open source software (which much of my current work certainly takes advantage of), that's something I think is often unappreciated: there are some problems you just can't tackle without a million dollars worth of equipment.

* * *

A long time client asked me to come in for an afternoon to one of their labs to help debug some cellular telecom equipment that had been returned from the field and for which I was one of the principal platform developers. We sat at the lab bench watching log messages scroll by on a laptop connected to the unit while a technician got the unit to go into its failure mode.

"Okay", I began, "this is likely to be a hardware problem. There is a failure with the connection between the ARM processor and the PowerPC processor. It sure looks like an intermittent solder joint failure."

"Oh, no", said the technician, "we think this is a software problem. We were thinking you could..."

As he spoke I slammed my hand against the side of the cabinet and the problem went away.

"... oh... Okay, I'll mark that one down as a hardware problem."

Of course, I had no idea what was going to happen when I hit the side of the cabinet. I was just doing that as a diagnostic step.

But it did make me look like a fraking genius.

1 comment:

bookwench said...

My very first experience with troubleshooting was in the classroom where we were learning about a massive satcom system. We were given a system with a predetermined instructor-set bug to track down and fix. There were lectures on oscilloscopes, lectures on voltmeters, lectures on fireberds. It was very theory-heavy. While they were discussing the final steps to troubleshooting, something failed for real, with a popping noise in the back of the classroom where the set-up for the practical was sitting. Immediately they dropped the teacher faces and looked excited, and bolted back to the equipment rack. One of them was checking a reading and the other was hauling equipment out, sniffing at it to find out where the smell of the burnt resistor was coming from. It was an excellent lesson on the difference between theoretical troubleshooting and troubleshooting for real in the field; equipment is physical stuff, and sometimes the symptoms will not be just symptoms. Sometimes they'll smell funny and make noise.

At a later duty station, we used to have an upconverter with a loose connector on one of the circuit boards. It was tremendously impressive when the circuit would go down and we could walk back, pull the UC out of the rack a bit, and hit it in the back quarter of the top cover - and watch the circuit lights go green.