Thursday, March 20, 2014

Python, Bash, and Embedded Systems

Lately I've been working on a little development project, Hackamore, as an excuse to learn the programming language Python. Hackamore is a multi-threaded framework that connects to one or more Asterisk PBXes via their Asterisk Management Interface (AMI) ports with the goal of dynamically modeling their channel and call states, including calls that cross PBXes via SIP trunks.

candidates = [ candidate for candidate in self.sources.values() if candidate.fileno() >= 0 ]
effective = 0.0
while candidates:
    # Service all pending I/O on every open Socket. Our goal here
    # is to consume data in the platform buffers as quickly as possible.
    for source in select.select(candidates, NONE, NONE, effective)[0]:
        source.service()
    active = False
    # Process queued events on every open socket. Should we process all
    # Events on each Source before moving onto the next one, or should
    # we round robin? There's probably no answer that will be right
    # every time. The code below does the former. Mostly we want to
    # stimulate the Model with each Event as quickly as possible
    # regardless of the Source.
    for source in candidates:
        while True:
            event = source.get(self)
            if event == None:
                break
            active = True
            message = Event(event, source.logger)
            yield message
    effective = 0.0 if active else timeout
    if effective > self.effective:
        self.logger.debug("Multiplex.multiplex: WAITING. %s", str(self))
    self.effective = effective

It's definitely a work in progress. Hackamore is more of a developer's and tester's tool than it is any real means for an administrator to monitor their PBX activity. It's output is just a straightforward ASCII report with some minimal ANSI terminal control that works with an X terminal. And mostly it was a way to figure out how to meet all my usual needs -- multithreading, synchronization, sockets, and general source code organization -- with a programming language I had never used before.


I like Python. It reminds me somewhat of the experimental languages like FP, BADJR, and BAFL that I played with the early 1980s when I was in graduate school and working in a research group that was investigating programming language and operating system architectures for hypothetical hardware inspired by the Japanese Fifth Generation project. Python's evolution started in 1989, so I sometimes wonder if it's creator, Guido van Rossum, had similar inspiration.

But there's a more pragmatic reason to like Python. Being the Type A sort of personality, I start every morning with a list of things I want to accomplish. The other day I settled in to my office around 0800 with my list in mind to work on Hackamore. By 1000 I was done.

"Huh," I thought, "guess I better start making a longer list."

That's happened a lot on this project. And it's why embedded developers need to start thinking beyond assembler, C, and even C++, and start considering the use of interpreted languages, byte code languages, and shell scripts whenever possible. While  my colleagues that work on tiny eight-bit micro-controllers may remain firmly in the assembler and C camps, and those that work on resource constrained platforms with an RTOS may never get past C++ (although I would encourage them to at least consider going that far), the rest of us need to think beyond int  main(int argc, char ** argv).

With the growing number of embedded systems that run Linux, it is becoming increasingly possible that a significant portion of our applications can be written in languages that feature an enormous productivity improvement. This improvement comes from a vastly shortened duration of the compile-test-debug iteration, from better development tools, from a capability to develop off target, from the availability of a large number of open source frameworks, and from an ability to work at a significantly higher level of abstraction where an application takes a few dozen lines of code instead of a few thousand. Python may not be your embedded tool of choice. But a shell script, even the relatively simple ash shell that is implemented in the ubiquitous BusyBox embedded tool, might be sufficient.

Years ago I remember talking to a colleague about some code we needed for a commercial Linux-based embedded telecommunications product. This little piece code was something that would only be run very occasionally, and only on demand by someone logged into the system. It became clear in the conversation that my colleague wanted to go off and start writing C code so we could have something to use in a few days. "Or, " I said, "we could write a twenty line bash script and be done in an hour." And that's what we did. It might have taken a wee more than an hour.

That happens a lot too. When all you have is a hammer, everything looks like a nail. And many a embedded developer's first instinct is to go straight to Kernighan & Ritchie. It doesn't help that most project managers are too clueless technically to know that this is a really expensive decision. I've had developers argue with me about the performance of scripting languages, but when you're talking about a program that will only be run occasionally and has no real-time requirements, the difference in total  cumulative execution time between a script and a compiled C program may only total to minutes over the entire lifetime of the commercial product in which it is used.

Even applications that talk to hardware can be scripted. That's why I wrote memtool, a utility written in C that makes is easy to read, write, and modify memory-mapped hardware registers from the command line. For sure, memtool is useful interactively. But where it really pays off is in shell scripts where you can do stuff like manipulate an FPGA or interrogate a status register without having to write a single line of new C code.  (The shell output below was scraped right off a BeagleBoard of mine running Android.)

bash-3.2# memtool -?
usage: memtool [ -d ] [ -o ] [ -a ADDDRESS ] [ -l BYTES ] [ -[1|2|4|8] ADDRESS ] [ -r | -[s|S|c|C|w] NUMBER ] [ -u USECONDS ] [ -t | -f ] [ ... ]
-1 ADDRESS    Use byte at ADDRESS
-2 ADDRESS    Use halfword at ADDRESS
-4 ADDRESS    Use word at ADDRESS
-8 ADDRESS    Use doubleword at ADDRESS
-C NUMBER     Clear 1<<NUMBER mask at ADDRESS
-S NUMBER     Set 1<<NUMBER mask at ADDRESS
-a ADDRESS    Optionally map region at ADDRESS
-c NUMBER     Clear NUMBER mask at ADDRESS
-d            Enable debug mode
-f            Proceed if the last result was 0
-l BYTES      Optionally map BYTES in length
-o            Enable core dumps
-r            Read ADDRESS
-s NUMBER     Set NUMBER mask at ADDRESS
-t            Proceed if the last result was !0
-u USECONDS   Sleep for USECONDS microseconds
-w NUMBER     Write NUMBER to ADDRESS
-?            Print menu

Even if your embedded target is too small to host even a simple shell interpreter, learning programing languages that are not natively compiled to machine code will prove valuable. This is true of Python in particular. Python is so easily interfaced with C-based  libraries that hardware vendors are starting to provide Python bindings for libraries that interface with their chips so that developers can trivially write code to monitor and manipulate their product. My friend and occasional colleague Doug Gibbons was just telling me the other day that he was using Python to monitor the performance of his signal processing code on DSPs. Python and other similar languages offer such an enormous productivity boost that I expect this trend  to continue upwards. I'm also seeing Quality Assurance testers using Python more and more to automate functional testing of the embedded systems on which I work. Knowing a little Python helps me relate to them.

If you're an embedded  developer, you are quickly running out of excuses for not learning some of the new programming languages,  even if you never expect to run those languages directly on the embedded target for which you're developing.

I'm old as the hills. If I can do this, so can you.

Tuesday, March 18, 2014

Maybe the System Isn't the Solution - Maybe It's the Problem

I read a lot. It takes discipline; I set aside at least an hour every single day for reading. I read both fiction and non-fiction, but as I've gotten older I've definitely trended more towards non-fiction. Even the most speculative of fiction authors often have a hard time matching the amazing stuff I find in the news every day. I think it's one of the reasons that one of my favorite fiction authors, William Gibson, has transitioned from writing about the near future to writing about the recent past. You can't make this stuff up.

Much my non-fiction choices have been in the realms of economics, and how and why complex systems fail. My initial motivation for both of these topics was in fact professional interest. I'm a little surprised to discover that my recent choices in reading material tend to combine these two topics, a relation that has been slowly dawning on me over the past decade or so.

Charles Perrow, Normal Accidents: Living with High Risk Technologies, Princeton University Press, 1999

On June 1, 1974 in Flixborough England, a chemical plant that manufactured a component of nylon exploded. The blast was estimated to be equivalent to as much as 45 tons of TNT. The plant was destroyed. All of the maintenance records that might have provided more details as to the exact cause of the disaster were destroyed. Buildings a thousand feet away from the plant were damaged. Windows a mile and three quarters away were broken. Twenty-eight people were killed, and at least eighty-nine were injured. It is believed that a cloud of cyclohexane vapor, a byproduct of production, escaped and ignited, generated the immense fuel-air explosion.

This is one of the many catastrophic accidents that Perrow describes in detail in his book. I would be lying if I didn't admit to a certain amount of fascinated schadenfreude for some of the apocalyptic events he documents. Two common themes that emerge is the presence of non-linear dynamics and tight-coupling.

Perrow writes that that the exact functioning of many of the reactions used for production in large commercial chemical plants are not well understood theoretically, even by experts in the field. In an effort to make such plants more efficient, their production systems are full of positive and negative feedback loops which can (and do) react in non-linear ways, and which are only stable within a narrow range of operating conditions. Even the physics may not be well understood; Perrow talks about how neutron bombardment embrittlement of nuclear reactor vessels came as a surprise.

It's no coincidence that the phrase "going non-linear" is a euphemism for a situation going very wrong very quickly. In The Logic of Failure: Recognizing and Avoiding Error in Complex Situations (which I wrote about in Imperfect People Build Imperfect Systems) [1], psychologist Dietrich Dörner writes about how humans really suck at conceptualizing and understanding complex non-linear systems. This means when complex systems like chemical plants enter an unexpected region of their operational parameters, typically due to one or more failures or mistakes lining up in time, it is difficult for human operators to understand what is happening, or to even recognize that something is happening at all. My own experience suggests it is difficult for designers of digital control systems to anticipate such failures, particularly when the systems under their control suddenly and unexpectedly depart from their normal operational envelope. Perrow captures the essence of non-linear behavior when he writes that it is a sensitivity to "trivial events in nontrivial systems" (p. 43).

Perrow echoes Dörner when he writes about how the cognitive limitations of humans limit their response to failures in complex systems.

We construct an expected world because we can't handle the complexity of the present one, and then process the information that fits the expected world, and find reason to exclude the information that might contradict it. (p. 214)
Confronted with ambiguous signals, the safest reality was constructed. (p. 217)

Those old enough to remember The Church of the Subgenius from the 1970s may or may not be surprised to find that that UFO religion parody had it right: "slack" is vitally important. Perrow writes about how tight-coupling in complex systems means that there is little margin for error. When things go bad in a system with little or no slack, the badness cascades through the system very quickly. Slack, which is just another word for decoupling, gives more opportunities to respond, including improvisationally, to failure in both space and time. This will seem familiar to designers of real-time software: the insertion of loose-coupling, for example in the form of message queues, between producer and consumer processes may increase stability at the expense of real-time response. Chemical and manufacturing plant designers face the same trade-offs.

Perrow makes the distinction between processes - chemical, nuclear, and otherwise - that are "transformative" versus "additive". The former can have much more complex interactions between its components, and may feature non-linear behavior which may not be understood by the designer of the process. He likens the additive production process to making a sandwich where as the transformative process is making a soufflé: the end result is drastically greater than the sum of its component parts.

Perrow writes about how safety systems frequently have unintended consequences that nullify their effects. The introduction of shipboard radar did not yield the expected reduction in maritime accidents because ships used it as an excuse to go faster. This is true in chemical and manufacturing processes as well; safety features allow the plants to be run at higher levels of production, resulting in greater revenue at the same accident rate.

The title of Perrow's book stems from his assertion that given complex systems with tight-coupling and non-linear behavior, in which safety subsystems are either subverted or exploited, accidents are inevitable, or in some sense, normal, to be expected.

My biggest criticism of Normal Accidents is that Perrow allows his own politics to influence his more analytical writing to the point that sometimes he sounds like a conspiracy theorist. He frequently talks about how the "elite" (by which he means executive management) may sacrifice safety for personal gain. In fact, people act in the way that they are incentivized to act (that's a good basic definition of the field of economics right there). Perrow recognizes this at some level: he uses the term "practical drift" to describe an economically-motivated departure of formal processes, in the form of a local adaptation, from what was required by reality, for example reducing the frequency of routine maintenance to reduce down-time and save money. That's the principle theme of Sidney Dekker's book featured in the next section.
(Edit 2020-01-28: Subsequent reading on my part suggests that the term "elites" seems to be widely used amongst sociologists, at least those I read that are writing in the field of failure analysis. I find the term value laden and disconnected with my own experience.)
Perrow was a sociology professor at Yale when he originally wrote Normal Accidents in 1984, inspired by the events at Three Mile Island in 1979. This book, updated in 1999, has been widely cited by accident researchers since it's publication. His book originally came under fire with his prediction of the likelihood of more nuclear accidents, but events have Fukushima may have caused people to reconsider that criticism. The enormous chemical plant explosion in West, Texas in 2013 makes you suspect that not much has changed since Perrow wrote his chapter on chemical plant catastrophes. His book predates the events of 9/11, but in his chapter on aircraft safety he spookily mentions that although aircraft could hit buildings, they somehow remarkably do not (although he wasn't considering deliberate acts of terrorism).

Sidney Dekker, Drift Into Failure: From Hunting Broken Components to Understanding Complex Systems, Ashgate Publishing Company,  2011

[Bhopal, Flixborough, and Chernobyl] were the effects of a systematic migration of organizational behavior under the influence of pressure toward cost-effectiveness in an aggressive, competitive environment. 
- Jens Rasmussen and Inge Svedung, Proactive Risk Management in a Dynamic Society, Raddningsverket, Swedish Rescue Services Agency, 2000
When I read that quote on page one of Drift Into Failure, I knew I was hooked. Dekker is the first researcher I have read on this topic that has expressed what has been percolating in my brain for several years now: market forces play a significant role, perhaps the most important role, in why complex systems fail. (He also made me a fan of Danish safety researcher Jens Rasmussen, who has apparently been writing about this very thing for many years.)

The gist of Dekker's assertion is that economic forces frequently cause an incremental departure from normal operating procedure, creating circumstances in which systems gradually over time move outside of their envelope of safe operation, and finally, sometimes catastrophically, beyond their safety margin. This incremental departure - for example, gradually increasing maintenance intervals to reduce downtime and save money, a change which is eventually accepted as the norm because nothing bad happened (yet) - is termed normalization of deviance (quoting the research of Diane Vaughan [2], of whom I am also now a fan). The perhaps unwitting acceptance of a greater level of risk - during which the system has not yet failed even though it is operating beyond its design parameters - is risk homeostasis. This gradual slide into the danger zone is what Dekker refers to as a drift into failure. Software engineers will recognize that Dekker is describing an effect in our field known as technical debt: the continual and gradual cutting of corners in the design and implementation of a software system that eventually results in the system becoming overwhelmingly unmaintainable.

Dekker describes a diagram from another paper by Rasmussen [3] that succinctly sums this up.



Market forces ("Management Pressure toward Efficiency") and labor forces ("Gradient toward Least Effort") tend to cause the operational point of a complex system to migrate into its error margin and beyond. Once the operational point passes the "Boundary of functionally acceptable performance", the system may fail, suddenly and catastrophically.

I glommed onto this so strongly because I've been thinking about this for years. It has surely colored my choices in reading material as I've read more and more popular books on economics, ranging from traditional micro- and macro-economics to the more esoteric game theory, and most recently a book on organization economics. I like reading about economics for the same reason I enjoy reading about physics: economics describes reality in a way that I find useful. Dekker's book was the first I've read to make explicit the connection I'd been suspecting between economics and system failure. The near inevitability of economic forces pushing systems into failure harkens back to Perrow's "normal accidents".

Drift into failure isn't merely a hypothesis. Dekker, himself both a safety researcher and a commercial airline pilot, writes at length about Alaska Airlines flight 261, a McDonnell Douglas MD-80 that on January 31, 2000 took off from Puerto Vallarta, Mexico destined for Seattle, Washington. In Dekker's harrowing transcription of the flight deck voice recorder, the pilots vainly try to regain control of the aircraft after the sudden failure of its trim jackscrew-nut assembly, a mechanical component that controlled its horizontal stabilizer. The rear stabilizer eventually slid to its far end of travel and stuck, creating enormous downward pitch forces and rendering the aircraft uncontrollable. For a brief period the pilots resorted to flying the huge commercial airliner inverted, a tactic that almost worked. But shortly afterwards the aircraft hit the Pacific Ocean near Port Hueneme, California, breaking apart, killing all five crew and eighty-three passengers.

The story of how the jackscrew-nut on flight 261 failed became a classic case study in normalization of deviance. When the MD-80 was launched in the mid-1960s, McDonnell Douglas recommended that the trim jackscrew assembly be lubricated every 300 to 350 flight hours, the so-called "B" maintenance interval. That means grounding the plane every few weeks. The access hatch to reach the assembly was small and awkwardly located, making the job difficult and time consuming. So in 1985, the airline reduced the maintenance interval to every 700 hours, or every other "B" maintenance interval. In 1987, the frequency of the "B" maintenance itself was moved to every 500 flight hours, making the  interval between lubrication 1000 flight hours. In 1988, the "B" maintenance was eliminated, and that work redistributed in the "A" and "C" maintenance. The lubrication was scheduled for every eighth "A" maintenance, done every 125 hours, so the jackscrew was still lubricated every 1000 hours. In 1991, the "A" maintenance was extended 150 flight hours, leaving the lubrication to be done every 1200 hours. In 1994, the "A" maintenance was extended to 200 hours; lubrication every 1600 flight hours. In 1996, the airline moved the lubrication task from the "A" maintenance task list to a task list that was done every eight calendar months, regardless of flight time. For flight 261 that might have translated to the jackscrew assembly being lubricated every 2550 hours. The jackscrew assembly recovered from the ocean floor did not show any evidence of ever having been lubricated.

Each individual change in the maintenance schedule for the jackscrew assembly was by itself a minor alteration. Doubling the interval from 350 hours to 700 hours still placed the maintenance well within the error margin for that component. Each change was come by honestly to save money and reduce effort, with no conscious intent to impact safety. But over time, each incremental change, done without consideration of any of the prior changes, gradually pushed the system into a maintenance interval that was nearly ten times the duration recommended but the manufacturer.

If, like me, you have ever worked in a code base of millions of lines of code that were the result of efforts of hundreds of developers over the span of decades, this may all sound familiar. As I myself once said: "How does a switch statement get to be four thousand lines long? One line at a time, baby, one line at a time."

John Gall, Systemantics: How Systems Work and Especially How They Fail, Pocket Books, 1975

It's hard to believe from the perspective of 2014, but way back in the 1970s it was acceptable, and even not that uncommon, to write a business best seller that was short, funny, and wise. Books like The Peter Principle and Up the Organization and Systemantics. How I long for those days.

It was in the 1970s, when I was a systems programmer in an IBM mainframe shop, that I read Systemantics because it was recommended to me by my colleague (and now old friend) Mike Manuel. I returned to it and re-read it recently because I saw it cited in a contemporary article on software engineering.

Gall's book holds up. Gall, who was, of all things, a pediatrician, spent much of his life studying and writing about why systems fail. He was studying more or less human systems -- organizations -- but what he wrote about applies to technological systems as well. He summarizes his research in a few dozen short dictums. Here are a few that are among my favorites.

The real world is whatever is reported to the system. 
The bigger the system, the narrower and more specialized the interface with individuals. 
A simple system, designed from scratch, sometimes works. 
A complex system that works is invariably found to have evolved from a simple system that works. 
A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system. 
Complex systems usually operate in failure mode. 
When a fail-safe system fails, it fails by failing to fail safe. 
Loose systems last longer and work better.

To no one's surprise, Gall is frequently quoted in the software engineering literature. Many of Gall's principles described in this short (158 page) and very entertaining book are expounded upon in much longer and far less readable tomes. As a litmus test, if you didn't find yourself nodding at many - perhaps all - of the dictums quoted above, then I question whether or not you are an experienced software engineer. Heck, the real world is whatever is reported to the system and complex systems usually operate in failure mode alone are worth the price of admission. I had to learn those the hard way while developing error recovery algorithms for a large telecommunications system.

Gall coins the term anergy as the opposite of energy, equivalent in a biological system to torpor: "A coiled spring is full of energy. When fully uncoiled, it is full of anergy." (p. 149) Anergy is measured in units of effort required to bring about the desired change. Hence:

The Law of Conservation of Anergy: the total amount of anergy in the universe is constant.

It may not seem obvious at first glance, but this simple conservation law explains unintended consequences, incentive distortion, measurement dysfunction, perverse incentives, the tragedy of the commons, really a whole class of problems discussed in business and economics. If you fix one problem, the total number of problems in the universe is not reduced. Because anergy must be conserved, other problems must be spontaneously created as a result of your actions. There is no end to it.

Galls book was so successful that he published a whole raft of follow-on books. But those are mostly just expansions of the basic ideas he laid out so succinctly in his original paperback. In a way, they just represent one more large complex system that evolved from a small simple system that worked.

Footnotes

[1] Dietrich Dörner, The Logic of Failure, Basic Books, 1997

[2] Diane Vaughan, The Challenger Launch Decision, University of Chicago Press, 1997

[3] Jens Rasmussen, "Risk Management In A Dynamic Society: A Modelling Problem", Safety Science, 27.2/3, Elsevier Science Ltd., 1997, pp. 183-213

(Footnotes added 2020-01-23)

Monday, March 03, 2014

Ahoy-hoy!

Little did I know when I wrote a little bit of telephone dialing history in DTMF Tones, Fourier Transforms, and Spectral Analysis that just a few days later I would be living it. (If you're into this sort of thing, I just updated that original article with photographs of vintage telephones from my personal collection for illustrative purposes. Doesn't everyone have a collection of vintage telephones?)

Mrs. Overclock and I recently paid a visit to Tucson Arizona to spend time with family on the eve of our thirtieth wedding anniversary. We spent a couple of nights at the downtown Hotel Congress, built in 1919 and best known historically as having played a role in the capture of the notorious John Dillinger in 1934.

One of the first things I noticed when we entered our room was the the rotary dial phone.

Untitled

The rotary dial room phones were in turn connected to a manual telephone switchboard stationed behind the reception desk in the main lobby. (This photograph is actually of one of the two additional inactive manual switchboards on display in the hotel lobby.) Notice the switchboard also has a rotary dial.

Untitled

Here's a close up of the tip-and-ring connectors one of those switchboards. As you can see, they are almost identical to the "phone plug" shown in my original article and which I purchased at my local Radio Shack, intended for applications like quarter-inch stereo headphone jacks.

Untitled

As I've said before, telephony technologies tend to be long lived. This is partly due to the network effect: the more people that were able to connect using a technology, the more useful and hence the more valueable it was. This was frequently an issue for telecom developers of yore as they discovered that customers expected to amortize their telecommunications equipment purchases over a decade or more. Hence, telephony equipment -- both hardware and software -- was typically built to last.