Tuesday, March 18, 2014

Maybe the System Isn't the Solution - Maybe It's the Problem

I read a lot. It takes discipline; I set aside at least an hour every single day for reading. I read both fiction and non-fiction, but as I've gotten older I've definitely trended more towards non-fiction. Even the most speculative of fiction authors often have a hard time matching the amazing stuff I find in the news every day. I think it's one of the reasons that one of my favorite fiction authors, William Gibson, has transitioned from writing about the near future to writing about the recent past. You can't make this stuff up.

Much my non-fiction choices have been in the realms of economics, and how and why complex systems fail. My initial motivation for both of these topics was in fact professional interest. I'm a little surprised to discover that my recent choices in reading material tend to combine these two topics, a relation that has been slowly dawning on me over the past decade or so.

Charles Perrow, Normal Accidents: Living with High Risk Technologies, Princeton University Press, 1999

On June 1, 1974 in Flixborough England, a chemical plant that manufactured a component of nylon exploded. The blast was estimated to be equivalent to as much as 45 tons of TNT. The plant was destroyed. All of the maintenance records that might have provided more details as to the exact cause of the disaster were destroyed. Buildings a thousand feet away from the plant were damaged. Windows a mile and three quarters away were broken. Twenty-eight people were killed, and at least eighty-nine were injured. It is believed that a cloud of cyclohexane vapor, a byproduct of production, escaped and ignited, generated the immense fuel-air explosion.

This is one of the many catastrophic accidents that Perrow describes in detail in his book. I would be lying if I didn't admit to a certain amount of fascinated schadenfreude for some of the apocalyptic events he documents. Two common themes that emerge is the presence of non-linear dynamics and tight-coupling.

Perrow writes that that the exact functioning of many of the reactions used for production in large commercial chemical plants are not well understood theoretically, even by experts in the field. In an effort to make such plants more efficient, their production systems are full of positive and negative feedback loops which can (and do) react in non-linear ways, and which are only stable within a narrow range of operating conditions. Even the physics may not be well understood; Perrow talks about how neutron bombardment embrittlement of nuclear reactor vessels came as a surprise.

It's no coincidence that the phrase "going non-linear" is a euphemism for a situation going very wrong very quickly. In The Logic of Failure: Recognizing and Avoiding Error in Complex Situations (which I wrote about in Imperfect People Build Imperfect Systems) [1], psychologist Dietrich Dörner writes about how humans really suck at conceptualizing and understanding complex non-linear systems. This means when complex systems like chemical plants enter an unexpected region of their operational parameters, typically due to one or more failures or mistakes lining up in time, it is difficult for human operators to understand what is happening, or to even recognize that something is happening at all. My own experience suggests it is difficult for designers of digital control systems to anticipate such failures, particularly when the systems under their control suddenly and unexpectedly depart from their normal operational envelope. Perrow captures the essence of non-linear behavior when he writes that it is a sensitivity to "trivial events in nontrivial systems" (p. 43).

Perrow echoes Dörner when he writes about how the cognitive limitations of humans limit their response to failures in complex systems.

We construct an expected world because we can't handle the complexity of the present one, and then process the information that fits the expected world, and find reason to exclude the information that might contradict it. (p. 214)
Confronted with ambiguous signals, the safest reality was constructed. (p. 217)

Those old enough to remember The Church of the Subgenius from the 1970s may or may not be surprised to find that that UFO religion parody had it right: "slack" is vitally important. Perrow writes about how tight-coupling in complex systems means that there is little margin for error. When things go bad in a system with little or no slack, the badness cascades through the system very quickly. Slack, which is just another word for decoupling, gives more opportunities to respond, including improvisationally, to failure in both space and time. This will seem familiar to designers of real-time software: the insertion of loose-coupling, for example in the form of message queues, between producer and consumer processes may increase stability at the expense of real-time response. Chemical and manufacturing plant designers face the same trade-offs.

Perrow makes the distinction between processes - chemical, nuclear, and otherwise - that are "transformative" versus "additive". The former can have much more complex interactions between its components, and may feature non-linear behavior which may not be understood by the designer of the process. He likens the additive production process to making a sandwich where as the transformative process is making a soufflé: the end result is drastically greater than the sum of its component parts.

Perrow writes about how safety systems frequently have unintended consequences that nullify their effects. The introduction of shipboard radar did not yield the expected reduction in maritime accidents because ships used it as an excuse to go faster. This is true in chemical and manufacturing processes as well; safety features allow the plants to be run at higher levels of production, resulting in greater revenue at the same accident rate.

The title of Perrow's book stems from his assertion that given complex systems with tight-coupling and non-linear behavior, in which safety subsystems are either subverted or exploited, accidents are inevitable, or in some sense, normal, to be expected.

My biggest criticism of Normal Accidents is that Perrow allows his own politics to influence his more analytical writing to the point that sometimes he sounds like a conspiracy theorist. He frequently talks about how the "elite" (by which he means executive management) may sacrifice safety for personal gain. In fact, people act in the way that they are incentivized to act (that's a good basic definition of the field of economics right there). Perrow recognizes this at some level: he uses the term "practical drift" to describe an economically-motivated departure of formal processes, in the form of a local adaptation, from what was required by reality, for example reducing the frequency of routine maintenance to reduce down-time and save money. That's the principle theme of Sidney Dekker's book featured in the next section.
(Edit 2020-01-28: Subsequent reading on my part suggests that the term "elites" seems to be widely used amongst sociologists, at least those I read that are writing in the field of failure analysis. I find the term value laden and disconnected with my own experience.)
Perrow was a sociology professor at Yale when he originally wrote Normal Accidents in 1984, inspired by the events at Three Mile Island in 1979. This book, updated in 1999, has been widely cited by accident researchers since it's publication. His book originally came under fire with his prediction of the likelihood of more nuclear accidents, but events have Fukushima may have caused people to reconsider that criticism. The enormous chemical plant explosion in West, Texas in 2013 makes you suspect that not much has changed since Perrow wrote his chapter on chemical plant catastrophes. His book predates the events of 9/11, but in his chapter on aircraft safety he spookily mentions that although aircraft could hit buildings, they somehow remarkably do not (although he wasn't considering deliberate acts of terrorism).

Sidney Dekker, Drift Into Failure: From Hunting Broken Components to Understanding Complex Systems, Ashgate Publishing Company,  2011

[Bhopal, Flixborough, and Chernobyl] were the effects of a systematic migration of organizational behavior under the influence of pressure toward cost-effectiveness in an aggressive, competitive environment. 
- Jens Rasmussen and Inge Svedung, Proactive Risk Management in a Dynamic Society, Raddningsverket, Swedish Rescue Services Agency, 2000
When I read that quote on page one of Drift Into Failure, I knew I was hooked. Dekker is the first researcher I have read on this topic that has expressed what has been percolating in my brain for several years now: market forces play a significant role, perhaps the most important role, in why complex systems fail. (He also made me a fan of Danish safety researcher Jens Rasmussen, who has apparently been writing about this very thing for many years.)

The gist of Dekker's assertion is that economic forces frequently cause an incremental departure from normal operating procedure, creating circumstances in which systems gradually over time move outside of their envelope of safe operation, and finally, sometimes catastrophically, beyond their safety margin. This incremental departure - for example, gradually increasing maintenance intervals to reduce downtime and save money, a change which is eventually accepted as the norm because nothing bad happened (yet) - is termed normalization of deviance (quoting the research of Diane Vaughan [2], of whom I am also now a fan). The perhaps unwitting acceptance of a greater level of risk - during which the system has not yet failed even though it is operating beyond its design parameters - is risk homeostasis. This gradual slide into the danger zone is what Dekker refers to as a drift into failure. Software engineers will recognize that Dekker is describing an effect in our field known as technical debt: the continual and gradual cutting of corners in the design and implementation of a software system that eventually results in the system becoming overwhelmingly unmaintainable.

Dekker describes a diagram from another paper by Rasmussen [3] that succinctly sums this up.



Market forces ("Management Pressure toward Efficiency") and labor forces ("Gradient toward Least Effort") tend to cause the operational point of a complex system to migrate into its error margin and beyond. Once the operational point passes the "Boundary of functionally acceptable performance", the system may fail, suddenly and catastrophically.

I glommed onto this so strongly because I've been thinking about this for years. It has surely colored my choices in reading material as I've read more and more popular books on economics, ranging from traditional micro- and macro-economics to the more esoteric game theory, and most recently a book on organization economics. I like reading about economics for the same reason I enjoy reading about physics: economics describes reality in a way that I find useful. Dekker's book was the first I've read to make explicit the connection I'd been suspecting between economics and system failure. The near inevitability of economic forces pushing systems into failure harkens back to Perrow's "normal accidents".

Drift into failure isn't merely a hypothesis. Dekker, himself both a safety researcher and a commercial airline pilot, writes at length about Alaska Airlines flight 261, a McDonnell Douglas MD-80 that on January 31, 2000 took off from Puerto Vallarta, Mexico destined for Seattle, Washington. In Dekker's harrowing transcription of the flight deck voice recorder, the pilots vainly try to regain control of the aircraft after the sudden failure of its trim jackscrew-nut assembly, a mechanical component that controlled its horizontal stabilizer. The rear stabilizer eventually slid to its far end of travel and stuck, creating enormous downward pitch forces and rendering the aircraft uncontrollable. For a brief period the pilots resorted to flying the huge commercial airliner inverted, a tactic that almost worked. But shortly afterwards the aircraft hit the Pacific Ocean near Port Hueneme, California, breaking apart, killing all five crew and eighty-three passengers.

The story of how the jackscrew-nut on flight 261 failed became a classic case study in normalization of deviance. When the MD-80 was launched in the mid-1960s, McDonnell Douglas recommended that the trim jackscrew assembly be lubricated every 300 to 350 flight hours, the so-called "B" maintenance interval. That means grounding the plane every few weeks. The access hatch to reach the assembly was small and awkwardly located, making the job difficult and time consuming. So in 1985, the airline reduced the maintenance interval to every 700 hours, or every other "B" maintenance interval. In 1987, the frequency of the "B" maintenance itself was moved to every 500 flight hours, making the  interval between lubrication 1000 flight hours. In 1988, the "B" maintenance was eliminated, and that work redistributed in the "A" and "C" maintenance. The lubrication was scheduled for every eighth "A" maintenance, done every 125 hours, so the jackscrew was still lubricated every 1000 hours. In 1991, the "A" maintenance was extended 150 flight hours, leaving the lubrication to be done every 1200 hours. In 1994, the "A" maintenance was extended to 200 hours; lubrication every 1600 flight hours. In 1996, the airline moved the lubrication task from the "A" maintenance task list to a task list that was done every eight calendar months, regardless of flight time. For flight 261 that might have translated to the jackscrew assembly being lubricated every 2550 hours. The jackscrew assembly recovered from the ocean floor did not show any evidence of ever having been lubricated.

Each individual change in the maintenance schedule for the jackscrew assembly was by itself a minor alteration. Doubling the interval from 350 hours to 700 hours still placed the maintenance well within the error margin for that component. Each change was come by honestly to save money and reduce effort, with no conscious intent to impact safety. But over time, each incremental change, done without consideration of any of the prior changes, gradually pushed the system into a maintenance interval that was nearly ten times the duration recommended but the manufacturer.

If, like me, you have ever worked in a code base of millions of lines of code that were the result of efforts of hundreds of developers over the span of decades, this may all sound familiar. As I myself once said: "How does a switch statement get to be four thousand lines long? One line at a time, baby, one line at a time."

John Gall, Systemantics: How Systems Work and Especially How They Fail, Pocket Books, 1975

It's hard to believe from the perspective of 2014, but way back in the 1970s it was acceptable, and even not that uncommon, to write a business best seller that was short, funny, and wise. Books like The Peter Principle and Up the Organization and Systemantics. How I long for those days.

It was in the 1970s, when I was a systems programmer in an IBM mainframe shop, that I read Systemantics because it was recommended to me by my colleague (and now old friend) Mike Manuel. I returned to it and re-read it recently because I saw it cited in a contemporary article on software engineering.

Gall's book holds up. Gall, who was, of all things, a pediatrician, spent much of his life studying and writing about why systems fail. He was studying more or less human systems -- organizations -- but what he wrote about applies to technological systems as well. He summarizes his research in a few dozen short dictums. Here are a few that are among my favorites.

The real world is whatever is reported to the system. 
The bigger the system, the narrower and more specialized the interface with individuals. 
A simple system, designed from scratch, sometimes works. 
A complex system that works is invariably found to have evolved from a simple system that works. 
A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system. 
Complex systems usually operate in failure mode. 
When a fail-safe system fails, it fails by failing to fail safe. 
Loose systems last longer and work better.

To no one's surprise, Gall is frequently quoted in the software engineering literature. Many of Gall's principles described in this short (158 page) and very entertaining book are expounded upon in much longer and far less readable tomes. As a litmus test, if you didn't find yourself nodding at many - perhaps all - of the dictums quoted above, then I question whether or not you are an experienced software engineer. Heck, the real world is whatever is reported to the system and complex systems usually operate in failure mode alone are worth the price of admission. I had to learn those the hard way while developing error recovery algorithms for a large telecommunications system.

Gall coins the term anergy as the opposite of energy, equivalent in a biological system to torpor: "A coiled spring is full of energy. When fully uncoiled, it is full of anergy." (p. 149) Anergy is measured in units of effort required to bring about the desired change. Hence:

The Law of Conservation of Anergy: the total amount of anergy in the universe is constant.

It may not seem obvious at first glance, but this simple conservation law explains unintended consequences, incentive distortion, measurement dysfunction, perverse incentives, the tragedy of the commons, really a whole class of problems discussed in business and economics. If you fix one problem, the total number of problems in the universe is not reduced. Because anergy must be conserved, other problems must be spontaneously created as a result of your actions. There is no end to it.

Galls book was so successful that he published a whole raft of follow-on books. But those are mostly just expansions of the basic ideas he laid out so succinctly in his original paperback. In a way, they just represent one more large complex system that evolved from a small simple system that worked.

Footnotes

[1] Dietrich Dörner, The Logic of Failure, Basic Books, 1997

[2] Diane Vaughan, The Challenger Launch Decision, University of Chicago Press, 1997

[3] Jens Rasmussen, "Risk Management In A Dynamic Society: A Modelling Problem", Safety Science, 27.2/3, Elsevier Science Ltd., 1997, pp. 183-213

(Footnotes added 2020-01-23)

5 comments:

chris said...

'The real world is whatever is reported to the system'
=
'The world is everything that is the case.' Ludwig Wittgenstein.

'If you fix one problem, the total number of problems in the universe is not reduced. Because anergy must be conserved, other problems must be spontaneously created as a result of your actions. There is no end to it.'=second law f thermodynamics.

Chip Overclock said...

I don't know the context from which the quote from Wittgenstein was taken, but since he died in 1951, I suspect he didn't have quite the same perspective as I do. When writing error recovery algorithms for large complex systems, all decisions are made not on the basis of reality, but on the basis of sensor (of some kind) input. My experience is that the two views may differ substantially.

I also think that Gall was being much more cynical^H^H^H^H^H^H^Hpragmatic than just entropy suggests. Biological systems actually battle entropy by creating higher states of order. But Gall is suggesting that this doesn't matter. That's been my experience too, and I think is more in the domain of Economics than Physics.

Todd Hoff said...

When I read "In fact, people act in the way that they are incentivized to act " it seems to slice away individual responsibility and shift agency to some other entity, in this case a system of economics, which is of course produced by humans, creating a feedback loop. The Pinto case, for example, is perfectly rational and is dictated by economics, but it was also immoral. They could have acted differently. Nothing was dictated. Good coverage in The Lucifer Effect. This is a version of an age old discussion of how to create a virtuous society (Plato, Augustine, Hobbes, Puritans, US Constitution, etc) reformulated in terms of complex systems and market forces instead of forms of government organization and their relation to the individual. And as such seems equally hopeless :-)

Chip Overclock said...

Todd: first, a big THANK YOU for the shout out on your popular blog "High Scalability"

http://highscalability.com

which I always appreciate. Now: I don't mean to excuse people for being deliberately immoral or unethical. But it's a rare person who has the luxury of losing their jobs versus, say, cutting costs by 20%, particularly when it can't be determined except in hindsight that cutting costs leads to catastrophe. For sure, there are evil people that do bad things. But I agree with Dekker and Rasmussen who write that executives moving complex systems into their error margin aren't deliberately or consciously doing so but merely responding to economic pressures. They make a change, and nothing bad happens. They make another change, and still nothing bad happens. The fact that the system they are changing is running on the hairy edge of disaster isn't apparent... at least to them. This is also addressed in the book I just finished, THE ORG, by Fisman and Sullivan, on organizational economics. They devote an entire chapter to the debacle that was British Petroleum (BP), whose disastrous Deepwater Horizon oil spill was merely the culmination of decades of driving costs out of the business, resulting in other disasters. Hanlon's Razor: "Never ascribe to malice that which is adequately explained by stupidity."

Anonymous said...

What is reality for those system if not the basis sensor? It is what is seen and 'understood' by the system. One can rephrase it that the world is everything in the case. Or maybe we are lost in translation from German?
'Die welt ist alles was der fall ist'.

I think that each system can be defined by a few factors. One of them is entropy. According to the second law of thermodynamics, entropy is ever growing.