Saturday, December 29, 2012

Scalability, Economics, and the Fermi Paradox

For some time now I've been wondering if my professional interests in technological scalability and my dilettante interests in economics and the Fermi Paradox might be all connected.

In "Is US Economic Growth Over?" economist Robert Gordon recently argues that the U.S. and the rest of the developed world is at an end of a third, and smaller, industrial revolution.
The analysis in my paper links periods of slow and rapid growth to the timing of the three industrial revolutions: 
  • IR #1 (steam, railroads) from 1750 to 1830;  
  • IR #2 (electricity, internal combustion engine, running water, indoor toilets, communications, entertainment, chemicals, petroleum) from 1870 to 1900; and
  • IR #3 (computers, the web, mobile phones) from 1960 to present. 
It provides evidence that IR #2 was more important than the others and was largely responsible for 80 years of relatively rapid productivity growth between 1890 and 1972.
Once the spin-off inventions from IR #2 (airplanes, air conditioning, interstate highways) had run their course, productivity growth during 1972-96 was much slower than before. In contrast, IR #3 created only a short-lived growth revival between 1996 and 2004. Many of the original and spin-off inventions of IR #2 could happen only once – urbanisation, transportation speed, the freedom of women from the drudgery of carrying tons of water per year, and the role of central heating and air conditioning in achieving a year-round constant temperature.
In "Is Growth Over?" economist Paul Krugman ponders this in light of the kind of non-scarcity economy known to anyone who is familiar with the idea of the technological singularity, in which robots replace most manual laborers and artificial intelligences (strong or weak) replace most information workers. As he points out, if all labor done by machines, you can raise the production per capita to any level you want, providing the robots and AIs are not part of what gets counted in the capita. (I should mention that Krugman is a science fiction fan himself and is certainly familiar with the writings of folks like Vernor Vinge and Charles Stross. I saw Stross interview Krugman in person at the World Science Fiction Convention in Montreal in 2009. Where do you go to hear Nobel Laureates speak?)

Krugman doesn't explain however from where the raw materials for this production will come. In his book The Great Stagnation economist Tyler Cowen has suggested that economic growth, particularly in the United States, was due to our having taken advantage of low hanging fruit, in the form of things like natural resources, cheap energy, and education. Now that those natural resources are more or less consumed, growth may become much more difficult.

Also recently, on the science fiction blog io9, George Dvorsky has written about The Great Filter, one of the possible explanations regarding the Fermi Paradox. Long time readers of this blog, and anyone that knows me well, will recall that I find the Fermi Paradox troubling. The Fermi Paradox is this: given the vast size of space, the vast span of time, and the vast numbers of galaxies, stars, planets, at least some percentage of which must be habitable, why haven't we heard any radio signals from extraterrestrial sources? It hasn't been for lack of trying. The Great Filter is a hypothesis that surmises that there is some fundamental and insurmountable barrier to development that all civilizations come up against.

Another possible explanation for what has been called The Great Silence is that mankind indeed holds a privileged position in the Universe. This can be seen as pro-religion argument, but it needn't be. It is possible that life is much much rarer than we have believed. There is actually some evidence to suggest that the Earth itself may inhabit a unique area of space in which physical constants permit life to thrive. (I've written about this in The Fine Structure Constant Isn't.)

Unfortunately, the explanation I find the most compelling (and like the least) is this: the Prisoner's Dilemma in Game Theory suggests that the dominant strategy is for space faring civilizations to wipe one another out before they themselves are wiped out by their competition. I call this the "Borg Strategy" (although rather than assimilation, I find a more credible possible mechanism to be weaponized von Neumann machines). Compare this to the optimal game strategy of cooperation, which I call the "United Federation of Planets Strategy". (I've written about this in The Prisoner's Dilemma, The Fermi Paradox, and War Games.)

In my professional work, particularly with large distributed systems and supercomputing, I have frequently seen issues with scalability. Often it becomes difficult to scale up performance with problem size. Cloud providers like Google and Amazon.com have addressed many problems that we thought were intractable in the past, as has the application of massively parallel processing to many traditional supercomputer applications. But the ugly truth is that cloud/MPP really only solves problems that are "embarrassingly parallel", that is, that naturally break up into many small and mostly independent parts. (I've written about this in Post-Modern Deck Construction.)

Many problems will remain intractable because they fall under the NP category: the only algorithms that are known to solve them run in "Non-Deterministic Polynomial" time (thanks to my old friend David Hemmendinger for that correction), which is to say, they scale, for example, exponentially with problem size. There are lots of problems that are in the NP category. Lucky for all of us, encryption and decryption is in the P category, while cryptographic code breaking is (so far) NP. True, encryption become easier to break as processing power increases, but adding a few more bits to the key increases the work necessary to crack codes exponentially.

What problems in economics are in fact computationally NP? For example, it could be that strategies necessary to more or less optimally manage an economy are fundamentally NP. This is one of the reasons that pro-free-market people give for free markets, where market forces encourage people to "do the right thing" independent of any central management. It really is a kind of crowd-sourced economic management system.

But suppose that there's a limit - in terms of computation or time or both - to how well an economy can work as a function of the number of actors (people, companies) in the economy relative to its available resources. Maybe there's some fundamental limit by which if a civilization hasn't achieved interstellar travel, it becomes impossible for them to do so. This can be compared to the history of Pacific islanders who became trapped on Easter Island when they cut down the last tree; no more big ocean going canoes, and as a side effect to the deforestation, ecological collapse.

This doesn't really need to be an NP class problem. It may just be that if a civilization doesn't make getting off of their planet a global priority, the time comes when they no longer have the resources necessary for an interplanetary diaspora or even for interstellar communication.

Henry Kissinger once famously said "Every civilization that has ever existed has ultimately collapsed." Is it possible that this is the result of fundamentally non-scalable economic principles, and is in part the explanation for the Fermi Paradox?

Update (2013-01-13)

I just finished Stephen Webb's If the Universe if Teeming with Aliens... WHERE IS EVERYBODY? Fifty Solutions to the Fermi Paradox and the Problem of Extraterrestrial Life (Springer-Verlang, 2002). Webb presents forty-nine explanations, plus one of his own, at about five pages apiece, that have been put forth by scientists, philosophers, etc. Besides being a good survey of this topic, it's also a good layman's introduction to a number of related science topics like plate tectonics, planetary formation, neurological support for language acquisition and processing, etc. I recommend it.

Wednesday, December 26, 2012

Leaving versus Staying

Let us suppose that people in the workforce can be categorized into those that need a compelling reason to leave a job (Group L) and those that need a compelling reason to stay in a job (Group S).

What would the benefits and retention strategies of a company predominantly employing people in Group L look like? Economics suggests that the company employing Group L employees would reach for the lowest common denominator in terms of benefits. The company would cite "best practices" in their industry to justify expending the least amount of money possible: the most basic health insurance, minimal vacation, and least flexible work hours, etc. There would be no financial incentive to do anymore than the least they had to do to keep their Group L employees. In fact, if they wanted to get rid of Group L employees, they would have to provide the compelling reason for them to leave, for example, layoffs. The company wouldn't have to worry about retaining Group S employees, nor would it have to worry about getting rid of them. They would likely have already have left.

Companies employing folks from Group L have it easy: they just have to make sure they don't give them a reason to leave. Except when they want them to leave. And my guess is that most people in Group L have similar reasons they would find compelling to leave.

What would the benefits and retention strategies of a company predominantly employing people in Group S look like? It would have to go out of its way to offer crazy generous benefits that were so obviously better than any of their competitors in order to retain Group S employees. Like free day care. Free food. Subsidized mass transit. Time to work on personal projects. Or they would have to put so much money on the table, maybe in the form of bonuses, so much that it might seem outrageous to those on the outside, that a Group S employee would be foolish to leave. Providing, of course, they didn't think they could get that kind of crazy money through other means. In which case, the outrageous bonuses aren't really a useful retention tool either.

Companies employing folks from Group S have it a lot harder. My guess is most people in Group S have different compelling reasons for staying. So the Group S company has to really scramble to keep their Group S employees.

Is it possible for a Group S employee to work at a Group L company? Sure, although the company probably has no idea why the Group S employee stays. The compelling reason for the Group S employee to stay may be something quite personal, even private. The management of the Group L company will be surprised when the Group S employee leaves, because from the Group L company's point of view, nothing has changed that could motivate the employee to leave.

Is it possible for a Group L employee to work at a Group S company? Maybe, and the Group L employee is probably amazed at how good they've got it. But the Group S company probably tries hard not to hire Group L employees. This could be done by placing all sorts of barriers in the interview and hiring process. Group S employees have to have a really good reason to apply for a particular job, even if it is with a Group S company.

I haven't said anything about the kinds of people that may be in Group L versus Group S. I have merely proposed a way in which you can decide which kind of company you work for. This probably says more about how your employer sees you than it says about you.

Thursday, December 20, 2012

Dead Man Walking

This article is about the event that was the greatest disaster of my career. It is also about the event that was the greatest stroke of luck of my career. It was the same event.

In 2005 I was working for a large Telecommunications Equipment Manufacturer. TEM had an impressive product line ranging from pizza-box-sized VOIP gateways to enormous global communications systems. At the time I was a software developer for TEM's Enterprise Communications Product, software that had a code base of millions of lines of code, the oldest sections of which stretched back to the invention of the C language. ECP was the crown jewels that, directly or indirectly, paid all the bills. Although I was an experienced product developer, I was fairly new to this area of the company, having spent the past several years writing software and firmware mostly in C++ closer to the hardware. But I had been working on a team that had developed a major new feature for ECP that was going to be in the upcoming release.

TEM was eager to win a big contract with one of their largest customers, a large, well-known Wealth Management Firm. It is likely that some or all of your retirement funds were under their management. WMF wanted a unified communications system for both their big office buildings, of which they had several, and for their smaller satellite offices scattered all over the country, of which they had many.

TEM was so eager to win this big contract, and the timing for WMF's acquisition was such, that my employer decided to preview the latest release of ECP by sending a team to one of WMF's data centers to install it on one of their servers so that WMF could see just how awesome it was, especially with this new feature that I had helped develop. But this new release of ECP not only wasn't in beta with any other customers yet, it hadn't even passed through TEM's own Quality Assurance organization. It was, at best, a development load, and not a terribly mature one at that. But millions of dollars were riding on TEM convincing WMF that ECP was the way to go.

When they asked me to get on the plane, being the fearless sort, I said yes.

Even given my relative inexperience with this code base, I was probably the logical choice. I had been one of the developers of one of the features in which WMF was interested. And getting this new release of ECP on WMF's server was not the usual process easily handled by one of TEM's technical support people. Because of the immaturity of the release, it wasn't a simple update, but a disk swap that required that I open up WMF's server and do some surgery. I had to back up the configuration and parameter database from the prior ECP release, swap the disks, and restore it to the new release. I was traveling with two disk drives in a hard case and a tool kit.

The conditions under which my work was done at the WMF data center were not optimal. I was in a tightly packed equipment room taking up most of a floor of an office building on a large WMF campus. All the work had to be done in the dead of night outside of normal business hours. I was locked in the equipment room without any means of communicating with anyone. If I walked out even to go to the euphemism, I couldn't get back in without finding a phone outside the room and calling someone. I had a mobile phone, but it couldn't get a signal inside the equipment room. For security reasons, there was no internet access. I had to get the install done quickly so that other two TEM developers that came on site could admin the new features and we could demo them before our narrow maintenance window expired. Security was tight, and time was short. I spent almost all my time at WMF sitting on the floor in a very narrow aisle with tools and parts strewn all around me, and a laptop in my lap connected to the server's maintenance Ethernet port. I got the DB backed up, the disks swapped, and the DB restored.

ECP would not come up.

It core dumped during initialization. I didn't even get to a maintenance screen. The system log file told me nothing that the stack trace didn't already. The ECP log was useless. I swapped the disks again and verified that the prior system came up just fine with the same DB as expected. I tried the spare disk that I had brought with me, to no avail. I desperately needed a clue. The catastrophic failure was in some part of the enormous application that I knew nothing about. Even if I did, I didn't have access to the code base or any of the usual diagnostic tools while sitting in the floor with my laptop. I had saved the DB, the stack trace, and the core dump on my laptop, but had no way to diagnose this level of failure on site, and no way to reach anyone that could. I knew that I was going to have to declare failure and cart everything back home for analysis.

Later, back at TEM, there were lots of meetings, port-mortems, but, remarkably, not a lot of finger pointing. We all knew it was a very immature release. I engaged other, more experienced, ECP developers to diagnose the failure and they set upon fixing it in the development base. Once that was done, I set up an ECP server, sans any actual telecommunications hardware, in my office, and installed WMF's DB on it to verify that it did indeed now come up. In the meantime, TEM's QA organization began testing this new ECP release on their own servers which did have actual hardware. Just a week or two passed before the powers that be decided that the new release had percolated enough that another attempt would be made. WMF would give TEM and ECP another chance.

I said yes, again. In hind sight, I'm a little surprised they asked.

This time I had a copy of the entire ECP code base on my laptop, although I still had no access to any of the complex diagnostic tools used to troubleshoot the system. The circumstances were identical: the same cast of characters, the same cramped cold equipment room, the same DB, the exact same server. Once again, I backed up the DB, swapped the disk, and restored the DB.

ECP came up. But it refused to do a firmware load onto the boards that allowed the server to communicate with any of the distributed equipment cabinets. ECP was up, but it was effectively useless.

We hadn't seen anything like this in our own QA testing of the new release, even though it used the same boards. My intuition told me that it probably something to do with WMF's specific DB. We weren't able to test with that DB in QA because the data in the DB is quite specific to the exact hardware configuration of the system, which involved hundreds if not thousands of individual components that we were unable to exactly replicate. The error didn't appear to be in the ECP software base itself, but in the firmware for the communications board, the source code of which I didn't have. And in any case I was not familiar with the hundreds of thousands of lines of C++ that made up that firmware. I personally knew folks at TEM that were, but even though they were standing by in the middle of the night back at the R&D facility, I had no way to contact them while connected to the server in front of me. After some consulting with the other TEM folk on site, and as our narrow maintenance window was closing, I once again declared failure.

As I got on the plane back east to return home, I knew that this was the end of my career at TEM. I took it as a compliment that they didn't fire me. They didn't even ax me in the inevitable next wave of layoffs. There were lots more meetings and post-mortems, some perhaps harsh but in my opinion well deserved words from TEM management to me, and a lot of discussion about a possible Plan C. But WMF's acquisition timetable had closed. And I knew that I would never be trusted with anything truly important at TEM ever again.

This is not the end of the story.

If you've never worked in a really big product development organization, it may help to know how these things operate.

ECP wasn't a single software product. It was a broad and deep product line incorporating several different types of servers, several possible configurations for some of the servers, many different hardware communications interface boards, and a huge number of features and options, some of which targeted very specific industries or market verticals. Just the ECP software that ran on a server alone was around eight million lines of code, mostly C. The code bases for all of the firmware that ran on the dozens of individual interface and feature boards manufactured by TEM, incorporating many different microprocessors, microcontrollers, FPGAs, and ASICs, and written in many different languages ranging from assembler to C++ to VHDL, added another few million lines of code. As ECP features were added and modified and new hardware introduced, all of this had to be kept in sync by a big globally distributed development organization of hundreds of developers and other experts.

The speed at which new ECP releases were generated by TEM was such that dozens developers were kept busy fixing bugs in perhaps two prior releases or more, while another team of developers was writing new features for the next release. It was this bleeding edge development release that I had hand carried to WMF. So it was not at all unusual to have at least three branches or forks of the ECP base in play at any one time. As bugs were found in the two prior forks, the fixes had to be ported forward to the latest fork. This was not always simple, since the code that the bug fix was in in the older fork may have been refactored, that is, modified, replaced, or even eliminated, in the course of new feature development in the latest fork. While a single code base might have been desirable, it simply wasn't practical given the demands of TEM's large installed user base all over the world, where customers just wanted their bugs fixed so that they could get on with their work and weren't at all interested in new and exciting bugs.

Once I got back home, and got some breathing space between meetings with understandably angry and disappointed managers, I started researching both of the WMF failures. Here is what I discovered: both of the issues I encountered trying to get ECP to run at WMF were known problems. They were documented in the bug reporting system for the prior release, not the development release that I had. The two bug reports were written as a result of TEM's own testing of the prior release. At WMF. At the same data center. On the same server. Those two known bugs had been fixed in the prior release, the very release of ECP that was already running on WMF's test server, but the fixes had not yet been ported forward to the development release that I was using for either of my two site visits. I hadn't known about these issues before; I was new enough to this particular part of the organization that I hadn't been completely conversant with the fu required to search its bug reporting system.

Both of the times I had gotten on the plane to fly to WMF, I was carefully hand carrying disk drives containing software for which it was absolutely known that it could not be made to work. In hindsight, my chances of success were guaranteed to be zero. It had always been a suicide mission.

Here's what keeps me awake some nights thinking about. There was a set of folks at TEM that knew we were taking this development release to WMF. There was a set of folks at TEM that knew this development release would not work at WMF. Was the intersection of those two sets empty? Surely it was. What motivation could anyone have to allow such a fiasco to occur?

But sometimes, in my darker moments, I remember that at the time TEM had an HR policy that included that enlightened system of forced ranking. And someone has to occupy those lower rating boxes. Would you pass up the opportunity to eliminate the competition for the rankings at the more rarified altitudes?

Never the less, I have always preferred to believe that the WMF fiasco was simply the result of the right hand not knowing what the left hand was doing. One of the lessons I carried away from this experience is that socializing high risk efforts widely through an organization might be a really good idea.

Ironically, WMF decided to go ahead and purchase TEM's ECP solution, the very product I had failed to get working, twice, for their main campuses, but go with TEM's major competitor for the small satellite offices. Technically, it was actually a good solution for WMF, since it played off the strengths of both vendors. Sometimes I wonder what my life would have become if WMF had simply gone with that solution in the first place and we could have avoided both of my ill-fated site visits.

WMF itself, once firmly in the Fortune 100, ceased to exist, having immolated under the tinder of bad debt in the crucible of the financial crisis.

Many of my former colleagues are still at TEM, albeit fewer with each wave of layoffs, still working in that creaky old huge C code base that features things like a four thousand line switch statement. It's probably significantly bigger by now.

As for me, a chance to transfer from the ECP development organization to another project came along. The new project was developing a distributed Java-based communications platform using SOA/EDA with an enterprise service bus. I moved to the new project, and worked there happily for over a year, learning all sorts of new stuff, some of which I've written about in this blog. ECP was probably relieved to see me go.

But knowing that I had made a career-limiting mistake, I eventually chose to leave TEM to be self-employed. My decision surprised a lot of people, most of whom knew nothing or only a small part of the WMF story. It was one of the best career decisions I've ever made. I'm happier, less stressed, learned more, and made more money, than had I stayed at TEM.

Funny how these things work out. Would I have ever have followed one of my life-long ambitions had not the WMF fiasco occurred? Or do we sometimes need a little creative destruction in our lives to set us on the right path?

Monday, December 17, 2012

Passion Practice Proficiency Profession

Back in the early 1980s I was a humble graduate student sitting in my office grading papers when I overhead one of the academic advisors for the computer science department talking to one of the undergraduates in the hallway. The student was saying "I really don't like programming but I'm majoring in computer science because I want to make a lot of money". This is a guy who was going to spend many years and a lot of money getting a degree so that he could be miserable in his job for the rest of his life. I'm also pretty sure he was never going to make a lot of money. I didn't understand it then, and I don't understand it now.

* * *

A few years ago I found myself at a banquet at that same university at which I was unexpectedly called upon to speak. I had to come up with something extemporaneously. Those who know me well will understand that this wasn't a big problem for me. This is more or less what I said.

"Most high technologies have a half life of about five years. Some technologies have done better than that: C, TCP/IP. Most haven't. This means that no matter what technologies you are teaching when a freshman enters the university, they will almost certainly not be what you are teaching when that senior graduates. And whatever technologies that student learns will not be what he ends up needing expertise in when he enters the workforce. Every six months or so I am expected in my job to become the world's greatest living expert in some technology that I may have never heard of beforehand. The most valuable thing I was taught during my time at this university was how to learn. Continuous, life-long learning isn't a buzzword, it's a requirement. Core skills, and learning how to learn, is what your students need. Not the latest fad. People who grasp specific technologies but can't quickly learn new ones on their own are the ones who are going to be laid off or whose jobs are going to be outsourced."

* * *

One of my favorite movies is the 1948 British film The Red Shoes. The film tells the story of the career of a ballerina and features a beautiful dance sequence based on the Hans Christian Andersen story of the same name. But the film is about ballet in the same way that the book Moby Dick is about the whaling industry in New England in the mid-1800s.

One of my favorite scenes in the movie has the aspiring ballerina chatting up a famous ballet company impresario at a dinner party, something that probably happens to him several times a day. Finally he snaps at her: "Why do you dance?" She cooly replies: "Why do you breath?" "Why... why I don't know. I just know that I must." "That's my answer too."

* * *

Not so many years ago I found myself on a chairlift with my niece, who was getting ready to graduate high school and was planning going to college to major in the performing arts. To their credit, her father, a professor of engineering, and her mother, at one time a technical writer, to my knowledge was never anything but supportive of her career choice. But given the professions of her parents, and the fact that her older brother was graduating with a degree in mechanical engineering, she was a little nervous. While her mother and my wife were getting caught up with the sister thing in the next chair back, this is more or less what I told her.

"To be happy in any profession, you have to be successful at it. To be successful, you have to be proficient at it. To be proficient at it, you have to have spent thousands of hours practicing at it, no matter what your natural skill at it may be. You have to be passionate about it, otherwise you'll never spend enough time at practice. You have to love it so much, you can't imagine not doing it. So much, you'd do it anyway even if you didn't get paid to do it. There is no point in choosing a career in anything that you don't love to do that much. No point in anything that you aren't compelled to be better at than anyone."

 * * *

The past six years or so have had some challenges. I lost both my mom and dad, although that wasn't a big surprise: mom was 86 when she died, dad was 94. I hope I do as well. I lost an old friend about my age to stroke. Another to cancer. Two colleagues to separate vehicular accidents. Three former colleagues to suicide. One friend to murder. It was after the sudden and unexpected death of one of those people that I went home and told Mrs. Overclock: "If I go to work tomorrow and don't come back, I just want you to know, it's all been good."

I don't know how I was lucky enough to end up in a profession that I can't imagine not doing. That I love so much I practice it even when I'm not being paid to do so. That I managed to make a decent living from. And that gave me an opportunity to routinely work with people smarter than myself and from whom I could learn.

But always with a work-life balance better than that of a certain aspiring ballerina.

Saturday, December 08, 2012

Arduino Due Data Types

Just the other day my Arduino Due arrived from one of my favorite suppliers, nearby Sparkfun Electronics based in Boulder Colorado. Unlike the Arduino Uno which uses an 8-bit Atmel AVR ATmega328 microcontroller, the Due uses an Atmel AT91SAM3X8E microcontroller which has a 32-bit ARM Cortex-M3 core. But like those AVR-based Arduinos, the Due's processor is a Harvard architecture, different from many other ARM-based processors which are von Neumann architectures. The Due has a whopping 512KB of flash for instructions and 96KB of SRAM for data.

First order of business was of course to run my little Arduino sketch that prints the sizes of all the data types. This is my version of the classic "Hello, World!" program. I like it because it not only verifies that the tool chain and platform software all works, and serves as a basic sanity test for the board, but tells me something useful about the underlying hardware target as well.

#include <stdint.h>

void setup() {
  Serial.begin(115200);
}

void loop() {
  Serial.print("sizeof(byte)="); Serial.println(sizeof(byte));
  Serial.print("sizeof(char)="); Serial.println(sizeof(char));
  Serial.print("sizeof(short)="); Serial.println(sizeof(short));
  Serial.print("sizeof(int)="); Serial.println(sizeof(int));
  Serial.print("sizeof(long)="); Serial.println(sizeof(long));
  Serial.print("sizeof(long long)="); Serial.println(sizeof(long long));
  Serial.print("sizeof(bool)="); Serial.println(sizeof(bool));
  Serial.print("sizeof(boolean)="); Serial.println(sizeof(boolean));
  Serial.print("sizeof(float)="); Serial.println(sizeof(float));
  Serial.print("sizeof(double)="); Serial.println(sizeof(double));
  Serial.print("sizeof(int8_t)="); Serial.println(sizeof(int8_t));
  Serial.print("sizeof(int16_t)="); Serial.println(sizeof(int16_t));
  Serial.print("sizeof(int32_t)="); Serial.println(sizeof(int32_t));
  Serial.print("sizeof(int64_t)="); Serial.println(sizeof(int64_t));
  Serial.print("sizeof(uint8_t)="); Serial.println(sizeof(uint8_t));
  Serial.print("sizeof(uint16_t)="); Serial.println(sizeof(uint16_t));
  Serial.print("sizeof(uint32_t)="); Serial.println(sizeof(uint32_t));
  Serial.print("sizeof(uint64_t)="); Serial.println(sizeof(uint64_t));
  Serial.print("sizeof(char*)="); Serial.println(sizeof(char*));
  Serial.print("sizeof(int*)="); Serial.println(sizeof(int*));
  Serial.print("sizeof(long*)="); Serial.println(sizeof(long*));
  Serial.print("sizeof(float*)="); Serial.println(sizeof(float*));
  Serial.print("sizeof(double*)="); Serial.println(sizeof(double*));
  Serial.print("sizeof(void*)="); Serial.println(sizeof(void*));
  Serial.println();
  delay(5000);
}

Here are the results. You can compare these to that of the Arduino Uno when I run a similar program on it.

sizeof(byte)=1
sizeof(char)=1
sizeof(short)=2
sizeof(int)=4
sizeof(long)=4
sizeof(long long)=8
sizeof(bool)=1
sizeof(boolean)=1
sizeof(float)=4
sizeof(double)=8
sizeof(int8_t)=1
sizeof(int16_t)=2
sizeof(int32_t)=4
sizeof(int64_t)=8
sizeof(uint8_t)=1
sizeof(uint16_t)=2
sizeof(uint32_t)=4
sizeof(uint64_t)=8
sizeof(char*)=4
sizeof(int*)=4
sizeof(long*)=4
sizeof(float*)=4
sizeof(double*)=4
sizeof(void*)=4

Wednesday, September 19, 2012

Brass Tacks

Sometimes when your low level code running on a microprocessor or microcontroller is talking to an external device and things just are not going as expected -- perhaps, as we said in my mainframe days, "no warnings, no errors, no output" -- you find yourself in desperate need of a clue. This article is about some of the hardware tools I've recently used on a paying gig to get a clue: a logic analyzer, an oscilloscope, and a logic-level serial to USB converter.

Here's a photograph of a board under test. It's about the size of the palm of your hand and is covered in logic clips for the logic analyzer and the converter, and with the oscilloscope probes. The multi-wire connector on the left goes to a debugging pod. Power and an RS-485 bus connection enters the board via the blue cable from behind. The firmware I wrote for this product is in a dialect of C, runs on a Microchip PIC16F1823 microcontroller unit (MCU), and has no operating system.

I2C_debug_detail

Logic Analyzer

A logic analyzer is a hardware device that allows you to make sense of one or more digital signals. The term digital here is important: the signals have to be logic voltage levels, for example 3.3 or 5 volts. Much more than that and your logic analyzer is not only not useful, it may actually be damaged. The signals have to be logic zero or one. The analyzer will interpret voltages below a threshold as a zero, voltages above another threshold as a one. That's all it knows how to do, zero or one, so analog signals that don't follow those rules will be misinterpreted. The signals have to change with a frequency no greater than the speed at which the hardware inside your logic analyzer can sample. The ability of the logic analyzer to recognize, capture, store, and display after the fact, digital events that occur faster than any human could possibly see them is the essential goodness of the tool. And this speed thing is important. It's the reason perfectly useful logic analyzers range in price from as little as US$150 to US$25,000 or more. The ability to capture higher frequency signals means more dollars.

I've used logic analyzers at the high end when someone else paid for them. But when it came to investing in one of my own, I bought a Saleae Logic, an eight-channel 24-megaHertz logic analyzer. I love it. At US$149, it's so ridiculously functional for its price, it'll pay for itself the first time you use it.

I've written about the Logic before in my articles about reverse engineering the original AR.drone and generating pulse width modulation signals on an Arduino board. So here's some photographs of the Logic that you may have seen before.

Saleae Logic Contents

The Logic has a tiny aluminum electronics pod that is almost covered by my company's property sticker. It comes with a wiring harness that plugs into the pod, small logic clips that fit on the wires on the harness and then clip to the digital pins you want to look at, and a USB cable. You can see how small it is by comparing it to my little Swiss Army pen knife at the top of the photograph. All of this fits in a small zippered case that comes with the Logic.

Saelae Logic with HP 110 Mini Netbook

The Logic depends on a computer (Windows, Mac, or Linux) with a fast USB port for power and to do the heavy lifting of its user interface (UI). Here it is connected to a small HP 110 Mini netbook that I carry around in one of my field cases that live in the trunk of my car. But Saleae makes it trivial to install the same software on any computer that is handy at a customer site.

Here's a Logic trace for a board that uses the Enhanced Universal Synchronous Asynchronous Receiver Transmitter (EUSART), also known as a serial port controller, on the PIC MCU to talk to an RS-485 bus transceiver chip. (Like all the photographs here, you can click on it to get access to larger versions.) Three digital signals go from the PIC to the transceiver: Receiver Output (RO) (receive or RX on the EUSART), Driver Input (DI) (transmit or TX), and Driver Enable (DE). You can see that I labeled the signal traces using the Logic UI.

RS485_RO_DI_DE_ECHO_TEST

RS-485 is a multi-drop half-duplex differential serial communications bus standard. Only one node on the bus can talk at a time. The DE signal tells the transceiver when it can drive an output signal onto the bus. This signal is explicitly controlled by code in my firmware, not by the EUSART controller. I wanted to verify that I was setting DE high before the EUSART began clocking the character out onto the bus, and then setting DE low after (but not too long after) the last data bit was clocked onto the bus.

You'll notice one of the great features of the Logic: it actually decodes the serial characters and tells me what they are (an ASCII 'B' in this case). This makes the Logic a kind of simple protocol analyzer as well. You can also see the two green timing flags the Logic UI provides that I moused around to have Logic calculate the time interval between the last bit of serial data going out on the bus and the DE signal being brought low: a completely comfortable 36.5 microseconds.

The Logic proved valuable in debugging two other issues with the RS-485 bus. When I first started using the RS-485 bus, I was getting garbage on the bus. The Logic told me that the characters I was outputting had framing errors. Thanks to the decode, I could also see that the characters being output onto the bus weren't even the right characters.

RS485_FRAMING

But the real clue was the fact that the DI line appeared to be idle low (zero) between successive characters when the correct idle state was high (one). RS-485 uses differential signaling over two wires that are by convention called A (negative or inverting) and B (positive or non-inverting). I had carefully followed the labeling on the board schematics, and verified that the schematics matched the data sheets for the RS-485 transceiver chip we were using. But the Logic suggested that I had A and B reversed. Some judicious Googling revealed that many transceiver manufacturers reverse the meaning of A and B from that of the RS-485 standard. Srsly? Reversing the wires fixed this problem.

Later, I noticed that I was occasionally getting a spurious character output onto the bus as my firmware started up. The character was always a hex 0xFF. I was sure I had some bug in my buffer handling.

RS485_0XFF

But twenty seconds of looking at the Logic trace told me that this character was being output well before the bus driver in my firmware emitted its first character. In fact, it was being output before I even initialized the DE signal to it's correct state, one of the first things my firmware does. The problem was back in my hardware initialization code: I wasn't setting the DI pin to the correct initial value and in the correct order relative to initialization of the DE pin. The values of the not-yet-initialized pins just conspired to look like I was putting a 0xFF character onto the bus.

One of the boards I'm writing firmware for on this gig has a light sensor chip that the PIC MCU controls via an Inter-Integrated Circuit (I2C or more conventionally I2C) bus. I2C is a two-wire serial bus standard that is very commonly used to allow two chips on a board to communicate. (I2C is not, I repeat not, an interprocessor communications interface. But that is an article for another day.) It uses two digital signals, Serial Data (SDA) and Serial Clock (SCL).

I2C_WRITE_UPPER_LOWER_THRESHOLD

In this first trace you can see the goodness of the Logic's decoder capability, telling me what the digital pulses on the SDA line really mean in an I2C context. In this example, my firmware is writing a value into a register in the light sensor chip. You can see that the Logic helpfully marks the beginning of I2C protocol sequences with green dots, and the end of the sequence with a red dot.

I2C_READ_ONE

This trace is an example in which my firmware is reading a register in the light sensor chip. The decode tells me what the value read was. Of course, the best tool in the world can't help you if you're having a senior moment.

I2C_READ_ONE_FAIL

Initially my I2C state machine wasn't completing correctly, so my I2C handler was getting hung up. This Logic trace showed me that the light sensor chip was holding the I2C bus in it's none-idle (zero) state as if it thought it had something else to do. I knew my code was doing something bone-headed with the I2C communication with the light sensor. Things would have gone a lot faster had I noticed that the Logic I2C decoder wasn't displaying the helpful red dot showing that the I2C sequence on the bus had terminated correctly. It took a colleague sitting near me to remark "Are you sending an I2C NAK instead of an ACK at the end to terminate the sequence?" D'oh! Five minutes and one minor code change later, I had a working I2C handler.

Since I bought the Logic, Saleae introduced the sixteen-channel Logic16 for US$299. That same colleague bought one after seeing me use my Logic. After seeing his Logic16, I ordered one too. Tool envy: it's not pretty.

Oscilloscope

Back in the day when I was an undergraduate and wooly mammoths roamed the plains, I actually had to take a short course in how to use an oscilloscope. This, even though I was majoring in computer science, not computer engineering. This was when oscilloscopes were the size of a carry-on suitcase and logic analyzers were in more the realm of Star Trek instead of something a guy like me might actually own two of. The wisdom of my professors revealed itself as time passed and, yes, I came to own an oscilloscope too.

Although logic analyzers have replaced oscilloscopes for a lot of applications, oscopes are still vital tools for the embedded developer wherever analog signals crop up. As their name implies, oscopes are devices that allow you to see oscillating signals. More specifically: analog signals, which may vary continuously in time instead of being discreet lows or highs, which may vary in voltage from zero (or even negative) to something much higher than digital logic levels, or which are supposed to be digital but which are somehow not well behaved and hence misinterpreted by a logic analyzer.

The oscilloscope I was trained on eons ago was probably a mostly analog device. It may even have had vacuum tubes in it. Or maybe twigs and bark. But a modern digital storage oscilloscope (DSO) is a purely digital device that samples an analog signal and displays it continuously in real-time. Like logic analyzers, DSOs have a wide price range depending on sampling rate and capability. And like logic analyzers, very useful models are well within the budget of guys like me.

Velleman PCSU1000 Oscilloscope

I own a Velleman PCSU1000, a two-channel 60-megaHertz DSO capable of handling analog signal levels up to 30 volts. I paid around US$350 for mine off Amazon.com, if you can believe it. Like the Saleae Logic, the PCSU100 depends on a laptop with a fast USB port (Windows only, alas) to store the digital samples and implement the user interface.

Velleman PCSU1000 Oscilloscope Front Panel

The PCSU1000 is about the size of a netbook. It's front panel accommodates two channels and an optional external trigger. The USB cable plugs in on the back.

One of the devices I wrote firmware for was a motion detector chip that has a two passive-infrared (PIR) sensors. The difference between the output of the two PIRs was expressed as an analog signal. The PIC read this analog signal using an analog-to-digital converter (ADC) built into the MCU.

PIR_5mV_200ms

Here is what the signal going into the PIC looked like as displayed by the PCSU1000. It's clearly not a digital signal: it's measured in millivolts, a signal level that a logic analyzer would probably interpret as a zero. Also, the duration of an entire event from the motion sensor is about a second, an eternity in the digital domain. Using the PCSU1000 to reveal the amplitude, shape, and duration of this waveform helped me configured the ADC on the PIC and understand its results once I had the code working.

Another task I had was to generate an analog output, in the form of a pulse width modulated (PWM) signal, to act as a kind of intelligent rheostat for lighting control. There were two phases to this output signal: the essentially digital PWM output from the MCU that my firmware generated, and the analog voltage going to the light fixture that a backend analog circuit amplified up to 10 volts.

pwm_duty_20_period_2048us

Here is the PWM output generated by my firmware on the PIC MCU. This is a digital signal that the Logic could have easily interpreted as well. But the PSCU1000 allowed me to watch it in real-time as I changed the PWM duty cycle. This trace is a 20% duty cycle, meaning the signal is in the on-state for 20% of its total period. To an analog device, this looks like an analog voltage that is 20% of the maximum voltage, because the analog device effectively sees the integral of the voltage of the digital signal.

pwm_duty_50_period_2048us

This is a 50% duty cycle. You can see the duration of the on-state has gotten longer.

pwm_duty_80_period_2048us

And this is an 80% duty cycle.



But as cool as that is, here is the really useful thing: this is a movie of the PSU1000 displaying the output of the 10 volt amplifier circuit as my test firmware turns the voltage up and town by varying the PWM duty cycle of the digital signal from 0 to 100% and back again, just as if I were manually turning a wall-mounted dimmer switch up and down. The logic analyzer could never do this.

Rheostat at 100% PWM Duty Cycle

And because, as my colleague John Lowe says, the voltmeter never lies: here is my handy Radio Shack multimeter displaying the analog voltage at 100% duty cycle.

Logic-Level Serial-to-USB Converter

Actual RS-232 signal levels can be as high as 12 volts or more, and bits on an RS-232 signal wire are encoded as both positive and negative voltages. The serial output of a USART on your microprocessor or microcontroller is nothing like this. Instead, it uses logic-level signals ranging from 0 to something like 3.3 or 5 volts. The serial port on an MCU is frequently used for talking to some other chip on the board; in my case, an RS-485 transceiver chip. But sometimes it really pays to have a way to borrow the serial port for purposes of debugging. I use an FTDI Friend from Adafruit Industries to do this.

FDTI Friend Logic-Level to USB Serial Converter

The FDTI Friend is a tiny circuit board smaller than your thumb that has on it a chip made by Future Technology Devices International that converts logic-level serial signals to a USB serial connection. I've confessed my love for FTDI before. When it comes to serial-to-USB conversion of any kind, I won't use anything else. Because everything else is crap.

FDTI Friend Logic-Level to USB Serial Converter

I added a USB cable, some jumper wires, and a few logic clips, and I now have a device I can clip onto the serial port pins of a microprocessor or microcontroller under test, hook the USB cable onto a laptop, fire up my favorite terminal emulator like PuTTY or screen, and spy on what its serial port is saying. I do this. A lot.

LUX32_DUMP

Here is the output from some test firmware I wrote to continuously query the light sensor chip via I2C and dump its measurements in hex encoded in fractions of a lux to the serial port. I inserted the FTDI Friend between the PIC MCU and the RS-485 bus transceiver, clipping its leads right to the pins on the bus transceiver, which were larger and hence easier to use than those on the MCU.

ADC_DUMP

Similarly, here is the output of some test firmware I wrote to query the motion sensor chip via the ADC and dump its measurements in hex encoded as a ten-bit value relative to a 5 volt reference voltage to the serial port.

The FTDI Friend is invaluable for stuff like this. It's around US$15 plus a few bucks for the logic clips and jumper wires.

If, like me, you do embedded development, or indeed any kind of development where you are running close to bare metal, you have to have tools that allow you to peer into that hidden world and see what's going on under the hood. And if, like me, you are self-employed, or indeed are employed at all, you have to know what your time is worth. Tools like the Saleae Logic analyzer, the Velleman PCSU1000 DSO, and the Adafruit FTDI Friend, have saved me countless hours of debugging and guesswork. You owe it to yourself as a professional to have the best tools you can afford.

Update (2012-10-27)

I have another little gadget similar to the FTDI Friend that I like even better. It's a USB to 3.3V/5VAuto Sensing Adapter that I bought right off Amazon.com. It's functionally equivalent to the FTDI Friend, but it's an integrated cable in which the FTDI chip is embedded inside the hood of the USB connector, which makes it even easier to use.

USB to 3.3V/5V TTL Auto Sensing Adapter

I didn't mention it in the original draft of this article because it took a bit of fabrication to build it. It came with a Molex-style connector on the end which I clipped off and replaced with jumper wires to which I could attach logic clips as shown here. If you are comfortable using a soldering iron, a heat gun, and shrink wrap, it only takes a few minutes to make one of these, and it will become an indispensable part of your embedded toolkit.

Update (2013-01-02)

Here's another little gadget that's useful if you deal with hardware with legacy RS232 interfaces. It's an USB 2.0 Serial RS-232 DB9 Mini Adapter from Amazon.com that's not much bigger than just a DB9 connector hood. It's perfect to convert boards that come with a female DB9 connector to their RS-232 serial port to serial USB. (If they come with a male DB9 connector you'll likely need a null modem adaptor too.)

USB 2.0 Serial RS-232 Mini Adapter

You can see these in use in some of my articles on older BeagleBoards or the more recent Samsung ODROID reference platform. These are so useful that I've purchased several, and I just keep them screwed onto the boards even when the project is in storage. Or, for example, in the box with my Abatron BDI3000 JTAG debugging pod, which occasionally needs to be reconfigured or reflashed using a serial cable.

Update (2017-01-17)

For those of you who spend time peering at Wireshark traces, this is the next gadget you'll want to buy: a SharkTap. Inside it's a Broadcom three-port (at least) gigabit Ethernet switch programmed to pass through everything between ports "blue" and "green", and mirror all that traffic to port "red". It's powered over USB. It's not cheap. But it might just be indispensable.

Untitled

Thursday, August 16, 2012

Big Things In Small Packages

In All the Interesting Problems Are Scalability Problems I remarked that the Mac Mini on which I am writing this article runs at 150 times the processor speed of the AVR microcontroller for which I was developing, but has 16,000 times the memory. This radical disparity led to some interesting design decisions and tradeoffs. That observation was reinforced -- in spades -- recently on a gig for which I was writing firmware in C for a PIC (for Peripheral Interface Controller) microcontroller.

Like the Atmel AVR ATmega2560 I was originally writing about, and typical of many other microcontrollers, the Microchip Technology Inc. PIC16F1823 is a Harvard architecture: the executable code and the data reside in two completely different memories. Instructions live in on-chip flash from which they are executed directly. Data lives in on-chip RAM. This particular PIC has an 8MHz instruction clock, so it executes an instruction every 125 nanoseconds. Long ago but within my memory (although I am so old I am basically a brain in a jar), that would have been considered impressively fast, considerably faster in fact than the original IBM PC. The PIC16F1823 has a scant two kilowords of flash, a word being equivalent to a single machine instruction. And ninety-six bytes of RAM.

Let me repeat that: ninety-six bytes; there is no K there. It has a 128 byte address space by virtue of its whopping seven bit addressing. But thirty-two of those bytes are dedicated to registers used to configure and control its several I/O controllers, the PIC being a system on a chip (SoC). Just naming the controllers that I have written drivers for, the PIC provides eight and sixteen-bit timers, a pulse width modulation (PWM) generator, analog to digital converters (ADC), an asynchronous serial port, an I2C serial bus interface, and the usual general purpose I/O (GPIO) input and output pins. The package I am using has sixteen pins, of which it uses fourteen; the remaining two are unconnected and serve merely to hold the chip down to the printed circuit board.

How complex an application can you write in ninety-six bytes? Quite complex, as it turns out. One of the boards for which I've written code, all in C, has six interrupt service routines and can exhibit quite sophisticated behavior. Not bad for a processor with a unit price of about a buck fifty U.S., and quantity pricing under a dollar. But every single line of code I write requires some agonizing. Do I really need this variable? Does it really need to be two bytes? Can it be one byte? Can I do without it completely?

I say I'm writing in C, but it's a dialect of C specific to this device, one that provides capabilities beyond that of ANSI C or even the GNU enhancements. For example, there is a bit data type that, you guessed it, takes up a single bit. As you might expect under the circumstances, I use those whenever I can. When I recently wrote a function that really really needed a four-byte integer, I nearly had a stroke.  Just one of those variables takes up more than four percent of the entire available RAM.

In Hitting a Moving Target While Flying Solo I talked about the challenges of using compilers and other tool chain elements for embedded targets where those elements were not nearly as widely used as those of more mainstream processors like the Intel Pentium. Those words came back to haunt me, as I was using the Hi-Tech PICC C compiler provided to me by the customer.

I was debugging my I2C state machine, which sure seemed to be doing impossible things. I was fortunate enough to have a debugger with which I could single step through the state machine as it was stimulated, and simultaneously watch its state variable change value. I was -- stunned is the only word for it -- to see the state machine, which was implemented as a C switch statement, entering the wrong case statements based on the value of the state variable. I assumed the debugger was lying to me. So I used a few precious bytes to instrument my code. Nope, the debugger was right: the C switch statement did not work correctly.

I came to find out that this is a known problem in the 9.81 compiler I was using, documented in the release notes for the 9.82 version. WTF? When's the last time you used a C compiler in which the switch statement didn't work? Ever? This is what I'm talking about.

However, the Hi-Tech PICC compiler is quite clever in other respects. Other processors I've used have a stack that is used by languages like C and C++ on which to push return addresses during function calls, to create a stack frame containing function parameters, and to allocate memory for automatic variables within the scope of a function.

The PIC16F1823 has a stack implemented in a memory that is separate from either program or data memory. It is used solely to push return addresses for functions and interrupt service routines. It has a fixed depth of sixteen words. The microcontroller has no data stack.

The Hi-Tech PICC compiler deals with this by performing a static call-tree analysis at link time, determining the maximum possible depth of function call and interrupt service routine nesting, and warns you if you exceed it. It also uses this information to allocate space for function parameters and automatic variables at fixed memory locations, just as if they were static variables. It uses the call-tree information to overlap these allocations such that there is no call path for which there is a memory use conflict. It is for this reason that C functions for this target cannot be reentrant, and cannot be called recursively.

So whether you exceed your generous allotment of ninety-six bytes depends not only on how many static variables you have, but also on the exact pattern of function call nesting you implement. This creates a conflict in trade-offs: you are highly motivated reduce all duplicated code to a separate function to save program space, as long as the code generated in doing so is shorter than the duplicated code. But you constantly run the risk of creating a function call path that cannot be supported by the compiler. This definitely results in a simpler is better approach to developing code for this target.

In Welcome to the Major Leagues I mentioned how hard it was to define embedded development exactly, and about the surprising (to some anyway) overlap between developing for the very small and the very large. This latest effort has been a great learning opportunity for me to add a few new tricks to my toolbox.

Thursday, July 19, 2012

Good Day Sunshine On My Arduino

Here's an update on my article Sunshine On My Arduino Makes Me Happy. This is part of my Amigo project, which in part experiments with alternative power sources for eight-bit AVR microcontrollers.

I went from a 1.5W to a larger 5W solar panel, then to much larger 15W solar panel. We'll see if that's enough to charge the 12V battery during the day and let the system run all night. I also had to go to Xbee Series 2 radios with external antennas on both the Arduino Uno with the Xbee shield in the "instrument pod", and on the Xbee Explorer that is USB connected to my desktop, in order to get the range I needed. The chip antennas couldn't cover the few tens of yards from the south-western edge of my back yard through a wall to my home office on the south side of my house.

Here's the instrument pod with its external antenna visible at the lower left. The pod looks fluorescent green in the photograph. It was originally a translucent white that I spray painted fluorescent yellow. The pod looks a lot better in the photograph than it does in real-life; it definitely has a cobbled-together look about it.

Instrument Pod: Cover Off

Crammed into this little box is the Arduino Uno with the Xbee shield, a battery meter activated by a tiny pushbutton switch, the solar charge controller, and a 12V sealed battery. The software on the Arduino merely pings my desktop every second. (In the past I've written about various environmental sensors I've already tried on this platform.)

Here is the instrument pod at the edge of my back yard connected to the largish (about 42" x 15" or 105cm x 38cm) 15W solar panel.

Instrument Pod and 15W Solar Panel

Here is the tiny Xbee Explorer, dwarfed by its own external antenna, USB attached to my desktop Mac.

Xbee Coordinator on Xbee Explorer

I'll monitor this for the next few days and see what happens.

Update (2012-07-23)

The instrument pod has been up continuously for over four days now. It pings my desktop once a second. The state of charge meter I have hooked up to the 12V battery, activated by a little pushbutton inside the pod, shows the battery to be completely charged. So far so good.

Update (2012-07-28)

The solar-recharged instrument pod has been continuously wirelessly pinging my desktop via its Zigbee radio for just short of nine days now, having survived several rain storms. Here's a snippet from the log file below. The ISO 8601-style time stamp is generated by the logging script on my desktop Mac; the duration timestamp, showing eight days and twenty-three hours, is generated by the remote Arduino in the instrument pod, and indicates how long the Arduino has been running since it last powered up.


2012-07-28T10:00:03 8:23:36:30
2012-07-28T10:00:04 8:23:36:31
2012-07-28T10:00:05 8:23:36:32
2012-07-28T10:00:06 8:23:36:33
2012-07-28T10:00:07 8:23:36:34
2012-07-28T10:00:08 8:23:36:35 

Update (2012-08-09)

The instrument pod finally lost power after running continuously for more than ten days. I could tell by periodically checking the battery's state of charge that the solar panel wasn't keeping up during the day with the loss of charge during the night. But while I would have preferred it stay up, this gives me some hope that by just mounting the solar panel in a better, more sunny, location, it might work continuously during the summer. Whether it do so during the winter is another matter.

Tuesday, July 17, 2012

The Death of Hard Power Off

You've already noticed this: when you hit the power button, it takes several seconds for the device to go dark. If it has a display, your device might show a little spinning or blinking icon to indicate it is doing something. This is true for your hand-held mobile devices: your tablet, your mobile phone. It is also true for your laptop and your desktop, and for the rack-mounted servers at your data center. It is also the case for consumer devices you perhaps don't give that much thought to: your digital video recorder, your MP3 player, or, perhaps, even your automobile.

Increasingly, digital devices implement a soft power off. Which is to say, when you press the power off button, you are not turning the power off. You are informing a piece of software of your desire for it to turn the power off. A million lines of executed code later, the power turns off. Usually.

Compare this to a hard power off, which is more like a simple light switch: the instant the purely mechanical switch separates a set of metallic contacts, the electrical circuit is interrupted and power is immediately removed from the device.

Soft power off has permeated our digital devices, more or less without us users thinking about it, for one reason: the need to maintain a consistent state.

This state could be your device remembering your web page history. Or its position in your music play-list. Or where you paused your movie. Or the last number you dialed. But state can also be something a lot more abstract, data the device has to save as part of some function or service it is doing on your behalf, or even something in the realm of routine maintenance, the details of which might make your eyes glaze over if you actually had to know about it. For example, devices with global positioning system capabilities - which is nearly everything now - like to save information about the GPS satellites used during the last position fix because this can vastly speed up acquisition of the same satellites the next time you turn the device on, providing you haven't moved very far or it hasn't been turned off for very long. You appreciate this capability even if you don't know about it.

This state could be saved on a remote server for network attached devices, whether they are wireless or wireful. But more often than not these days, state is saved on a persistent  read-write storage drive embedded directly in your device. The growth of read-write storage in embedded devices has exploded in recent years. Very early mobile digital devices actually had tiny surface-mount spinning disk drives. But the introduction of less expensive flash memory, read-write persistent semiconductor memory with no mechanical parts, now dominates the mobile device market, and is beginning to dominate even the less mobile laptop market.

Sometimes this flash memory is used directly by the device; the operating system uses a file system implementation like Journalling Flash File System 2 (JFFS2) and Yet Another Flash File System (YAFFS) that makes the flash behave less like memory and more like a disk drive, and which provides the usual functional capabilities like directories and files and permission bits and the like.

Some read-write persistent storage devices, like a USB memory stick, or a microSD memory card, offer a slightly more disk-like hardware interface on top of the underlying flash, and the operating system conspires to make the storage device behave like a disk to the application software. My little shirt-pocket-sized ODROID-A4, a battery-powered and WiFi-connected platform reference device produced by Hardkernel for developers writing low level code for Samsung's Galaxy Android smartphones and tablets, uses a microSD card for its persistent storage. But the A4 layers on top of it disk partitioning and multiple EXT4 file systems, something you would have in the past expected to find on a server at the data center.

Solid state disks (SSDs) are storage devices which emulate a full blown disk hardware interfaces on top of the underlying flash memory. Not even the operating system may be able to tell the difference between the SSD and a spinning disk. I've built embedded products using SSDs that used the stock disk drivers in Linux. On these systems I had no choice but to use file system implementation tailored for disk drives, like EXT3, because that's the hardware interface I had to work with.

The introduction of read-write persistent disk-like semantics to mobile devices brings with it not just all the convenience and capabilities of having spinning disks, but all the issues the plague our mobile devices' bigger cousins that traditionally use those spinning disks. I've already written about issues of data remanence and solid-state storage devices. But here, I'm talking about basic reliability.

Perhaps you have learned the hard way to put your desktop system on an uninterruptible power supply. You may not appreciate the fact that your laptop has its own built-in UPS, but you depend on that fact just the same. And pulling the power cord out of a running server at your data center is a good way to get escorted to the door by your organization's security apparatus. There is a reason why all of these devices now implement soft power off. And why Google added a twelve volt battery to each individual server.

The reason is that as application software has become more and more complex, its demands of its underlying storage system has increasingly become more and more like a database transaction, either in fact (because it uses an actual database) or in function (because it requires atomically consistent behavior to be reliable). It is for this reason that, no matter what the nature of the underlying storage device, file system implementations like EXT3 and EXT4 have borrowed from the database world and are journalled file systems: a single atomic write to a sequential file or journal on the storage device is first done to record the intent of the following more complex multiple write operations which may be spread across the storage device. If a failure occurs during the multiple writes, the journal is consulted during the restart to repair the file system. (Log-structured file systems do away with the subsequent multiple write operations completely and merely reconstruct the vision of the file system as seen by the application software from the sequential log file as a kind of dynamic in-memory hallucination, with some performance penalty.)

Update 2016-03-09: Something I failed to make clear here is that in the case of journalled file systems, only the meta-data -- that is, only the writes done to modify the structure the file system itself -- are saved in the journal, not the writes of the data blocks. This allows the file system to be repaired following a power cycle, such that the file system structure is intact and consistent. But the data writes in progress at the time of the power cycle are lost. One of the symptoms I've seen of this is zero-length files. The file entry in the directory was completed from its record in the journal, but the actual data payload was not.

The need for consistent file system semantics has lead to a lot of research in file system architectures, because techniques like journalling are not perfect, and sometimes not adequate for applications software that have more complex consistency requirements than just knowing whether a particular disk block has been committed reliably to the storage device. But more practically, it has lead to the end of hard power off as a hardware design. Soft power off gives the software stack time to commit pending writes to storage to insure a consistent state on restart. (And for network connected devices which may depend on consistent state on remote servers, it allows for a more orderly notification of the far-end and shutdown of communication channels.)

The web is full of woeful tales of users who bricked their devices by cutting the power to them at an inopportune moment. And I have my own horror stories of products on which I've worked with read-write persistent storage but architected with only hard power off.

Hard power off is such an issue in maintaining the integrity of SSDs that the more reliable ones (by which I mean, the only ones you should ever use) implement their own soft power off, in the form of a capacitor-based energy storage system, to keep the device running long enough to reach a consistent internal state. There are a lot of SSDs that don't do this. Those SSDs are crap. As you will learn the hard way once you've cycled power on them just a few times. (If you are using SSDs in any context, adding a UPS to the system in which they are used is not sufficient. As long as power is applied, the tiny controller inside the SSD is doing all sorts of stuff, all the time, asynchronously, whether or not your system is using it, even if your system has been shutdown. Like garbage collecting flash sectors for erasure as a background task. Only the controller inside the SSD knows when it's reached a consistent state; neither the operating system nor even the BIOS has any visibility into that activity.)

This is just going to get worse. The decreasing cost of solid-state read-write persistent storage makes it more likely that it will be used in less and less expensive (and hence a greater and greater number of) small digital devices. Increasing memory sizes on digital devices allows more complex software, which places greater demands on the storage system. Larger memory also increases the amount of data cached there, typically for reasons of performance, which stretches the latency in committing the modified data to storage, and increases the likelihood that an inconsistency will happen should a failure should occur. (One of the principle differences between the EXT3 and the EXT4 file systems is the latter caches data more aggressively.) We should have expected this just by looking at the disparate technology growth curves.

As you consider expanding your product line to include more digital control in your embedded products, it will occur to you that adding some solid-state read-write persistent storage would be a really good thing: to store user settings, to allow firmware to be updated, to implement more and smarter features. Once you take that step, remember that you now face the issue of soft power off where perhaps you didn't before. Because with today's digital devices, hard power off is dead.

Update 2016-03-09: Nearly four years after having written this article, I am still trying to convince my clients that they cannot design their embedded products with a hard power off switch, even as the complexity of these products, in both software and hardware, evolves to look more and more like data center servers. And failing that, helping them try to figure out how to make their products more recoverable in the field when their file systems -- much much larger than they were four years ago -- are hopelessly scrogged.

Thursday, July 12, 2012

STEM, RPN, and Market Forces

There was a time when giants walked the Earth. Giants who were into science, technology, engineering, and mathematics, or what today is known as STEM. Giants who landed men on the moon and brought them back. And when those giants needed to solve some problems, problems so hard they couldn't just solve them in their heads (which were mighty problems indeed), they pulled out their trusty Hewlett Packard calculators and made mysterious incantations in what is known as Reverse Polish Notation or RPN.

HP made an entire line of domain-specific calculators just as today we have domain-specific programming languages, calculators designed for specialized job functions. Here are three of them that I own and still use on a regular basis.

Hewlett Packard 11C Scientific Calculator

The 11C is a scientific and engineering calculator. It features all the usual logarithmic,  trigonometric, and exponential functions. It's programmable: you can save sequences of steps, including iteration and conditional expressions, to calculate long equations automatically. If you took science and engineering courses in high school or college, this calculator would have been your faithful servant. You can even use it to balance your check book.

Hewlett Packard 12C Financial Calculator

The 12C is a financial calculator. It computes depreciation, loan amortization, and net present value (NPV), stuff I actually have to do now and then on those occasions when I have to put on my management hat. It also does a bunch of stuff that I know little or nothing about, but the MBAs who read my blog (remarkably, there are one or two) will recognize. You can tell this calculator was meant for guys on Wall Street because it's metal trim is gold instead of silver. No, I'm not kidding.

Hewlett Packard 16C Programmer's Calculator

The 16C is a programmer's calculator. It handles calculations in decimal (base ten), hexadecimal (base sixteen), and octal (base eight). The ability to do octal arithmetic will be appreciated by my old comrades from my PDP-11 days, since that minicomputer insisted on organizing everything in three-bit units to match the fact that it had a whopping big set of eight registers. The 16C handles logical operations like and, or, and exclusive-or, bit shifts and rotations, and both one's and two's bit complement. These are the kinds of calculations you routinely do when you do the kind of work I do. I use my 16C on nearly a daily basis.

People will tell you that the software calculator they have on their laptop or tablet is just as good as these old HP calculators. Those people are wrong. What they really mean is that once in a while they have to add numbers bigger then they can do in their head, and the little calculator that came on their Windows or Mac laptop suffices. I use those calculators too, and will even admit that the Calculator program found in the Applications directory on Mac OS X (not the dumb as a brick calculator widget) is actually pretty good. I once had an excellent calculator utility on my Palm Pilot for which I paid real money; it emulated an HP calculator right down to the colors, shape, and placement of the buttons. But for the most part, calculator software applications seem more like someone's freshman computer science project. When the going gets tough, the tough scientists and engineers get out their HP calculators and start slinging RPN.

Or at least, they used to. I use my 16C so often that it occurred to me that maybe for not too much money I could own two of them, one to keep in my office at home and one to keep in the briefcase I take to client sites. After some web perusal what I discovered is guys like me (both in age and profession) paying ridiculous prices for used HP 16C calculators on eBay. The cheapest one I saw today was US$71, the most expensive US$389. Three hundred and eight-nine dollars! They've become fraking collectors' items!

I immediately turned around in my office chair and put all three of my HP calculators in my floor safe.

But I don't mean to imply that you can't buy any of these calculators new. HP still manufactures one of them, and it looks exactly like mine that is pictured above. Can you guess which one?

If you guessed the 12C financial calculator, then you probably have some insight into what all the fuss is about when people express concern that not enough students are majoring in STEM-related fields. Seriously. The fact that HP, once the premiere manufacturer of scientific and engineering calculators, now just makes one of these little shirt pocket sized marvels for those folks with a business school background tells you something about how they see the market for their products.

To be fair, HP does make some pretty nice looking larger models; I find the 35s scientific calculator to be kind of sexy, and it still supports RPN. But I don't see anything even remotely like my beloved 16C. I'm guessing those Java developers with their new fangled multi-core servers and their fancy sixty-four bit addressing just don't need to do hexadecimal arithmetic anymore.

I hate how I sound when I get this way. Hey, you kids! Get off my lawn! I blame Congress.