Saturday, March 30, 2013

Observations on Product Development: Part 3

  1. Your product forms a unified hardware-firmware-software-marketing ecosystem.
  2. This is true even if you are only really interested in one part of it.
  3. Your product has a lifecycle that extends from conception through decommission.
  4. When your product first ships it will be, at best, one-third of the way through its lifecycle.
  5. Your product's ecosystem and its lifecycle are orthogonal concerns.
  6. You need to control costs in both; otherwise they will eat you alive.
  7. Architect, design, implement, and deploy with the entire ecosystem in mind.
  8. Architect, design, implement, and deploy with the entire lifecycle in mind.
  9. Successful product development organizations don't do this because they are large.
  10. They got large because they did this and it allowed them to scale.

Sunday, March 24, 2013

Observations on Product Development: Part 2

  1. Product developers need order and structure to do their jobs.
  2. This is true whether they admit it or not.
  3. Every process - waterfall, agile, you name it - makes its own assumptions.
  4. If those assumptions don't hold, that process may not yield the results you expect.
  5. Everyone wants to use a process that guarantees success.
  6. There is no silver bullet.
  7. If it were easy, anybody could do it.
  8. There is no substitute for smart, engaged people and face-to-face communication.

Saturday, March 16, 2013

Observations on Product Development: Part 1

  1. All product development is fractally iterative.
  2. This is true whether you want it to be or not.
  3. Success comes from generating revenue.
  4. Revenue cannot be sustainably generated without shipping a product.
  5. No one has ever shipped a perfect product.
  6. You won't be the first.

Monday, March 04, 2013

Hard Power Off Is Dead But Not Buried

In The Death of Hard Power Off I talked about how the use of persistent flash-based read-write storage -- flash file systems and solid state disks -- had led to the systems using those technologies requiring soft power off: software mechanisms that implement an orderly shutdown of the device before power is actually removed. The failure to quiesce flash-based read-write storage before cycling power will eventually lead to file system corruption. Do it enough, and you have a high likelihood of bricking the entire device.

Just the other day, my occasional colleague Paul Gross passed this very recent gem of a paper along to me, for which I am grateful.
Mai Zheng, Joseph Tucek, Feng Qin, Mark Lillibridge, "Understanding the Robustness of SSDs under Power Fault", 11th USENIX Conference on File and Storage Technologies (FAST '13), San Jose CA USA, February 12-15, 2013
It's worth a read. Here's just a brief section from the abstract.
Applying our testing framework, we test fifteen commodity SSDs from five different vendors using more than three thousand fault injection cycles in total. Our experimental results reveal that thirteen out of the fifteen tested SSD devices exhibit surprising failure behaviors under power faults, including bit corruption, shorn writes, unserializable writes, meta-data corruption, and total device failure.
These researchers from the Ohio State University and H-P Labs take the same approach as unpublished (and kinda clever) work done by my occasional colleague Julie Remington, a hardware engineer who hooked up a system we were troubleshooting that had a surface-mount SSD to a computer-controlled power supply and proceeded to cycle power on the system in a controlled, scripted fashion. Each power cycle waited until the system under test was up and stable, and logged all of the results from the system's serial console. Her results: after a few iterations, the system had to use fsck to repair the EXT3 file system on the SSD (file system corruption, not unexpected under the circumstances); after a few more, the SSD began reporting a bogus device type and serial number to hdparm (internal meta-data corruption); after just a few more, the device quit responding completely to I/O commands. It could only be recovered by removing the tiny flash chips from the top of the SSD chip itself and replacing them with uncorrupted chips from an identical SSD. Which one of Julie's colleagues did. Which is kinda, you know, hard core.

But it doesn't have to be an SSD. I've seen similar failure modes in embedded systems using JFFS2 (Journaling Flash File System version 2) under Linux. JFFS2 does all the same kinds of things behind the scenes that an SSD does, except its controller is implemented in software instead of hardware, on top of commodity NAND flash. Just as the controller inside an SSD is rewriting its flash behind the scenes, the JFFS2 garbage collector kernel thread (which will look something like [jffs2_gc_mtd2], interpreted as "JFFS2 Garbage Collector for Memory Technology Device partition 2", when you do a ps command) is rolling along its merry way, erasing and rewriting flash blocks with little or no regard to the fact that your finger is on the power switch.

But with JFFS2, at least you have some prayer of coming to an orderly stop if you do something like a shutdown -h now before turning off the power. Not that systems in the field with hard power off will have an opportunity to do so, of course. But at least you might save a system or two in the development lab.

The problem with SSDs is that they are not only asynchronous -- doing stuff behind the scenes -- but also autonomous -- doing stuff that you don't even know about and have no control over. The SSD controller continues to erase and rewrite flash blocks even if you unmount the file system. Even if you shut down the operating system. Even if you hold the damn processor in reset.

It's like the honey badger. Honey badger SSD don't care.