Monday, May 30, 2011

MSST 2011

Last week I attended the 27th IEEE Symposium on Massive Storage Systems and Technologies, a.k.a. MSST 2011. A couple of decades ago, when large scale storage was my field, I attended this conference annually. This year I attended it because it was a twenty-minute drive from the Palatial Overclock Estate (what the media insists on calling the Heavily-Armed Overclock Compound) and because it was being held at the Brown Palace Hotel, a stunning Victorian-era historic fixture in downtown Denver Colorado. It was also a good excuse to catch up on the state of the art.

Back when I was a regular at this conference, large meant terabytes, which is kind of remarkable considering today you can trivially buy a terabyte disk drive at your local big box store. Nowadays, large means exabytes; a single tape cartridge can hold several terabytes, and a single robotic tape library can hold thousands of tape cartridges.

Applications of storage on the exabyte scale falls into a variety of categories, such as long term archiving of data from observations and experiments (e.g. weather observations going back decades, satellite imagery, results from experiments with particle colliders, DNA sequencing of individuals); short or long term storage of output of applications running on high performance computers; and cloud storage much of which is driven by the consumer space. All of these domains were represented at this international conference, with speakers and attendees from national laboratories and agencies, storage vendors, cloud users and providers including Yahoo!, and academic and corporate researchers doing original work in this area.

Here are some of the notes I wrote down while attending this week-long conference. All original ideas are those of the speakers; all interpretations, and misinterpretations, are mine. And don't be surprised if some of these contradict one another. The notes are in chronological order. You can find a complete list of the papers, speakers, and panels from whence they came here.

Geoff Arnold/Yahoo!: Cloud storage: trends and questions

10x Effect: refers to the observation that scaling your application by a factor of ten requires that you to re-architect it.

Trend: deployment of components used in mass storage systems from enterprise-grade devices to consumer-grade devices. The consumer-grade devices may be less reliable, but you have to accommodate failures even in enterprise-grade devices anyway, and the consumer-grade devices offer better price performance.

Fail-in-place: strategy where when something fails, you don't replace it, you just power it off, and let the system work around it. It will be replaced anyway by the next technology refresh, which happen so often that replacing a failed component with the same technology isn't worth the cost.

RAID (Redundant Array of Inexpensive Disks) is dead. Instead: JBOD (Just a Bunch Of Disks, or Just a Box Of Drives), with geographic replication for reliability. Part of the problem with RAID is that they don't scale: they appear to be one I/O device with one serialized command queue, where as commands to JBODs can be more parallelized.

SAN (Storage Area Network, not to be confused with Network Attached Storage) is dead. Instead: converged networks using a single interconnect fabric. (But there were lots of folks proposing SAN architectures, too.)

Immutable objects: the concept where once an object is stored, it is never modified, except perhaps to be deleted. This vastly simplifies the storage system design. (This is very similar to the idea of single-assignment in programming languages, where variables can never be changed once assigned. As bizarre as this sounds, it's quite workable. I implemented a single-assignment programming language as part of my master's thesis, and another graduate student implemented a Prolog interpreter in my language.)

POSIX file semantics are dead, because they don't scale. The big problems are file locking and file modification. Instead: the future scalable storage interface will be RESTful, implementing HTTP-like operations limited to GET, PUT, POST, and DELETE and operate on objects instead of files and blocks.

Trend: movement from synchronous to asynchronous operations, because the former don't scale. One of the reasons POSIX file semantics are dead is we have to give up the API that has "an implicit guarantee of transactional consistency". Eventual consistency scales.

Trend: replacing MTBF (Mean Time Between Failures) with MTTR (Mean Time To Recover). The former doesn't scale. You have to design your system around the assumption that devices will fail, and optimize your ability to recover from those failures. (The routing architecture of the Internet Protocol was based on this concept, and the telecoms have been making this assumption for decades, for exactly the same reasons. I have also been attending an exascale computing conference, sponsored by the U.S. Department of Energy, for the past couple of years. To scale up to an exascale supercomputer with commodity memory chips, the MTBF calculation suggests that such a system will suffer a RAM chip failure every few minutes.)

Trend: replacing a "predictable business model" with one of "chaotic innovation". (I wish the speaker had said more about that. But I like the premise. It fits in well with my own reading on asymmetric warfare and John Boyd's OODA loops.)

Trend: replacing human operators with automation and self-service. The former doesn't scale. Sun Microsystems was quoted as observing that "the largest source of failures in the data center was humans opening the door".

Cloud computing requires operational refactoring: outsourcing lower portions of the application stack, taking advantage of economies of scale, implementing service level agreements, and leverage specialization on the part of your provider.

Virtualization is a nightmare for security. Hypervisor vulnerabilities are the major concern. No matter how bullet proof your application stack is, once you run it in a virtual machine, all bets are off. VMs make the cloud providers very very nervous. (This was a recurring theme.)

People get nervous about cloud computing, yet there are centuries of precedence in outsourcing critical infrastructure. You put all your money in a bank, and that's your money, for pete's sake. Not to mention hospitals, delivery services, lawyers, etc.

Stovepiped: when your application stack is vertically integrated. This is a problem when moving to cloud architectures since doing so means migrating the lower portions of the stack into the cloud. To gain efficiencies of scale requires cloud vendors to provide basically the same generic services to everyone. (That was a new term for me.)

The recent failure of Amazon.com's storage system was not in their S3 cloud computing service, which has RESTful semantics, but in their Elastic Block Storage (EBS), which has POSIX semantics. And even then, the failure fell within their Service Level Agreement. Cloud storage architects point to this as an example of why POSIX semantics don't scale.

On moving part of your application stack into the cloud: "Until a new airliner crashes, you have have no idea what will happen when that airliner crashes".

Trend: cloud providers have tended to optimize their stacks for "north-south" traffic (vertical within the stack) but are seeing a lot of growth in "east-west" traffic (horizontally between applications within the stack).

We need to automate systems originally designed for human time scales. An example of this is the internet's Domain Name System (DNS), which was designed to be updated by a human using a command line interface, not by automated systems in clouds registering and deleting many domains per second.

Joshua McKenty/Piston Cloud Computing: Openstack: cloud's big tent

Trend: moving computing closer to or even into the storage environment. This becomes more important as the datasets get larger. (This was reiterated time and again by many speakers.)

Tombstoned: destined or marked for deletion.

Patrick Quaid/Yahoo!: Lessons learned from a global DB

Trend: the cloud stack becomes (from the top down): SaaS (Software as a Service), KaaS (Knowledge as a Service), PaaS (Platform as a Service), IaaS (Infrastructure as a Service).

Trend: in social networking, the movement from "me-view" to "we-view". (I interpreted this to mean applications leveraging crowd sourcing.)

Cloud providers have found that implementing big-market infrastructures is wasted in developing markets. Once size does not fit all.

You can't retrofit consistency into your architecture.

Data remanence: how do you guarantee the data is deleted? In the domains like defense, finance, and medicine, there are legal requirements to ensure that deleted data cannot be recovered (particularly by the wrong parties). This is especially troublesome given the trend of geographic replication. You could encrypt the data and later delete the keys, but there is no guarantee that the keys themselves are not recoverable, and also no legal precedent that such a strategy is adequate. On the flip side, there are opposing legal requirements to ensure that some data flows can be wiretapped.

Scale: billions of records, ~600,000 users, one million reads per second, ~500,000 writes per second.

Sean Roberts/Yahoo!: Implementation problems and design principles

Black swan events: singularities like the death of Michael Jackson, the Royal Wedding.

No-ops are the enemy.

Chris King/Linkedin: Storage infrastructure design and management at scale

Google is moving from consumer-grade storage devices to enterprise-grade devices, because the latter are cheaper in the long run. Even if the enterprise-grade device lasts just a little longer before it fails, the savings in labor costs to have someone replace the device makes up for it.

Conway's Law: a system's architecture will recapitulate the bureaucracy that created it. This is a very old (circa 1968) assertion, but cloud providers have observed it in their domain.

Geoff Arnold/Yahoo!: Private clouds

Forget about the long tail; the problems are with the big head.

In a slowly changing environment, it's easy to assume some things are constant, when they in fact aren't. Result: correlators don't correlate, invariants aren't invariant. Example: IP addresses. (I've been bit by this one too. I used to use MAC addresses as a unique system identifier, until I started seeing systems in which users routinely reprogrammed their MAC addresses.)

On demand assumes that each individual customer is small compared to the entire pool of customers, and that there is no correlation of behavior among customers. These assumptions are frequently wrong.

A sufficiently large data replication process looks like a distributed denial of service (DDoS) attack.

Joshua McKenty/Piston Cloud Computing: Security

Blue Pill Attack: when the virtual machine attacks the underlying real machine. This is the opposite of the usual concern. (From the movie The Matrix, this is specifically when a real machine is compromised by it having been unwittingly virtualized, hence the attacker has wrested control of the real machine from the host operating system.)

Michael Ernst/Brookhaven National Laboratory: Experience with 30PB of data from the LHC

Storing petabytes of data is easy compared to moving petabytes of data to where it is needed. (This was a recurring theme, driving the rethinking of architecture.)

Ron Oldfield/Sandia National Laboratory: Addressing scalable IO challenges for exascale

Electrical power is the largest hurdle to exascale. At current efficiencies, an exascale system will require two gigawatts of power. Data movement is the largest user of power. (At the last exascale conference I attended someone mentioned that the Chinese have committed to building their supercomputer centers with a nuclear power plant. Everyone wonders how we'll compete with that.)

Applications use the storage system as a communications channel.

Trend: it may be more economical to recompute rather than to store. (This is theme I see recurring at all scales, from massive storage systems to variables in a single application.)

More than eighty percent of all I/O is done for reasons of resilience.

Applications today are not designed to survive failures. The larger the application, the higher the probability of failure. This is the reason for checkpoints (where the entire application state is periodically dumped to storage). As applications grow larger, and their data requirements bigger, checkpoints are increasingly expensive to generate and to store. (This is just another take on the theme that you have to expect failures and be able to recover from them.)

Asynchronous I/O still buries the network, and jitter becomes a real issue. But, storing data persistently at each intermediate stage is actually very resilient. (This was said in the context of Big Science, such as the Large Hadron Collider.)

Dave Fellinger/Data Direct Networks: Simplifying collaboration in the cloud

IDC predicts that by 2020 the world will have thirty-five zettabytes of digital storage.

There is no bandwidth faster than a 747 filled with tapes. (This is a variation on the idea that Fedex operates the highest bandwidth network in the world by virtue of overnight delivery of tapes.)

Alan Hall/NOAA: National Climate Data Center's plan for reprocessing large datasets

"80% of the world's data is on tape." (The speaker was quoting the CTO of a manufacturer of tape libraries.)

Trend: archiving the software applications along with the data that it produced or processed. The software effectively becomes part of the metadata, because it is the only formal description of the data. The application, the operating system on which is runs, and the data would all be deployed to a VM from the mass storage system.

Harry Hulen/IBM: Operational concepts and methods for using RAIT in high availability tape archives

Rule of thumb: it takes about one full time equivalent (FTE) of staff to manage every five petabytes of tape archive.

Todd Abrahamson/Imation: Tape media reliability

The black box flight recorder from the Columbia space shuttle disaster hit a tree in Texas at an estimated Mach 18. It took weeks to be found, during which time water got into the device. The data on its magnetic tape were still recoverable. (The speaker was the engineer that recovered and interpreted this data.)

Rule of thumb: current magnetic tape technologies offers about a thirty year longevity when stored at seventy degrees Fahrenheit and thirty percent relative humidity.

Current hard disk drive controller boards are tuned to the specific spindle with which they are manufactured. This makes it difficult to recover a disk whose controller has been damaged. You can't just replace the controller with one from another disk of the same model.

Panel Discussion

Trend: embed flash memory in tape cartridges so as to store metadata in the cartridge itself. (Cartridges today use an RFID chip to carry identity information.)

Economics drive tape technology migration, not reliability or longevity.

Latency means uncertainty.

The National Institute of Standards and Technology (NIST, formerly the U.S. National Bureau of Standards) defines the required archive duration for medical records as "the life of the Republic plus one hundred years".

Dimitry Ozerov/DESY: Data preservation in high energy physics

Both theory and common sense evolve over time. This requires that we periodically revisit legacy data.

Ian Fisk/Fermi National Accelerator Laboratory: Evolving WLCG data access strategy and storage management strategies

There is a 108 magnitude difference in data popularity between the most and least popular datasets. (This strikes me as another example of the long tail phenomena.)

Like events tend to aggregate (be clumped together) in a dataset due to buffering, resulting in jitter in the data. This is an issue with the data is reprocessed for analysis since it is not stored in strictly the order that it occurred in the experiment. (This was said in the context of Big Science, like the Large Hadron Collider, where the raw datasets generated in real-time from a single experiment are petabytes in size.)

Panel Discussion

Funding agencies are no longer willing to pay for peak capacity. Sharing resources is necessary.

Amazon.com's Amazon Web Service (AWS) established not only that infrastructure has value, but what that value was, and it demonstrated that customers would be willing to pay for infrastructure.

Peter Braam/Xyratex: Bridging the peta- to exa-scale I/O gap

Rule of thumb for exascale storage requirements: memory capacity times thirty equals scratch space capacity, times three equals archive storage capacity per month. (I believe this was said in the context of Big Science.)

Metrics for an exascale system: 108 CPU cores, 10 gigaFLOPS per core, 1 gigabyte of RAM per core, 5x103 cores per node, 5 terabytes RAM per node, 50 teraFLOPS per node, 15 gigabytes/second I/O per node, 20x103 nodes per cluster, 100 petabytes RAM per cluster, 300 terabytes/second I/O per cluster, file system capacity greater than 1 exabyte.

Architectures of exascale systems will have to be heterogeneous.

Dan Duffy/Goddard Space Flight Center: Alternative architectures for mass storage

NCCS, the smaller of NASA's two data centers, has enough storage for an iPod playlist 34,000 years long.

Mary Baker/HP Labs: Preserving bread crumbs

"All problems in computer science can be solved by another layer of indirection. Except for the problem of too many layers of indirection." -- David Wheeler (1927-2004) (Cited as the first person to be awarded a Ph.D. in Computer Science.)

The Japanese have developed a unique strategy for preserving their monuments: they periodically rebuild them from scratch. An example of this is the Sanctuary of Ise, in which the temple is rebuilt every twenty years on an adjacent plot using the prior temple as a model. This is also the strategy for long term data preservation: continuous renewal by coping to new media using new technology and verifying against the prior copy. (I've heard this strategy referred to as George Washington's Axe: "This is George Washington's axe. The handle rotted away so we replaced it. Then the head rusted away so we replaced that too.")

How do we communicate the value of data archives to future generations?

MTDL: Mean Time to Data Loss.

Phil Carns/Argonne National Lab: Understanding and improving computational science storage access through continuous characterization

MiB: a Mebibyte is 220 or 1,1048,576 bytes.

MB: a Megabyte is 106 or 1,000,000 bytes (corrected).

Raja Appusuwamy/Free University of Amsterdam: Flexible, modular file system virtualization in Loris

Trend: designing a storage stack in a manner similar to a network stack. Layers: virtual file system, naming, cache, logical, physical, disk device driver, hardware.

Yangwook Kang/U. C. Santa Cruz: Object-based SCM: an efficient interface for storage class memories

SCM: storage-class memory e.g. flash memory and related memory-based persistent storage devices like solid state disk (SSD).

Yulai Xie/U. C. Santa Cruz: Design and evaluation of Oasis: an active storage framework based on T10 OSD standard

Trend: moving computation to the data via mechanisms like running workload on the storage servers, instead of moving the data to the compute servers. At least one site is doing this using Java applications with JVMs running on the storage system, or by running complete packages of application plus operating system on a VM on the storage system.

Panel Discussion

Blocks are dead. Instead: object stores. (This is another take on the death of POSIX file semantics, and was a common theme.)

Humans are bad at remembering why they saved something. We have to automate the capture of the provenance of the data being saved.

Disks will be with us forever. Probably tape too. We can't build enough chip fabrication facilities to make enough solid state storage for all the world's data requirements.

Fewer and fewer students are studying the "systems" area. (This has been my experience as well, and is principally why my career has lasted as long as it has.)

Academic research should be like science fiction: it should ask what if questions, and expect a high rate of failure. Lots of useful collateral may be produced by even a failed line of research.

Academia should have been doing research on flash memories way earlier; by the time it did, flash was already everywhere. (This was said in response to a number of research papers given on alternative flash translation layers.)

Storage systems are hard. Distributed systems are hard. Distributed storage systems are really hard.

In three to five years there will be no spinning media in any consumer device. (This was very controversial, but I'm not sure I disagree with it. Part of the reason is that cloud storage will eliminate much of the requirement for big storage in consumer devices.)

Startups can't fund long horizon research and development. Only large companies can do that, thanks to their existing revenue stream. (As a former employee of Bell Labs, I don't think this bodes well for long horizon R&D.)

There are too many people designing new systems without understanding what existing systems do and how they work. (In fact, a couple of the speakers on cloud computing remarked that they had come to realize they were on trails previously broken by the telecommunications industry who are used to developing very large scalable geographically distributed networked systems that had to recover automatically from frequent failures. Having spent time at Bell Labs working on such systems, this had occurred to me as well.)

BHAG: Big Hairy Audacious Goal. BHAGs are what the Defense Advanced Research Projects Agency (DARPA) and the National Science Foundation (NSF) should be funding. They are currently way too oriented towards near future applications and incremental improvements.

The reliability problem extends to people.

Specific workloads matter. I/Os per second (IOPS) are crap as a useful measure. (This same sentiment has been expressed by the high performance computing community regarding supercoputer vendors quoting GFLOPS of performance for their wares. HPC pundit Bill Buzbee was exactly right in saying "Any claims of performance by a vendor should be considered a guarantee that their product cannot exceed them.")

Trend: The Rise of the Stupid Storage Device (this is actually my term, not the speaker's), moving the storage management intelligence further up in the application stack. (This is a more general statement of the comment above about RAID, where you get better performance by moving disk management outside of the storage device. Compare with: The Rise of the Stupid Network.)

Negative results are still results. Disproving your hypothesis does not mean you failed. More is learned by failing gloriously in an ambitious endeavor than by succeeding in a modest incremental goal.

Wes Felter/IBM: Reliability-aware energy management for hybrid storage system

Energy-Proportional: Google's term for energy use that is proportional to a level of activity. Hard disk drives are not energy-proportional. spinning the disk takes energy, whether I/O is being done or not. Spinning a disk up from idle takes a lot of energy. You have to determine the break even point to know when or if to spin a disk down. Reliability is adversely affected by spinning a disk up. (I've also had some colleagues that had to deal with disk drives that had to be periodically spun down; otherwise, cavitation would cause bubbles to form in the working fluid of the drive leading to a disk crash. I've always assumed my colleagues were talking about the fluid bearings in the drive, but I might be wrong.)

Dean Hildebrand/IBM: ZoneFS: Stripe remodeling in cloud data centers

The MapReduce design pattern is a form of moving the computation to the data, since at least the map portion of the computation can be applied to local data.

Muthukumar Murugan/U. Minnesota: Rejuvenator: a static wear leveling algorithm for NAND flash memory with minimzed overhead

Rule of thumb: a read of a page of flash takes on the order of 101 microseconds (from another source: this is a random read, whereas a sequential read takes on the order of 10-2 microseconds); a write of a page of flash 102 microseconds; an erase of a block of flash 103 microseconds. Pages are typically two or four kilobytes; blocks are typically 128 kilobytes. Single-level cell (SLC) flash can endure about 105 erase cycles per block; multi-level cell (MLC) flash about 104 erase cycles.

4 comments:

Craig Ruff said...

When I was still working on tape archives, I encountered two cases where the cartridge memory chip contents became corrupted and tape drives subsequently refused to load the cartridge. Supposedly the memory chip was just a backup to the on-tape metadata, but the firmware designers forgot that fact. The cartridges had to be sent to the manufacturer for data recovery.

Chip Overclock said...

The good news is with flash memory inside the cartridge, there will be a much larger capacity of data that can be corrupted. (But in my line of work, the likelihood of having a flash programmer is much higher than having an RFID programmer.)

Anonymous said...

Great post, thank you! Any context or further information on:

"Google is moving from consumer-grade storage devices to enterprise-grade devices, because the latter are cheaper in the long run. Even if the enterprise-grade device lasts just a little longer before it fails, the savings in labor costs to have someone replace the device makes up for it."

vs (and oft referenced)
Google and commodity disks

Chip Overclock said...

It was a comment made during Chris King's (from LinkedIn) presentation. My notes suggest he made it, but it's been long enough now that I can't be sure.

But it is consistent with other comments made during the conference and elsewhere: people are so expensive compared to hardware that organizations are now optimizing for labor costs instead of capital costs. This isn't new (see: Industrial Revolution), but it's interesting that even a relatively small difference in real MTBF can make a big difference in long term costs.

Thanks for the comment!