Friday, December 16, 2011

The Cost of Experimenting with the Cloud

So after two months of my part-time experimentation with the Simple Storage Service (S3) from Amazon Web Services (AWS) using Hayloft, my company finally got charged for its usage. Three cents. It will be interesting to see if it actually shows up on my company credit card bill, or if Amazon.com defers it until the amount gets a little larger.

It's going to take a while. Typically I run the automated test suite many times a day, and each execution issues dozens of transactions to S3, but each transaction involves a tiny amount of data. Still, considering all I've learned while working on this project, this is a bargain.

Update 2011-12-18

I just counted: 198 individual S3 operations in the Hayloft test suite as of today, each one involving at least one request and response with S3.

Friday, December 09, 2011

Fibonacci Scaling

Circa 1996 I found myself in an ideal situation: spending long hours working on a very technically challenging project with a multi-disciplinary bunch of engineers who were all smarter than I was. Good times.

During that project one of my esteemed colleagues, John Meiners, needed a way to scale up the bandwidth on communications channels allocated in real-time on an as-needed basis. We were using Asynchronous Transfer Mode (ATM), a technology in which endpoints have to negotiate with the network for the amount of bandwidth they needed. Asking for more bandwidth than we needed meant more cost for our customers. Asking for less meant data loss. Tearing down an existing channel to allocate a bigger one would be viewed as a catastrophic failure. Due to the hard real-time nature of the application, doing everything via best effort was not even on the table.

So we started small, then as load increased we allocated additional channels. But some back of the envelope calculations suggested that if we chose a fixed bandwidth increment, things could get out of hand quickly with the number of channels we had to manage. We needed to start small, but allocate larger channels as the need arose. What algorithm should we use?

John looked at various exponential algorithms, like doubling the bandwidth request of each subsequent channel. But that got really big (and hence really expensive) really quickly, and the scaling didn't match our expected use cases. What he finally settled on was Fibonacci numbers:

Fn = Fn-1 + Fn-2

where

F0 = 0
F1 = 1

Starting with F1 just as Fibonacci himself did in 1202 (since using zero as a bandwidth multiplier doesn't make much sense) produces a sequence that looks like this:

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...

This turned out to be pretty much ideal. It starts small and conservative, expands but not too quickly, and is trivially computed since you just have to keep track of the last two numbers in the sequence and merely do an addition.

Fibonacci numbers have been around a long time. Although Leonardo of Pisa (a.k.a. Fibonacci) gets the credit, the sequence was written down earlier by Indian mathematicians. It occurs frequently in nature, most noticeably in spirals such as those found in pine cones and shells. John's idea seems pretty innovative in 1996, but lots of high-technology applications of the Fibonacci sequence exist today, including network back off algorithms, lossy data compression, pseudo-random number generators, and tree and heap data structures.

I'm thinking about this because I am once again using this sequence as a way to scale up back off time delays when recovering from network errors in Hayloft, my little C++ research project using Amazon Web Services Simple Storage Service (S3). I have no doubt that Fibonacci's simple little recursive algorithm will continue to serve me well.

Thursday, December 08, 2011

Venkatesh Rao on Developernomics

Venkatesh Rao had an article on the Forbes web site on the valuation of software developers: "The Rise of Developeronomics". He comes closer than anyone else so far (including me) to describing how I see things going career wise for software developers, although it's written from the perspective of someone desperately trying to hire competent developers to compete in a world in which every company has to become a software company, explicitly or implicitly. It's worth a read.

[I]f you don’t have a skill, like baking, which the developer-centric economy can actually use, you are in deep trouble. One reason the Occupy Wall Street movement is not having the impact it seems like it should, based on sheer numbers of people involved, is that many participants simply have no cards left to play in this national game of economic poker.

Investing in good developers is such a good bet at the moment, that if you have money and you happen to find a talented developer who seems to like you and wants to work with you, you should give him/her your money to build something, anything, even if you have no really good product ideas (those are cheap; I’ll sell you a dozen for a dollar). If you don’t have money, you should offer up whatever other kind of surplus you do have. The NPV on a strong and positive relationship with a talented developer today is ridiculously high. If you gain the trust of a talented developer to the point that they are likely to drop any active gig in the future in favor of joining one of your projects, the value is through the roof. The world is your oyster, which you with your developer will open.

The idea of using Net Present Value (NPV) to compare the value of a software developer with alternative investments is one of those things that for me is initially surprising yet obvious in hindsight. In today's economy, where instead of owning a factory and a warehouse of parts all of a company's worth is in intellectual capital (which really means, in the brains of its employees), it makes sense.

Rao writes the ribbonfarm.com blog where he ponders, among other things, organizational issues in high-technology firms, which is probably how he appeared on my radar screen. I don't always agree with him. Sometimes I'm not sure I'm smart enough to agree or disagree with him. But he's got some interesting ideas.

Tuesday, December 06, 2011

Eventual Consistency and Hayloft

Given my thirty plus years of experience in distributed computing, it's a given that I've had to more or less continuously grapple with the Fallacies of Distributed Computing as laid out by L Peter Deutsch, James Gosling, and others:

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn't change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

The implications of these myths led Eric Brewer to make Brewer's Conjecture, which became the CAP Theorem once it was proven by Seth Gilbert and Nancy Lynch.

A distributed system can guarantee at most two of the three following qualities:

  1. all nodes can see the same data at the same time (consistency);
  2. every request receives a response as to whether it succeeded or failed (availability);
  3. the system continues to operate despite message loss (partition tolerance).

I've been thinking a lot about the implications of this for storage systems ever since attending the 27th IEEE Symposium on Massive Storage Systems and Technologies, a.k.a. MSST 2011. One of the big take aways I had from that conference was that distributed storage systems cannot reliably support POSIX semantics, in particular the POSIX features of file locking and consistency in file modification. Thanks to the CAP Theorem, POSIX semantics don't scale.

(I find that all of the really interesting problems are scalability problems, regardless of the problem domain. If solving a problem at scale n seems easy, trying applying the same solution at scale n x 1000. Not so much. That's why domains like cloud computing and exascale high performance systems have caught my interest. You can't just take current architectures and multiple by one thousand.)

This is one of the reasons that cloud storage has become popular: the various web-based mechanism for data transfer used by the cloud, like the RESTful HTTP GET and PUT, and XML-based SOAP, have semantics that have proven to be highly scalable. Hence they map well to large distributed system architectures.

It was this very issue that lead Amazon.com to adopt a strategy for its globally distributed Amazon Web Services (AWS), including S3, its non-POSIX-compliant Simple Storage Service, which its CTO Werner Vogels refers to as eventually consistent, choosing availability and partition tolerance over consistency:

Data inconsistency in large-scale reliable distributed systems has to be tolerated for two reasons: improving read and write performance under highly concurrent conditions; and handling partition cases where a majority model would render part of the system unavailable even though the nodes are up and running.

Indeed, the well publicized failures in AWS have been not in S3, but in its Elastic Block Store (EBS), which tries to support POSIX semantics on top of a globally distributed system.

I can see the Fallacies, and the eventual consistency architecture, at work when I play with Hayloft, my little C++ research project that uses S3. Any realistic application using S3 has to deal with network failures, where simply retrying the operation may be sufficient. But it also has to deal with the issues of consistency convergence: the fact that every S3 action may take place on a different AWS server that doesn't yet know what actions have taken place on other AWS servers, even though those actions may have been performed on the same storage objects.

Hayloft features both a synchronous and an asynchronous interface. The former makes it easy to play with. The latter is more likely to be what a production application would use. The asynchronous interface features a mechanism to block until an S3 action completes, or to execute multiple S3 actions incrementally and in parallel. Unlike my prior article on Hayloft, I'll use the asynchronous interface here, but use the blocking completion mechanism because it's simpler. (As before, all of this code is taken from working unit tests in the Hayloft distribution, but edited for readability. Apologies for any typos.)

Here is a code snippet that creates a bucket to hold objects. It starts the action and blocks until it completes.

Multiplex multiplex;
BucketCreate bucket("Bucket", multiplex);
bucket.start();
multiplex.complete();

But that's not sufficient. Sometimes (rarely), the action will return a status indicating a temporary failure, like "connection failed", "name lookup error", or "request timed out". The network isn't reliable, and the failure more likely to be on my end, or somewhere along the convoluted path between my system and S3, than in S3 itself. Simply retrying the action after a one second delay is usually sufficient unless the outage is severe. (Note that "success" is not a retryable condition.)

Multiplex multiplex;
BucketCreate bucket("Bucket", multiplex);
for (int ii = 0; ii < 10; ++ii) {
bucket.start();
multiplex.complete();
if (!bucket.isRetryable()) { break; }
platform.yield(platform.frequency());
}

Once the bucket is created, we can store an object in it. Here's a code snippet similar to the one above except it has to deal with rewinding the input data source if we need to retry the action. You can see that this might get a smidge complicated depending on where your data is coming from.

PathInput * source = new PathInput("./file.txt");
Size bytes = size(*input);
ObjectPut put("InputFile.txt", bucket, multiplex, source, bytes);
for (int ii = 0; ii < 10; ++ii) {
put.start();
multiplex.complete();
if (!put.isRetryable()) { break; }
platform.yield(platform.frequency());
source = new PathInput("./file.txt");
bytes = size(*input);
put.reset(source, bytes);
}

But network outages aren't the only failures we have to worry about. The following code snippet retrieves the metadata for the object we just created.

ObjectHead head(put, multiplex);
for (int ii = 0; ii < 10; ++ii) {
head.start();
multiplex.complete();
if (!head.isRetryable()) { break; }
platform.yield(platform.frequency());
}

This usually works. Except when it doesn't. Sometimes (and much more frequently than I see retries due to temporary network failures) the action will return a non-retryable status indicating the object wasn't found. WTF?

The object head action was serviced by a different AWS server than that of the object put action, one that hadn't yet been notified of the object put. This is eventual consistency in action. The unit test tries a code snippet like the following, treating the non-existence of the object as a retryable error, since it knows darn well the object put was successful.

ObjectHead head(put, multiplex);
for (int ii = 0; ii < 10; ++ii) {
head.start();
multiplex.complete();
if (head.isRetryable()) {
// Keep trying.
} else if (head.isNonexistent()) {
// Keep trying.
} else {
break;
}
platform.yield(platform.frequency());
}

Eventually this works. But it's even more complicated than that. If you ran this exact same code snippet again, it might still fail at first, because it was run on yet another server which had not yet been updated to be consistent with the others. In fact, its success on a subsequent attempt may be because it just happened to be executed on the original server, not because all the servers had been updated. Applications have to be designed around this. Amazon's S3 best practices document recommends not using code snippets like this to verify the successful put of an object.

The unit tests log a message every time they retry. I can actually watch how long it takes for the AWS servers to become consistent, and how this consistency convergence changes with the load on the system. Typically it converges so quickly that the unit test doesn't have to retry. Rarely, it takes a second or two. Or sometimes, not so rarely. I was testing a lot of this code on the day after the Thanksgiving holiday in the United States, also known as Black Friday, typically the busiest shopping day of the Christmas season. It frequently took two or three seconds for the system to converge to a consistent state. I haven't seen this kind of latency before or since. But, to its credit, S3 always converged and the unit tests all eventually ran successfully to completion.

Whether or not this is an issue for you depends on your S3 application. For a lot of applications, it won't matter. For applications where read-after-write consistency is important, AWS offers a higher (more expensive) tier of service where objects are co-located in one of their big data centers instead of possibly being spread out among multiple centers; this tier offers read-after-write consistency. If immediate consistency is an issue for you, S3 may not be right platform for your application. However, I would argue that distributed computing isn't the right paradigm for you.

The cloud is not the place for everyone.

Thursday, December 01, 2011

Amazon.com's Simple Storage Service and Hayloft

Information technology is like fashion: you live long enough, you see it change many times. And sometimes the new stuff looks kinda like the old stuff. But that doesn't mean you don't get excited about it. Change is inevitable; you might as well learn to enjoy it.

I'm watching the IT world evolve to a model that is handheld battery powered consumer devices, which these days we're calling smartphones and tablets, communicating wirelessly with enormous distributed computing systems, which we now call cloud computing. I'm all excited. For lots of reasons. Not the least of which is the fact that my professional career over the past thirty-five years as wobbled back and forth between developing software for small embedded systems and for high performance computers, gigantic distributed systems, and server farms. Turns out, the skill sets are the same. I am not the first person to have noticed this.

My most recent project is to develop Hayloft, a C++ object oriented interface to Amazon Web Services (AWS) Simple Storage Service (S3). My goal is for Hayloft to make it easy to use S3 from embedded devices. S3 is Amazon.com's web-based storage service that can (if you do it right) reliably and inexpensively store blobs of arbitrary data. How much data? A single blob, or object, can be up to five terabytes in size. And you can store a lot of objects. Objects are organized into buckets. Each bucket can be thought of as a web site with its own domain name, which means bucket names have to conform to the internet's Domain Name System (DNS) syntax. Each object is independently addressable via its own URL. To S3, these objects are simply opaque data files in the classic mass storage system reference model sense: they can range from HTML pages to scientific databases to virtual machine images to what have you.

(Indeed, I strongly suspect you could build a distributed mass storage system from S3 just as Cycle Computing built a 10,000 core distributed supercomputer from Amazon.com's Elastic Computing Cloud (EC2) for one of their customers. I'm pretty excited about that, too.)

Why would you you want to access S3 from an embedded device? Can you think of any applications for an infinitely large, internet-accessible, web-addressable, reliable, secure, storage system? Whether you are building clever consumer applications for the commercial space, developing sensor platforms for the defense and intelligence domain, or information processing applications for the industrial, financial, or medical markets, if the thought of this doesn't make your hands shake, you aren't thinking clearly. Seriously. Terabytes of storage are just a wireless connection away from your handheld and embedded applications.

Here's a little taste of what Hayloft is like. These examples are taken directly from the unit test suite included with Hayloft with all the EXPECT and ASSERT statements removed. I've also removed all of the error recovery and consistency convergence code which in any practical circumstances are very necessary; I'll talk more about that in later articles. Hayloft presents both a synchronous and an asynchronous interface. These examples use the simpler synchronous interface: when the C++ constructor completes, all the S3 work is done.

Here's a code snippet that creates a new bucket, and writes a local file into an object in it. The user key id and secret access key, which are sort of like your login and password for AWS, and which are provided by Amazon.com, are in environmental variables. Also in an environmental variable is a bucket suffix appended to all bucket names to make them globally unique; I use my internet domain name. All of the other S3 parameters, like location constraint, endpoint, access control list, and so forth can be omitted, because for experimenting, the defaults are reasonable. Input and output is handled using Desperado I/O functors.

BucketCreate create("bucket");
PathInput input("./oldfile.txt");
Size bytes = size(input);
ObjectPut put("Object.txt", create, input, bytes);

By default, buckets and objects are created by Hayloft with private access. With just a few more lines you can specify public read access so that you can (and I have) retrieve this object using your web browser.

AccessPublicRead access;
Context context;
context.setAccess(access);
BucketCreate create("bucket", context);
Properties properties;
properties.setAccess(access);
PathInput input("./oldfile.txt");
Size bytes = size(input);
ObjectPut put("Object.txt", create, input, bytes, properties);

This object can now be accessed using the following URL.

http://bucket.hayloft.diag.com.s3.amazonaws.com/Object.txt

This is a code snippet that reads the object into a local file, then deletes the object.

PathOutput output("./newfile.txt");
ObjectGet get("Object.txt", create, output);
ObjectDelete delete("Object.txt", create);

Other than some #include statements for header files, and some curly brackets and what not, that's about all there is to it, if you ignore (at your peril) error recovery. Here's a snippet that gets a table of contents from an existing bucket, then uses the C++ Standard Template Library (STL) to iterate through it and print the object names.

BucketManifest manifest("bucket");
BucketManifest::Manifest::const_iterator here = manifest.getManifest().begin();
BucketManifest::Manifest::const_iterator there = manifest.getManifest().end();
while (here != there) {
const char * key = here->first.c_str();
printf("%s\n", key);
++here;
}

While using Hayloft is easy, installing it may take a few minutes. Hayloft is built for GNU and Linux on top of Desperadito, my C++ systems programming library (a subset of the much larger Desperado), and libs3, Bryan Ischo's excellent C-based S3 library. libs3 is built on top of CURL, Open SSL, and XML2, all of which can be probably acquired through the standard package manager on your Linux system. The unit tests for Hayloft are built with Google Test (or Google Mock, which includes Google Test), Google's outstanding C++ unit testing framework, and Lariat, my thin layer over Google Test.

But once you get past the installation, starting to play with S3 in C++ applications is as simple as pulling out your credit card, signing up for an AWS account, and writing a few lines of code.