Just listening to Alyssa Henry’s keynote talk at FAST ’09. She is the General Manager of S3 at Amazon. She used a great analogy to explain the difficult choice of which thing to spend resources on to protect against failures in a highly distributed system. For some things we choose to have expensive redundancy, e.g. we use both seat belts as well as air bags. Protecting one’s life in a catastrophic situation is important enough to warrant the extra expense. But we tend not to use both waist belts as well as suspenders 🙂
Alyssa also talked about “retry” as an important part of building resilient systems. To handle failures in distributed systems where messages may be lost or nodes may go down, just retry. But what about a message to charge a customer some amount of money? Do you really want to resend that request? The point was that they needed to think about making some operations idempotent by design.
According to Alyssa, the next failure after retry was solved, was surge/overload. Retries can be overwhelming to a system recovering from failure. So rate limiting might be used e.g. exponential backoff. Related are cache time-to-live (TTL) leases expiring but the underlying system which is the source of the data is down. As that system is comming back up, it would get overwhelmed. Alyssa suggested to try extending the TTL to keep the underlying system from breaking down when it comes back up. For example, there is a service at Amazon that checks if a customer’s Account is live. In case that service is down, it’s client systems just continue to assume that the customer is still in good standing.
She also talked about trading consistency with availability. When you write to S3, they will send data to multiple data centers. They write pointers to more datacenters than the data itself. If the pointer writes don’t come back, they just keep going which implies preferring availability against consistency. Then use eventual consistency and come back later and propogate the pointer changes everywhere. She called that offline process “anti-entropy”.
Another interesting thing they do is force exercising of their failure handling code. For example, if they have to do a maintenance on a machine, they just unplug it’s power. Similarly, they just unplug entire datacenters for maintenance to force the failure path to be exercised.
Diversity is important. Amazon’s hardware is very diverse. As an anecdote, Alyssa described how in 2006 they receive an entire batch of bad drives. But their drive diversity saved them! Correlated failures do happen especially in mono-cultures. Even diversity of workloads was found to be of benefit because different types of customer loads have different resource needs. So multiplexing them is beneficial.
Alyssa used examples to suggest that application level checksuming is critical: don’t rely on just tcp checksums, e.g. Also Amazon also does integrity checking for data at rest. They go through every single object in the system in an effort to catch errors quickly. It wasn’t clear how often they do this.
Rosenthall of Stanford Library asked about reliability of the data stored in S3. e.g. There are financial penalties against Amazon for lack of service availability but not for losing your data. This was a very interesting observation. Alyssa suggested that if their customers wanted it they would consider financial penalties for data loss as well. She refused to provide actual data on how much data loss they have had.
A Cal Poly student asked re: telemetry. It was interesting that Amazon chose to focus their telemetry first and most on noticing problems before their customers do.
Overall, a reasonably interesting talk. But sadly lacking in interesting observations on where future research should be focused or a list of open problems. I found it hard to believe that they have met all their design goals!