Just listening to Alyssa Henry’s keynote talk at FAST ’09. She is the General Manager of S3 at Amazon. She used a great analogy to explain the difficult choice of which thing to spend resources on to protect against failures in a highly distributed system. For some things we choose to have expensive redundancy, e.g. we use both seat belts as well as air bags. Protecting one’s life in a catastrophic situation is important enough to warrant the extra expense. But we tend not to use both waist belts as well as suspenders 🙂
Alyssa also talked about “retry” as an important part of building resilient systems. To handle failures in distributed systems where messages may be lost or nodes may go down, just retry. But what about a message to charge a customer some amount of money? Do you really want to resend that request? The point was that they needed to think about making some operations idempotent by design.
According to Alyssa, the next failure after retry was solved, was surge/overload. Retries can be overwhelming to a system recovering from failure. So rate limiting might be used e.g. exponential backoff. Related are cache time-to-live (TTL) leases expiring but the underlying system which is the source of the data is down. As that system is comming back up, it would get overwhelmed. Alyssa suggested to try extending the TTL to keep the underlying system from breaking down when it comes back up. For example, there is a service at Amazon that checks if a customer’s Account is live. In case that service is down, it’s client systems just continue to assume that the customer is still in good standing.
She also talked about trading consistency with availability. When you write to S3, they will send data to multiple data centers. They write pointers to more datacenters than the data itself.
The USENIX Conference on File and Storage Technologies (FAST) is the premier place to send papers on all things storage. The program committee is usually the who’s who of the field. For the last few years, VMware has been holding a birds of a feather (BoF) session on the intersection of virtualization and storage/filesystem technologies. The BoF chair this year is a good friend of mine, Ajay Gulati.
Ajay has setup a really cool program that I think will attract a large crowd. Take a look at the following and be sure to drop by if you are lucky enough to be attending the conference (or even if you are not, but find yourself in the area, you are welcome to drop by our meeting room). I’m particularly excited about the demos!
Storage Technologies and Challenges in Virtualized Environments
VMware Vendor BoF
Thursday, February 26, 7:30 p.m.–8:30 p.m., San Francisco C
Do you wonder what VMware has to do with storage? Are you interested in learning about VMware technologies beyond core server virtualization? Do you want to get a glimpse of some of the future products and what storage applications they can enable?
Join engineers from VMware in a discussion about a number of novel storage-related technologies that VMware has been working on. We will also discuss some of the currently open problems and challenges related to better storage performance and management.
We will give two live demos:
1) Online storage migration (Storage VMotion)
2) Transparent and efficient workload characterization of VM workloads inside ESX Server
In addition, there will be a number of manned stations with posters and demos of technologies such as Distributed Storage IO Resource Management, VMware’s Cluster File System (VMFS), ESX’s Pluggable Storage Stack, VM aware storage (VMAS) and our dynamic Virtual Machine instrumentation tool called VProbes.
As part of our PARDA research, we examined how IO latency varies with increases in overall load (queue length) at the array using one to five hosts accessing the same storage array. The attached image (Figure 6 from the paper) shows the aggregate throughput and average latency observed in the system, with increasing contention at the array. The generated workload is a uniform 16 KB IOs, 67% reads and 70% random, while keeping 32 IOs outstanding from each host. It can be clearly seen that, for this experiment, throughput peaked at three hosts, but overall latency continues to increase with load. In fact, in some cases, beyond a certain level of workload parallelism, throughput can even drop.
An important question to consider for application performance is whether bandwidth is more important or latency. If the former, then pushing the outstanding IOs higher might make sense up to a point. However, for latency sensitive workloads, it is better to provide a target latency and to stop increasing the load (outstanding IOs) on the array beyond that point. The latter is the key observation that PARDA is built around. We use a control equation that uses an input target latency goal beyond which the array can be considered to be overloaded. Using our equation, we modify the outstanding IOs count across VMware ESX hosts in a distributed fashion to stay close to the target IO latency. In the paper
, we also detail how our equation also incorporates proportional sharing and fairness. Our experimental results show the technique to be effective.