Pages

Tuesday, July 3, 2012

Yet another post on High Availability

Looks like High Availability, Scalability are becoming the most abused words / jargons these days. And Cloud Computing has helped these words reaching the common man. I was really annoyed at an article on High Availability on CloudAve. The article was posted immediately after the latest AWS outage and talks about how companies like Netflix and Pinterest have all got it wrong and how to "program" for failover. Or is that failure? Here are the main items that the article talks about

  1. Programming for failover / failure. Anyone who doesn't is a fool
  2. Doing an S3 sync can help you avoid outages
  3. Elastic Load Balancing can ensure regions are healthy


1. Agreed. Any large scale system should definitely be designed for failure. And I am sure the folks at Netflix have not ignored this fact. Netflix is at a scale where they are probably one of the largest customer of AWS and they indeed work very closely with AWS. And to imagine what their scale is, they made their own custom AWS management solution called Asgard  and open sourced it. They were not happy with the feature set of AWS management console and built their own. And the news also appeared in AWS blog. Now, I am at loss of words if Netflix hasn't thought about using multiple AZs, load balancers, etc...That is like a bare minimum for their scale.

2. Syncing between S3 buckets. Here we are talking about Syncing large amount of objects between S3 buckets located across continents. Over the Internet. Now, this looks all the more simple task to do in paper. Anyone who has thought about the actual implementation will fairly appreciate the complexity and cost involved in doing so. Of course that may outweigh the benefits of being available during an outage. But before we get to that there are many implementation bottlenecks to be addressed. You just cannot imagine putting a Single EC2 Instance (oh yeah we are talking about HA. lets make it one Instance in each AZ) and writing a custom script to perform the sync. Bandwidth availability from a single EC2, availability of that setup during an outage and inter region internet bandwidth charges should definitely be considered to go for such a design. And in large scale architecture, companies heavily rely on Content Distribution Network (CDN) such as Akamai / AWS CloudFront to deliver the media assets. If the content is stored in S3, it is most likely served through a CDN. In such case, even if the origin server is not available (in this case an outage at AWS), the content is always available at the edge location. Apart from the media assets, companies on the likes/scale of Netflix will heavily use other technologies like NoSQL (Mongo, Cassandra), Relational Databases (RDS) and Cache Clusters. And they are probably running 100s in each. Replicating them, building failover is a slow and complex process.

3. Elastic Load Balancing is a load balancing service from AWS. It can be used to bring HA in to your architecture - within a region. You cannot use ELB to ensure regions are healthy. ELB is a service that is restricted to a specific region and operates within it. To detect multiple region health status, one needs a solution at the DNS level. Such as going for a managed DNS solution which can perform health checks between multiple data centers (geo distributed) and failover. AWS Route53 can do that. Route53 Latency Based Routing (LBR) can detect multi region latency and route traffic accordingly.

Building a Highly Available solution is not about just utilizing multiple data centers or syncing data. It involves very deep understanding of the application and the infrastructure. And how they work together. Availability is perceived by the user. Hence it varies from application to application. For example, a photo sharing website might have three major tiers - photo viewer, photo upload, commenting. If the website is completely hosted on AWS, during at outage (depending upon which AWS services are affected), some of the tiers can still be perceived to be working. Photo viewer might continue to serve the pictures from a CDN while users might not be able to add new comments. Or, new comments might be working for the users who are posting but might be delayed for the ones who are viewing (achieved through queues and cache clusters). Architects can think of various ways to build such systems and that's how one increases the overall availability of the system. It is not an one time activity but a continuous process requiring patience, careful examination and planning. Guess such architectures will quietly work behind the scenes and let others abuse words like High Availability, Scalability for eternity.