Wednesday, March 27, 2013

AWS VPC NAT Instance Failover and High Availability

Amazon Virtual Private Cloud (VPC) is a great way to setup an isolated portion of AWS and control the network topology. It is a great way to extend your data center and use AWS for burst requirements. With the latest VPC for Everyone announcement, what was earlier "Classic" and "VPC" in AWS will  soon be only VPC. That is, every deployment in AWS will be on a VPC even though one might not need all the additional features that VPC provides. One might eventually start looking at utilizing VPC features such as multiple Subnets, Network isolation, Network ACLs, etc.. Those who have already worked with VPC's understand the role of NAT Instance in a VPC.

When you create a VPC, you create them with multiple Subnets (Public and Private). Instances launched in the Public Subnet have direct internet connectivity to send and receive internet traffic through the internet gateway of the VPC. Typically, internet facing servers such as web servers are kept in the Public Subnet. A Private Subnet can be used to launch Instances that do not require direct access from the internet. Instances in a Private Subnet can access the Internet without exposing their private IP address by routing their traffic through a Network Address Translation (NAT) instance in the Public Subnet. AWS provides an AMI that can be launched as a NAT Instance. Following diagram is the representation of a standard VPC that gets provisioned through the AWS Management Console wizard.
Standard Private and Public Subnets in a VPC
Standard Private and Public Subnets in a VPC
The above architecture has
  • A Public Subnet that has direct internet connectivity through the Internet Gateway. Web Instances can be placed within the Public Subnet
  • The custom Route Table associated with Public Subnet will have the necessary routing information to route traffic to the Internet Gateway
  • A NAT Instance is also provisioned in the Public Subnet
  • A Private Subnet that has outbound internet connectivity through the NAT Instance in the Public Subnet
  • The Main Route Table is by default associated with the Private Subnet. This will have necessary routing information to route internet traffic to the NAT Instance
  • Instances in the Private Subnet will use the NAT Instance for outbound internet connectivity. For example, DB backups from standby that needs to be stored in S3. Background programs that make external web services calls
Of course, the above architecture has limited High Availability since all the Subnets are created within the same Availability Zone. We can avoid this by creating multiple Subnets in multiple Availability Zones.
Availability Zones
Public and Private Subnets with multiple Availability Zones

  • Additional Subnets (Public and Private) are created in one another Availability Zone
  • Both Private Subnets are attached to the Main Routing Table
  • Both Public Subnets are attached to the same Custom Routing Table
  • Instances in the Private Subnet still continue to use the NAT Instance for outbound internet connectivity
Though we increased the High Availability by utilizing multiple Availability Zones, the NAT Instance is still a Single Point of Failure. NAT Instance is just another EC2 Instance that can become unavailable any time. The updated architecture below uses two NAT Instances to provide failover and High Availability for the NAT Instances
NAT Instance High Availability
NAT Instance High Availability
  • Each Subnet is associated with its own Route Table
  • NAT1 is provisioned in Public Subnet 1
  • NAT2 is provisioned in Public Subnet 2
  • Private Subnet 1's Route Table (RT) has routing entry to NAT1 for internet traffic
  • Private Subnet 2's Route Table (RT) has routing entry to NAT2 for internet traffic
NAT Instance HA Illustration
NAT Instance HA Illustration

A script can be installed on both the NAT Instances to monitor each other and swap the routing table association if one of them fails. For example, if NAT1 detects that NAT2 is not responding to its ping requests, it can change the Route Table of Private Subnet 2 to NAT1 for internet traffic. Once NAT2 becomes operational again, a reverse swapping can happen. AWS has a pretty good documentation on this and a sample script for the swapping.

Apart from HA, the above architecture also provides better overall throughput, since during normal conditions, both NAT Instances can be used to drive the outbound internet requirements of the VPC. If there are workloads that requires a lot of outbound internet connectivity, having more than one NAT Instance would make sense. Of course, you are still limited with one NAT Instance per Subnet.


BeardedLady said...

How did you create multiple public subnets in one VPC? I thought that VPC was limited to a single public subnet.

Raghu said...

You can always created multiple public and private subnets in a VPC. In fact, to utilize multi-AZ within VPC, one needs to have multiple public subnets because subnets are tied to an AZ.

electricWombat said...

In case this may save people some time: I don't know if the response format has changed for some of the API calls, but in order to get the script to work properly, I had to change this line:

NAT_STATE=`/opt/aws/bin/ec2-describe-instances $NAT_ID -U $EC2_URL | grep INSTANCE | awk '{print $4;}'`

NAT_STATE=`/opt/aws/bin/ec2-describe-instances $NAT_ID -U $EC2_URL | grep INSTANCE | awk '{print $5;}'`

ie the instance state is the 5th field, not the 4th. Otherwise the condition can never be true, the process sleeps forever and never reboots the stopped NAT instance.

Justin Holzer said...

@Chris Jomaron
I just had to make the same change to the script. I've notified the Solutions Architect that we've been working with at AWS about the error in the script they're referencing and hopefully he will be able to get in touch with someone that can actually update it.