Raghuraman: December 2012

Friday, December 28, 2012

Log archive & analysis with Amazon S3 and Glacier - Part III

A recap from the previous posts:

We had seen how to configure CloudFront to enable access logs. CloudFront will directly push all access logs to a configured S3 bucket
We saw what the considerations are for logging at the web tier - such as local log storage, dynamic environment. We configured the EC2 Instances to use the Ephemeral Storage for storing the log files locally

In this post, we will see how to push logs from the local storage to the central log storage - Amazon S3 and what the considerations are.

Amazon S3 folder structure

The logs are going to be generated throughout the day and hence we need to have a proper folder structure to store them in S3. Logs will be particularly useful to perform analysis such as a production issue or to find a usage pattern such as feature adoption by users/month. Hence it will make sense to store them by year/month/day/hour_of_the_day structure.

Multiple Instances
The web tier will be automatically scaled and we will have multiple Instances always running for High Availability and Scalability. So, even if we have logs stored hourly basis, we will be having log files with similar names from multiple Instances. Hence the folder structure needs to factor multiple Instances as well. The resultant folder structure in S3 will look something like this

Amazon S3 Log Folder Structure

Note in the above picture (as encircled) that we are storing "Instance" wise logs for every hour.

Log Rotation
Every logging framework will have an option to rotate the log files on size, date, etc...We will be periodically pushing the log files to Amazon S3 and hence it might make sense to say, rotate the log file every hour and push it to S3. But the downside to that is, we cannot anticipate the traffic to the web tier and that's the reason we have the web tier scaling automatically on demand. If there is a sudden surge in the traffic which may result in large log files generated, it will start filling up the file system eventually making the Instance unavailable. Hence it is better to rotate the log files by size.

Linux-logrotate
You can use the default logrotate available in Linux systems to rotate the log files on size. Logrotate can be configured to call a post script after the rotation to enable us push the newly rotated files to S3. A sample logrotate implementation will look like this:

Note: If you are using logrotate, make sure your logging framework isn't configured to rotate

/var/log/applogs/httpd/web {
        missingok
        rotate 52
        size 50M
        copytruncate
        notifempty
        create 644 root root
        sharedscripts
        postrotate
                /usr/local/admintools/compress-and-upload.sh web &> /var/log/custom/web_logrotate.log
        endscript
}

The above set of commands rotate the "httpd" log files whenever the size reaches 50M. It also calls a "postrotate" script to compress the rotated file and upload it to S3.

Upload to S3
The next step is to upload the rotated log file to S3.

We need a mechanism to access the S3 API from the shell to upload the files. S3cmd is a command line tool that is widely used and recommended for accessing all S3 APIs through the command line. We need to setup the CLI in the Instance
We are rotating by size but we will be uploading to a folder structure that maintains log files by the hour
We will also be uploading from multiple Instances and hence we need to fetch the Instance Id to store in the corresponding folder. Within EC2, there is an easy way to get Instance meta data. If we "wget" "http://169.254.169.254/latest/meta-data/" it will provide the Instance meta-data such as InstanceId, public DNS, etc.. For example if we "wget" "http://169.254.169.254/latest/meta-data/instance-id" we will get the current Instance Id

The following set of commands will compress the rotated file and upload them into the corresponding S3 bucket

# Perform Rotated Log File Compression
tar -czPf /var/log/httpd/"$1".1.gz /var/log/httpd/"$1".1

# Fetch the instance id from the instance
EC2_INSTANCE_ID="`wget -q -O - http://169.254.169.254/latest/meta-data/instance-id`"
if [ -z $EC2_INSTANCE_ID ]; then
echo "Error: Couldn't fetch Instance ID .. Exiting .."
exit;
else
        # Upload log file to Amazon S3 bucket
        /usr/bin/s3cmd -c /.s3cfg put /var/log/httpd/"$1".1.gz s3://$BUCKET_NAME/$(date +%Y)/$(date +%m)/$(date +%d)/$EC2_INSTANCE_ID/"$1"/$(hostname -f)-$(date +%H%M%S)-"$1".gz
fi
# Removing Rotated Compressed Log File
rm -f /var/log/httpd/"$1".1.gz

Now that the files are automatically getting rotated, compressed and uploaded to S3 there is one last thing to be taken care of.

Run While Shutdown

Since the web tier will automatically scale depending upon the load, Instances can be pulled off (terminated) when load decreases. During such scenarios, we might be still left with some log files (maximum of 50MB) that didn't get rotated and uploaded. During shutdown, we can have a small script, that will forcefully call the logrotate to rotate the final set of files, compress and upload.

stop() {
echo -n "Force rotation of log files and upload to s3 intitiated"
/usr/sbin/logrotate -f /etc/logrotate.d/web
exit 0
}

Use IAM

We need to provide Access Key and Secret Access Key to the S3cmd utility for S3 API access. Do NOT provide the AWS account's Access Key and Secret Access Key. Create an IAM user who has access to only the specific S3 bucket where we are uploading the files and use the IAM user's Access Key and Secret Access Key. A sample policy allowing access for the IAM user to the S3 log bucket would be

{
  "Statement": [
    {
      "Sid": "Stmt1355302111002",
      "Action": [
        "s3:*"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::"
      ]
    }
  ]
}

Note:

The above policy allows the IAM user to perform all actions on the S3 bucket. The user will not have permission to access any other buckets or services
If you want to restrict further, instead of allowing all actions on the S3 bucket, we can allow only PutObject (s3:PutObject) for uploading the files
Through the above approach, you will be storing the IAM credentials on the EC2 Instance itself. An alternative approach is to use IAM Roles so that the EC2 Instance will obtain the API credentials at runtime

With that we have the web tier log files automatically getting rotated, compressed and uploaded to Amazon S3 and stored in a central location. We have access to log information by the year/month/day/hour and Instance-wise.

Wednesday, December 19, 2012

Log archive & analysis with Amazon S3 and Glacier - Part II

In the previous post, we saw how to configure logging in AWS CloudFront and start collecting access logs. Let's now move on to the next tier in the architecture - web/app. The following are the key considerations for logging in web/app layer:

Local Log Storage - choosing the local storage for logging. Using a storage option that is sufficient, cost-effective and meets the performance requirement for logging
Central Log Storage - how do we centrally store log files for log analysis in future
Dynamic Infrastructure - how do we collect logs from multiple servers that are provisioned on demand

Local Log Storage

Except very few cases, EBS-backed Instances are the most sought after Instance type. They launch quickly and easier to build Images out of them. But they come with couple of limitations from logging perspective

Limited storage - EBS-backed AMIs that are provided by AWS or third-party providers come with limited storage. For example, a typical RHEL AMI comes with around 6GB of EBS attached as the root partition. Similarly Windows AMI come with 30GB EBS attached as C:\
Growing EBS - log files tend to grow faster. And it becomes difficult to grow the root EBS (or an additional EBS) as the log file sizes grow
Performance - any I/O operation on an EBS Volume is over the network. And it tends to be slower than local disk writes. Specifically for logging, it is always better to remove the I/O bottleneck. Otherwise lot of system resources could be spent towards logging

Every EC2 Instance comes with ephemeral storage. These are local storage directly attached to the host on which the Instance is running. Ephemeral storage do not persist between stop-start cycles of an Instance (EBS-backed) but they are available when the Instance is running and persist during reboots. There are couple of advantages of Ephemeral storage:

They are locally attached on the physical host on which the Instance runs and hence have better I/O throughput when compared to EBS Volumes
They come in pretty good size - for example a m1.large Instance comes with 850GB of ephemeral storage
And it comes free of cost - you aren't charged per GB or for any I/O operations on the ephemeral storage unlinke EBS

This makes ephemeral storage the ideal candidate for storing log files. For an EBS-backed Instance, the ephemeral storage is not mounted and readily available. Hence one needs to follow the following steps to start using the ephemeral storage for storing log files

The logging framework usually comes with a configuration file to configure logging parameters. The log file path needs to be configured to point to the ephemeral storage mount directory that we create below
All application related files (such as binaries, configuration files, web/app server) will be installed on the root EBS. Before the final AMI is created, the ephemeral storage needs to be setup and configured
Run fdisk to list all the storage devices that are mounted

fdisk -l

Created a directory such as "/applogs". This is the directory where the ephemeral storage will be mounted

mkdir /var/log/applogs

Mount the storage device in this directory using the "mount" command

mount /dev/xvdj /var/log/applogs

Add "fstab" entries so that the ephemeral storage is mounted in the same directory after stop/start or when new Instances are launched out of this AMI

/dev/xvdj  /var/log/applogs xfs defaults,noatime,nobarrier,allocsize=64k,logbufs=8 0 2
/dev/xvdj /var/log/applogs    ext3    defaults        0   0

The last step is essential especially from AutoScaling point of view. When AutoScaling launches new Instances, the ephemeral storage needs to be automatically mounted in the directory so that the application can start logging. Now, we can go ahead create the final AMI and launch Instances from them. The new Instances will have the ephemeral storage automatically mounted in the "/var/log/applogs" directory and applications can start storing the log files in them.

Friday, December 14, 2012

Log archive & analysis with Amazon S3 and Glacier - Part I

Since this would be a multi part article, here's an outline in terms of how the different parts will be arranged

First we will set the context in terms of taking a web application and identifying the areas of log generation for analysis
Next we will define the overall log storage structure since we have logs being generated from different sources
We will then look at each tier and look at how logs can be collected, uploaded to a centralized storage, what are the considerations
Finally, we will look at other factors such as cost implications, alternate storage options, how to utilize the logs for analysis

Let's take an e-commerce web application which has the following tiers

Content Distribution Network - a CDN to serve the static assets of the website. AWS CloudFront
Web/App Server running Apache / Nginx / Tomcat on Amazon EC2
Search Server running Solr on Amazon EC2
Database - Amazon RDS

The first three areas are the major source of log information. Your CDN provider will provide access logs in a standard format with information such as the edge location serving the request, the client IP address, the referrer, the user agent, etc...The web servers and search servers will write access logs, error logs and application logs (custom logging by your application).

Log Analysis Architecture

In AWS, Amazon S3 becomes the natural choice for centralized log storage. Since S3 comes with unlimited storage and is internally replicated for redundancy, it will be the right choice for storing the log files generated by the CDN provider, web servers and search servers. Per above architecture, all of the above tiers will be configured/setup to push their respective logs to Amazon S3. We will evaluate each layer independently and look at how to setup logging and the different considerations associated.

S3 Log Storage Structure

Since we have logs coming in front different sources, it is better to create a bucket structure to organize them. Let's say we have the following S3 bucket structure

S3 Log Storage Bucket Structure

my-global-logs: Bucket containing all the logs
cf-logs: Folder under the bucket for storing CloudFront logs
web-logs: Folder under the bucket for storing Web Server logs
solr-logs: Folder under the bucket for storing Solr Server logs

AWS CloudFront

AWS CloudFront is the Content Distribution Network service from AWS. With a growing list of 37 edge locations, it serves as a vital component in e-commerce applications hosted in AWS for serving static content. By using CloudFront, one can deliver static assets and streaming videos to users from the nearest edge location and thereby reducing latency, round trips and also off loading such delivery from the web servers.

Enable CloudFront Access Logging

You can configure CloudFront to log all access information during the "Create Distribution" step. You "Enable Logging" and specify the bucket to which CloudFront should push the logs.

Configure CloudFront for Access Logging

Specify the bucket that we created above in the "Bucket for Logs" option. This field will accept only a bucket in your account and not any sub-folders in the bucket
Since we have a folder called "cf-logs" under the bucket to store the logs, mention the name of that folder in the "Log Prefix" option
CloudFront will start pushing access logs to this location every hour. The logs will be in W3C extended format. The logs will be compressed by AWS since the original size could be significantly large for websites that attract massive traffic

Once this is setup CloudFront will periodically start pushing access logs to this folder.

CloudFront Logs

In the next post, we will see how to configure the web tier to push logs to S3 and what are the different considerations.

Tuesday, December 11, 2012

Log archive & analysis with Amazon S3 and Glacier - Introduction

Logging is an essential part of any system. It let's you understand what's going on in your system especially serving as a vital source for debugging. Primarily many systems uses logging to let developers debug issues in the production environment. But there are systems where logging becomes the essential component to understand the following

User Behavior - understanding user behavior patterns such as which areas of the system is being used by the user
Feature Adoption - evaluate new feature adoption by tracking how a new feature is being used by the users. Do they vanish after a particular step in a particular flow? Are people from a specific geography use this during a specific time of the day?
Click through analysis - let's say you are placing relevant ads across different pages in your websites. You would like to know how many users clicked them, the demographic analysis and such
System performance

Any abnormal behavior in certain areas in the system - a particular step in a workflow resulting in error/exception conditions
Analyzing performance of different areas in the system - such as finding out if a particular screen takes more time to load because of a longer query getting executed. Should we optimize the database? Should we probably introduce a caching layer?

Any architect would enforce logging as a core component in the technical architecture. While logging is definitely required, many a times, inefficient logging such as too much logging, using inappropriate log levels might lead to the following

Under performance of the system - the system could be spending more resources in logging than actively serving requests
Huge log files - generally log files grow very fast, especially when inappropriate log levels are used such as "debug" levels for all log statements
Inadequate data - if the log contains only debug information by the developer there will not be much of an analysis that can be performed

On the other hand, the infrastructure architecture also needs to support for efficient logging and analysis

Local Storage - how do you efficiently store the log files on the local server without running out of disk space; especially when log files tend to grow
Central Log Storage - how do you centrally store log files so that it can be used later for analysis
Dynamic Server Environment - how do you make sure you collect & store all the log files in a dynamic server environment where servers will be provisioned and de-provisioned on demand depending upon load
Multi source - handling log files from different sources - like your web servers, search servers, Content Distribution Network logs, etc...
Cost effective - when your application grows, so does your log files. How do you store the log files in the most cost effective manner without burning a lot of cash

In this multi-post article let's take up a case of a typical e-commerce web application with the above characteristics and setup a best practice architecture for logging, analysis and archiving in AWS. We will see how different AWS services can be used effectively to store and process the logs from different sources in a cost effective and efficient manner.

Tuesday, December 4, 2012

Amazon VPC Essentials

Amazon Virtual Private Cloud (VPC) is a great way to setup an isolated portion of AWS and control the network topology. It is a great way to extend your data center and use AWS for burst requirements. In this post, I will list down the key areas that one needs to consider when working with VPC. This will help one decide the best architecture / solution that may fit the given requirement.

Instances
Except Cluster Compute Quadruple Extra Large instances all other Instance types (including Micro) are available within VPC. Of course, be sure to check which Instance types are available in a region. For example Second generation Standard On-demand instances are available only in US-East

Services
Know which services and which features of a service are available within VPC

Dynamodb, ElastiCache, SQS, SES, SNS and CloudSearch are not yet available in VPC
~~Amazon RDS Micro Instances are not available in a VPC at this time~~ - Update: AWS just announced (few minutes after posting this article) availability of Micro RDS Instances in VPC #paceofawsinnovation
RDS Instances launched in VPC cannot be accessed over the internet (through the end point). You will need a jump box (bastion) in the Public Subnet to connect to RDS or of course you can connect through the VPN from DC
DevPay AMIs cannot be launched within VPC

BGP

BGP is no longer a requirement for VPN connectivity from your devices. Static routing can be used for non BGP devices. But BGP is recommended since the liveness check that is performed is better. Also, each VPN conenctions has two tunnels for redundancy. If one of them goes down, the failover process is fairly simple in case of BGP

Customer Gateway Compatibility

The following customer gateway devices are tested and known to be working with AWS VPC. Make sure you are using one of these devices at the Customer Gateway side. Other devices may be compatible as well

Static Routing

Cisco ASA 5500 Series version 8.2 (or later) software
Cisco ISR running Cisco IOS 12.4 (or later) software
Juniper J-Series Service Router running JunOS 9.5 (or later) software
Juniper SRX-Series Services Gateway running JunOS 9.5 (or later) software
Juniper SSG running ScreenOS 6.1, or 6.2 (or later) software
Juniper ISG running ScreenOS 6.1, or 6.2 (or later) software
Microsoft Windows Server 2008 R2 (or later) software
Yamaha RTX1200 router

Dynamic Routing using BGP

Astaro Security Gateway running version 8.3 (or later)
Astaro Security Gateway Essential Firewall Edition running version 8.3 (or later)
Cisco ISR running Cisco IOS 12.4 (or later) software
Fortinet Fortigate 40+ Series running FortiOS 4.0 (or later) software
Juniper J-Series Service Router running JunOS 9.5 (or later) software
Juniper SRX-Series Services Gateway running JunOS 9.5 (or later) software
Juniper SSG running ScreenOS 6.1, or 6.2 (or later) software
Juniper ISG running ScreenOS 6.1, or 6.2 (or later) software
Palo Alto Networks PA Series running PANOS 4.1.2 (or later) software
Vyatta Network OS 6.5 (or later) software
Yamaha RTX1200 router

IP Ranges

When setting up a VPC you are essentially fixing the network of the VPC. And if the VPC requires VPN connectivity (as in most of the cases), care should be taken to choose the IP range of the VPC and avoid any IP conflicts.

Assess the IP ranges used at the customer gateway side
If the customer gateway side has more than one data center which already have VPN connectivity between them, assess all the data centers
Also check the MPLS link between the data centers and the IP range used by the MPLS provider

The above two points (Device Support & IP Range) becomes the critical point in setting up the VPC. If the chosen IP ranges result in IP conflicts, this will jeopardize the entire network architecture and can potentially bring down existing systems in the customer data center

Public and Private Subnets

The VPC network can be divided further in to smaller network segments called as Subnets. Any VPC will have at least one Subnet

You can setup a Public Subnet which will have internet connectivity. Instances launched within a Public Subnet will have both outbound and inbound (through EIP) internet connectivity through the Internet Gateway attached to the Public Subnet
Private Subnets are completely locked down. They do not have internet connectivity by default
Create number of Public and Private Subnets depending upon your architecture. Place all public facing servers such as web servers, search servers in the public subnet. Keep DB servers, cache nodes, application servers in the private subnet

Security Groups

VPC Security Groups are different from normal EC Security Groups. With EC2 Security Groups you can control the ingress into your EC2 Instance. With VPC Security Groups, you have the option to control both inbound and outbound traffic. When something is not accessible you have to check both inbound and outbound rules set in the VPC Security Group

ELB Security Group

When you launch an ELB within VPC, you have the option to specify a VPC Security Group to be attached with the ELB. This is not available for ELB launched outside VPC in normal EC2. With this additional option, you can control access to specific ELB ports from specific IP sources. On the backend EC2 Instances' Security Group, you can allow access to the VPC Security Group that you associated with the ELB

Internal ELB

When you launch an ELB within VPC, you also have additional option to launch it as an "Internal Load Balancer". You can use an "Internal Load Balancer" to load balance your application tier from the web tier above. Please refer to my earlier article on Internal Load Balancer

NAT Instance

By default the Private Subnets in a VPC do not have internet connectivity. They cannot be accessed over the internet and neither can they make outbound connections to internet resources. But let's say you have setup a database on an EC2 Instance in the Private Subnet and have implemented a backup mechanism. You would want to push the backups to Amazon S3. But the Private Subnet's cannot access S3 since there is no internet connectivity. You can achieve it by placing a NAT Instance in the VPC

Through NAT Instance outbound connectivity for Private Subnet Instances can be achieved. The Instances will still not be reachable from the internet (inbound)
You need to configure the VPC Routing Table to enable all outbound internet traffic for the Private Subnet to go through the NAT Instance
AWS provides a ready NAT AMI (ami-f619c29f) which you can use to launch the NAT Instance
You can have only one NAT Instance per VPC
Since you can have only one NAT Instance per VPC, you need to be aware that it becomes a Single Point Of Failure in the architecture. If the architecture depends on the NAT Instance for any critical connectivity, it is an area to be reviewed
And you are limited by the bandwidth availability of a single NAT Instance. So do not build architecture that will have internet bandwidth requirements from the Private Subnet with NAT

Elastic Network Interface

This is a powerful feature of VPC which lets you solve many problems around licensing, hosting multiple SSL websites, etc...

You can attach multiple ENI's per Instance essentially creating multiple private IPs and possibility of attaching multiple EIP's per Instance
One interesting result of attaching multiple ENI's per Instance is this: previously you are used to launching EC2 Instances in a Subnet. Now EC2 Instances can span across Subnets. For example

You can have two ENIs - one attached to the public Subnet and other attached to the private Subnet
You attach a Security Group to each of these ENIs. The Security Group of public Subnet ENI will allow access on port 80 from the internet but no access on, let's say, port 22
The other Security Group attached to the Private Subnet ENI will allow port 22 access from the VPN gateway essentially securing the Instances more

If there are products that have licenses tied to the MAC address, you can now set the MAC address to the ENI and have the EC2 Instance derive the them from ENI

Private IP and Elastic IP

You can fix the Private IP of the Instance when you launch Instances within the VPC. Let's say, you plan to launch multiple tiers (web servers and search servers) within a Subnet and have multiple Instances for each tier. Make a logical allocation of IP ranges for each tier. For example, 10.0.1.110 to 10.0.1.130 for web servers. Whenever you launch any Web Server Instance use one of the private IPs from the logical list. This will help in management of the potentially vast number of Instances
Private IP of an Instance do not change if you stop and start the Instance
If you attach an Elastic IP to the Instance, the EIP does not change if you stop and start the Instance

Pages