Pages

Wednesday, July 31, 2013

Auto-Scaling with AWS Spot Instances

While working on inPowered's back-end systems we are using a lot of different AWS services. One of our sub-components recently needed to be able to handle many times the load it had been handling very well for over a year. Since we're using auto-scaling groups (ASGs) for pretty much everything of course the solution was to simply scale up more instances. However we had to go from a few dozen to a few hundred instances almost over night. And even though this ASG was utilizing t1.micros controlling cost became a concern. So we decided to try using spot instances.


EC2 Spot Instances

For those not familiar with spot: It's basically a bidding system for EC2 instances. You put in a maximum bid for an instance request, AWS calculates a market price and if your bid is on or over that price and instances are available you get some. This may or may not be the number you requested or even none at all. Also, Amazon may shut spot instances down at any time, so they should only run applications that could get killed at random.


Auto-Scaling

Like many of our back-end systems this ASG was scaling on an SQS queue length. The instances got their input from an SQS queue, performed some tasks and wrote the output to another queue. A perfect setup for auto-scaling and spot instances (in my opinion ASGs is one of the best features of EC2). We configured the ASG to add one more instance if the queue contained more than a certain number of messages - say 10,000. Because the load varies a lot over the day we added another trigger to add 3 more instances if the queue size grows even larger, say 50,000 messages (effectively adding 4 instances when the queue grew to that size).

ASGs using Spot

This worked very well for an extended period of time, but as soon as we switched to spot instances, our system had a hard time catching up. Since we just increased the load by a lot the first thought was that we just needed more instances. However, that didn't really change anything. Suddenly the system that worked well pretty much unattended needed manual interaction to help scaling according to the the load. Another thought was that our scaling rules just didn't apply for the higher load. And finally it's also possible that you don't get all or any instances you requested because of the market price or availability (we have seen up to two hours of not getting any instance).

Turns out none of the above actually caused the  issue. Instead what was happening was the following: Our triggers requested to increase the desired capacity of the ASG by 1 (or 3 respectively).  Since scaling up a spot instance isn't guaranteed to happen and all it is is a request for instances, in many occurrences the time to fulfill the request completely was longer than the cool-down period of our alarm. When the trigger fired again, the ASG was trying to increase the capacity from its current state, canceled all previous pending spot instance request and put in a new one. I could verify this was happening using the CLI tool as-describe-scaling-activities. In worst case scenario you would never get any instances because the requests are always canceled before anything happens. In our case we sporadically got an instance here and there.

This problem was amplified by the fact that we had 2 alarms set up using the same cool-down period, effectively cutting the time to get an instance in half. In our case removing one alarm, doubling the alarm's cool-down period and tripling the number of additional instances per request led the system back into a stable state.

Learnings:
  • When auto-scaling spot instances, don't use more than one trigger.
  • Don't try to scale up a few spot instances in small intervals, rather increase interval and number of additional instances to start.

No comments:

Post a Comment