What to do During a Spot Instance Interruption

What to do During a Spot Instance Interruption?

by Jing Xie

What to Do During a Spot Instance Interruption?

Cloud providers offer their underutilized computing capacity at reduced prices as Spot instances. This is an opportunity for customers to use cloud resources in a cost-effective way. However, Spot instances can be stopped by the provider without warning, unlike regular on-demand instances. This has the potential to disrupt your computing workflows, particularly for important tasks.

In this article, we’ll dive into effective strategies to ensure your applications and deployments can leverage Spot Instance interruptions while keeping your operations smooth and costs under control.

What are spot instances?

Spot instances are virtual server instances offered at a discounted rate compared to the on-demand price. Cloud computing platforms such as Amazon Web Services, Google Cloud Platform, and Microsoft Azure provide these opportunities at a reduced rate. They are generated from the cloud provider’s excess capacity, allowing you to lease computing power at much lower costs (up to 90% off) than regular on-demand instances.

What is the purpose of spot instances?

Spot instances are an option to use cloud resources at a lower rate. It’s an economical option for executing workloads that can tolerate interruptions. Spot instances are typically used for processes like batch jobs, scientific simulations, or any task that isn’t time-sensitive. They give the companies an opportunity to scale their compute resources economically while managing their budget effectively.

Can spot instances be interrupted?

Spot instances can be interrupted by Amazon EC2 with a two-minute notification. This is because spot instances are a spare capacity that Amazon Web Services offers at a discounted price. When AWS needs that capacity back they can interrupt your EC2 spot instance and reclaim the CPU or GPUs that were provisioned to your job.

There are a few potential reasons why AWS could interrupt your spot instance.

If there is a surge in demand for EC2 instances, AWS may need to reclaim the capacity that your spot instance is using.
You set the maximum price you’re willing to pay for an EC2 spot instance. If the market price for spot instances rises above the bid you placed, your instance will be interrupted.

If you’ve set a specific set of requirements for your spot instance, such as a particular Availability Zone, and those requirements can’t be met, your instance may be interrupted. You can view these details in the instance metadata.

How long can an AWS spot instance run?

An AWS instance can run as long as there is enough capacity in the Availability Zone, and the Spot price is less than your maximum price. This of course comes with the risk that AWS could unpredictably stop or hibernate your spot instance with a 2-minute warning.

In the past, AWS used to a feature called Spot Blocks, where you could run EC2 spot instances for a predefined duration of 1-6 hours without interruption. This feature was discontinued for new customers as of July 2021.

What happens when an Amazon EC2 occurrence is stopped or terminated?

When an AWS EC2 instance is stopped, it goes through the following series of events.

Your instance will be shut down after saving its state.
Once a spot instance is stopped, only an AWS EC2 can restart it. It will restart the stopped spot instance when there is enough capacity in the Availability Zone.
To make sure the capacity previously occupied by the spot instance is not lost, Amazon EC2 launches a replacement instance to maintain it.
While the instance is stopped you won’t be charged for the instance itself. Instead, you’ll be charged for the Electric Block Store volumes, which are block-level storage solutions that allow you to create storage volumes and add them to EC2 instances.
You will be able to make modifications to some of the spot instance’s attributes except the instance type.
While it is stopped you will also be able to terminate the spot instance entirely. If you follow through with the cancellation request, an EC2 Fleet will terminate the stopped spot instances.

Unlike stopping, the termination of spot instances means that the instance is completely shut down and removed from your account. Here are the events that typically follow in sequence.

Once an instance is terminated by AWS EC2, any and all processes and applications running on it will cease.
Any data stored locally on the instance’s ephemeral storage will be lost. It’s best to back up the data to a permanent storage solution like AWS EBS volumes or Amazon S3 buckets.
If you terminate the spot instance yourself, you will be charged a full hour for the interrupted partial hour. Whereas if AWS EC2 interrupts the spot instance, you’ll not be charged for the interrupted partial hour.

Once it is terminated, the computing capacity previously used by the spot instance will be reclaimed by AWS and reassigned.

Stops and interruptions of spot instances can disrupt your applications. Therefore, it’s important to design your applications to be as fault-tolerant as possible to mitigate these risks.

What will trigger a spot instance interruption?

AWS spot instance interruption can happen mainly under two conditions. Those are a lack of spot capacity and spot price not meeting your maximum price.

Lack of spot capacity

When demand for EC2 resources increases, the available spot capacity decreases. If your spot instance is using resources that are needed for higher-priority on-demand or reserved instances, AWS will reclaim them. This can lead to an interruption of your spot instance.

Price

The market price for spot instances fluctuates based on supply and demand. If the market price rises above your maximum bid, AWS automatically interrupts your spot instance to ensure that you do not pay more than what you’ve agreed to.

What are the risks of spot instances?

Apart from the unpredictable interruptions, these low-priced computing solutions come with more risks.

Because Spot instances are priced through a bidding system, the prices can be very dynamic. This means at any given moment the price of the spot instance could increase beyond your maximum price, which in turn can terminate your instance.

Spot instances are generally not supported in many AWS services like CodeDeploy, BeanStalk, OpsWorks, etc.

The AWS EC2 Spot instances are not suitable for workloads that are inflexible, fault-intolerant, or stateful.

What to do during a spot instance interruption?

When using spot instances, you have to be always prepared for interruptions. There are three steps you can follow to mitigate a spot instance interruption.

Step 1: Monitor instance Spot instance Interruption Notices

Setting up monitoring for interruption Notices is important, as the two-minute warnings will alert you of upcoming interruptions. Utilize the Eventbridge tool for monitoring and automating responses to these notices. AWS Lambda can also be employed for this purpose.

Step 2: Handle interruptions carefully
When you receive a warning you need to handle the shutdown process carefully to minimize disruptions. This includes implementing mechanisms like the Cloudwatch configurations and checkpoints to resume work where you left off once a new instance is available. Use AWS Auto Scaling Groups to launch a replacement instance automatically when interruptions occur.

Step 3: Fault tolerance

One of the most important practices when using Amazon EC2 spot instances is to design the architecture of your system with fault tolerance and redundancy. This means you should distribute your workloads across many spot instances to reduce the risk of downtime. You can use tools like AWS Elastic Load Balancing to execute the distribution of traffic across instances. Spot fleet diversification can help by employing a collection of instance types so that you wouldn’t have to face simultaneous interruptions.

How often do spot instances get interrupted?

The interruption of spot instances depends on several factors that can affect the rate of interruptions. On average, the interruption frequency across all instance types is 5%. However, the Amazon EC2 Spot Instance Advisor indicates that the frequency of interruption could range from less than 5% to over 20%. Therefore, it’s best to be prepared.

How do I stop a spot instance in EC2?

In 2020, AWS introduced a feature for customers to stop spot instances if they become unnecessary.

Before you start the process, make sure that the instance request or spot fleet request is marked as ‘persistent’ instead of ‘hibernate’. This should have been chosen when launching the spot instance. This feature is highly recommended as it allows spot instances to automatically restart after an interruption. Moreover, you should verify that there is an EBS-backed root volume.

To stop a spot instance, go to the EC2 instances section, select the spot instance you want to stop, and choose the ‘stop’ option from the Actions menu. The options to stop and start a spot instance will show up as soon as it is launched. The controls are quite obvious and once the spot instance is stopped, the EBS volume stays the same. You can also use the ‘stop-instances’ command from the AWS CLI to execute a stop function.

You can also use the EC2 API which is not only capable of managing instances lifecycle but also monitoring metrics like instance health.

How do you simulate spot interruptions?

To ensure your system can withstand spot instance interruptions, it’s possible to simulate these disruptions. There are three main methods for simulating a spot instance interruption.

Cloud-based fault injection: On cloud platforms like AWS, you’ll be able to find built-in tools to simulate faults, including spot instance interruptions. A notable tool would be the AWS Fault Injection Simulator. This allows you to target specific instances, design an experiment that mimics an interruption, and run it.
Manual scripting: This is a more complex process that requires you to write custom scripts, but that does mean it is more customizable. The commands you need to use depend on the operating system you use and your environment.
Spot Fleet with Planned Interruptions: You can configure a Spot Fleet in AWS with a specific interruption behavior. This allows you to define a percentage of your fleet instances that can be interrupted at any given time.

How MMCloud helps with Spot interruptions?

The Memory Machine Cloud (MMCloud) is a software platform designed by MemVerge to help automate running workloads on Spot instances and handle Spot interruptions.

MMCloud automatically handles spot interruptions by saving the in-memory state of a batch job or application and migrating them to other instances. This means you wouldn’t have to restart from scratch when a Spot instance is reclaimed by your provider.
MMCloud uses a memory snapshot technology called AppCapsule which is not only used to checkpoint and restore workloads running on Spot instances but also to intelligently autoscale to larger and smaller instances based on real time CPU and memory resource utilization. This helps you avoid issues such as tasks needing to restart because of insufficient resources, or budget overruns due to running on larger machines than what is actually required.
A noteworthy bonus would be that MMCloud gives the user real-time visibility into application resources use, helping you take rebalance recommendations and optimize your resources.

Users of open-source workflow managers like Nextflow can easily get started with MMCloud by creating an account on MMCloud.io and following the directions in the MMCloud Quick Guides.

Integrating Spot Instances with AWS Services

For good measure, here are a few more tools that can be used with spot instances.

EKS (Amazon Elastic Kubernetes Service) is a managed Kubernetes service provided by AWS. Spot instances can be integrated with EKS clusters to reduce costs.
EMR (Amazon Elastic MapReduce) is a managed big data processing service. Spot instances can be used for EMR clusters to reduce costs, but users need to set parameters to the clusters so they can handle instance interruptions and job retries appropriately.
IAM (Identity and Access Management) is used to manage access to AWS resources securely. It’s crucial in the context of spot instances to define roles and permissions to control who can manage spot instance requests and access instance metadata.
ECS (Amazon Elastic Container Service) is a container orchestration service. Spot instances can be used with ECS to reduce costs, but users need to configure task placement strategies to handle interruptions gracefully.