Memory Machine™
Checkpoint Engine
For AWS Batch
Bring More Workloads onto Low-cost Spot Instances
with Memory Machine Checkpoint Engine featuring SpotSurfer
AWS Batch is designed to help you to run batch computing workloads on the AWS Cloud by configuring and managing the required infrastructure, and by automatically provisioning and optimizing compute resources.
SpotSurfer is a checkpointing and recovery feature that gracefully handles Spot terminations. The service combines checkpointing with cloud automation to make it possible for big stateful workloads to run safely on Spot instances, saving up to 90% in compute cost.
SpotSurfer can expand the capability of an AWS Batch environment to bring more workloads onto low-cost Spot instances by installing Memory Machine Checkpoint Engine for AWS. Batch.
Read how these Memory Machine Cloud use cases benefit from SpotSurfer today and can benefit your AWS Batch environment in the future.
“reduced the time from several weeks to a few days and at 50-80% lower cost”
“I was getting up to 80% batch failure rates …we have already brought failure rates due to spot reclaims to below 1%…”
“a transparent, low-overhead incremental checkpoint / restore solution that makes these EDA jobs resilient (hot restart) to Spot pre-emptions”
Memory Machine™ Cloud
How it Works
The Memory Machine Checkpoint Engine integrates with AWS Batch ECS. Once integrated, the Memory Machine Checkpoint Engine captures the entire running state of an AWS Batch Job into a consistent image and restores the Job on a new Compute Instance without losing any work progress. It ensures a high quality of service at the Batch level using low-cost, but unreliable Spot-based Compute Instances.
The Memory Machine Batch Engine’s key features include:
- Full integration into the customer Batch environment
- Automated checkpoint and restore
- No change to the customer workflow
- No change to the Job applications and Workflow Manager scripts
- Scalable across thousands of Batch Jobs and Compute Instances
- Secure data processing within the customer VPC
The integration architecture is shown below:
For Compute Instances running on Spot VMs, the following diagram shows the Memory Machine Checkpoint Engine protecting jobs when the instance is preempted:
- Compute Instance 1 has a root volume EBS 1, and two Jobs, A & B, are running on the instance.
- On Spot preemption warning, Memory Machine Checkpoint Engine checkpoints both Jobs, including both the processes inside the Docker container and working files in the EBS 1, and stores the checkpoint images on S3.
- When Batch reschedules Job A to run on another Compute Instance 2, normally Job A would restart from the beginning, losing all previous work progress. The Memory Machine Checkpoint Engine restores the Job A process and working files to the attached EBS 2. Job A’s previous work progress is maintained across the new Compute Instance 2.
- Likewise for Job B on Compute Instance 3.
AWS Partnership
MemVerge is a proud member of the AWS Partner Network, a global community of businesses using AWS to develop customer-focused solutions and services. The company has also earned membership in the ISV Accelerate Program, Amazon EC2 Spot Ready Program, and Qualified Software Program.