Memory Machine Cloud

AppCapsule

Documentation

Quick Guides

Book a Demo

Get Started Free

Memory Machine™ Cloud – AppCapsule

CPU / GPU Checkpoint & Restore

Memory Machine Cloud is a suite of powerful and intuitive container orchestration services for running data-intensive pipelines such as bioinformatics and interactive computing applications such as EDA.

Memory Machine Cloud AppCapsule is a powerful checkpointing and restore engine that must be qualified on an app-by-app basis. Once integrated, CPU and GPU resources can be checkpointed which allows hot re-starts of a pipeline, or interactive app, to a specific point in time.

Memory Machine™ Cloud

OpCenter

Optimize & Automate

Air

Point-and-Click Launch

Managed Services

Coumpute-as-a-Service

AppCapsule

Checkpoint / Restore

Key Features & Benefits

Drives the ability to surf cloud resources and continuously rightsize

AppCapsule snapshots drive the capabilities of both the SpotSurfer and WaveRider features that migrate data automatically. AppCapsules capture the entire state of an application, including CPU registers, cache, and memory, and storage.  The captured state can then be saved to persistent storage or transported to another server.

Facilitates node maintenance, node rightsizing, and workload migration

Allows Infrastructure architects and MLOps teams to implement use cases such as spot instance preemption protection, cloud bursting, job/batch prioritization, and instance rightsizing.  AppCapsule can be managed through RESTful APIs as well as CLIs after an easy installation.   For security sensitive organizations, AppCapsule can be configured to operate as a normal user, instead of root or superuser, so that it cannot have unintended consequences. 

Watch this NVIDIA blog to learn more about how this transparent checkpoint/hot restart feature can facilitate node maintenance, node rightsizing, and workload migration/bursting.

Checkpoint for Applications using GPU accelerators

Many AI applications require the use of GPUs.  Since GPUs utilize High Bandwidth Memory (HBM), checkpointing these applications is more complex, since the states stored in HBM must be captured by the checkpoint.  The following is a high level overview of the steps involved.

Step A is Simultaneous Freeze: pauses both CPU and GPU processes at the same time, leveraging a new CUDA driver function to copy GPU memory to system memory. Step B is Quick Resume: The application resumes running immediately while MemVerge efficiently constructs checkpoint metadata, while the complete image is offloaded to storage in the background.

Restore

Step C is to restore CPU memory, GPU memory, and file content from checkpoint image in storage. Step D is to restore process states in CPU and GPU and restart the application.

Available Now

Memory Machine™ Cloud