Memory Machine Checkpoint Engine

Transparent Checkpointing & Restore Increases AI Availability

Memory Machine is a suite of powerful and intuitive container orchestration services for running data-intensive pipelines such as bioinformatics and interactive computing applications such as EDA.

Memory Machine Checkpoint Engine is a powerful checkpointing and restore engine that must be qualified on an app-by-app basis. Once integrated, CPU and GPU resources can be checkpointed which allows hot re-starts of a pipeline, or interactive app, to a specific point in time.

The software is included with Memory Machine Cloud and Memory Machine AI and is available as a stand-alone application.

Key Features & Benefits

Drives the ability to surf cloud resources and continuously rightsize

Checkpoint Engine snapshots drive the capabilities of both the SpotSurfer and WaveRider features that migrate data automatically. Snapshot Engine captures the entire state of an application, including CPU registers, cache, and memory, and storage.  The captured state can then be saved to persistent storage or transported to another server.

Facilitates node maintenance, node rightsizing, and workload migration

Allows Infrastructure architects and MLOps teams to implement use cases such as spot instance preemption protection, cloud bursting, job/batch prioritization, and instance rightsizing.  Snapshot Engine can be managed through RESTful APIs as well as CLIs after an easy installation. For security sensitive organizations, Snapshot Engine can be configured to operate as a normal user, instead of root or superuser, so that it cannot have unintended consequences.

Watch this NVIDIA blog to learn more about how this transparent checkpoint/hot restart feature can facilitate node maintenance, node rightsizing, and workload migration/bursting.

Checkpoint for Applications using GPU accelerators

Many AI applications require the use of GPUs.  Since GPUs utilize High Bandwidth Memory (HBM), checkpointing these applications is more complex, since the states stored in HBM must be captured by the checkpoint.  The following is a high level overview of the steps involved.

Step A is Simultaneous Freeze: pauses both CPU and GPU processes at the same time, leveraging a new CUDA driver function to copy GPU memory to system memory. Step B is Quick Resume: The application resumes running immediately while MemVerge efficiently constructs checkpoint metadata, while the complete image is offloaded to storage in the background.

Restore

Step C is to restore CPU memory, GPU memory, and file content from checkpoint image in storage. Step D is to restore process states in CPU and GPU and restart the application.