CXL Use Case

Fabric-Attached CXL Memory Accelerates Ray

Ray and GISMO (Global IO-free Shared Memory Objects)

MemVerge Memory Machine for CXL includes Fabric Attached Memory (FAM) software features for various Artificial Intelligence (AI), Machine Learning (ML), and Database workloads. It includes a memory object store API called GISMO that allows applications to create and access memory objects across multiple nodes using memory semantics. GISMO reduces or eliminates transferring data over the network, the most costly step of network-based message passing, by allowing applications to directly access data in the shared memory pool and maintain cache coherence between processors in different servers.

In a baseline Ray environment, sharing data between processes using message passing involves a 3-step process:

    1. Writing data to local memory in node A
    2. Passing the message across the network
    3. Writing the data to local memory in node B

Using GISMO, node A writes to shared memory and node B reads from shared memory. GISMO maintains cache coherence between the nodes and delivers high throughput and low latency in single-writer, multiple-reader application environments such as Ray-based AI.

Mixed DIMM and CXL Memory Configurations

675% Faster Remote Get and 280% Faster Shuffle across 4 nodes

Memory Machine for CXL Fabric-Attached Memory makes Ray clusters IO-free by eliminating object serialization and transfers over the network for remote object access. Memory Machine for CXL also creates a zero-copy environment. No more duplicate object copies on different nodes. The fabric-attached memory software also reduces object spilling and data skewing for each node accessing the memory pool.

In testing performed by MemVerge using software emulation of a pooled CXL memory sharing environment, Memory Machine for CXL Fabric Attached Memory software delivered the same access time for a local get object, 675% faster access time for a remote get object, and 280% better performance for a shuffle across 4 nodes.

Shuffle Benchmark  Results

Baseline RayWith GismoDifference
Local Get 1GB object0.4 sec0.4 secCXL shared memory as fast as local memory
Remote Get 1GB object2.7 sec0.4 sec675% faster
Shuffle 50GB, 4 nodes, each 4 cores, 128 GB object store515 sec185 sec280% faster