CXL Use Case
Fabric-Attached CXL Memory Accelerates Ray
Ray and GISMO (Global IO-free Shared Memory Objects)
MemVerge Memory Machine X includes Fabric Attached Memory (FAM) software features for various Artificial Intelligence (AI), Machine Learning (ML), and Database workloads. It includes a memory object store API called GISMO that allows applications to create and access memory objects across multiple nodes using memory semantics. GISMO reduces or eliminates transferring data over the network, the most costly step of network-based message passing, by allowing applications to directly access data in the shared memory pool and maintain cache coherence between processors in different servers.
In a baseline Ray environment, sharing data between processes using message passing involves a 3-step process:
-
- Writing data to local memory in node A
- Passing the message across the network
- Writing the data to local memory in node B
Using GISMO, node A writes to shared memory and node B reads from shared memory. GISMO maintains cache coherence between the nodes and delivers high throughput and low latency in single-writer, multiple-reader application environments such as Ray-based AI.
Mixed DIMM and CXL Memory Configurations
675% Faster Remote Get and 280% Faster Shuffle across 4 nodes
Memory Machine X Fabric-Attached Memory makes Ray clusters IO-free by eliminating object serialization and transfers over the network for remote object access. Memory Machine X also creates a zero-copy environment. No more duplicate object copies on different nodes. The fabric-attached memory software also reduces object spilling and data skewing for each node accessing the memory pool.
In testing performed by MemVerge using software emulation of a pooled CXL memory sharing environment, Memory Machine X Fabric Attached Memory software delivered the same access time for a local get object, 675% faster access time for a remote get object, and 280% better performance for a shuffle across 4 nodes.
Shuffle Benchmark Results
Baseline Ray | With Gismo | Difference | |
---|---|---|---|
Local Get 1GB object | 0.4 sec | 0.4 sec | CXL shared memory as fast as local memory |
Remote Get 1GB object | 2.7 sec | 0.4 sec | 675% faster |
Shuffle 50GB, 4 nodes, each 4 cores, 128 GB object store | 515 sec | 185 sec | 280% faster |