Emulating CXL Shared Memory Devices in QEMU

by Ryan Willis and Gregory Price

Overview

In this article, we will accomplish the following:

  1. Building and installing a working branch of QEMU
  2. Launching a pre-made QEMU instance with a CXL Memory Expander
  3. Creating a memory region for the CXL Memory Expander
  4. Converting that memory region between DEVDAX and NUMA Modes

Out of scope:

  • Networking. We will not cover setting up ssh/networking for this image.

References:

Build Prerequisites

This guide is written for a Fedora 38 host system; the below list of dependencies may have slightly different naming or versions on other distributions.

QEMU-CXL

As of 6/22/2023, the mainline QEMU does not have full support for creating CXL volatile memory devices, so we need to build a working branch.

This was the list of relevant dependencies we needed to install for a Fedora host. This list may not be entirely accurate, depending on your distribution and version.

sudo dnf group install "C Development Tools and Libraries" "Development Tools"

sudo dnf install golang meson pixman pixman-devel zlib zlib-devel python3

bzip2 bzip2-devel acpica-tools pkgconf-pkg-config libaio libaio-devel

liburing liburing-devel libzstd libzstd-devel libgudev ruby rubygem-ncursesw

libssh libssh-devel kernel-devel numactl numactl-devel libpmem libpmem-devel

libpmem2 libpmem2-devel daxctl daxctl-devel cxl-cli cxl-devel python3-sphinx

genisoimage ninja-build libdisk-devel parted-devel util-linux-core bridge-utils

libslirp libslirp-devel dbus-daemon dwarves perl

// Optional requirements

sudo dnf install liburing liburing-devel libnfs libnfs-devel libcurl libcurl-devel

libgcrypt libgcrypt-devel libpng libpng-devel

// Optional Fedora 37 or newer

sudo dnf install blkio blkio-devel

If the host system uses Ubuntu distributions, here is the list of dependencies to install (This might be slightly incomplete depending on specific systems, missing dependencies will be indicated during the configuration step of QEMU installation)

sudo apt-get install libaio-dev liburing-dev libnfs-dev

libseccomp-dev libcap-ng-dev libxkbcommon-dev libslirp-dev

libpmem-dev python3.10-venv numactl libvirt-dev

Johnathan Cameron is the QEMU maintainer of the CXL subsystem, we will use one of his checkpoint branches which integrate future work, such as volatile memory support.

git clone https://gitlab.com/jic23/qemu.git

cd qemu

git checkout cxl-2023-05-25

mkdir build

cd build

../configure --prefix=/opt/qemu-jic23 --target-list=i386-softmmu,x86_64-softmmu --enable-libpmem --enable-slirp

# At this step, you may need to install additional dependencies

make -j all

sudo make install

Our QEMU build is now installed at /opt/qemu-jic23 , to validate:

/opt/qemu-jic23/bin/qemu-system-x86_64 --version

QEMU emulator version 7.2.50 (v6.2.0-8711-gbeb0973a68)

Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers

CXL-Enabled Image

First, download the MemVerge guest VM OS Image that has CXL support. We recommend that you do not try to build your own for this introduction, as a custom Kernel build is required to enable a few nebulous options.

VM Guest Image Download: https://app.box.com/shared/static/9wpc6nh0hmz4rrv9mtuk9ini41bd3p11.tgz

wget -O memexp.tgz https://app.box.com/shared/static/9wpc6nh0hmz4rrv9mtuk9ini41bd3p11.tgz

tar -xzf ./memexp.tgz

cd memexp

This image is configured with 4 vCPU, 4GB of DRAM, and a single 4GB CXL memory expander.

sudo /opt/qemu-jic23/bin/qemu-system-x86_64

-drive file=./memexp.qcow2,format=qcow2,index=0,media=disk,id=hd

-m 4G,slots=4,maxmem=8G

-smp 4

-machine type=q35,cxl=on

-daemonize

-net nic

-net user,hostfwd=tcp::2222-:22

-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52

-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0

-object memory-backend-ram,id=mem0,size=4G

-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0

-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G

Note that KVM is NOT enabled. This is because of a bug in emulation which will cause QEMU/KVM to crash if code is executed from the CXL Memory Expander device. This means your execution is likely to be quite slow but should be sufficient for functional testing.

To launch the image, simply run ./memexp.sh . You should see the following output

$ ./memexp.sh

VNC server running on 127.0.0.1:5904

The QEMU instance is now booting, and once the boot is complete it will be accessible via SSH.

$ ssh -p 2222 fedora@localhost

fedora@localhost's password:

The password on the instance is password, and once entered, you now have access to the QEMU instance.

CXL Memory Regions and Memory Modes

First, launch the image and login.

  • Username: fedora
  • Password: password

In the ‘fedora’ users home directory, you will find a README and 3 shell scripts.

[fedora@localhost ~]$ ls

create_region.sh dax_mode.sh numa_mode.sh README.txt

You can see the cxl topology by running the cxl list command.

[fedora@localhost ~]$ cxl list

[

{

"memdev":"mem0",

"ram_size":4294967296,

"serial":0,

"host":"0000:35:00.0"

}

]

You can view the extended topology with roots, ports, decoders, and endpoints with cxl list -v[vv].

Creating a Memory Region

We must create a memory region to enable the CXL memory for use. This can be done by running the ./create_region.sh script. A new memory-only NUMA node should appear in numactl --hardware.

[fedora@localhost ~]$ ./create_region.sh

[ 301.749627] cxl_region region0: Bypassing cpu_cache_invalidate_memregion() for testing!

{

"region":"region0",

"resource":"0x390000000",

"size":"4.00 GiB (4.29 GB)",

"type":"ram",

"interleave_ways":1,

"interleave_granularity":4096,

"decode_state":"commit",

"mappings":[

{

"position":0,

"memdev":"mem0",

"decoder":"decoder2.0"

}

]

}

cxl region: cmd_create_region: created 1 region

[ 302.038616] Fallback order for Node 1: 1 0

[ 302.039703] Built 1 zonelists, mobility grouping on. Total pages: 973364

[ 302.040478] Policy zone: Normal

[ 302.181976] Fallback order for Node 0: 0 1

[ 302.182061] Fallback order for Node 1: 1 0

[ 302.182082] Built 2 zonelists, mobility grouping on. Total pages: 1006132

[ 302.183361] Policy zone: Normal

onlined memory for 1 device

[fedora@localhost ~]$ numactl --hardware

available: 2 nodes (0-1)

node 0 cpus: 0 1 2 3

node 0 size: 3901 MB

node 0 free: 3521 MB

node 1 cpus:

node 1 size: 4096 MB

node 1 free: 4096 MB

node distances:

node 0 1

0: 10 20

1: 20 10

Converting between NUMA and DAX Mode

The memory expander can also be used in direct-access mode, such that the memory would be accessed by software through /dev/dax0.0 via mmap.

Our image has included scripts to convert between modes after creating the memory region.

To convert to dax mode, run ./dax_mode.sh.

[fedora@localhost ~]$ ./dax_mode.sh

[ 518.680697] Fallback order for Node 0: 0 1

[ 518.680801] Fallback order for Node 1: 1 0

[ 518.682309] Built 2 zonelists, mobility grouping on. Total pages: 973364

[ 518.682988] Policy zone: Normal

offlined memory for 1 device

[ 519.226271] memmap_init_zone_device initialised 1048576 pages in 23ms

[

{

"chardev":"dax0.0",

"size":4294967296,

"target_node":1,

"align":2097152,

"mode":"devdax"

}

]

reconfigured 1 device

To convert back to NUMA mode, run ./numa_mode.sh.

[fedora@localhost ~]$ ./numa_mode.sh

[ 563.620382] Fallback order for Node 1: 1 0

[ 563.620506] Built 1 zonelists, mobility grouping on. Total pages: 973364

[ 563.621969] Policy zone: Normal

[

{

"chardev":"dax0.0",

"size":4294967296,

"target_node":1,

"align":2097152,

"mode":"system-ram",

"online_memblocks":0,

"total_memblocks":32

}

]

reconfigured 1 device

[ 564.118291] Fallback order for Node 0: 0 1

[ 564.118385] Fallback order for Node 1: 1 0

[ 564.118409] Built 2 zonelists, mobility grouping on. Total pages: 1006132

[ 564.119729] Policy zone: Normal

onlined memory for 1 device

Advanced Information

Kernel Image

As noted earlier, the kernel image on this VM is a custom build with a few kernel configuration options changed from the default. If you want to update the kernel on this image, one of these options is REQUIRED in order for memory expanders to operate correctly in QEMU.

CONFIG_CXL_REGION_INVALIDATION_TEST

This option is REQUIRED. If you do not enable this kernel option, when attempting to create a memory region, you will see a failure condition.

[ 4825.630746] cxl_region region0: Failed to synchronize CPU cache state

CONFIG_CXL_MEM_RAW_COMMANDS

This option is optional. It allows users to write raw CXL mailbox commands to the memory expander device via the character device in /dev/. This will be used in future articles.

CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE

This option is optional for this article but will be required for multi-headed devices to ensure un-provisioned memory is not set to online by default. This will be relevant in future articles.

NDCTL Installation

This image also uses a manually installed NDCTL distribution. NDCTL provides the cxl-cli and daxctl commands. We used v77 because it is synced with the kernel 6.3.x release. If you update the kernel, you may also need to update this package.

https://github.com/pmem/ndctl

CXL-CLI and DAXCTL Commands

The three scripts used to create memory regions and change memory modes are very simple 2-line scripts that execute cxl-cli and daxctl commands.

creation_region.sh

#!/bin/bash

sudo cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0

sudo daxctl online-memory dax0.0

cxl create-region is used to create the memory region. daxctl online-memory is used to convert the DAX device created by cxl create-region into a NUMA mode and online the memory as system-ram.

The cxl topology changes after cxl create-region to include region0 and dax_region0, and the dax topology now includes a new dax0.0 device. After daxctl online-memory, a new memory-only numa node is created and the memory is onlined.

[fedora@localhost ~]$ ls /sys/bus/cxl/devices/

decoder0.0 decoder2.0 decoder2.2 endpoint2 nvdimm-bridge0 root0

decoder1.0 decoder2.1 decoder2.3 mem0 port1

[fedora@localhost ~]$ cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0

[ 301.749627] cxl_region region0: Bypassing cpu_cache_invalidate_memregion() for testing!

{

"region":"region0",

"resource":"0x390000000",

"size":"4.00 GiB (4.29 GB)",

"type":"ram",

"interleave_ways":1,

"interleave_granularity":4096,

"decode_state":"commit",

"mappings":[

{

"position":0,

"memdev":"mem0",

"decoder":"decoder2.0"

}

]

}

cxl region: cmd_create_region: created 1 region

[ 302.038616] Fallback order for Node 1: 1 0

[ 302.039703] Built 1 zonelists, mobility grouping on. Total pages: 973364

[ 302.040478] Policy zone: Normal

[ 302.181976] Fallback order for Node 0: 0 1

[ 302.182061] Fallback order for Node 1: 1 0

[ 302.182082] Built 2 zonelists, mobility grouping on. Total pages: 1006132

[ 302.183361] Policy zone: Normal

onlined memory for 1 device

[fedora@localhost ~]$ ls /sys/bus/cxl/devices/

dax_region0 decoder1.0 decoder2.1 decoder2.3 mem0 port1 root0

decoder0.0 decoder2.0 decoder2.2 endpoint2 nvdimm-bridge0 region0

[fedora@localhost ~]$ ls /sys/bus/dax/devices/

dax0.0

[fedora@localhost ~]$ sudo daxctl online-memory dax0.0

[ 1714.185899] Fallback order for Node 0: 0 1

[ 1714.186018] Fallback order for Node 1: 1 0

[ 1714.186042] Built 2 zonelists, mobility grouping on. Total pages: 1006132

[ 1714.187274] Policy zone: Normal

onlined memory for 1 device

[fedora@localhost ~]$ numactl --hardware

available: 2 nodes (0-1)

node 0 cpus: 0 1 2 3

node 0 size: 3901 MB

node 0 free: 3511 MB

node 1 cpus:

node 1 size: 4096 MB

node 1 free: 4096 MB

node distances:

node 0 1

0: 10 20

1: 20 10

dax_mode.sh

sudo daxctl offline-memory dax0.0

sudo daxctl reconfigure-device -m devdax dax0.0

This script offlines the memory from the DAX device and drops the NUMA node from the topology. The /dev/dax0.0 device can now be used with mmap in software. Note: the /dev/dax0.0 device cannot be used with mmap while in NUMA mode

[fedora@localhost ~]$ sudo daxctl offline-memory dax0.0

[ 1887.176101] Fallback order for Node 0: 0 1

[ 1887.176224] Fallback order for Node 1: 1 0

[ 1887.176253] Built 2 zonelists, mobility grouping on. Total pages: 973365

[ 1887.177886] Policy zone: Normal

offlined memory for 1 device

[fedora@localhost ~]$ sudo daxctl reconfigure-device -m devdax dax0.0

[ 1892.888340] memmap_init_zone_device initialised 1048576 pages in 24ms

[

{

"chardev":"dax0.0",

"size":4294967296,

"target_node":1,

"align":2097152,

"mode":"devdax"

}

]

reconfigured 1 device

numa_mode.sh

sudo daxctl reconfigure-device -m system-ram --no-online dax0.0

sudo daxctl online-memory dax0.0

This script does the opposite of dax_mode.sh. It creates the NUMA node from dax0.0, then onlines the memory for that NUMA node.

[fedora@localhost ~]$ sudo daxctl reconfigure-device -m system-ram --no-online dax0.0

[ 1922.086858] Fallback order for Node 1: 1 0

[ 1922.086966] Built 1 zonelists, mobility grouping on. Total pages: 973365

[ 1922.087620] Policy zone: Normal

[

{

"chardev":"dax0.0",

"size":4294967296,

"target_node":1,

"align":2097152,

"mode":"system-ram",

"online_memblocks":0,

"total_memblocks":32

}

]

reconfigured 1 device

[fedora@localhost ~]$ sudo daxctl online-memory dax0.0

[ 2011.746550] Fallback order for Node 0: 0 1

[ 2011.746678] Fallback order for Node 1: 1 0

[ 2011.747532] Built 2 zonelists, mobility grouping on. Total pages: 1006133

[ 2011.748632] Policy zone: Normal

onlined memory for 1 device

Creating additional Memory Expanders

The easiest way to create new memory expanders on the current QEMU branch is to add additional cxl buses and root ports. Technically, it is possible to add multiple memory expanders to the same bus, but I have found this to be somewhat problematic. For now, consider the following configuration.

sudo /opt/qemu-jic23/bin/qemu-system-x86_64

-drive file=./memexp.qcow2,format=qcow2,index=0,media=disk,id=hd

-m 4G,slots=4,maxmem=8G

-smp 4

-machine type=q35,cxl=on

-nographic

-device virtio-net

-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52

-device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=191

-device pxb-cxl,id=cxl.2,bus=pcie.0,bus_nr=230

-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0

-device cxl-rp,id=rp1,bus=cxl.1,chassis=0,port=1,slot=1

-device cxl-rp,id=rp2,bus=cxl.2,chassis=0,port=2,slot=2

-object memory-backend-ram,id=mem0,size=4G

-object memory-backend-ram,id=mem1,size=4G

-object memory-backend-ram,id=mem2,size=4G

-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,sn=12345

-device cxl-type3,bus=rp1,volatile-memdev=mem1,id=cxl-mem1,sn=34567

-device cxl-type3,bus=rp2,volatile-memdev=mem2,id=cxl-mem2,sn=56789

-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.size=4G,cxl-fmw.2.targets.0=cxl.2,cxl-fmw.2.size=4G

This configuration has 3 CXL buses with one root port each. Each root port is attached to a memory expander backed by 4GB of RAM. There are 3 CFMW (CXL Fixed Memory Windows), one for each memory expander.

It can be difficult to know which decoder is related to which memory expander. If you encounter issues creating memory regions, try mixing and matching decoder0.X and memY in the cxl create-region command until it works.

cxl create-region -m -t ram -d decoder0.X -w 1 -g 4096 memY

Currently, the kernel does not expose a way to determine which decoder maps to which memory device in this topology, so you may need to make a little script to brute force the combinations.

Running Multiple Instances

If you wish to run multiple instances on the same machine, you’ll need to update the port that the QEMU uses to avoid conflict. You can use any available port, but it needs to map to port 22 on the QEMU instance for SSH.

This can be done by editing memexp.sh

The following example exposes the instance on port 2223 rather than 2222 (see line 8)

sudo /opt/qemu-jic23/bin/qemu-system-x86_64

-drive file=./memexp.qcow2,format=qcow2,index=0,media=disk,id=hd

-m 4G,slots=4,maxmem=8G

-smp 4

-machine type=q35,cxl=on

-daemonize

-net nic

-net user,hostfwd=tcp::2223-:22

-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52

-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0

-object memory-backend-ram,id=mem0,size=4G

-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0

-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G

Future Work

Stay tuned for multi-headed device emulation and memory pooling!