Emulating CXL Shared Memory Devices in QEMU
by Ryan Willis and Gregory Price
Overview
In this article, we will accomplish the following:
- Building and installing a working branch of QEMU
- Launching a pre-made QEMU instance with a CXL Memory Expander
- Creating a memory region for the CXL Memory Expander
- Converting that memory region between DEVDAX and NUMA Modes
Out of scope:
- Networking. We will not cover setting up ssh/networking for this image.
References:
-
Branch: cxl-2023-05-25
-
MemVerge’s CXL Memory Expander QEMU Image
-
https://app.box.com/shared/static/9wpc6nh0hmz4rrv9mtuk9ini41bd3p11.tgz
-
VM Login username/password: fedora/password
Build Prerequisites
This guide is written for a Fedora 38 host system; the below list of dependencies may have slightly different naming or versions on other distributions.
QEMU-CXL
As of 6/22/2023, the mainline QEMU does not have full support for creating CXL volatile memory devices, so we need to build a working branch.
This was the list of relevant dependencies we needed to install for a Fedora host. This list may not be entirely accurate, depending on your distribution and version.
sudo dnf group install "C Development Tools and Libraries" "Development Tools"
sudo dnf install golang meson pixman pixman-devel zlib zlib-devel python3
bzip2 bzip2-devel acpica-tools pkgconf-pkg-config libaio libaio-devel
liburing liburing-devel libzstd libzstd-devel libgudev ruby rubygem-ncursesw
libssh libssh-devel kernel-devel numactl numactl-devel libpmem libpmem-devel
libpmem2 libpmem2-devel daxctl daxctl-devel cxl-cli cxl-devel python3-sphinx
genisoimage ninja-build libdisk-devel parted-devel util-linux-core bridge-utils
libslirp libslirp-devel dbus-daemon dwarves perl
// Optional requirements
sudo dnf install liburing liburing-devel libnfs libnfs-devel libcurl libcurl-devel
libgcrypt libgcrypt-devel libpng libpng-devel
// Optional Fedora 37 or newer
sudo dnf install blkio blkio-devel
If the host system uses Ubuntu distributions, here is the list of dependencies to install (This might be slightly incomplete depending on specific systems, missing dependencies will be indicated during the configuration step of QEMU installation)
sudo apt-get install libaio-dev liburing-dev libnfs-dev
libseccomp-dev libcap-ng-dev libxkbcommon-dev libslirp-dev
libpmem-dev python3.10-venv numactl libvirt-dev
Johnathan Cameron is the QEMU maintainer of the CXL subsystem, we will use one of his checkpoint branches which integrate future work, such as volatile memory support.
git clone
https://gitlab.com/jic23/qemu.git
cd qemu
git checkout cxl-2023-05-25
mkdir build
cd build
../configure --prefix=/opt/qemu-jic23 --target-list=i386-softmmu,x86_64-softmmu --enable-libpmem --enable-slirp
# At this step, you may need to install additional dependencies
make -j all
sudo make install
Our QEMU build is now installed at /opt/qemu-jic23 , to validate:
/opt/qemu-jic23/bin/qemu-system-x86_64 --version
QEMU emulator version 7.2.50 (v6.2.0-8711-gbeb0973a68)
Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
CXL-Enabled Image
First, download the MemVerge guest VM OS Image that has CXL support. We recommend that you do not try to build your own for this introduction, as a custom Kernel build is required to enable a few nebulous options.
VM Guest Image Download: https://app.box.com/shared/static/9wpc6nh0hmz4rrv9mtuk9ini41bd3p11.tgz
wget -O memexp.tgz
https://app.box.com/shared/static/9wpc6nh0hmz4rrv9mtuk9ini41bd3p11.tgz
tar -xzf ./memexp.tgz
cd memexp
This image is configured with 4 vCPU, 4GB of DRAM, and a single 4GB CXL memory expander.
sudo /opt/qemu-jic23/bin/qemu-system-x86_64
-drive file=./memexp.qcow2,format=qcow2,index=0,media=disk,id=hd
-m 4G,slots=4,maxmem=8G
-smp 4
-machine type=q35,cxl=on
-daemonize
-net nic
-net user,hostfwd=tcp::2222-:22
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0
-object memory-backend-ram,id=mem0,size=4G
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
Note that KVM is NOT enabled. This is because of a bug in emulation which will cause QEMU/KVM to crash if code is executed from the CXL Memory Expander device. This means your execution is likely to be quite slow but should be sufficient for functional testing.
To launch the image, simply run ./memexp.sh . You should see the following output
$ ./memexp.sh
VNC server running on 127.0.0.1:5904
The QEMU instance is now booting, and once the boot is complete it will be accessible via SSH.
$ ssh -p 2222 fedora@localhost
fedora@localhost's password:
The password on the instance is password, and once entered, you now have access to the QEMU instance.
CXL Memory Regions and Memory Modes
First, launch the image and login.
- Username: fedora
- Password: password
In the ‘fedora’ users home directory, you will find a README and 3 shell scripts.
[fedora@localhost ~]$ ls
create_region.sh dax_mode.sh numa_mode.sh README.txt
You can see the cxl topology by running the cxl list command.
[fedora@localhost ~]$ cxl list
[
{
"memdev":"mem0",
"ram_size":4294967296,
"serial":0,
"host":"0000:35:00.0"
}
]
You can view the extended topology with roots, ports, decoders, and endpoints with cxl list -v[vv].
Creating a Memory Region
We must create a memory region to enable the CXL memory for use. This can be done by running the ./create_region.sh script. A new memory-only NUMA node should appear in numactl --hardware.
[fedora@localhost ~]$ ./create_region.sh
[ 301.749627] cxl_region region0: Bypassing cpu_cache_invalidate_memregion() for testing!
{
"region":"region0",
"resource":"0x390000000",
"size":"4.00 GiB (4.29 GB)",
"type":"ram",
"interleave_ways":1,
"interleave_granularity":4096,
"decode_state":"commit",
"mappings":[
{
"position":0,
"memdev":"mem0",
"decoder":"decoder2.0"
}
]
}
cxl region: cmd_create_region: created 1 region
[ 302.038616] Fallback order for Node 1: 1 0
[ 302.039703] Built 1 zonelists, mobility grouping on. Total pages: 973364
[ 302.040478] Policy zone: Normal
[ 302.181976] Fallback order for Node 0: 0 1
[ 302.182061] Fallback order for Node 1: 1 0
[ 302.182082] Built 2 zonelists, mobility grouping on. Total pages: 1006132
[ 302.183361] Policy zone: Normal
onlined memory for 1 device
[fedora@localhost ~]$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 3901 MB
node 0 free: 3521 MB
node 1 cpus:
node 1 size: 4096 MB
node 1 free: 4096 MB
node distances:
node 0 1
0: 10 20
1: 20 10
Converting between NUMA and DAX Mode
The memory expander can also be used in direct-access mode, such that the memory would be accessed by software through /dev/dax0.0 via mmap.
Our image has included scripts to convert between modes after creating the memory region.
To convert to dax mode, run ./dax_mode.sh.
[fedora@localhost ~]$ ./dax_mode.sh
[ 518.680697] Fallback order for Node 0: 0 1
[ 518.680801] Fallback order for Node 1: 1 0
[ 518.682309] Built 2 zonelists, mobility grouping on. Total pages: 973364
[ 518.682988] Policy zone: Normal
offlined memory for 1 device
[ 519.226271] memmap_init_zone_device initialised 1048576 pages in 23ms
[
{
"chardev":"dax0.0",
"size":4294967296,
"target_node":1,
"align":2097152,
"mode":"devdax"
}
]
reconfigured 1 device
To convert back to NUMA mode, run ./numa_mode.sh.
[fedora@localhost ~]$ ./numa_mode.sh
[ 563.620382] Fallback order for Node 1: 1 0
[ 563.620506] Built 1 zonelists, mobility grouping on. Total pages: 973364
[ 563.621969] Policy zone: Normal
[
{
"chardev":"dax0.0",
"size":4294967296,
"target_node":1,
"align":2097152,
"mode":"system-ram",
"online_memblocks":0,
"total_memblocks":32
}
]
reconfigured 1 device
[ 564.118291] Fallback order for Node 0: 0 1
[ 564.118385] Fallback order for Node 1: 1 0
[ 564.118409] Built 2 zonelists, mobility grouping on. Total pages: 1006132
[ 564.119729] Policy zone: Normal
onlined memory for 1 device
Advanced Information
Kernel Image
As noted earlier, the kernel image on this VM is a custom build with a few kernel configuration options changed from the default. If you want to update the kernel on this image, one of these options is REQUIRED in order for memory expanders to operate correctly in QEMU.
CONFIG_CXL_REGION_INVALIDATION_TEST
This option is REQUIRED. If you do not enable this kernel option, when attempting to create a memory region, you will see a failure condition.
[ 4825.630746] cxl_region region0: Failed to synchronize CPU cache state
CONFIG_CXL_MEM_RAW_COMMANDS
This option is optional. It allows users to write raw CXL mailbox commands to the memory expander device via the character device in /dev/. This will be used in future articles.
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
This option is optional for this article but will be required for multi-headed devices to ensure un-provisioned memory is not set to online by default. This will be relevant in future articles.
NDCTL Installation
This image also uses a manually installed NDCTL distribution. NDCTL provides the cxl-cli and daxctl commands. We used v77 because it is synced with the kernel 6.3.x release. If you update the kernel, you may also need to update this package.
CXL-CLI and DAXCTL Commands
The three scripts used to create memory regions and change memory modes are very simple 2-line scripts that execute cxl-cli and daxctl commands.
creation_region.sh
#!/bin/bash
sudo cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
sudo daxctl online-memory dax0.0
cxl create-region is used to create the memory region. daxctl online-memory is used to convert the DAX device created by cxl create-region into a NUMA mode and online the memory as system-ram.
The cxl topology changes after cxl create-region to include region0 and dax_region0, and the dax topology now includes a new dax0.0 device. After daxctl online-memory, a new memory-only numa node is created and the memory is onlined.
[fedora@localhost ~]$ ls /sys/bus/cxl/devices/
decoder0.0 decoder2.0 decoder2.2 endpoint2 nvdimm-bridge0 root0
decoder1.0 decoder2.1 decoder2.3 mem0 port1
[fedora@localhost ~]$ cxl create-region -m -t ram -d decoder0.0 -w 1 -g 4096 mem0
[ 301.749627] cxl_region region0: Bypassing cpu_cache_invalidate_memregion() for testing!
{
"region":"region0",
"resource":"0x390000000",
"size":"4.00 GiB (4.29 GB)",
"type":"ram",
"interleave_ways":1,
"interleave_granularity":4096,
"decode_state":"commit",
"mappings":[
{
"position":0,
"memdev":"mem0",
"decoder":"decoder2.0"
}
]
}
cxl region: cmd_create_region: created 1 region
[ 302.038616] Fallback order for Node 1: 1 0
[ 302.039703] Built 1 zonelists, mobility grouping on. Total pages: 973364
[ 302.040478] Policy zone: Normal
[ 302.181976] Fallback order for Node 0: 0 1
[ 302.182061] Fallback order for Node 1: 1 0
[ 302.182082] Built 2 zonelists, mobility grouping on. Total pages: 1006132
[ 302.183361] Policy zone: Normal
onlined memory for 1 device
[fedora@localhost ~]$ ls /sys/bus/cxl/devices/
dax_region0 decoder1.0 decoder2.1 decoder2.3 mem0 port1 root0
decoder0.0 decoder2.0 decoder2.2 endpoint2 nvdimm-bridge0 region0
[fedora@localhost ~]$ ls /sys/bus/dax/devices/
dax0.0
[fedora@localhost ~]$ sudo daxctl online-memory dax0.0
[ 1714.185899] Fallback order for Node 0: 0 1
[ 1714.186018] Fallback order for Node 1: 1 0
[ 1714.186042] Built 2 zonelists, mobility grouping on. Total pages: 1006132
[ 1714.187274] Policy zone: Normal
onlined memory for 1 device
[fedora@localhost ~]$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 3901 MB
node 0 free: 3511 MB
node 1 cpus:
node 1 size: 4096 MB
node 1 free: 4096 MB
node distances:
node 0 1
0: 10 20
1: 20 10
dax_mode.sh
sudo daxctl offline-memory dax0.0
sudo daxctl reconfigure-device -m devdax dax0.0
This script offlines the memory from the DAX device and drops the NUMA node from the topology. The /dev/dax0.0 device can now be used with mmap in software. Note: the /dev/dax0.0 device cannot be used with mmap while in NUMA mode
[fedora@localhost ~]$ sudo daxctl offline-memory dax0.0
[ 1887.176101] Fallback order for Node 0: 0 1
[ 1887.176224] Fallback order for Node 1: 1 0
[ 1887.176253] Built 2 zonelists, mobility grouping on. Total pages: 973365
[ 1887.177886] Policy zone: Normal
offlined memory for 1 device
[fedora@localhost ~]$ sudo daxctl reconfigure-device -m devdax dax0.0
[ 1892.888340] memmap_init_zone_device initialised 1048576 pages in 24ms
[
{
"chardev":"dax0.0",
"size":4294967296,
"target_node":1,
"align":2097152,
"mode":"devdax"
}
]
reconfigured 1 device
numa_mode.sh
sudo daxctl reconfigure-device -m system-ram --no-online dax0.0
sudo daxctl online-memory dax0.0
This script does the opposite of dax_mode.sh. It creates the NUMA node from dax0.0, then onlines the memory for that NUMA node.
[fedora@localhost ~]$ sudo daxctl reconfigure-device -m system-ram --no-online dax0.0
[ 1922.086858] Fallback order for Node 1: 1 0
[ 1922.086966] Built 1 zonelists, mobility grouping on. Total pages: 973365
[ 1922.087620] Policy zone: Normal
[
{
"chardev":"dax0.0",
"size":4294967296,
"target_node":1,
"align":2097152,
"mode":"system-ram",
"online_memblocks":0,
"total_memblocks":32
}
]
reconfigured 1 device
[fedora@localhost ~]$ sudo daxctl online-memory dax0.0
[ 2011.746550] Fallback order for Node 0: 0 1
[ 2011.746678] Fallback order for Node 1: 1 0
[ 2011.747532] Built 2 zonelists, mobility grouping on. Total pages: 1006133
[ 2011.748632] Policy zone: Normal
onlined memory for 1 device
Creating additional Memory Expanders
The easiest way to create new memory expanders on the current QEMU branch is to add additional cxl buses and root ports. Technically, it is possible to add multiple memory expanders to the same bus, but I have found this to be somewhat problematic. For now, consider the following configuration.
sudo /opt/qemu-jic23/bin/qemu-system-x86_64
-drive file=./memexp.qcow2,format=qcow2,index=0,media=disk,id=hd
-m 4G,slots=4,maxmem=8G
-smp 4
-machine type=q35,cxl=on
-nographic
-device virtio-net
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52
-device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=191
-device pxb-cxl,id=cxl.2,bus=pcie.0,bus_nr=230
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0
-device cxl-rp,id=rp1,bus=cxl.1,chassis=0,port=1,slot=1
-device cxl-rp,id=rp2,bus=cxl.2,chassis=0,port=2,slot=2
-object memory-backend-ram,id=mem0,size=4G
-object memory-backend-ram,id=mem1,size=4G
-object memory-backend-ram,id=mem2,size=4G
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0,sn=12345
-device cxl-type3,bus=rp1,volatile-memdev=mem1,id=cxl-mem1,sn=34567
-device cxl-type3,bus=rp2,volatile-memdev=mem2,id=cxl-mem2,sn=56789
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.size=4G,cxl-fmw.2.targets.0=cxl.2,cxl-fmw.2.size=4G
This configuration has 3 CXL buses with one root port each. Each root port is attached to a memory expander backed by 4GB of RAM. There are 3 CFMW (CXL Fixed Memory Windows), one for each memory expander.
It can be difficult to know which decoder is related to which memory expander. If you encounter issues creating memory regions, try mixing and matching decoder0.X and memY in the cxl create-region command until it works.
cxl create-region -m -t ram -d decoder0.X -w 1 -g 4096 memY
Currently, the kernel does not expose a way to determine which decoder maps to which memory device in this topology, so you may need to make a little script to brute force the combinations.
Running Multiple Instances
If you wish to run multiple instances on the same machine, you’ll need to update the port that the QEMU uses to avoid conflict. You can use any available port, but it needs to map to port 22 on the QEMU instance for SSH.
This can be done by editing memexp.sh
The following example exposes the instance on port 2223 rather than 2222 (see line 8)
sudo /opt/qemu-jic23/bin/qemu-system-x86_64
-drive file=./memexp.qcow2,format=qcow2,index=0,media=disk,id=hd
-m 4G,slots=4,maxmem=8G
-smp 4
-machine type=q35,cxl=on
-daemonize
-net nic
-net user,hostfwd=tcp::2223-:22
-device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=52
-device cxl-rp,id=rp0,bus=cxl.0,chassis=0,port=0,slot=0
-object memory-backend-ram,id=mem0,size=4G
-device cxl-type3,bus=rp0,volatile-memdev=mem0,id=cxl-mem0
-M cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
Future Work
Stay tuned for multi-headed device emulation and memory pooling!