Introducing Weighted Interleaving in Linux for Enhanced Memory Bandwidth Management

With the release of Linux Kernel 6.9, system administrators have gained a powerful new tool for managing memory distribution across NUMA nodes: Weighted Interleaving. This feature is especially beneficial in systems utilizing various types of memory, including traditional DRAM and Compute Express Link (CXL) attached memory. In this article, we’ll explore Weighted Interleaving, how it works, and how to use it.

Rationale for Weighted Interleaving

As applications continue to demand higher bandwidth and larger memory capacities, traditional DRAM configurations face significant limitations. One constraint is the number of available DIMM slots on a motherboard. To meet the growing needs of modern applications, system administrators often have to purchase higher-capacity DDR modules, such as 128 GB, 256 GB, and 512 GB. However, these high-capacity modules come at a significantly higher price per gigabyte than the more common 16GB, 32GB, and 64GB modules. Additionally, even with high-capacity modules, the number of DIMM slots on a motherboard imposes a hard limit on the memory bandwidth available in a system. This limitation can bottleneck the performance of memory-intensive applications, leading to inefficiencies and increased operational costs.

Compute Express Link (CXL) memory devices offer a compelling solution to these challenges. By utilizing CXL memory devices, systems can significantly expand their memory capacity and bandwidth without the prohibitive costs associated with high-capacity DDR modules.

Understanding Weighted Interleaving

For years, memory allocation across NUMA nodes has been a challenge. Existing interleave methods can often result in less-than-optimal distributions of data. The default round-robin scheme, for example, allocates an equal share of memory to all banks, which can strain devices with less bandwidth while starving devices with higher bandwidth. This can lead to increased memory latency and ultimately degrade performance. Weighted Interleaving addresses this issue by allowing administrators to assign weights to each NUMA node. Pages can then be allocated asymmetrically – according to each node’s weight – which ensures more efficient distributions of memory in heterogeneous systems.

CXL memory devices will be available from many vendors, and each device SKU will have different capacity, latency, and bandwidth characteristics. While a system’s DRAM modules must be homogeneous in both type and capacity, the CXL devices attached to that same system can have a wide range of characteristics. By assigning weights, it is possible to differentiate CXL devices that are installed within the server from those connected remotely. Alternatively, weights can be used to differentiate high-performance CXL devices from lower-performance devices. In effect, Weighted Interleaving gives users fine-grained control over both memory bandwidth and capacity.

How Weighted Interleaving Works

Weighted Interleaving allows you to specify unique weights for each NUMA node. These weights influence the distribution of newly allocated pages. Higher weights increase the likelihood of a node receiving pages, thereby balancing the load according to the different nodes’ capacities. Configuring weights according to available bandwidth or type of memory (e.g., local or remote) can enhance overall system performance.

Let’s look at a few scenarios where Weighted Interleaving is used.

Scenario 1

Consider a system with two CPU Sockets with attached DRAM (NUMA Nodes 0 and 1), and a CXL Type 3 memory device connected to each CPU (NUMA Nodes 3 and 4). Assume each DRAM NUMA node has a local CPU throughput of up to 300GB/s, and the CXL device has up to 60GB/s to its local CPU. This configuration has a bandwidth ratio of 5:1, meaning DRAM has 5x better bandwidth than CXL. We can configure Weighted Interleaving to use this ratio. For every 5 pages allocated on DRAM, 1 page will be allocated on CXL, thus improving the overall memory bandwidth by 20%.

Scenario 2

Consider a system with two CPU Sockets with attached DRAM (NUMA Nodes 0 and 1), and four CXL Type 3 memory devices connected to each CPU (NUMA Nodes 3 to 10). Assume each DRAM NUMA node has a local CPU throughput of up to 300GB/s, and each CXL device has up to 60GB/s to its local CPU. This configuration has a bandwidth ratio of 1.25:1, meaning DRAM has a 1.25x better bandwidth than CXL. As before, we can configure Weighted Interleaving to use this ratio. For every 5 pages allocated on DRAM, 4 pages will be allocated on CXL, almost doubling the available memory bandwidth.

Using Weighted Interleaving

The numactl utility has been updated to support Weighted Interleaving with the -w, --weighted-interleave option and is expected to be available in numactl version 2.0.19 or later. Its usage is described in the updated man page entry:

--weighted-interleave=nodes, -w nodes
       Set a weighted memory interleave policy. Memory will be allocated
       using the weighted ratio for each node, which can be read from
       /sys/kernel/mm/mempolicy/weighted_interleave/node*. 
       When memory cannot be allocated on the current interleave 
       target fall back to other nodes.

As the description outlines, the desired number of pages per NUMA-node must be echoed into /sys/kernel/mm/mempolicy/weighted_interleave/node*. For example, if a user wanted to interleave pages across nodes 0, 1, and 2 in a 5:3:1 ratio, they would run the following lines in their shell:

echo 5 > /sys/kernel/mm/mempolicy/weighted_interleave/node0
echo 3 > /sys/kernel/mm/mempolicy/weighted_interleave/node1
echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2

Then, the ratio can be applied to an application via numactl:

numactl --weighted-interleave=<nodes> [OPTIONS] command args ...

specifies a comma-delimited list of node numbers, ranges, or all nodes.
OPTIONS refer to any other numactl option.
command args executes the application command with optional arguments

Example Usage

Consider an example where you want to run a program with memory allocated across three NUMA nodes with the weights assigned above. With a 5:3:1 ratio, for every 9 pages allocated, Node 0 will get 5 pages, Node 1 will get 3 pages, and Node 3 will get 1 page (assuming there’s space available in the nodes). You can achieve this with the following command:

numactl --weighted-interleave=0,1,2 ./your_program --args

This configuration ensures that memory allocations are distributed according to the specified weights, optimizing performance based on your system’s memory architecture.

Weighted Interleaving in Action

You can experiment with Weighted Interleaving using a simple Python application. This example assumes a 2-socket server (Nodes 0 and 1) with one CXL device attached to Socket 0 (Node 2).

Configure the node/page ratios as shown below:

echo 3 > /sys/kernel/mm/mempolicy/weighted_interleave/node0
echo 2 > /sys/kernel/mm/mempolicy/weighted_interleave/node1
echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2

Paste the following Python invocation into your command line. The script will allocate and initialize 64MiB of memory, and numactl --weighted-interleave=0,1,2 will spread the allocated data across nodes 0, 1, and 2.

echo 
"import time
size = 4096*4096*4
arr = bytearray(size)
for i in range(size):
    arr[i] = 1
while (True):
    time.sleep(1)" | 
numactl --weighted-interleave=0,1,2 python3 &

To verify that pages were interleaved across nodes as expected, check the per-node memory usage with numastat:

numastat -p `pidof python3`

Your output should look something like this:

# numastat -p `pidof python3`

Per-node process memory usage (in MBs) for PID 913 (python3)
                           Node 0          Node 1          Node 2
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.36            0.24            0.12
Stack                        0.04            0.02            0.02
Private                     33.36           27.62           11.02
----------------  --------------- --------------- ---------------
Total                       33.75           27.89           11.15

                            Total
                  ---------------
Huge                         0.00
Heap                         0.71
Stack                        0.08
Private                     72.00
----------------  ---------------
Total                       72.79

Note that the ratio of memory usage on nodes 0, 1, and 2 is 33:28:11. Barring pages that are bound to local nodes (for example, to ensure that executable pages and shared libraries aren’t allocated on remote memory, which would significantly impede performance), this tracks closely with our desired 3:2:1 distribution.

Conclusion

Weighted Interleaving, introduced with Linux kernel 6.9, represents a significant advancement in memory management for systems that utilize complex NUMA and CXL memory architectures. This feature optimizes page allocation by allowing administrators to assign weights to different NUMA nodes, balancing load according to each node’s capacity. This enhances performance and offers a cost-effective solution for expanding memory capacity and bandwidth.

Explore the possibilities of this innovative feature, optimize your memory allocation strategies, and unlock the full potential of your systems with Weighted Interleaving and CXL memory devices. Your journey to more efficient memory management starts here.

Trying it out

Prerequisites

Ubuntu 24.04 or later
Kernel 6.9 or later
A server with one or more CXL memory devices or QEMU with emulated CXL devices

Update your system

The v6.9 mainline Kernel was officially released on 12th May 2024. Ubuntu doesn’t support this Kernel version yet but will soon, so we provide several methods below to install it. At the time of writing, the dependencies required by the 6.9 Kernel can only be found in Ubuntu 24.04 LTS or later.

To avoid any potential conflicts, it is recommended to update your system before proceeding by running the following commands:

sudo apt update && sudo apt upgrade

If the upgrade installed a newer Kernel version, reboot your system for the changes to take effect:

sudo systemctl reboot

Upgrading Ubuntu to 24.04

Kernel v6.9 depends on packages only available in Ubuntu 24.04. If your system has Ubuntu 22.04, 23.04, or 23.10, use the following to upgrade to 22.04. If your system has an older version, it is recommended that you perform a clean installation of 24.04.

Install the ubuntu-release-upgrader-core package:

sudo apt install ubuntu-release-upgrader-core

To start the upgrade to the 24.04 LTS version, use the following:

$ sudo do-release-upgrade -d

You should see this message indicating you are about to upgrade to 24.04:

$ sudo do-release-upgrade -d
Checking for a new Ubuntu release

= Welcome to Ubuntu 24.04 LTS 'Noble Numbat' =

[...snip...]

Follow the prompts to complete the upgrade process.

Install Kernel 6.9 or later

Use either of the following methods to install the Kernel.

Method 1: Use the `mainline` Kernel Manager

The Ubuntu Mainline Kernel Installer has its official LaunchPAD PPA, which you must import. In your terminal, run the following command:

sudo add-apt-repository ppa:cappelikan/ppa -y

To update the package list on your system with the newly imported PPA, run the following command:

sudo apt update

To install the Ubuntu Mainline Kernel Installer tool, run the following command in the terminal:

sudo apt install -y mainline

To list all available mainline kernel versions, use the following command:

mainline list

Installing the Latest Kernel Version

Run the following command to download and install the latest mainline kernel version automatically:

sudo mainline install-latest

Verify no errors were reported during the installation process.

Installing a Specific Kernel Version

To install a specific Linux Kernel version, use:

sudo mainline install <version>

Replace with the target version number, for example ‘6.9’. This command simplifies the installation process, ensuring you get the exact kernel version you need.

Verify no errors were reported during the installation process.

Reboot the System

You must reboot the system to load the new Kernel:

sudo systemctl reboot

When the system boots, verify the correct Kernel version is being used:

uname -r

Note: If the system fails to boot and displays a message “bad shim signature - you need to load the kernel first”, the cause is that mainline Kernel packages are not signed, as this is left to the distros. The BIOS Secure Boot feature prevents unsigned Kernels from loading for security reasons. Enter the BIOS and disable “Secure Boot”, and your system should boot with the 6.9 Kernel. This is not recommended for production systems.

Uninstall a Specific Kernel Version

To uninstall a specific kernel version, use:

sudo mainline uninstall <version>

Method 2: Install the Kernel DEB packages

Download the DEB packages from https://kernel.ubuntu.com/mainline/v6.9/amd64/ (assumes an x86 system)

mkdir kernel-6.9
cd kernel-6.9
wget https://kernel.ubuntu.com/mainline/v6.9/amd64/linux-headers-6.9.0-060900-generic_6.9.0-060900.202405122134_amd64.deb
wget https://kernel.ubuntu.com/mainline/v6.9/amd64/linux-headers-6.9.0-060900_6.9.0-060900.202405122134_all.deb
wget https://kernel.ubuntu.com/mainline/v6.9/amd64/linux-image-unsigned-6.9.0-060900-generic_6.9.0-060900.202405122134_amd64.deb
wget https://kernel.ubuntu.com/mainline/v6.9/amd64/linux-modules-6.9.0-060900-generic_6.9.0-060900.202405122134_amd64.deb

Install the downloaded kernel

sudo dpkg -i *.deb

You must reboot the system to load the new Kernel:

sudo systemctl reboot

When the system boots, verify the correct Kernel version is being used:

uname -r

Building numactl

The version of numactl provided by your package manager may not support --weighted-interleave. To use this feature, you can build ndctl from the source code.

The following commands install the dependencies, build the numa toolset, and place the compiled binaries in a local directory inside your home directory; ${HOME}/local:

sudo apt update
sudo apt install build-essential autoconf automake libtool
git clone https://github.com/numactl/numactl
cd numactl
./autogen.sh
./configure --prefix=${HOME}/local
make install

You can then execute ${HOME}/local/bin/numactl or add ${HOME}/local/bin to your ${PATH} environment variable in your shell control file – ~/.bashrc, or ~/.zshrc, for example.

Create a QEMU Environment

See this article for a comprehensive reference regarding NUMA configurations and QEMU.

In the example below, we create a virtualized environment with two CPU sockets and an additional memory-only node. The memory node is not associated with any CPU (similar to how a CXL memory device is seen by the kernel). It assumes you already have a bootable Ubuntu virtual machine image (file=./images/ubuntu.qcow2). If not, you need to install Ubuntu 24.04 inside the virtual machine.

sudo ./qemu-system-x86_64 
  -drive file=./images/ubuntu.qcow2,format=qcow2,index=0,media=disk,id=hd 
  -nographic 
  -m 4G,slots=4,maxmem=32G 
  -smp 4 
  -machine type=q35,hmat=on 
  -enable-kvm 
  -object memory-backend-ram,id=m0,size=2G  
  -object memory-backend-ram,id=m1,size=1G  
  -object memory-backend-ram,id=m2,size=1G  
  -numa node,nodeid=0,memdev=m0,cpus=0-1 
  -numa node,nodeid=1,memdev=m1,cpus=2-3 
  -numa node,nodeid=2,memdev=m2 
  -nic user,model=virtio-net-pci

Then follow the instructions above to install Kernel 6.9 and numactl.

Enjoy!