# Enhancing CXL Memory RAS through a Forgotten Coding Theory



# Enhancing CXL Memory RAS through a Forgotten Coding Theory

Tong Zhang, ScaleFlux Inc.









#### **RS Codes for DRAM Fault Tolerance**

□ Data centers demand strong DRAM fault tolerance

- Higher reliability & stability of data centers
- ✓ Lower system TCO by accommodating less reliable, lower cost DRAM chips
- ✓ Higher resilience to the security risk caused by DRAM RowHammer attacks

Reed-Solomon (RS) code: One of the most widely used error correction codes (ECC)

Adopted by server CPUs for DRAM fault tolerance

□ Standard RS decoding algorithms correct **up to**  $t = \left\lfloor \frac{d-1}{2} \right\rfloor$  symbol errors, where *d* is the minimum distance between any two codewords



#### **Beyond-Minimum-Distance Decoding**

Correct more-than-t errors?!

□ Professor Peter Elias at MIT in 1950s

> Rationale: Minimum-distance-based decoding *significantly under-utilizes* the Euclidean code space



## Implementation of Standard RS Decoding

Standard RS decoding algorithm **bounded** by minimum distance



**Computational Complexity** 

FROM IDEAS TO IMPACT



Implementation of ultra-low-latency standard RS decoder at reasonable silicon cost





#### Implementation of Standard RS Decoding: Codeword Interleaving



FROM IDEAS TO IMPACT

✓ Short decoding latency at low silicon cost

2024

X Much weaker DRAM fault tolerance strength



## ScaleFlux DRAM-Oriented RS List Decoding





# **Application to CXL Memory**





#### **Application to CXL Memory**



# **Evaluation and Implementation**

□ Three possible outcome of decoding:

MORY FABRIC

- 1. Success: indeed outputs the *correct* codeword (good)
- 2. Detected failure: fails to find a valid codeword and declares a decoding failure (bad)
- 3. Mis-correction: outputs an incorrect but valid codeword (really bad)
- □ FPGA-based platform to confirm the superior error correction power
  - Siven *e* symbol errors, define  $P_{suc}(e)$ ,  $P_{df}(e)$ ,  $P_{mc}(e)$  as the probability of success, detected failure, mis-correction

| (80, 64) RS          | e ∈ [0,8]                        | <i>e</i> = 9                                                       | <i>e</i> =10                                                       | <i>e</i> =11                                                                             | <i>e</i> = 12                                                                                            |
|----------------------|----------------------------------|--------------------------------------------------------------------|--------------------------------------------------------------------|------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
| SFX List<br>decoding | $P_{suc}(e) = 1$ $P_{df}(e) = 0$ | $P_{suc}(e) \approx 1$<br>$P_{df}(e) \approx 0$<br>$P_{mc}(e) = 0$ | $P_{suc}(e) \approx 1$<br>$P_{df}(e) \approx 0$<br>$P_{mc}(e) = 0$ | $P_{suc}(e) \approx 1$<br>$P_{df}(e) \approx 2 \times 10^{-11}$<br>$P_{mc}(e) \approx 0$ | $P_{suc}(e) \approx 1$<br>$P_{df}(e) \approx 6 \times 10^{-11}$<br>$P_{mc}(e) \approx 2 \times 10^{-11}$ |
| Standard<br>decoding | $P_{mc}(e)=0$                    | $P_{succ}(e) = 0, P_{mc}(e) \approx 0, P_{df}(e) \approx 1$        |                                                                    |                                                                                          |                                                                                                          |

FROM IDEAS TO IMPACT

□ Silicon implementation: one decoder @ 38GB/s decoding throughput and 5ns decoding latency

Integrated into ScaleFlux CXL 3.1 memory controller (launch in early 2025)



For the very first time, bring the beyond-minimum-distance error correction into DRAM fault tolerance



□ One RS list decoder: 38GB/s decoding throughput and 5ns decoding latency

□ First use case: CXL memory controller (launch early 2025)

Support both DDR4 and DDR5 with zero-cost per-cacheline metadata embedding

**Raise the Standard of DRAM Fault Tolerance** 







#### Call to Action

- How to fully exploit such stronger-than-normal error correction?
  - ✓ System reliability: the impact on the overall system reliability
  - ✓ Memory cost: deploy less reliable, lower cost DRAM chips
  - ✓ System security: mitigate security risk caused by DRAM RowHammer attacks
- Industry-wide collaboration across the device, hardware, and software stacks



## Thank you!

