Private bug reported:

Compute Express Link (CXL) enables shared and pooled memory through
CXL.mem, allowing multiple hosts and devices to access external memory
expanders. While this improves scalability and utilization, it
introduces challenges in maintaining availability when faults occur in
shared memory regions or along the CXL fabric.

CXL.mem isolation is a key RAS (Reliability, Availability,
Serviceability) capability that ensures faults (e.g., media errors, link
failures, poison propagation) are contained within affected memory
regions, devices, or paths without impacting the entire system or other
tenants. Isolation mechanisms include address range containment, poison
handling, device-level fencing, and dynamic removal of faulty regions
from the system memory map.

In the Linux kernel, CXL support (via subsystems such as cxl_core,
cxl_mem, and integration with memory hotplug and NUMA) enables basic
management of CXL memory devices. However, fine-grained isolation
capabilities for fault containment, especially in multi-tenant and
pooled memory environments, are still evolving. Enhancing OS support is
critical to ensure high availability and resilience in CXL-based
systems.

Feature request:
Requested details to be enabled on OS:
Enable fine-grained isolation of faulty CXL.mem regions (range-based 
isolation). 
Support poison detection, containment, and controlled propagation handling. 
Integrate CXL.mem errors with EDAC and RAS frameworks. 
Enable dynamic offlining/removal of affected memory regions (memory 
hot-remove). 
Support device-level isolation (fencing faulty CXL devices or links). 
Provide sysfs/debugfs interfaces for monitoring isolation events and memory 
health. 
Enable coordination with firmware for error containment and recovery workflows. 
Support multi-tenant isolation in shared memory pool environments. 
Integrate with NUMA and memory tiering for workload-aware isolation and 
migration. 
Provide tools for fault injection, validation, and debugging of isolation 
mechanisms. 
Document isolation policies, workflows, and best practices for CXL deployments.

Business Justification:
  Improves system availability by isolating faults without full system 
downtime. 
  Enables safe operation of shared and pooled memory environments. 
  Reduces impact of memory and link failures on running workloads. 
  Supports multi-tenant cloud and hyperscale deployments. 
  Enhances resilience and fault tolerance in CXL-based architectures. 
  Aligns OS capabilities with advanced RAS requirements for disaggregated 
memory systems.

References:
  CXL 2.0 / 3.0 Specifications (CXL.mem, RAS, Poison Handling) 
  Linux Kernel CXL Subsystem Documentation 
  Linux Memory Hotplug and NUMA Documentation 
  Industry Whitepapers on Memory Disaggregation and High-Availability Systems

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

** Information type changed from Public to Private

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146672

Title:
  Request for RAS Availability Support – CXL.mem Isolation

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146672/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to