Private bug reported:
In modern high-speed interconnects such as PCIe and CXL, timely and
accurate error reporting is essential for effective serviceability. In-
band error reporting refers to mechanisms where error information is
communicated over the same data path as functional traffic, enabling
faster detection and response without relying solely on out-of-band
channels.
MPRAS (Multi-Protocol RAS) in-band error reporting enables unified and
protocol-aware error signaling across PCIe/CXL fabrics. It allows
devices, switches, and endpoints to propagate error information (e.g.,
link errors, protocol violations, poison events) through the fabric to
the host in a standardized manner. This is particularly important in
complex topologies involving switches, multi-level fabrics, and shared
resources.
In the Linux kernel, current error reporting relies heavily on
mechanisms such as PCIe Advanced Error Reporting (AER), ACPI APEI, and
vendor-specific logs. However, support for in-band, fabric-level error
propagation mechanisms like MPRAS is limited or not fully standardized.
Enhancing support would improve observability, reduce detection latency,
and simplify debugging in large-scale deployments.
Feature Request:
Requested details to be enabled on OS:
Enable support for in-band error reporting mechanisms (MPRAS) in PCIe/CXL
subsystems.
Integrate MPRAS error events with PCIe AER, CXL RAS, and system logging
frameworks.
Support unified error decoding across multiple protocols (PCIe, CXL.io,
CXL.mem, CXL.cache).
Provide sysfs/debugfs interfaces for accessing in-band error logs and
telemetry.
Enable propagation and aggregation of errors across switches and multi-level
fabrics.
Support correlation of in-band errors with hardware components (device,
link, switch).
Enable firmware-to-OS handoff of MPRAS capabilities and configuration.
Provide tools for debugging, validation, and fault injection of in-band
error scenarios.
Ensure compatibility with PCIe Gen5/Gen6 and CXL 2.0/3.0 fabrics.
Document MPRAS workflows, configuration, and error interpretation guidelines.
Business Justification:
Reduces error detection latency and improves response time.
Enhances serviceability in complex PCIe/CXL fabric deployments.
Simplifies debugging and root-cause analysis across multi-level topologies.
Provides unified error reporting across multiple protocols.
Improves operational efficiency for data center and hyperscale environments.
Aligns OS capabilities with next-generation fabric-level RAS mechanisms.
References:
PCI-SIG PCIe Specifications (AER and RAS Enhancements)
CXL 2.0 / 3.0 Specifications (RAS and Fabric Error Handling)
Linux Kernel PCIe AER and CXL RAS Documentation
Industry Whitepapers on In-Band Error Reporting and Fabric Serviceability
** Affects: linux (Ubuntu)
Importance: Undecided
Status: New
** Information type changed from Public to Private
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146673
Title:
Request for RAS Serviceability Support – In-band Error Reporting
(MPRAS)
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146673/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs