Private bug reported:
Off-package interconnects (e.g., PCIe, CXL, and other high-speed SerDes-based
links) are increasingly critical in modern platforms, enabling communication
between CPUs, accelerators, memory expanders, and I/O devices. With higher data
rates (e.g., PCIe Gen5/Gen6), signal integrity challenges increase, leading to
higher bit error rates.
To ensure reliable communication, these links implement multiple layers of
error detection and correction mechanisms:
FEC (Forward Error Correction): Corrects bit errors at the physical layer
without retransmission.
CRC (Cyclic Redundancy Check): Detects data corruption at the data link layer.
Replay Mechanism: Retransmits corrupted packets when CRC detects errors that
cannot be corrected.
These mechanisms work together to provide robust data integrity and minimize
data loss across off-package links. While largely handled in hardware, they
generate error events and telemetry that are essential for system-level RAS,
diagnostics, and performance tuning.
In the Linux kernel, existing support includes PCIe Advanced Error
Reporting (AER) and basic link error handling. However, detailed
visibility into FEC corrections, CRC errors, and replay events is
limited or vendor-specific. Enhancing OS-level support would improve
observability, proactive fault management, and reliability in high-speed
interconnect environments.
Feature Request:
Requested details to be enabled on OS:
Extend PCIe/CXL error reporting to include FEC correction statistics and
thresholds.
Enhance AER framework to capture CRC error counts and replay events.
Provide standardized interfaces (sysfs/debugfs) for link health monitoring.
Enable per-link telemetry for error rates, replay counts, and correction
activity.
Integrate link error data with RAS frameworks and system logging.
Support proactive fault management (e.g., link retraining, degradation
alerts).
Enable firmware-to-OS handoff of link reliability metrics and thresholds.
Ensure compatibility with PCIe Gen5/Gen6 and CXL link features.
Provide tools for debugging and validating link reliability issues.
Document interpretation of FEC/CRC/replay metrics and recommended actions.
Business Justification:
Improves reliability of high-speed interconnects under increasing data
rates.
Enables early detection of signal integrity issues and hardware degradation.
Reduces risk of data corruption and system instability.
Supports mission-critical and high-performance workloads.
Enhances observability and diagnostics for platform validation and support
teams.
Aligns OS capabilities with advanced link-level RAS features in modern
hardware.
References:
PCI-SIG PCIe Gen5/Gen6 Specifications (FEC, CRC, Replay Mechanisms)
CXL 2.0 / 3.0 Specifications
Linux PCIe AER Documentation
High-Speed SerDes and Link Reliability Whitepapers
** Affects: linux (Ubuntu)
Importance: Undecided
Status: New
** Information type changed from Public to Private
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146665
Title:
Request for RAS Reliability Support – Off-Package Links FEC + CRC +
Replay
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146665/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs