Private bug reported:
Error Correcting Code (ECC) is a fundamental Reliability, Availability,
and Serviceability (RAS) feature in modern memory subsystems, enabling
detection and correction of memory errors to ensure system stability.
Advanced Memory Data Correction (AMDC) extends traditional ECC
capabilities by providing enhanced error detection, multi-bit
correction, and improved fault isolation mechanisms within DRAM and
memory controller architectures.
AMDC is particularly relevant for next-generation platforms (e.g., DDR5,
MRDIMM, high-density DIMMs) where increasing memory capacity and speed
raise the likelihood of transient and persistent errors. It enhances
reliability by supporting advanced correction techniques beyond standard
SECDED (Single Error Correction, Double Error Detection), including
multi-symbol correction and improved handling of chip-level failures.
In the Linux kernel, ECC support is typically handled via the EDAC
(Error Detection and Correction) subsystem, which reports memory errors
to the OS. However, advanced capabilities such as AMDC are not fully
exposed or utilized, limiting visibility into enhanced correction events
and reducing the ability to perform proactive fault management.
Feature Request:
Requested details to be enabled on OS:
Extend EDAC subsystem to support AMDC-specific error reporting and
classification.
Enable detection and reporting of multi-bit and advanced corrected errors.
Provide visibility into AMDC correction events (corrected, uncorrected,
deferred errors).
Integrate AMDC telemetry with RAS frameworks and logging systems.
Support memory controller drivers exposing AMDC capabilities and statistics.
Enable proactive fault management (e.g., page offlining, predictive failure
analysis).
Provide sysfs/debugfs interfaces for monitoring AMDC-related metrics.
Ensure compatibility with DDR5, MRDIMM, and future memory technologies.
Coordinate with firmware (BIOS/UEFI) for standardized AMDC error reporting.
Document AMDC behavior, limitations, and debugging workflows.
Business Justification:
Improves system reliability and uptime through advanced error correction.
Enables early detection of degrading memory components.
Reduces risk of data corruption in mission-critical workloads.
Supports high-density memory deployments in data centers.
Enhances RAS capabilities for enterprise and hyperscale environments.
Aligns OS support with advanced hardware memory protection features.
References:
JEDEC DDR5 ECC Specifications
Linux EDAC Subsystem Documentation
Platform Memory Controller Documentation
Industry Whitepapers on Advanced Memory Error Correction (AMDC)
** Affects: linux (Ubuntu)
Importance: Undecided
Status: New
** Information type changed from Public to Private
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146664
Title:
Request for RAS Reliability Support – DRAM ECC with AMDC
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146664/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs