Private bug reported:
Multiplexed Rank DIMM (MRDIMM) is an advanced memory technology designed
to increase memory bandwidth and capacity by multiplexing multiple ranks
over a shared data interface. MRDIMMs introduce a buffer (multiplexer)
on the module that enables higher effective data rates and improved
scalability compared to traditional RDIMMs.
Due to increased complexity, MRDIMMs incorporate enhanced RAS
(Reliability, Availability, Serviceability) features such as advanced
ECC schemes, command/address parity, data buffering protection, and
improved fault isolation at rank and sub-rank levels. These features are
essential to maintain reliability at higher speeds and densities.
While much of MRDIMM RAS functionality is handled within hardware (DIMM and
memory controller), the OS plays a critical role in error reporting, logging,
and proactive fault management. In the Linux kernel, existing support via the
EDAC subsystem provides basic ECC error reporting, but lacks full visibility
into MRDIMM-specific RAS features such as multiplexing errors, buffer faults,
and advanced correction events.
Enhancing OS-level support for MRDIMM RAS would improve observability, enable
proactive maintenance, and ensure reliability for next-generation memory
subsystems.
Feature Request:
Requested details to be enabled on OS:
Extend EDAC subsystem to support MRDIMM-specific error types and reporting.
Enable visibility into multiplexing-related errors and buffer (RCD/DB) faults.
Support advanced ECC reporting (multi-bit correction, chip-level faults).
Provide rank/sub-rank level error granularity for fault isolation.
Expose MRDIMM telemetry via sysfs/debugfs (error counts, health metrics).
Integrate MRDIMM RAS data with system RAS frameworks and logging.
Enable proactive fault management (e.g., page offlining, DIMM de-rating,
predictive failure alerts).
Support firmware-to-OS handoff of MRDIMM RAS capabilities and thresholds.
Ensure compatibility with DDR5 and future high-bandwidth DIMM technologies.
Provide tools for diagnostics, validation, and performance/reliability
analysis.
Document MRDIMM RAS features, error interpretation, and mitigation workflows.
Business Justification:
Improves reliability and stability of high-bandwidth memory subsystems.
Enables efficient fault isolation and faster root-cause analysis.
Reduces downtime through proactive memory failure detection.
Supports next-generation server platforms using MRDIMMs.
Enhances RAS capabilities for enterprise and hyperscale deployments.
Aligns OS support with emerging memory technologies and standards.
References:
JEDEC MRDIMM and DDR5 Specifications
Linux EDAC Subsystem Documentation
Memory Controller and DIMM Vendor RAS Documentation
Industry Whitepapers on High-Bandwidth DIMM Architectures and Reliability
** Affects: linux (Ubuntu)
Importance: Undecided
Status: New
** Information type changed from Public to Private
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146668
Title:
Request for RAS Reliability Support – MRDIMM RAS Support
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146668/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs