Private bug reported:
Out-of-Band (OOB) RAS mechanisms enable system health monitoring and
error reporting independent of the host OS, typically via the Baseboard
Management Controller (BMC). Machine Check Architecture (MCA) is a key
mechanism for reporting CPU, memory, and interconnect errors.
OOB MCA FRU (Field Replaceable Unit) text enhances serviceability by
providing human-readable, component-level fault information (e.g., DIMM
slot, CPU socket, PCIe device) associated with MCA errors. This
information is generated by platform firmware and made available to the
BMC, allowing operators to quickly identify and replace faulty hardware
without deep analysis of raw error logs.
By exposing MCA error details with FRU-level granularity through OOB
channels, systems can significantly reduce troubleshooting complexity
and downtime. This is especially valuable in large-scale data centers
where rapid fault isolation and replacement are critical.
In the Linux kernel, MCA errors are typically handled via in-band
mechanisms (e.g., mcelog, rasdaemon), but OOB exposure of decoded FRU
text is limited or not standardized. Enhancing integration between
firmware, BMC, and OS would improve visibility and correlation of errors
across management layers.
Feature Request:
Requested details to be enabled on OS:
Enable support for OOB delivery of MCA error information with FRU text via
BMC interfaces.
Integrate OOB MCA FRU data with OS RAS frameworks and logging systems.
Provide mechanisms to correlate OOB FRU information with in-band MCA events.
Expose FRU-level fault details (e.g., DIMM slot, CPU core, PCIe slot) to
user space.
Support standardized interfaces (e.g., IPMI, Redfish) for accessing
FRU-enriched error data.
Enable firmware-to-BMC-to-OS data flow for consistent error reporting.
Provide tools/utilities to decode, display, and analyze FRU-based MCA
errors.
Support alerting and automation workflows based on FRU-level fault
identification.
Ensure compatibility with CPU vendor-specific MCA/SMCA implementations.
Document workflows for interpreting and acting on FRU-based MCA reports.
Business Justification:
Reduces mean time to repair (MTTR) through precise fault localization.
Simplifies debugging by providing human-readable error information.
Enhances serviceability in large-scale and remote-managed environments.
Minimizes downtime by enabling faster hardware replacement decisions.
Improves coordination between OS, firmware, and BMC management layers.
Aligns with enterprise data center operational best practices.
References:
CPU Vendor MCA/SMCA Documentation
ACPI Platform Error Interface (APEI) Specification
IPMI and Redfish Specifications
Linux RAS Tools (mcelog, rasdaemon) Documentation
** Affects: linux (Ubuntu)
Importance: Undecided
Status: New
** Information type changed from Public to Private
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2146676
Title:
Request for RAS Serviceability Support – Out-of-Band (OOB) MCA FRU
Text
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146676/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs