** Description changed:
[Impact]
rasdaemon does not know how to decode MCE events from various new platforms,
making it difficult to interpret errors reported up from the platform.
[Test Case]
+ On an AMD SMCA-capable system:
+ #!/bin/bash
+ modprobe mce-inject
- [Fix]
+ EINJ=/sys/kernel/debug/mce-inject
+
+ # See /sys/kernel/debug/mce-inject/README
+
+ echo hw > $EINJ/flags
+ echo 0x9c2030000000011b > $EINJ/status
+ echo 0x040000035dd8bfc0 > $EINJ/addr
+ echo 0x0000c2030b404000 > $EINJ/synd
+ echo 0 > $EINJ/bank
+
+ # Wait for MCE to appear in dmesg
+ sudo ras-mc-ctl --errors
+ There should be a new MCE event in the output:
+ 1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU
2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c,
status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d,
cpuid=0x00830f10
+
+
+ For Skylake, I regression tested by using mce-test w/ the "corrected" test,
as I'm not sure how to inject a Skylake-specific event there.
+ git clone https://github.com/andikleen/mce-inject
+ cd mce-inject
+ make
+ sudo ./mce-inject < test/corrected
+ sudo ras-mc-ctl --errors
+ No Memory errors.
+
+ No PCIe AER errors.
+
+ No Extlog errors.
+
+ MCE events:
+ 1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci
Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000,
addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001
+ 2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci
Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000,
addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654,
apicid=0x00000002, bank=0x00000002
+
[Regression Risk]
+ The new code added should only run on the newly supported systems, so
regressions should be restricted to those systems. On those systems, a bug in
the decoding code could cause an issue on these systems such as a crash in
rasdaemon, etc. That is mitigated by testing on those newly supported platforms.
** Changed in: rasdaemon (Ubuntu Eoan)
Status: New => In Progress
** Changed in: rasdaemon (Ubuntu Bionic)
Status: New => In Progress
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1871965
Title:
new platform support: Intel SkyLake, AMD Scalable MCA
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/rasdaemon/+bug/1871965/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs