** Description changed:

  [Impact]
  rasdaemon does not know how to decode MCE events from various new platforms, 
making it difficult to interpret errors reported up from the platform.
  
  [Test Case]
  On an AMD SMCA-capable system:
  #!/bin/bash
  modprobe mce-inject
  
  EINJ=/sys/kernel/debug/mce-inject
  
  # See /sys/kernel/debug/mce-inject/README
  
  echo hw > $EINJ/flags
  echo 0x9c2030000000011b > $EINJ/status
  echo 0x040000035dd8bfc0 > $EINJ/addr
  echo 0x0000c2030b404000 > $EINJ/synd
  echo 0 > $EINJ/bank
  
  # Wait for MCE to appear in dmesg
  sudo ras-mc-ctl --errors
  There should be a new MCE event in the output:
  1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 
2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, 
status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, 
cpuid=0x00830f10
  
- 
  For Skylake, I regression tested by using mce-test w/ the "corrected" test, 
as I'm not sure how to inject a Skylake-specific event there.
  git clone https://github.com/andikleen/mce-inject
  cd mce-inject
  make
  sudo ./mce-inject < test/corrected
  sudo ras-mc-ctl --errors
  No Memory errors.
  
  No PCIe AER errors.
  
  No Extlog errors.
  
  MCE events:
  1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci 
Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, 
addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001
  2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci 
Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, 
addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, 
apicid=0x00000002, bank=0x00000002
  
- 
  [Regression Risk]
- The new code added should only run on the newly supported systems, so 
regressions should be restricted to those systems. On those systems, a bug in 
the decoding code could cause an issue on these systems such as a crash in 
rasdaemon, etc. That is mitigated by testing on those newly supported platforms.
+ The new code added should only run on the newly supported systems, so 
regressions should be restricted to those systems. On those systems, a bug in 
the decoding code could cause an issue on these systems such as a crash in 
rasdaemon, etc. That is mitigated by testing on those newly supported 
platforms. Note that one code path I could not exercise is the Hygon Dhyana 
support as I don't have that hardware - that patch is a trivial "do the same 
thing as AMD Zen", as it is a derivative platform.

** Description changed:

  [Impact]
  rasdaemon does not know how to decode MCE events from various new platforms, 
making it difficult to interpret errors reported up from the platform.
  
  [Test Case]
  On an AMD SMCA-capable system:
  #!/bin/bash
  modprobe mce-inject
  
  EINJ=/sys/kernel/debug/mce-inject
  
  # See /sys/kernel/debug/mce-inject/README
  
  echo hw > $EINJ/flags
  echo 0x9c2030000000011b > $EINJ/status
  echo 0x040000035dd8bfc0 > $EINJ/addr
  echo 0x0000c2030b404000 > $EINJ/synd
  echo 0 > $EINJ/bank
  
  # Wait for MCE to appear in dmesg
  sudo ras-mc-ctl --errors
  There should be a new MCE event in the output:
  1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 
2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, 
status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, 
cpuid=0x00830f10
  
  For Skylake, I regression tested by using mce-test w/ the "corrected" test, 
as I'm not sure how to inject a Skylake-specific event there.
  git clone https://github.com/andikleen/mce-inject
  cd mce-inject
  make
  sudo ./mce-inject < test/corrected
  sudo ras-mc-ctl --errors
  No Memory errors.
  
  No PCIe AER errors.
  
  No Extlog errors.
  
  MCE events:
  1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci 
Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, 
addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001
  2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci 
Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, 
addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, 
apicid=0x00000002, bank=0x00000002
  
+ [Fix]
+ 
https://github.com/mchehab/rasdaemon/commit/b30a7fd4e5df8c4e61c7441f79e52d8f5f115974
+ 
https://github.com/mchehab/rasdaemon/commit/a16ca0711001957ee98f2c124abce0fa1f801529
+ 
https://github.com/mchehab/rasdaemon/commit/8704a85d8dc3483423ec2934fee8132f85f8fdb6
+ 
https://github.com/mchehab/rasdaemon/commit/22f2d8bb1d1065dede59b73b148ad4b4e2177c33
+ 
  [Regression Risk]
  The new code added should only run on the newly supported systems, so 
regressions should be restricted to those systems. On those systems, a bug in 
the decoding code could cause an issue on these systems such as a crash in 
rasdaemon, etc. That is mitigated by testing on those newly supported 
platforms. Note that one code path I could not exercise is the Hygon Dhyana 
support as I don't have that hardware - that patch is a trivial "do the same 
thing as AMD Zen", as it is a derivative platform.

** Description changed:

  [Impact]
  rasdaemon does not know how to decode MCE events from various new platforms, 
making it difficult to interpret errors reported up from the platform.
  
  [Test Case]
  On an AMD SMCA-capable system:
  #!/bin/bash
  modprobe mce-inject
  
  EINJ=/sys/kernel/debug/mce-inject
  
  # See /sys/kernel/debug/mce-inject/README
  
  echo hw > $EINJ/flags
  echo 0x9c2030000000011b > $EINJ/status
  echo 0x040000035dd8bfc0 > $EINJ/addr
  echo 0x0000c2030b404000 > $EINJ/synd
  echo 0 > $EINJ/bank
  
  # Wait for MCE to appear in dmesg
  sudo ras-mc-ctl --errors
  There should be a new MCE event in the output:
  1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 
2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, 
status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, 
cpuid=0x00830f10
  
  For Skylake, I regression tested by using mce-test w/ the "corrected" test, 
as I'm not sure how to inject a Skylake-specific event there.
  git clone https://github.com/andikleen/mce-inject
  cd mce-inject
  make
  sudo ./mce-inject < test/corrected
  sudo ras-mc-ctl --errors
  No Memory errors.
  
  No PCIe AER errors.
  
  No Extlog errors.
  
  MCE events:
  1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci 
Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, 
addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001
  2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci 
Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, 
addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, 
apicid=0x00000002, bank=0x00000002
  
  [Fix]
  
https://github.com/mchehab/rasdaemon/commit/b30a7fd4e5df8c4e61c7441f79e52d8f5f115974
  
https://github.com/mchehab/rasdaemon/commit/a16ca0711001957ee98f2c124abce0fa1f801529
  
https://github.com/mchehab/rasdaemon/commit/8704a85d8dc3483423ec2934fee8132f85f8fdb6
  
https://github.com/mchehab/rasdaemon/commit/22f2d8bb1d1065dede59b73b148ad4b4e2177c33
  
  [Regression Risk]
- The new code added should only run on the newly supported systems, so 
regressions should be restricted to those systems. On those systems, a bug in 
the decoding code could cause an issue on these systems such as a crash in 
rasdaemon, etc. That is mitigated by testing on those newly supported 
platforms. Note that one code path I could not exercise is the Hygon Dhyana 
support as I don't have that hardware - that patch is a trivial "do the same 
thing as AMD Zen", as it is a derivative platform.
+ The new code added should only run on the newly supported systems, so 
regressions should be restricted to those systems. On those systems, a bug in 
the decoding code could cause e.g. as a crash in rasdaemon. That is mitigated 
by testing on those newly supported platforms. Note that one code path I could 
not exercise is the Hygon Dhyana support as I don't have that hardware - that 
patch is a trivial "do the same thing as AMD Zen", as it is a derivative 
platform.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1871965

Title:
  new platform support: Intel SkyLake, AMD Scalable MCA

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/rasdaemon/+bug/1871965/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to