Public bug reported:

The current implementation of Machine Check handler and HMI handler in
Linux, goes down kernel panic path for unrecoverable errors. On FSP
based system FSP also gets notified about these errors which then
forwards it to PRD (that runs on FSP) for error analysis and gard record
creation.

On OpenPower (BMC based system e.g. Habanero from TYAN) where PRD runs
in Linux host, it never gets a chance to do error analysis at the time
of Linux crash and no gard record is created for such errors. Since the
faulty component never gets de-configured, the system is vulnerable to
get hit by same HW error again.

To fix this issue, a new OPAL call 'opal_cec_reboot2()' has been
introduced to trigger a checkstop on BMC based system to inform BMC/OCC
about this error, so that BMC can collect relevant data for error
analysis and decide what component to de-configure before rebooting.
Linux kernel should invoke this opal call for unrecoverable MCE and HMI
instead before calling kernel panic so that OCC is informed about the
error.

The kernel changes has already been posted to upstream and are listed
below:

https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-May/128341.html
https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-May/128342.html
https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-August/132045.html
https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-August/132114.html 

Above patches needs to be included in ubuntu 14.04.3+

We will update this bug with commit ids, once the above patches are
accepted upstream.

Contact Information = [email protected] 
 
---uname output---
Linux rcx2d403 3.19.0-26-generic #27 SMP Tue Aug 4 01:38:15 CDT 2015 ppc64le 
ppc64le ppc64le GNU/Linux
 
---Additional Hardware Info---
Habanero pass2 system 

 
Machine Type = OpenPower, Habanero 
 
---System Hang---
 If system is hung, it can be recovered by sending ipmi power off/on command.
$ ipmitool -H <BMC> -I lanplus -U <user> -P <passwd> power off
$ ipmitool -H <BMC> -I lanplus -U <user> -P <passwd> power on

** Affects: ubuntu
     Importance: Undecided
         Status: New


** Tags: architecture-ppc64le bugnameltc-128601 severity-high 
targetmilestone-inin---

** Tags added: architecture-ppc64le bugnameltc-128601 severity-high
targetmilestone-inin---

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1482343

Title:
  Trigger a checkstop on unrecoverable MCE/HMI errors to inform BMC/OCC
  about the error.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+bug/1482343/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to