[Kernel-packages] [Bug 1512593] Re: :Ubuntu NV: Panic timeout=0 means Ubuntu does not reboot and recover from HMI (e.g. Core Unit Checkstop)
** Changed in: linux (Ubuntu) Assignee: Canonical Kernel Team (canonical-kernel-team) => Chris J Arges (arges) ** Changed in: linux (Ubuntu) Importance: Undecided => Medium ** Changed in: linux (Ubuntu) Status: Triaged => In Progress -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1512593 Title: :Ubuntu NV: Panic timeout=0 means Ubuntu does not reboot and recover from HMI (e.g. Core Unit Checkstop) Status in linux package in Ubuntu: In Progress Bug description: Problem Description == I attempted to inject a Core Unit Checkstop error by flipping Core FIR bit 5 on a K80 (Nvidia) 42L Server ( FSP - gp4fp1.aus.stglabs.ibm.com ). I was expecting the server to crash due to a Sapphire assert ( PEL with SRC BB821410 ) and a Sapphire dump to be collected in the process. However on injecting the error , the host ( Ubuntu NV ) crashed, and never recovered. OPAL however seemed to stay up and there were a plethora of mail box errors - B182953C logged by the FSP. Error Inject -- $ putscom pu.ex 10013100 5 1 1 -ib -p3 -c5 s1.ex k0:n0:s0:p03:c5 ecmd_ppc putscom pu.ex 10013100 5 1 1 -ib -p3 -c5 Console message -- [htx@gp4p01] [1m/sys/devices/system/cpu# [0m [ 1652.976253] Fatal Hypervisor Maintenance interrupt [Not recovered] [ 1652.976332] Error detail: Malfunction Alert [ 1652.976402] HMER: 8040 [ 1652.976450] Kernel panic - not syncing: Unrecoverable HMI exception [ 1652.976467] CPU: 24 PID: 1261 Comm: kworker/24:1 Tainted: P OE 3.16.0-37-generic #51~14.04.1-Ubuntu [ 1652.976530] Workqueue: events hmi_event_handler [ 1652.976561] Call Trace: [ 1652.976571] [c000189bf9e0] [c0017330] show_stack+0x170/0x290 (unreliable) [ 1652.976647] [c000189bfac0] [c09eb8e4] dump_stack+0x90/0xbc [ 1652.976674] [c000189bfaf0] [c09e2b5c] panic+0x104/0x2a8 [ 1652.976703] [c000189bfb80] [c007306c] hmi_event_handler+0x19c/0x2b0 [ 1652.976732] [c000189bfc50] [c00d62dc] process_one_work+0x1ac/0x4d0 [ 1652.976772] [c000189bfce0] [c00d6b80] worker_thread+0x190/0x630 [ 1652.976800] [c000189bfd80] [c00e0024] kthread+0x114/0x140 [ 1652.976837] [c000189bfe30] [c000a468] ret_from_kernel_thread+0x5c/0x74 [ 1652.977085] ---[ end Kernel panic - not syncing: Unrecoverable HMI exception . . . . |--| | 0x50227CFC 06/14/2015 22:27:43 System Hypervisor Firmware mbox | | 0x50227CFC Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation |--| | 0x50227CDD 06/14/2015 22:27:21 System Hypervisor Firmware spif | | 0x50227CDD Processed Predictive Error B182951C | --> Unexpected mail box error , needs investigation |--| | 0x50227CC0 06/14/2015 22:26:57 System Hypervisor Firmware mbox | | 0x50227CC0 Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation |--| | 0x50227C9E 06/14/2015 22:26:31 System Hypervisor Firmware spif | | 0x50227C9E Processed Predictive Error B182951C | --> Unexpected mail box error , needs investigation |--| | 0x50227C88 06/14/2015 22:26:12 System Hypervisor Firmware mbox | | 0x50227C88 Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation |--| | 0x50227C7E 06/14/2015 22:26:06 System Hypervisor Firmware spif | | 0x50227C7E Processed Predictive Error B182951C | --> Unexpected mail box error , needs investigation |--| | 0x50227C4F 06/14/2015 22:25:27 System Hypervisor Firmware mbox | | 0x50227C4F Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation |--| | 0x50227C40 06/14/2015 22:25:16 System Hypervisor Firmware spif | | 0x50227C40 Processed Predictive Error B182951C | --> Unexpected mail box
[Kernel-packages] [Bug 1512593] Re: :Ubuntu NV: Panic timeout=0 means Ubuntu does not reboot and recover from HMI (e.g. Core Unit Checkstop)
** Package changed: ubuntu => linux (Ubuntu) ** Changed in: linux (Ubuntu) Status: New => Triaged ** Changed in: linux (Ubuntu) Assignee: Taco Screen team (taco-screen-team) => Canonical Kernel Team (canonical-kernel-team) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1512593 Title: :Ubuntu NV: Panic timeout=0 means Ubuntu does not reboot and recover from HMI (e.g. Core Unit Checkstop) Status in linux package in Ubuntu: Triaged Bug description: Problem Description == I attempted to inject a Core Unit Checkstop error by flipping Core FIR bit 5 on a K80 (Nvidia) 42L Server ( FSP - gp4fp1.aus.stglabs.ibm.com ). I was expecting the server to crash due to a Sapphire assert ( PEL with SRC BB821410 ) and a Sapphire dump to be collected in the process. However on injecting the error , the host ( Ubuntu NV ) crashed, and never recovered. OPAL however seemed to stay up and there were a plethora of mail box errors - B182953C logged by the FSP. Error Inject -- $ putscom pu.ex 10013100 5 1 1 -ib -p3 -c5 s1.ex k0:n0:s0:p03:c5 ecmd_ppc putscom pu.ex 10013100 5 1 1 -ib -p3 -c5 Console message -- [htx@gp4p01] [1m/sys/devices/system/cpu# [0m [ 1652.976253] Fatal Hypervisor Maintenance interrupt [Not recovered] [ 1652.976332] Error detail: Malfunction Alert [ 1652.976402] HMER: 8040 [ 1652.976450] Kernel panic - not syncing: Unrecoverable HMI exception [ 1652.976467] CPU: 24 PID: 1261 Comm: kworker/24:1 Tainted: P OE 3.16.0-37-generic #51~14.04.1-Ubuntu [ 1652.976530] Workqueue: events hmi_event_handler [ 1652.976561] Call Trace: [ 1652.976571] [c000189bf9e0] [c0017330] show_stack+0x170/0x290 (unreliable) [ 1652.976647] [c000189bfac0] [c09eb8e4] dump_stack+0x90/0xbc [ 1652.976674] [c000189bfaf0] [c09e2b5c] panic+0x104/0x2a8 [ 1652.976703] [c000189bfb80] [c007306c] hmi_event_handler+0x19c/0x2b0 [ 1652.976732] [c000189bfc50] [c00d62dc] process_one_work+0x1ac/0x4d0 [ 1652.976772] [c000189bfce0] [c00d6b80] worker_thread+0x190/0x630 [ 1652.976800] [c000189bfd80] [c00e0024] kthread+0x114/0x140 [ 1652.976837] [c000189bfe30] [c000a468] ret_from_kernel_thread+0x5c/0x74 [ 1652.977085] ---[ end Kernel panic - not syncing: Unrecoverable HMI exception . . . . |--| | 0x50227CFC 06/14/2015 22:27:43 System Hypervisor Firmware mbox | | 0x50227CFC Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation |--| | 0x50227CDD 06/14/2015 22:27:21 System Hypervisor Firmware spif | | 0x50227CDD Processed Predictive Error B182951C | --> Unexpected mail box error , needs investigation |--| | 0x50227CC0 06/14/2015 22:26:57 System Hypervisor Firmware mbox | | 0x50227CC0 Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation |--| | 0x50227C9E 06/14/2015 22:26:31 System Hypervisor Firmware spif | | 0x50227C9E Processed Predictive Error B182951C | --> Unexpected mail box error , needs investigation |--| | 0x50227C88 06/14/2015 22:26:12 System Hypervisor Firmware mbox | | 0x50227C88 Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation |--| | 0x50227C7E 06/14/2015 22:26:06 System Hypervisor Firmware spif | | 0x50227C7E Processed Predictive Error B182951C | --> Unexpected mail box error , needs investigation |--| | 0x50227C4F 06/14/2015 22:25:27 System Hypervisor Firmware mbox | | 0x50227C4F Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation |--| | 0x50227C40 06/14/2015 22:25:16 System Hypervisor Firmware spif | | 0x50227C40 Processed Predictive Error B182951C | --> Unexpected mail box error , needs
[Kernel-packages] [Bug 1512593] Re: :Ubuntu NV: Panic timeout=0 means Ubuntu does not reboot and recover from HMI (e.g. Core Unit Checkstop)
This is not a bug with our kernel config. Perhaps changing the userspace tool to check for this behavior and setting the sysctl value to the desired value may be possible. Marking as Invalid for now, please feel free to reopen if you feel this is still an issue. Thanks! ** Changed in: linux (Ubuntu) Status: In Progress => Invalid -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1512593 Title: :Ubuntu NV: Panic timeout=0 means Ubuntu does not reboot and recover from HMI (e.g. Core Unit Checkstop) Status in linux package in Ubuntu: Invalid Bug description: Problem Description == I attempted to inject a Core Unit Checkstop error by flipping Core FIR bit 5 on a K80 (Nvidia) 42L Server ( FSP - gp4fp1.aus.stglabs.ibm.com ). I was expecting the server to crash due to a Sapphire assert ( PEL with SRC BB821410 ) and a Sapphire dump to be collected in the process. However on injecting the error , the host ( Ubuntu NV ) crashed, and never recovered. OPAL however seemed to stay up and there were a plethora of mail box errors - B182953C logged by the FSP. Error Inject -- $ putscom pu.ex 10013100 5 1 1 -ib -p3 -c5 s1.ex k0:n0:s0:p03:c5 ecmd_ppc putscom pu.ex 10013100 5 1 1 -ib -p3 -c5 Console message -- [htx@gp4p01] [1m/sys/devices/system/cpu# [0m [ 1652.976253] Fatal Hypervisor Maintenance interrupt [Not recovered] [ 1652.976332] Error detail: Malfunction Alert [ 1652.976402] HMER: 8040 [ 1652.976450] Kernel panic - not syncing: Unrecoverable HMI exception [ 1652.976467] CPU: 24 PID: 1261 Comm: kworker/24:1 Tainted: P OE 3.16.0-37-generic #51~14.04.1-Ubuntu [ 1652.976530] Workqueue: events hmi_event_handler [ 1652.976561] Call Trace: [ 1652.976571] [c000189bf9e0] [c0017330] show_stack+0x170/0x290 (unreliable) [ 1652.976647] [c000189bfac0] [c09eb8e4] dump_stack+0x90/0xbc [ 1652.976674] [c000189bfaf0] [c09e2b5c] panic+0x104/0x2a8 [ 1652.976703] [c000189bfb80] [c007306c] hmi_event_handler+0x19c/0x2b0 [ 1652.976732] [c000189bfc50] [c00d62dc] process_one_work+0x1ac/0x4d0 [ 1652.976772] [c000189bfce0] [c00d6b80] worker_thread+0x190/0x630 [ 1652.976800] [c000189bfd80] [c00e0024] kthread+0x114/0x140 [ 1652.976837] [c000189bfe30] [c000a468] ret_from_kernel_thread+0x5c/0x74 [ 1652.977085] ---[ end Kernel panic - not syncing: Unrecoverable HMI exception . . . . |--| | 0x50227CFC 06/14/2015 22:27:43 System Hypervisor Firmware mbox | | 0x50227CFC Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation |--| | 0x50227CDD 06/14/2015 22:27:21 System Hypervisor Firmware spif | | 0x50227CDD Processed Predictive Error B182951C | --> Unexpected mail box error , needs investigation |--| | 0x50227CC0 06/14/2015 22:26:57 System Hypervisor Firmware mbox | | 0x50227CC0 Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation |--| | 0x50227C9E 06/14/2015 22:26:31 System Hypervisor Firmware spif | | 0x50227C9E Processed Predictive Error B182951C | --> Unexpected mail box error , needs investigation |--| | 0x50227C88 06/14/2015 22:26:12 System Hypervisor Firmware mbox | | 0x50227C88 Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation |--| | 0x50227C7E 06/14/2015 22:26:06 System Hypervisor Firmware spif | | 0x50227C7E Processed Predictive Error B182951C | --> Unexpected mail box error , needs investigation |--| | 0x50227C4F 06/14/2015 22:25:27 System Hypervisor Firmware mbox | | 0x50227C4F Processed Predictive Error B182953C | --> Unexpected mail box error , needs investigation |--| | 0x50227C40 06/14/2015 22:25:16 System Hypervisor Firmware spif | | 0x50227C40