[Kernel-packages] [Bug 1512593] Re: :Ubuntu NV: Panic timeout=0 means Ubuntu does not reboot and recover from HMI (e.g. Core Unit Checkstop)

2015-11-03 Thread Chris J Arges
** Changed in: linux (Ubuntu)
 Assignee: Canonical Kernel Team (canonical-kernel-team) => Chris J Arges 
(arges)

** Changed in: linux (Ubuntu)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu)
   Status: Triaged => In Progress

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1512593

Title:
  :Ubuntu NV: Panic timeout=0 means Ubuntu does not reboot and recover
  from HMI (e.g. Core Unit Checkstop)

Status in linux package in Ubuntu:
  In Progress

Bug description:
  Problem Description
  ==
  I attempted to inject a Core Unit Checkstop error by flipping Core FIR bit 5 
on a K80 (Nvidia) 42L Server ( FSP - gp4fp1.aus.stglabs.ibm.com ). I was 
expecting the server to crash due to a Sapphire assert ( PEL with SRC BB821410 
) and a Sapphire dump to be collected in the process.

  However on injecting the error , the host ( Ubuntu NV ) crashed, and
  never recovered. OPAL however seemed to stay up and there were a
  plethora of mail box errors - B182953C logged by the FSP.

  Error Inject
  --
  $ putscom pu.ex 10013100 5 1 1 -ib -p3 -c5
  s1.ex   k0:n0:s0:p03:c5
  ecmd_ppc putscom pu.ex 10013100 5 1 1 -ib -p3 -c5

  
  Console message
  --
  [htx@gp4p01]  [1m/sys/devices/system/cpu# [0m [ 1652.976253] Fatal Hypervisor 
Maintenance interrupt [Not recovered]
  [ 1652.976332]  Error detail: Malfunction Alert
  [ 1652.976402]  HMER: 8040
  [ 1652.976450] Kernel panic - not syncing: Unrecoverable HMI exception
  [ 1652.976467] CPU: 24 PID: 1261 Comm: kworker/24:1 Tainted: P   OE 
3.16.0-37-generic #51~14.04.1-Ubuntu
  [ 1652.976530] Workqueue: events hmi_event_handler
  [ 1652.976561] Call Trace:
  [ 1652.976571] [c000189bf9e0] [c0017330] show_stack+0x170/0x290 
(unreliable)
  [ 1652.976647] [c000189bfac0] [c09eb8e4] dump_stack+0x90/0xbc
  [ 1652.976674] [c000189bfaf0] [c09e2b5c] panic+0x104/0x2a8
  [ 1652.976703] [c000189bfb80] [c007306c] 
hmi_event_handler+0x19c/0x2b0
  [ 1652.976732] [c000189bfc50] [c00d62dc] 
process_one_work+0x1ac/0x4d0
  [ 1652.976772] [c000189bfce0] [c00d6b80] worker_thread+0x190/0x630
  [ 1652.976800] [c000189bfd80] [c00e0024] kthread+0x114/0x140
  [ 1652.976837] [c000189bfe30] [c000a468] 
ret_from_kernel_thread+0x5c/0x74
  [ 1652.977085] ---[ end Kernel panic - not syncing: Unrecoverable HMI 
exception
   

  .
  .
  .
  .
  
|--|
  | 0x50227CFC 06/14/2015 22:27:43 System Hypervisor Firmware   
mbox |
  | 0x50227CFC Processed   Predictive Error 
B182953C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227CDD 06/14/2015 22:27:21 System Hypervisor Firmware   
spif |
  | 0x50227CDD Processed   Predictive Error 
B182951C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227CC0 06/14/2015 22:26:57 System Hypervisor Firmware   
mbox |
  | 0x50227CC0 Processed   Predictive Error 
B182953C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227C9E 06/14/2015 22:26:31 System Hypervisor Firmware   
spif |
  | 0x50227C9E Processed   Predictive Error 
B182951C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227C88 06/14/2015 22:26:12 System Hypervisor Firmware   
mbox |
  | 0x50227C88 Processed   Predictive Error 
B182953C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227C7E 06/14/2015 22:26:06 System Hypervisor Firmware   
spif |
  | 0x50227C7E Processed   Predictive Error 
B182951C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227C4F 06/14/2015 22:25:27 System Hypervisor Firmware   
mbox |
  | 0x50227C4F Processed   Predictive Error 
B182953C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227C40 06/14/2015 22:25:16 System Hypervisor Firmware   
spif |
  | 0x50227C40 Processed   Predictive Error 
B182951C | --> Unexpected mail box 

[Kernel-packages] [Bug 1512593] Re: :Ubuntu NV: Panic timeout=0 means Ubuntu does not reboot and recover from HMI (e.g. Core Unit Checkstop)

2015-11-03 Thread Leann Ogasawara
** Package changed: ubuntu => linux (Ubuntu)

** Changed in: linux (Ubuntu)
   Status: New => Triaged

** Changed in: linux (Ubuntu)
 Assignee: Taco Screen team (taco-screen-team) => Canonical Kernel Team 
(canonical-kernel-team)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1512593

Title:
  :Ubuntu NV: Panic timeout=0 means Ubuntu does not reboot and recover
  from HMI (e.g. Core Unit Checkstop)

Status in linux package in Ubuntu:
  Triaged

Bug description:
  Problem Description
  ==
  I attempted to inject a Core Unit Checkstop error by flipping Core FIR bit 5 
on a K80 (Nvidia) 42L Server ( FSP - gp4fp1.aus.stglabs.ibm.com ). I was 
expecting the server to crash due to a Sapphire assert ( PEL with SRC BB821410 
) and a Sapphire dump to be collected in the process.

  However on injecting the error , the host ( Ubuntu NV ) crashed, and
  never recovered. OPAL however seemed to stay up and there were a
  plethora of mail box errors - B182953C logged by the FSP.

  Error Inject
  --
  $ putscom pu.ex 10013100 5 1 1 -ib -p3 -c5
  s1.ex   k0:n0:s0:p03:c5
  ecmd_ppc putscom pu.ex 10013100 5 1 1 -ib -p3 -c5

  
  Console message
  --
  [htx@gp4p01]  [1m/sys/devices/system/cpu# [0m [ 1652.976253] Fatal Hypervisor 
Maintenance interrupt [Not recovered]
  [ 1652.976332]  Error detail: Malfunction Alert
  [ 1652.976402]  HMER: 8040
  [ 1652.976450] Kernel panic - not syncing: Unrecoverable HMI exception
  [ 1652.976467] CPU: 24 PID: 1261 Comm: kworker/24:1 Tainted: P   OE 
3.16.0-37-generic #51~14.04.1-Ubuntu
  [ 1652.976530] Workqueue: events hmi_event_handler
  [ 1652.976561] Call Trace:
  [ 1652.976571] [c000189bf9e0] [c0017330] show_stack+0x170/0x290 
(unreliable)
  [ 1652.976647] [c000189bfac0] [c09eb8e4] dump_stack+0x90/0xbc
  [ 1652.976674] [c000189bfaf0] [c09e2b5c] panic+0x104/0x2a8
  [ 1652.976703] [c000189bfb80] [c007306c] 
hmi_event_handler+0x19c/0x2b0
  [ 1652.976732] [c000189bfc50] [c00d62dc] 
process_one_work+0x1ac/0x4d0
  [ 1652.976772] [c000189bfce0] [c00d6b80] worker_thread+0x190/0x630
  [ 1652.976800] [c000189bfd80] [c00e0024] kthread+0x114/0x140
  [ 1652.976837] [c000189bfe30] [c000a468] 
ret_from_kernel_thread+0x5c/0x74
  [ 1652.977085] ---[ end Kernel panic - not syncing: Unrecoverable HMI 
exception
   

  .
  .
  .
  .
  
|--|
  | 0x50227CFC 06/14/2015 22:27:43 System Hypervisor Firmware   
mbox |
  | 0x50227CFC Processed   Predictive Error 
B182953C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227CDD 06/14/2015 22:27:21 System Hypervisor Firmware   
spif |
  | 0x50227CDD Processed   Predictive Error 
B182951C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227CC0 06/14/2015 22:26:57 System Hypervisor Firmware   
mbox |
  | 0x50227CC0 Processed   Predictive Error 
B182953C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227C9E 06/14/2015 22:26:31 System Hypervisor Firmware   
spif |
  | 0x50227C9E Processed   Predictive Error 
B182951C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227C88 06/14/2015 22:26:12 System Hypervisor Firmware   
mbox |
  | 0x50227C88 Processed   Predictive Error 
B182953C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227C7E 06/14/2015 22:26:06 System Hypervisor Firmware   
spif |
  | 0x50227C7E Processed   Predictive Error 
B182951C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227C4F 06/14/2015 22:25:27 System Hypervisor Firmware   
mbox |
  | 0x50227C4F Processed   Predictive Error 
B182953C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227C40 06/14/2015 22:25:16 System Hypervisor Firmware   
spif |
  | 0x50227C40 Processed   Predictive Error 
B182951C | --> Unexpected mail box error , needs 

[Kernel-packages] [Bug 1512593] Re: :Ubuntu NV: Panic timeout=0 means Ubuntu does not reboot and recover from HMI (e.g. Core Unit Checkstop)

2015-11-03 Thread Chris J Arges
This is not a bug with our kernel config. Perhaps changing the userspace tool 
to check for this behavior and setting the sysctl value to the desired value 
may be possible.
Marking as Invalid for now, please feel free to reopen if you feel this is 
still an issue.
Thanks!

** Changed in: linux (Ubuntu)
   Status: In Progress => Invalid

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1512593

Title:
  :Ubuntu NV: Panic timeout=0 means Ubuntu does not reboot and recover
  from HMI (e.g. Core Unit Checkstop)

Status in linux package in Ubuntu:
  Invalid

Bug description:
  Problem Description
  ==
  I attempted to inject a Core Unit Checkstop error by flipping Core FIR bit 5 
on a K80 (Nvidia) 42L Server ( FSP - gp4fp1.aus.stglabs.ibm.com ). I was 
expecting the server to crash due to a Sapphire assert ( PEL with SRC BB821410 
) and a Sapphire dump to be collected in the process.

  However on injecting the error , the host ( Ubuntu NV ) crashed, and
  never recovered. OPAL however seemed to stay up and there were a
  plethora of mail box errors - B182953C logged by the FSP.

  Error Inject
  --
  $ putscom pu.ex 10013100 5 1 1 -ib -p3 -c5
  s1.ex   k0:n0:s0:p03:c5
  ecmd_ppc putscom pu.ex 10013100 5 1 1 -ib -p3 -c5

  
  Console message
  --
  [htx@gp4p01]  [1m/sys/devices/system/cpu# [0m [ 1652.976253] Fatal Hypervisor 
Maintenance interrupt [Not recovered]
  [ 1652.976332]  Error detail: Malfunction Alert
  [ 1652.976402]  HMER: 8040
  [ 1652.976450] Kernel panic - not syncing: Unrecoverable HMI exception
  [ 1652.976467] CPU: 24 PID: 1261 Comm: kworker/24:1 Tainted: P   OE 
3.16.0-37-generic #51~14.04.1-Ubuntu
  [ 1652.976530] Workqueue: events hmi_event_handler
  [ 1652.976561] Call Trace:
  [ 1652.976571] [c000189bf9e0] [c0017330] show_stack+0x170/0x290 
(unreliable)
  [ 1652.976647] [c000189bfac0] [c09eb8e4] dump_stack+0x90/0xbc
  [ 1652.976674] [c000189bfaf0] [c09e2b5c] panic+0x104/0x2a8
  [ 1652.976703] [c000189bfb80] [c007306c] 
hmi_event_handler+0x19c/0x2b0
  [ 1652.976732] [c000189bfc50] [c00d62dc] 
process_one_work+0x1ac/0x4d0
  [ 1652.976772] [c000189bfce0] [c00d6b80] worker_thread+0x190/0x630
  [ 1652.976800] [c000189bfd80] [c00e0024] kthread+0x114/0x140
  [ 1652.976837] [c000189bfe30] [c000a468] 
ret_from_kernel_thread+0x5c/0x74
  [ 1652.977085] ---[ end Kernel panic - not syncing: Unrecoverable HMI 
exception
   

  .
  .
  .
  .
  
|--|
  | 0x50227CFC 06/14/2015 22:27:43 System Hypervisor Firmware   
mbox |
  | 0x50227CFC Processed   Predictive Error 
B182953C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227CDD 06/14/2015 22:27:21 System Hypervisor Firmware   
spif |
  | 0x50227CDD Processed   Predictive Error 
B182951C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227CC0 06/14/2015 22:26:57 System Hypervisor Firmware   
mbox |
  | 0x50227CC0 Processed   Predictive Error 
B182953C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227C9E 06/14/2015 22:26:31 System Hypervisor Firmware   
spif |
  | 0x50227C9E Processed   Predictive Error 
B182951C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227C88 06/14/2015 22:26:12 System Hypervisor Firmware   
mbox |
  | 0x50227C88 Processed   Predictive Error 
B182953C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227C7E 06/14/2015 22:26:06 System Hypervisor Firmware   
spif |
  | 0x50227C7E Processed   Predictive Error 
B182951C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227C4F 06/14/2015 22:25:27 System Hypervisor Firmware   
mbox |
  | 0x50227C4F Processed   Predictive Error 
B182953C | --> Unexpected mail box error , needs investigation 
  
|--|
  | 0x50227C40 06/14/2015 22:25:16 System Hypervisor Firmware   
spif |
  | 0x50227C40