[Bug 1661684] [NEW] ISST-LTE:pVM:roselp4:ubuntu 16.04.2: drop in xmon when running dlpar tests under stress

bugproxy Fri, 03 Feb 2017 09:21:12 -0800

Public bug reported:

== Comment: #0 - Ping Tian Han <[email protected]> - 2016-12-26 21:59:52 ==
---Problem Description---
When testing DLPAR, include slot/cpu/mem, under stress on roselp4, system 
dropped into xmon:


roselp4 login: [   95.511790] sysrq: SysRq : Changing Loglevel
[   95.511816] sysrq: Loglevel set to 9
[  289.363833] mlx4_en 0292:60:00.0: removed PHC
[  293.123896] iommu: Removing device 0292:60:00.0 from group 3
[  303.173744] pci_bus 0292:60: busn_res: [bus 60-ff] is released
[  303.173865] rpadlpar_io: slot PHB 658 removed
[  335.853779] iommu: Removing device 0021:01:00.0 from group 0
[  345.893764] pci_bus 0021:01: busn_res: [bus 01-ff] is released
[  345.893869] rpadlpar_io: slot PHB 33 removed
[  382.204003] min_free_kbytes is not updated to 16885 because user defined 
value 551564 is preferred
[  446.143648] cpu 152 (hwid 152) Ready to die...
[  446.464057] cpu 153 (hwid 153) Ready to die...
[  446.473525] cpu 154 (hwid 154) Ready to die...
[  446.474077] cpu 155 (hwid 155) Ready to die...
[  446.483529] cpu 156 (hwid 156) Ready to die...
[  446.493532] cpu 157 (hwid 157) Ready to die...
[  446.494078] cpu 158 (hwid 158) Ready to die...
[  446.503527] cpu 159 (hwid 159) Ready to die...
[  446.664534] cpu 144 (hwid 144) Ready to die...
[  446.964113] cpu 145 (hwid 145) Ready to die...
[  446.973525] cpu 146 (hwid 146) Ready to die...
[  446.974094] cpu 147 (hwid 147) Ready to die...
[  446.983944] cpu 148 (hwid 148) Ready to die...
[  446.984062] cpu 149 (hwid 149) Ready to die...
[  446.993518] cpu 150 (hwid 150) Ready to die...
[  446.993543] Querying DEAD? cpu 150 (150) shows 2
[  446.994098] cpu 151 (hwid 151) Ready to die...
[  447.133726] cpu 136 (hwid 136) Ready to die...
[  447.403532] cpu 137 (hwid 137) Ready to die...
[  447.403772] cpu 138 (hwid 138) Ready to die...
[  447.403839] cpu 139 (hwid 139) Ready to die...
[  447.403887] cpu 140 (hwid 140) Ready to die...
[  447.403937] cpu 141 (hwid 141) Ready to die...
[  447.403979] cpu 142 (hwid 142) Ready to die...
[  447.404038] cpu 143 (hwid 143) Ready to die...
[  447.513546] cpu 128 (hwid 128) Ready to die...
[  447.693533] cpu 129 (hwid 129) Ready to die...
[  447.693999] cpu 130 (hwid 130) Ready to die...
[  447.703530] cpu 131 (hwid 131) Ready to die...
[  447.704087] Querying DEAD? cpu 132 (132) shows 2
[  447.704102] cpu 132 (hwid 132) Ready to die...
[  447.713534] cpu 133 (hwid 133) Ready to die...
[  447.714064] Querying DEAD? cpu 134 (134) shows 2
cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
    pc: 000000001ec3072c
    lr: 000000001ec2fee0
    sp: 1faf6bd0
   msr: 8000000102801000
   dar: 212d6c1a2a20c
 dsisr: 42000000
  current = 0xc000000474c6d600
  paca    = 0xc000000007b6b600   softe: 0        irq_happened: 0x01
    pid   = 0, comm = swapper/134
Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
WARNING: exception is not recoverable, can't continue
enter ? for help
SP (1faf6bd0) is in userspace
86:mon> 
86:mon> t
SP (1faf6bd0) is in userspace
86:mon> r
R00 = 000212d6c1a2a20f   R16 = c000000000ff1c38
R01 = 000000001faf6bd0   R17 = c000000474c9c080
R02 = 000000001ed1be80   R18 = c000000474c9c000
R03 = 000000001faf6c80   R19 = c0000000013fdf08
R04 = 0000000000000018   R20 = c000000474c9c080
R05 = 00000000000000e0   R21 = c0000000013e8ad0
R06 = 0000000000009e04   R22 = c000000474c9c000
R07 = 000000001faf6d30   R23 = c00000047a9a1c40
R08 = 000000001faf6d28   R24 = 0000000000000002
R09 = 000212d6c1a2a20c   R25 = c000000000fd4e6c
R10 = 000000001ec1b118   R26 = c000000000fd4e6c
R11 = 000000001ee7e040   R27 = c0000000014daae0
R12 = 000000000163c1d8   R28 = 0000000000000000
R13 = c000000007b6b600   R29 = 0000000000000086
R14 = c0000000014defb0   R30 = c000000000fd4e68
R15 = 0000000000000001   R31 = 000000001faf6bd0
pc  = 000000001ec3072c
cfar= 000000001ec2fedc
lr  = 000000001ec2fee0
msr = 8000000102801000   cr  = 42000000
ctr = 000000001ec48788   xer = 0000000000000020   trap =  300
dar = 000212d6c1a2a20c   dsisr = 42000000
86:mon> 


 
Contact Information = Ping Tian Han/[email protected] 
 
---uname output---
Linux roselp4 4.8.0-34-generic #36~16.04.1-Ubuntu SMP Wed Dec 21 18:53:20 UTC 
2016 ppc64le ppc64le ppc64le GNU/Linux
 
Machine Type = lpar 
 
---Debugger Data---
cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
    pc: 000000001ec3072c
    lr: 000000001ec2fee0
    sp: 1faf6bd0
   msr: 8000000102801000
   dar: 212d6c1a2a20c
 dsisr: 42000000
  current = 0xc000000474c6d600
  paca    = 0xc000000007b6b600   softe: 0        irq_happened: 0x01
    pid   = 0, comm = swapper/134
Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
WARNING: exception is not recoverable, can't continue
enter ? for help
SP (1faf6bd0) is in userspace
86:mon> e
cpu 0x86: Vector: 300 (Data Access) at [c000000007b0fd40]
    pc: 000000001ec3072c
    lr: 000000001ec2fee0
    sp: 1faf6bd0
   msr: 8000000102801000
   dar: 212d6c1a2a20c
 dsisr: 42000000
  current = 0xc000000474c6d600
  paca    = 0xc000000007b6b600   softe: 0        irq_happened: 0x01
    pid   = 0, comm = swapper/134
Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
86:mon> t
SP (1faf6bd0) is in userspace
86:mon> r
R00 = 000212d6c1a2a20f   R16 = c000000000ff1c38
R01 = 000000001faf6bd0   R17 = c000000474c9c080
R02 = 000000001ed1be80   R18 = c000000474c9c000
R03 = 000000001faf6c80   R19 = c0000000013fdf08
R04 = 0000000000000018   R20 = c000000474c9c080
R05 = 00000000000000e0   R21 = c0000000013e8ad0
R06 = 0000000000009e04   R22 = c000000474c9c000
R07 = 000000001faf6d30   R23 = c00000047a9a1c40
R08 = 000000001faf6d28   R24 = 0000000000000002
R09 = 000212d6c1a2a20c   R25 = c000000000fd4e6c
R10 = 000000001ec1b118   R26 = c000000000fd4e6c
R11 = 000000001ee7e040   R27 = c0000000014daae0
R12 = 000000000163c1d8   R28 = 0000000000000000
R13 = c000000007b6b600   R29 = 0000000000000086
R14 = c0000000014defb0   R30 = c000000000fd4e68
R15 = 0000000000000001   R31 = 000000001faf6bd0
pc  = 000000001ec3072c
cfar= 000000001ec2fedc
lr  = 000000001ec2fee0
msr = 8000000102801000   cr  = 42000000
ctr = 000000001ec48788   xer = 0000000000000020   trap =  300
dar = 000212d6c1a2a20c   dsisr = 42000000
86:mon>  
 
---System Hang---
 drop into xmon
 
---Steps to Reproduce---
 1. run IO stress tests on roselp4
2. run slot/cpu/mem dlpar tests on roselp4
 
Stack trace output:
 no
 
Oops output:
 no
 
System Dump Info:
  The system was configured to capture a dump, however a dump was not produced.
 
*Additional Instructions for Ping Tian Han/[email protected]: 
-Post a private note with access information to the machine that is currently 
in the debugger. 
-Attach sysctl -a output output to the bug.

== Comment: #4 - PAWAN K. SINGH <[email protected]> - 2016-12-27
02:19:58 ==


== Comment: #7 - Ping Tian Han <[email protected]> - 2016-12-27 20:53:59 ==


== Comment: #8 - Ping Tian Han <[email protected]> - 2016-12-27 20:59:04 ==


== Comment: #14 - Ping Tian Han <[email protected]> - 2016-12-28 03:17:50 ==
FYI. With default min_free_kbytes, roselp4 still drops into xmon:

Ubuntu 16.04.1 LTS roselp4 hvc0

roselp4 login: [  260.094141] sysrq: SysRq : Changing Loglevel
[  260.094161] sysrq: Loglevel set to 9
[  266.614273] cpu 152 (hwid 152) Ready to die...
[  266.794136] cpu 153 (hwid 153) Ready to die...
[  266.794694] cpu 154 (hwid 154) Ready to die...
[  266.804248] cpu 155 (hwid 155) Ready to die...
[  266.804302] cpu 156 (hwid 156) Ready to die...
[  266.804354] cpu 157 (hwid 157) Ready to die...
[  266.804410] cpu 158 (hwid 158) Ready to die...
[  266.804465] cpu 159 (hwid 159) Ready to die...
[  266.935065] cpu 144 (hwid 144) Ready to die...
[  267.144140] cpu 145 (hwid 145) Ready to die...
[  267.144683] cpu 146 (hwid 146) Ready to die...
[  267.154692] cpu 147 (hwid 147) Ready to die...
[  267.164134] cpu 148 (hwid 148) Ready to die...
[  267.164702] cpu 149 (hwid 149) Ready to die...
[  267.174819] cpu 150 (hwid 150) Ready to die...
[  267.184684] cpu 151 (hwid 151) Ready to die...
[  267.324831] cpu 136 (hwid 136) Ready to die...
[  267.614138] cpu 137 (hwid 137) Ready to die...
[  267.614745] cpu 138 (hwid 138) Ready to die...
[  267.624135] cpu 139 (hwid 139) Ready to die...
[  267.624716] cpu 140 (hwid 140) Ready to die...
[  267.634637] Querying DEAD? cpu 141 (141) shows 2
cpu 0x8d: Vector: 300 (Data Access) at [c000000007ad7d40]
    pc: 000000001ec26be0
    lr: 000000001ec26ab4
    sp: 1faf6920
   msr: 8000000102801000
   dar: fffffe801faf6bc0
 dsisr: 40000000
  current = 0xc000000474c51e00
  paca    = 0xc000000007b6f500   softe: 0        irq_happened: 0x01
    pid   = 0, comm = swapper/141
Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
WARNING: exception is not recoverable, can't continue
enter ? for help
SP (1faf6920) is in userspace
8d:mon> cpu 0x8e: Vector: 300 (Data Access) at [c000000007acfd40]
    pc: 000000001ec22614
    lr: 000000001ec22d5c
    sp: 1faf6b00
   msr: 8000000102801000
   dar: 20000000
 dsisr: 40000000
  current = 0xc000000474c7c800
  paca    = 0xc000000007b6fe00   softe: 0        irq_happened: 0x01
    pid   = 0, comm = swapper/142
Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
WARNING: exception is not recoverable, can't continue

8d:mon> 
Unrecognized command: \x1be (type ? for help)
8d:mon> e
cpu 0x8d: Vector: 300 (Data Access) at [c000000007ad7d40]
    pc: 000000001ec26be0
    lr: 000000001ec26ab4
    sp: 1faf6920
   msr: 8000000102801000
   dar: fffffe801faf6bc0
 dsisr: 40000000
  current = 0xc000000474c51e00
  paca    = 0xc000000007b6f500   softe: 0        irq_happened: 0x01
    pid   = 0, comm = swapper/141
Linux version 4.8.0-34-generic (buildd@bos01-ppc64el-026) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #36~16.04.1-Ubuntu SMP Wed Dec 
21 18:53:20 UTC 2016 (Ubuntu 4.8.0-34.36~16.04.1-generic 4.8.11)
8d:mon> t
SP (1faf6920) is in userspace
8d:mon>

== Comment: #15 - Ping Tian Han <[email protected]> - 2016-12-28 03:22:22
==


== Comment: #19 - Ping Tian Han <[email protected]> - 2016-12-29 00:13:39 ==


== Comment: #20 - Kevin W. Rudd <[email protected]> - 2016-12-29 12:49:15 ==
Nathan or Laurent,

In the dmesg output, I'm seeing similar behavior to the problem reported
in Bug 146931.  The following error and completely bogus NIP/LR values
appear to be the same scenario:

pseries-hotplug-cpu: Failed to release drc (10000098) for CPU
PowerPC,POWER8, rc: -17

The NIP and LR values appear to be completely bogus, so I'm not sure
what about the Bug 146931 scenario matched the issue being tracked in
Bug 146776.

This looks to be a side issue of doing hotplugging on the CPUs

Please review and provide your thoughts on this observed behavior.

Thanks.


== Comment: #25 - Nathan D. Fontenot <[email protected]> - 2017-01-13 
13:32:51 ==
My first thought in looking at this is that it appears that the swapper thread 
for a cpu is scheduled to run on a cpu that has been removed. This may explain 
the bogus pc and lr values. There have been a lot of updates to the generic 
kernel cpu hotplug code recently, perhaps some update there could be causing 
this. It would be interesting to see if this occurs on older kernels.

As for the rtas set-indicator call returning -17, I don't know how that
is possible. A return value of -17 is not even a defined return value in
the PAPR. This could be a side effect of what is causing the crash
though so that should be resolved first and then see if this still
occurs.

== Comment: #31 - Fernando Seiti Furusato <[email protected]> - 2017-02-02 
11:37:37 ==
Mirroring so Canonical is aware of this bug.

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: Taco Screen team (taco-screen-team)
         Status: New


** Tags: architecture-ppc64le bugnameltc-150119 severity-high 
targetmilestone-inin---

** Tags added: architecture-ppc64le bugnameltc-150119 severity-high
targetmilestone-inin---

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1661684

Title:
  ISST-LTE:pVM:roselp4:ubuntu 16.04.2: drop in xmon when running dlpar
  tests under stress

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1661684/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1661684] [NEW] ISST-LTE:pVM:roselp4:ubuntu 16.04.2: drop in xmon when running dlpar tests under stress

Reply via email to