Public bug reported:

== Comment: #0 - Pridhiviraj Paidipeddi <[email protected]> - 2016-08-13 
08:28:54 ==
---Problem Description---
Install P8 PowerNV 8284-22A Hardware with latest FW860 firmware having build 
SV860_028, and install a ubuntu 16.10 on top of it. During EEH FrozenPE error 
injection, observed a "Oops: Kernel access of bad area, sig: 11 [#1]"
 
Contact Information = [email protected] 
 
---uname output---
Linux lep8b 4.4.0-34-generic #53-Ubuntu SMP Wed Jul 27 16:04:07 UTC 2016 
ppc64le ppc64le ppc64le GNU/Linux
 
Machine Type = PowerNV 8284-22A 
 
---System Hang---
 system is hung and need to do a Hard Power OFF/ON to bring the system up again.
 
---Debugger---
A debugger is not configured
 
---Steps to Reproduce---
 1. Install a FW860 SV860_028 level of firmware on a P8 PowerNV 8284-22A 
Hardware.
2. Install a ubuntu 16.10 on top of it.
3. Inject below frozenPE EEH Error.
echo 0:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0004/err_injct && lspci -ns 
0004:00:00.0; echo $?
4. Immediately we can observe a kernel Oops.

 
*Additional Instructions for [email protected]: 
-Post a private note with access information to the machine that the bug is 
occuring on.


Call Traces:
root@lep8b:~# echo 0:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0004/err_injct && 
lspci -ns 0004:00:00.0; echo $?
[  271.110859] EEH: Frozen PE#0 on PHB#4 detected
[  271.110967] EEH: PE location: N/A, PHB location: N/A
0004:00:00.0 0604: 1014:03dc
0
root@lep8b:~# [  277.108098] Unable to handle kernel paging request for data at 
address 0x00000010
[  277.108183] Faulting instruction address: 0xc000000000083c7c
[  277.108198] Oops: Kernel access of bad area, sig: 11 [#1]
[  277.108253] SMP NR_CPUS=2048 NUMA PowerNV
[  277.108310] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp 
bridge stp llc kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter 
ip6_tables iptable_filter ip_tables x_tables leds_powernv ibmpowernv 
powernv_rng ipmi_powernv uio_pdrv_genirq ipmi_msghandler uio ib_iser rdma_cm 
iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi 
scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 
multipath linear ses enclosure be2net lpfc vxlan ip6_udp_tunnel udp_tunnel 
scsi_transport_fc ipr
[  277.109391] CPU: 9 PID: 973 Comm: eehd Not tainted 4.4.0-34-generic 
#53-Ubuntu
[  277.109467] task: c000000feb3c2a20 ti: c000000feb408000 task.ti: 
c000000feb408000
[  277.109542] NIP: c000000000083c7c LR: c000000000083c78 CTR: c000000000083c20
[  277.109617] REGS: c000000feb40b760 TRAP: 0300   Not tainted  
(4.4.0-34-generic)
[  277.109691] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28008822  
XER: 00000000
[  277.109880] CFAR: c000000000008468 DAR: 0000000000000010 DSISR: 40000000 
SOFTE: 1 
GPR00: c000000000083c78 c000000feb40b9e0 c0000000015b5d00 0000000000000000 
GPR04: 0000000000000001 c000000feb40bac0 c000002d74b54220 0000000000000f9f 
GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000026 
GPR12: c000000000083c20 c000000007b45580 c0000000000e63d8 c000002d74c40100 
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000d42468 
GPR24: c000000000d42440 0000000000000100 c000000000036460 0000000000000000 
GPR28: c00000000161a3f0 0000000000000001 c000002ffff81000 c0000000fe440000 
[  277.110878] NIP [c000000000083c7c] pnv_eeh_reset+0x5c/0x170
[  277.110931] LR [c000000000083c78] pnv_eeh_reset+0x58/0x170
[  277.110981] Call Trace:
[  277.111009] [c000000feb40b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 
(unreliable)
[  277.111098] [c000000feb40ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0
[  277.111175] [c000000feb40bb00] [c000000000af472c] eeh_reset_device+0xd8/0x228
[  277.111255] [c000000feb40bba0] [c00000000003c4c0] 
eeh_handle_normal_event+0x390/0x440
[  277.111429] [c000000feb40bc20] [c00000000003c964] 
eeh_handle_event+0x184/0x370
[  277.111601] [c000000feb40bcd0] [c00000000003cd28] 
eeh_event_handler+0x1d8/0x1e0
[  277.111772] [c000000feb40bd80] [c0000000000e64e0] kthread+0x110/0x130
[  277.111910] [c000000feb40be30] [c000000000009538] 
ret_from_kernel_thread+0x5c/0xa4
[  277.112068] Instruction dump:
[  277.112143] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 
2f890002 
[  277.112385] 419e0054 7fe3fb78 4bfb7065 60000000 <e9230010> 2fa90000 419e00dc 
e9290010 
[  277.112629] ---[ end trace a6aa80c26ba676f6 ]---
[  277.116859] 
[  277.116910] Sending IPI to other CPUs
[  277.118085] IPI complete
[  277.120271] kexec: waiting for cpu 0 (physical 32) to enter OPAL
 -> smp_release_cpus()
spinning_secondaries = 191
 <- smp_release_cpus()
 <- setup_system()
[    0.397633] Kernel panic - not syncing: Out of memory and no killable 
processes...
[    0.397633] 
[    0.397769] CPU: 4 PID: 1 Comm: swapper/1 Not tainted 4.4.0-34-generic 
#53-Ubuntu
[    0.397843] Call Trace:
[    0.397870] [c00000000c583190] [c000000008af983c] dump_stack+0xb0/0xf0 
(unreliable)
[    0.397959] [c00000000c5831d0] [c000000008af5a70] panic+0x100/0x2c0
[    0.398035] [c00000000c583260] [c000000008231e04] out_of_memory+0x5e4/0x5f0
[    0.398114] [c00000000c583310] [c00000000823a434] 
__alloc_pages_nodemask+0xc54/0xc90
[    0.398204] [c00000000c583500] [c0000000082a0a6c] 
alloc_page_interleave+0x6c/0xe0
[    0.398292] [c00000000c583550] [c0000000082a1558] 
alloc_pages_current+0x138/0x1a0
[    0.398381] [c00000000c5835a0] [c00000000822cdcc] 
__page_cache_alloc+0x11c/0x160
[    0.398470] [c00000000c5835e0] [c00000000822cf84] 
pagecache_get_page+0x174/0x2a0
[    0.398558] [c00000000c583650] [c00000000822d4b4] 
grab_cache_page_write_begin+0x54/0x80
[    0.398646] [c00000000c583690] [c00000000831d484] 
simple_write_begin+0x54/0x180
[    0.398735] [c00000000c5836e0] [c00000000822ca64] 
generic_perform_write+0x104/0x280
[    0.398823] [c00000000c583780] [c00000000822ed08] 
__generic_file_write_iter+0x208/0x250
[    0.398912] [c00000000c5837e0] [c00000000822ee40] 
generic_file_write_iter+0xf0/0x280
[    0.399000] [c00000000c583830] [c0000000082e1844] new_sync_write+0xc4/0x120
[    0.399076] [c00000000c5838d0] [c0000000082e2640] vfs_write+0xc0/0x230
[    0.399152] [c00000000c583920] [c0000000082e367c] SyS_write+0x6c/0x110
[    0.399229] [c00000000c583970] [c000000008ea700c] xwrite+0x4c/0xb4
[    0.399305] [c00000000c5839b0] [c000000008ea7164] do_copy+0xf0/0x170
[    0.399381] [c00000000c5839e0] [c000000008ea6774] write_buffer+0x5c/0x88
[    0.399458] [c00000000c583a10] [c000000008ea67fc] flush_buffer+0x5c/0xf0
[    0.399534] [c00000000c583a60] [c000000008eea034] __gunzip+0x378/0x470
[    0.399610] [c00000000c583ae0] [c000000008ea75ac] 
unpack_to_rootfs+0x1f8/0x34c
[    0.399699] [c00000000c583ba0] [c000000008ea7910] populate_rootfs+0x94/0x164
[    0.399775] [c00000000c583c20] [c00000000800b49c] do_one_initcall+0x12c/0x2a0
[    0.399852] [c00000000c583cf0] [c000000008ea4204] 
kernel_init_freeable+0x28c/0x37c
[    0.399940] [c00000000c583dc0] [c00000000800be0c] kernel_init+0x2c/0x160
[    0.400016] [c00000000c583e30] [c000000008009538] 
ret_from_kernel_thread+0x5c/0xa4
[    0.418756] ---[ end Kernel panic - not syncing: Out of memory and no 
killable processes...
[    0.418756] 


oot@lep8b:~# uname -a
Linux lep8b 4.4.0-34-generic #53-Ubuntu SMP Wed Jul 27 16:04:07 UTC 2016 
ppc64le ppc64le ppc64le GNU/Linux
root@lep8b:~# cat /etc/os-release 
NAME="Ubuntu"
VERSION="16.10 (Yakkety Yak)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.10"
VERSION_ID="16.10"
HOME_URL="http://www.ubuntu.com/";
SUPPORT_URL="http://help.ubuntu.com/";
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/";
UBUNTU_CODENAME=yakkety
root@lep8b:~# update_flash -d
Current firwmare version :
  P side    : FW860.00 (SV860_026)
  T side    : FW860.00 (SV860_028)
  Boot side : FW860.00 (SV860_028)
root@lep8b:~# cat /sys/firmware/opal/msglog | grep -i skiboot
[45182541432,5] SkiBoot skiboot-5.3.0-rc2 starting...
root@lep8b:~# 
root@lep8b:~# lspci
0000:00:00.0 PCI bridge: IBM Device 03dc
0000:01:00.0 RAID bus controller: IBM Obsidian-E PCI-E SCSI controller (rev 01)
0001:00:00.0 PCI bridge: IBM Device 03dc
0001:01:00.0 PCI bridge: PLX Technology, Inc. PEX 8732 32-lane, 8-Port PCI 
Express Gen 3 (8.0 GT/s) Switch (rev ca)
0001:02:01.0 PCI bridge: PLX Technology, Inc. PEX 8732 32-lane, 8-Port PCI 
Express Gen 3 (8.0 GT/s) Switch (rev ca)
0001:02:08.0 PCI bridge: PLX Technology, Inc. PEX 8732 32-lane, 8-Port PCI 
Express Gen 3 (8.0 GT/s) Switch (rev ca)
0001:02:09.0 PCI bridge: PLX Technology, Inc. PEX 8732 32-lane, 8-Port PCI 
Express Gen 3 (8.0 GT/s) Switch (rev ca)
0001:03:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 
Gigabit Ethernet PCIe (rev 01)
0001:03:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 
Gigabit Ethernet PCIe (rev 01)
0001:03:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 
Gigabit Ethernet PCIe (rev 01)
0001:03:00.3 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 
Gigabit Ethernet PCIe (rev 01)
0001:04:00.0 RAID bus controller: IBM PCI-E IPR SAS Adapter (ASIC) (rev 01)
0002:00:00.0 PCI bridge: IBM Device 03dc
0002:01:00.0 Fibre Channel: Emulex Corporation Lancer-X: LightPulse Fibre 
Channel Host Adapter (rev 10)
0002:01:00.1 Fibre Channel: Emulex Corporation Lancer-X: LightPulse Fibre 
Channel Host Adapter (rev 10)
0003:00:00.0 PCI bridge: IBM Device 03dc
0003:01:00.0 PCI bridge: PLX Technology, Inc. Device 8748 (rev ca)
0003:02:01.0 PCI bridge: PLX Technology, Inc. Device 8748 (rev ca)
0003:02:08.0 PCI bridge: PLX Technology, Inc. Device 8748 (rev ca)
0003:02:09.0 PCI bridge: PLX Technology, Inc. Device 8748 (rev ca)
0003:02:10.0 PCI bridge: PLX Technology, Inc. Device 8748 (rev ca)
0003:02:11.0 PCI bridge: PLX Technology, Inc. Device 8748 (rev ca)
0003:03:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI 
Host Controller (rev 02)
0003:04:00.0 RAID bus controller: IBM PCI-E IPR SAS Adapter (ASIC) (rev 01)
0003:05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 
Gigabit Ethernet PCIe (rev 01)
0003:05:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 
Gigabit Ethernet PCIe (rev 01)
0003:05:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 
Gigabit Ethernet PCIe (rev 01)
0003:05:00.3 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 
Gigabit Ethernet PCIe (rev 01)
0003:0b:00.0 Fibre Channel: Emulex Corporation Saturn-X: LightPulse Fibre 
Channel Host Adapter (rev 03)
0003:0b:00.1 Fibre Channel: Emulex Corporation Saturn-X: LightPulse Fibre 
Channel Host Adapter (rev 03)
0004:00:00.0 PCI bridge: IBM Device 03dc
0005:00:00.0 PCI bridge: IBM Device 03dc
0005:01:00.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) 
(rev 10)
0005:01:00.1 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) 
(rev 10)
0005:01:00.2 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) 
(rev 10)
0005:01:00.3 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) 
(rev 10)
0005:01:00.4 Fibre Channel: Emulex Corporation OneConnect FCoE Initiator 
(Lancer) (rev 10)
0005:01:00.5 Fibre Channel: Emulex Corporation OneConnect FCoE Initiator 
(Lancer) (rev 10)
0006:00:00.0 PCI bridge: IBM Device 03dc
0006:01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 
Gigabit Ethernet PCIe (rev 01)
0006:01:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 
Gigabit Ethernet PCIe (rev 01)
0006:01:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 
Gigabit Ethernet PCIe (rev 01)
0006:01:00.3 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 
Gigabit Ethernet PCIe (rev 01)

== Comment: #1 - Milton D. Miller II <[email protected]> - 2016-09-09 19:05:32 
==
>From the opcode the dereferencing 0x10 from a NULL pointer 
and the DAR was 0x10 so the pointer was NULL.

disassembly of the printed opcodes shows an out of module call was 
made and the result used as a base, the loaded value compared for 
NULL, then the loaded value again loaded as a base with the same 
16 byte offset.

Looking at upstream, eeh_pe_bus_get can return NULL,
and in pnv_eeh_reset both the returned bus and the bus->parent are
checked for pci_is_root_bus which checks the word at offset 16 for NULL.
The parent field is immediately after a list head and lines up.

Without looking at the full function disassembly, it would appear that 
pnv_eeh_reset needs to consider the action if the bus returned from 
pnv_eeh_reset is NULL before checking if the bus or it parent is a root bus.

== Comment: #2 - Russell Currey <[email protected]> - 2016-09-11 21:46:21 ==
Thanks for the details Milton, you're right.  I'll write a patch to fix this in 
EEH and make sure all eeh_pe_bus_get calls check for failure.

== Comment: #3 - Russell Currey <[email protected]> - 2016-09-12
00:19:27 ==


== Comment: #4 - Russell Currey <[email protected]> - 2016-09-12 00:20:25 ==
Attached a patch that should stop the oops, can you test?

Note that not being able to find a bus is still an issue that we need to
find the cause of.

== Comment: #5 - Milton D. Miller II <[email protected]> - 2016-09-12 12:36:18 
==
Originator: There is a second problem that the kdump process failed because it 
ran out of memory.

Please open a second defect to investigate that (unless you are aware of
instructions setting up kdump that  were not followed).

You should be able to recreate that via echo c > /proc/sysrq-trigger and
look for the message:

[    0.397633] Kernel panic - not syncing: Out of memory and no killable
processes...

[note: it appears to have failed unpacking the initrd early in the dump
process on your machine.  This may be related to the partition
definition such as memory size and distribution policy]

== Comment: #6 - Pridhiviraj Paidipeddi <[email protected]> - 2017-04-11 
07:15:03 ==
@mamatha
Please create a ubuntu mirror request for this, the patches are merged in 
upstream.
https://patchwork.ozlabs.org/patch/668552/


Please backport the patches to respective 16.04.2/ 16.10 kernels.

** Affects: kerneloops (Ubuntu)
     Importance: Undecided
     Assignee: Taco Screen team (taco-screen-team)
         Status: New


** Tags: architecture-ppc64le bugnameltc-144961 severity-high 
targetmilestone-inin1610

** Tags added: architecture-ppc64le bugnameltc-144961 severity-high
targetmilestone-inin1610

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1683699

Title:
  [LTCTest][Opal][FW860] Oops: Kernel access of bad area, sig: 11 [#1]
  during frozen PE EEH error injection.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/kerneloops/+bug/1683699/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to