[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-24 Thread bugproxy
--- Comment From chngu...@us.ibm.com 2018-05-24 18:16 EDT---
(In reply to comment #259)
> In bug #167562, Canonical reports that these fixes have been put in
> bionic-proposed (assumed to mean linux-image-4.15.0-23-generic). We need to
> test this ASAP in order to prevent the patches from being reverted. Can we
> get the latest -proposed Ubuntu Bionic installed and checked out on the
> systems where we saw this issue?
>
> This is urgent. Starting by setting NEEDINFO for Chanh, although someone
> else may need to pick that up.

I installed on boslcp3 and it works. Don't see the crash like we use to see.
root@boslcp3:~# uname -a
Linux boslcp3 4.15.0-23-generic #25-Ubuntu SMP Wed May 23 17:59:00 UTC 2018 
ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3:~# lspci |grep QLogic
0030:01:00.0 Fibre Channel: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to 
PCIe Adapter (rev 01)
0030:01:00.1 Fibre Channel: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to 
PCIe Adapter (rev 01)
root@boslcp3:~#

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-24 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-05-24 14:37 EDT---
In bug #167562, Canonical reports that these fixes have been put in 
bionic-proposed (assumed to mean linux-image-4.15.0-23-generic). We need to 
test this ASAP in order to prevent the patches from being reverted. Can we get 
the latest -proposed Ubuntu Bionic installed and checked out on the systems 
where we saw this issue?

This is urgent. Starting by setting NEEDINFO for Chanh, although someone
else may need to pick that up.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-21 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-05-21 13:20 EDT---
*** Bug 168018 has been marked as a duplicate of this bug. ***

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-11 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-05-02 14:39 EDT---
The SAN incident in the previous dmesg log shows only a single port (WWPN) 
glitching. The logs from panics showed two ports glitching at the same time. 
Also, this incident did not show the port logging back in for about 8 minutes, 
whereas the panics showed immediate/concurrent login. So, I'm not certain if 
we've proven the fix yet.

--- Comment From kla...@br.ibm.com 2018-05-02 16:32 EDT---
I think next steps here are:

1) apply all the known firmware workarounds (GH 1158)
2) Bring up system with Doug's recommendations  for log verbosity (comment 211 
and 215). Also capture the console output to a separate file if possible.
3) re-start the test using this same kernel, but with no stress on the host: 
proceed to restart the 3 guests with stress, and have a 4th guest migrating 
between boslcp3 and 4.

--- Comment From dougm...@us.ibm.com 2018-05-02 16:36 EDT---
(In reply to comment #218)
> I think next steps here are:
>
> 1) apply all the known firmware workarounds (GH 1158)
> 2) Bring up system with Doug's recommendations  for log verbosity (comment
> 211 and 215). Also capture the console output to a separate file if possible.
> 3) re-start the test using this same kernel, but with no stress on the host:
> proceed to restart the 3 guests with stress, and have a 4th guest migrating
> between boslcp3 and 4.

Klaus, let's hold off on making more changes right now. I'd like to let
things run as-is a little longer.

--- Comment From indira.pr...@in.ibm.com 2018-05-02 23:21 EDT---
Attached host boslcp3 host console tee logs.
Default Comment by Bridge

--- Comment From indira.pr...@in.ibm.com 2018-05-03 03:22 EDT---
boslcp3 host console dumps messages related to qlogic driver.

Latest tee logs for boslcp3 host :

kte111.isst.aus.stglabs.ibm.com 9.3.111.155 [kte/don2rry]

kte111:/LOGS/boslcp3-host-may1.txt

[ipjoga@kte (AUS) ~]$ ls -l /LOGS/boslcp3-host-may1.txt
-rwxrwxr-x 1 ipjoga ipjoga 20811302 May  3 02:12 /LOGS/boslcp3-host-may1.txt

Regards,
Indira

--- Comment From dougm...@us.ibm.com 2018-05-03 08:20 EDT---
There were a large number of SAN incidents in the evening, although none 
involved two ports at the same time. Still, many involved relogin while the 
logout was still being processed - so there is some confidence that the patches 
may be working.

There was a large period of SAN instability between May  2 21:42:09 and
21:58:47. This involved only one port (21:00:00:24:ff:7e:f6:fe). It
would be interesting if this could be traced back to some activity,
either on this machine or on the SAN (e.g. was migration being tested on
other machines at this point?).

We still have not seen the same situation that was associated with the
panics (two or more ports experiencing instability at the same time), so
it's not clear if we can conclude that the patches fix the original
problem.If we could find some trigger for the instability, we might be
able to orchestrate the situation originally seen.

--- Comment From indira.pr...@in.ibm.com 2018-05-04 11:10 EDT---
We could not able to install 'sar' package due to 166588 prior patch. And also 
'xfs'  was being used on the system from the prior run.  To overcome both, we  
planned fresh installation . Installed latest ubutnu1804 kernel(4.15.0-20) on 
LSI disk and booted up with disk. Login prompt appears & gave credentials. 
Immediately in less than a minute, system dump messages and started rebooting. 
Its not allowing time to run anything on the console prompt.

Tried  multiple attempts to boot with the latest kernel & once logged in
system is rebooting with call traces as below.

Ubuntu 18.04 LTS boslcp3 hvc0

boslcp3 login: [   51.679446] sd 3:0:1:0: rejecting I/O to offline device
[   58.251326] Unable to handle kernel paging request for data at address 
0xbf52a78fa0cf2419
[   58.251413] Faulting instruction address: 0xc038ae70
[   58.251462] Oops: Kernel access of bad area, sig: 11 [#1]
[   58.251500] LE SMP NR_CPUS=2048 NUMA PowerNV
[   58.251543] Modules linked in: rpcsec_gss_krb5(E) nfsv4(E) nfs(E) fscache(E) 
binfmt_misc(E) dm_service_time(E) dm_multipath(E) scsi_dh_rdac(E) 
scsi_dh_emc(E) scsi_dh_alua(E) joydev(E) input_leds(E) mac_hid(E) 
idt_89hpesx(E) at24(E) uio_pdrv_genirq(E) uio(E) vmx_crypto(E) ofpart(E) 
crct10dif_vpmsum(E) cmdlinepart(E) powernv_flash(E) mtd(E) opal_prd(E) 
ipmi_powernv(E) ibmpowernv(E) ipmi_devintf(E) ipmi_msghandler(E) nfsd(E) 
auth_rpcgss(E) nfs_acl(E) sch_fq_codel(E) lockd(E) grace(E) sunrpc(E) 
ip_tables(E) x_tables(E) autofs4(E) ses(E) enclosure(E) hid_generic(E) 
usbhid(E) hid(E) qla2xxx(E) ast(E) i2c_algo_bit(E) ttm(E) mpt3sas(E) ixgbe(E) 
drm_kms_helper(E) nvme_fc(E) syscopyarea(E) sysfillrect(E) nvme_fabrics(E) 
sysimgblt(E) fb_sys_fops(E) nvme_core(E) raid_class(E) crc32c_vpmsum(E) drm(E) 
i40e(E)
[   58.252067]  scsi_transport_sas(E) aacraid(E) scsi_transport_fc(E) mdio(E)
[   58.252120] CPU: 80 

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-11 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-05-11 12:12 EDT---
Some information coming in on the SAN where this reproduces. It appears that 
there is some undesirable configuration, where fast switches are backed by 
slower switches between host and disks. The current theory is that other 
activity on the fabric causes bottle-necks in the slow switches and results in 
the temporary loss of login. Working on a way to reproduce this on-demand.

But, if this is true, I think this probably is not likely to be hit by
customers. Seems like customers would not be mixing slow switches with
fast, especially in such a dysfunctional setup.

Still investigating, though, so nothing conclusive yet.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-10 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-05-10 14:13 EDT---
Being able to reproduce this on ltc-boston113 seems to have been a temporary 
condition. I can no longer reproduce there, Pegas or Ubuntu. Without some idea 
of what external conditions are causing this, it will be very difficult to 
pursue.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-10 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-05-10 12:59 EDT---
I have had some luck reproducing this, on ltc-boston113 (previously unable to 
reproduce there). I had altered the boot parameters to remove "quiet splash" 
and added "qla2xxx.logging=0x1e40", and got the kworker panic during boot 
(did not even reach login prompt). I also hit this panic while booting the 
Pegas 1.1 installer, so it looks like Pegas is also affected. I am completing 
the Pegas install with qla2xxx blacklisted, and will characterize some more.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-09 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-05-09 11:34 EDT---
There was a period of SAN instability observed on boslcp1 this morning, at 
about May  9 05:01:28 to 05:51:56. This involved 2 ports simultaneously 
handling relogins. This was a Pegas kernel that should be susceptible to the 
panic, but no panic was seen. But since we don't know enough about the exact 
timing required to produce the panic, we can't say just what that means.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-08 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-05-08 12:09 EDT---
It appears that there were some SAN incidents yesterday on boslcp3, approx. 
times were May  7 12:44:54 through 14:28:17. All were for one port, so not 
exactly the situation I think caused the panic. If we could correlate these SAN 
incidents with other activity on neighboring systems, that might help.

[207374.827928] = first incident
[213578.181860] = last incident
[287293.677076] Tue May  8 10:56:52 CDT 2018

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-07 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-05-07 12:10 EDT---
Of the "boslcp" systems, only 3 appear to have QLogic adapters. Of those, one 
has been running without the extended error logging and so collected no data, 
and one has been down (or non-functional) for about 36 hours. Of the data 
collected, though, there is no evidence of any SAN instability since Friday - 
before starting the patched kernels. This means that we have no new data on 
whether the patches fix the problem.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-05 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-05-05 13:23 EDT---
The boslcp6 logs look characteristic of the qla2xxx issue (panic in 
process_one_work()). Don't have detailed qla2xxx logging so can't determine SAN 
disposition.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-05 Thread bugproxy
--- Comment From cdead...@us.ibm.com 2018-05-05 10:31 EDT---
Yesterday, the decision was made at Padma's daily KVM meeting to only track 
System Firmware Mustfix issues using the LC GA1 Mustfix label since that is all 
that applies to the Supermicro team. The OS Kernel/KVM issues will be managed 
with a spreadsheet tracked by the KVM team and also in the internal slack 
channel. Removing the Mustfix label.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-04 Thread bugproxy
--- Comment From indira.pr...@in.ibm.com 2018-05-04 11:10 EDT---
We could not able to install 'sar' package due to 166588 prior patch. And also 
'xfs'  was being used on the system from the prior run.  To overcome both, we  
planned fresh installation . Installed latest ubutnu1804 kernel(4.15.0-20) on 
LSI disk and booted up with disk. Login prompt appears & gave credentials. 
Immediately in less than a minute, system dump messages and started rebooting. 
Its not allowing time to run anything on the console prompt.

Tried  multiple attempts to boot with the latest kernel & once logged in
system is rebooting with call traces as below.

Ubuntu 18.04 LTS boslcp3 hvc0

boslcp3 login: [   51.679446] sd 3:0:1:0: rejecting I/O to offline device
[   58.251326] Unable to handle kernel paging request for data at address 
0xbf52a78fa0cf2419
[   58.251413] Faulting instruction address: 0xc038ae70
[   58.251462] Oops: Kernel access of bad area, sig: 11 [#1]
[   58.251500] LE SMP NR_CPUS=2048 NUMA PowerNV
[   58.251543] Modules linked in: rpcsec_gss_krb5(E) nfsv4(E) nfs(E) fscache(E) 
binfmt_misc(E) dm_service_time(E) dm_multipath(E) scsi_dh_rdac(E) 
scsi_dh_emc(E) scsi_dh_alua(E) joydev(E) input_leds(E) mac_hid(E) 
idt_89hpesx(E) at24(E) uio_pdrv_genirq(E) uio(E) vmx_crypto(E) ofpart(E) 
crct10dif_vpmsum(E) cmdlinepart(E) powernv_flash(E) mtd(E) opal_prd(E) 
ipmi_powernv(E) ibmpowernv(E) ipmi_devintf(E) ipmi_msghandler(E) nfsd(E) 
auth_rpcgss(E) nfs_acl(E) sch_fq_codel(E) lockd(E) grace(E) sunrpc(E) 
ip_tables(E) x_tables(E) autofs4(E) ses(E) enclosure(E) hid_generic(E) 
usbhid(E) hid(E) qla2xxx(E) ast(E) i2c_algo_bit(E) ttm(E) mpt3sas(E) ixgbe(E) 
drm_kms_helper(E) nvme_fc(E) syscopyarea(E) sysfillrect(E) nvme_fabrics(E) 
sysimgblt(E) fb_sys_fops(E) nvme_core(E) raid_class(E) crc32c_vpmsum(E) drm(E) 
i40e(E)
[   58.252067]  scsi_transport_sas(E) aacraid(E) scsi_transport_fc(E) mdio(E)
[   58.252120] CPU: 80 PID: 1740 Comm: ureadahead Tainted: GE
4.15.0-20-generic #21+bug166588
[   58.252186] NIP:  c038ae70 LR: c038ae5c CTR: c0621860
[   58.252245] REGS: c00fd98b76c0 TRAP: 0380   Tainted: GE 
(4.15.0-20-generic)
[   58.252309] MSR:  90009033   CR: 24002844  
XER: 
[   58.252373] CFAR: c0016e1c SOFTE: 1
[   58.252373] GPR00: c038ad34 c00fd98b7940 c16eae00 
0001
[   58.252373] GPR04: 007f2daa2bd342ac 05ea 0001 
05e9
[   58.252373] GPR08: 7f2daa2bd34242b4   

[   58.252373] GPR12: 2000 cfab7000 c00fd9d9f848 
c00fd9d9fab8
[   58.252373] GPR16: c00fd98b7c90 002a 0001fe80 

[   58.252373] GPR20:   002a 
7f528781f8910018
[   58.252373] GPR24: c000200e585e2401 bf52a78fa0cf2419 c0b2142c 
c00ff901ee00
[   58.252373] GPR28:  015004c0 c000200e585e2401 
c00ff901ee00
[   58.252879] NIP [c038ae70] kmem_cache_alloc_node+0x2f0/0x350
[   58.252927] LR [c038ae5c] kmem_cache_alloc_node+0x2dc/0x350
[   58.252974] Call Trace:
[   58.252996] [c00fd98b7940] [c038ad34] 
kmem_cache_alloc_node+0x1b4/0x350 (unreliable)
[   58.253066] [c00fd98b79b0] [c0b2142c] __alloc_skb+0x6c/0x220
[   58.253116] [c00fd98b7a10] [c0b2332c] 
alloc_skb_with_frags+0x7c/0x2e0
[   58.253174] [c00fd98b7aa0] [c0b16f8c] 
sock_alloc_send_pskb+0x29c/0x2c0
[   58.253233] [c00fd98b7b50] [c0c492c4] 
unix_stream_sendmsg+0x264/0x5c0
[   58.253292] [c00fd98b7c30] [c0b11424] sock_sendmsg+0x64/0x90
[   58.253342] [c00fd98b7c60] [c0b11508] sock_write_iter+0xb8/0x120
[   58.253401] [c00fd98b7d00] [c03d0434] new_sync_write+0x104/0x160
[   58.253459] [c00fd98b7d90] [c03d3b78] vfs_write+0xd8/0x220
[   58.253509] [c00fd98b7de0] [c03d3e98] SyS_write+0x68/0x110
[   58.253560] [c00fd98b7e30] [c000b184] system_call+0x58/0x6c
[   58.253607] Instruction dump:
[   58.253637] 7c97ba78 fb210038 38a50001 7f19ba78 fb29 f8aa 4bc8bfb1 
6000
[   58.253698] 7fb8b840 419e0028 e93f0022 e91f0140 <7d59482a> 7d394a14 7d4a4278 
7fa95040
[   58.253760] ---[ end trace 21f1ccbedad3db06 ]---
[   58.360858] device-mapper: multipath: Reinstating path 65:240.
[   58.362107] sd 3:0:1:0: Power-on or device reset occurred
[   58.369695] sd 2:0:1:0: Power-on or device reset occurred
[   58.371943] sd 3:0:1:0: alua: port group 00 state A non-preferred supports 
tolusna
[   58.376534] sd 3:0:0:0: Power-on or device reset occurred
[   58.381190] sd 2:0:0:0: Power-on or device reset occurred
[   58.391738] sd 3:0:0:0: alua: port group 01 state N non-preferred supports 
tolusna
[   59.265054]

Attached boslcp3 host console logs
Please let us know if this is a different issue to be 

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-03 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-05-03 08:20 EDT---
There were a large number of SAN incidents in the evening, although none 
involved two ports at the same time. Still, many involved relogin while the 
logout was still being processed - so there is some confidence that the patches 
may be working.

There was a large period of SAN instability between May  2 21:42:09 and
21:58:47. This involved only one port (21:00:00:24:ff:7e:f6:fe). It
would be interesting if this could be traced back to some activity,
either on this machine or on the SAN (e.g. was migration being tested on
other machines at this point?).

We still have not seen the same situation that was associated with the
panics (two or more ports experiencing instability at the same time), so
it's not clear if we can conclude that the patches fix the original
problem.If we could find some trigger for the instability, we might be
able to orchestrate the situation originally seen.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-03 Thread bugproxy
--- Comment From indira.pr...@in.ibm.com 2018-05-03 03:22 EDT---
boslcp3 host console dumps messages related to qlogic driver.

Latest tee logs for boslcp3 host :

kte111.isst.aus.stglabs.ibm.com 9.3.111.155 [kte/don2rry]

kte111:/LOGS/boslcp3-host-may1.txt

[ipjoga@kte (AUS) ~]$ ls -l /LOGS/boslcp3-host-may1.txt
-rwxrwxr-x 1 ipjoga ipjoga 20811302 May  3 02:12 /LOGS/boslcp3-host-may1.txt

Regards,
Indira

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-02 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-05-02 14:39 EDT---
The SAN incident in the previous dmesg log shows only a single port (WWPN) 
glitching. The logs from panics showed two ports glitching at the same time. 
Also, this incident did not show the port logging back in for about 8 minutes, 
whereas the panics showed immediate/concurrent login. So, I'm not certain if 
we've proven the fix yet.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-05-02 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-05-02 14:27 EDT---
Unfortunately, the current test run was executed without "dmesg -n debug" so 
the captured console output has no value. I corrected that, and so future 
console output should have what we need.

The good news is that the dmesg buffer had not wrapped yet, so I could
still grab all of that. There was only one "SAN instability" incident,
about an hour ago, and it did not cause a panic. However, I'm not sure
if that is conclusive yet.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-27 Thread bugproxy
--- Comment From indira.pr...@in.ibm.com 2018-04-27 08:14 EDT---
(In reply to comment #211)
> - We have decided to replace the qlogic by Emulex.
> - Apply the new kernel patch in 208.
> - add the slub_debug=FZPU
> System is up with latest kernel and ready now.
> root@boslcp3:~# uname -a
> Linux boslcp3 4.15.0-20-generic #21+bug166588 SMP Thu Apr 26 15:05:59 CDT
> 2018 ppc64le ppc64le ppc64le GNU/Linux
> root@boslcp3:~# cat /proc/cmdline
> root=UUID=bab108a0-d0a6-4609-87f1-6e33d0ad633c ro slub_debug=FZPU splash
> quiet
> crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:
> 4096M@128M
> root@boslcp3:~#

Started tests on guests - boslcp3 host with latest kernel

root@boslcp3:/kte/tools/setup.d# uname -a
Linux boslcp3 4.15.0-20-generic #21+bug166588 SMP Thu Apr 26 15:05:59 CDT 2018 
ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3:/kte/tools/setup.d# cat /proc/cmdline
root=UUID=bab108a0-d0a6-4609-87f1-6e33d0ad633c ro slub_debug=FZPU splash quiet 
crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M@128M

boslcp3 host - IO run & stress-ng IO class run on emulex disks
boslcp3g1 - guest crashed on boslcp4 (dev looking into), Logs are attached in 
bug #166303-c46
boslcp3g3 - Freshely installed using cdrom & booted with kernel 4.15.0-19. ( 
Requested gustavo to provide NMI patch on top of 19 kernel in bug#166877,c29 )
boslcp3g4 - LTP run 4 hours done

Regards,
Indira

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-27 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-27 07:33 EDT---
(In reply to comment #211)
> - We have decided to replace the qlogic by Emulex.
> - Apply the new kernel patch in 208.
> - add the slub_debug=FZPU
> System is up with latest kernel and ready now.
> root@boslcp3:~# uname -a
> Linux boslcp3 4.15.0-20-generic #21+bug166588 SMP Thu Apr 26 15:05:59 CDT
> 2018 ppc64le ppc64le ppc64le GNU/Linux
> root@boslcp3:~# cat /proc/cmdline
> root=UUID=bab108a0-d0a6-4609-87f1-6e33d0ad633c ro slub_debug=FZPU splash
> quiet
> crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:
> 4096M@128M
> root@boslcp3:~#

Wait, if you are no longer running with the qlogic card, your tests are
not applicable to this kernel. This new kernel was patched with fixes to
the qlogic driver. Also, by making changes to the FC SAN we risk losing
the ability to reproduce the qlogic problem.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-26 Thread bugproxy
--- Comment From chngu...@us.ibm.com 2018-04-26 20:38 EDT---
- We have decided to replace the qlogic by Emulex.
- Apply the new kernel patch in 208.
- add the slub_debug=FZPU
System is up with latest kernel and ready now.
root@boslcp3:~# uname -a
Linux boslcp3 4.15.0-20-generic #21+bug166588 SMP Thu Apr 26 15:05:59 CDT 2018 
ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3:~# cat /proc/cmdline
root=UUID=bab108a0-d0a6-4609-87f1-6e33d0ad633c ro slub_debug=FZPU splash quiet 
crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M@128M
root@boslcp3:~#

--- Comment From chngu...@us.ibm.com 2018-04-26 20:39 EDT---
FYI: We are doing the migration between boslcp3 & boslcp4 using guest boslcp3g1 
at this moment. Feel free to add more stress on system.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-26 Thread bugproxy
--- Comment From mauri...@br.ibm.com 2018-04-26 13:57 EDT---
The skiroot kernel build is available at:
http://dorno.rch.stglabs.ibm.com/~mauricfo/kernel/skiroot/bz166588/zImage.epapr_4.15.14-openpower1.bz166588c132

(In reply to comment #200)
> Dwip and I talked, and we don't feel there is anything new to be learned by
> changing focus to petitboot right now. [...]

Since the build was already in progress, I'll post it here, just in case
it might still help.

You can flash it as described in bug 167103 comment 22
on systems with secure-boot _disabled_ (bug bug 167103 comment 30),
i.e., there is no 'secure-enabled' property in device tree.

Copying the instructions here for reference:

# ls /proc/device-tree/ibm,secureboot/secure-enabled
ls: cannot access '/proc/device-tree/ibm,secureboot/secure-enabled': No such 
file or directory

# lsprop /proc/device-tree/ibm,secureboot/
compatible   "ibm,secureboot-v2"
hw-key-hash-size 0040 (64)
hw-key-hash  40d487ff 7380ed6a d54775d5 795fea0d
e2f541fe a9db06b8 466a42a3 20e65f75
b4866546 0017d907 515dc2a5 f9fc5095
4d6ee0c9 b67d219d fb708535 1d01d6d1
phandle  0086 (134)
name "ibm,secureboot"

On Petitboot:

/ # uname -r
4.13.16-openpower1

/ # wget -O zImage.epapr
http://dorno.rch.stglabs.ibm.com/~mauricfo/kernel/skiroot/bz166588/zImage.epapr_4.15.14-openpower1.bz166588c132

/ # pflash -e -p zImage.epapr -P BOOTKERNEL
About to erase 0x017e1000..0x02ca5478 !
WARNING ! This will modify your HOST flash chip content !
Enter "yes" to confirm:yes
<...>

/ # reboot
<...>

EXPECTED:
/ # uname -r
4.15.14-openpower1.bz166588c132

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-26 Thread bugproxy
--- Comment From dnban...@us.ibm.com 2018-04-26 10:58 EDT---
I took a quick look at the crash stacks mentioned in c191-c193. Since we don't
have a debug kernel for "4.15.0-15-generic #16+bug166588" I just looked
at the stacks. From that it seems reasonable to draw the conclusion that these
appear to be all manifestations of issues we have seen before. I tried to 
categorize
them below. Note that some of these were hit before booting into the actual 
kernel
so it would be a good idea to install a skiroot kernel with the above patches
as well (as was indeed decided in the meeting and Klaus mentions in #194).

crash 201804260138
==
[   27.682301] NIP [c0389760] kmem_cache_alloc+0x2e0/0x340
[   27.682343] LR [c038974c] kmem_cache_alloc+0x2cc/0x340
[   27.682386] Call Trace:
[   27.682406] [c5fef5c0] [c5fef610] 0xc5fef610 
(unreliable)
[   27.682459] [c5fef620] [c02dfacc] 
mempool_alloc_slab+0x2c/0x40
[   27.682510] [c5fef640] [c02dff18] mempool_alloc+0x88/0x1e0
[   27.682555] [c5fef6d0] [c06724fc] 
bio_alloc_bioset+0x1ac/0x2e0
[   27.682607] [c5fef740] [c042a904] submit_bh_wbc+0xd4/0x240
[   27.682650] [c5fef790] [c042b9a0] ll_rw_block+0x130/0x1a0
[   27.682694] [c5fef7f0] [c042bae4] __breadahead+0x44/0xb0
[   27.682739] [c5fef820] [c04cb9a8] 
__ext4_get_inode_loc+0x448/0x5c0
[   27.682789] [c5fef8e0] [c04cffbc] ext4_iget+0x9c/0xc40
[   27.682832] [c5fef9d0] [c04ef234] ext4_lookup+0x1b4/0x2e0
GPR24: e6eef6af4c054c5f c000200e585a3901 26eed6a1145f755e c02dfacc

GPR28: c00ff901ee00 01011200 c000200e585a3901 c00ff901ee00
   

appears to be kmem cache corruption.
seems like another instantiation of the double free issue (likely).

crash 201804252219
==
[   84.702368] NIP [c0389ed0] kmem_cache_alloc_node+0x2f0/0x350
[   84.702407] LR [c0389ebc] kmem_cache_alloc_node+0x2dc/0x350
[   84.702446] Call Trace:
[   84.702463] [c5e77940] [c0389d94] 
kmem_cache_alloc_node+0x1b4/0x350 (unreliable)
[   84.702520] [c5e779b0] [c0b2eb6c] __alloc_skb+0x6c/0x220
[   84.702560] [c5e77a10] [c0b30a6c] 
alloc_skb_with_frags+0x7c/0x2e0
[   84.702608] [c5e77aa0] [c0b246cc] 
sock_alloc_send_pskb+0x29c/0x2c0
[   84.702655] [c5e77b50] [c0c569e4] 
unix_stream_sendmsg+0x264/0x5c0
[   84.702703] [c5e77c30] [c0b1eb64] sock_sendmsg+0x64/0x90
[   84.702743] [c5e77c60] [c0b1ec48] sock_write_iter+0xb8/0x120
[   84.702791] [c5e77d00] [c03cf494] new_sync_write+0x104/0x160
[   84.702838] [c5e77d90] [c03d2bd8] vfs_write+0xd8/0x220
[   84.702878] [c5e77de0] [c03d2ef8] SyS_write+0x68/0x110
[   84.702919] [c5e77e30] [c000b184] system_call+0x58/0x6c

GPR24: c000200e585ebc01 26eed6a1145bf0fd c0b2eb6c c00ff901ee00

GPR28:  015004c0 c000200e585ebc01 c00ff901ee00
  

appears to be kmem cache corruption.
another case of double free (?)

crash 201804251933
=
[ 7083.142916] NIP [c013277c] process_one_work+0x3c/0x5a0
[ 7083.142965] LR [c0132d78] worker_thread+0x98/0x630
[ 7083.143004] Call Trace:
[ 7083.143026] [c000200bb70b7c90] [c01329f4] 
process_one_work+0x2b4/0x5a0 (unreliable)
[ 7083.143085] [c000200bb70b7d20] [c0132d78] worker_thread+0x98/0x630
[ 7083.143134] [c000200bb70b7dc0] [c013b9a8] kthread+0x1a8/0x1b0
[ 7083.143185] [c000200bb70b7e30] [c000b528] 
ret_from_kernel_thread+0x5c/0xb4
GPR08: c000200e60eb7df0  2040 c000200e60ea10a8


the worker object issue again.

crash 201804251726
==
[   48.707329] NIP [c0389ed0] kmem_cache_alloc_node+0x2f0/0x350
[   48.707376] LR [c0389ebc] kmem_cache_alloc_node+0x2dc/0x350
[   48.707422] Call Trace:
[   48.707444] [c000200e46c07890] [c0389d94] 
kmem_cache_alloc_node+0x1b4/0x350 (unreliable)
[   48.707511] [c000200e46c07900] [c0b2eb6c] __alloc_skb+0x6c/0x220
[   48.707561] [c000200e46c07960] [c0cf4004] 
kobject_uevent_env+0x804/0xa40
[   48.707620] [c000200e46c07a40] [c0aa3338] dm_kobject_uevent+0x78/0xd0
[   48.707676] [c000200e46c07ae0] [c0aab930] dev_suspend+0x360/0x390
[   48.707725] [c000200e46c07b30] [c0aac110] ctl_ioctl+0x200/0x5a0
[   48.707773] [c000200e46c07d20] [c0aac4d0] dm_ctl_ioctl+0x20/0x30
[   48.707822] [c000200e46c07d40] [c03ef9f4] do_vfs_ioctl+0xd4/0xa00
[   48.707870] [c000200e46c07de0] [c03f03e4] SyS_ioctl+0xc4/0x130
[   48.707920] [c000200e46c07e30] [c000b184] system_call+0x58/0x6c
GPR24: c000200e585e3a01 26eed6a1145b76a7 c0b2eb6c c00ff901ee00

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-26 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-26 10:48 EDT---
The crashdumps that were collected are for a different/custom kernel. That 
kernel was built using the same name as the stock Ubuntu kernel, which causes 
more confusion. We need to have the dbgsym version of the kernel to analyze 
them.

I want to propose an attempt to fix the problem by using a newer qla2xxx
driver, version 10.00.00.04-k. I have built that for the stock Ubuntu
-15 kernel, and will attach it here.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-26 Thread bugproxy
--- Comment From mauri...@br.ibm.com 2018-04-26 10:22 EDT---
(In reply to comment #194)
> 3) Mauricio will use the information in [2] above to rebuild the Skiroot in
> Bug 167103 comment 22, but with Dwip's patch replaced by the patches in [2].
> In other words, a Skiroot with the tlbie fix + whatever is pointed out in [2]

(In reply to comment #195)
> I am attaching the FOUR commits identified before:

> commit d8630bb95f46ea118dede63bd75533faa64f9612
> commit 9cd883f07a54e5301d51e259acd250bb035996be
> commit 1ae634eb28533b82f9777a47c1ade44cb8c0182b
> commit eaf75d1815dad230dac2f1e8f1dc0349b2d50071

> I also sent Gustavo and Mauricio L'Notes email with the same.

Ack.
Working on the skiroot build, and will post instructions per Dwip's request.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-26 Thread bugproxy
--- Comment From dnban...@us.ibm.com 2018-04-26 10:12 EDT---
I am attaching the FOUR commits identified before:

===
commit d8630bb95f46ea118dede63bd75533faa64f9612
Author: Quinn Tran 
Date:   Thu Dec 28 12:33:43 2017 -0800

commit 9cd883f07a54e5301d51e259acd250bb035996be

commit 1ae634eb28533b82f9777a47c1ade44cb8c0182b
Author: Quinn Tran 
Date:   Thu Dec 28 12:33:44 2017 -0800

commit eaf75d1815dad230dac2f1e8f1dc0349b2d50071
Author: Quinn Tran 
Date:   Thu Feb 1 10:33:17 2018 -0800

(the last one is for the double free issue)

I also sent Gustavo and Mauricio L'Notes email with the same.

I believe Doug took a look as well and he also built an upstream
based version (#179). I will look some more as well (but those
4 above would appear to be a necessary starting point)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-26 Thread bugproxy
--- Comment From kla...@br.ibm.com 2018-04-26 09:59 EDT---
In the KVM Scrum discussion today, it was decided that:

1) Doug will jump on boslcp3 and reboot (multiple times if needed) in an
attempt to reproduce the PETITBOOT issue described in comment 191
(process_one_work crash). Once in that condition, Doug will try to
blacklist the qla2xxx driver in PETITBOOT and retry to see if it makes
any better. If Chanh is available, Doug could also ask him to yank out
the Qlogic adapter from the box and see if the process_one_work crash
can be reproduced. This is an attempt at bringing more data into the
boot condition that apparently triggers after the Canonical Host Kernel
hangs/crashes, but somehow persists across reboots.

2) In parallel with the above, Dwip will identify the set of upstream
qla2xxx commits (and any important dependencies) that would serve as the
equivalent for his tentative patch in comment 132.

3) Mauricio will use the information in [2] above to rebuild the Skiroot
in Bug 167103 comment 22, but with Dwip's patch replaced by the patches
in [2]. In other words, a Skiroot with the tlbie fix + whatever is
pointed out in [2]

4) Similarly, Gustavo Walbon will use the information in [2] and apply
it to the latest proposed Canonical Kernel (comment 186), also ensuring
that the patch for Bug 166877 is applied (if not already part of the
official Canonical build).

5) Once Doug is done with [1] and Mauricio is done with [3], Doug can
test the skiroot in [3] to validate if it makes the PETITBOOT issue
better

6) Once Doug is done with [5], Chanh can restart the long-running test,
with the Skiroot provided in [3] and with the kernel provided in [4].

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-26 Thread bugproxy
--- Comment From indira.pr...@in.ibm.com 2018-04-26 04:18 EDT---
boslcp3 hit with 166588 again. It was running with Guestavo's Patch mentioned 
in c151. The system was running from last 40 hours, but we observed slowness 
during y'day evening. Today morning it was not reachable and noticed in 
petiboot. Multiple symptoms are seen now

(1) When we exit from petitboot and it rebooted, but crashed couple of
times and generated vmcore.

vmcore path:

kte111.isst.aus.stglabs.ibm.com 9.3.111.155 [kte/don2rry]
kte111:/LOGS/boslcp3/201804252219
kte111:/LOGS/boslcp3/201804260138

(2) Again it crashed and prior reaching to petiboot its dropping to
xmon.

Linux version 4.15.14-openpower1 (smc@smc-desktop) (gcc version 6.4.0 
(Buildroot 2018.02-2-g05e5240)) #2 SMP Fri Apr 20 09:34:06 PDT 2018
enter ? for help
[c000200ff1743cb0] c0089dc8 process_one_work+0x24c/0x328 (unreliable)
[c000200ff1743d40] c008a4a4 worker_thread+0x2e4/0x3a8
[c000200ff1743dc0] c008ffc0 kthread+0x14c/0x154
[c000200ff1743e30] c000b594 ret_from_kernel_thread+0x5c/0xc8
68:mon

(3) Used X to come to dump and it came out from xmon booted & enters
into petitboot menu.

(4)  selected kernel & booted where it again rebooted.

Attached console logs for step(3) & step(4).

(5) Powered off & on again and tried to boot with kernel where it dumps
rcu stall traces & below  error & booted again

37.614879] Kernel panic - not syncing: Fatal exception in interrupt

Attached console logs

(6)After multiple attempts of reboot,boslcp3 host came up with kernel

root@boslcp3:~# uname -a
Linux boslcp3 4.15.0-15-generic #16+bug166588 SMP Mon Apr 23 11:50:06 CDT 2018 
ppc64le ppc64le ppc64le GNU/Linux

NOTE: If we reboot system we can recreate the above issue

Regards,
Indira

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-25 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-25 17:09 EDT---
I have been trying to reproduce this using portdisable/portenable on the FC 
switch. So far, no problem seen. I made some runs with extra qla2xxx debug 
logging, and see the timing is not quite the same as seen on the fabric 
connected to boslcp3. Also, this system has only one port connected to the 
fabric.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-25 Thread bugproxy
--- Comment From chngu...@us.ibm.com 2018-04-25 16:45 EDT---
(In reply to comment #181)
> boslcp3 host is running with IO run on qlogic disks & stress-ng IO class
> & 2 guests are running 30+ hours of stress run. boslcp3g4 guest is facing
> out of network issue( updated bug#165570- c41 for guest out of network issue)
>
> root@boslcp3:~# virsh list --all
>  IdName   State
> 
>  1 boslcp3g1  running
>  2 boslcp3g3  running
>  4 boslcp3g4  running
>
> root@boslcp3:~# uptime
>  06:28:42 up 1 day, 15:07, 10 users,  load average: 117.72, 116.54, 117.92
> root@boslcp3:~#
>
> Regards,
> Indira

We are able to bring up boslcp4 and about to try out the migration then
we discover boslcp3 becomes unresponsive.

As of now, It still can be ping but cannot access either via ssh or direct 
console.
On the SOL console, the "virsh list" command just hangs there...

root@boslcp3:~# uptime
14:02:40 up 1 day, 22:41,  6 users,  load average: 216.49, 215.61, 214.52
root@boslcp3:~# virsh list

<< it stays here, if we hit enter again, the cursor move  down one more line
On the monitor connect via VGA port, after we enter the username/passwd to log 
in, it pops up the "Last login: Wed Apr 25 10:55" then stays there. Cannot 
do anything from there.

Please let us know what is the next step here...We need to do the
Migration since we have boslcp4 up and running.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-25 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-25 10:20 EDT---
I was not able to find any suspect tasks in the 04/18 crashdump, aside what 
Dwip already mentioned. I found 3 tasks that were in __queue_work(), but all 
those target pools were currently empty so they did not exhibit the problem 
seen on the offending pool. They also did not involve qla2xxx, although the 
theory is that other (non-qla2xxx) work could get added to a broken pool.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-25 Thread bugproxy
--- Comment From kla...@br.ibm.com 2018-04-25 09:04 EDT---
(In reply to comment #181)
> boslcp3 host is running with IO run on qlogic disks & stress-ng IO class
> & 2 guests are running 30+ hours of stress run. boslcp3g4 guest is facing
> out of network issue( updated bug#165570- c41 for guest out of network issue)
>
> root@boslcp3:~# virsh list --all
>  IdName   State
> 
>  1 boslcp3g1  running
>  2 boslcp3g3  running
>  4 boslcp3g4  running
>
> root@boslcp3:~# uptime
>  06:28:42 up 1 day, 15:07, 10 users,  load average: 117.72, 116.54, 117.92
> root@boslcp3:~#
>
> Regards,
> Indira

We can either try to continue with this kernel, or try something new. I
was hoping we could repro the original issue in the Development Boston
so we could allow this one to continue the test train.

Breno/Walbon: Will Canonical have an official kernel soon similar to the
one you built us (i.e., 165988 reverted, 166877 applied, plus any other
confirmed fixes)?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-25 Thread bugproxy
--- Comment From indira.pr...@in.ibm.com 2018-04-25 07:39 EDT---
boslcp3 host is running with IO run on qlogic disks & stress-ng IO class
& 2 guests are running 30+ hours of stress run. boslcp3g4 guest is facing out 
of network issue( updated bug#165570- c41 for guest out of network issue)

root@boslcp3:~# virsh list --all
IdName   State

1 boslcp3g1  running
2 boslcp3g3  running
4 boslcp3g4  running

root@boslcp3:~# uptime
06:28:42 up 1 day, 15:07, 10 users,  load average: 117.72, 116.54, 117.92
root@boslcp3:~#

Regards,
Indira

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-24 Thread bugproxy
--- Comment From indira.pr...@in.ibm.com 2018-04-24 23:51 EDT---
(In reply to comment #174)
> (In reply to comment #85)
> > Copied the  dump to our kte server
> >
> > kte111.isst.aus.stglabs.ibm.com 9.3.111.155 [kte/don2rry]
> >
> > kte111:/LOGS/boslcp3/BZ166588/
> >
> > h# ls -l /LOGS/boslcp3/BZ166588/
> > total 4
> > drwxr-xr-x 2 root root 4096 Apr 19 02:42 201804181042
> >
> > Thanks.
>
> The dump is inaccessible to normal users:
>
> [kte@kte (AUS) 201804181042]$ ls -l
> total 1986812
> -rw--- 1 root root 237028 Apr 19 02:42 dmesg.201804181042
> -rw--- 1 root root 2034352684 Apr 19 02:42 dump.201804181042

Can you please check now

[ipjoga@kte (AUS) 201804181042]$ ls -l
total 1986908
-rwxrwxrwx 1 ipjoga ipjoga 237028 Apr 18 22:15 dmesg.201804181042
-rwxrwxrwx 1 ipjoga ipjoga 2034352684 Apr 18 22:15 dump.201804181042
[ipjoga@kte (AUS) 201804181042]$ pwd
/home/ipjoga/201804181042

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-24 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-24 13:48 EDT---
I was able to compile the upstream qla2xxx driver version 10.00.00.04-k (commit 
1d1db6a3ca32ad52e97ed42d5c005d49fda7b589) under Ubuntu kernel 4.15.0-15-generic 
without errors or warnings. I have  not tried it yet, but also can't reproduce 
the issue on this system. This, of course, grabs all the commits since the 
10.00.00.02-k version that is in Ubuntu 18.04.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-24 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-24 12:50 EDT---
My attempts at running stress-ng on ltc-boston1 don't seem to use the QLogic 
disks. Are there options or config files needed to get it to stress certain 
disks?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-24 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-24 11:27 EDT---
(In reply to comment #85)
> Copied the  dump to our kte server
>
> kte111.isst.aus.stglabs.ibm.com 9.3.111.155 [kte/don2rry]
>
> kte111:/LOGS/boslcp3/BZ166588/
>
> h# ls -l /LOGS/boslcp3/BZ166588/
> total 4
> drwxr-xr-x 2 root root 4096 Apr 19 02:42 201804181042
>
> Thanks.

The dump is inaccessible to normal users:

[kte@kte (AUS) 201804181042]$ ls -l
total 1986812
-rw--- 1 root root 237028 Apr 19 02:42 dmesg.201804181042
-rw--- 1 root root 2034352684 Apr 19 02:42 dump.201804181042

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-24 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-24 11:06 EDT---
I'm not up to speed on the double-free issue, but if multiple work queues/pools 
referenced the same work item, you could get a double free situation. 
Essentially, the qla2xxx driver doing the double (triple, ...) insertion of a 
work item on multiple pools could result in addition work items, appended to 
that fatal qla2xxx item, getting executed on multiple kworker threads. So, it 
is possible that the skb-related failures were an odd case where some network 
work item was unlucky enough to get appended to one of these offending qla2xxx 
items, and we saw collateral damage (panics) as a result. If that unlucky work 
item were dynamically allocated, it could get freed multiple times.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-24 Thread bugproxy
--- Comment From dnban...@us.ibm.com 2018-04-24 10:52 EDT---


Obviously this corruption happened a while ago. I poked around a
bit to see if there is any smoking gun around but nothing that
meets the eye. Since we have been seeing all this in the context
of other qla2xxx issues (where there were trails), I tried to see
if we can find those actors here. And indeed they are in this
scene too:

PID: 1085   TASK: c000200e48e2e000  CPU: 94  COMMAND: "kworker/94:1"
#0 [c000200e48de7940] __schedule at c0d05d24
#1 [c000200e48de7a10] schedule at c0d065b0
#2 [c000200e48de7a30] schedule_timeout at c0d0b3d0
#3 [c000200e48de7b30] msleep at c01b5e2c
#4 [c000200e48de7b60] qlt_free_session_done at c0080ef1faf0 [qla2xxx]
#5 [c000200e48de7c90] process_one_work at c0132bd8
#6 [c000200e48de7d20] worker_thread at c0132f78
#7 [c000200e48de7dc0] kthread at c013bba8
#8 [c000200e48de7e30] ret_from_kernel_thread at c000b528

PID: 1750   TASK: c000200e4daf3c00  CPU: 94  COMMAND: "kworker/94:2"
#0 [c000200e4db17940] __schedule at c0d05d24
#1 [c000200e4db17a10] schedule at c0d065b0
#2 [c000200e4db17a30] schedule_timeout at c0d0b3d0
#3 [c000200e4db17b30] msleep at c01b5e2c
#4 [c000200e4db17b60] qlt_free_session_done at c0080ef1faf0 [qla2xxx]
#5 [c000200e4db17c90] process_one_work at c0132bd8
#6 [c000200e4db17d20] worker_thread at c0132f78
#7 [c000200e4db17dc0] kthread at c013bba8
#8 [c000200e4db17e30] ret_from_kernel_thread at c000b528

PID: 3937   TASK: c000200e3b1a3f00  CPU: 94  COMMAND: "kworker/94:3"
#0 [c000200e3b2f3940] __schedule at c0d05d24
#1 [c000200e3b2f3a10] schedule at c0d065b0
#2 [c000200e3b2f3a30] schedule_timeout at c0d0b3d0
#3 [c000200e3b2f3b30] msleep at c01b5e2c
#4 [c000200e3b2f3b60] qlt_free_session_done at c0080ef1faf0 [qla2xxx]
#5 [c000200e3b2f3c90] process_one_work at c0132bd8
#6 [c000200e3b2f3d20] worker_thread at c0132f78
#7 [c000200e3b2f3dc0] kthread at c013bba8
#8 [c000200e3b2f3e30] ret_from_kernel_thread at c000b528

While they are all sleeping on cpu 94's worker threads, they do
belong to different fc_ports. However, their existence and
their propensity to cause issues because of the way they can be
scheduled does give pause for thought.

While I didn't want to be tunnel visioned w.r.t. free_work/del_work
issues we have seen elsewhere for qla2xxx, I did want to include
the information for completeness. And also, we haven't seen this
with the patch encoding the initial expt w.r.t (free_/del_)work
described in #132.

I then decided to take a look at the bug again, back in history.
And then this thing caught my eye! It was captured in a log by
Indira.

[31751.586142] Sending NMI from CPU 104 to CPUs 1:
[31751.586257] NMI backtrace for cpu 1
[31751.586260] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-15-generic 
#16-Ubuntu
[31751.586262] NIP:  c00a40b4 LR: c00a40b4 CTR: c0008000
[31751.586264] REGS: c00ff91bbc40 TRAP: 0100   Not tainted  
(4.15.0-15-generic)
[31751.586265] MSR:  90001033   CR: 24004482  
XER: 
[31751.586270] CFAR: c00ff91bbda0 SOFTE: -4611685949823549440
[31751.586270] GPR00: c00a40b4 c00ff91bbda0 c16eb400 
c00ff91bbc40
[31751.586270] GPR04: b0cpu 0x79: Vector: 700 (Program Check) at 
[c000200e5831b450]
pc: c038ba38: kmem_cache_free+0xc8/0x2b0
lr: c02dfd4c: mempool_free_slab+0x2c/0x40
sp: c000200e5831b6d0
msr: 90029033
current = 0xc000200e58205c00
paca= 0xc7a73300   softe: 0irq_happened: 0x01
pid   = 0, comm = swapper/121
kernel BUG at /build/linux-QzAGR9/linux-4.15.0/mm/slub.c:296!
Linux version 4.15.0-15-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 
(Ubuntu 7.3.0-14ubuntu1)) #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 (Ubuntu 
4.15.0-15.16-generic 4.15.15)
cpu 0x9: Vector: 100 (System Reset) at [c7f39d80]
pc: c00ed874: kvmppc_got_guest+0x1cc/0x380
lr: c00ed7f0: kvmppc_got_guest+0x148/0x380
sp: c0042754f4d0
msr: 900102883003
current = 0xc003a3e87300
paca= 0xc7a26300   softe: 0irq_happened: 0x01
pid   = 33539, comm = CPU 3/KVM
Linux version 4.15.0-15-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 
(Ubuntu 7.3.0-14ubuntu1)) #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 (Ubuntu 
4.15.0-15.16-generic 4.15.15)
cpu 0x22: Vector: 100 (System Reset) at [c7e0dd80]
pc: c00eddb8: mc_cont+0x38/0x13c
lr: c00ee5b0: hcall_try_real_mode+0x60/0x7c
sp: c004275374d0
msr: 90081033
current = 0xc003a3eb8100
paca= 0xc7a37600   softe: 0irq_happened: 0x01
pid   = 33540, comm = CPU 4/KVM
Linux version 4.15.0-15-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 
(Ubuntu 7.3.0-14ubuntu1)) #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-24 Thread bugproxy
--- Comment From kla...@br.ibm.com 2018-04-24 10:42 EDT---
(In reply to comment #166)
> this point, I don't see a connection to KVM or even Ubuntu vs. Pegas. This
> appears to be something that will happen in any distro that has the right
> vintage of qla2xxx driver. Not sure why we think this did not happen on
> Ubuntu -13 kernel - I see nothing in the diffs of qla2xxx that would affect
> this.

What is puzzling is that we had multiple reproduces (with guests,
without guests) prior to Dwip's patch.

With Dwip's patch, which is arguably not covering all scenarios, we
didn't get any repro

Without Dwip's patch, but reverting 165988, so far it looks clean as
well. I couldn't find a definitive link between reverting 165988 and
running clear for this testcase, other than speculating that for this
system, "cpu_present_mask" is different from "cpu_possible_mask", or
that there are other changes in how IRQs are distributed that I'm not
seeing exactly how.

I'd feel more comfortable having another system reproducing the original
issue, where we can debug/experiment better, while allowing boslcp3 to
continue the test run with the current kernel (it is the closest thing
we have from what will appear in GA anyway I think).

--- Comment From dnban...@us.ibm.com 2018-04-24 10:50 EDT---
When we first started looking at this bug, the captured issue seemed
slightly different - a case where an skb allocation appeared to be
failing due to (what appeared to be) a corrupted slab cache.

Thereafter we got a series of kworker thread related failures which
have been diagnosed above. However, while looking at those crashes
I chanced upon an instance that seems more related to  the corrputed
slab cache.

I decided to dig a bit to see if those instances are clearly
corelated (any relation?) or if there are other things we need to  be
aware of.

There is a crash from sometime on Friday April 20 (likely on the wrk_dbg
kernel that just had the debug object cinfiguration turned on).

The stack trace was a little different and it worked with the stock
kernel ...

KERNEL: /usr/lib/debug/boot/vmlinux-4.15.0-15-generic
DUMPFILE: dump.201804201534  [PARTIAL DUMP]
CPUS: 160
DATE: Fri Apr 20 15:33:10 2018
UPTIME: 00:03:51
LOAD AVERAGE: 3.22, 0.76, 0.25
TASKS: 1791
NODENAME: boslcp3
RELEASE: 4.15.0-15-generic
VERSION: #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018
MACHINE: ppc64le  (2134 Mhz)
MEMORY: 128 GB
PANIC: "Unable to handle kernel paging request for data at address 
0x26eed6a1145b0a2a"
PID: 5874
COMMAND: "systemd-udevd"
TASK: c00fe6482e00  [THREAD_INFO: c00fe9394000]
CPU: 80
STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 5874   TASK: c00fe6482e00  CPU: 80  COMMAND: "systemd-udevd"
#0 [c00fe93975a0] crash_kexec at c01e3950
#1 [c00fe93975e0] oops_end at c0025888
#2 [c00fe9397660] bad_page_fault at c006a900
#3 [c00fe93976d0] slb_miss_bad_addr at c0027764
#4 [c00fe93976f0] bad_addr_slb at c0008a1c
Data SLB Access [380] exception frame:
R0:  c0389874R1:  c00fe93979e0R2:  c16eb400
R3:  0001R4:  00e608fe511d18e9R5:  03cc
^^BAD
R6:  0001R7:  03cbR8:  e608fe511d1854c2
^^BAD
R9:  R10: R11: 00f1
R12: 2000R13: c7a57000R14: c00fdc28f080
R15: R16: 0001R17: c00fae88d800
R18: 0002R19: R20: 
R21: R22: R23: c1621200
R24: e6eef6af4c054c2bR25: c000200e585e4601R26: 26eed6a1145b0a2a
^^^BAD^^^
^^^ptr^^^freelist_ptr^
R27: c0b32514R28: c00ff901ee00R29: 014000c0
kmem_cache^^ gfpflags^
R30: c000200e585e4601R31: c00ff901ee00
^^^object^^^
NIP: c03899a0MSR: 90009033OR3: c0016e1c
CTR: LR:  c038998cXER: 
CCR: 28002808MQ:  0001DAR: 26eed6a1145b0a2a
DSISR:  Syscall Result: 
#5 [c00fe93979e0] kmem_cache_alloc at c03899a0
[Link Register] [c00fe93979e0] kmem_cache_alloc at c038998c  
(unreliable)
#6 [c00fe9397a40] skb_clone at c0b32514
#7 [c00fe9397a70] netlink_broadcast_filtered at c0ba84a0
#8 [c00fe9397b30] netlink_sendmsg at c0babae4
#9 [c00fe9397bc0] sock_sendmsg at c0b1ec64
#10 [c00fe9397bf0] ___sys_sendmsg at c0b20abc
#11 [c00fe9397d90] __sys_sendmsg at c0b221ec
#12 [c00fe9397e30] system_call at c000b184
System Call [c00] exception frame:
R0:  0155R1:  7fffe05d2e20R2:  7b8f72337f00
R3:  000eR4:  7fffe05d2ec8R5:  
R6:  

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-24 Thread bugproxy
--- Comment From dnban...@us.ibm.com 2018-04-24 10:39 EDT---
Doug, I somehow missed that note about the dump.

It is on boslcp3 (root/don2rry): /var/crash/201804181042 .

I believe they may have mirrored it to some other location as
well (I thought I saw a note about that, somewhere in this bug).
Yes indeed,  comment #85.

Regarding the crash, for this particular invocation:

In this case unfortunately it appears that threads on CPU 104 and
CPU 129 interfered in the work object scheduling/execution. CPU 129
was executing the worker function when qla2xxx driver scheduled
another instance of the same work object. As a result, after
the object was deleted from 129's queue it was inserted into
104's queue and the work->data was first written with the
correct pwq info (from the scheduling/insertion part) but then
overwritten with the marker  value from execution/deletion path.
The pool->lock won't protect against this kind of accesses. As a
result we have a work object in the list with the special
value instead of the proper (pwq) queue value.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-24 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-24 10:09 EDT---
The two commits that Dwip mentions look very pertinent. There may be others, 
though, as there appears  that a fair amount of work has been done in this area.

I still haven't gotten access to the dump(s), but another issue is that
the re-login attempt is started before the delete/free actions complete.
This appears to be initiated by qlt_unreg_sess() before scheduling
qlt_free_session_done(), so it seems possible that they could race.
Those two commits may address this adequately, though.

It appears that flaky, or fluttering, FC ports (or fabric) seems to be
the cause of this. The login retry is succeeding before the
logout/delete/free is fully completed. That may be why we are not
reproducing this on other systems - the fabric just does not have the
instability seen on boslcp3. At this point, I don't see a connection to
KVM or even Ubuntu vs. Pegas. This appears to be something that will
happen in any distro that has the right vintage of qla2xxx driver. Not
sure why we think this did not happen on Ubuntu -13 kernel - I see
nothing in the diffs of qla2xxx that would affect this.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-23 Thread bugproxy
--- Comment From dnban...@us.ibm.com 2018-04-24 00:12 EDT---
While at it, please pull in the following commit as well (whenever
the next composite test kernel is being built)...

###

commit eaf75d1815dad230dac2f1e8f1dc0349b2d50071
Author: Quinn Tran 
Date:   Thu Feb 1 10:33:17 2018 -0800

scsi: qla2xxx: Fix double free bug after firmware timeout

#

I will post a detailed analysis later (tomorrow,since it is quite late now
as to why that one appears necessary.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-23 Thread bugproxy
--- Comment From dnban...@us.ibm.com 2018-04-23 23:49 EDT---
I decided to take a look at qla2xxx driver's free and delete paths a little
more since my gut feeling was that these kinds of issues have to be
encountered by others too. Looking a little deeper I discovered these:

(Note this was from a quick  perusal)


commit d8630bb95f46ea118dede63bd75533faa64f9612
Author: Quinn Tran 
Date:   Thu Dec 28 12:33:43 2017 -0800

scsi: qla2xxx: Serialize session deletion by using work_lock

for session deletion, replace sess_lock with work_lock.
Under certain case sess_lock is not feasiable to acquire.
The lock is needed temporarily to make sure a single
call to schedule of the work element.



commit 9cd883f07a54e5301d51e259acd250bb035996be

+   /* use cancel to push work element through before re-queue */
+   cancel_work_sync(>del_work);
INIT_WORK(>del_work, qla24xx_delete_sess_fn);
queue_work(sess->vha->hw->wq, >del_work);



commit 1ae634eb28533b82f9777a47c1ade44cb8c0182b
Author: Quinn Tran 
Date:   Thu Dec 28 12:33:44 2017 -0800

scsi: qla2xxx: Serialize session free in qlt_free_session_done

Add free_pending flag to serialize queueing of
free_work element onto the work queue

Signed-off-by: Quinn Tran 
Signed-off-by: Himanshu Madhani 
Signed-off-by: Martin K. Petersen 

diff --git a/drivers/scsi/qla2xxx/qla_target.c 
b/drivers/scsi/qla2xxx/qla_target.c
index 72b452d..0d3c3f6 100644
--- a/drivers/scsi/qla2xxx/qla_target.c
+++ b/drivers/scsi/qla2xxx/qla_target.c
@@ -1105,6 +1105,7 @@ static void qlt_free_session_done(struct work_struct 
*work)
sess->plogi_link[QLT_PLOGI_LINK_SAME_WWN] = NULL;
}
}
+
spin_unlock_irqrestore(>tgt.sess_lock, flags);

ql_dbg(ql_dbg_tgt_mgt, vha, 0xf001,
@@ -1118,6 +1119,9 @@ static void qlt_free_session_done(struct work_struct 
*work)
wake_up_all(>fcport_waitQ);

base_vha = pci_get_drvdata(ha->pdev);
+
+   sess->free_pending = 0;
+
if (test_bit(PFLG_DRIVER_REMOVING, _vha->pci_flags))
return;
@@ -1140,11 +1144,20 @@ static void qlt_free_session_done(struct work_struct 
*work)
void qlt_unreg_sess(struct fc_port *sess)
{
struct scsi_qla_host *vha = sess->vha;
+   unsigned long flags;

ql_dbg(ql_dbg_disc, sess->vha, 0x210a,
"%s sess %p for deletion %8phC\n",
__func__, sess, sess->port_name);

+   spin_lock_irqsave(>vha->work_lock, flags);
+   if (sess->free_pending) {
+   spin_unlock_irqrestore(>vha->work_lock, flags);
+   return;
+   }
+   sess->free_pending = 1;
+   spin_unlock_irqrestore(>vha->work_lock, flags);
+
if (sess->se_sess)
vha->hw->tgt.tgt_ops->clear_nacl_from_fcport_map(sess);



The last one is obviously a much more refined and well-thought
version of the free_work changes...

Obviously the code given in #132 was an attempt to move the testing/debug
forward and validate the cause analysis for the crashes.

Going forward, these changes (the relevant set -I just did a very quick
walk through ...the changes need to be picked carefully) need to be
selected with extreme diligence and pulled into the kernel (Canonical?).

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-23 Thread bugproxy
--- Comment From chngu...@us.ibm.com 2018-04-23 17:35 EDT---
(In reply to comment #154)
> (In reply to comment #153)
> Chanh, please also clarify the steps your're using on your test. We have a
> Dev P9 system ready to start reproducing/debugging this (comment 144), we
> need direction to hit that issue.

I have started test on boslcp3:

root@boslcp3:~# uname -a
Linux boslcp3 4.15.0-15-generic #16+bug166588 SMP Mon Apr 23 11:50:06 CDT 2018 
ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3:~# uptime
16:30:35 up  1:09,  6 users,  load average: 106.57, 103.36, 82.17
root@boslcp3:~# virsh list
IdName   State

1 boslcp3g1  running(rootvg is on SAN via qlogic 
16Gb)
2 boslcp3g3  running(rootvg is on LSI)
4 boslcp3g4  running(rootvg is on LSI)

root@boslcp3:~# ps -ef |grep stress-ng |head -2
root  12103   6161  0 16:12 pts/000:00:00 ./stress-ng -t 44h --aio 4 
--hdd 4 --rawdev 10 --readahead 10 --revio 10 --seek 10 --sync-file 10
root  12104  12103  0 16:12 pts/000:00:02 ./stress-ng -t 44h --aio 4 
--hdd 4 --rawdev 10 --readahead 10 --revio 10 --seek 10 --sync-file 10

**
Steps we are running on boslcp3:
- Install Host with kernel patch in #151.
- Install guest with kernel patch 166877.
- 3 guests are running the ltp testsuite. Add stress-ng if need
- start stress on Host:
(nohup ./stress-ng -t 44h -aio 4 --hdd 4 ---rawdev 10 --readahead 10 --revio 10 
--seek 10 --sync-file 10 > /tmp/iolog 2>&1

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-23 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-23 16:14 EDT---
Regarding the "init" condition of the work item in the crash analysis, besides 
INIT_WORK() this condition would also be present after using list_del_init(), 
which is done just prior to executing the work function. So, if multiple pools 
were pointing to this work item, when one (any, all) starts to execute it then 
this condition will exist. This state should show the work item removed from 
the pool, although it is only possible for that remove to work for one pool 
(the last one to which the entry was added), and any other pools will not be 
able to get rid of that work item. The exact result of all this is still not 
clear, but it seems there are several possible ways that the tasks get 
effectively hung.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-23 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-23 16:04 EDT---
Looking closer at the logs of the crash in comment #81, I see that there are 3 
calls into qlt_unreg_sess(), for the same port, in a span of less than 2 
seconds. Between the time that the first instance of qlt_free_session_done() is 
queued and it actually executes, the port has a re-login event. It's difficult 
to tell exactly what all the login messages mean, but it appears that the login 
procedure continues even though qlt_free_session_done() appears to have logged 
out the port. There are then two more qlt_unreg_sess() calls during the login 
sequence, but both of kworker threads associated with those seem get stuck in 
the "wait for logout complete" loop - at least according to the stack traces in 
the dump. qla24xx_fcport_handle_login() seems to be getting called repeatedly, 
or at least as many times as qlt_unreg_sess() gets called. Unclear if that is 
normal, or a reaction to simultaneous attempts to logout and shutdown the port.

It also seems strange that more mutexing is not happening, although I
can't see conclusive proof that these contradictory events are happening
simultaneously.

I do see that qlt_free_session_done(), among others, makes calls to
qla2x00_fcport_event_handler() which would also handle the login. There
may be some interesting stack traces in that core file, that might give
a better picture of what is going on.

Was that core file preserved someplace that we can look at?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-23 Thread bugproxy
--- Comment From kla...@br.ibm.com 2018-04-23 15:51 EDT---
(In reply to comment #153)
> Current status of boslcp3 for record here:
> root@boslcp3:~# uname -a
> Linux boslcp3 4.15.15tst1 #4 SMP Sat Apr 21 16:57:31 CDT 2018 ppc64le
> ppc64le ppc64le GNU/Linux
> root@boslcp3:~# uptime
>  14:32:57 up 1 day, 19:53,  8 users,  load average: 147.86, 146.92, 146.12
> root@boslcp3:~# virsh list
>  IdName   State
> 
>  8 boslcp3g4  running
>  22boslcp3g1  running
>  25boslcp3g3  running
>
> root@boslcp3:~#
>
> I am going to apply the new kernel in comment #151 and restart our test.

Chanh, please also clarify the steps your're using on your test. We have
a Dev P9 system ready to start reproducing/debugging this (comment 144),
we need direction to hit that issue.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-23 Thread bugproxy
--- Comment From kla...@br.ibm.com 2018-04-23 15:30 EDT---
Chanh, please provide results of the testing with kernel in comment 151

--- Comment From chngu...@us.ibm.com 2018-04-23 15:35 EDT---
Current status of boslcp3 for record here:
root@boslcp3:~# uname -a
Linux boslcp3 4.15.15tst1 #4 SMP Sat Apr 21 16:57:31 CDT 2018 ppc64le ppc64le 
ppc64le GNU/Linux
root@boslcp3:~# uptime
14:32:57 up 1 day, 19:53,  8 users,  load average: 147.86, 146.92, 146.12
root@boslcp3:~# virsh list
IdName   State

8 boslcp3g4  running
22boslcp3g1  running
25boslcp3g3  running

root@boslcp3:~#

I am going to apply the new kernel in comment #151 and restart our test.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-23 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-23 09:58 EDT---
Regarding rcu_sched stalls, or any other manifestation of hangs, having a work 
item in this condition (next, prev point to itself) on an active worklist would 
effectively cause a linked-list loop. When a kworker thread reaches this work 
item, it will effectively hang, performing this work function over and over. 
That would likely result in any of Hard LOCKUP, soft lockup, and/or rcu_sched 
stall warnings. The key malfunction here seems to be that this work item had 
INIT_WORK() performed on it while it was still on an (active) pool->worklist. 
The fact that two kworker threads are both running this work item (probably 
both hung in the loop) suggests that this work item got added to two pools, but 
it's not clear why we see the work item in the INIT_WORK state - unless a third 
instance was in qlt_unreg_sess() and had just performed INIT_WORK.

A closer look at the kworker pools might be in order, as well as
searching for any tasks currently in qlt_unreg_sess().

The qlt_unreg_sess() function performs two main operations on the work
item: INIT_WORK() and schedule_work(). There are no normal program
breaks between these two operations, so ordinarily they would both get
performed in a very short period of time. However, it is conceivable
that at thread might be performing INIT_WORK() at the same time that
another thread(s) is executing the work item, and that might cause the
panic and leave the work item in this partially-queued state.

It still seems we must be getting multiple threads performing
qlt_unreg_sess() on the same port at the same time. Not sure the best
way to try and catch that, but perhaps some way of verifying the state
of the work item before doing INIT_WORK might do the job. One problem
may be if the work item is left in an indeterminate state after it is
run. I don't see any sort of mutex in the fc_port structure, so not sure
how qla2xxx ensures only one operation is performed at a time. Also,
this may not be two threads actually running in qlt_unreg_sess() at the
same time, but simply two threads successively running qlt_unreg_sess()
without the work item being completed in between.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-23 Thread bugproxy
--- Comment From indira.pr...@in.ibm.com 2018-04-23 09:46 EDT---
Latest update on boslcp3

boslcp3g1, boslcp3g4 guests are up & running for 32 hours without any 
hang/crash.
boslcp3g3 guest run went fine for 24 hours but after that seen "nfs: server 
10.33.11.31 not responding, timed out" messages.show.report.py & other commands 
not responding. So restarted guest from then guest became so slow ,started 
dumping rcu stall messages as below

root@boslcp3:~# virsh console --force boslcp3g3

Connected to domain boslcp3g3
Escape character is ^]
[ 2648.583544] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 2648.704426] 9-: (1 GPs behind) idle=ca8/0/0 softirq=18376/18379 
fqs=2805
[ 2648.738639] (detected by 4, t=10132 jiffies, g=25268, c=25267, q=2218)
[ 2649.063533] rcu_sched kthread starved for 2036 jiffies! g25268 c25267 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=4

Host is doing fine as of now.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-22 Thread bugproxy
--- Comment From cha...@us.ibm.com 2018-04-22 11:52 EDT---
*** Bug 167045 has been marked as a duplicate of this bug. ***

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-22 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-22 11:43 EDT---
I think the fact that two threads are in qlt_free_session_done() for the same 
fcport at the same time is definitely a problem. I'm not sure how two different 
kworker pools could contain the same work item. If the same work item got put 
on two pools, that would effectively merge the two pool lists at that point 
(the item's next pointer would point to the remainder of the list for the last 
pool the item was added to). This would mean that two kworker threads would end 
up working the same list at the same time, leading to something like what we 
see. But, the work item that crashes might not be the one that was erroneously 
added to both pools (although it is probably likely).

This scenario might explain the strange condition of the pool pointers
and work item. It appears that the offending work_struct has had
INIT_WORK() run on it, but the pool still has it on the worklist. The
only place I see INIT_WORK() run on the free_work item is from
qlt_unreg_sess(), but this is the predecessor to qlt_free_session_done()
being run, and should not ordinarily happen on the same CPU/thread.
Still, the INIT_WORK() immediately precedes schedule_work() so the work
item should only exist in this condition for a short time - unless
something preempts the code or another CPU/thread races in and crashes
before it can complete. Are there any tasks (in the dumps) that are
running in qlt_unreg_sess()?

It's looking as if some locking malfunction or memory corruption (which
could cause lock malfunction) is happening here.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-22 Thread bugproxy
--- Comment From indira.pr...@in.ibm.com 2018-04-22 10:12 EDT---
(In reply to comment #130)
> (In reply to comment #128)
> > Machine still up...
> >
> > root@boslcp3:~# virsh list
> >  IdName   State
> > 
> >  8 boslcp3g4  running
> >  14boslcp3g3  running
> >  20boslcp3g1  running
> >
> > root@boslcp3:~# uptime
> >  08:29:09 up 13:49,  8 users,  load average: 77.56, 84.04, 86.55
> >
> > logs are still pretty clean..
> >
> > I had built on that system ...but will put a patch up here later.
>
> Hi Dwip,
>
> Run is going fine on boslcp3g3, & boslcp3g4 guests but boslcp3g1 testcases
> failing .
> so restarting the run on boslcp3g1. We need to wait for some more hours to
> conclude wrt to the issue recreation.
>
> Please wait for some more time.
>
> Regards,
> Indira
>

root@boslcp3:~# uname -a
Linux boslcp3 4.15.15tst1 #4 SMP Sat Apr 21 16:57:31 CDT 2018 ppc64le ppc64le 
ppc64le GNU/Linux
root@boslcp3:~# uptime
09:10:20 up 14:30,  7 users,  load average: 96.45, 98.60, 94.12
root@boslcp3:~# date
Sun Apr 22 09:10:23 CDT 2018

root@boslcp3:~# virsh list --all
IdName   State

8 boslcp3g4  running
14boslcp3g3  running
22boslcp3g1  running

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-22 Thread bugproxy
--- Comment From dnban...@us.ibm.com 2018-04-22 09:45 EDT---
In response to Klaus #124 ...

This change is not related to anything before ...165988 or the reverted
stuff.

The fix only relates to the Qlogic adapter's delete/free work handling flow
for session unregistration -it attempts to ensure that only one instance is
active at a time.

--- Comment From indira.pr...@in.ibm.com 2018-04-22 09:46 EDT---
(In reply to comment #128)
> Machine still up...
>
> root@boslcp3:~# virsh list
>  IdName   State
> 
>  8 boslcp3g4  running
>  14boslcp3g3  running
>  20boslcp3g1  running
>
> root@boslcp3:~# uptime
>  08:29:09 up 13:49,  8 users,  load average: 77.56, 84.04, 86.55
>
> logs are still pretty clean..
>
> I had built on that system ...but will put a patch up here later.

Hi Dwip,

Run is going fine on boslcp3g3, & boslcp3g4 guests but boslcp3g1 testcases 
failing .
so restarting the run on boslcp3g1. We need to wait for some more hours to 
conclude wrt to the issue recreation.

Please wait for some more time.

Regards,
Indira

T

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-22 Thread bugproxy
--- Comment From dnban...@us.ibm.com 2018-04-22 09:36 EDT---
Machine still up...

root@boslcp3:~# virsh list
IdName   State

8 boslcp3g4  running
14boslcp3g3  running
20boslcp3g1  running

root@boslcp3:~# uptime
08:29:09 up 13:49,  8 users,  load average: 77.56, 84.04, 86.55

logs are still pretty clean..

I had built on that system ...but will put a patch up here later.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-22 Thread bugproxy
--- Comment From kla...@br.ibm.com 2018-04-22 08:26 EDT---
(In reply to comment #121)
> The kernel mentioned in #114 was a quick, rough attempt to force one instance
> of free_work/del_work pending at a time.
>
> With that, the machine still seems to be up.
>
> root@boslcp3:~# uptime
>  22:21:50 up  3:41,  2 users,  load average: 77.72, 82.67, 81.90
> root@boslcp3:~# virsh list
>  IdName   State
> 
>  1 boslcp3g3  running
>  2 boslcp3g4  running
>  4 boslcp3g1  running
>
> And the logs still look clean!

Thanks for the excellent investigation here and on Bugzilla 167103,
Dwip.

Can you clarify the changes to the kernel mentioned in comment 114? How
do they compare to the changes in Bz 165988 that we're considering
reverting?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-21 Thread bugproxy
--- Comment From dnban...@us.ibm.com 2018-04-21 23:43 EDT---
The kernel mentioned in #114 was a quick, rough attempt to force one instance
of free_work/del_work pending at a time.

With that, the machine still seems to be up.

root@boslcp3:~# uptime
22:21:50 up  3:41,  2 users,  load average: 77.72, 82.67, 81.90
root@boslcp3:~# virsh list
IdName   State

1 boslcp3g3  running
2 boslcp3g4  running
4 boslcp3g1  running

And the logs still look clean!

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-21 Thread bugproxy
--- Comment From dnban...@us.ibm.com 2018-04-21 23:33 EDT---
Continuing from the description at #84...

This is going to be long post. I debated putting it as as
attachment but placing it in the main body will probably
help in searching in the future.

===

So the pool_mayday_timeout() routine basically walked into the
weird/corrupted work item on the pool workqueue corresponding to
cpu 0x68 (104). So the critical question is who may have put
that item there with the strange work->data value
data = R10: 2040

BTW, pool_mayday_timeout got the pool thus:

static void pool_mayday_timeout(struct timer_list *t)
{
struct worker_pool *pool = from_timer(pool, t, mayday_timer);
and then ...
list_for_each_entry(work, >worklist, entry)
send_mayday(work);
}

This is the timer:

struct timer_list {
entry = {
next = 0x5deadbeef200,
pprev = 0x0
},
expires = 0x1018d5103,
function = 0xc012e790 ,
flags = 0x168
}

crash> rd jiffies
c1713b00:  0001018d5380.S..

The corresponding worker pool:
-
struct worker_pool {
lock = {
{
rlock = {
raw_lock = {
slock = 0x8068
}
}
}
},
cpu = 0x68,
node = 0x8,
id = 0xd0,  Note the id!
flags = 0x1,
watchdog_ts = 0x1018d50b0,
worklist = {
next = 0xc00fe2a0b020,
prev = 0xc00fe2a075b0
},
nr_workers = 0x2,
nr_idle = 0x0,
idle_list = {
next = 0xc000200e60eb7db8,
prev = 0xc000200e60eb7db8
},
idle_timer = {
entry = {
next = 0x5deadbeef200,
pprev = 0x0
},
expires = 0x1018c6fa3,
function = 0xc012e980 ,
flags = 0x41c80068
},
mayday_timer = {
entry = {
next = 0x5deadbeef200,
pprev = 0x0
},
expires = 0x1018d5103,
function = 0xc012e790 ,
flags = 0x168
},
...
workers = {
next = 0xc0002000cb9fa138,
prev = 0xc0002000cb9f1708
},
detach_completion = 0x0,
worker_ida = {
ida_rt = {
gfp_mask = 0x700,
rnode = 0xc000200e51923bd8
}
},
attrs = 0xc00ff92cddf8,
hash_node = {
next = 0x0,
pprev = 0x0
},
refcnt = 0x1,
nr_running = {
counter = 0x0
},
rcu = {
next = 0x0,
func = 0x0
}
}

This pool->id will be used to compute a marker for a
deleted/executed item in the future.

Now let's walk through the workers on the work list:

The TWO entries there are:
c00fe2a075b0
c00fe2a0b020

Work struct #1
===
crash> work_struct c00fe2a075a8
struct work_struct {
data = {
counter = 0xc000200e60eba305 <<< DATA GOOD!!! (a pwq)
},
entry = {
next = 0xc000200e60eb7da0,
prev = 0xc00fe2a0b020
},
func = 0xc0080b1af5a8
}

Work struct #2
===
crash> work_struct 0xc00fe2a0b018
struct work_struct {
data = {
counter = 0x2040 <<<--  DATA BAD!!!
},
entry = {
next = 0xc00fe2a0b020,
prev = 0xc00fe2a0b020
},
func = 0xc0080b1af5a8
}

Note that Work struct #2 is the PROBLEM work item!!

One important thing to note: BOTH THE WORK ENTRIES HAVE THE SAME
WORK FUNCTION - i.e the same entity likely created this work.

crash> dis 0xc0080b1af5a8 1
0xc0080b1af5a8 : addis   r2,r12,4

So these work entries were created by the QLogic driver!



--- Comment From dnban...@us.ibm.com 2018-04-21 23:36 EDT---


The call
===
void qlt_unreg_sess(struct fc_port *sess)
...
INIT_WORK(>free_work, qlt_free_session_done);
schedule_work(>free_work);

Now we can look at the embedding structure, which is the following:

crash> fc_port c00fe2a0af58
struct fc_port {
list = {
next = 0xc00fe2a074e8,
prev = 0xc00fe2a09f68
},
vha = 0xc000200e458b69a0,
node_name = "P\005\ah\001\000\241\245",
port_name = "P\005\ah\001\020\241\245",
d_id = {
b24 = 0x8cfdc0,
b = {
al_pa = 0xc0,
area = 0xfd,
domain = 0x8c,
rsvd_1 = 0x0
}
},
loop_id = 0x1000,
old_loop_id = 0x0,
conf_compl_supported = 0x0,
deleted = 0x2,
local = 0x0,
logout_on_delete = 0x1,
logo_ack_needed = 0x0,
keep_nport_handle = 0x0,
send_els_logo = 0x0,
login_pause = 0x0,
login_succ = 0x0,
query = 0x0,
nvme_del_work = {
data = {
counter = 0x0
},
entry = {
next = 0x0,
prev = 0x0
},
func = 0x0
},
nvme_del_done = {
done = 0x0,
wait = {
lock = {
{
rlock = {
raw_lock = {
slock = 0x0
}
}
}
},
head = {
next = 0x0,
prev = 0x0
}
}
},
nvme_prli_service_param = 0x0,
nvme_flag = 0x0,
conflict = 0x0,
logout_completed = 0x0,
generation = 0x0,
se_sess = 0x0,
sess_kref = {
refcount = {
refs = {
counter = 0x0
}
}
},
tgt = 0x0,
expires = 0x0,
del_list_entry = {
next = 0x0,
prev = 0x0
},
free_work = {  ---  THIS IS OUR (CORRUPTED) WORK!
data = {
counter = 0x2040
},
entry = {
next = 0xc00fe2a0b020,
prev = 0xc00fe2a0b020
},
func = 0xc0080b1af5a8 
},
plogi_link = {0x0, 0x0},
tgt_id = 0x0,
old_tgt_id = 0x0,
fcp_prio = 0x0,
fabric_port_name = " \b\000'\370\037N\261",
fp_speed = 0x,
port_type = FCT_TARGET,
state = {
counter = 0x3
},
flags = 0xb,
login_retry = 0x1d,
rport = 0xc00fd91399c8,
drport = 0x0,
supported_classes = 0x8,
fc4_type = 0x8,
fc4f_nvme = 0x0,
scan_state = 

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-21 Thread bugproxy
--- Comment From dnban...@us.ibm.com 2018-04-21 21:30 EDT---
I gave a test kernel to Chanh to try out on boslcp3, based on the observations
from the crash.(it has taken a while ...) - just a  quick initial attempt.

I will soon be posting the analysis. Meanwhile, boslcp3 still seems to
be up:

root@boslcp3:~# uptime
20:27:40 up  1:47,  3 users,  load average: 75.78, 76.19, 63.86
root@boslcp3:~# virsh list
IdName   State

1 boslcp3g3  running
2 boslcp3g4  running
4 boslcp3g1  running

Don't see any hung tasks (yet?) in the logs...

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-21 Thread bugproxy
--- Comment From chngu...@us.ibm.com 2018-04-21 20:09 EDT---
Dwip provided a new kernel and we start test on 3 guests.
root@boslcp3:~# date
Sat Apr 21 19:00:07 CDT 2018
root@boslcp3:~# uptime
19:00:09 up 20 min,  3 users,  load average: 0.01, 0.42, 0.48
root@boslcp3:~# uname -a
Linux boslcp3 4.15.15tst1 #4 SMP Sat Apr 21 16:57:31 CDT 2018 ppc64le ppc64le 
ppc64le GNU/Linux
root@boslcp3:~# virsh list --all
IdName   State

1 boslcp3g3  running
2 boslcp3g4  running
4 boslcp3g1  running
- boslcp3g5  shut off

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-21 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-21 19:17 EDT---
We need to see the console messages *before* things go wrong. Please capture 
the SOL console output from boot until this happens.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-21 Thread bugproxy
--- Comment From kla...@br.ibm.com 2018-04-21 18:55 EDT---
(In reply to comment #80)
> (In reply to comment #79)
> > Machine still seems to be up... will check if I can observe anything
> > interesting ...
>
> System just crashes it now. The vmcore is at /var/crash/201804181042

Can we retry this test on the P8 system using Brian's kernel in comment
#94?

Also, please post access information for this system.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-21 Thread bugproxy
--- Comment From chngu...@us.ibm.com 2018-04-21 17:02 EDT---
Not sure what is going on. The SOL console print out all of these messages...
rcu_sched self-detected stall on CPU
[20705.652053]  95-: (1 GPs behind) idle=c72/2/0 softirq=179/180 fqs=2586003
[20705.652101]   (t=5172329 jiffies g=213 c=212 q=74736)
[20705.652140] Task dump for CPU 95:
[20705.652164] swapper/95  R  running task0 0  1 0x0804
[20705.652213] Call Trace:
[20705.652231] [c000200fff3d3460] [c0149ed8] 
sched_show_task.part.16+0xd8/0x110 (unreliable)
[20705.652288] [c000200fff3d34d0] [c01a9e9c] 
rcu_dump_cpu_stacks+0xd4/0x138
[20705.652336] [c000200fff3d3520] [c01a8f68] 
rcu_check_callbacks+0x8e8/0xb40
[20705.652385] [c000200fff3d3650] [c01b7208] 
update_process_times+0x48/0x90
[20705.652433] [c000200fff3d3680] [c01cef54] 
tick_sched_handle.isra.5+0x34/0xd0
[20705.652482] [c000200fff3d36b0] [c01cf050] tick_sched_timer+0x60/0xe0
[20705.652530] [c000200fff3d36f0] [c01b7db4] 
__hrtimer_run_queues+0x144/0x370
[20705.652578] [c000200fff3d3770] [c01b8d0c] 
hrtimer_interrupt+0xfc/0x350
[20705.652627] [c000200fff3d3840] [c00248f0] 
__timer_interrupt+0x90/0x260
[20705.652675] [c000200fff3d3890] [c0024d08] timer_interrupt+0x98/0xe0
[20705.652716] [c000200fff3d38c0] [c0009014] 
decrementer_common+0x114/0x120
[20705.652765] --- interrupt: 901 at _raw_spin_lock_irqsave+0x88/0x110
[20705.652765] LR = _raw_spin_lock_irqsave+0x80/0x110
[20705.652836] [c000200fff3d3bb0] [c000200fff3d3bf0] 0xc000200fff3d3bf0 
(unreliable)
[20705.652885] [c000200fff3d3bf0] [c0904e90] 
scsi_end_request+0x110/0x270
[20705.652933] [c000200fff3d3c50] [c0905414] 
scsi_io_completion+0x424/0x750
[20705.652981] [c000200fff3d3d10] [c08f949c] 
scsi_finish_command+0x11c/0x1b0
[20705.653029] [c000200fff3d3d90] [c0904428] 
scsi_softirq_done+0x198/0x220
[20705.653078] [c000200fff3d3e20] [c068fe98] blk_done_softirq+0xb8/0xe0
[20705.653126] [c000200fff3d3e60] [c0cffb08] __do_softirq+0x158/0x3e4
[20705.653167] [c000200fff3d3f40] [c0115968] irq_exit+0xe8/0x120
[20705.653207] [c000200fff3d3f60] [c0017788] __do_irq+0x88/0x1c0
[20705.653248] [c000200fff3d3f90] [c002a1b0] call_do_irq+0x14/0x24
[20705.653289] [c000200e582fba90] [c001795c] do_IRQ+0x9c/0x130
[20705.653330] [c000200e582fbae0] [c0009b04] 
h_virt_irq_common+0x114/0x120
[20705.653379] --- interrupt: ea1 at replay_interrupt_return+0x0/0x4
[20705.653379] LR = arch_local_irq_restore+0x74/0x90
[20705.653459] [c000200e582fbdd0] [005f] 0x5f (unreliable)
[20705.653500] [c000200e582fbdf0] [c0ac16d0] 
cpuidle_enter_state+0xf0/0x450
[20705.653549] [c000200e582fbe50] [c017311c] call_cpuidle+0x4c/0x90
[20705.653590] [c000200e582fbe70] [c0173530] do_idle+0x2b0/0x330
[20705.653631] [c000200e582fbec0] [c01737ec] cpu_startup_entry+0x3c/0x50
[20705.653679] [c000200e582fbef0] [c004a050] start_secondary+0x4f0/0x510
[20705.653727] [c000200e582fbf90] [c000aa6c] 
start_secondary_prolog+0x10/0x14
[   ***] (2 of 2) A start job is running for? polling (5h 45min 33s / no limit)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-21 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-21 14:26 EDT---
I have ltc-boston1 setup with Ubuntu kernel 4.15.0-15, but there is no SAN 
connected to the QLE2742. I see no problem there right now. I have reserve the 
system for this bug until Monday evening.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-21 Thread bugproxy
--- Comment From kla...@br.ibm.com 2018-04-21 09:21 EDT---
Should we go back to the stock Ubuntu kernel in an attempt to identify if bug 
167104 is a result of the custom kernel or the newest PNOR?

--- Comment From prad...@us.ibm.com 2018-04-21 13:17 EDT---
(In reply to comment #106)
> Should we go back to the stock Ubuntu kernel in an attempt to identify if
> bug 167104 is a result of the custom kernel or the newest PNOR?

I am looking at all these plethora of bugs on the host and guest, seems
like a constantly shifting problem. I don't think that this particular
instance was because of the custom kernel (reverting some Qlogic
patches). In my opinion memory corruption seems to be gaining currency.

Are systems without the Qlogic adapters seeing any of the problems
reported here?

How do we debug with constantly moving pieces? Can we get a stable base
to start with? By that I mean we go back to a kernel and pnor that
worked in the past. Then using the same kernel advance the pnor and
validate how it works. We might need to limit the testing to "cater" to
the various bugs.

Once we have isolated the pnor, then we repeat with the kernels. Not
sure how long these activities will take, but we might need to consider
running a parallel exercise.

--- Comment From dnban...@us.ibm.com 2018-04-21 14:17 EDT---
boslcp3 seems to have gone off the rails again ... the console is spitting out a
lot of messages like:

[   ***] (2 of 2) A start job is running for?urnal Service

and it is not pingable

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-21 Thread bugproxy
--- Comment From chngu...@us.ibm.com 2018-04-20 16:48 EDT---
Boslcp3 is back with the new kernel from #94.

root@boslcp3:~# cat /proc/cmdline
root=UUID=bab108a0-d0a6-4609-87f1-6e33d0ad633c ro splash quiet 
crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M@128M

I will launch our test soon.

--- Comment From chngu...@us.ibm.com 2018-04-20 20:05 EDT---
(In reply to comment #98)
> Boslcp3 is back with the new kernel from #94.
>
> root@boslcp3:~# uname -a
> Linux boslcp3 4.15.0-18-generic #19 SMP Fri Apr 20 12:45:38 CDT 2018 ppc64le
> ppc64le ppc64le GNU/Linux
> root@boslcp3:~# cat /proc/cmdline
> root=UUID=bab108a0-d0a6-4609-87f1-6e33d0ad633c ro splash quiet
> crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:
> 4096M@128M
>
> I will launch our test soon.

It is not looking good on boslcp3. After I start test, within 3 hours run, 
system is still pingable but I cannot ssh to it. Looking at the console, I see 
these on all over

[ 8785.370897] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 8785.370962]  1-...0: (4 GPs behind) idle=ca2/141/0 
softirq=15273/15273 fqs=1075891
[ 8785.371035]  (detected by 3, t=2179442 jiffies, g=2107, c=2106, q=386665)
[ 8785.371090] Task dump for CPU 1:
[ 8785.371123] kworker/1:3 R  running task0  4111  2 0x0804
[ 8785.371195] Call Trace:
[ 8785.371221] [c000d5c4fa00] [c8133cf8] worker_thread+0x98/0x630 
(unreliable)
[ 8848.390897] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 8848.390964]  1-...0: (4 GPs behind) idle=ca2/141/0 
softirq=15273/15273 fqs=1083603
[ 8848.391037]  (detected by 3, t=2195197 jiffies, g=2107, c=2106, q=389679)
[ 8848.391092] Task dump for CPU 1:
[ 8848.391125] kworker/1:3 R  running task0  4111  2 0x0804
[ 8848.391197] Call Trace:
[ 8848.391223] [c000d5c4fa00] [c8133cf8] worker_thread+0x98/0x630 
(unreliable)
[ 8857.031091] systemd[1]: systemd-journald.service: Start operation timed out. 
Terminating.
***

root@boslcp3:~# uname -a
Linux boslcp3 4.15.0-18-generic #19 SMP Fri Apr 20 12:45:38 CDT 2018 ppc64le 
ppc64le ppc64le GNU/Linux

--- Comment From chetj...@in.ibm.com 2018-04-21 08:08 EDT---
The two guests are impacted due to (In reply to comment #103)
> Updated  boslcp3 with latest PNOR:0420 & restarted tests on guests with
> kernel '4.15.0-18-generic'.
>
> $ ./ipmis bmc-boslcp3 fru print 47
> Product Name  : OpenPOWER Firmware
> Product Version   : open-power-SUPERMICRO-P9DSU-V1.11-20180420-imp
> Product Extra : op-build-4d27fab
> Product Extra : skiboot-v5.11-70-g5307c0ec7899-pc34e21f
> Product Extra : hostboot-742640c
> Product Extra : linux-4.15.14-openpower1-p81c2d44
> Product Extra : petitboot-v1.7.1-p8b80147
> Product Extra : machine-xml-32ce616
> Product Extra : occ-4f49f6
>
> root@boslcp3:~# uname -a
> Linux boslcp3 4.15.0-18-generic #19 SMP Fri Apr 20 12:45:38 CDT 2018 ppc64le
> ppc64le ppc64le GNU/Linux
> root@boslcp3:~# uname -r
> 4.15.0-18-generic
>
> Guests kernel:
> 
> root@boslcp3g3:~# uname -a
> Linux boslcp3g3 4.15.0-15-generic #16+bug166877 SMP Wed Apr 18 14:47:30 CDT
> 2018 ppc64le ppc64le ppc64le GNU/Linux
> root@boslcp3g3:~# uname -r
> 4.15.0-15-generic
>
> Regards,
> Indira

The two guests are impacted in the new run today (bug# 167104) and the
3rd one is not able to reach, but not dumping any console logs, we are
waiting to see what happens!

The host is running fine, so far, but w/o guest run I'm not sure how
soon can we verify this?

Please check on bug#167104 to make this recreate/verify fast.
The host continue

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-21 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-21 08:45 EDT---
The latest logs show a panic in process_one_work() on CPU 145, some sort of 
NULL pointer fault, followed by 2 CPUs (22, 125) getting a "Bad interrupt in 
KVM entry/exit code, sig: 6" panic (possibly in response to the panic IPI). 
Those 2 CPUs timeout and the KDUMP kexec starts.

The KDUMP then gets the same process_one_work() panic, this time on CPU
1, followed by Hard LOCKUP detected on CPUs 0 and 1. rcu_sched then
starts detecting the stalled CPU(s), only trying to dump CPU 1.

The problem seems to keep changing. Originally it was a panic on a very
strange address in kmem_cache_alloc_node() from socket code. Later we
see a NULL pointer issue in pool_mayday_timeout() from KVM. Now we are
seeing a panic in process_one_work() from a kworker thread (unknown
workqueue). If these different panics all have the same cause, it would
seem to be something like memory corruption. Not being able to get a
clean dump is going to be a problem.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-21 Thread bugproxy
--- Comment From indira.pr...@in.ibm.com 2018-04-21 02:01 EDT---
Updated  boslcp3 with latest PNOR:0420 & restarted tests on guests with kernel 
'4.15.0-18-generic'.

$ ./ipmis bmc-boslcp3 fru print 47
Product Name  : OpenPOWER Firmware
Product Version   : open-power-SUPERMICRO-P9DSU-V1.11-20180420-imp
Product Extra : op-build-4d27fab
Product Extra : skiboot-v5.11-70-g5307c0ec7899-pc34e21f
Product Extra : hostboot-742640c
Product Extra : linux-4.15.14-openpower1-p81c2d44
Product Extra : petitboot-v1.7.1-p8b80147
Product Extra : machine-xml-32ce616
Product Extra : occ-4f49f6

root@boslcp3:~# uname -a
Linux boslcp3 4.15.0-18-generic #19 SMP Fri Apr 20 12:45:38 CDT 2018 ppc64le 
ppc64le ppc64le GNU/Linux
root@boslcp3:~# uname -r
4.15.0-18-generic

Guests kernel:

root@boslcp3g3:~# uname -a
Linux boslcp3g3 4.15.0-15-generic #16+bug166877 SMP Wed Apr 18 14:47:30 CDT 
2018 ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3g3:~# uname -r
4.15.0-15-generic

Regards,
Indira

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-21 Thread bugproxy
--- Comment From prad...@us.ibm.com 2018-04-21 01:53 EDT---
Looks like an Oops similar to the previous one in comment#39 starting a 
sequence of events

root@boslcp3:~# [ 2837.030181] Unable to handle kernel paging request for data 
at address 0x0008
[ 2837.030253] Faulting instruction address: 0xc01336fc
[ 2837.030295] Oops: Kernel access of bad area, sig: 11 [#1]
[ 2837.030328] LE SMP NR_CPUS=2048 NUMA PowerNV
[ 2837.030364] Modules linked in: vhost_net vhost macvtap macvlan tap 
xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat 
nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack 
libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter 
ebtables devlink ip6table_filter ip6_tables iptable_filter rpcsec_gss_krb5 
nfsv4 nfs fscache kvm_hv binfmt_misc kvm dm_service_time dm_multipath 
scsi_dh_rdac scsi_dh_emc scsi_dh_alua joydev input_leds idt_89hpesx mac_hid 
vmx_crypto crct10dif_vpmsum at24 ofpart cmdlinepart uio_pdrv_genirq uio 
powernv_flash mtd ibmpowernv ipmi_powernv ipmi_devintf ipmi_msghandler opal_prd 
nfsd auth_rpcgss nfs_acl lockd grace sunrpc sch_fq_codel ip_tables x_tables 
autofs4 btrfs xor zstd_compress raid6_pq ses enclosure hid_generic
[ 2837.030909]  usbhid hid qla2xxx ast i2c_algo_bit ttm ixgbe drm_kms_helper 
mpt3sas nvme_fc syscopyarea sysfillrect nvme_fabrics sysimgblt fb_sys_fops 
nvme_core raid_class crc32c_vpmsum drm i40e scsi_transport_sas 
scsi_transport_fc mdio aacraid
[ 2837.031053] CPU: 145 PID: 1182 Comm: kworker/145:1 Not tainted 
4.15.0-18-generic #19
[ 2837.031107] NIP:  c01336fc LR: c0133cf8 CTR: c0cfefa0
[ 2837.031156] REGS: c000200e44c77a10 TRAP: 0300   Not tainted  
(4.15.0-18-generic)
[ 2837.031204] MSR:  90009033   CR: 28000822  
XER: 
[ 2837.031257] CFAR: c0133cf4 DAR: 0008 DSISR: 4000 
SOFTE: 0
[ 2837.031257] GPR00: c0133cf8 c000200e44c77c90 c16eae00 
c000200e44bda5c0
[ 2837.031257] GPR04: c00fdf6f7da0 c000200e618f7da0 c000200e618fa305 
c00fdf6f7cc8
[ 2837.031257] GPR08: c000200e6190c960 2440  
c0080f04e0f8
[ 2837.031257] GPR12:  c7a83b00 c013c788 
c000200e50ebf3c0
[ 2837.031257] GPR16:    

[ 2837.031257] GPR20: c000200e618f7d80   
fef7
[ 2837.031257] GPR24: 0402  c000200e618f8100 
c1713b00
[ 2837.031257] GPR28: c000200e618f7da0  c000200e618f7d80 
c000200e44bda5c0
[ 2837.031687] NIP [c01336fc] process_one_work+0x3c/0x5a0
[ 2837.031727] LR [c0133cf8] worker_thread+0x98/0x630
[ 2837.031760] Call Trace:
[ 2837.031778] [c000200e44c77c90] [c0133974] 
process_one_work+0x2b4/0x5a0 (unreliable)
[ 2837.031828] [c000200e44c77d20] [c0133cf8] worker_thread+0x98/0x630
[ 2837.031885] [c000200e44c77dc0] [c013c928] kthread+0x1a8/0x1b0
[ 2837.031928] [c000200e44c77e30] [c000b528] 
ret_from_kernel_thread+0x5c/0xb4
[ 2837.031976] Instruction dump:
[ 2837.032001] 6000 7d908026 fba1ffe8 fbc1fff0 91810008 f821ff71 e924 
712a0004
[ 2837.032052] 793d05e4 40820008 3ba0 ebc30048  815e0010 81290100 
714a0004
[ 2837.032104] ---[ end trace ae121b1a8fbe89f8 ]---

A cascading series of events follow ending up in hard lockups. However,
that likely happens when  IPIs fail and these are secondary events.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-20 Thread bugproxy
--- Comment From dnban...@us.ibm.com 2018-04-20 16:33 EDT---
Klaus, I am not aware of the particular tests being run.

But I pinged Chanh so that he can start a new round of tests.

However ... I do see that boslcp3 now has reverted to the prior kernel:
Linux boslcp3 4.13.0-25-generic #29-Ubuntu SMP Mon Jan 8 21:15:55 UTC 2018 
ppc64le ppc64le ppc64le GNU/Linux

I am not sure if there were other plans, but I let Chanh know about the
existence of the new patch which we would like to be tested. And he
kindly agreed to start the tests after installing the patch in #94.

p.s. the old kernel (according to Chanh) is due to one of the disks
getting corrupted.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-20 Thread bugproxy
--- Comment From kla...@br.ibm.com 2018-04-20 15:39 EDT---
Padma is reporting that the boslcp3 is available.

Dwip, I think Indira won't be available at this time of the day. Can you
jump in and try to reproduce with the debug kernel in comment #94 above?

Thanks

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-20 Thread bugproxy
--- Comment From bjki...@us.ibm.com 2018-04-20 15:11 EDT---
Below is a test kernel with the four QLogic commits that were added to the 
4.15.0-15.16 kernel reverted, plus the patch from 166877. Please run this and 
update the bug if the crash is still seen.

https://ibm.ent.box.com/s/n29uregixfwrywyle4ursgovmrbjcxtd

--- Comment From bjki...@us.ibm.com 2018-04-20 15:11 EDT---
These are the four qlogic driver commits that were added between 4.15.0-13 and 
4.15.0-15.16:

79c67fb6fa21774c67bba59619eaa908c18de759 scsi: qla2xxx: Fix crashes in 
qla2x00_probe_one on probe failure
21af711d6011c857f11717d20b57516c334d5dd0 scsi: qla2xxx: Fix logo flag for 
qlt_free_session_done()
60b5e40ad28c93a2752fff0988660fa28fe7905d scsi: qla2xxx: Fix NULL pointer access 
for fcport structure
e4caf5c1b7d847400f2cb6525e7cf83167863241 scsi: qla2xxx: Fix smatch warning in 
qla25xx_delete_{rsp|req}_que

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-19 Thread bugproxy
--- Comment From rajanikanth...@in.ibm.com 2018-04-19 03:51 EDT---
Copied the  dump to our kte server

kte111.isst.aus.stglabs.ibm.com 9.3.111.155 [kte/don2rry]

kte111:/LOGS/boslcp3/BZ166588/

h# ls -l /LOGS/boslcp3/BZ166588/
total 4
drwxr-xr-x 2 root root 4096 Apr 19 02:42 201804181042

Thanks.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-18 Thread bugproxy
--- Comment From kla...@br.ibm.com 2018-04-18 12:27 EDT---
Nick made some interesting comments about lockups in LTC bug 166684, comment 
#24 about the hard lockup watchdog being added in Kernel 4.13. Also other 
comments about RCU stall warnings being too aggressive, but at least in this 
last log RCU doesn't complain until after the first few traces/lockups...

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-16 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-16 17:21 EDT---
We're waiting for a reproduce and a kdump. Also more logs, including firmware 
logs/eSELs/etc.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-15 Thread bugproxy
--- Comment From indira.pr...@in.ibm.com 2018-04-16 01:24 EDT---
(In reply to comment #54)
> Please collect the dmesg log and a crashdump.

Collected dl logs from xmon prompt & unable to take crashdump from xmon
prompt  ,we have bug#10 opened.

Regards,
Indira

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-15 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-15 16:36 EDT---
Please collect the dmesg log and a crashdump.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-13 Thread bugproxy
--- Comment From dougm...@us.ibm.com 2018-04-13 14:51 EDT---
I believe that the "1" in c000200e5848b701 is a flag. The address actually used 
will be c000200e5848b700. The flags PAGE_MAPPING_ANON and/or 
PAGE_MAPPING_MOVABLE are added to page addresses, and are stripped of before 
dereferencing. If that R30 value is something like "anon_mapping = (unsigned 
long)READ_ONCE(page->mapping)" then it will contain those flags. Not sure if 
that applies to your situation or not.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-13 Thread bugproxy
--- Comment From bjki...@us.ibm.com 2018-04-13 14:24 EDT---
Dwip - excellent suggestion, I agree with your suggestion on next steps. If 
this is a double free we need to catch that earlier than where we are crashing.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-12 Thread bugproxy
--- Comment From indira.pr...@in.ibm.com 2018-04-12 11:10 EDT---
Hi,

Today i have tried rebooting boslcp3 system and crash issue recreated.

For first attempt, after rebooting host it booted with latest kernel & i
have attempted disable stop4, 5 commands then it immediately crashed &
enters into xmon with similar stack trace as reported in the
bug(recreation steps). Tried to take dump from xmon prompt using 'X'
option , it did not worked & it came back to shell prompt.

For second attempt of reboot, host booted with latest kernel. Issued
kdump-config status command & then host crashed with same stack trace as
reported in recreation steps. Again tried to take dump from xmon prompt
using 'X' which did not worked . it came back to shell prompt.

Attached host console logs for both attempts of reboots clearly.

Regards,
Indira

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-11 Thread bugproxy
--- Comment From chetj...@in.ibm.com 2018-04-12 01:24 EDT---
(In reply to comment #18)
> Can you see if the bug happens with and of these mainline kernels?  We can
> perform a kernel bisect if we can narrow down to the last good kernel
> version and first bad one:
>
> v4.14 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14/
> v4.15-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc1/
> v4.15-rc4: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc4/
> v4.15 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/
>
> You don't have to test every kernel, just up until the kernel that first has
> this bug.
>
> Thanks in advance!

We need to make progress in testing other firmware and guest issues. We
will come back to this later.

Meanwhile, the problem happened again today with the reboot and we tried
to collect the vmcore using 'X', but it did not collect. Indira, pls add
those details.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-11 Thread bugproxy
--- Comment From chetj...@in.ibm.com 2018-04-11 03:14 EDT---
(In reply to comment #16)
> Can you test again on a third system?
> Can this be a hw problem on the first system?

No. This cannot he an hardware issue, since we are running fine on the
same system from last 4 months with multiple kernel updates.

And the system is back up again automatically on 3rd & 4th reboot. So
the underlying problem still reside

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1762844] Comment bridged from LTC Bugzilla

2018-04-10 Thread bugproxy
--- Comment From cha...@us.ibm.com 2018-04-10 21:32 EDT---
According to test they have another bostonLC (boslcp4) and they did update to 
this kernel and system is booting up normally.
root@boslcp4:~# uname -a
Linux boslcp4 4.15.0-15-generic #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 
ppc64le ppc64le ppc64le GNU/Linux
root@boslcp4:~# date
Tue Apr 10 16:37:37 CDT 2018
root@boslcp4:~# uptime
16:37:38 up 40 min,  2 users,  load average: 0.00, 0.03, 0.12

Additionally, I rebooted the system a third time to add the
slub_debug=FZ kernel option and system booted to the login and I logged
in successfully. I did it a fourth time and it succeeded again.

root@boslcp3:~# uname -a
Linux boslcp3 4.15.0-15-generic #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 
ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3:~# cat /proc/cmdline
root=UUID=bab108a0-d0a6-4609-87f1-6e33d0ad633c ro xmon=on splash quiet 
slub_debug=FZ

Strange.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1762844

Title:
  ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into
  xmon after moving to 4.15.0-15.16 kernel

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1762844/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs