------- Comment From [email protected] 2020-06-17 06:59 EDT-------
O.K. we have some new insights here.

@[email protected] did some experiments on my behalf with a slightly
modified Ubuntu kernel (based on 5.4.0-29) where I removed commit
3060781f2664 ("s390/qdio: allow to scan all Output SBALs in one go")
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3060781f2664d34af641247aeac62696405a3fde).
We had a suspicion that this might be related to the queue-
stalls/-slowdowns we always saw in the past before the crash in the WBT
code. And to my slight suprise, not only did the queue-stalls/-slowdowns
disappear, but the WBT crash still persisted. I checked all available
logs and our driver traces from the dump and didn't find any indication
what so ever that scsi-EH was ever invoked, nor that we went through
adapter-recovery any time after the initial instance boot - no command
timeouts or anything.

So in my mind, while I can't proof yet that 3060781f2664 was really
responsible for the queue-stalls/-slowdowns - that might still just be
coincidence (although, it *did* happen quiet persistently before, and
now not once.. so that is rather suspicious for me) - it shows that the
crash in the WBT code is independent. So that seems to be something that
can happen without any transport interruptions.

Here is the backtrace from that particular run, where no queue-
stalls/-slowdowns were seen, but WBT still crashed:

[22808.815235] Unable to handle kernel pointer dereference in virtual kernel 
address space
[22808.815247] Failing address: 00007fe010a50000 TEID: 00007fe010a50403
[22808.815249] Fault in home space mode while using kernel ASCE.
[22808.815252] AS:000003dc9b67c00b R2:000003fd0000800b R3:000003fd0000c007 
S:000003fba9b84800 P:0000000000000400
[22808.815368] Oops: 0011 ilc:2 [#1] SMP
[22808.815376] Modules linked in: xfs vhost_net vhost macvtap macvlan tap 
xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp 
ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack 
nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink ip6table_filter ip6_tables 
iptable_filter bpfilter bridge dm_service_time aufs overlay dm_multipath 
scsi_dh_rdac scsi_dh_emc scsi_dh_alua s390_trng chsc_sch eadm_sch vfio_ccw 
vfio_mdev mdev vfio_iommu_type1 vfio 8021q garp mrp stp llc sch_fq_codel drm 
drm_panel_orientation_quirks i2c_core ip_tables x_tables btrfs zstd_compress 
zlib_deflate raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq libcrc32c raid1 raid0 linear dm_mirror dm_region_hash 
dm_log qeth_l2 pkey zcrypt crc32_vx_s390 ghash_s390 prng aes_s390 des_s390 
libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common 
zfcp scsi_transport_fc dasd_eckd_mod dasd_mod qeth qdio ccwgroup
[22808.815519] CPU: 14 PID: 185372 Comm: CPU 0/KVM Kdump: loaded Not tainted 
5.4.0-2901-generic #01
[22808.815521] Hardware name: IBM 8561 T01 708 (LPAR)
[22808.815523] Krnl PSW : 0404e00180000000 000003dc9a6dd9be 
(try_to_wake_up+0x4e/0x700)
[22808.815535]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 
RI:0 EA:3
[22808.815602] Krnl GPRS: 000003fbd9ab7588 00007fe000000000 00007fe00000000f 
0000000000000003
[22808.815605]            0000000000000000 0000000000000039 04007fe001ef7a88 
0000000000000003
[22808.815607]            0000000000000003 00007fe010a50284 0000000000000000 
00007fe010a4f930
[22808.815609]            000003f588fd6600 000003dc9af0f070 00007fe001ef7ae0 
00007fe001ef7a60
[22808.815620] Krnl Code: 000003dc9a6dd9b2: 41902954            la      
%r9,2388(%r2)
000003dc9a6dd9b6: 582003ac            l       %r2,940
#000003dc9a6dd9ba: a7180000            lhi     %r1,0
>000003dc9a6dd9be: ba129000            cs      %r1,%r2,0(%r9)
000003dc9a6dd9c2: a77401c9            brc     7,000003dc9a6ddd54
000003dc9a6dd9c6: e310b0080004        lg      %r1,8(%r11)
000003dc9a6dd9cc: b9800018            ngr     %r1,%r8
000003dc9a6dd9d0: a774001f            brc     7,000003dc9a6dda0e
[22808.815637] Call Trace:
[22808.816011] ([<00007fff809861e8>] __key.84156+0x10/0xfffffffffffb7e28 [xfs])
[22808.816022]  [<000003dc9ab596ba>] rq_qos_wake_function+0x8a/0xa0
[22808.816025]  [<000003dc9a6fcbde>] __wake_up_common+0x9e/0x1b0
[22808.816028]  [<000003dc9a6fd0e4>] __wake_up_common_lock+0x94/0xe0
[22808.816029]  [<000003dc9a6fd15a>] __wake_up+0x2a/0x40
[22808.816034]  [<000003dc9ab70640>] wbt_done+0x90/0xe0
[22808.816036]  [<000003dc9ab597be>] __rq_qos_done+0x3e/0x60
[22808.816040]  [<000003dc9ab455b0>] blk_mq_free_request+0xe0/0x140
[22808.816045]  [<000003dc9ace7c60>] dm_softirq_done+0x140/0x230
[22808.816046]  [<000003dc9ab43fbc>] blk_done_softirq+0xbc/0xe0
[22808.816051]  [<000003dc9af06710>] __do_softirq+0x100/0x360
[22808.816054]  [<000003dc9a6ad25e>] irq_exit+0x9e/0xc0
[22808.816057]  [<000003dc9a638b18>] do_IRQ+0x78/0xb0
[22808.816059]  [<000003dc9af05c28>] ext_int_handler+0x128/0x12c
[22808.816060]  [<000003dc9af05306>] sie_exit+0x0/0x46
[22808.816065] ([<000003dc9a67144a>] __vcpu_run+0x27a/0xc30)
[22808.816068]  [<000003dc9a67a9a8>] kvm_arch_vcpu_ioctl_run+0x2d8/0x840
[22808.816072]  [<000003dc9a665242>] kvm_vcpu_ioctl+0x282/0x770
[22808.816077]  [<000003dc9a90df66>] do_vfs_ioctl+0x376/0x690
[22808.816078]  [<000003dc9a90e304>] ksys_ioctl+0x84/0xb0
[22808.816080]  [<000003dc9a90e39a>] __s390x_sys_ioctl+0x2a/0x40
[22808.816082]  [<000003dc9af055f2>] system_call+0x2a6/0x2c8
[22808.816084] Last Breaking-Event-Address:
[22808.816087]  [<000003dc9a6de07e>] wake_up_process+0xe/0x20
[22808.816159] Kernel panic - not syncing: Fatal exception in interrupt

For the moment I'll stop chasing after the WBT crash - I already
provided a work-around for it with the udev-rule I wrote before
(although, I might note, I have no idea what this workaround means for
performance.. WBT is primarily a performance feature that is intended to
stop I/O starvation in face of excessive page-cache writeback from
individual tasks). We will definitely continue working on the queue-
stalls/-slowdowns, but that seems now independent from this particular
bug-report.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1881109

Title:
  [Ubuntu 20.04] LPAR crashes in block layer under high stress. Might be
  triggered by scsi errors.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1881109/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to