On Wednesday, September 13, 2017 4:28:23 PM IST Eryu Guan wrote:
> Hi all,
>
> Recently I noticed multiple crashes triggered by xfs/104 on ppc64 hosts
> in my upstream 4.13 kernel testings. The block layer is involved in the
> call trace so I add linux-block to cc list too. I append the full
> console log to the end of this mail.
>
> Now I can reproduce the crash on x86_64 hosts too by running xfs/104
> many times (usually within 100 iterations). A git-bisect run (I ran it
> for 500 iterations before calling it good in bisect run to be safe)
> pointed the first bad commit to commit acdda3aae146 ("xfs: use
> iomap_dio_rw").
>
> I confirmed the bisect result by checking out a new branch with commit
> acdda3aae146 as HEAD, xfs/104 would crash kernel within 100 iterations,
> then reverting HEAD, xfs/104 passed 1500 iterations.
I am able to recreate the issue on my ppc64 guest. I added some printk()
statements and got this,
xfs_fs_fill_super:1670: Filled up sb c006344db800.
iomap_dio_bio_end_io:784: sb = c006344db800; inode->i_sb->s_dio_done_wq =
(null), >aio.work = c006344bb5b0.
In iomap_dio_rw(), I had added the following printk() statement,
ret = sb_init_dio_done_wq(inode->i_sb);
if (ret < 0)
iomap_dio_set_error(dio, ret);
printk("%s:%d: sb = %p; Created s_dio_done_wq.\n",
__func__, __LINE__, inode->i_sb);
In the case of crash, I don't see the above message being printed.
Btw, I am unable to recreate this issue on today's linux-next though. Maybe
it is because the race condition is accidently masked out.
I will continue debugging this and provide an update.
>
> On one of my test vms, the crash happened as
>
> [ 340.419429] BUG: unable to handle kernel NULL pointer dereference at
> 0102
> [ 340.420408] IP: __queue_work+0x32/0x420
>
> and that IP points to
>
> (gdb) l *(__queue_work+0x32)
> 0x9cf32 is in __queue_work (kernel/workqueue.c:1383).
> 1378WARN_ON_ONCE(!irqs_disabled());
> 1379
> 1380debug_work_activate(work);
> 1381
> 1382/* if draining, only works from the same workqueue are
> allowed */
> 1383if (unlikely(wq->flags & __WQ_DRAINING) &&
> 1384WARN_ON_ONCE(!is_chained_work(wq)))
> 1385return;
> 1386retry:
> 1387if (req_cpu == WORK_CPU_UNBOUND)
>
> So looks like "wq" is null. The test vm is a kvm guest running 4.13
> kernel with 4 vcpus and 8G memory.
>
> If more information is needed please let me know.
>
> Thanks,
> Eryu
>
> P.S. console log when crashing
>
> [ 339.746983] run fstests xfs/104 at 2017-09-13 17:38:26
> [ 340.027352] XFS (vda6): Unmounting Filesystem
> [ 340.207107] XFS (vda6): Mounting V5 Filesystem
> [ 340.217553] XFS (vda6): Ending clean mount
> [ 340.419429] BUG: unable to handle kernel NULL pointer dereference at
> 0102
> [ 340.420408] IP: __queue_work+0x32/0x420
> [ 340.420408] PGD 215373067
> [ 340.420408] P4D 215373067
> [ 340.420408] PUD 21210d067
> [ 340.420408] PMD 0
> [ 340.420408]
> [ 340.420408] Oops: [#1] SMP
> [ 340.420408] Modules linked in: xfs ip6t_rpfilter ipt_REJECT nf_reject_ipv4
> nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
> nf_defrag_ipv6 xt_conntrack nf_conntrack libcrc32c ip_set nfnetlink
> ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security
> ip6table_raw iptable_mangle iptable_security iptable_raw ebtable_filter
> ebtables ip6table_filter ip6_tables iptable_filter btrfs xor raid6_pq ppdev
> i2c_piix4 parport_pc i2c_core parport virtio_balloon pcspkr nfsd auth_rpcgss
> nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi
> virtio_net virtio_blk ata_piix libata virtio_pci virtio_ring serio_raw virtio
> floppy
> [ 340.420408] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.13.0 #64
> [ 340.420408] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
> [ 340.420408] task: 8b1d96222500 task.stack: b06bc0cb8000
> [ 340.420408] RIP: 0010:__queue_work+0x32/0x420
> [ 340.420408] RSP: 0018:8b1d9fd83d18 EFLAGS: 00010046
> [ 340.420408] RAX: 0096 RBX: 0002 RCX:
> 8b1d9489e6d8
> [ 340.420408] RDX: 8b1d903c2090 RSI: RDI:
> 2000
> [ 340.420408] RBP: 8b1d9fd83d58 R08: 0400 R09:
> 0009
> [ 340.420408] R10: 8b1d9532b400 R11: R12:
> 2000
> [ 340.420408] R13: R14: 8b1d903c2090 R15:
> 7800
> [ 340.420408] FS: () GS:8b1d9fd8()
> knlGS:
> [ 340.420408] CS: 0010 DS: ES: CR0: 80050033
> [ 340.420408] CR2: 0102 CR3: 0002152ce000 CR4:
> 06e0
> [ 340.420408] Call Trace:
> [ 340.420408]
> [ 340.420408] ? __slab_free+0x8e/0x260
> [