RE: crash in iscsi/scsi initiator with linux-4.15.0-rc1

2017-12-20 Thread Steve Wise
> > Hey Ewan, Yan, Bart,
> >
> > I'm still seeing this issue with 4.15-rc4.  Is the issue still outstanding?
> >
> > Steve.
> >
> 
> Please apply the following commit from the 4.15/scsi-fixes branch of
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git
> 
> and advise if it does not fix your issue.  It should.

This seems to resolve my issue.  Thanks!

If you want, you can add a Tested-by: from me. 

Steve.



RE: crash in iscsi/scsi initiator with linux-4.15.0-rc1

2017-12-19 Thread Ewan D. Milne
On Tue, 2017-12-19 at 13:31 -0600, Steve Wise wrote:
> > > Hey,
> > >
> > > I'm  seeing this null pointer dereference with linux-4.15.0-rc1.  To 
> > > reproduce
> > > it, I connect two ram disks via iscsi/TCP, and start an fio:
> > >
> > > iscsiadm -m discovery --op update --type sendtargets -p 172.16.1.10:3260
> > > iscsiadm -m node -p 172.16.1.10:3260 -l
> > > ISCSI_DISKS=/dev/sdd:/dev/sde; fio --rw=randrw --name=random --
> > norandommap
> > > --ioengine=libaio --size=400m --group_reporting --exitall 
> > > --fsync_on_close=1
> > > --invalidate=1 --direct=1 --filename=$ISCSI_DISKS --time_based 
> > > --runtime=300
> > > --iodepth=128 --numjobs=8 --unit_base=1 --bs=64k --kb_base=1000
> > >
> > > Then on the initiator node, while the fio test is running, I detach the 
> > > devices:
> > >
> > > iscsiadm -m node -p 172.16.1.10:3260 -I iser -u
> > >
> > > Then I hit this crash.  Has anyone else encountered this issue?  
> > > Wondering if
> > > there is a fix handy. :)
> > >
> > 
> > This is the same problem that is being discussed under the thread:
> > "[PATCH] scsi: fix race condition when removing target".
> > 
> > We had good test results with both Jason Yan's patch and Bart's patch
> > applied, however the ultimate solution is still in progress, see James'
> > comments.
> > 
> > You could also try reverting fbce4d97fd "scsi: fixup kernel warning
> > during rmmod()" if you just need to get past this.
> > 
> > -Ewan
> > 
> 
> Hey Ewan, Yan, Bart, 
> 
> I'm still seeing this issue with 4.15-rc4.  Is the issue still outstanding?  
> 
> Steve.
> 

Please apply the following commit from the 4.15/scsi-fixes branch of

git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git

and advise if it does not fix your issue.  It should.



commit 81b6c999897919d5a16fedc018fe375dbab091c5
Author: Hannes Reinecke 
Date:   Wed Dec 13 14:21:37 2017 +0100

scsi: core: check for device state in __scsi_remove_target()

As it turned out device_get() doesn't use kref_get_unless_zero(), so we
will be always getting a device pointer.  Consequently, we need to check
for the device state in __scsi_remove_target() to avoid tripping over
deleted objects.

Fixes: fbce4d97fd43 ("scsi: fixup kernel warning during rmmod()")
Reported-by: Jason Yan 
Signed-off-by: Hannes Reinecke 
Reviewed-by: Bart Van Assche 
Reviewed-by: Ewan D. Milne 
Signed-off-by: Martin K. Petersen 

> ---
> 
> [ 1002.205103] BUG: unable to handle kernel NULL pointer dereference at   
> (null)
> [ 1002.213022] IP: _raw_spin_lock_irqsave+0x1e/0x40
> [ 1002.217740] PGD 0 P4D 0
> [ 1002.220382] Oops: 0002 [#1] SMP
> [ 1002.223637] Modules linked in: iw_cxgb4 cxgb4 nvme_rdma nvme_fabrics 
> rdma_ktest(O) rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi 
> scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp 
> ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib 
> ib_core libcxgb vfat intel_rapl fat iosf_mbi x86_pkg_temp_thermal 
> intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd iTCO_wdt 
> iTCO_vendor_support mxm_wmi mei_me ipmi_si lpc_ich mei pcspkr i2c_i801 
> mfd_core ipmi_devintf shpchp sg ipmi_msghandler wmi nfsd auth_rpcgss nfs_acl 
> lockd grace sunrpc ip_tables ext4 mbcache jbd2 mlx4_en mgag200 drm_kms_helper 
> syscopyarea sysfillrect sysimgblt fb_sys_fops ttm sd_mod igb drm ahci libahci 
> dca mlx4_core
> [ 1002.295663]  ptp libata pps_core crc32c_intel nvme i2c_algo_bit i2c_core 
> nvme_core [last unloaded: cxgb4]
> [ 1002.305563] CPU: 4 PID: 5156 Comm: fio Tainted: G   O 
> 4.15.0-rc4 #3
> [ 1002.313223] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.2a 07/09/2015
> [ 1002.320555] RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
> [ 1002.326077] RSP: 0018:c900070cbd10 EFLAGS: 00010046
> [ 1002.331692] RAX:  RBX: 0246 RCX: 
> 
> [ 1002.339225] RDX: 0001 RSI: 88085fd0e038 RDI: 
> 
> [ 1002.346763] RBP: 880855a65f18 R08:  R09: 
> 0744
> [ 1002.354315] R10: 03ff R11: 0001 R12: 
> 88084992e180
> [ 1002.361873] R13: 880855a67000 R14: 880855a65800 R15: 
> 880856d7d5a8
> [ 1002.369447] FS:  () GS:88085fd0() 
> knlGS:
> [ 1002.377995] CS:  0010 DS:  ES:  CR0: 80050033
> [ 1002.384209] CR2:  CR3: 01c09005 CR4: 
> 000606e0
> [ 1002.391826] Call Trace:
> [ 1002.394774]  scsi_device_dev_release_usercontext+0x40/0x230
> [ 1002.400858]  execute_in_process_context+0x58/0x60
> [ 1002.406085]  device_release+0x2d/0x80
> [ 1002.410277]  kobject_cleanup+0x5e/0x180
> [ 1002.414659]  scsi_disk_put+0x2b/0x40 

RE: crash in iscsi/scsi initiator with linux-4.15.0-rc1

2017-12-19 Thread Steve Wise
> > Hey,
> >
> > I'm  seeing this null pointer dereference with linux-4.15.0-rc1.  To 
> > reproduce
> > it, I connect two ram disks via iscsi/TCP, and start an fio:
> >
> > iscsiadm -m discovery --op update --type sendtargets -p 172.16.1.10:3260
> > iscsiadm -m node -p 172.16.1.10:3260 -l
> > ISCSI_DISKS=/dev/sdd:/dev/sde; fio --rw=randrw --name=random --
> norandommap
> > --ioengine=libaio --size=400m --group_reporting --exitall --fsync_on_close=1
> > --invalidate=1 --direct=1 --filename=$ISCSI_DISKS --time_based --runtime=300
> > --iodepth=128 --numjobs=8 --unit_base=1 --bs=64k --kb_base=1000
> >
> > Then on the initiator node, while the fio test is running, I detach the 
> > devices:
> >
> > iscsiadm -m node -p 172.16.1.10:3260 -I iser -u
> >
> > Then I hit this crash.  Has anyone else encountered this issue?  Wondering 
> > if
> > there is a fix handy. :)
> >
> 
> This is the same problem that is being discussed under the thread:
> "[PATCH] scsi: fix race condition when removing target".
> 
> We had good test results with both Jason Yan's patch and Bart's patch
> applied, however the ultimate solution is still in progress, see James'
> comments.
> 
> You could also try reverting fbce4d97fd "scsi: fixup kernel warning
> during rmmod()" if you just need to get past this.
> 
> -Ewan
> 

Hey Ewan, Yan, Bart, 

I'm still seeing this issue with 4.15-rc4.  Is the issue still outstanding?  

Steve.

---

[ 1002.205103] BUG: unable to handle kernel NULL pointer dereference at 
  (null)
[ 1002.213022] IP: _raw_spin_lock_irqsave+0x1e/0x40
[ 1002.217740] PGD 0 P4D 0
[ 1002.220382] Oops: 0002 [#1] SMP
[ 1002.223637] Modules linked in: iw_cxgb4 cxgb4 nvme_rdma nvme_fabrics 
rdma_ktest(O) rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi 
scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib 
rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx4_ib ib_core libcxgb 
vfat intel_rapl fat iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm 
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel 
crypto_simd glue_helper cryptd iTCO_wdt iTCO_vendor_support mxm_wmi mei_me 
ipmi_si lpc_ich mei pcspkr i2c_i801 mfd_core ipmi_devintf shpchp sg 
ipmi_msghandler wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 
mbcache jbd2 mlx4_en mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops ttm sd_mod igb drm ahci libahci dca mlx4_core
[ 1002.295663]  ptp libata pps_core crc32c_intel nvme i2c_algo_bit i2c_core 
nvme_core [last unloaded: cxgb4]
[ 1002.305563] CPU: 4 PID: 5156 Comm: fio Tainted: G   O 4.15.0-rc4 
#3
[ 1002.313223] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.2a 07/09/2015
[ 1002.320555] RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
[ 1002.326077] RSP: 0018:c900070cbd10 EFLAGS: 00010046
[ 1002.331692] RAX:  RBX: 0246 RCX: 
[ 1002.339225] RDX: 0001 RSI: 88085fd0e038 RDI: 
[ 1002.346763] RBP: 880855a65f18 R08:  R09: 0744
[ 1002.354315] R10: 03ff R11: 0001 R12: 88084992e180
[ 1002.361873] R13: 880855a67000 R14: 880855a65800 R15: 880856d7d5a8
[ 1002.369447] FS:  () GS:88085fd0() 
knlGS:
[ 1002.377995] CS:  0010 DS:  ES:  CR0: 80050033
[ 1002.384209] CR2:  CR3: 01c09005 CR4: 000606e0
[ 1002.391826] Call Trace:
[ 1002.394774]  scsi_device_dev_release_usercontext+0x40/0x230
[ 1002.400858]  execute_in_process_context+0x58/0x60
[ 1002.406085]  device_release+0x2d/0x80
[ 1002.410277]  kobject_cleanup+0x5e/0x180
[ 1002.414659]  scsi_disk_put+0x2b/0x40 [sd_mod]
[ 1002.419559]  __blkdev_put+0x1b5/0x1d0
[ 1002.423777]  ? disk_flush_events+0x24/0x60
[ 1002.428430]  blkdev_close+0x21/0x30
[ 1002.432484]  __fput+0xd5/0x210
[ 1002.436111]  task_work_run+0x82/0xa0
[ 1002.440262]  do_exit+0x2be/0xb20
[ 1002.444074]  ? syscall_trace_enter+0x1af/0x290
[ 1002.449110]  do_group_exit+0x39/0xa0
[ 1002.453287]  SyS_exit_group+0x10/0x10
[ 1002.457557]  do_syscall_64+0x61/0x1a0
[ 1002.461829]  entry_SYSCALL64_slow_path+0x25/0x25
[ 1002.467064] RIP: 0033:0x7f9abb1c8529
[ 1002.471266] RSP: 002b:7ffe53be40d8 EFLAGS: 0206 ORIG_RAX: 
00e7
[ 1002.479482] RAX: ffda RBX: 0010 RCX: 7f9abb1c8529
[ 1002.487279] RDX: 0005 RSI: 000a RDI: 0005
[ 1002.495079] RBP: 7f9a9c9de818 R08: 003c R09: 00e7
[ 1002.502882] R10: ff60 R11: 0206 R12: 0006
[ 1002.510690] R13: 0006 R14:  R15: 0172a440
[ 1002.518497] Code: f4 66 90 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 53 
9c 58 66 66 90 66 90 48 89 c3 fa 66 66 90 66 66 90 31 c0 ba 01 00 00 00  0f 
b1 17 85 c0 75 05 48 89 d8 5b c3 89 c6 e8 77 06 9e ff eb
[ 1002.538742] RIP: 

RE: crash in iscsi/scsi initiator with linux-4.15.0-rc1

2017-12-01 Thread Steve Wise
> > Then I hit this crash.  Has anyone else encountered this issue?  Wondering 
> > if
> > there is a fix handy. :)
> >
> 
> This is the same problem that is being discussed under the thread:
> "[PATCH] scsi: fix race condition when removing target".
> 
> We had good test results with both Jason Yan's patch and Bart's patch
> applied, however the ultimate solution is still in progress, see James'
> comments.
> 
> You could also try reverting fbce4d97fd "scsi: fixup kernel warning
> during rmmod()" if you just need to get past this.
> 
> -Ewan


Thanks Ewan, I'll back up that commit just to verify I'm seeing the same issue. 
 I'm also happy to test any final fix.

Steve.



Re: crash in iscsi/scsi initiator with linux-4.15.0-rc1

2017-12-01 Thread Ewan D. Milne
On Fri, 2017-12-01 at 11:00 -0600, Steve Wise wrote:
> Hey,
> 
> I'm  seeing this null pointer dereference with linux-4.15.0-rc1.  To reproduce
> it, I connect two ram disks via iscsi/TCP, and start an fio:
> 
> iscsiadm -m discovery --op update --type sendtargets -p 172.16.1.10:3260
> iscsiadm -m node -p 172.16.1.10:3260 -l
> ISCSI_DISKS=/dev/sdd:/dev/sde; fio --rw=randrw --name=random --norandommap
> --ioengine=libaio --size=400m --group_reporting --exitall --fsync_on_close=1
> --invalidate=1 --direct=1 --filename=$ISCSI_DISKS --time_based --runtime=300
> --iodepth=128 --numjobs=8 --unit_base=1 --bs=64k --kb_base=1000
> 
> Then on the initiator node, while the fio test is running, I detach the 
> devices:
> 
> iscsiadm -m node -p 172.16.1.10:3260 -I iser -u
> 
> Then I hit this crash.  Has anyone else encountered this issue?  Wondering if
> there is a fix handy. :)
> 

This is the same problem that is being discussed under the thread:
"[PATCH] scsi: fix race condition when removing target".

We had good test results with both Jason Yan's patch and Bart's patch
applied, however the ultimate solution is still in progress, see James'
comments.

You could also try reverting fbce4d97fd "scsi: fixup kernel warning
during rmmod()" if you just need to get past this.

-Ewan

> Thanks,
> 
> Steve.
> 
> 
> 
> [  127.175953] scsi 8:0:0:0: alua: Detached
> [  127.175955] scsi 8:0:0:0: alua: Detached
> [  127.175981] [ cut here ]
> [  127.175984] list_del corruption. prev->next should be 8803382f1240, but
> was 88039ab0f780
> [  127.176010] WARNING: CPU: 5 PID: 373 at lib/list_debug.c:53
> __list_del_entry_valid+0x7c/0xa0
> [  127.176011] Modules linked in: iscsi_tcp libiscsi_tcp rpcrdma ib_isert
> iscsi_target_mod libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp
> scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm
> iw_cm libcxgb mlx5_ib ext4 ib_core dm_mirror dm_region_hash dm_log dm_mod
> mbcache jbd2 coretemp kvm iTCO_wdt ppdev irqbypass iTCO_vendor_support 
> gpio_ich
> i2c_i801 pcspkr lpc_ich parport_pc i5400_edac sg parport i5k_amb shpchp nfsd
> auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod nouveau
> cdrom sd_mod ata_generic pata_acpi video mxm_wmi wmi drm_kms_helper 
> syscopyarea
> sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm igb cxgb4 ahci 
> firewire_ohci
> ata_piix libahci firewire_core dca i2c_algo_bit devlink libata ptp serio_raw
> i2c_core crc_itu_t pps_core [last unloaded: ib_iser]
> [  127.176055] CPU: 5 PID: 373 Comm: kworker/u16:4 Not tainted 4.15.0-rc1+ #6
> [  127.176056] Hardware name: Supermicro X7DWA/X7DWA, BIOS 6.00 12/21/2007
> [  127.176074] Workqueue: scsi_wq_9 __iscsi_unbind_session
> [scsi_transport_iscsi]
> [  127.176075] task: 88039a498000 task.stack: c9000288
> [  127.176076] RIP: 0010:__list_del_entry_valid+0x7c/0xa0
> [  127.176076] RSP: 0018:c90002883d38 EFLAGS: 00010082
> [  127.176077] RAX:  RBX: 8803382f1240 RCX: 
> 
> [  127.176078] RDX: 0001 RSI: 0002 RDI: 
> 0092
> [  127.176079] RBP: 8803982129c0 R08: 0054 R09: 
> 823d60e0
> [  127.176079] R10: 0473 R11:  R12: 
> 880398212800
> [  127.176080] R13: 880396701800 R14: 880396701800 R15: 
> 8801afc31000
> [  127.176081] FS:  () GS:8803bfd4()
> knlGS:
> [  127.176082] CS:  0010 DS:  ES:  CR0: 80050033
> [  127.176083] CR2: 7f6a80028038 CR3: 00039a957000 CR4: 
> 06e0
> [  127.176084] Call Trace:
> [  127.176091]  alua_bus_detach+0x5c/0xc0
> [  127.176095]  scsi_dh_release_device+0x18/0x50
> [  127.176098]  scsi_device_dev_release_usercontext+0x25/0x230
> [  127.176107]  execute_in_process_context+0x58/0x60
> [  127.176110]  device_release+0x2d/0x80
> [  127.176113]  kobject_cleanup+0x5e/0x180
> [  127.176115]  scsi_remove_target+0x16b/0x1b0
> [  127.176119]  __iscsi_unbind_session+0xb3/0x160 [scsi_transport_iscsi]
> [  127.176121]  process_one_work+0x141/0x340
> [  127.176123]  worker_thread+0x47/0x3e0
> [  127.176124]  kthread+0xf5/0x130
> [  127.176126]  ? rescuer_thread+0x380/0x380
> [  127.176127]  ? kthread_associate_blkcg+0x90/0x90
> [  127.176129]  ret_from_fork+0x1f/0x30
> [  127.176130] Code: ff 31 c0 c3 48 89 fe 31 c0 48 c7 c7 60 19 a9 81 e8 3a 33 
> d0
> ff 0f ff 31 c0 c3 48 89 fe 31 c0 48 c7 c7 20 19 a9 81 e8 24 33 d0 ff <0f> ff 
> 31
> c0 c3 48 89 fe 31 c0 48 c7 c7 e8 18 a9 81 e8 0e 33 d0
> [  127.176145] ---[ end trace e7e378e0f32966e0 ]---
> [  127.176148] scsi 9:0:0:0: alua: Detached
> [  127.466362] BUG: unable to handle kernel NULL pointer dereference at
> (null)
> [  127.474355] IP: _raw_spin_lock_irqsave+0x1e/0x40
> [  127.479136] PGD 399e70067 P4D 399e70067 PUD 3966cd067 PMD 0
> [  127.484961] Oops: 0002 [#1] SMP
> [  127.488269] Modules linked in: iscsi_tcp libiscsi_tcp