Hello,

on my FC environment target machine hanged always while rebooting the initiator machine. I was able to capture the following call trace:

[19236.146988] rport-11:0-0: blocked FC remote port time out: removing target and saving binding [19236.157185] rport-10:0-0: blocked FC remote port time out: removing target and saving binding [19236.157288] scsi scan: 37 byte inquiry failed. Consider BLIST_INQUIRY_36 for this device [19236.157290] scsi scan: 37 byte inquiry failed. Consider BLIST_INQUIRY_36 for this device [19236.157412] BUG: unable to handle kernel NULL pointer dereference at (null)
[19236.157416] IP: [<ffffffff8141d20f>] scsi_device_put+0xf/0x50
[19236.157423] PGD 0
[19236.157425] Oops: 0000 [#1] SMP
[19236.157427] Modules linked in: iscsi_scst(O) scst_vdisk(O) qla2x00tgt(O) scst(O) sch_htb rpcsec_gss_krb5 nls_iso8859_1 nls_cp437 vfat fat zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) crc32c_intel sg qla2xxx(O) scsi_transport_fc mpt2sas(O) raid_class scsi_transport_sas button acpi_cpufreq mperf processor ixgbe(O) igb(O) ptp pps_core aufs [last unloaded: scst] [19236.157449] CPU: 0 PID: 28914 Comm: kworker/0:0 Tainted: P O 3.10.92-oe64-ge331686 #15
[19236.157451] Hardware name: Supermicro X8DTS/X8DTS, BIOS 2.1 06/25/2012
[19236.157457] Workqueue: fc_wq_10 fc_starget_delete [scsi_transport_fc]
[19236.157459] task: ffff88030d8741a0 ti: ffff8802ec38e000 task.ti: ffff8802ec38e000 [19236.157461] RIP: 0010:[<ffffffff8141d20f>] [<ffffffff8141d20f>] scsi_device_put+0xf/0x50
[19236.157464] RSP: 0018:ffff8802ec38fdf0  EFLAGS: 00010202
[19236.157466] RAX: 0000000000000000 RBX: ffff88030be48800 RCX: 00000001810000ba [19236.157467] RDX: 00000001810000bb RSI: ffff88030e4b0860 RDI: ffff88030be48800 [19236.157469] RBP: ffff88032ca8d000 R08: 0000000000000000 R09: ffffea000c392c00 [19236.157470] R10: ffff880332803d00 R11: ffffffff8142992c R12: ffff88032b951860 [19236.157472] R13: ffff88032ca8d010 R14: ffff8802ef3e0c00 R15: ffff88030be48800 [19236.157474] FS: 0000000000000000(0000) GS:ffff880332e00000(0000) knlGS:0000000000000000
[19236.157475] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[19236.157477] CR2: 0000000000000000 CR3: 000000000195e000 CR4: 00000000000007f0 [19236.157478] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [19236.157480] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[19236.157481] Stack:
[19236.157482] ffff88032ca8d000 ffff88032ca8d000 ffffffff81429aba 0000000000000286 [19236.157484] ffff8802dd800800 ffff88032b951b08 ffff880332e11680 0000000000000000 [19236.157487] ffffe8ffffa05900 0000000000000001 ffffffff8105ce4d ffffffff8105a4a7
[19236.157489] Call Trace:
[19236.157494]  [<ffffffff81429aba>] ? scsi_remove_target+0x16a/0x250
[19236.157499]  [<ffffffff8105ce4d>] ? process_one_work+0x13d/0x3b0
[19236.157502]  [<ffffffff8105a4a7>] ? pwq_activate_delayed_work+0x27/0x40
[19236.157504]  [<ffffffff8105d7b1>] ? worker_thread+0x121/0x3d0
[19236.157507]  [<ffffffff8105d690>] ? manage_workers.isra.26+0x280/0x280
[19236.157510]  [<ffffffff81062e92>] ? kthread+0xc2/0xd0
[19236.157514]  [<ffffffff81070000>] ? sched_clock_cpu+0x30/0x100
[19236.157517]  [<ffffffff81062dd0>] ? kthread_create_on_node+0x110/0x110
[19236.157521]  [<ffffffff8169db98>] ? ret_from_fork+0x58/0x90
[19236.157524]  [<ffffffff81062dd0>] ? kthread_create_on_node+0x110/0x110
[19236.157525] Code: 7d 58 4c 89 fe e8 92 a2 27 00 48 89 d8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 40 00 55 53 48 89 fb 48 8b 07 48 8b 80 c0 00 00 00 <48> 8b 28 48 85 ed 74 0d 48 89 ef e8 71 c4 c6 ff 48 85 c0 75 14
[19236.157548] RIP  [<ffffffff8141d20f>] scsi_device_put+0xf/0x50
[19236.157551]  RSP <ffff8802ec38fdf0>
[19236.157552] CR2: 0000000000000000
[19236.157555] ---[ end trace 37bfa3906f93d93a ]---
[19236.157578] BUG: unable to handle kernel paging request at ffffffffffffffd8
[19236.157580] IP: [<ffffffff810633c7>] kthread_data+0x7/0x10
[19236.157583] PGD 1961067 PUD 1963067 PMD 0
[19236.157586] Oops: 0000 [#2] SMP
[19236.157587] Modules linked in: iscsi_scst(O) scst_vdisk(O) qla2x00tgt(O) scst(O) sch_htb rpcsec_gss_krb5 nls_iso8859_1 nls_cp437 vfat fat zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) crc32c_intel sg qla2xxx(O) scsi_transport_fc mpt2sas(O) raid_class scsi_transport_sas button acpi_cpufreq mperf processor ixgbe(O) igb(O) ptp pps_core aufs [last unloaded: scst] [19236.157605] CPU: 0 PID: 28914 Comm: kworker/0:0 Tainted: P D O 3.10.92-oe64-ge331686 #15
[19236.157606] Hardware name: Supermicro X8DTS/X8DTS, BIOS 2.1 06/25/2012
[19236.157617] task: ffff88030d8741a0 ti: ffff8802ec38e000 task.ti: ffff8802ec38e000 [19236.157618] RIP: 0010:[<ffffffff810633c7>] [<ffffffff810633c7>] kthread_data+0x7/0x10
[19236.157621] RSP: 0018:ffff8802ec38fa48  EFLAGS: 00010002
[19236.157623] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001 [19236.157624] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88030d8741a0 [19236.157626] RBP: ffff88030d8741a0 R08: 0000000000000000 R09: ffff880332803a00 [19236.157627] R10: ffff880332e14a80 R11: ffffea000b862a00 R12: 0000000000000000 [19236.157629] R13: ffff88030d874490 R14: ffff88030d874190 R15: 0000000000000246 [19236.157630] FS: 0000000000000000(0000) GS:ffff880332e00000(0000) knlGS:0000000000000000
[19236.157632] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[19236.157634] CR2: 0000000000000028 CR3: 000000000195e000 CR4: 00000000000007f0 [19236.157635] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [19236.157637] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[19236.157638] Stack:
[19236.157639] ffffffff8105dd48 ffff880332e11e00 ffffffff816963bb ffff8802ec38ffd8 [19236.157641] ffff8802ec38ffd8 ffff8802ec38ffd8 ffff88030d8741a0 ffff88030d8741a0 [19236.157643] ffff8802ec38faf8 ffff8802ec38fb00 ffff88030d874438 ffff88030d874440
[19236.157645] Call Trace:
[19236.157648]  [<ffffffff8105dd48>] ? wq_worker_sleeping+0x8/0x90
[19236.157653]  [<ffffffff816963bb>] ? __schedule+0x3db/0x6a0
[19236.157656]  [<ffffffff81070ddd>] ? task_cputime+0x2d/0x50
[19236.157659]  [<ffffffff81048843>] ? do_exit+0x7e3/0xa40
[19236.157662]  [<ffffffff81698837>] ? oops_end+0x97/0xe0
[19236.157666]  [<ffffffff81036c7d>] ? no_context+0xfd/0x2e0
[19236.157669]  [<ffffffff8169af9a>] ? __do_page_fault+0xea/0x510
[19236.157672]  [<ffffffff81070c44>] ? arch_vtime_task_switch+0x74/0xa0
[19236.157675]  [<ffffffff8106a9b9>] ? finish_task_switch+0x29/0xb0
[19236.157678]  [<ffffffff8169624d>] ? __schedule+0x26d/0x6a0
[19236.157680]  [<ffffffff8105c289>] ? flush_work+0x19/0x150
[19236.157682]  [<ffffffff8105c289>] ? flush_work+0x19/0x150
[19236.157687]  [<ffffffff813e6340>] ? dev_vprintk_emit+0x40/0x50
[19236.157690]  [<ffffffff8169b3e2>] ? do_page_fault+0x22/0x40
[19236.157693]  [<ffffffff81697c38>] ? page_fault+0x28/0x30
[19236.157695]  [<ffffffff8142992c>] ? scsi_remove_device+0x1c/0x30
[19236.157698]  [<ffffffff8141d20f>] ? scsi_device_put+0xf/0x50
[19236.157700]  [<ffffffff81429aba>] ? scsi_remove_target+0x16a/0x250
[19236.157703]  [<ffffffff8105ce4d>] ? process_one_work+0x13d/0x3b0
[19236.157705]  [<ffffffff8105a4a7>] ? pwq_activate_delayed_work+0x27/0x40
[19236.157708]  [<ffffffff8105d7b1>] ? worker_thread+0x121/0x3d0
[19236.157710]  [<ffffffff8105d690>] ? manage_workers.isra.26+0x280/0x280
[19236.157713]  [<ffffffff81062e92>] ? kthread+0xc2/0xd0
[19236.157715]  [<ffffffff81070000>] ? sched_clock_cpu+0x30/0x100
[19236.157718]  [<ffffffff81062dd0>] ? kthread_create_on_node+0x110/0x110
[19236.157721]  [<ffffffff8169db98>] ? ret_from_fork+0x58/0x90
[19236.157724]  [<ffffffff81062dd0>] ? kthread_create_on_node+0x110/0x110
[19236.157725] Code: 00 00 00 00 65 48 8b 04 25 c0 b6 00 00 48 8b 80 80 02 00 00 48 8b 40 c8 48 c1 e8 02 83 e0 01 c3 0f 1f 40 00 48 8b 87 80 02 00 00 <48> 8b 40 d8 c3 0f 1f 40 00 48 83 ec 08 48 8b b7 80 02 00 00 ba
[19236.157748] RIP  [<ffffffff810633c7>] kthread_data+0x7/0x10
[19236.157751]  RSP <ffff8802ec38fa48>
[19236.157752] CR2: ffffffffffffffd8
[19236.157753] ---[ end trace 37bfa3906f93d93b ]---
[19236.157755] Fixing recursive fault but reboot is needed!

This happened because of race condition between scsi_remove_target (in stgt_delete_work) and scsi_probe_and_add_lun (in scan_work). I created a patch that cancels scan_work always when it's going to schedule stgt_delete_work.

Here's the patch for 3.10.93 kernel:

diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index e106c27..472a16e 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -3143,6 +3144,7 @@ fc_timeout_deleted_rport(struct work_struct *work)
                        " a FCP target, removing starget\n");
                spin_unlock_irqrestore(shost->host_lock, flags);
                scsi_target_unblock(&rport->dev, SDEV_TRANSPORT_OFFLINE);
+               cancel_work_sync(&rport->scan_work);
                fc_queue_work(shost, &rport->stgt_delete_work);
                return;
        }
@@ -3227,13 +3229,19 @@ fc_timeout_deleted_rport(struct work_struct *work)
                 * all attached scsi devices.
                 */
                rport->flags |= FC_RPORT_DEVLOSS_CALLBK_DONE;
+
+               /* cancel pending scan work */
+               spin_unlock_irqrestore(shost->host_lock, flags);
+               cancel_work_sync(&rport->scan_work);
+               spin_lock_irqsave(shost->host_lock, flags);
+
                fc_queue_work(shost, &rport->stgt_delete_work);
do_callback = 1;
        }
-
        spin_unlock_irqrestore(shost->host_lock, flags);
+
        /*
         * Notify the driver that the rport is now dead. The LLDD will
         * also guarantee that any communication to the rport is terminated


--
Best regards
Arkadiusz Bubała
Open-E Poland Sp. z o.o.
www.open-e.com

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to