Re: Crash in TCM-LIO
On Wed, 2016-10-26 at 09:01 +, Gurumurthy, Anil wrote: > Hello Nicholas, > > Thanks for confirming. The ABORT_TASK TMR_TASK_DOES_NOT_EXIST > exceptions here do not depend on the missing SCF_ACK_KREF bit > assignment. > > The earlier SCF_ACK_KREF reference leak fix is specific to TMR that > reference active se_cmd->cmd_kref tags, when I/O is still outstanding > to associated target-core se_cmd->se_dev backends. > > > > > > > On this particular kernel version (4.7), I am unable to get a crash > > > dump, so cannot really fathom whats going on. > > > > > > > Note there is a v4.1+ reference leak regression for ABORT_TASK + session > > shutdown here: > > > > https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c > > 6d6afdb1e3e > > > > > Have you seen or have been notified of this behaviour? > > > > TomK (CC') reported something similar using v4.8.4 with ESX hosts > > ABORT_TASK + tcm/qla2xxx ports. > > > > > Any ideas/thoughts on how to proceed? > > > > > > > > > > Currently unsure if this list corruption is related to the above > > regression, or not. > > > > Please verify using the patch on v4.7.y code during tcm/qla2xxx > session shutdown -> restart, once ABORT_TASK has occurred. > > [Anil] I see the issue even when I apply this patch to my kernel. > > > > From above + TomK's list corruption logs, it looks like a > se_cmd->cmd_kref is prematurely reaching zero + freeing memory while > se_cmd memory is still outstanding to target-core backend. > > The se_cmd->state_list is not used by TMR, so AFAICT list corruption > here is specific to qla_tgt_cmd->se_cmd dispatched into target-core, > released while se_cmd is still outstanding. > > To confirm verify the theory, please change the list_debug warn above > into BUG_ON with LKCD logic in place, and let's have a look. > > [Anil] > With the BUG_ON, I pretty much see the same thing. Getting a crash > dump has been a significant challenge - never been able to generate > one with kernel 4.6 and later. > > [ 3513.258979] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: > 1258720 > [ 3513.258998] qla2xxx [:0b:00.1]-e900:4: RESET-TMR > online/active/old-count/new-count = 1/0/0/1. > [ 3513.259045] BUG: unable to handle kernel NULL pointer dereference at > 00e0 > [ 3513.259177] IP: [] kmem_cache_free+0x11a/0x200 > [ 3513.259277] PGD 0 > [ 3513.259312] Oops: [#1] SMP > [ 3513.259362] Modules linked in: target_core_pscsi tcm_qla2xxx(OE) > qla2xxx(OE) iscsi_target_mod target_core_file target_core_iblock > target_core_mod netconsole ebtable_nat ebtables ipt_MASQUERADE > nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM > iptable_mangle bridge 8021q mrp garp stp llc ipt_REJECT nf_reject_ipv4 > nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT > nf_reject_ipv6[ 3513.260012] [ cut here ] > [ 3513.260018] WARNING: CPU: 2 PID: 389 at lib/list_debug.c:61 > __list_del_entry+0x65/0xb0 > [ 3513.260019] list_del corruption. prev->next should be 8806ec7c3328, > but was 8806ec7bcf68 > [ 3513.260019] Modules linked in: target_core_pscsi tcm_qla2xxx(OE) > qla2xxx(OE) iscsi_target_mod target_core_file t Ok, so it sounds like there are two issues here. 1) The TMR list_corruption observed here with tcm_qla2xxx, and 2) >= v4.6.y CONFIG_CRASH_DUMP=y breakage in your environment. Looking at the difference between tcm_qla2xxx in v4.x.y code: - v4.7.y+ contains commit e3dc0e3 to convert to private sess_kref - v4.6.y+ contains commit 1b655b1 to convert to target_alloc_session() - v4.5.y+ contains commit a07100e to fix TMR ABORT interaction issue At this point without a vmcore to root cause #1, I'd recommend generating a vmcore on the earliest kernel (v4.5.y) possible in order to confirm the regression. For >= v4.6.y CONFIG_CRASH_DUMP=y breakage in #2, it needs to be reported to LKML. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Crash in TCM-LIO
Hello Nicholas, -Original Message- From: Nicholas A. Bellinger [mailto:n...@linux-iscsi.org] Sent: 26 October 2016 11:49 To: Gurumurthy, Anil <anil.gurumur...@cavium.com> Cc: Anil Gurumurthy <anil.gurumur...@qlogic.com>; linux-scsi <linux-scsi@vger.kernel.org>; Malavali, Giridhar <giridhar.malav...@cavium.com>; Tran, Quinn <quinn.t...@cavium.com>; TomK <t...@mdevsys.com> Subject: Re: Crash in TCM-LIO Hello Anil & Co, On Tue, 2016-10-25 at 05:33 +, Gurumurthy, Anil wrote: > On Mon, 2016-10-24 at 06:36 +, Anil Gurumurthy wrote: > > > > [71884.588748] BUG: unable to handle kernel NULL pointer dereference > > at 00e0 > > > > [71884.51] IP: [] kmem_cache_free+0x11a/0x200 > > > > [71884.588981] PGD 0 > > > > [71884.589017] Oops: [#1] SMP > > > > [71884.589041] [ cut here ] > > > > [71884.589048] WARNING: CPU: 2 PID: 20783 at lib/list_debug.c:62 > > __list_del_entry+0x86/0xd0 > > > > [71884.589049] list_del corruption. next->prev should be > > 8806f8daeb68, but was 8806f8db79e8 > > > > [71884.589075] Modules linked in: target_core_pscsi tcm_qla2xxx(OE) > > qla2xxx(OE) iscsi_target_mod target_core_file target_core_iblock > > target_core_mod netconsole ebtable_nat ebtables ipt_MASQUERADE > > nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM > > iptable_mangle bridge 8021q mrp garp stp llc ipt_REJECT > > nf_reject_ipv4 > > nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables > > ip6t_REJECT > > nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state > > nf_conntrack ip6table_filter ip6_tables binfmt_misc vhost_net > > macvtap macvlan vhost tun uinput sg serio_raw iTCO_wdt > > iTCO_vendor_support ipmi_ssif ipmi_si ipmi_msghandler hpilo hpwdt > > bnx2 intel_powerclamp coretemp kvm_intel kvm irqbypass > > crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel > > pcspkr acpi_power_meter lpc_ich mfd_core i7core_edac edac_core > > shpchp ext4(E) mbcache(E) jbd2(E) sd_mod(E) > > sr_mod(E) cdrom(E) scsi_transport_fc(E) hpsa(E) > > scsi_transport_sas(E) > > aesni_intel(E) ablk_helper(E) cryptd(E) lrw(E) gf128mul(E) > > glue_helper(E) pata_acpi(E) ata_generic(E) ata_piix(E) libata(E) > > radeon(E) ttm(E) drm_kms_helper(E) drm(E) fb_sys_fops(E) > > sysimgblt(E) > > sysfillrect(E) syscopyarea(E) i2c_algo_bit(E) i2c_core(E) > > dm_mirror(E) > > dm_region_hash(E) dm_log(E) dm_mod(E) [last unloaded: qla2xxx] > > > > [71884.589091] CPU: 2 PID: 20783 Comm: kworker/2:0 Tainted: GW > > IOE 4.7.0-rc1+ #2 > > > > [71884.589092] Hardware name: HP ProLiant DL380 G7, BIOS P67 > > 05/05/2011 > > > > [71884.589107] Workqueue: target_completion target_complete_ok_work > > [target_core_mod] > > > > [71884.589109] 880408d1fae8 81352197 > > 81370ef6 > > > > [71884.589110] 880408d1fb48 880408d1fb48 > > 880408d1fb38 > > > > [71884.589111] 8108a12d 07c3 003e0246 > > 0246 > > > > [71884.589112] Call Trace: > > > > [71884.589115] [] dump_stack+0x67/0x90 > > > > [71884.589117] [] ? __list_del_entry+0x86/0xd0 > > > > [71884.589119] [] __warn+0xfd/0x120 > > > > [71884.589120] [] warn_slowpath_fmt+0x49/0x50 > > > > [71884.589122] [] __list_del_entry+0x86/0xd0 > > > > [71884.589123] [] list_del+0x11/0x40 > > > > [71884.589132] [] target_remove_from_state_list > > +0x6e/0x80 [target_core_mod] > > > > [71884.589140] [] transport_cmd_check_stop > > +0xe4/0x120 [target_core_mod] > > > > [71884.589151] [] > > transport_cmd_check_stop_to_fabric+0x15/0x20 [target_core_mod] > > > > [71884.589160] [] target_complete_ok_work > > +0x14e/0x280 [target_core_mod] > > > > [71884.589162] [] ? > > pwq_dec_nr_in_flight+0x50/0xa0 > > > > [71884.589164] [] process_one_work+0x183/0x4d0 > > > > [71884.589166] [] ? __schedule+0x1ff/0x5c0 > > > > [71884.589167] [] ? schedule+0x40/0xb0 > > > > [71884.589169] [] worker_thread+0x16d/0x530 > > > > [71884.589171] [] ? __switch_to+0x1cd/0x5e0 > > > > [71884.589173] [] ? __schedule+0x1ff/0x5c0 > > > > [71884.589175] [] ? __wake_up_common+0x56/0x90 > > > > [71884.589177] [] ? > > maybe_create_worker+0x120/0x120 > > > > [71884.589178] [] ? schedule+0x4
Re: Crash in TCM-LIO
Hello Anil & Co, On Tue, 2016-10-25 at 05:33 +, Gurumurthy, Anil wrote: > On Mon, 2016-10-24 at 06:36 +, Anil Gurumurthy wrote: > > > > [71884.588748] BUG: unable to handle kernel NULL pointer dereference > > at 00e0 > > > > [71884.51] IP: [] kmem_cache_free+0x11a/0x200 > > > > [71884.588981] PGD 0 > > > > [71884.589017] Oops: [#1] SMP > > > > [71884.589041] [ cut here ] > > > > [71884.589048] WARNING: CPU: 2 PID: 20783 at lib/list_debug.c:62 > > __list_del_entry+0x86/0xd0 > > > > [71884.589049] list_del corruption. next->prev should be > > 8806f8daeb68, but was 8806f8db79e8 > > > > [71884.589075] Modules linked in: target_core_pscsi tcm_qla2xxx(OE) > > qla2xxx(OE) iscsi_target_mod target_core_file target_core_iblock > > target_core_mod netconsole ebtable_nat ebtables ipt_MASQUERADE > > nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM > > iptable_mangle bridge 8021q mrp garp stp llc ipt_REJECT nf_reject_ipv4 > > nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT > > nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack > > ip6table_filter ip6_tables binfmt_misc vhost_net macvtap macvlan vhost > > tun uinput sg serio_raw iTCO_wdt iTCO_vendor_support ipmi_ssif ipmi_si > > ipmi_msghandler hpilo hpwdt bnx2 intel_powerclamp coretemp kvm_intel > > kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel > > ghash_clmulni_intel pcspkr acpi_power_meter lpc_ich mfd_core > > i7core_edac edac_core shpchp ext4(E) mbcache(E) jbd2(E) sd_mod(E) > > sr_mod(E) cdrom(E) scsi_transport_fc(E) hpsa(E) scsi_transport_sas(E) > > aesni_intel(E) ablk_helper(E) cryptd(E) lrw(E) gf128mul(E) > > glue_helper(E) pata_acpi(E) ata_generic(E) ata_piix(E) libata(E) > > radeon(E) ttm(E) drm_kms_helper(E) drm(E) fb_sys_fops(E) sysimgblt(E) > > sysfillrect(E) syscopyarea(E) i2c_algo_bit(E) i2c_core(E) dm_mirror(E) > > dm_region_hash(E) dm_log(E) dm_mod(E) [last unloaded: qla2xxx] > > > > [71884.589091] CPU: 2 PID: 20783 Comm: kworker/2:0 Tainted: GW > > IOE 4.7.0-rc1+ #2 > > > > [71884.589092] Hardware name: HP ProLiant DL380 G7, BIOS P67 > > 05/05/2011 > > > > [71884.589107] Workqueue: target_completion target_complete_ok_work > > [target_core_mod] > > > > [71884.589109] 880408d1fae8 81352197 > > 81370ef6 > > > > [71884.589110] 880408d1fb48 880408d1fb48 > > 880408d1fb38 > > > > [71884.589111] 8108a12d 07c3 003e0246 > > 0246 > > > > [71884.589112] Call Trace: > > > > [71884.589115] [] dump_stack+0x67/0x90 > > > > [71884.589117] [] ? __list_del_entry+0x86/0xd0 > > > > [71884.589119] [] __warn+0xfd/0x120 > > > > [71884.589120] [] warn_slowpath_fmt+0x49/0x50 > > > > [71884.589122] [] __list_del_entry+0x86/0xd0 > > > > [71884.589123] [] list_del+0x11/0x40 > > > > [71884.589132] [] target_remove_from_state_list > > +0x6e/0x80 [target_core_mod] > > > > [71884.589140] [] transport_cmd_check_stop > > +0xe4/0x120 [target_core_mod] > > > > [71884.589151] [] > > transport_cmd_check_stop_to_fabric+0x15/0x20 [target_core_mod] > > > > [71884.589160] [] target_complete_ok_work > > +0x14e/0x280 [target_core_mod] > > > > [71884.589162] [] ? pwq_dec_nr_in_flight+0x50/0xa0 > > > > [71884.589164] [] process_one_work+0x183/0x4d0 > > > > [71884.589166] [] ? __schedule+0x1ff/0x5c0 > > > > [71884.589167] [] ? schedule+0x40/0xb0 > > > > [71884.589169] [] worker_thread+0x16d/0x530 > > > > [71884.589171] [] ? __switch_to+0x1cd/0x5e0 > > > > [71884.589173] [] ? __schedule+0x1ff/0x5c0 > > > > [71884.589175] [] ? __wake_up_common+0x56/0x90 > > > > [71884.589177] [] ? maybe_create_worker+0x120/0x120 > > > > [71884.589178] [] ? schedule+0x40/0xb0 > > > > [71884.589179] [] ? maybe_create_worker+0x120/0x120 > > > > [71884.589180] [] kthread+0xcc/0xf0 > > > > [71884.589183] [] ? do_syscall_64+0x78/0x1d0 > > > > [71884.589185] [] ? schedule_tail+0x1e/0xc0 > > > > [71884.589188] [] ret_from_fork+0x1f/0x40 > > > > [71884.589189] [] ? kthread_freezable_should_stop > > +0x70/0x70 > > > > [71884.589190] ---[ end trace 7f24d6c863b6e35b ]--- > > > > [71884.589204] [ cut here ] > > > > [71884.589206] WARNING: CPU: 2 PID: 20783 at lib/list_debug.c:59 > > __list_del_entry+0xa5/0xd0 > > > > > > Was list corruption preceded by hung kernel task warnings..? > > No. The only messages prior to them were similar to this: > [ 3345.864359] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: > 1219168 [ 3345.864378] qla2xxx [:0b:00.1]-e900:4: RESET-TMR > online/active/old-count/new-count = 1/0/0/1. > [ 3345.864418] BUG: unable to handle kernel NULL pointer dereference at > 00e0 [ 3345.864549] IP: [] > kmem_cache_free+0x11a/0x200 > Thanks for confirming. The ABORT_TASK TMR_TASK_DOES_NOT_EXIST exceptions here do not depend on the
RE: Crash in TCM-LIO
Hello Nicholas, -Original Message- From: Gurumurthy, Anil Sent: 24 October 2016 14:50 To: 'Nicholas A. Bellinger' <n...@linux-iscsi.org>; Anil Gurumurthy <anil.gurumur...@qlogic.com> Cc: linux-scsi <linux-scsi@vger.kernel.org>; Malavali, Giridhar <giridhar.malav...@cavium.com>; Tran, Quinn <quinn.t...@cavium.com>; TomK <t...@mdevsys.com> Subject: RE: Crash in TCM-LIO -Original Message- From: Nicholas A. Bellinger [mailto:n...@linux-iscsi.org] Sent: 24 October 2016 12:30 To: Anil Gurumurthy <anil.gurumur...@qlogic.com> Cc: linux-scsi <linux-scsi@vger.kernel.org>; Malavali, Giridhar <giridhar.malav...@cavium.com>; Tran, Quinn <quinn.t...@cavium.com>; TomK <t...@mdevsys.com> Subject: Re: Crash in TCM-LIO Hello Anil & Co, On Mon, 2016-10-24 at 06:36 +, Anil Gurumurthy wrote: > Hello Nicholas, > > I was trying to get DIF working on TCM-LIO with a QLogic FC adapter > on kernel version 4.7. I noticed a crash when there was an abort > received by the target with the following stack trace: > > Thanks for reporting. > > [71884.588748] BUG: unable to handle kernel NULL pointer dereference > at 00e0 > > [71884.51] IP: [] kmem_cache_free+0x11a/0x200 > > [71884.588981] PGD 0 > > [71884.589017] Oops: [#1] SMP > > [71884.589041] [ cut here ] > > [71884.589048] WARNING: CPU: 2 PID: 20783 at lib/list_debug.c:62 > __list_del_entry+0x86/0xd0 > > [71884.589049] list_del corruption. next->prev should be > 8806f8daeb68, but was 8806f8db79e8 > > [71884.589075] Modules linked in: target_core_pscsi tcm_qla2xxx(OE) > qla2xxx(OE) iscsi_target_mod target_core_file target_core_iblock > target_core_mod netconsole ebtable_nat ebtables ipt_MASQUERADE > nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM > iptable_mangle bridge 8021q mrp garp stp llc ipt_REJECT nf_reject_ipv4 > nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT > nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack > ip6table_filter ip6_tables binfmt_misc vhost_net macvtap macvlan vhost > tun uinput sg serio_raw iTCO_wdt iTCO_vendor_support ipmi_ssif ipmi_si > ipmi_msghandler hpilo hpwdt bnx2 intel_powerclamp coretemp kvm_intel > kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel > ghash_clmulni_intel pcspkr acpi_power_meter lpc_ich mfd_core > i7core_edac edac_core shpchp ext4(E) mbcache(E) jbd2(E) sd_mod(E) > sr_mod(E) cdrom(E) scsi_transport_fc(E) hpsa(E) scsi_transport_sas(E) > aesni_intel(E) ablk_helper(E) cryptd(E) lrw(E) gf128mul(E) > glue_helper(E) pata_acpi(E) ata_generic(E) ata_piix(E) libata(E) > radeon(E) ttm(E) drm_kms_helper(E) drm(E) fb_sys_fops(E) sysimgblt(E) > sysfillrect(E) syscopyarea(E) i2c_algo_bit(E) i2c_core(E) dm_mirror(E) > dm_region_hash(E) dm_log(E) dm_mod(E) [last unloaded: qla2xxx] > > [71884.589091] CPU: 2 PID: 20783 Comm: kworker/2:0 Tainted: GW > IOE 4.7.0-rc1+ #2 > > [71884.589092] Hardware name: HP ProLiant DL380 G7, BIOS P67 > 05/05/2011 > > [71884.589107] Workqueue: target_completion target_complete_ok_work > [target_core_mod] > > [71884.589109] 880408d1fae8 81352197 > 81370ef6 > > [71884.589110] 880408d1fb48 880408d1fb48 > 880408d1fb38 > > [71884.589111] 8108a12d 07c3 003e0246 > 0246 > > [71884.589112] Call Trace: > > [71884.589115] [] dump_stack+0x67/0x90 > > [71884.589117] [] ? __list_del_entry+0x86/0xd0 > > [71884.589119] [] __warn+0xfd/0x120 > > [71884.589120] [] warn_slowpath_fmt+0x49/0x50 > > [71884.589122] [] __list_del_entry+0x86/0xd0 > > [71884.589123] [] list_del+0x11/0x40 > > [71884.589132] [] target_remove_from_state_list > +0x6e/0x80 [target_core_mod] > > [71884.589140] [] transport_cmd_check_stop > +0xe4/0x120 [target_core_mod] > > [71884.589151] [] > transport_cmd_check_stop_to_fabric+0x15/0x20 [target_core_mod] > > [71884.589160] [] target_complete_ok_work > +0x14e/0x280 [target_core_mod] > > [71884.589162] [] ? pwq_dec_nr_in_flight+0x50/0xa0 > > [71884.589164] [] process_one_work+0x183/0x4d0 > > [71884.589166] [] ? __schedule+0x1ff/0x5c0 > > [71884.589167] [] ? schedule+0x40/0xb0 > > [71884.589169] [] worker_thread+0x16d/0x530 > > [71884.589171] [] ? __switch_to+0x1cd/0x5e0 > > [71884.589173] [] ? __schedule+0x1ff/0x5c0 > > [71884.589175] [] ? __wake_up_common+0x56/0x90 > > [71884.589177] [] ? maybe_create_worker+0x120/0x120 > > [71884.589178] [] ? schedule+0x40/0xb0 > > [71884.58
RE: Crash in TCM-LIO
-Original Message- From: Nicholas A. Bellinger [mailto:n...@linux-iscsi.org] Sent: 24 October 2016 12:30 To: Anil Gurumurthy <anil.gurumur...@qlogic.com> Cc: linux-scsi <linux-scsi@vger.kernel.org>; Malavali, Giridhar <giridhar.malav...@cavium.com>; Tran, Quinn <quinn.t...@cavium.com>; TomK <t...@mdevsys.com> Subject: Re: Crash in TCM-LIO Hello Anil & Co, On Mon, 2016-10-24 at 06:36 +, Anil Gurumurthy wrote: > Hello Nicholas, > > I was trying to get DIF working on TCM-LIO with a QLogic FC adapter > on kernel version 4.7. I noticed a crash when there was an abort > received by the target with the following stack trace: > > Thanks for reporting. > > [71884.588748] BUG: unable to handle kernel NULL pointer dereference > at 00e0 > > [71884.51] IP: [] kmem_cache_free+0x11a/0x200 > > [71884.588981] PGD 0 > > [71884.589017] Oops: [#1] SMP > > [71884.589041] [ cut here ] > > [71884.589048] WARNING: CPU: 2 PID: 20783 at lib/list_debug.c:62 > __list_del_entry+0x86/0xd0 > > [71884.589049] list_del corruption. next->prev should be > 8806f8daeb68, but was 8806f8db79e8 > > [71884.589075] Modules linked in: target_core_pscsi tcm_qla2xxx(OE) > qla2xxx(OE) iscsi_target_mod target_core_file target_core_iblock > target_core_mod netconsole ebtable_nat ebtables ipt_MASQUERADE > nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM > iptable_mangle bridge 8021q mrp garp stp llc ipt_REJECT nf_reject_ipv4 > nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT > nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack > ip6table_filter ip6_tables binfmt_misc vhost_net macvtap macvlan vhost > tun uinput sg serio_raw iTCO_wdt iTCO_vendor_support ipmi_ssif ipmi_si > ipmi_msghandler hpilo hpwdt bnx2 intel_powerclamp coretemp kvm_intel > kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel > ghash_clmulni_intel pcspkr acpi_power_meter lpc_ich mfd_core > i7core_edac edac_core shpchp ext4(E) mbcache(E) jbd2(E) sd_mod(E) > sr_mod(E) cdrom(E) scsi_transport_fc(E) hpsa(E) scsi_transport_sas(E) > aesni_intel(E) ablk_helper(E) cryptd(E) lrw(E) gf128mul(E) > glue_helper(E) pata_acpi(E) ata_generic(E) ata_piix(E) libata(E) > radeon(E) ttm(E) drm_kms_helper(E) drm(E) fb_sys_fops(E) sysimgblt(E) > sysfillrect(E) syscopyarea(E) i2c_algo_bit(E) i2c_core(E) dm_mirror(E) > dm_region_hash(E) dm_log(E) dm_mod(E) [last unloaded: qla2xxx] > > [71884.589091] CPU: 2 PID: 20783 Comm: kworker/2:0 Tainted: GW > IOE 4.7.0-rc1+ #2 > > [71884.589092] Hardware name: HP ProLiant DL380 G7, BIOS P67 > 05/05/2011 > > [71884.589107] Workqueue: target_completion target_complete_ok_work > [target_core_mod] > > [71884.589109] 880408d1fae8 81352197 > 81370ef6 > > [71884.589110] 880408d1fb48 880408d1fb48 > 880408d1fb38 > > [71884.589111] 8108a12d 07c3 003e0246 > 0246 > > [71884.589112] Call Trace: > > [71884.589115] [] dump_stack+0x67/0x90 > > [71884.589117] [] ? __list_del_entry+0x86/0xd0 > > [71884.589119] [] __warn+0xfd/0x120 > > [71884.589120] [] warn_slowpath_fmt+0x49/0x50 > > [71884.589122] [] __list_del_entry+0x86/0xd0 > > [71884.589123] [] list_del+0x11/0x40 > > [71884.589132] [] target_remove_from_state_list > +0x6e/0x80 [target_core_mod] > > [71884.589140] [] transport_cmd_check_stop > +0xe4/0x120 [target_core_mod] > > [71884.589151] [] > transport_cmd_check_stop_to_fabric+0x15/0x20 [target_core_mod] > > [71884.589160] [] target_complete_ok_work > +0x14e/0x280 [target_core_mod] > > [71884.589162] [] ? pwq_dec_nr_in_flight+0x50/0xa0 > > [71884.589164] [] process_one_work+0x183/0x4d0 > > [71884.589166] [] ? __schedule+0x1ff/0x5c0 > > [71884.589167] [] ? schedule+0x40/0xb0 > > [71884.589169] [] worker_thread+0x16d/0x530 > > [71884.589171] [] ? __switch_to+0x1cd/0x5e0 > > [71884.589173] [] ? __schedule+0x1ff/0x5c0 > > [71884.589175] [] ? __wake_up_common+0x56/0x90 > > [71884.589177] [] ? maybe_create_worker+0x120/0x120 > > [71884.589178] [] ? schedule+0x40/0xb0 > > [71884.589179] [] ? maybe_create_worker+0x120/0x120 > > [71884.589180] [] kthread+0xcc/0xf0 > > [71884.589183] [] ? do_syscall_64+0x78/0x1d0 > > [71884.589185] [] ? schedule_tail+0x1e/0xc0 > > [71884.589188] [] ret_from_fork+0x1f/0x40 > > [71884.589189] [] ? kthread_freezable_should_stop > +0x70/0x70 > > [71884.589190] ---[ end trace 7f24d6c863b6e35b ]--- > > [71884.589204] ---
RE: Crash in TCM-LIO
-Original Message- From: Nicholas A. Bellinger [mailto:n...@linux-iscsi.org] Sent: 24 October 2016 12:30 To: Anil Gurumurthy <anil.gurumur...@qlogic.com> Cc: linux-scsi <linux-scsi@vger.kernel.org>; Malavali, Giridhar <giridhar.malav...@cavium.com>; Tran, Quinn <quinn.t...@cavium.com>; TomK <t...@mdevsys.com> Subject: Re: Crash in TCM-LIO Hello Anil & Co, On Mon, 2016-10-24 at 06:36 +, Anil Gurumurthy wrote: > Hello Nicholas, > > I was trying to get DIF working on TCM-LIO with a QLogic FC adapter > on kernel version 4.7. I noticed a crash when there was an abort > received by the target with the following stack trace: > > Thanks for reporting. > > [71884.588748] BUG: unable to handle kernel NULL pointer dereference > at 00e0 > > [71884.51] IP: [] kmem_cache_free+0x11a/0x200 > > [71884.588981] PGD 0 > > [71884.589017] Oops: [#1] SMP > > [71884.589041] [ cut here ] > > [71884.589048] WARNING: CPU: 2 PID: 20783 at lib/list_debug.c:62 > __list_del_entry+0x86/0xd0 > > [71884.589049] list_del corruption. next->prev should be > 8806f8daeb68, but was 8806f8db79e8 > > [71884.589075] Modules linked in: target_core_pscsi tcm_qla2xxx(OE) > qla2xxx(OE) iscsi_target_mod target_core_file target_core_iblock > target_core_mod netconsole ebtable_nat ebtables ipt_MASQUERADE > nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM > iptable_mangle bridge 8021q mrp garp stp llc ipt_REJECT nf_reject_ipv4 > nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT > nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack > ip6table_filter ip6_tables binfmt_misc vhost_net macvtap macvlan vhost > tun uinput sg serio_raw iTCO_wdt iTCO_vendor_support ipmi_ssif ipmi_si > ipmi_msghandler hpilo hpwdt bnx2 intel_powerclamp coretemp kvm_intel > kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel > ghash_clmulni_intel pcspkr acpi_power_meter lpc_ich mfd_core > i7core_edac edac_core shpchp ext4(E) mbcache(E) jbd2(E) sd_mod(E) > sr_mod(E) cdrom(E) scsi_transport_fc(E) hpsa(E) scsi_transport_sas(E) > aesni_intel(E) ablk_helper(E) cryptd(E) lrw(E) gf128mul(E) > glue_helper(E) pata_acpi(E) ata_generic(E) ata_piix(E) libata(E) > radeon(E) ttm(E) drm_kms_helper(E) drm(E) fb_sys_fops(E) sysimgblt(E) > sysfillrect(E) syscopyarea(E) i2c_algo_bit(E) i2c_core(E) dm_mirror(E) > dm_region_hash(E) dm_log(E) dm_mod(E) [last unloaded: qla2xxx] > > [71884.589091] CPU: 2 PID: 20783 Comm: kworker/2:0 Tainted: GW > IOE 4.7.0-rc1+ #2 > > [71884.589092] Hardware name: HP ProLiant DL380 G7, BIOS P67 > 05/05/2011 > > [71884.589107] Workqueue: target_completion target_complete_ok_work > [target_core_mod] > > [71884.589109] 880408d1fae8 81352197 > 81370ef6 > > [71884.589110] 880408d1fb48 880408d1fb48 > 880408d1fb38 > > [71884.589111] 8108a12d 07c3 003e0246 > 0246 > > [71884.589112] Call Trace: > > [71884.589115] [] dump_stack+0x67/0x90 > > [71884.589117] [] ? __list_del_entry+0x86/0xd0 > > [71884.589119] [] __warn+0xfd/0x120 > > [71884.589120] [] warn_slowpath_fmt+0x49/0x50 > > [71884.589122] [] __list_del_entry+0x86/0xd0 > > [71884.589123] [] list_del+0x11/0x40 > > [71884.589132] [] target_remove_from_state_list > +0x6e/0x80 [target_core_mod] > > [71884.589140] [] transport_cmd_check_stop > +0xe4/0x120 [target_core_mod] > > [71884.589151] [] > transport_cmd_check_stop_to_fabric+0x15/0x20 [target_core_mod] > > [71884.589160] [] target_complete_ok_work > +0x14e/0x280 [target_core_mod] > > [71884.589162] [] ? pwq_dec_nr_in_flight+0x50/0xa0 > > [71884.589164] [] process_one_work+0x183/0x4d0 > > [71884.589166] [] ? __schedule+0x1ff/0x5c0 > > [71884.589167] [] ? schedule+0x40/0xb0 > > [71884.589169] [] worker_thread+0x16d/0x530 > > [71884.589171] [] ? __switch_to+0x1cd/0x5e0 > > [71884.589173] [] ? __schedule+0x1ff/0x5c0 > > [71884.589175] [] ? __wake_up_common+0x56/0x90 > > [71884.589177] [] ? maybe_create_worker+0x120/0x120 > > [71884.589178] [] ? schedule+0x40/0xb0 > > [71884.589179] [] ? maybe_create_worker+0x120/0x120 > > [71884.589180] [] kthread+0xcc/0xf0 > > [71884.589183] [] ? do_syscall_64+0x78/0x1d0 > > [71884.589185] [] ? schedule_tail+0x1e/0xc0 > > [71884.589188] [] ret_from_fork+0x1f/0x40 > > [71884.589189] [] ? kthread_freezable_should_stop > +0x70/0x70 > > [71884.589190] ---[ end trace 7f24d6c863b6e35b ]--- > > [71884.589204] --
Re: Crash in TCM-LIO
Hello Anil & Co, On Mon, 2016-10-24 at 06:36 +, Anil Gurumurthy wrote: > Hello Nicholas, > > I was trying to get DIF working on TCM-LIO with a QLogic FC adapter > on kernel version 4.7. I noticed a crash when there was an abort > received by the target with the following stack trace: > > Thanks for reporting. > > [71884.588748] BUG: unable to handle kernel NULL pointer dereference > at 00e0 > > [71884.51] IP: [] kmem_cache_free+0x11a/0x200 > > [71884.588981] PGD 0 > > [71884.589017] Oops: [#1] SMP > > [71884.589041] [ cut here ] > > [71884.589048] WARNING: CPU: 2 PID: 20783 at lib/list_debug.c:62 > __list_del_entry+0x86/0xd0 > > [71884.589049] list_del corruption. next->prev should be > 8806f8daeb68, but was 8806f8db79e8 > > [71884.589075] Modules linked in: target_core_pscsi tcm_qla2xxx(OE) > qla2xxx(OE) iscsi_target_mod target_core_file target_core_iblock > target_core_mod netconsole ebtable_nat ebtables ipt_MASQUERADE > nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM > iptable_mangle bridge 8021q mrp garp stp llc ipt_REJECT nf_reject_ipv4 > nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT > nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack > ip6table_filter ip6_tables binfmt_misc vhost_net macvtap macvlan vhost > tun uinput sg serio_raw iTCO_wdt iTCO_vendor_support ipmi_ssif ipmi_si > ipmi_msghandler hpilo hpwdt bnx2 intel_powerclamp coretemp kvm_intel > kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel > ghash_clmulni_intel pcspkr acpi_power_meter lpc_ich mfd_core > i7core_edac edac_core shpchp ext4(E) mbcache(E) jbd2(E) sd_mod(E) > sr_mod(E) cdrom(E) scsi_transport_fc(E) hpsa(E) scsi_transport_sas(E) > aesni_intel(E) ablk_helper(E) cryptd(E) lrw(E) gf128mul(E) > glue_helper(E) pata_acpi(E) ata_generic(E) ata_piix(E) libata(E) > radeon(E) ttm(E) drm_kms_helper(E) drm(E) fb_sys_fops(E) sysimgblt(E) > sysfillrect(E) syscopyarea(E) i2c_algo_bit(E) i2c_core(E) dm_mirror(E) > dm_region_hash(E) dm_log(E) dm_mod(E) [last unloaded: qla2xxx] > > [71884.589091] CPU: 2 PID: 20783 Comm: kworker/2:0 Tainted: GW > IOE 4.7.0-rc1+ #2 > > [71884.589092] Hardware name: HP ProLiant DL380 G7, BIOS P67 > 05/05/2011 > > [71884.589107] Workqueue: target_completion target_complete_ok_work > [target_core_mod] > > [71884.589109] 880408d1fae8 81352197 > 81370ef6 > > [71884.589110] 880408d1fb48 880408d1fb48 > 880408d1fb38 > > [71884.589111] 8108a12d 07c3 003e0246 > 0246 > > [71884.589112] Call Trace: > > [71884.589115] [] dump_stack+0x67/0x90 > > [71884.589117] [] ? __list_del_entry+0x86/0xd0 > > [71884.589119] [] __warn+0xfd/0x120 > > [71884.589120] [] warn_slowpath_fmt+0x49/0x50 > > [71884.589122] [] __list_del_entry+0x86/0xd0 > > [71884.589123] [] list_del+0x11/0x40 > > [71884.589132] [] target_remove_from_state_list > +0x6e/0x80 [target_core_mod] > > [71884.589140] [] transport_cmd_check_stop > +0xe4/0x120 [target_core_mod] > > [71884.589151] [] > transport_cmd_check_stop_to_fabric+0x15/0x20 [target_core_mod] > > [71884.589160] [] target_complete_ok_work > +0x14e/0x280 [target_core_mod] > > [71884.589162] [] ? pwq_dec_nr_in_flight+0x50/0xa0 > > [71884.589164] [] process_one_work+0x183/0x4d0 > > [71884.589166] [] ? __schedule+0x1ff/0x5c0 > > [71884.589167] [] ? schedule+0x40/0xb0 > > [71884.589169] [] worker_thread+0x16d/0x530 > > [71884.589171] [] ? __switch_to+0x1cd/0x5e0 > > [71884.589173] [] ? __schedule+0x1ff/0x5c0 > > [71884.589175] [] ? __wake_up_common+0x56/0x90 > > [71884.589177] [] ? maybe_create_worker+0x120/0x120 > > [71884.589178] [] ? schedule+0x40/0xb0 > > [71884.589179] [] ? maybe_create_worker+0x120/0x120 > > [71884.589180] [] kthread+0xcc/0xf0 > > [71884.589183] [] ? do_syscall_64+0x78/0x1d0 > > [71884.589185] [] ? schedule_tail+0x1e/0xc0 > > [71884.589188] [] ret_from_fork+0x1f/0x40 > > [71884.589189] [] ? kthread_freezable_should_stop > +0x70/0x70 > > [71884.589190] ---[ end trace 7f24d6c863b6e35b ]--- > > [71884.589204] [ cut here ] > > [71884.589206] WARNING: CPU: 2 PID: 20783 at lib/list_debug.c:59 > __list_del_entry+0xa5/0xd0 > > Was list corruption preceded by hung kernel task warnings..? > > On this particular kernel version (4.7), I am unable to get a crash > dump, so cannot really fathom whats going on. > Note there is a v4.1+ reference leak regression for ABORT_TASK + session shutdown here: https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e > Have you seen or have been notified of this behaviour? TomK (CC') reported something similar using v4.8.4 with ESX hosts ABORT_TASK + tcm/qla2xxx ports. > Any ideas/thoughts on how to proceed? > > Currently unsure if this list corruption is related to the above