qla2xxx firmware crashes in target mode

2015-10-19 Thread Chris Boot
Hi folks,

So this is a bit of a strange situation I'm in, where my *target*
qla2xxx firmware appears to get stuck when the *initiator* kernel is 4.1+.

The target is an Intel system with a QLE2464 running kernel 4.2.1 (from
Debian) and using fw=7.03.00. The initiator is another Intel system with
a QLE2460 and using fw=7.03.00. They are connected by direct fibre link,
there are no switches / fabric involved.

The initiator and target are both stable when the initiator is running
kernel 4.0 or lower. When the initiator is running a 4.1 or 4.2 kernel,
the *target* firmware becomes unstable and the initiator times out IOs
and generally becomes very unhappy.

When booting a 4.1+ kernel on the initiator, everything appears to work
well for a little while (up to an hour or so) before the issue manifests
itself. At some point I see the "ISP System Error" message and IO locks
up. To get out of this situation I need to reboot the initiator; the
target appears to recover by itself.

Do you know about this issue? I can debug further (e.g. try to bisect
it?) if required but no point if you know about it already.

dmesg from the target end (I haven't been able to capture the initiator
end):

[484701.194971] qla2xxx [:05:00.0]-5003:9: ISP System Error -
mbx1=c19h mbx2=10h mbx3=0h mbx7=0h.
[484701.222021] qla2xxx [:05:00.0]-d001:9: Firmware dump saved to
temp buffer (9/c90002b84000), dump status flags (0x3f).
[484701.222082] qla2xxx [:05:00.0]-00af:9: Performing ISP error
recovery - ha=8800ab7c4000.
[484702.063799] qla2xxx [:05:00.0]-500a:9: LOOP UP detected (4 Gbps).
[484702.112814] qla2xxx [:05:00.0]-0121:9: Failed to enable
receiving of RSCN requests: 0x2.
[484702.743687] qla2xxx [:05:00.0]-5003:9: ISP System Error -
mbx1=c19h mbx2=10h mbx3=0h mbx7=0h.
[484702.754050] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484703.619362] qla2xxx [:05:00.0]-00af:9: Performing ISP error
recovery - ha=8800ab7c4000.
[484704.459181] qla2xxx [:05:00.0]-500a:9: LOOP UP detected (4 Gbps).
[484704.508170] qla2xxx [:05:00.0]-0121:9: Failed to enable
receiving of RSCN requests: 0x2.
[484704.854664] qla2xxx [:05:00.0]-5003:9: ISP System Error -
mbx1=c19h mbx2=10h mbx3=0h mbx7=0h.
[484704.865014] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484734.867554] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484764.883993] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484794.900464] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484824.916954] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484854.933415] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484884.953887] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484914.974377] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484918.761483] INFO: task kworker/2:17:36759 blocked for more than 120
seconds.
[484918.778839]   Not tainted 4.2.0-0.bpo.1-amd64 #1
[484918.793941] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[484918.812578] kworker/2:17D 88042e855840 0 36759  2
0x
[484918.812597] Workqueue: qla_tgt_wq qlt_create_sess_from_atio [qla2xxx]
[484918.812607]  880108076500 0046 88009e473d80
880107cef040
[484918.812613]  0286 88009e474000 880426a5f9a4
880108076500
[484918.812624]   880426a5f9a8 0296
8154f26f
[484918.812626] Call Trace:
[484918.812632]  [] ? schedule+0x2f/0x70
[484918.812635]  [] ? schedule_preempt_disabled+0xe/0x20
[484918.812643]  [] ? __mutex_lock_slowpath+0x85/0x100
[484918.812649]  [] ? mutex_lock+0x1b/0x30
[484918.812659]  [] ?
qlt_create_sess_from_atio+0x12a/0x1c0 [qla2xxx]
[484918.812668]  [] ? process_one_work+0x14a/0x3d0
[484918.812671]  [] ? worker_thread+0x65/0x470
[484918.812675]  [] ? rescuer_thread+0x2f0/0x2f0
[484918.812677]  [] ? kthread+0xd3/0xf0
[484918.812680]  [] ? kthread_create_on_node+0x170/0x170
[484918.812684]  [] ? ret_from_fork+0x3f/0x70
[484918.812687]  [] ? kthread_create_on_node+0x170/0x170
[484944.994831] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484975.019311] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484975.559187] qla2xxx [:05:00.0]-00af:9: Performing ISP error
recovery - ha=8800ab7c4000.
[484976.430963] qla2xxx [:05:00.0]-500a:9: LOOP UP detected (4 Gbps).
[484976.448002] 

qla2xxx firmware crashes in target mode

2015-10-19 Thread Chris Boot
Hi folks,

So this is a bit of a strange situation I'm in, where my *target*
qla2xxx firmware appears to get stuck when the *initiator* kernel is 4.1+.

The target is an Intel system with a QLE2464 running kernel 4.2.1 (from
Debian) and using fw=7.03.00. The initiator is another Intel system with
a QLE2460 and using fw=7.03.00. They are connected by direct fibre link,
there are no switches / fabric involved.

The initiator and target are both stable when the initiator is running
kernel 4.0 or lower. When the initiator is running a 4.1 or 4.2 kernel,
the *target* firmware becomes unstable and the initiator times out IOs
and generally becomes very unhappy.

When booting a 4.1+ kernel on the initiator, everything appears to work
well for a little while (up to an hour or so) before the issue manifests
itself. At some point I see the "ISP System Error" message and IO locks
up. To get out of this situation I need to reboot the initiator; the
target appears to recover by itself.

Do you know about this issue? I can debug further (e.g. try to bisect
it?) if required but no point if you know about it already.

dmesg from the target end (I haven't been able to capture the initiator
end):

[484701.194971] qla2xxx [:05:00.0]-5003:9: ISP System Error -
mbx1=c19h mbx2=10h mbx3=0h mbx7=0h.
[484701.222021] qla2xxx [:05:00.0]-d001:9: Firmware dump saved to
temp buffer (9/c90002b84000), dump status flags (0x3f).
[484701.222082] qla2xxx [:05:00.0]-00af:9: Performing ISP error
recovery - ha=8800ab7c4000.
[484702.063799] qla2xxx [:05:00.0]-500a:9: LOOP UP detected (4 Gbps).
[484702.112814] qla2xxx [:05:00.0]-0121:9: Failed to enable
receiving of RSCN requests: 0x2.
[484702.743687] qla2xxx [:05:00.0]-5003:9: ISP System Error -
mbx1=c19h mbx2=10h mbx3=0h mbx7=0h.
[484702.754050] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484703.619362] qla2xxx [:05:00.0]-00af:9: Performing ISP error
recovery - ha=8800ab7c4000.
[484704.459181] qla2xxx [:05:00.0]-500a:9: LOOP UP detected (4 Gbps).
[484704.508170] qla2xxx [:05:00.0]-0121:9: Failed to enable
receiving of RSCN requests: 0x2.
[484704.854664] qla2xxx [:05:00.0]-5003:9: ISP System Error -
mbx1=c19h mbx2=10h mbx3=0h mbx7=0h.
[484704.865014] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484734.867554] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484764.883993] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484794.900464] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484824.916954] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484854.933415] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484884.953887] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484914.974377] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484918.761483] INFO: task kworker/2:17:36759 blocked for more than 120
seconds.
[484918.778839]   Not tainted 4.2.0-0.bpo.1-amd64 #1
[484918.793941] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[484918.812578] kworker/2:17D 88042e855840 0 36759  2
0x
[484918.812597] Workqueue: qla_tgt_wq qlt_create_sess_from_atio [qla2xxx]
[484918.812607]  880108076500 0046 88009e473d80
880107cef040
[484918.812613]  0286 88009e474000 880426a5f9a4
880108076500
[484918.812624]   880426a5f9a8 0296
8154f26f
[484918.812626] Call Trace:
[484918.812632]  [] ? schedule+0x2f/0x70
[484918.812635]  [] ? schedule_preempt_disabled+0xe/0x20
[484918.812643]  [] ? __mutex_lock_slowpath+0x85/0x100
[484918.812649]  [] ? mutex_lock+0x1b/0x30
[484918.812659]  [] ?
qlt_create_sess_from_atio+0x12a/0x1c0 [qla2xxx]
[484918.812668]  [] ? process_one_work+0x14a/0x3d0
[484918.812671]  [] ? worker_thread+0x65/0x470
[484918.812675]  [] ? rescuer_thread+0x2f0/0x2f0
[484918.812677]  [] ? kthread+0xd3/0xf0
[484918.812680]  [] ? kthread_create_on_node+0x170/0x170
[484918.812684]  [] ? ret_from_fork+0x3f/0x70
[484918.812687]  [] ? kthread_create_on_node+0x170/0x170
[484944.994831] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484975.019311] qla2xxx [:05:00.0]-d007:9: Firmware has been
previously dumped (c90002b84000) -- ignoring request.
[484975.559187] qla2xxx [:05:00.0]-00af:9: Performing ISP error
recovery - ha=8800ab7c4000.
[484976.430963] qla2xxx [:05:00.0]-500a:9: LOOP UP detected (4 Gbps).
[484976.448002]