Re: Deconnection Problem with netapp

Mike Christie Wed, 23 Nov 2011 11:12:05 -0800

In the log below it seems that we just do not get any IO for 5 secs, so
the initiator sends a nop as a ping. We do not get a response for 5
secs. So the initiator is just not seeing any data get through.


Is the network trace you sent for the same test run as the log snippet
is for? Is it possible that the nop-out/ping that times out is the one
in packet 361? In there we also see that READ not get completed correctly.

If the initiator is not causing the RST in packet 345, and if we are
missing a kernel network event we might be sending the nop/ping that
times out after the RST and we see the log or something like it below.

I made this kernel:

http://people.redhat.com/mchristi/kernel-2.6.18-274.el5.iscsidebug1.i686.rpm

http://people.redhat.com/mchristi/kernel-PAE-2.6.18-274.el5.iscsidebug1.i686.rpm

http://people.redhat.com/mchristi/kernel-2.6.18-274.el5.iscsidebug1.x86_64.rpm

It is the rhel 5.7 kernel with some extra debugging output in the
function that handles state changes for the socket and in the iscsi eh
code. It could tell us if we are mishandling a event vs we are causing
the RST. It will not tell
us what is initiating the RST if it is not us, but it will hopefully
tell us if the
iscsi layer is or if it possibly something else in the kernel.

Just run it like normal. You do not need any debugging on.



On 11/21/2011 08:46 AM, Amaury FRANCOIS wrote:
> Hi, here is the output of /var/log/messages :
> 
> Nov 21 10:32:21 bdd4a kernel:  session1: iscsi_free_task freeing task
> itt 0xb state 1 sc 0000000000000000
> Nov 21 10:32:21 bdd4a kernel:  connection1:0:
> iscsi_check_transport_timeouts Setting next tmo 4790025113
> Nov 21 10:32:25 bdd4a kernel:  connection2:0:
> iscsi_check_transport_timeouts Sending nopout as ping
> Nov 21 10:32:25 bdd4a kernel:  connection2:0:
> iscsi_check_transport_timeouts Setting next tmo 4790029108
> Nov 21 10:32:25 bdd4a kernel:  session2: iscsi_prep_mgmt_task mgmtpdu
> [op 0x0 hdr->itt 0x3000001d datalen 0]
> Nov 21 10:32:26 bdd4a kernel:  connection1:0:
> iscsi_check_transport_timeouts Sending nopout as ping
> Nov 21 10:32:26 bdd4a kernel:  connection1:0:
> iscsi_check_transport_timeouts Setting next tmo 4790030113
> Nov 21 10:32:26 bdd4a kernel:  session1: iscsi_prep_mgmt_task mgmtpdu
> [op 0x0 hdr->itt 0x10000009 datalen 0]
> Nov 21 10:32:26 bdd4a kernel:  session1: __iscsi2_complete_pdu [op
> 0x20 cid 0 itt 0x9 len 0]
> Nov 21 10:32:26 bdd4a kernel:  session1: iscsi_complete_task complete
> task itt 0x9 state 3 sc 0000000000000000
> Nov 21 10:32:26 bdd4a kernel:  session1: iscsi_free_task freeing task
> itt 0x9 state 1 sc 0000000000000000
> Nov 21 10:32:26 bdd4a kernel:  connection1:0:
> iscsi_check_transport_timeouts Setting next tmo 4790030113
> Nov 21 10:32:30 bdd4a kernel:  connection2:0: ping timeout of 5 secs
> expired, recv timeout 5, last rx 4790019108, last ping 4790024108, now
> 4790029108
> Nov 21 10:32:30 bdd4a kernel:  connection2:0: detected conn error
> (1011)
> Nov 21 10:32:30 bdd4a kernel:  session2: iscsi2_queuecommand iscsi:
> cmd 0x28 is not queued (7)
> Nov 21 10:32:30 bdd4a last message repeated 2287 times
> Nov 21 10:32:30 bdd4a kernel:  session2: iscsi2_queueued (7)
> Nov 21 10:32:30 bdd4a kernel:  session2: iscsi2_queued (7)
> Nov 21 10:32:30 bdd4a kernel:  session2: iscsi2_queuecomued (7)
> Nov 21 10:32:30 bdd4a kernel:  session2: iscsi2_queueued (7)
> Nov 21 10:32:30 bdd4a kernel:  session2: iscsi2_qued (7)
> Nov 21 10:32:30 bdd4a kernel:  session2: iscsi2_qued (7)
> Nov 21 10:32:30 bdd4a kernel:  session2: iscsi2_queuued (7)
> Nov 21 10:32:30 bdd4a kernel:  session2: iscsi2_queued (7)
> Nov 21 10:32:30 bdd4a kernel:  session2: iscsi2_queuued (7)
> .............
> And so on
> .............
> Nov 21 10:32:31 bdd4a kernel:  session2: iscsi2_queuecommued (7)
> Nov 21 10:32:31 bdd4a kernel:  session2: iscsi2_queued (7)
> Nov 21 10:32:31 bdd4a kernel:  session2: iscsi2_ued (7)
> Nov 21 10:32:31 bdd4a kernel:  session2: iscsi2_queuecued (7)
> Nov 21 10:32:31 bdd4a kernel:  session2: iscsiued (7)
> Nov 21 10:32:31 bdd4a kernel:  session2: iscsi2_quued (7)
> Nov 21 10:32:31 bdd4a kernel:  session2: iscsi2_queuued (7)
> Nov 21 10:32:31 bdd4a kernel:  session2: iscsi2_queuecon2:
> iscsi2_queuecommand iscsi: cmd 0x28 is not queued (7)
> Nov 21 10:32:31 bdd4a kernel:  session2: iscsi2_queuecommand iscsi:
> cmd 0x28 is not queued (7)
> Nov 21 10:32:31 bdd4a last message repeated 7702 times
> Nov 21 10:32:31 bdd4a kernel:  session2: iscsi_start_session_recovery
> blocking session
> Nov 21 10:32:31 bdd4a kernel:  session2: iscsi2_queuecommand iscsi:
> cmd 0x28 is not queued (7)
> Nov 21 10:32:31 bdd4a kernel:  session2: fail_mgmt_tasks failing mgmt
> itt 0x1d state 3
> Nov 21 10:32:31 bdd4a kernel:  session2: iscsi_complete_task complete
> task itt 0x1d state 3 sc 0000000000000000
> Nov 21 10:32:31 bdd4a kernel:  session2: iscsi_free_task freeing task
> itt 0x1d state 5 sc 0000000000000000
> Nov 21 10:32:31 bdd4a kernel:  session2: iscsi2_queuecommand iscsi:
> cmd 0x28 is not queued (786432)
> Nov 21 10:32:31 bdd4a iscsid: Kernel reported iSCSI connection 2:0
> error (1011) state (3)
> Nov 21 10:32:31 bdd4a kernel:  connection1:0:
> iscsi_check_transport_timeouts Sending nopout as ping
> Nov 21 10:32:31 bdd4a kernel:  connection1:0:
> iscsi_check_transport_timeouts Setting next tmo 4790035113
> Nov 21 10:32:31 bdd4a kernel:  session1: iscsi_prep_mgmt_task mgmtpdu
> [op 0x0 hdr->itt 0x1000000c datalen 0]
> Nov 21 10:32:31 bdd4a kernel:  session1: __iscsi2_complete_pdu [op
> 0x20 cid 0 itt 0xc len 0]
> Nov 21 10:32:31 bdd4a kernel:  session1: iscsi_complete_task complete
> task itt 0xc state 3 sc 0000000000000000
> Nov 21 10:32:31 bdd4a kernel:  session1: iscsi_free_task freeing task
> itt 0xc state 1 sc 0000000000000000
> Nov 21 10:32:31 bdd4a kernel:  connection1:0:
> iscsi_check_transport_timeouts Setting next tmo 4790035113
> Nov 21 10:32:32 bdd4a kernel:  session1: iscsi_prep_scsi_cmd_pdu iscsi
> prep [read cid 0 sc ffff8105860860c0 cdb 0x28 itt 0xd len 4096
> bidi_len 0 cmdsn 50514 win 256]
> Nov 21 10:32:32 bdd4a kernel:  session1: __iscsi2_complete_pdu [op
> 0x25 cid 0 itt 0xd len 0]
> Nov 21 10:32:32 bdd4a kernel:  session1: iscsi_data_in_rsp data in
> with status done [sc ffff8105860860c0 res 0 itt 0xd]
> Nov 21 10:32:32 bdd4a kernel:  session1: iscsi_complete_task complete
> task itt 0xd state 3 sc ffff8105860860c0
> Nov 21 10:32:32 bdd4a kernel:  session1: iscsi_free_task freeing task
> itt 0xd state 1 sc ffff8105860860c0
> Nov 21 10:32:32 bdd4a kernel:  session1: iscsi_prep_scsi_cmd_pdu iscsi
> prep [read cid 0 sc ffff81057d7113c0 cdb 0xc0 itt 0xe len 256 bidi_len
> 0 cmdsn 50515 win 256]
> Nov 21 10:32:32 bdd4a kernel:  session1: __iscsi2_complete_pdu [op
> 0x21 cid 0 itt 0xe len 0]
> Nov 21 10:32:32 bdd4a kernel:  session1: iscsi_scsi_cmd_rsp cmd rsp
> done [sc ffff81057d7113c0 res 0 itt 0xe]
> Nov 21 10:32:32 bdd4a kernel:  session1: iscsi_complete_task complete
> task itt 0xe state 3 sc ffff81057d7113c0
> Nov 21 10:32:32 bdd4a kernel:  session1: iscsi_free_task freeing task
> itt 0xe state 1 sc ffff81057d7113c0
> Nov 21 10:32:36 bdd4a kernel:  session2: session recovery timed out
> after 5 secs
> Nov 21 10:32:36 bdd4a kernel:  session2: iscsi2_queuecommand iscsi:
> cmd 0x28 is not queued (983040)
> Nov 21 10:32:36 bdd4a kernel: sd 3:0:0:0: SCSI error: return code =
> 0x000f0000
> Nov 21 10:32:36 bdd4a kernel: end_request: I/O error, dev sdb, sector
> 0
> Nov 21 10:32:36 bdd4a kernel:  connection1:0:
> iscsi_check_transport_timeouts Setting next tmo 4790036132
> Nov 21 10:32:36 bdd4a kernel: device-mapper: multipath: Failing path
> 8:16.
> Nov 21 10:32:36 bdd4a multipathd: sdb: directio checker reports path
> is down
> Nov 21 10:32:36 bdd4a multipathd: checker failed path 8:16 in map
> quorum
> Nov 21 10:32:36 bdd4a multipathd: quorum: remaining active paths: 1
> Nov 21 10:32:36 bdd4a multipathd: dm-8: add map (uevent)
> Nov 21 10:32:36 bdd4a multipathd: dm-8: devmap already registered
> 
> On Nov 18, 9:06 pm, Mike Christie <micha...@cs.wisc.edu> wrote:
>> On 11/16/2011 09:58 AM, Amaury FRANCOIS wrote:
>>
>>> Hi,
>>
>>> I've configured iscsi on RHEL 5.7 (last kernel) with a netapp disk
>>> array and I'm seeing connection problems every 20 minutes (the
>>> connection is dropped and comes back a few minutes or seconds later).
>>
>>> Here are the messages in /var/log/messages :
>>
>>> Nov  6 04:29:01 bdd4a kernel:  connection1:0: ping timeout of 5 secs
>>> expired, recv timeout 5, last rx 4356175664, last ping 4356180664, now
>>> 4356185664
>>
>> This means that we did not see any data/pdus from the target for 5
>> seconds (node.conn[0].timeo.noop_out_interval), so we sent a iscsi ping
>> to make sure the target was still up. The target did not respond to the
>> iscsi ping in 5 seconds (node.conn[0].timeo.noop_out_timeout), so we
>> assumed there was a connection problem. The initiator will then drop the
>> tcp/ip connection and try to reconnect and relogin and restart IO.
>>
>>> Nov  6 04:29:01 bdd4a kernel:  connection1:0: detected conn error
>>> (1011)
>>> Nov  6 04:29:02 bdd4a iscsid: Kernel reported iSCSI connection 1:0
>>> error (1011) state (3)
>>> Nov  6 04:29:07 bdd4a kernel:  session1: session recovery timed out
>>> after 5 secs
>>
>> We tried to relogin/reconnect for 5 seconds
>> (node.session.timeo.replacement_timeout)  but could not, so the
>> initiator was instructed to fail IO.
>>
>> It then looks like we are able to log back in around 30 seconds later.
>> When this happens it normally is a temporary issue with the network. For
>> some reason we just cannot reconnect to the target for a long'ish (30
>> secs) time.
>>
>>
>>
>>> I've also traced the network with tcpdump and analyzed it with
>>> wireshark :
>>
>>> ARP request about the server from Netapp
>>> ARP response from the server -> the problem always seem to happen
>>> after an ARP request from the netapp array
>>> RST sent from the server on the iscsi connection -> !!!!!
>>
>>> All NOP in, NOP out seemed to be OK just before, any idea why this
>>> problem happen ?
>>
>> So in the network trace you can see the NOP get sent and its response
>> before the tcp disconnection? If so then that might be a bug in the
>> iscsi layer not processing it right. From the log above it looks like we
>> did not get a response.
>>
>> Can you run the initiator in debug mode? It will lead to lots and lots
>> of log output.
>>
>> echo 1 > /sys/module/libiscsi2/parameters/debug_libiscsi
>>
>> And could you also send your net trace if you have it.
> 

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Re: Deconnection Problem with netapp

Reply via email to