ping timeouts, conn error (1011), reset received on connection using open iscsi and Dell EqualLogic
I'm running RHEL5 kernel 2.6.18 in a VM under VMware ESX 4.0.0, 261974. We have several VM that don't have any problem but they are all running some flavor of Windows. This is our only Linux box using a targeted iSCSI volume. # yum info device-mapper-multipath Loaded plugins: rhnplugin, security Installed Packages Name : device-mapper-multipath Arch : x86_64 Version: 0.4.7 Release: 34.el5_5.6 Size : 6.9 M Repo : installed # yum info iscsi-initiator-utils Loaded plugins: rhnplugin, security Installed Packages Name : iscsi-initiator-utils Arch : x86_64 Version: 6.2.0.871 Release: 0.20.el5_5 Size : 1.9 M Repo : installed We are using multipath connections and one of the paths frequently gets ping timeouts (usually about every 5 - 15 minutes) and then the connection error which causes a problematic pause in IO while the multipathing recovers. The server is our campus web server and is running Apache, mysql, and Joomla. I've been a sysadmin for other OSes for many years but am new to Linux server administration and any help would be greatly appreciated. Below are more details. If there is anything else I can provide that would be helpful please let me know: # dmesg device-mapper: multipath: Failing path 8:32. connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4355241960, last ping 4355246960, now 4355251960 connection2:0: detected conn error (1011) device-mapper: multipath: Failing path 8:32. connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4355898393, last ping 4355903393, now 4355908393 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4357043430, last ping 4357048430, now 4357053430 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4358334347, last ping 4358339347, now 4358344347 connection2:0: detected conn error (1011) device-mapper: multipath: Failing path 8:32. connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 435899, last ping 4358916119, now 4358921119 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4360295386, last ping 4360300386, now 4360305386 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4360717303, last ping 4360722303, now 4360727303 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4361937494, last ping 4361942494, now 4361947494 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4363713740, last ping 4363718740, now 4363723740 connection2:0: detected conn error (1011) device-mapper: multipath: Failing path 8:32. connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4364203932, last ping 4364208932, now 4364213932 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4365150602, last ping 4365155602, now 4365160602 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4365512045, last ping 4365517045, now 4365522045 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4365977352, last ping 4365982352, now 4365987352 connection2:0: detected conn error (1011) connection2:0: detected conn error (1019) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4366877780, last ping 4366882780, now 4366887780 connection2:0: detected conn error (1011) connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4367263655, last ping 4367268655, now 4367273655 connection2:0: detected conn error (1011) # tail /var/log/messages Nov 12 10:40:32 techcms kernel: connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4368811299, last ping 4368816299, now 4368821299 Nov 12 10:40:32 techcms kernel: connection2:0: detected conn error (1011) Nov 12 10:40:33 techcms multipathd: sdc: readsector0 checker reports path is down Nov 12 10:40:33 techcms multipathd: checker failed path 8:32 in map san-techcms Nov 12 10:40:33 techcms multipathd: san-techcms: remaining active paths: 1 Nov 12 10:40:33 techcms multipathd: dm-2: add map (uevent) Nov 12 10:40:33 techcms kernel: device-mapper: multipath: Failing path 8:32. Nov 12 10:40:33 techcms multipathd: dm-2: devmap already registered Nov 12 10:40:33 techcms iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3) Nov 12 10:40:55 techcms multipathd: sdc: readsector0 checker reports path is up Nov 12 10:40:55 techcms multipathd: 8:32: reinstated Nov 12 10:40:55 techcms multipathd: san-techcms: remaining active paths: 2 Nov 12 10:40:55 techcms multipathd: dm-2: add map (uevent) Nov 12 10
Re: ping timeouts, conn error (1011), reset received on connection using open iscsi and Dell EqualLogic
On 11/12/2010 01:54 PM, macmike wrote: I'm running RHEL5 kernel 2.6.18 in a VM under VMware ESX 4.0.0, What version of rhel5 and what kernel (uname -a will tell you the kernel). I think you just need to update the kernel. Some modules would mark a iscsi ping/nopout/noop_out as failed too quickly. # Time interval to wait for on connection before sending a ping. node.conn[0].timeo.noop_out_interval = 5 # To specify the time to wait for a Nop-out response before failing # the connection, edit this line. Failing the connection will # cause IO to be failed back to the SCSI layer. If using dm-multipath # this will cause the IO to be failed to the multipath layer. node.conn[0].timeo.noop_out_timeout = 5 You also might want to increase these values. In one case it looked like it did take a while to get logged back in, so it might have been a valid problem. You can reset those values by setting them in iscsid.conf then rediscoverying your targets or run iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_interval -v your_new_value iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_timeout -v your_new_value -- You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-is...@googlegroups.com. To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.
Re: ping timeouts
Ulrich Windl wrote: Hmm, I wonder: If every packet exchange for an iSCSI target re-triggers a timer that prevents NOPs from being sent/expected, the problem under heavy load should go away as well, right? I see little sense to send extra NOP queries when there is heavy traffic for a target. (MHO) Ach, if it were so easy. Yes, of course you are correct. And yes, we shouldn't be sending NOPs when under high load. In fact I hooked into the tcp_data_ready() callback to make sure to update the timer whenever a new skb is available from the network stack. And even in any unterminated while() loop in the xmit and recv codepath. The problem is that I'm _still_ seeing the occasional nopouts. I did some analysis and printed the MIB statistics (ie octects send / received) when the NOP was being active. And it really _looks_ as if there is no traffic on this connection during that time. And what's more worrying, apparently we're sending a NOP and won't get a reply. I'm still fighting to get a wireshark trace; but dumping two _heavily_ loaded connections gives you massive traces. So far the disk space has run out before I got anything interesting noted in them :-( More updates to follow. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: ping timeouts
Ulrich Windl wrote: Hmm, I wonder: If every packet exchange for an iSCSI target re-triggers a timer that prevents NOPs from being sent/expected, the problem under heavy load should go away as well, right? I see little sense to send extra NOP queries when there is heavy traffic for a target. (MHO) But the point is there _is_ no traffic at that point. However, after some heavy instrumenting I found this: [ .. ] [ 2666.376858] connection2:0: mgmtpdu [itt 0xa05 p 8100795675c0] delayed, cmd 0 mgmt 0 req 0 exp 1913193 max 1913177 queued 1913194 The 'cmd', 'mgmt', and 'req' parameters are just !list_empty(cmdqueue), !list_empty(mgmtqueue), !list_empty(requeue). (I said I'm running on an older code base). So at this point there are _no_ requests queued. Really curious. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: ping timeouts
Hannes Reinecke wrote: Ulrich Windl wrote: Hmm, I wonder: If every packet exchange for an iSCSI target re-triggers a timer that prevents NOPs from being sent/expected, the problem under heavy load should go away as well, right? I see little sense to send extra NOP queries when there is heavy traffic for a target. (MHO) But the point is there _is_ no traffic at that point. However, after some heavy instrumenting I found this: [ .. ] [ 2666.376858] connection2:0: mgmtpdu [itt 0xa05 p 8100795675c0] delayed, cmd 0 mgmt 0 req 0 exp 1913193 max 1913177 queued 1913194 The 'cmd', 'mgmt', and 'req' parameters are just !list_empty(cmdqueue), !list_empty(mgmtqueue), !list_empty(requeue). (I said I'm running on an older code base). So at this point there are _no_ requests queued. Really curious. Ah-ha: Jul 21 09:50:01 esk kernel: [ 456.807621] connection2:0: Sending nopout,cmd 0 mgmt 0 req 0 exp 15661 max 15598 queued 15599 (queued is actually now session-cmdsn, as MikeC suggested). So it's not ExpCmdSN which is the problem here, it's CmdSN. We're actually sending a NOP Out with a CmdSN _larger_ than MaxCmdSN. Which is happily ignored by the target. And looking further, we indeed never check in the xmit code path if CmdSN might exceed MaxCmdSN. In which case we really should wait until we've got an update for MaxCmdSN. Checking. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
ping timeouts
Hi all, we do have a problem with NOP processing. Under heavy I/O, the connection starts spitting out 'ping timeout' error messages and resetting the connection. There has been _no_ error from the target side nor the connection, so it's us, sadly. However, after some heavy instrumenting I found this: [ 2664.456270] connection2:0: Sending nopout exp 1913193 max 1913177 queued 1913194 [ 2664.462592] connection2:0: mgmtpdu [itt xa05 p 8100795675c0] queued [ 2584.459166] connection2:0: mgmtpdu [op 0x0 hdr-itt 0xa05 datalen 0] [ 2584.466259] connection2:0: mgmtpdu [itt 0xa05 p 8100795675c0] done [ 2585.232044] connection2:0: mgmtpdu [op 0x2 hdr-itt 0xa06 datalen 0] [ 2585.235410] connection2:0: mgmtpdu [itt 0xa06 p 810079567540] done [ 2666.376858] connection2:0: mgmtpdu [itt 0xa05 p 8100795675c0] delayed, cmd 0 mgmt 0 req 0 exp 1913193 max 1913177 queued 1913194 [ 2667.169094] connection2:0: mgmtpdu [itt 0xa05 p 8100795675c0] delayed, cmd 0 mgmt 0 req 0 exp 1913193 max 1913177 queued 1913194 [ 2667.179683] connection2:0: Sending nopout,cmd 0 mgmt 0 req 0 exp 1913193 max 1913177 queued 1913194 [ 2669.092002] connection2:0: mgmtpdu [itt 0xa05 p 8100795675c0] delayed, cmd 0 mgmt 0 req 0 exp 1913193 max 1913177 queued 1913194 [ 2669.093935] connection2:0: ping timeout of 5 secs expired, stat 7880612636/308802796/2 state 1/-150303960 [ 2669.097568] connection2:0: iscsi: detected conn error (1011) [ 2669.201246] session2: blocking session [ 2669.206148] connection2:0: mgmtpdu [itt xa05 p 8100795675c0] finished (ignore the clock skew, it's a multicore TSC problem ...). The numbers are ExpCmdSN, MaxCmdSN, and queued CmdSN. So, as you can see, we're for some reason sending a NOP-Out with CmdSN 1913194, ExpCmdSN 1913193, and MaxCmdSN 191377. But reading through RFC 3720, I found this: The target MUST NOT transmit a MaxCmdSN that is less than ExpCmdSN-1. For non-immediate commands, the CmdSN field can take any value from ExpCmdSN to MaxCmdSN inclusive. The target MUST silently ignore any non-immediate command outside of this range or non- immediate duplicates within the range. The CmdSN carried by immediate commands may lie outside the ExpCmdSN to MaxCmdSN range. For example, if the initiator has previously sent a non-immediate command carrying the CmdSN equal to MaxCmdSN, the target window is closed. For group task management commands issued as immediate commands, CmdSN indicates the scope of the group action (e.g., on ABORT TASK SET indicates which commands are aborted). So no wonder we're never seeing any replies to that one. Question remains, though, why we're starting to send PDUs with invalid MaxCmdSN numbers ... Apologies as I'm running on an older codebase. But the same error / 'behaviour' is present in mainline as well. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: ping timeouts
On 07/20/2009 12:16 PM, Mike Christie wrote: hdr-cmd_sn = session-cmdsn++; so after we send a hdr with the cmdsn we incremented the session Not send. I meant to write prepd above. tracking one and so the tracking one is going to be one higher than what we are actually sending. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---