ping timeouts, conn error (1011), reset received on connection using open iscsi and Dell EqualLogic

2010-11-13 Thread macmike
I'm running RHEL5 kernel 2.6.18 in a VM under VMware ESX 4.0.0,
261974. We have several VM that don't have any problem but they are
all running some flavor of Windows. This is our only Linux box using a
targeted iSCSI volume.

# yum info device-mapper-multipath
Loaded plugins: rhnplugin, security
Installed Packages
Name   : device-mapper-multipath
Arch   : x86_64
Version: 0.4.7
Release: 34.el5_5.6
Size   : 6.9 M
Repo   : installed

# yum info iscsi-initiator-utils
Loaded plugins: rhnplugin, security
Installed Packages
Name   : iscsi-initiator-utils
Arch   : x86_64
Version: 6.2.0.871
Release: 0.20.el5_5
Size   : 1.9 M
Repo   : installed

We are using multipath connections and one of the paths frequently
gets ping timeouts (usually about every 5 - 15 minutes) and then the
connection error which causes a problematic pause in IO while the
multipathing recovers.
The server is our campus web server and is running Apache, mysql, and
Joomla.

I've been a sysadmin for other OSes for many years but am new to Linux
server administration and any help would be greatly appreciated.

Below are more details. If there is anything else I can provide that
would be helpful please let me know:


# dmesg
device-mapper: multipath: Failing path 8:32.
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last
rx 4355241960, last ping 4355246960, now 4355251960
 connection2:0: detected conn error (1011)
device-mapper: multipath: Failing path 8:32.
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last
rx 4355898393, last ping 4355903393, now 4355908393
 connection2:0: detected conn error (1011)
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last
rx 4357043430, last ping 4357048430, now 4357053430
 connection2:0: detected conn error (1011)
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last
rx 4358334347, last ping 4358339347, now 4358344347
 connection2:0: detected conn error (1011)
device-mapper: multipath: Failing path 8:32.
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last
rx 435899, last ping 4358916119, now 4358921119
 connection2:0: detected conn error (1011)
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last
rx 4360295386, last ping 4360300386, now 4360305386
 connection2:0: detected conn error (1011)
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last
rx 4360717303, last ping 4360722303, now 4360727303
 connection2:0: detected conn error (1011)
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last
rx 4361937494, last ping 4361942494, now 4361947494
 connection2:0: detected conn error (1011)
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last
rx 4363713740, last ping 4363718740, now 4363723740
 connection2:0: detected conn error (1011)
device-mapper: multipath: Failing path 8:32.
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last
rx 4364203932, last ping 4364208932, now 4364213932
 connection2:0: detected conn error (1011)
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last
rx 4365150602, last ping 4365155602, now 4365160602
 connection2:0: detected conn error (1011)
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last
rx 4365512045, last ping 4365517045, now 4365522045
 connection2:0: detected conn error (1011)
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last
rx 4365977352, last ping 4365982352, now 4365987352
 connection2:0: detected conn error (1011)
 connection2:0: detected conn error (1019)
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last
rx 4366877780, last ping 4366882780, now 4366887780
 connection2:0: detected conn error (1011)
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last
rx 4367263655, last ping 4367268655, now 4367273655
 connection2:0: detected conn error (1011)

# tail /var/log/messages

Nov 12 10:40:32 techcms kernel:  connection2:0: ping timeout of 5 secs
expired, recv timeout 5, last rx 4368811299, last ping 4368816299, now
4368821299
Nov 12 10:40:32 techcms kernel:  connection2:0: detected conn error
(1011)
Nov 12 10:40:33 techcms multipathd: sdc: readsector0 checker reports
path is down
Nov 12 10:40:33 techcms multipathd: checker failed path 8:32 in map
san-techcms
Nov 12 10:40:33 techcms multipathd: san-techcms: remaining active
paths: 1
Nov 12 10:40:33 techcms multipathd: dm-2: add map (uevent)
Nov 12 10:40:33 techcms kernel: device-mapper: multipath: Failing path
8:32.
Nov 12 10:40:33 techcms multipathd: dm-2: devmap already registered
Nov 12 10:40:33 techcms iscsid: Kernel reported iSCSI connection 2:0
error (1011) state (3)
Nov 12 10:40:55 techcms multipathd: sdc: readsector0 checker reports
path is up
Nov 12 10:40:55 techcms multipathd: 8:32: reinstated
Nov 12 10:40:55 techcms multipathd: san-techcms: remaining active
paths: 2
Nov 12 10:40:55 techcms multipathd: dm-2: add map (uevent)
Nov 12 10

Re: ping timeouts, conn error (1011), reset received on connection using open iscsi and Dell EqualLogic

2010-11-13 Thread Mike Christie

On 11/12/2010 01:54 PM, macmike wrote:

I'm running RHEL5 kernel 2.6.18 in a VM under VMware ESX 4.0.0,


What version of rhel5 and what kernel (uname -a will tell you the 
kernel). I think you just need to update the kernel. Some modules would 
mark a iscsi ping/nopout/noop_out as failed too quickly.




# Time interval to wait for on connection before sending a ping.
node.conn[0].timeo.noop_out_interval = 5

# To specify the time to wait for a Nop-out response before failing
# the connection, edit this line. Failing the connection will
# cause IO to be failed back to the SCSI layer. If using dm-multipath
# this will cause the IO to be failed to the multipath layer.
node.conn[0].timeo.noop_out_timeout = 5


You also might want to increase these values. In one case it looked like 
it did take a while to get logged back in, so it might have been a valid 
problem.


You can reset those values by setting them in iscsid.conf then 
rediscoverying your targets or run


iscsiadm -m node -o update -n   node.conn[0].timeo.noop_out_interval -v 
your_new_value
iscsiadm -m node -o update -n   node.conn[0].timeo.noop_out_timeout -v 
your_new_value


--
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.



Re: ping timeouts

2009-07-21 Thread Hannes Reinecke

Ulrich Windl wrote:
 Hmm, I wonder: If every packet exchange for an iSCSI target re-triggers a 
 timer 
 that prevents NOPs from being sent/expected, the problem under heavy load 
 should 
 go away as well, right? I see little sense to send extra NOP queries when 
 there is 
 heavy traffic for a target. (MHO)
 
Ach, if it were so easy.
Yes, of course you are correct. And yes, we shouldn't be sending NOPs when under
high load. In fact I hooked into the tcp_data_ready() callback to make sure to
update the timer whenever a new skb is available from the network stack.
And even in any unterminated while() loop in the xmit and recv codepath.

The problem is that I'm _still_ seeing the occasional nopouts. I did some 
analysis
and printed the MIB statistics (ie octects send / received) when the NOP was
being active. And it really _looks_ as if there is no traffic on this connection
during that time. And what's more worrying, apparently we're sending a NOP and
won't get a reply.

I'm still fighting to get a wireshark trace; but dumping two _heavily_ loaded
connections gives you massive traces. So far the disk space has run out before
I got anything interesting noted in them :-(

More updates to follow.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: ping timeouts

2009-07-21 Thread Hannes Reinecke

Ulrich Windl wrote:
 Hmm, I wonder: If every packet exchange for an iSCSI target re-triggers a 
 timer 
 that prevents NOPs from being sent/expected, the problem under heavy load 
 should 
 go away as well, right? I see little sense to send extra NOP queries when 
 there is 
 heavy traffic for a target. (MHO)
 
But the point is there _is_ no traffic at that point.


 However, after some heavy instrumenting I found this:
[ .. ]

 [ 2666.376858]  connection2:0: mgmtpdu [itt 0xa05 p 8100795675c0] 
 delayed, cmd 0 mgmt 0 req 0 exp 1913193 max 1913177 queued 1913194

The 'cmd', 'mgmt', and 'req' parameters are just !list_empty(cmdqueue), 
!list_empty(mgmtqueue), !list_empty(requeue).
(I said I'm running on an older code base).
So at this point there are _no_ requests queued.

Really curious.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: ping timeouts

2009-07-21 Thread Hannes Reinecke

Hannes Reinecke wrote:
 Ulrich Windl wrote:
 Hmm, I wonder: If every packet exchange for an iSCSI target re-triggers a 
 timer 
 that prevents NOPs from being sent/expected, the problem under heavy load 
 should 
 go away as well, right? I see little sense to send extra NOP queries when 
 there is 
 heavy traffic for a target. (MHO)

 But the point is there _is_ no traffic at that point.
 
 However, after some heavy instrumenting I found this:
 [ .. ]
 [ 2666.376858]  connection2:0: mgmtpdu [itt 0xa05 p 8100795675c0] 
 delayed, cmd 0 mgmt 0 req 0 exp 1913193 max 1913177 queued 1913194

 The 'cmd', 'mgmt', and 'req' parameters are just !list_empty(cmdqueue), 
 !list_empty(mgmtqueue), !list_empty(requeue).
 (I said I'm running on an older code base).
 So at this point there are _no_ requests queued.
 
 Really curious.
 
Ah-ha:

Jul 21 09:50:01 esk kernel: [  456.807621]  connection2:0: Sending nopout,cmd 0 
mgmt 0 req 0 exp 15661 max 15598 queued 15599

(queued is actually now session-cmdsn, as MikeC suggested).
So it's not ExpCmdSN which is the problem here, it's CmdSN.
We're actually sending a NOP Out with a CmdSN _larger_ than MaxCmdSN.
Which is happily ignored by the target.

And looking further, we indeed never check in the xmit code path if
CmdSN might exceed MaxCmdSN. In which case we really should
wait until we've got an update for MaxCmdSN.

Checking.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



ping timeouts

2009-07-20 Thread Hannes Reinecke

Hi all,

we do have a problem with NOP processing. Under heavy I/O, the connection 
starts spitting out
'ping timeout' error messages and resetting the connection.
There has been _no_ error from the target side nor the connection, so it's us, 
sadly.

However, after some heavy instrumenting I found this:
[ 2664.456270]  connection2:0: Sending nopout exp 1913193 max 1913177 queued 
1913194

[ 2664.462592]  connection2:0: mgmtpdu [itt xa05 p 8100795675c0] queued

[ 2584.459166]  connection2:0: mgmtpdu [op 0x0 hdr-itt 0xa05 datalen 0]

[ 2584.466259]  connection2:0: mgmtpdu [itt 0xa05 p 8100795675c0] done

[ 2585.232044]  connection2:0: mgmtpdu [op 0x2 hdr-itt 0xa06 datalen 0]

[ 2585.235410]  connection2:0: mgmtpdu [itt 0xa06 p 810079567540] done

[ 2666.376858]  connection2:0: mgmtpdu [itt 0xa05 p 8100795675c0] delayed, 
cmd 0 mgmt 0 req 0 exp 1913193 max 1913177 queued 1913194

[ 2667.169094]  connection2:0: mgmtpdu [itt 0xa05 p 8100795675c0] delayed, 
cmd 0 mgmt 0 req 0 exp 1913193 max 1913177 queued 1913194

[ 2667.179683]  connection2:0: Sending nopout,cmd 0 mgmt 0 req 0 exp 1913193 
max 1913177 queued 1913194

[ 2669.092002]  connection2:0: mgmtpdu [itt 0xa05 p 8100795675c0] delayed, 
cmd 0 mgmt 0 req 0 exp 1913193 max 1913177 queued 1913194

[ 2669.093935]  connection2:0: ping timeout of 5 secs expired, stat 
7880612636/308802796/2 state 1/-150303960

[ 2669.097568]  connection2:0: iscsi: detected conn error (1011)

[ 2669.201246]  session2: blocking session

[ 2669.206148]  connection2:0: mgmtpdu [itt xa05 p 8100795675c0] finished


(ignore the clock skew, it's a multicore TSC problem ...).

The numbers are ExpCmdSN, MaxCmdSN, and queued CmdSN.
So, as you can see, we're for some reason sending a NOP-Out with CmdSN 1913194, 
ExpCmdSN 1913193, and MaxCmdSN 191377.
But reading through RFC 3720, I found this:

   The target MUST NOT transmit a MaxCmdSN that is less than
   ExpCmdSN-1.  For non-immediate commands, the CmdSN field can take any
   value from ExpCmdSN to MaxCmdSN inclusive.  The target MUST silently
   ignore any non-immediate command outside of this range or non-
   immediate duplicates within the range.  The CmdSN carried by
   immediate commands may lie outside the ExpCmdSN to MaxCmdSN range.
   For example, if the initiator has previously sent a non-immediate
   command carrying the CmdSN equal to MaxCmdSN, the target window is
   closed.  For group task management commands issued as immediate
   commands, CmdSN indicates the scope of the group action (e.g., on
   ABORT TASK SET indicates which commands are aborted).

So no wonder we're never seeing any replies to that one.

Question remains, though, why we're starting to send PDUs with invalid
MaxCmdSN numbers ...

Apologies as I'm running on an older codebase. But the same error / 'behaviour'
is present in mainline as well.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: ping timeouts

2009-07-20 Thread Mike Christie

On 07/20/2009 12:16 PM, Mike Christie wrote:
 hdr-cmd_sn = session-cmdsn++;
 so after we send a hdr with the cmdsn we incremented the session

Not send. I meant to write prepd above.


 tracking one and so the tracking one is going to be one higher than what
 we are actually sending.



--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---