Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-19 Thread Matthew Dickinson



On 11/10/09 11:39 AM, Mike Christie micha...@cs.wisc.edu wrote:
 
 What version of open-iscsi were you using and what kernel, and were you
 using the iscsi kernel modules with open-iscsi.org tarball or from the
 kernel?

iscsi-initiator-utils-6.2.0.871-0.10.el5
kernel-2.6.18-164.2.1.el5

RedHat RPMs

 
 
 It looks like we are sending more IO than the target can handle. In one
 of those cases it took more than 30 or 60 seconds (depending on your
 timeout value).
 
 What is the value of
 
 cat /sys/block/sdXYZ/device/timeout
 
 ?
 
 If it is 30 or 60 could you increase it to 360? After you login to the
 target do
 
 echo 360  /sys/block/sdXYZ/device/timeout

I've tried setting this, but it appears to have no effect - it was 60, and I
increased to 360.

 
 And what is the value of:
 
 iscsiadm -m node -T your_target | grep node.session.cmds_max
 
 If that is 128, then could you decrease that to 32 or 16?
 
 Run
 
 iscsiadm -m node -T your_target -u
 iscsiadm -m node -T your_target -o update -n node.session.cmds_max -v 32
 iscsiad-m node -T your_target -l

I've tried setting to both 16 and 32, but it behaves about the same.

 
 
 And if those prevent the io errors then could you do
 
 echo noop  /sys/block/sdXYZ/queue/scheduler
 
 to see if performance increases with a difference scheduler.


I really think I'm back to the duplicate ACK problem - see the attached
packet dump - at one point  there's 30 duplicate ACKs... Interestingly, the
storage has worked for the past week - I'm using it as  D2D backup.  This
morning (about 7 days later), it's giving all these duplicate ACKs.

I'm currently running into messages such as:

Nov 19 09:46:58 backup iscsid: Kernel reported iSCSI connection 2:0 error
(1011) state (3)
Nov 19 09:47:00 backup kernel:  session2: target reset succeeded
Nov 19 09:47:01 backup iscsid: connection2:0 is operational after recovery
(1 attempts)
Nov 19 09:47:10 backup kernel: sd 9:0:0:0: SCSI error: return code =
0x000e
Nov 19 09:47:10 backup kernel: end_request: I/O error, dev sdf, sector 8856
Nov 19 09:47:10 backup kernel: device-mapper: multipath: Failing path 8:80.
Nov 19 09:47:10 backup kernel: sd 9:0:0:0: SCSI error: return code =
0x000e
Nov 19 09:47:10 backup kernel: end_request: I/O error, dev sdf, sector 74424
Nov 19 09:47:10 backup kernel: sd 9:0:0:2: SCSI error: return code =
0x000e
Nov 19 09:47:10 backup kernel: end_request: I/O error, dev sdm, sector
8845240
Nov 19 09:47:10 backup kernel: device-mapper: multipath: Failing path 8:192.
Nov 19 09:47:10 backup kernel: sd 9:0:0:2: SCSI error: return code =
0x000e
Nov 19 09:47:10 backup kernel: end_request: I/O error, dev sdm, sector
62915456
Nov 19 09:47:10 backup kernel: sd 9:0:0:2: timing out command, waited 300s
Nov 19 09:47:10 backup multipathd: /sbin/mpath_prio_alua exitted with 1
Nov 19 09:47:10 backup multipathd: error calling out /sbin/mpath_prio_alua
/dev/sdm 
Nov 19 09:47:10 backup multipathd: 3600d0230061d4479bfb83902: switch
to path group #2 

This is also interesting:

Nov 18 01:48:30 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 8
Nov 18 20:16:29 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 7
Nov 18 20:16:29 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 6
Nov 18 20:16:29 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 5
Nov 18 20:16:29 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 4
Nov 18 20:16:34 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 5
Nov 18 20:32:09 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 6
Nov 18 20:43:05 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 7
Nov 18 20:48:08 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 8
Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 7
Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 6
Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 5
Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 4
Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 3
Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 2
Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 1
Nov 18 20:53:36 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 0
Nov 18 20:53:41 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 1
Nov 18 20:59:09 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 2
Nov 18 21:04:37 backup multipathd: 3600d0230061d4479bfb83902:
remaining active paths: 3
Nov 18 21:10:05 backup multipathd: 

Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-19 Thread Mike Christie
Matthew Dickinson wrote:
 
 
 On 11/10/09 11:39 AM, Mike Christie micha...@cs.wisc.edu wrote:
 What version of open-iscsi were you using and what kernel, and were you
 using the iscsi kernel modules with open-iscsi.org tarball or from the
 kernel?
 
 iscsi-initiator-utils-6.2.0.871-0.10.el5
 kernel-2.6.18-164.2.1.el5
 
 RedHat RPMs
 

 It looks like we are sending more IO than the target can handle. In one
 of those cases it took more than 30 or 60 seconds (depending on your
 timeout value).

 What is the value of

 cat /sys/block/sdXYZ/device/timeout

 ?

 If it is 30 or 60 could you increase it to 360? After you login to the
 target do

 echo 360  /sys/block/sdXYZ/device/timeout
 
 I've tried setting this, but it appears to have no effect - it was 60, and I
 increased to 360.
 
 And what is the value of:

 iscsiadm -m node -T your_target | grep node.session.cmds_max

 If that is 128, then could you decrease that to 32 or 16?

 Run

 iscsiadm -m node -T your_target -u
 iscsiadm -m node -T your_target -o update -n node.session.cmds_max -v 32
 iscsiad-m node -T your_target -l
 
 I've tried setting to both 16 and 32, but it behaves about the same.
 

 And if those prevent the io errors then could you do

 echo noop  /sys/block/sdXYZ/queue/scheduler

 to see if performance increases with a difference scheduler.
 
 
 I really think I'm back to the duplicate ACK problem - see the attached
 packet dump - at one point  there's 30 duplicate ACKs... Interestingly, the

I did not get the attachement.

 storage has worked for the past week - I'm using it as  D2D backup.  This
 morning (about 7 days later), it's giving all these duplicate ACKs.
 
 I'm currently running into messages such as:
 
 Nov 19 09:46:58 backup iscsid: Kernel reported iSCSI connection 2:0 error
 (1011) state (3)
 Nov 19 09:47:00 backup kernel:  session2: target reset succeeded

If you are using Red Hat RPMs, make a red hat bugzilla 
https://bugzilla.redhat.com/. CC mchri...@redhat.com on the bugzilla or 
email me at that address when you have made the bugzilla. I will then 
add some network people to it. Attach your trace to the bugzilla.

--

You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-is...@googlegroups.com.
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=.




Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-12 Thread Hoot, Joseph

sorry... wrong information.  Here is the correct information.  I was doing some 
testing in VMWare Fusion VM's for a presentation that I'm giving.  The 
storage server is CentOS 5.3, which dishes out IETD targets for my OVM 
servers.  The OVM 2.2 environment is as follows:

[r...@ovm1 ~]# uname -r
2.6.18-128.2.1.4.9.el5xen
[r...@ovm1 ~]# rpm -qa | grep iscsi
iscsi-initiator-utils-6.2.0.871-0.7.el5
[r...@ovm1 ~]# 



On Nov 10, 2009, at 2:30 PM, Mike Christie wrote:

 
 Hoot, Joseph wrote:
 [r...@storage ~]# uname -r
 2.6.18-164.el5
 [r...@storage ~]# rpm -qa | grep iscsi
 iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1
 [r...@storage ~]#
 
 
 Weird.
 
 Is 2.6.18-164.el5 the kernel being used in the virtual machine/DonU? Is 
 that where you are using iscsi? It looks like the Oracle enterprise 
 linux kernel is 2.6.18-164.el5, which looks like it is based on RHEL 
 5.4. The iscsi code in there is the same as RHEL/upstream. No sendwait 
 patch.
 
 However, it looks like there is a 2.6.18-128.2.1.4.9 kernel (comes with 
 the Oracle VM rpms). In here we have a different iscsi version. It looks 
 a little older than what is in 2.6.18-164.el5, but it has the sendwait 
 patch I send to dell. Do you use this kernel in the Dom0? Are you using 
 this kernel with iscsi?
 
 
 
 On Nov 10, 2009, at 12:17 PM, Mike Christie wrote:
 
 Hoot, Joseph wrote:
 I've had about 3 threads of dt (kicking off a bit randomly) on (3) 
 separate volumes for over a week and haven't had a single disconnect yet.  
  I am currently using whatever rpm is distributed with Oracle VM v2.2.  I 
 know for sure that they have included the 871 base, plus I believe at 
 least a one off patch.  I can get more details if you'd like.
 
 But so far so good for now
 
 I think I have the source they are using. Could you do a uname -r, so I 
 can see what kernel they are using.
 
 
 ===
 Joseph R. Hoot
 Lead System Programmer/Analyst
 (w) 716-878-4832
 (c) 716-759-HOOT
 joe.h...@itec.suny.edu
 GPG KEY:   7145F633
 ===
 
 
 
 
 
  

===
Joseph R. Hoot
Lead System Programmer/Analyst
(w) 716-878-4832
(c) 716-759-HOOT
joe.h...@itec.suny.edu
GPG KEY:   7145F633
===


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-10 Thread Mike Christie

Hoot, Joseph wrote:
 I've had about 3 threads of dt (kicking off a bit randomly) on (3) separate 
 volumes for over a week and haven't had a single disconnect yet.   I am 
 currently using whatever rpm is distributed with Oracle VM v2.2.  I know for 
 sure that they have included the 871 base, plus I believe at least a one off 
 patch.  I can get more details if you'd like.
 
 But so far so good for now
 

I think I have the source they are using. Could you do a uname -r, so I 
can see what kernel they are using.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-10 Thread Hoot, Joseph

[r...@storage ~]# uname -r
2.6.18-164.el5
[r...@storage ~]# rpm -qa | grep iscsi
iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1
[r...@storage ~]#

On Nov 10, 2009, at 12:17 PM, Mike Christie wrote:

 
 Hoot, Joseph wrote:
 I've had about 3 threads of dt (kicking off a bit randomly) on (3) separate 
 volumes for over a week and haven't had a single disconnect yet.   I am 
 currently using whatever rpm is distributed with Oracle VM v2.2.  I know for 
 sure that they have included the 871 base, plus I believe at least a one off 
 patch.  I can get more details if you'd like.
 
 But so far so good for now
 
 
 I think I have the source they are using. Could you do a uname -r, so I 
 can see what kernel they are using.
 
  

===
Joseph R. Hoot
Lead System Programmer/Analyst
(w) 716-878-4832
(c) 716-759-HOOT
joe.h...@itec.suny.edu
GPG KEY:   7145F633
===


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-10 Thread Mike Christie

Matthew Dickinson wrote:
 On 11/6/09 3:39 PM, Matthew Dickinson matt-openis...@alpha345.com wrote:
 

 Try disabling nops by setting

 node.conn[0].timeo.noop_out_interval = 0
 node.conn[0].timeo.noop_out_timeout = 0
 
 I'm still getting errors:
 
 Nov 10 09:08:04 backup kernel:  connection12:0: detected conn error (1011)
 Nov 10 09:08:05 backup iscsid: Kernel reported iSCSI connection 12:0 error
 (1011) state (3)
 Nov 10 09:08:08 backup iscsid: connection12:0 is operational after recovery
 (1 attempts)
 Nov 10 09:09:43 backup kernel:  connection11:0: detected conn error (1011)
 Nov 10 09:09:43 backup kernel:  connection12:0: detected conn error (1011)
 Nov 10 09:09:44 backup kernel:  connection11:0: detected conn error (1011)
 Nov 10 09:09:44 backup iscsid: Kernel reported iSCSI connection 11:0 error
 (1011) state (3)
 Nov 10 09:09:44 backup iscsid: Kernel reported iSCSI connection 12:0 error
 (1011) state (3)
 Nov 10 09:09:44 backup iscsid: Kernel reported iSCSI connection 11:0 error
 (1011) state (1)
 Nov 10 09:09:46 backup kernel:  session11: target reset succeeded\
 Nov 10 09:09:47 backup iscsid: connection11:0 is operational after recovery
 (1 attempts)
 Nov 10 09:09:47 backup iscsid: connection12:0 is operational after recovery
 (1 attempts)
 Nov 10 09:09:56 backup kernel: sd 18:0:0:2: SCSI error: return code =
 0x000e
 Nov 10 09:09:56 backup kernel: end_request: I/O error, dev sdv, sector
 60721248
 Nov 10 09:09:56 backup kernel: device-mapper: multipath: Failing path 65:80.
 Nov 10 09:09:56 backup kernel: sd 18:0:0:2: SCSI error: return code =
 0x000e
 Nov 10 09:09:56 backup kernel: end_request: I/O error, dev sdv, sector
 60727648
 Nov 10 09:09:56 backup kernel: sd 18:0:0:2: SCSI error: return code =
 0x000e
 Nov 10 09:10:31 backup kernel: device-mapper: multipath: Failing path
 65:112.
 
 Interestingly, I  tried a Windows 2008 server R2 talking over a single
 connection to the storage unit,  configured to access just via one
 interface, I was able to sustain 20MB/s ­ so it would ³appear² to be a
 Linux-related issue - I'm only able to get 9MB/s out of Linux even when
 using 8 interfaces on both controllers.
 

What version of open-iscsi were you using and what kernel, and were you 
using the iscsi kernel modules with open-iscsi.org tarball or from the 
kernel?


It looks like we are sending more IO than the target can handle. In one 
of those cases it took more than 30 or 60 seconds (depending on your 
timeout value).

What is the value of

cat /sys/block/sdXYZ/device/timeout

?

If it is 30 or 60 could you increase it to 360? After you login to the 
target do

echo 360  /sys/block/sdXYZ/device/timeout

And what is the value of:

iscsiadm -m node -T your_target | grep node.session.cmds_max

If that is 128, then could you decrease that to 32 or 16?

Run

iscsiadm -m node -T your_target -u
iscsiadm -m node -T your_target -o update -n node.session.cmds_max -v 32
iscsiad-m node -T your_target -l


And if those prevent the io errors then could you do

echo noop  /sys/block/sdXYZ/queue/scheduler

to see if performance increases with a difference scheduler.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-10 Thread Mike Christie

Hoot, Joseph wrote:
 [r...@storage ~]# uname -r
 2.6.18-164.el5
 [r...@storage ~]# rpm -qa | grep iscsi
 iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1
 [r...@storage ~]#
 

Weird.

Is 2.6.18-164.el5 the kernel being used in the virtual machine/DonU? Is 
that where you are using iscsi? It looks like the Oracle enterprise 
linux kernel is 2.6.18-164.el5, which looks like it is based on RHEL 
5.4. The iscsi code in there is the same as RHEL/upstream. No sendwait 
patch.

However, it looks like there is a 2.6.18-128.2.1.4.9 kernel (comes with 
the Oracle VM rpms). In here we have a different iscsi version. It looks 
a little older than what is in 2.6.18-164.el5, but it has the sendwait 
patch I send to dell. Do you use this kernel in the Dom0? Are you using 
this kernel with iscsi?



 On Nov 10, 2009, at 12:17 PM, Mike Christie wrote:
 
 Hoot, Joseph wrote:
 I've had about 3 threads of dt (kicking off a bit randomly) on (3) separate 
 volumes for over a week and haven't had a single disconnect yet.   I am 
 currently using whatever rpm is distributed with Oracle VM v2.2.  I know 
 for sure that they have included the 871 base, plus I believe at least a 
 one off patch.  I can get more details if you'd like.

 But so far so good for now

 I think I have the source they are using. Could you do a uname -r, so I 
 can see what kernel they are using.

 
 ===
 Joseph R. Hoot
 Lead System Programmer/Analyst
 (w) 716-878-4832
 (c) 716-759-HOOT
 joe.h...@itec.suny.edu
 GPG KEY:   7145F633
 ===
 
 
  


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-09 Thread Santi Saez

El 06/11/09 14:10, mdaitc escribió:

Hi mdaitc,

 I’m seeing similar TCP “weirdness” as the other posts mention as  well
 as the below errors.

(..)

 Nov  2 08:15:14 backup kernel:  connection33:0: detected conn error
 The performance isn’t what I’d expect:

(..)

What happens if you disable TCP window scaling option in RHEL servers?

# echo 0  /proc/sys/net/ipv4/tcp_window_scaling

In our case, iSCSI conn errors stopped after disabling, but still have 
a lot of TCP “weirdness” in the network, mainly dup ACKs packages.

Regards,

-- 
Santi Saez
http://woop.es

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-09 Thread Hoot, Joseph

What version of OiS are you using?  I had lots of weirdness and the  
same types of disconnects to our Dell EqualLogic when we were  
(actually still are in production) using 868 code.  I'm now using open- 
iscsi-871 code plus a sendwait patch and haven' had the issue.  I've  
now been slamming my storage for a week and a half with multiple  
threads of dt.


On Nov 9, 2009, at 4:33 AM, Santi Saez wrote:


 El 06/11/09 14:10, mdaitc escribió:

 Hi mdaitc,

 I’m seeing similar TCP “weirdness” as the other posts mention as   
 well
 as the below errors.

 (..)

 Nov  2 08:15:14 backup kernel:  connection33:0: detected conn error
 The performance isn’t what I’d expect:

 (..)

 What happens if you disable TCP window scaling option in RHEL servers?

 # echo 0  /proc/sys/net/ipv4/tcp_window_scaling

 In our case, iSCSI conn errors stopped after disabling, but still  
 have
 a lot of TCP “weirdness” in the network, mainly dup ACKs packages.

 Regards,

 -- 
 Santi Saez
 http://woop.es

 

===
Joseph R. Hoot
Lead System Programmer/Analyst
(w) 716-878-4832
(c) 716-759-HOOT
joe.h...@itec.suny.edu
GPG KEY:   7145F633
===


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-09 Thread Gopu Krishnan
Hi all,

I am working on iSCSI En. Tar. Could you please someone explain about the
performance of the IET.
If so how the performance was calculated and what was the througput for the
same.

Thanks
Gopala krishnan Varatharajan

On Sat, Nov 7, 2009 at 3:09 AM, Matthew Dickinson 
matt-openis...@alpha345.com wrote:


 On 11/6/09 3:08 PM, Mike Christie micha...@cs.wisc.edu wrote:

 
  Could you send more of the log? Do you see a message like
connection1:0 is
  operational after recovery (1 attempts)
  after you see the conn errors (how many attempts)?

 Here's one particular connection:

 Nov  4 05:12:14 backup kernel:  connection22:0: ping timeout of 5 secs
 expired, recv timeout 5, last rx 4321648393, last ping 4321653393, now
 4321658393
 Nov  4 05:12:14 backup kernel:  connection22:0: detected conn error (1011)
 Nov  4 05:12:21 backup iscsid: connection22:0 is operational after recovery
 (1 attempts)
 Nov  4 05:12:46 backup kernel:  connection22:0: ping timeout of 5 secs
 expired, recv timeout 5, last rx 4321680691, last ping 4321685691, now
 4321690691
 Nov  4 05:12:46 backup kernel:  connection22:0: detected conn error (1011)
 Nov  4 05:12:58 backup iscsid: connection22:0 is operational after recovery
 (1 attempts)
 Nov  4 07:46:03 backup kernel:  connection22:0: ping timeout of 5 secs
 expired, recv timeout 5, last rx 4330877890, last ping 4330882890, now
 4330887890
 Nov  4 07:46:03 backup kernel:  connection22:0: detected conn error (1011)
 Nov  4 07:46:10 backup iscsid: connection22:0 is operational after recovery
 (1 attempts)
 Nov  4 07:46:27 backup kernel:  connection22:0: ping timeout of 5 secs
 expired, recv timeout 5, last rx 4330901733, last ping 4330906733, now
 4330911733
 Nov  4 07:46:27 backup kernel:  connection22:0: detected conn error (1011)
 Nov  4 07:46:32 backup iscsid: connection22:0 is operational after recovery
 (1 attempts)
 Nov  4 07:47:21 backup kernel:  connection22:0: ping timeout of 5 secs
 expired, recv timeout 5, last rx 4330955414, last ping 4330960414, now
 4330965414
 Nov  4 07:47:21 backup kernel:  connection22:0: detected conn error (1011)
 Nov  4 07:47:28 backup iscsid: connection22:0 is operational after recovery
 (1 attempts)
 Nov  4 07:48:28 backup kernel:  connection22:0: ping timeout of 5 secs
 expired, recv timeout 5, last rx 4331023213, last ping 4331028213, now
 4331033213
 Nov  4 07:48:28 backup kernel:  connection22:0: detected conn error (1011)
 Nov  4 07:48:35 backup iscsid: connection22:0 is operational after recovery
 (1 attempts)

 FWIW:

 [r...@backup ~]# cat /var/log/messages | grep after recovery | awk
 '{print
 $11 $12}' | sort  | uniq
 (113 attempts)
 (1 attempts)
 (24 attempts)
 (2 attempts)
 (3 attempts)
 (4 attempts)
 (5 attempts)
 (66 attempts)
 (68 attempts)
 (6 attempts)
 (7 attempts)
 (8 attempts)
 (9 attempts)

 
  Try disabling nops by setting
 
  node.conn[0].timeo.noop_out_interval = 0
  node.conn[0].timeo.noop_out_timeout = 0

 Ok, I'll let you know how it pans out.

 Thanks,

 Matthew



 



--

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-09 Thread Mike Christie

Hoot, Joseph wrote:
 What version of OiS are you using?  I had lots of weirdness and the  
 same types of disconnects to our Dell EqualLogic when we were  
 (actually still are in production) using 868 code.  I'm now using open- 
 iscsi-871 code plus a sendwait patch and haven' had the issue.  I've  

What is the sendwait patch? Is it a patch for open-iscsi or to the 
kernel network code?

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-09 Thread Hoot, Joseph

it was for OiS 871 code prior to RHEL 5.4 release (not sure if the  
release include it or not).  I'm not sure who came up with it.  I was  
working with Don Williams from Dell EqualLogic.  He got ahold of it  
somehow.  I applied it and it seemed to improve things.


On Nov 9, 2009, at 2:31 PM, Mike Christie wrote:


 Hoot, Joseph wrote:
 What version of OiS are you using?  I had lots of weirdness and the
 same types of disconnects to our Dell EqualLogic when we were
 (actually still are in production) using 868 code.  I'm now using  
 open-
 iscsi-871 code plus a sendwait patch and haven' had the issue.  I've

 What is the sendwait patch? Is it a patch for open-iscsi or to the
 kernel network code?

 

===
Joseph R. Hoot
Lead System Programmer/Analyst
(w) 716-878-4832
(c) 716-759-HOOT
joe.h...@itec.suny.edu
GPG KEY:   7145F633
===


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-09 Thread Mike Christie

Hoot, Joseph wrote:
 it was for OiS 871 code prior to RHEL 5.4 release (not sure if the  
 release include it or not).  I'm not sure who came up with it.  I was  
 working with Don Williams from Dell EqualLogic.  He got ahold of it  
 somehow.  I applied it and it seemed to improve things.
 

Ah ok. I think it was the patch I sent to Don.

If you just used 871 without the patch (or what is in the stock RHEL 5.4 
kernel) does it work ok? There were a couple changes from 868 to 871 
that I thought would also fix the problem, so I was waiting for Don and 
them to retest just 871 and get back to me.

 
 On Nov 9, 2009, at 2:31 PM, Mike Christie wrote:
 
 Hoot, Joseph wrote:
 What version of OiS are you using?  I had lots of weirdness and the
 same types of disconnects to our Dell EqualLogic when we were
 (actually still are in production) using 868 code.  I'm now using  
 open-
 iscsi-871 code plus a sendwait patch and haven' had the issue.  I've
 What is the sendwait patch? Is it a patch for open-iscsi or to the
 kernel network code?

 
 ===
 Joseph R. Hoot
 Lead System Programmer/Analyst
 (w) 716-878-4832
 (c) 716-759-HOOT
 joe.h...@itec.suny.edu
 GPG KEY:   7145F633
 ===
 
 
  


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-09 Thread Hoot, Joseph

I've had about 3 threads of dt (kicking off a bit randomly) on (3) separate 
volumes for over a week and haven't had a single disconnect yet.   I am 
currently using whatever rpm is distributed with Oracle VM v2.2.  I know for 
sure that they have included the 871 base, plus I believe at least a one off 
patch.  I can get more details if you'd like.

But so far so good for now

 

On Nov 9, 2009, at 6:18 PM, Mike Christie wrote:

 
 Hoot, Joseph wrote:
 it was for OiS 871 code prior to RHEL 5.4 release (not sure if the  
 release include it or not).  I'm not sure who came up with it.  I was  
 working with Don Williams from Dell EqualLogic.  He got ahold of it  
 somehow.  I applied it and it seemed to improve things.
 
 
 Ah ok. I think it was the patch I sent to Don.
 
 If you just used 871 without the patch (or what is in the stock RHEL 5.4 
 kernel) does it work ok? There were a couple changes from 868 to 871 
 that I thought would also fix the problem, so I was waiting for Don and 
 them to retest just 871 and get back to me.
 
 
 On Nov 9, 2009, at 2:31 PM, Mike Christie wrote:
 
 Hoot, Joseph wrote:
 What version of OiS are you using?  I had lots of weirdness and the
 same types of disconnects to our Dell EqualLogic when we were
 (actually still are in production) using 868 code.  I'm now using  
 open-
 iscsi-871 code plus a sendwait patch and haven' had the issue.  I've
 What is the sendwait patch? Is it a patch for open-iscsi or to the
 kernel network code?
 
 
 ===
 Joseph R. Hoot
 Lead System Programmer/Analyst
 (w) 716-878-4832
 (c) 716-759-HOOT
 joe.h...@itec.suny.edu
 GPG KEY:   7145F633
 ===
 
 
 
 
 
  

===
Joseph R. Hoot
Lead System Programmer/Analyst
(w) 716-878-4832
(c) 716-759-HOOT
joe.h...@itec.suny.edu
GPG KEY:   7145F633
===


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-04 Thread Santi Saez

El 03/11/09 0:52, Mike Christie escribió:

Dear Mike,

 You can turn off ping/nops by setting

 node.conn[0].timeo.noop_out_interval = 0
 node.conn[0].timeo.noop_out_timeout = 0

 (set that in iscsid.conf then rediscovery the target or run iscsiadm -m
 node -T your_target -o update -n name_of_param_above -v 0

Thanks!! As I said to James in the previous email, disabling TCP window 
scaling *solves partially* this problem, we still hold nop pings in the 
configuration. But still have too many TCP Dup ACKs in the network :-S


 This might just work around. What might happen is that you will not see
 the nop/ping and conn errors and instead would just see a slow down in
 the workloads being run.

I have sent your contact to Infortrend developers, a engineer will 
contact you, thanks!

Regards,

-- 
Santi Saez
http://woop.es

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-02 Thread Mike Christie

Santi Saez wrote:
 
 Hi,
 
 Randomly we get Open-iSCSI conn errors when connecting to an  
 Infortrend A16E-G2130-4 storage array. We had discussed about this  
 earlier in the list, see:
 
   http://tr.im/DVQm
   http://tr.im/DVQp
 
 Open-iSCSI logs this:
 
 ===
 Nov  2 18:34:02 vz-17 kernel: ping timeout of 5 secs expired, last rx  
 408250499, last ping 408249467, now 408254467
 Nov  2 18:34:02 vz-17 kernel:  connection1:0: iscsi: detected conn  
 error (1011)
 Nov  2 18:34:03 vz-17 iscsid: Kernel reported iSCSI connection 1:0  
 error (1011) state (3)
 Nov  2 18:34:07 vz-17 iscsid: connection1:0 is operational after  
 recovery (1 attempts)
 Nov  2 18:34:52 vz-17 kernel: ping timeout of 5 secs expired, last rx  
 408294833, last ping 408299833, now 408304833
 Nov  2 18:34:52 vz-17 kernel:  connection1:0: iscsi: detected conn  
 error (1011)
 Nov  2 18:34:53 vz-17 iscsid: Kernel reported iSCSI connection 1:0  
 error (1011) state (3)
 Nov  2 18:34:57 vz-17 iscsid: connection1:0 is operational after  
 recovery (1 attempts)
 ===
 
 Running on CentOS 5.4 with iscsi-initiator-utils-6.2.0.871-0.10.el5;  
 I think it's not a Open-iSCSI bug as Mike suggested at:
 
 http://groups.google.com/group/open-iscsi/msg/fe37156096b2955f
 
 I have only this error when connecting to Infortrend storage, and not  
 with NetApp, Nexsan, etc. *connected in the same SAN*.
 
 Using Wireshark I see a lot of TCP Dup ACK, TCP ACKed lost  
 segment, etc. and iSCSI session finally ends in timeout, see a  
 screenshot here:
 
 http://tinyurl.com/ykpvckn
 
 Using Wireshark IO graphs I get this strange report about TCP/IP errors:
 
 http://tinyurl.com/ybm4m8x
 
 And this is another report in the same SAN connecting to a NetApp:
 
 http://tinyurl.com/ycgc8ul
 
 Those TCP/IP errors only occurs when connecting to Infortrend  
 storage.. and no with other targets in the same SAN (using same switch  
 infrastructure); is there anyway to deal with this using Open-iSCSI?  
 As I see in Internet, there're a lot of Infortrend's users suffering  
 this behavior.
 

You can turn off ping/nops by setting

node.conn[0].timeo.noop_out_interval = 0
node.conn[0].timeo.noop_out_timeout = 0

(set that in iscsid.conf then rediscovery the target or run iscsiadm -m 
node -T your_target -o update -n name_of_param_above -v 0

Or you might want to set them higher.

This might just work around. What might happen is that you will not see 
the nop/ping and conn errors and instead would just see a slow down in 
the workloads being run.

If you guys can get a hold of any infrotrend people let me know, because 
I would be happy to work with them on this.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: Infortrend + iSCSI: detected conn error (1011) + TCP Dup ACK

2009-11-02 Thread Ulrich Windl

Hi!

I would guess that your network/storage system is overloaded occasionally, and 
then network packets are significantly (5s) delayed. It sounds unlikely, but 
that 
would explain a duplicate ACK IMHO.

Here with SLES10 SP2 on x86_64 with a HP EVA 6000 + iSCSI connectivity option 
(MPX100) there is no such problem, even though there are about 20 hosts 
connected 
to the storage system, most of them directly through a fabric switch.

However the iSCSI load is light. Here is what one of the two MPX boxes reports 
(the load will rise; it's still early morning):
# show perf byte

  WARNING: Valid data is only displayed for port(s) that are not
   associated with any configured FCIP routes.

  Displaying bytes/sec (total)...  (Press any key to stop display)

  GE1 GE2 FC1 FC2
  
  53K 606K43K 620K
  774 774 1K  1K
  651K868 17K 636K
  2K  26K 18K 10K
  49K 0   22K 26K
  55K 805K799K66K
  29K 17K 40K 7K
  597K7K  534K72K
  172 196K150K47K
  49K 516 25K 28K
  294K43K 29K 313K
  774 774 1K  1K
  228K868 65K 171K
  18K 20K 528 43K
  6K  6K  0   12K
  145K516 101K45K
  15K 29K 15K 30K

(GEx are the gigabit Ethernet ports, FCx are the FibreChannel ports that are 
connected with the storage system via a fabric switch)

Some more statistics (for those who care or like inspiration):
# show stats

  FC Port Statistics
  
  FC Port  1
  Interrupt Count  82812963
  Target Command Count 0
  Initiator Command Count  83526964
  Link Failure Count   0
  Loss of Sync Count   0
  Loss of Signal Count 0
  Primitive Sequence Error Count   0
  Invalid Transmission Word Count  0
  Invalid CRC Error Count  0

  FC Port  2
  Interrupt Count  81551444
  Target Command Count 0
  Initiator Command Count  82221649
  Link Failure Count   0
  Loss of Sync Count   0
  Loss of Signal Count 0
  Primitive Sequence Error Count   0
  Invalid Transmission Word Count  0
  Invalid CRC Error Count  0

  iSCSI Port Statistics
  ---
  iSCSI Port   1
  Interrupt Count  190227478
  Target Command Count 100974368
  Initiator Command Count  0
  MAC Xmit Frames  122882662
  MAC Xmit Byte Count  28445475500
  MAC Xmit Multicast Frames0
  MAC Xmit Broadcast Frames0
  MAC Xmit Pause Frames0
  MAC Xmit Control Frames  0
  MAC Xmit Deferrals   0
  MAC Xmit Late Collisions 0
  MAC Xmit Aborted 0
  MAC Xmit Single Collisions   0
  MAC Xmit Multiple Collisions 0
  MAC Xmit Collisions  0
  MAC Xmit Dropped Frames  0
  MAC Xmit Jumbo Frames1702770
  MAC Rcvd Frames  156585933
  MAC Rcvd Byte Count  187107689124
  MAC Rcvd Unknown Control Frames  0
  MAC Rcvd Pause Frames0
  MAC Rcvd Control Frames  0
  MAC Rcvd Dribbles0
  MAC Rcvd Frame Length Errors 0
  MAC Rcvd Jabbers 0
  MAC Rcvd Carrier Sense Errors0
  MAC Rcvd Dropped Frames  0
  MAC Rcvd CRC Errors  0
  MAC Rcvd Encoding Errors 0
  MAC Rcvd Length Errors Large 0
  MAC Rcvd Length Errors Small 0
  MAC Rcvd Multicast Frames178239
  MAC Rcvd Broadcast Frames48

  iSCSI Port   2
  Interrupt Count  182729151
  Target Command Count 97555857
  Initiator Command Count  0
  MAC Xmit Frames  120067355
  MAC Xmit Byte Count  27948414986
  MAC Xmit Multicast Frames0
  MAC Xmit Broadcast Frames0
  MAC Xmit Pause Frames0
  MAC Xmit Control Frames  0
  MAC Xmit Deferrals   0
  MAC Xmit Late Collisions 0
  MAC Xmit Aborted 0
  MAC Xmit Single Collisions   0
  MAC Xmit Multiple Collisions 0
  MAC Xmit Collisions  0
  MAC Xmit Dropped Frames  0
  MAC Xmit Jumbo Frames1718670
  MAC Rcvd Frames  156126093
  MAC Rcvd Byte Count  196255090216
  MAC Rcvd Unknown Control Frames  0
  MAC Rcvd Pause Frames0
  MAC Rcvd Control Frames  0
  MAC Rcvd Dribbles0
  MAC Rcvd Frame Length Errors 0
  MAC Rcvd Jabbers 0
  MAC Rcvd Carrier Sense Errors0
  MAC Rcvd Dropped Frames  0
  MAC Rcvd CRC Errors  0
  MAC Rcvd Encoding Errors 0
  MAC Rcvd Length Errors Large 0
  MAC Rcvd Length Errors Small 0