Re: connection, host resets, I/O errors eventually (DRBD, but not only)

2009-01-11 Thread Mike Christie

Tomasz Chmielewski wrote:
 Mike Christie schrieb:
 
 The scsi layer sets a timeout on each command. If it does not execute in 
 X seconds, it will run the iscsi eh.

 So you can increase the scsi command time:

 To modify the udev rule open /etc/udev/rules.d/50-udev.rules, and find the
 following lines:

 ACTION==add, SUBSYSTEM==scsi , SYSFS{type}==0|7|14, \
  RUN+=/bin/sh -c 'echo 60  /sys$$DEVPATH/timeout'


 and you probably want to decrease the number of oustanding commands by 
 setting the node.session.cmds_max for that session. With 50 kB/s you 
 might as well set this to 1 command.
 
 This helps a bit, but after some time, something weird happens.
 
 
 I increased the timeout to 240 seconds.
 
 The data flows fine for some time, but after a couple of minutes, every
 program running on that initiator machine seems to freeze (i.e. ping
 stops to ping, top stops to refresh the data, they can't be
 interrupted / won't exit with ctrl+c).
 There is no traffic any more between the target and the initiator.
 
 The machine is a bit alive, as it replies to pings and responds to
 sysrq magic, and I can switch VTs (ctrl+alt+F1...).
 
 
 The machine has its root filesystem accessible via iSCSI (via fast LAN,
 to a different target) which can somehow contribute to the problem? It 
 runs a 2.6.22 kernel.
 Some bad interaction if the initiator is connected to two targets with
 different IPs, and connection to one target is very slow?
 

There should not be. Each session/connection to the target is going to 
get its own threads for sending IO. The receiving is done in the network 
softirq and cannot sleep or dominate the use.

Did you set the queue limit lower too? If so did you do it globally (set 
it in iscsid.conf and discovery the targets) or did you run it for a 
specific sesssion (run iscsiadm -m node -T target -p ip:port -o update 
-n ..)? Maybe if you did it globally the lower queue depth is 
slowing the IO execution and affecting the apps. This is probably not 
the case though. I only know things like a big database not like its IO 
slowed down and I do not think other apps would notice the slow down as 
long as IO completes.

Or were there any iscsi or IO messages in the logs?

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: connection, host resets, I/O errors eventually (DRBD, but not only)

2009-01-09 Thread Tomasz Chmielewski

Mike Christie schrieb:

 The scsi layer sets a timeout on each command. If it does not execute in 
 X seconds, it will run the iscsi eh.
 
 So you can increase the scsi command time:
 
 To modify the udev rule open /etc/udev/rules.d/50-udev.rules, and find the
 following lines:
 
 ACTION==add, SUBSYSTEM==scsi , SYSFS{type}==0|7|14, \
  RUN+=/bin/sh -c 'echo 60  /sys$$DEVPATH/timeout'
 
 
 and you probably want to decrease the number of oustanding commands by 
 setting the node.session.cmds_max for that session. With 50 kB/s you 
 might as well set this to 1 command.

This helps a bit, but after some time, something weird happens.


I increased the timeout to 240 seconds.

The data flows fine for some time, but after a couple of minutes, every
program running on that initiator machine seems to freeze (i.e. ping
stops to ping, top stops to refresh the data, they can't be
interrupted / won't exit with ctrl+c).
There is no traffic any more between the target and the initiator.

The machine is a bit alive, as it replies to pings and responds to
sysrq magic, and I can switch VTs (ctrl+alt+F1...).


The machine has its root filesystem accessible via iSCSI (via fast LAN,
to a different target) which can somehow contribute to the problem? It 
runs a 2.6.22 kernel.
Some bad interaction if the initiator is connected to two targets with
different IPs, and connection to one target is very slow?


No such phenomenon on a machine with rootfs on SATA, where everything
works fine.


-- 
Tomasz Chmielewski
http://wpkg.org


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: connection, host resets, I/O errors eventually (DRBD, but not only)

2009-01-09 Thread Bart Van Assche

On Thu, Jan 8, 2009 at 12:44 PM, Tomasz Chmielewski man...@wpkg.org wrote:
 Anyone using iSCSI over DRBD? And a slow internet link perhaps?

How reliable is the link you are using -- which percentage of packets
is lost ? You can test this e.g. with the ping command. The following
command will generate about 32 KB/s of network traffic and reports the
percentage of lost packets:

# ping -q -i 0.01 -c1000 -s160 ${remote_ip}
PING 192.168.1.102 (192.168.1.102) 160(188) bytes of data.

--- 192.168.1.102 ping statistics ---
1000 packets transmitted, 1000 received, 0% packet loss, time 8997ms
rtt min/avg/max/mdev = 0.000/0.048/0.474/0.027 ms


Bart.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: connection, host resets, I/O errors eventually (DRBD, but not only)

2009-01-09 Thread Bart Van Assche

On Fri, Jan 9, 2009 at 3:22 PM, Tomasz Chmielewski man...@wpkg.org wrote:
 Bart Van Assche schrieb:
 # ping -q -i 0.01 -c1000 -s160 ${remote_ip}

 I get about 1% losses.

IMHO running iSCSI over a slow link should work, but a packet loss of
1% is troublesome. On a local network the packet loss rate is about
0.001% (1e-5) for 1000-byte packets.

Bart.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: connection, host resets, I/O errors eventually (DRBD, but not only)

2009-01-09 Thread Tomasz Chmielewski

Bart Van Assche schrieb:
 On Fri, Jan 9, 2009 at 3:22 PM, Tomasz Chmielewski man...@wpkg.org wrote:
 Bart Van Assche schrieb:
 # ping -q -i 0.01 -c1000 -s160 ${remote_ip}
 I get about 1% losses.
 
 IMHO running iSCSI over a slow link should work, but a packet loss of
 1% is troublesome. On a local network the packet loss rate is about
 0.001% (1e-5) for 1000-byte packets.

It's not really running iSCSI over a slow link in this case.

DRBD synchronizes two block devices, over a slow link in this case:

P - primary node, accessed by the target, accessed by the initiator
S - secondary node / synchronized area
U - unsynchronized area


PPP
   slow link
SSS


Slow link is used to transfer data for unsynchronized area.

Now, if the initiator begins to write data, DRBD has to transfer it to 
the secondary node before the write is completed: writes flow over a 
slow link and compete with background synchronization in the meantime.

As a result, we can say that iSCSI is running over a slow link.


Mike's suggestion help though - increasing timeouts and decreasing the 
number of outstanding commands help here.


One more note - I see such connection/host resets from time to time also 
when using a gigabit ethernet and a very loaded target (no I/O errors 
though, everything recovers on time with the default values).


-- 
Tomasz Chmielewski
http://wpkg.org


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



connection, host resets, I/O errors eventually (DRBD, but not only)

2009-01-08 Thread Tomasz Chmielewski

Anyone using iSCSI over DRBD? And a slow internet link perhaps?

If yes, you are likely to see connection errors, host resets, and eventually,
I/O errors reported, for example:

Jan  7 21:47:09 vmware1 kernel:  connection23:0: iscsi: detected conn error 
(1011)
Jan  7 21:47:10 vmware1 kernel: iscsi: host reset succeeded
Jan  7 21:47:50 vmware1 kernel:  connection23:0: iscsi: detected conn error 
(1011)
Jan  7 21:47:50 vmware1 kernel: iscsi: host reset succeeded
Jan  7 21:48:00 vmware1 kernel: sd 22:0:0:1: SCSI error: return code = 
0x0002
Jan  7 21:48:00 vmware1 kernel: end_request: I/O error, dev sdw, sector 1494720
Jan  7 21:48:00 vmware1 kernel: Buffer I/O error on device sdw, logical block 
186840
Jan  7 21:48:00 vmware1 kernel: lost page write due to I/O error on sdw


This is due to the fact that open-iscsi doesn't seem to like low-speed (but 
stable)
connections to the target.

To reproduce:

1) set up a connection with limited speed between the target and the initiator,
for example, with openvpn, one would use --shaper 5 option to limit the 
speed
to 50 kB/s.

2) login the target to the initiator over this connection (can be also in LAN)

3) start reading and writing... after some time you will be seeing connection 
errors and 
host resets followed by I/O errors, possibly data corruption


A harder way to reproduce (but somehow more realistic) would be to set up DRBD,
start background synchronization at high speed (thus leaving not much bandwidth 
for
normal writes), start reading and writing...


I can reproduce it with tgtd and IET, so I guess open-iscsi is to be blamed.

Ideas what's wrong and why it fails?


-- 
Tomasz Chmielewski
http://wpkg.org

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: connection, host resets, I/O errors eventually (DRBD, but not only)

2009-01-08 Thread Mike Christie

Tomasz Chmielewski wrote:
 Anyone using iSCSI over DRBD? And a slow internet link perhaps?
 
 If yes, you are likely to see connection errors, host resets, and eventually,
 I/O errors reported, for example:
 
 Jan  7 21:47:09 vmware1 kernel:  connection23:0: iscsi: detected conn error 
 (1011)
 Jan  7 21:47:10 vmware1 kernel: iscsi: host reset succeeded
 Jan  7 21:47:50 vmware1 kernel:  connection23:0: iscsi: detected conn error 
 (1011)
 Jan  7 21:47:50 vmware1 kernel: iscsi: host reset succeeded
 Jan  7 21:48:00 vmware1 kernel: sd 22:0:0:1: SCSI error: return code = 
 0x0002
 Jan  7 21:48:00 vmware1 kernel: end_request: I/O error, dev sdw, sector 
 1494720
 Jan  7 21:48:00 vmware1 kernel: Buffer I/O error on device sdw, logical block 
 186840
 Jan  7 21:48:00 vmware1 kernel: lost page write due to I/O error on sdw
 
 
 This is due to the fact that open-iscsi doesn't seem to like low-speed (but 
 stable)
 connections to the target.
 
 To reproduce:
 
 1) set up a connection with limited speed between the target and the 
 initiator,
 for example, with openvpn, one would use --shaper 5 option to limit the 
 speed
 to 50 kB/s.
 
 2) login the target to the initiator over this connection (can be also in LAN)
 
 3) start reading and writing... after some time you will be seeing connection 
 errors and 
 host resets followed by I/O errors, possibly data corruption
 
 
 A harder way to reproduce (but somehow more realistic) would be to set up 
 DRBD,
 start background synchronization at high speed (thus leaving not much 
 bandwidth for
 normal writes), start reading and writing...
 
 
 I can reproduce it with tgtd and IET, so I guess open-iscsi is to be blamed.
 
 Ideas what's wrong and why it fails?
 

The scsi layer sets a timeout on each command. If it does not execute in 
X seconds, it will run the iscsi eh.

So you can increase the scsi command time:

To modify the udev rule open /etc/udev/rules.d/50-udev.rules, and find the
following lines:

ACTION==add, SUBSYSTEM==scsi , SYSFS{type}==0|7|14, \
 RUN+=/bin/sh -c 'echo 60  /sys$$DEVPATH/timeout'


and you probably want to decrease the number of oustanding commands by 
setting the node.session.cmds_max for that session. With 50 kB/s you 
might as well set this to 1 command.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
open-iscsi group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---