iSCSI Failover time too long

Akshay Lal Wed, 08 Jul 2009 23:07:40 -0700

Devs,

    We seem to having an issue with the time to failover over iSCSI.
The end goal here being to force a failover within 10 seconds to an
alternate path as defined by dm-multipath.


Distro: CentOS
Kernel version: 2.6.29.5
dm-multipath version: device-mapper-multipath-0.4.7-17.el5
iscsid version: iscsi-initiator-utils-6.2.0.868-0.7.el5

We have dm-multipath installed and configured with the following
configurations:
                udev_dir                     /dev
                polling_interval            3
                selector                      "round-robin 0"
                path_grouping_policy   failover
                getuid_callout             "/sbin/scsi_id -g -u -s /
block/%n"
                prio_callout                 /bin/true
                path_checker              tur
                rr_min_io                    10
                max_fds                     8192
                rr_weight                    uniform
                failback                      manual
                no_path_retry             fail
                user_friendly_names   yes

We have also modified scsi PDU timeout:
                ACTION=="add", SUBSYSTEM=="scsi" , SYSFS{type}=="0|7|
14", \
                                RUN+="/bin/sh -c 'echo 60 > /sys$
$DEVPATH/timeout'"
in the /etc/udev/rules.d/50-udev.rules.

We have also modified some parameters in /etc/iscsi/iscsi.conf:
                node.session.timeo.replacement_timeout = 5
                node.conn[0].timeo.login_timeout = 5
                node.conn[0].timeo.logout_timeout = 5
                node.conn[0].timeo.noop_out_interval = 5
                node.conn[0].timeo.noop_out_timeout = 10

Given the above configuration the failover takes place in 2 minutes.


After reading a few posts on this group, I did try to change the scsi
PDU timeout & the node.session.timeo.replacement_timeout, but still
that didn't change the failover time.

However, if I modify the scsi PDU timout to 3, the node.conn
[0].timeo.noop_out_interval = 1 & node.conn[0].timeo.noop_out_timeout
= 2, we seem to get a failover in about 65 seconds (thats still too
long for our purposes)

/etc/iscsi/iscsi.conf:
                node.session.timeo.replacement_timeout = 120
                node.conn[0].timeo.login_timeout = 15
                node.conn[0].timeo.logout_timeout = 15
                node.conn[0].timeo.noop_out_interval = 1
                node.conn[0].timeo.noop_out_timeout = 2

/etc/udev/rules.d/50-udev.rules:
                ACTION=="add", SUBSYSTEM=="scsi" , SYSFS{type}=="0|7|
14", \
                                RUN+="/bin/sh -c 'echo 60 > /sys$
$DEVPATH/timeout'"

Not sure why these values actually cause a difference in the failover
time, but apparently changing any other parameter doesn't really help.

/var/log/messages:
When the cable (power) is pulled from the primary:
     Jul  8 15:38:23 cschi-mbxdsg-0226.cleversafelabs.com kernel:
connection2:0: ping timeout of 2 secs expired, last rx 4295169719,
last ping 4295170719, now 4295172719
     Jul  8 15:38:23 cschi-mbxdsg-0226.cleversafelabs.com kernel:
connection2:0: detected conn error (1011)
     Jul  8 15:38:24 cschi-mbxdsg-0226.cleversafelabs.com iscsid:
Kernel reported iSCSI connection 2:0 error (1011) state (3)
.
.
No messages for a bit and then: (the failover occurs at this point)
     Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
2:0:0:0: timing out command, waited 18s
     Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
2:0:0:0: [sdb] Unhandled error code
     Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
2:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_DISRUPTED
driverbyte=DRIVER_OK,SUGGEST_OK
     Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel:
end_request: I/O error, dev sdb, sector 1599400
     Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel:
device-mapper: multipath: Failing path 8:16.
     Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
2:0:0:0: timing out command, waited 18s
     Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
2:0:0:0: [sdb] Unhandled error code
     Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
2:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_DISRUPTED
driverbyte=DRIVER_OK,SUGGEST_OK
     Jul  8 15:39:35 cschi-mbxdsg-0226.cleversafelabs.com kernel:
end_request: I/O error, dev sdb, sector 1600424


Any clues on how we can reduce this failover time would be
appreciated.


--

Akshay Lal


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~----------~----~----~----~------~----~------~--~---

iSCSI Failover time too long

Reply via email to