Mike Christie wrote: > Erez Zilber wrote: > >> Mike, >> >> We're testing open-iscsi + multipath. In order to make failover faster, >> we changed the following defaults: >> >> node.session.timeo.replacement_timeout = 30 >> node.conn[0].timeo.noop_out_timeout = 5 >> > > Is .timeo.noop_out_interval 10? >
Sorry for the late response (been busy with too many other things). Yes, timeo.noop_out_interval is 10. > >> So, we see that ep_disconnect is called and then "session recovery timed >> > > Before you see the ep_disconnect getting called you should see all the > running commands failed and sent to dm: > Yes > This code in initiator.c: should stop the conn and when that happens, > libiscsi will fail the running commands to the scsi layer which should > fail them to dm right away because failfast is set. > > if (do_stop) { > /* state: STATE_CLEANUP_WAIT */ > if (ipc->stop_conn(session->t->handle, session->id, > conn->id, do_stop)) { > log_error("can't stop connection %d:%d (%d)", > session->id, conn->id, errno); > delay = 5; > goto queue_reopen; > } > log_debug(3, "connection %d:%d is stopped for recovery", > session->id, conn->id); > } > conn->session->t->template->ep_disconnect(conn); > > > >> out after 30 secs". After that, we still have to wait more than a minute >> until the SCSI device becomes offline. For example, if we run sg_map -i >> -x at that time, it doesn't return until the device becomes offline. We >> > > This is expected. If a command gets sent to the path while the scsi > layer's eh is running (or if the nop timeout does not catch the problem > before the scsi command timeout fires) you have to wait up to > node.session.timeo.replacement_timeouts + scsi command timeout for > commands to be failed. > Is it because scsi-ml doesn't handle new commands while eh is running? > >> think that this may be due to a timeout in scsi-ml, is it? How can we >> control it (because failover is really slow now - 1.5-2 minutes)? >> >> > > If your problem is that there is no IO to the path, you pull a cable, > then send IO to the path, with your current settings the failover is > going to take node.session.timeo.replacement_timeouts + scsi command > timeout seconds. On most distros that will be 1.5 minutes (30 sec > replacement and scsi timer is 50 secs). So set the scsi command timer > lower and set the replacement timer lower. OK. Is it configurable? Where? > If you search the list, > people that have wanted really fast failovers and rely on dm's queueing, > use a lot lower values than I mentioned in the README. > > If your problem is that there is IO on the patch, you pull a cable, and > then you do not see those IOs getting failed by the stop conn call, > within noop interval + noop timeout seconds, then there is bug in the > iscsi layer. You should turn on debugging and send the output. > No, this is not the problem. Thanks for the very detailed answer. Erez --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/open-iscsi -~----------~----~----~----~------~----~------~--~---