Mike Christie wrote:
> Erez Zilber wrote:
>   
>> Mike,
>>
>> We're testing open-iscsi + multipath. In order to make failover faster,
>> we changed the following defaults:
>>
>> node.session.timeo.replacement_timeout = 30
>> node.conn[0].timeo.noop_out_timeout = 5
>>     
>
> Is .timeo.noop_out_interval 10?
>   

Sorry for the late response (been busy with too many other things). Yes,
timeo.noop_out_interval is 10.

>   
>> So, we see that ep_disconnect is called and then "session recovery timed
>>     
>
> Before you see the ep_disconnect getting called you should see all the 
> running commands failed and sent to dm:
>   

Yes

> This code in initiator.c: should stop the conn and when that happens, 
> libiscsi will fail the running commands to the scsi layer which should 
> fail them to dm right away because failfast is set.
>
>          if (do_stop) {
>                  /* state: STATE_CLEANUP_WAIT */
>                  if (ipc->stop_conn(session->t->handle, session->id,
>                                     conn->id, do_stop)) {
>                          log_error("can't stop connection %d:%d (%d)",
>                                    session->id, conn->id, errno);
>                          delay = 5;
>                          goto queue_reopen;
>                  }
>                  log_debug(3, "connection %d:%d is stopped for recovery",
>                            session->id, conn->id);
>          }
>          conn->session->t->template->ep_disconnect(conn);
>
>
>   
>> out after 30 secs". After that, we still have to wait more than a minute
>> until the SCSI device becomes offline. For example, if we run sg_map -i
>> -x at that time, it doesn't return until the device becomes offline. We
>>     
>
> This is expected. If a command gets sent to the path while the scsi 
> layer's eh is running (or if the nop timeout does not catch the problem 
> before the scsi command timeout fires) you have to wait up to 
> node.session.timeo.replacement_timeouts + scsi command timeout for 
> commands to be failed.
>   

Is it because scsi-ml doesn't handle new commands while eh is running?

>   
>> think that this may be due to a timeout in scsi-ml, is it? How can we
>> control it (because failover is really slow now - 1.5-2 minutes)?
>>
>>     
>
> If your problem is that there is no IO to the path, you pull a cable, 
> then send IO to the path, with your current settings the failover is 
> going to take node.session.timeo.replacement_timeouts + scsi command 
> timeout seconds. On most distros that will be 1.5 minutes (30 sec 
> replacement and scsi timer is 50 secs). So set the scsi command timer 
> lower and set the replacement timer lower.

OK. Is it configurable? Where?

>  If you search the list, 
> people that have wanted really fast failovers and rely on dm's queueing, 
> use a lot lower values than I mentioned in the README.
>
> If your problem is that there is IO on the patch, you pull a cable, and 
> then you do not see those IOs getting failed by the stop conn call, 
> within noop interval + noop timeout seconds, then there is bug in the 
> iscsi layer. You should turn on debugging and send the output.
>   

No, this is not the problem.

Thanks for the very detailed answer.

Erez


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~----------~----~----~----~------~----~------~--~---

Reply via email to