Re: connection2:0: detected conn error (1011).. when rebooting the machine hangs the reboot sequence.

2008-05-19 Thread Mike Christie

Konrad Rzeszutek wrote:
> On Thu, Apr 17, 2008 at 06:07:12AM -0400, Konrad Rzeszutek wrote:
>>> It looks like the network is off but the session is still running. We 
>>> eventually get to the kernel shutoff here. Is your init script getting 
>>> run? If not then run it. If you left the session on on purpose then you 
>>> cannot turn the network off because the scsi layer will want to do its 
>>> shutdown when the kernel is stopped.
>> Ah. Thanks for the explanation. The init script was run, but it didn't 
>> logoff of all the sessions (it would selectivly logoff instead of doing
>> all of them).
> 
> After I made sure that 'iscsiadm -m session -U all' was called during shutdown
> a QA engineer here was able to make the 'iscsiadm' hang during this sequence.
> 
> The result was that some of the iSCSI sessions did log-out while some other 
> did not,
> and the machine hanged during the "Synchronizing SCSI cache for disk .."

I am not sure if you meant you are running iscsiadm -m session -U all 
and still hanging and if so where so to make sure you are not hitting 
the cache sync bug let me try to seperate all the shutdown bugs we have 
seen :)

Notes:
iscsiadm will not return until all sessions you asked to logout are 
stopped and the sync caches are completed, but iscsiadm ignores sessions 
marked as being used for boot.

If iscsiadm is run during iscsi init script shutdown then if the network 
is down we will hang and no session should get logged off.

If the network is up and iscsiadm is hanging then it is probably the 
cache sync iscsiadm hang caused by us using the delete sysfs file and 
expecting it not to return until the cache sync has completed. In this 
case some sessions could get logged out right and others could hang. It 
is a race so it is fun and not always predictable :)

If you are hanging when the kernel shutsdown (when the driver model 
shutdown functions get called which for sd.c case will send the cache 
sync) then if your system is not setup we will hang on the cache sync 
because iscsid is not up to handle the connection error that results 
when the scsi eh fails. We can also hang if the network is up right at 
this time, but we experience a connection error or timeout or the target 
could drop connection for whatever reason it wants to. In this case we 
hit the same problem where we send a connection error to userspace but 
iscsid is not there. In older kernels the scsi command would timeout and 
in newer kernels iscsi_eh_cmd_timed_out will keep resetting the command 
timer.


> My thought was that since the logic of the SCSI reset is to call first 
> 'iscsi_eh_abort',
> then 'iscsi_eh_device_reset' and then finally 'iscsi_eh_host_reset' we could 
> in
> iscsi_eh_host_reset check the tmf_state as so:
> 
> 
> diff --git a/kernel/libiscsi.c b/kernel/libiscsi.c
> index f8f4cf9..a5b5b37 100644
> --- a/kernel/libiscsi.c
> +++ b/kernel/libiscsi.c
> @@ -1163,7 +1163,8 @@ failed:
>   wait_event_interruptible(conn->ehwait,
>session->state == ISCSI_STATE_TERMINATE ||
>session->state == ISCSI_STATE_LOGGED_IN ||
> -  session->state == ISCSI_STATE_RECOVERY_FAILED);
> +  session->state == ISCSI_STATE_RECOVERY_FAILED 
> ||
> +  conn->tmf_state == TMF_TIMEDOUT);
>   if (signal_pending(current))
>   flush_signals(current);
> 
> That does make it possible for the SCSI reset sequence to finish off and the 
> machine
> reboots fine. It does not change the behavior during a normal run of iSCSI 
> (ie, unplugging
> a cable for couple of seconds, minutes, etc).

Are you sure about this? I do not think you tried all the timing 
scenarios or had IO running when testing.

> 
> But I have this gnawing feeling that I missing something here ...
>

The problem with the patch is that if we return SUCCESS from 
iscsi_eh_host_reset then we tell the scsi layer that the problem is 
solved and to retry IO if it can. If we return FAILED then we tell the 
scsi layer we are kaput and it will fail IO.

For a non shutdown run then we want to wait for replacement_timeout 
seconds for the session to come back. If the connection problem was 
detected after the command timer fired so iscsi_eh_cmd_timed_out cannot 
check the status of the connection then we send TMFs they will timeout 
and with your patch we immediately fail the IO. Or if there is a 
connection error while sending TMFs and this causes one to timeout then 
again with your patch we fail the IO too soon.

Also for the shutdown case, it does not handle when the connection 
problem is detected before the command times out. In this case 
iscsi_eh_cmd_timed_out is going to keep asking scsi-ml to reset the 
command timer.

My original solution was to call iscsi_block_session when 
iscsi_conn_failure/ISCSI_SUSPEND_BIT is run. We can do this in newer 
kernels, but for the compat modules this will not work b

Re: connection2:0: detected conn error (1011).. when rebooting the machine hangs the reboot sequence.

2008-05-17 Thread Tomasz Chmielewski

Konrad Rzeszutek schrieb:
>> "Synchronizing SCSI cache for disk" happens because:
>>
>> - iSCSI sessions were not properly disconnected, and
> 
> Correct.
> 
>> - they can't be properly disconnected any more, because the network is 
>> already disabled.
> 
> Kind of. There is a kernel timer that gets activated during the logout 
> sequence
> that waits for up to 120 seconds (or what you have set in
> node.session.timeo.replacement_timeout) and if the logout sequence hasn't 
> completed releases the kernel resources.
> 
>> Most distributions shut down all network interfaces when a "halt" 
>> command is started (i.e., they add "-i" option to the halt command):
>>
>>  -i: shut down all network interfaces.
>>
>> Without this flag, everything should shut down properly, even when it's 
> 
> Right. And this situation will hang the kernel during reboot b/c the
> SCSI error handlers wait for a logout state condition that never happens.
> 
>> not possible to logout all sessions earlier (i.e., a diskless machine 
>> started off iSCSI).
> 
> And the patch I attached in the previous e-mail describes a solution
> to this.

BTW, similar hack (not disabling the network) is also needed when we 
reboot the system using kexec (without that patch, that is):

  -x, --no-ifdown  Don't bring down network interfaces.


-- 
Tomasz Chmielewski
http://wpkg.org

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: connection2:0: detected conn error (1011).. when rebooting the machine hangs the reboot sequence.

2008-05-16 Thread Konrad Rzeszutek

> "Synchronizing SCSI cache for disk" happens because:
> 
> - iSCSI sessions were not properly disconnected, and

Correct.

> - they can't be properly disconnected any more, because the network is 
> already disabled.

Kind of. There is a kernel timer that gets activated during the logout sequence
that waits for up to 120 seconds (or what you have set in
node.session.timeo.replacement_timeout) and if the logout sequence hasn't 
completed releases the kernel resources.

> 
> Most distributions shut down all network interfaces when a "halt" 
> command is started (i.e., they add "-i" option to the halt command):
> 
>  -i: shut down all network interfaces.
> 
> Without this flag, everything should shut down properly, even when it's 

Right. And this situation will hang the kernel during reboot b/c the
SCSI error handlers wait for a logout state condition that never happens.

> not possible to logout all sessions earlier (i.e., a diskless machine 
> started off iSCSI).

And the patch I attached in the previous e-mail describes a solution
to this.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: connection2:0: detected conn error (1011).. when rebooting the machine hangs the reboot sequence.

2008-05-16 Thread Tomasz Chmielewski

Konrad Rzeszutek schrieb:
> On Thu, Apr 17, 2008 at 06:07:12AM -0400, Konrad Rzeszutek wrote:
>>> It looks like the network is off but the session is still running. We 
>>> eventually get to the kernel shutoff here. Is your init script getting 
>>> run? If not then run it. If you left the session on on purpose then you 
>>> cannot turn the network off because the scsi layer will want to do its 
>>> shutdown when the kernel is stopped.
>> Ah. Thanks for the explanation. The init script was run, but it didn't 
>> logoff of all the sessions (it would selectivly logoff instead of doing
>> all of them).
> 
> After I made sure that 'iscsiadm -m session -U all' was called during shutdown
> a QA engineer here was able to make the 'iscsiadm' hang during this sequence.
> 
> The result was that some of the iSCSI sessions did log-out while some other 
> did not,
> and the machine hanged during the "Synchronizing SCSI cache for disk .."

I didn't follow the thread very closely, but a hang during 
"Synchronizing SCSI cache for disk" happens because:

- iSCSI sessions were not properly disconnected, and
- they can't be properly disconnected any more, because the network is 
already disabled.

Most distributions shut down all network interfaces when a "halt" 
command is started (i.e., they add "-i" option to the halt command):

 -i: shut down all network interfaces.

Without this flag, everything should shut down properly, even when it's 
not possible to logout all sessions earlier (i.e., a diskless machine 
started off iSCSI).


-- 
Tomasz Chmielewski
http://wpkg.org

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: connection2:0: detected conn error (1011).. when rebooting the machine hangs the reboot sequence.

2008-05-16 Thread Konrad Rzeszutek

On Thu, Apr 17, 2008 at 06:07:12AM -0400, Konrad Rzeszutek wrote:
> > It looks like the network is off but the session is still running. We 
> > eventually get to the kernel shutoff here. Is your init script getting 
> > run? If not then run it. If you left the session on on purpose then you 
> > cannot turn the network off because the scsi layer will want to do its 
> > shutdown when the kernel is stopped.
> 
> Ah. Thanks for the explanation. The init script was run, but it didn't 
> logoff of all the sessions (it would selectivly logoff instead of doing
> all of them).

After I made sure that 'iscsiadm -m session -U all' was called during shutdown
a QA engineer here was able to make the 'iscsiadm' hang during this sequence.

The result was that some of the iSCSI sessions did log-out while some other did 
not,
and the machine hanged during the "Synchronizing SCSI cache for disk .."

We can't reproduce it anymore but I started digging in the code to see if there
is a way to shield ourselves if this happens again.

The hang happens when iscsi_eh_host_reset waits for this condition to become
true:

debug_scsi("iscsi_eh_host_reset wait for relogin\n");
wait_event_interruptible(conn->ehwait,
 session->state == ISCSI_STATE_TERMINATE ||
 session->state == ISCSI_STATE_LOGGED_IN ||
 session->state == ISCSI_STATE_RECOVERY_FAILED);

The logic that would change the state to ISCSI_STATE_RECOVERY_FAILED is
in 'iscsi_session_recovery_timedout' which gets put on queue and eventually
executed (if the timeout expires and if the connection is truly failed) only 
when
the user-land sends an ISCSI_UEVENT_STOP_CONN IPC event.

But during a reboot sequence wherein the iscsiadm gets wedged this IPC event
never gets to the kernel and the machine waits for a stage change that
never happens.

However there is another timer that does get executed before the 
iscsi_eh_host_reset
gets called, that is the iscsi_eh_abort which calls iscsi_exec_task_mgmt_fn, 
which sets
up a 'iscsi_tmf_timedout' timer which when it expires sets the tmf_state:

   if (conn->tmf_state == TMF_QUEUED) {
conn->tmf_state = TMF_TIMEDOUT;
debug_scsi("tmf timedout\n");
/* unblock eh_abort() */
wake_up(&conn->ehwait);
}

My thought was that since the logic of the SCSI reset is to call first 
'iscsi_eh_abort',
then 'iscsi_eh_device_reset' and then finally 'iscsi_eh_host_reset' we could in
iscsi_eh_host_reset check the tmf_state as so:


diff --git a/kernel/libiscsi.c b/kernel/libiscsi.c
index f8f4cf9..a5b5b37 100644
--- a/kernel/libiscsi.c
+++ b/kernel/libiscsi.c
@@ -1163,7 +1163,8 @@ failed:
wait_event_interruptible(conn->ehwait,
 session->state == ISCSI_STATE_TERMINATE ||
 session->state == ISCSI_STATE_LOGGED_IN ||
-session->state == ISCSI_STATE_RECOVERY_FAILED);
+session->state == ISCSI_STATE_RECOVERY_FAILED 
||
+conn->tmf_state == TMF_TIMEDOUT);
if (signal_pending(current))
flush_signals(current);

That does make it possible for the SCSI reset sequence to finish off and the 
machine
reboots fine. It does not change the behavior during a normal run of iSCSI (ie, 
unplugging
a cable for couple of seconds, minutes, etc).

But I have this gnawing feeling that I missing something here ...
 

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: connection2:0: detected conn error (1011).. when rebooting the machine hangs the reboot sequence.

2008-04-17 Thread Konrad Rzeszutek

> It looks like the network is off but the session is still running. We 
> eventually get to the kernel shutoff here. Is your init script getting 
> run? If not then run it. If you left the session on on purpose then you 
> cannot turn the network off because the scsi layer will want to do its 
> shutdown when the kernel is stopped.

Ah. Thanks for the explanation. The init script was run, but it didn't 
logoff of all the sessions (it would selectivly logoff instead of doing
all of them).

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: connection2:0: detected conn error (1011).. when rebooting the machine hangs the reboot sequence.

2008-04-17 Thread Mike Christie

Konrad Rzeszutek wrote:
> Firstly, I haven't dug in this yet but this is more of a call: 
> "have-you-seen-this-too?"
> 

This is probably on the list 20 times :)

> When I reboot the machine without logging off from iSCSI targets I can
> hang the reboot sequence. This is with 869-rc4 userspace, SLES 10 SP2 Beta 
> kernel, with
> a 869-rc4 kernels compiled out of tree. (With a SLES 10 SP2 Beta kernel, 
> which has
> a back-port of 868-rc1, I get the same bug)
> 
> I enabled the debugging in the kernel (DEBUG_SCSI) and added a dump_stack() 
> in the
> iscsi_check_transport_timeouts, and this is what I get:
>

The timer is still running because the session is.

> iscsi: Sending nopout as ping on conn 88007a0b8a50
> iscsi: Setting next tmo 4294974247
> iscsi: mtask deq [cid 0 itt 0xa06]
> iscsi: mgmtpdu [op 0x0 hdr->itt 0xa06 datalen 0]
> Sending SIGKILL to all processes.
> Please stand by while rebooting the system.
> md: stopping all md devices.
> Synchronizing SCSI cache for disk sdl: 


It looks like the network is off but the session is still running. We 
eventually get to the kernel shutoff here. Is your init script getting 
run? If not then run it. If you left the session on on purpose then you 
cannot turn the network off because the scsi layer will want to do its 
shutdown when the kernel is stopped.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---