Re: connection2:0: detected conn error (1011).. when rebooting the machine hangs the reboot sequence.
Konrad Rzeszutek wrote: > On Thu, Apr 17, 2008 at 06:07:12AM -0400, Konrad Rzeszutek wrote: >>> It looks like the network is off but the session is still running. We >>> eventually get to the kernel shutoff here. Is your init script getting >>> run? If not then run it. If you left the session on on purpose then you >>> cannot turn the network off because the scsi layer will want to do its >>> shutdown when the kernel is stopped. >> Ah. Thanks for the explanation. The init script was run, but it didn't >> logoff of all the sessions (it would selectivly logoff instead of doing >> all of them). > > After I made sure that 'iscsiadm -m session -U all' was called during shutdown > a QA engineer here was able to make the 'iscsiadm' hang during this sequence. > > The result was that some of the iSCSI sessions did log-out while some other > did not, > and the machine hanged during the "Synchronizing SCSI cache for disk .." I am not sure if you meant you are running iscsiadm -m session -U all and still hanging and if so where so to make sure you are not hitting the cache sync bug let me try to seperate all the shutdown bugs we have seen :) Notes: iscsiadm will not return until all sessions you asked to logout are stopped and the sync caches are completed, but iscsiadm ignores sessions marked as being used for boot. If iscsiadm is run during iscsi init script shutdown then if the network is down we will hang and no session should get logged off. If the network is up and iscsiadm is hanging then it is probably the cache sync iscsiadm hang caused by us using the delete sysfs file and expecting it not to return until the cache sync has completed. In this case some sessions could get logged out right and others could hang. It is a race so it is fun and not always predictable :) If you are hanging when the kernel shutsdown (when the driver model shutdown functions get called which for sd.c case will send the cache sync) then if your system is not setup we will hang on the cache sync because iscsid is not up to handle the connection error that results when the scsi eh fails. We can also hang if the network is up right at this time, but we experience a connection error or timeout or the target could drop connection for whatever reason it wants to. In this case we hit the same problem where we send a connection error to userspace but iscsid is not there. In older kernels the scsi command would timeout and in newer kernels iscsi_eh_cmd_timed_out will keep resetting the command timer. > My thought was that since the logic of the SCSI reset is to call first > 'iscsi_eh_abort', > then 'iscsi_eh_device_reset' and then finally 'iscsi_eh_host_reset' we could > in > iscsi_eh_host_reset check the tmf_state as so: > > > diff --git a/kernel/libiscsi.c b/kernel/libiscsi.c > index f8f4cf9..a5b5b37 100644 > --- a/kernel/libiscsi.c > +++ b/kernel/libiscsi.c > @@ -1163,7 +1163,8 @@ failed: > wait_event_interruptible(conn->ehwait, >session->state == ISCSI_STATE_TERMINATE || >session->state == ISCSI_STATE_LOGGED_IN || > - session->state == ISCSI_STATE_RECOVERY_FAILED); > + session->state == ISCSI_STATE_RECOVERY_FAILED > || > + conn->tmf_state == TMF_TIMEDOUT); > if (signal_pending(current)) > flush_signals(current); > > That does make it possible for the SCSI reset sequence to finish off and the > machine > reboots fine. It does not change the behavior during a normal run of iSCSI > (ie, unplugging > a cable for couple of seconds, minutes, etc). Are you sure about this? I do not think you tried all the timing scenarios or had IO running when testing. > > But I have this gnawing feeling that I missing something here ... > The problem with the patch is that if we return SUCCESS from iscsi_eh_host_reset then we tell the scsi layer that the problem is solved and to retry IO if it can. If we return FAILED then we tell the scsi layer we are kaput and it will fail IO. For a non shutdown run then we want to wait for replacement_timeout seconds for the session to come back. If the connection problem was detected after the command timer fired so iscsi_eh_cmd_timed_out cannot check the status of the connection then we send TMFs they will timeout and with your patch we immediately fail the IO. Or if there is a connection error while sending TMFs and this causes one to timeout then again with your patch we fail the IO too soon. Also for the shutdown case, it does not handle when the connection problem is detected before the command times out. In this case iscsi_eh_cmd_timed_out is going to keep asking scsi-ml to reset the command timer. My original solution was to call iscsi_block_session when iscsi_conn_failure/ISCSI_SUSPEND_BIT is run. We can do this in newer kernels, but for the compat modules this will not work b
Re: connection2:0: detected conn error (1011).. when rebooting the machine hangs the reboot sequence.
Konrad Rzeszutek schrieb: >> "Synchronizing SCSI cache for disk" happens because: >> >> - iSCSI sessions were not properly disconnected, and > > Correct. > >> - they can't be properly disconnected any more, because the network is >> already disabled. > > Kind of. There is a kernel timer that gets activated during the logout > sequence > that waits for up to 120 seconds (or what you have set in > node.session.timeo.replacement_timeout) and if the logout sequence hasn't > completed releases the kernel resources. > >> Most distributions shut down all network interfaces when a "halt" >> command is started (i.e., they add "-i" option to the halt command): >> >> -i: shut down all network interfaces. >> >> Without this flag, everything should shut down properly, even when it's > > Right. And this situation will hang the kernel during reboot b/c the > SCSI error handlers wait for a logout state condition that never happens. > >> not possible to logout all sessions earlier (i.e., a diskless machine >> started off iSCSI). > > And the patch I attached in the previous e-mail describes a solution > to this. BTW, similar hack (not disabling the network) is also needed when we reboot the system using kexec (without that patch, that is): -x, --no-ifdown Don't bring down network interfaces. -- Tomasz Chmielewski http://wpkg.org --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: connection2:0: detected conn error (1011).. when rebooting the machine hangs the reboot sequence.
> "Synchronizing SCSI cache for disk" happens because: > > - iSCSI sessions were not properly disconnected, and Correct. > - they can't be properly disconnected any more, because the network is > already disabled. Kind of. There is a kernel timer that gets activated during the logout sequence that waits for up to 120 seconds (or what you have set in node.session.timeo.replacement_timeout) and if the logout sequence hasn't completed releases the kernel resources. > > Most distributions shut down all network interfaces when a "halt" > command is started (i.e., they add "-i" option to the halt command): > > -i: shut down all network interfaces. > > Without this flag, everything should shut down properly, even when it's Right. And this situation will hang the kernel during reboot b/c the SCSI error handlers wait for a logout state condition that never happens. > not possible to logout all sessions earlier (i.e., a diskless machine > started off iSCSI). And the patch I attached in the previous e-mail describes a solution to this. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: connection2:0: detected conn error (1011).. when rebooting the machine hangs the reboot sequence.
Konrad Rzeszutek schrieb: > On Thu, Apr 17, 2008 at 06:07:12AM -0400, Konrad Rzeszutek wrote: >>> It looks like the network is off but the session is still running. We >>> eventually get to the kernel shutoff here. Is your init script getting >>> run? If not then run it. If you left the session on on purpose then you >>> cannot turn the network off because the scsi layer will want to do its >>> shutdown when the kernel is stopped. >> Ah. Thanks for the explanation. The init script was run, but it didn't >> logoff of all the sessions (it would selectivly logoff instead of doing >> all of them). > > After I made sure that 'iscsiadm -m session -U all' was called during shutdown > a QA engineer here was able to make the 'iscsiadm' hang during this sequence. > > The result was that some of the iSCSI sessions did log-out while some other > did not, > and the machine hanged during the "Synchronizing SCSI cache for disk .." I didn't follow the thread very closely, but a hang during "Synchronizing SCSI cache for disk" happens because: - iSCSI sessions were not properly disconnected, and - they can't be properly disconnected any more, because the network is already disabled. Most distributions shut down all network interfaces when a "halt" command is started (i.e., they add "-i" option to the halt command): -i: shut down all network interfaces. Without this flag, everything should shut down properly, even when it's not possible to logout all sessions earlier (i.e., a diskless machine started off iSCSI). -- Tomasz Chmielewski http://wpkg.org --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: connection2:0: detected conn error (1011).. when rebooting the machine hangs the reboot sequence.
On Thu, Apr 17, 2008 at 06:07:12AM -0400, Konrad Rzeszutek wrote: > > It looks like the network is off but the session is still running. We > > eventually get to the kernel shutoff here. Is your init script getting > > run? If not then run it. If you left the session on on purpose then you > > cannot turn the network off because the scsi layer will want to do its > > shutdown when the kernel is stopped. > > Ah. Thanks for the explanation. The init script was run, but it didn't > logoff of all the sessions (it would selectivly logoff instead of doing > all of them). After I made sure that 'iscsiadm -m session -U all' was called during shutdown a QA engineer here was able to make the 'iscsiadm' hang during this sequence. The result was that some of the iSCSI sessions did log-out while some other did not, and the machine hanged during the "Synchronizing SCSI cache for disk .." We can't reproduce it anymore but I started digging in the code to see if there is a way to shield ourselves if this happens again. The hang happens when iscsi_eh_host_reset waits for this condition to become true: debug_scsi("iscsi_eh_host_reset wait for relogin\n"); wait_event_interruptible(conn->ehwait, session->state == ISCSI_STATE_TERMINATE || session->state == ISCSI_STATE_LOGGED_IN || session->state == ISCSI_STATE_RECOVERY_FAILED); The logic that would change the state to ISCSI_STATE_RECOVERY_FAILED is in 'iscsi_session_recovery_timedout' which gets put on queue and eventually executed (if the timeout expires and if the connection is truly failed) only when the user-land sends an ISCSI_UEVENT_STOP_CONN IPC event. But during a reboot sequence wherein the iscsiadm gets wedged this IPC event never gets to the kernel and the machine waits for a stage change that never happens. However there is another timer that does get executed before the iscsi_eh_host_reset gets called, that is the iscsi_eh_abort which calls iscsi_exec_task_mgmt_fn, which sets up a 'iscsi_tmf_timedout' timer which when it expires sets the tmf_state: if (conn->tmf_state == TMF_QUEUED) { conn->tmf_state = TMF_TIMEDOUT; debug_scsi("tmf timedout\n"); /* unblock eh_abort() */ wake_up(&conn->ehwait); } My thought was that since the logic of the SCSI reset is to call first 'iscsi_eh_abort', then 'iscsi_eh_device_reset' and then finally 'iscsi_eh_host_reset' we could in iscsi_eh_host_reset check the tmf_state as so: diff --git a/kernel/libiscsi.c b/kernel/libiscsi.c index f8f4cf9..a5b5b37 100644 --- a/kernel/libiscsi.c +++ b/kernel/libiscsi.c @@ -1163,7 +1163,8 @@ failed: wait_event_interruptible(conn->ehwait, session->state == ISCSI_STATE_TERMINATE || session->state == ISCSI_STATE_LOGGED_IN || -session->state == ISCSI_STATE_RECOVERY_FAILED); +session->state == ISCSI_STATE_RECOVERY_FAILED || +conn->tmf_state == TMF_TIMEDOUT); if (signal_pending(current)) flush_signals(current); That does make it possible for the SCSI reset sequence to finish off and the machine reboots fine. It does not change the behavior during a normal run of iSCSI (ie, unplugging a cable for couple of seconds, minutes, etc). But I have this gnawing feeling that I missing something here ... --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: connection2:0: detected conn error (1011).. when rebooting the machine hangs the reboot sequence.
> It looks like the network is off but the session is still running. We > eventually get to the kernel shutoff here. Is your init script getting > run? If not then run it. If you left the session on on purpose then you > cannot turn the network off because the scsi layer will want to do its > shutdown when the kernel is stopped. Ah. Thanks for the explanation. The init script was run, but it didn't logoff of all the sessions (it would selectivly logoff instead of doing all of them). --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: connection2:0: detected conn error (1011).. when rebooting the machine hangs the reboot sequence.
Konrad Rzeszutek wrote: > Firstly, I haven't dug in this yet but this is more of a call: > "have-you-seen-this-too?" > This is probably on the list 20 times :) > When I reboot the machine without logging off from iSCSI targets I can > hang the reboot sequence. This is with 869-rc4 userspace, SLES 10 SP2 Beta > kernel, with > a 869-rc4 kernels compiled out of tree. (With a SLES 10 SP2 Beta kernel, > which has > a back-port of 868-rc1, I get the same bug) > > I enabled the debugging in the kernel (DEBUG_SCSI) and added a dump_stack() > in the > iscsi_check_transport_timeouts, and this is what I get: > The timer is still running because the session is. > iscsi: Sending nopout as ping on conn 88007a0b8a50 > iscsi: Setting next tmo 4294974247 > iscsi: mtask deq [cid 0 itt 0xa06] > iscsi: mgmtpdu [op 0x0 hdr->itt 0xa06 datalen 0] > Sending SIGKILL to all processes. > Please stand by while rebooting the system. > md: stopping all md devices. > Synchronizing SCSI cache for disk sdl: It looks like the network is off but the session is still running. We eventually get to the kernel shutoff here. Is your init script getting run? If not then run it. If you left the session on on purpose then you cannot turn the network off because the scsi layer will want to do its shutdown when the kernel is stopped. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---