Re: logout call doesn't return
On Tue, Jun 23, 2009 at 8:29 PM, Mike Christie wrote: > > Erez Zilber wrote: >> Mike, >> >> I'm trying to debug a problem that we have with iscsiadm: I'm running >> open-iscsi against multiple targets. At some point, I'm closing the >> connection from one of the targets (i.e. on the target side). Then, I >> try to logout from the initiator side, but something goes wrong. The >> last thing that iscsiadm does it call recv from iscsid_response and it >> doesn't return (at least not after 10 minutes). I also see that in the >> kernel, __iscsi_unbind_session calls scsi_remove_target and doesn't >> return. I guess that this causes iscsiadm to wait on the recv call. > > Yeah, iscsiadm will wait for the iscsid operations like the unind to > complete, and that can take a while. > > If you stop the target and then we start the session shutdown process > while we still think the session is up (we have not got a tcp connection > error or rst or any other indication that is bad like a nop timing out), > then we are going to end up firing the iscsi or scsi eh. > > If you have IO running or if your LU requires a cache sync to be sent > when shutting it down, then the worse case is that you have nops turned > off, and for some reason the network layer does not return a error (just > returns somehting we thing is retryable like EAGAIN) when we try to do > sendpage/sendmsg. This will result in the scsi commands timing out. Then > the aborts and other tmfs will timeout, and then we will wait for > replacement_timeout seconds to try and reconnect. > > If you have nops on or the net layer returns a error, it would be a > little faster because you do not have to wait for scsi commands to time > out. The nop will timeout after noop_timeout seconds, then we will wait > for replacement_timeout seconds to reconnect. After that time we will > fail everything. > > if you do not have IO running and your device does not require cache > syncs, then it should be a lot shorter, but still may be a minute. The > __iscsi_unbind_session/scsi_remove_target should complete quickly since > they do not have to wait on IO and cache syncs to complete. We would > just wait for the logout iscsi pdu to timeout. > > > There is also a bug, where we retry the sending of data even though we > know the connection is bad. This patch helps > http://git.kernel.org/?p=linux/kernel/git/mnc/linux-2.6-iscsi.git;a=commit;h=b138adb2df49967bf0a035143f734d33c4263963 > but what we want is to be able to break from the sendpage/sendsg wait. I > am working on a patch, but have hit some problems (for some reason if I > send a signal it does not break from the wait). This problem only adds > maybe 30 seconds extra for the logout of a session, so I am not sure > that is what you are hitting. > > > > So first check if your device needs a cache sync. You can check that by > looking at /var/log/messages when the device is discovered. You will see > something like: > > kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, > doesn't support DPO or FUA > > If write cache is enabled then the scsi layer will send cache syncs. > > Then check your replacement_timeout. If that is really long, then we > might be hitting that. > > > > >> >> BTW - I'm not running with the latest code. My HEAD is commit >> ef0357c4728ebba1a4b91a7f6d69c729a5f9e6e3. I don't know if any relevant >> bug fixes were made lately. > > > > Just so you know, I normally work on linux-2.6-iscsi, which tracks > upstream, then port to open-iscsi/kernel, so the newest kernel patches > will be in there. Eventually, it was caused by an internal bug that we had. After fixing it, things look OK. Thanks for your help. Erez --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: logout call doesn't return
Erez Zilber wrote: > Mike, > > I'm trying to debug a problem that we have with iscsiadm: I'm running > open-iscsi against multiple targets. At some point, I'm closing the > connection from one of the targets (i.e. on the target side). Then, I > try to logout from the initiator side, but something goes wrong. The > last thing that iscsiadm does it call recv from iscsid_response and it > doesn't return (at least not after 10 minutes). I also see that in the > kernel, __iscsi_unbind_session calls scsi_remove_target and doesn't > return. I guess that this causes iscsiadm to wait on the recv call. Yeah, iscsiadm will wait for the iscsid operations like the unind to complete, and that can take a while. If you stop the target and then we start the session shutdown process while we still think the session is up (we have not got a tcp connection error or rst or any other indication that is bad like a nop timing out), then we are going to end up firing the iscsi or scsi eh. If you have IO running or if your LU requires a cache sync to be sent when shutting it down, then the worse case is that you have nops turned off, and for some reason the network layer does not return a error (just returns somehting we thing is retryable like EAGAIN) when we try to do sendpage/sendmsg. This will result in the scsi commands timing out. Then the aborts and other tmfs will timeout, and then we will wait for replacement_timeout seconds to try and reconnect. If you have nops on or the net layer returns a error, it would be a little faster because you do not have to wait for scsi commands to time out. The nop will timeout after noop_timeout seconds, then we will wait for replacement_timeout seconds to reconnect. After that time we will fail everything. if you do not have IO running and your device does not require cache syncs, then it should be a lot shorter, but still may be a minute. The __iscsi_unbind_session/scsi_remove_target should complete quickly since they do not have to wait on IO and cache syncs to complete. We would just wait for the logout iscsi pdu to timeout. There is also a bug, where we retry the sending of data even though we know the connection is bad. This patch helps http://git.kernel.org/?p=linux/kernel/git/mnc/linux-2.6-iscsi.git;a=commit;h=b138adb2df49967bf0a035143f734d33c4263963 but what we want is to be able to break from the sendpage/sendsg wait. I am working on a patch, but have hit some problems (for some reason if I send a signal it does not break from the wait). This problem only adds maybe 30 seconds extra for the logout of a session, so I am not sure that is what you are hitting. So first check if your device needs a cache sync. You can check that by looking at /var/log/messages when the device is discovered. You will see something like: kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA If write cache is enabled then the scsi layer will send cache syncs. Then check your replacement_timeout. If that is really long, then we might be hitting that. > > BTW - I'm not running with the latest code. My HEAD is commit > ef0357c4728ebba1a4b91a7f6d69c729a5f9e6e3. I don't know if any relevant > bug fixes were made lately. Just so you know, I normally work on linux-2.6-iscsi, which tracks upstream, then port to open-iscsi/kernel, so the newest kernel patches will be in there. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
logout call doesn't return
Mike, I'm trying to debug a problem that we have with iscsiadm: I'm running open-iscsi against multiple targets. At some point, I'm closing the connection from one of the targets (i.e. on the target side). Then, I try to logout from the initiator side, but something goes wrong. The last thing that iscsiadm does it call recv from iscsid_response and it doesn't return (at least not after 10 minutes). I also see that in the kernel, __iscsi_unbind_session calls scsi_remove_target and doesn't return. I guess that this causes iscsiadm to wait on the recv call. BTW - I'm not running with the latest code. My HEAD is commit ef0357c4728ebba1a4b91a7f6d69c729a5f9e6e3. I don't know if any relevant bug fixes were made lately. Thanks, Erez --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "open-iscsi" group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---