Re: logout call doesn't return

2009-06-29 Thread Erez Zilber

On Tue, Jun 23, 2009 at 8:29 PM, Mike Christie wrote:
>
> Erez Zilber wrote:
>> Mike,
>>
>> I'm trying to debug a problem that we have with iscsiadm: I'm running
>> open-iscsi against multiple targets. At some point, I'm closing the
>> connection from one of the targets (i.e. on the target side). Then, I
>> try to logout from the initiator side, but something goes wrong. The
>> last thing that iscsiadm does it call recv from iscsid_response and it
>> doesn't return (at least not after 10 minutes). I also see that in the
>> kernel, __iscsi_unbind_session calls scsi_remove_target and doesn't
>> return. I guess that this causes iscsiadm to wait on the recv call.
>
> Yeah, iscsiadm will wait for the iscsid operations like the unind to
> complete, and that can take a while.
>
> If you stop the target and then we start the session shutdown process
> while we still think the session is up (we have not got a tcp connection
> error or rst or any other indication that is bad like a nop timing out),
> then we are going to end up firing the iscsi or scsi eh.
>
> If you have IO running or if your LU requires a cache sync to be sent
> when shutting it down, then the worse case is that you have nops turned
> off, and for some reason the network layer does not return a error (just
> returns somehting we thing is retryable like EAGAIN) when we try to do
> sendpage/sendmsg. This will result in the scsi commands timing out. Then
> the aborts and other tmfs will timeout, and then we will wait for
> replacement_timeout seconds to try and reconnect.
>
> If you have nops on or the net layer returns a error, it would be a
> little faster because you do not have to wait for scsi commands to time
> out. The nop will timeout after noop_timeout seconds, then we will wait
> for replacement_timeout seconds to reconnect. After that time we will
> fail everything.
>
> if you do not have IO running and your device does not require cache
> syncs, then it should be a lot shorter, but still may be a minute. The
> __iscsi_unbind_session/scsi_remove_target should complete quickly since
> they do not have to wait on IO and cache syncs to complete. We would
> just wait for the logout iscsi pdu to timeout.
>
>
> There is also a bug, where we retry the sending of data even though we
> know the connection is bad. This patch helps
> http://git.kernel.org/?p=linux/kernel/git/mnc/linux-2.6-iscsi.git;a=commit;h=b138adb2df49967bf0a035143f734d33c4263963
> but what we want is to be able to break from the sendpage/sendsg wait. I
> am working on a patch, but have hit some problems (for some reason if I
> send a signal it does not break from the wait). This problem only adds
> maybe 30 seconds extra for the logout of a session, so I am not sure
> that is what you are hitting.
>
>
>
> So first check if your device needs a cache sync. You can check that by
> looking at /var/log/messages when the device is discovered. You will see
>  something like:
>
> kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled,
> doesn't support DPO or FUA
>
> If write cache is enabled then the scsi layer will send cache syncs.
>
> Then check your replacement_timeout. If that is really long, then we
> might be hitting that.
>
>
>
>
>>
>> BTW - I'm not running with the latest code. My HEAD is commit
>> ef0357c4728ebba1a4b91a7f6d69c729a5f9e6e3. I don't know if any relevant
>> bug fixes were made lately.
>
>
>
> Just so you know, I normally work on linux-2.6-iscsi, which tracks
> upstream, then port to open-iscsi/kernel, so the newest kernel patches
> will be in there.

Eventually, it was caused by an internal bug that we had. After fixing
it, things look OK. Thanks for your help.

Erez

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: logout call doesn't return

2009-06-23 Thread Mike Christie

Erez Zilber wrote:
> Mike,
> 
> I'm trying to debug a problem that we have with iscsiadm: I'm running
> open-iscsi against multiple targets. At some point, I'm closing the
> connection from one of the targets (i.e. on the target side). Then, I
> try to logout from the initiator side, but something goes wrong. The
> last thing that iscsiadm does it call recv from iscsid_response and it
> doesn't return (at least not after 10 minutes). I also see that in the
> kernel, __iscsi_unbind_session calls scsi_remove_target and doesn't
> return. I guess that this causes iscsiadm to wait on the recv call.

Yeah, iscsiadm will wait for the iscsid operations like the unind to 
complete, and that can take a while.

If you stop the target and then we start the session shutdown process 
while we still think the session is up (we have not got a tcp connection 
error or rst or any other indication that is bad like a nop timing out), 
then we are going to end up firing the iscsi or scsi eh.

If you have IO running or if your LU requires a cache sync to be sent 
when shutting it down, then the worse case is that you have nops turned 
off, and for some reason the network layer does not return a error (just 
returns somehting we thing is retryable like EAGAIN) when we try to do 
sendpage/sendmsg. This will result in the scsi commands timing out. Then 
the aborts and other tmfs will timeout, and then we will wait for 
replacement_timeout seconds to try and reconnect.

If you have nops on or the net layer returns a error, it would be a 
little faster because you do not have to wait for scsi commands to time 
out. The nop will timeout after noop_timeout seconds, then we will wait 
for replacement_timeout seconds to reconnect. After that time we will 
fail everything.

if you do not have IO running and your device does not require cache 
syncs, then it should be a lot shorter, but still may be a minute. The 
__iscsi_unbind_session/scsi_remove_target should complete quickly since 
they do not have to wait on IO and cache syncs to complete. We would 
just wait for the logout iscsi pdu to timeout.


There is also a bug, where we retry the sending of data even though we 
know the connection is bad. This patch helps
http://git.kernel.org/?p=linux/kernel/git/mnc/linux-2.6-iscsi.git;a=commit;h=b138adb2df49967bf0a035143f734d33c4263963
but what we want is to be able to break from the sendpage/sendsg wait. I 
am working on a patch, but have hit some problems (for some reason if I 
send a signal it does not break from the wait). This problem only adds 
maybe 30 seconds extra for the logout of a session, so I am not sure 
that is what you are hitting.



So first check if your device needs a cache sync. You can check that by 
looking at /var/log/messages when the device is discovered. You will see 
  something like:

kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, 
doesn't support DPO or FUA

If write cache is enabled then the scsi layer will send cache syncs.

Then check your replacement_timeout. If that is really long, then we 
might be hitting that.




> 
> BTW - I'm not running with the latest code. My HEAD is commit
> ef0357c4728ebba1a4b91a7f6d69c729a5f9e6e3. I don't know if any relevant
> bug fixes were made lately.



Just so you know, I normally work on linux-2.6-iscsi, which tracks 
upstream, then port to open-iscsi/kernel, so the newest kernel patches 
will be in there.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



logout call doesn't return

2009-06-23 Thread Erez Zilber

Mike,

I'm trying to debug a problem that we have with iscsiadm: I'm running
open-iscsi against multiple targets. At some point, I'm closing the
connection from one of the targets (i.e. on the target side). Then, I
try to logout from the initiator side, but something goes wrong. The
last thing that iscsiadm does it call recv from iscsid_response and it
doesn't return (at least not after 10 minutes). I also see that in the
kernel, __iscsi_unbind_session calls scsi_remove_target and doesn't
return. I guess that this causes iscsiadm to wait on the recv call.

BTW - I'm not running with the latest code. My HEAD is commit
ef0357c4728ebba1a4b91a7f6d69c729a5f9e6e3. I don't know if any relevant
bug fixes were made lately.

Thanks,
Erez

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---