Re: [PATCH 0/2] Update TMF handling

2009-07-29 Thread Mike Christie
Hannes Reinecke wrote:
> Mike Christie wrote:
>> Mike Christie wrote:
>>> Mike Christie wrote:
 Mike Christie wrote:
> Mike Christie wrote:
>> Hannes Reinecke wrote:
>>> Mike Christie wrote:
> The second patch is the more important one, as it
> fixes an error during LUN Reset handling in the
> initiator. When sending a LUN Reset during an
> ongoing R2T transfer, we're suspending Tx and
> aborting all _SCSI_ tasks. However, once we're
> done there we're resuming Tx and the R2T transfer
> will happily continue. So we should rather be
 This should not be happening. When iscsi_suspend_tx returns the tx 
 thread has stopped so we know there are no users accessing the task 
 (well, there could be if a target is sending a tmf response then a 
 r2t, 
 but if the target is following the rfc there should not be).

 So when fail_scsi_tasks calls

 fail_scsi_task ->iscsi_complete_task (this will cleanup conn->task if 
 this is the same task) -> __iscsi_put_task

 this should be the last put on the task and that should release it 
 calling iscsi_free_task which should call cleanup_task to kill any 
 pending r2t handling and it would remove it from the requeue list.

 If we are sending a data-out for a task that has had fail_scsi_task 
 ->iscsi_complete_task -> __iscsi_put_task called for it then we are in 
 bigger trouble because the last put should have been called on it and 
 we 
   are accessing a bad task.

>>> This is the log I'm getting:
>>>
>>>
>>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_eh_device_reset LU Reset 
>>> [sc 88007b94d080 lun 6]
>>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_exec_task_mgmt_fn tmf set 
>>> timeout
>>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x3a lun 6 abort 
>>> transfer
>>> Jul 29 10:34:48 tyne kernel:  session1: mgmtpdu [op 0x2 hdr->itt 0x5d 
>>> datalen 0]
>>> Jul 29 10:34:48 tyne kernel:  connection1:0: mgmtpdu [itt 0x5d task 
>>> 88007a01fc00] xmit
>>> Jul 29 10:34:48 tyne kernel:  connection1:0: tmf rsp [itt 0x5d] 
>>> response 0 state 1
>>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x72 lun 6 abort 
>>> transfer
>>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_suspend_tx suspend Tx
>>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_complete_task task itt 
>>> 0x72 sc 88007b5bc580 still active
>>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x57 lun 6 abort 
>>> transfer
>>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x59 lun 6 abort 
>>> transfer
>>> Jul 29 10:34:48 tyne kernel:  session1: Tx suspended!
>>>
>>> So we're indeed would have continued the R2T task (itt 0x57 and itt 
>>> 0x59) even though we've
>>> already received a valid TMF response.
>>> So I'm afraid it's us ...
>> Ah, I misunderstood you. I do not think it has to do with the cleanup 
>> still leaving r2ts. I am not sure where you are putting printks, but I 
>> think it is this:
>>
>>  while (!list_empty(&conn->requeue)) {
>>  if (conn->session->fast_abort && conn->tmf_state != 
>> TMF_INITIAL)
>>  break;
>>
>> Once the tmf completes, we will start sending data again.
>>
> Ooops. I am too sleepy. Ignore that. I am wrong there.
>
 I guess if fast_abort is 0 though, we will hit this problem. And we will 
 send data-outs when getting tmf responses as well as when we are sending 
 the tmf.
>>>
>>> I think the problem is wording like in 10.5.1:
>>>
>>> For ABORT TASK SET and CLEAR TASK SET, the issuing initiator MUST
>>> continue to respond to all valid target transfer tags (received via
>>> R2T, Text Response, NOP-In, or SCSI Data-In PDUs) related to the
>>> affected task set, even after issuing the task management request.
>>>
>>> I think in some other doc (probably the one Mathew and Ulrich mentioned) 
>>> there is wording about doing similar for abort and lu resets.
>>>
>>> The things is that I think half of targets want us to respond to r2ts 
>>> and half do not. This is where the fast_abort comes from. If set then we 
>>> reply to r2ts and if not set we do not. I think once we get a successful 
>> Fudge. I am really going to be now. I mean if it is set we do not reply 
>> to r2ts. If not set then we reply.
>>
> Actually, I think it's a race condition:
> 
> drivers/scsi/libiscsi.c:iscsi_eh_device_reset()
>   rc = SUCCESS;
>   spin_unlock_bh(&session->lock);
> 
>   iscsi_suspend_tx(conn);
> 
> So the workqueue thread could wedge in after we've
> unlocked the session lock and start sending data
> even though we're meant to suspend transmitting here.
> 
> Wi

Re: [PATCH 0/2] Update TMF handling

2009-07-29 Thread Hannes Reinecke

Hannes Reinecke wrote:
> Mike Christie wrote:
>> Mike Christie wrote:
[ .. ]
>>>
>>> The things is that I think half of targets want us to respond to r2ts 
>>> and half do not. This is where the fast_abort comes from. If set then we 
>>> reply to r2ts and if not set we do not. I think once we get a successful 
>> Fudge. I am really going to be now. I mean if it is set we do not reply 
>> to r2ts. If not set then we reply.
>>
> Actually, I think it's a race condition:
> 
> drivers/scsi/libiscsi.c:iscsi_eh_device_reset()
>   rc = SUCCESS;
>   spin_unlock_bh(&session->lock);
> 
>   iscsi_suspend_tx(conn);
> 
> So the workqueue thread could wedge in after we've
> unlocked the session lock and start sending data
> even though we're meant to suspend transmitting here.
> 
> Will be trying it.
> 
Sigh. Yes, it is a race condition.
But the above is not the only one:

drivers/scsi/libiscsi.c:iscsi_exec_task_mgmt_fn()

wait_event_interruptible(conn->ehwait, age != session->age ||
 session->state != ISCSI_STATE_LOGGED_IN ||
 conn->tmf_state != TMF_QUEUED);
if (signal_pending(current))
flush_signals(current);
del_timer_sync(&conn->tmf_timer);

mutex_lock(&session->eh_mutex);
spin_lock_bh(&session->lock);

Plenty of time here to wedge in.
So a simple reshuffle will not suffice here.

Okay, preparing patches.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: [PATCH 0/2] Update TMF handling

2009-07-29 Thread Hannes Reinecke

Mike Christie wrote:
> Mike Christie wrote:
>> Mike Christie wrote:
>>> Mike Christie wrote:
 Mike Christie wrote:
> Hannes Reinecke wrote:
>> Mike Christie wrote:
 The second patch is the more important one, as it
 fixes an error during LUN Reset handling in the
 initiator. When sending a LUN Reset during an
 ongoing R2T transfer, we're suspending Tx and
 aborting all _SCSI_ tasks. However, once we're
 done there we're resuming Tx and the R2T transfer
 will happily continue. So we should rather be
>>> This should not be happening. When iscsi_suspend_tx returns the tx 
>>> thread has stopped so we know there are no users accessing the task 
>>> (well, there could be if a target is sending a tmf response then a r2t, 
>>> but if the target is following the rfc there should not be).
>>>
>>> So when fail_scsi_tasks calls
>>>
>>> fail_scsi_task ->iscsi_complete_task (this will cleanup conn->task if 
>>> this is the same task) -> __iscsi_put_task
>>>
>>> this should be the last put on the task and that should release it 
>>> calling iscsi_free_task which should call cleanup_task to kill any 
>>> pending r2t handling and it would remove it from the requeue list.
>>>
>>> If we are sending a data-out for a task that has had fail_scsi_task 
>>> ->iscsi_complete_task -> __iscsi_put_task called for it then we are in 
>>> bigger trouble because the last put should have been called on it and 
>>> we 
>>>   are accessing a bad task.
>>>
>> This is the log I'm getting:
>>
>>
>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_eh_device_reset LU Reset 
>> [sc 88007b94d080 lun 6]
>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_exec_task_mgmt_fn tmf set 
>> timeout
>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x3a lun 6 abort 
>> transfer
>> Jul 29 10:34:48 tyne kernel:  session1: mgmtpdu [op 0x2 hdr->itt 0x5d 
>> datalen 0]
>> Jul 29 10:34:48 tyne kernel:  connection1:0: mgmtpdu [itt 0x5d task 
>> 88007a01fc00] xmit
>> Jul 29 10:34:48 tyne kernel:  connection1:0: tmf rsp [itt 0x5d] response 
>> 0 state 1
>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x72 lun 6 abort 
>> transfer
>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_suspend_tx suspend Tx
>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_complete_task task itt 
>> 0x72 sc 88007b5bc580 still active
>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x57 lun 6 abort 
>> transfer
>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x59 lun 6 abort 
>> transfer
>> Jul 29 10:34:48 tyne kernel:  session1: Tx suspended!
>>
>> So we're indeed would have continued the R2T task (itt 0x57 and itt 
>> 0x59) even though we've
>> already received a valid TMF response.
>> So I'm afraid it's us ...
> Ah, I misunderstood you. I do not think it has to do with the cleanup 
> still leaving r2ts. I am not sure where you are putting printks, but I 
> think it is this:
>
>  while (!list_empty(&conn->requeue)) {
>  if (conn->session->fast_abort && conn->tmf_state != 
> TMF_INITIAL)
>  break;
>
> Once the tmf completes, we will start sending data again.
>
 Ooops. I am too sleepy. Ignore that. I am wrong there.

>>> I guess if fast_abort is 0 though, we will hit this problem. And we will 
>>> send data-outs when getting tmf responses as well as when we are sending 
>>> the tmf.
>>
>>
>> I think the problem is wording like in 10.5.1:
>>
>> For ABORT TASK SET and CLEAR TASK SET, the issuing initiator MUST
>> continue to respond to all valid target transfer tags (received via
>> R2T, Text Response, NOP-In, or SCSI Data-In PDUs) related to the
>> affected task set, even after issuing the task management request.
>>
>> I think in some other doc (probably the one Mathew and Ulrich mentioned) 
>> there is wording about doing similar for abort and lu resets.
>>
>> The things is that I think half of targets want us to respond to r2ts 
>> and half do not. This is where the fast_abort comes from. If set then we 
>> reply to r2ts and if not set we do not. I think once we get a successful 
> 
> Fudge. I am really going to be now. I mean if it is set we do not reply 
> to r2ts. If not set then we reply.
> 
Actually, I think it's a race condition:

drivers/scsi/libiscsi.c:iscsi_eh_device_reset()
rc = SUCCESS;
spin_unlock_bh(&session->lock);

iscsi_suspend_tx(conn);

So the workqueue thread could wedge in after we've
unlocked the session lock and start sending data
even though we're meant to suspend transmitting here.

Will be trying it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 740

Re: [PATCH 0/2] Update TMF handling

2009-07-29 Thread Mike Christie

Mike Christie wrote:
> Mike Christie wrote:
>> Mike Christie wrote:
>>> Mike Christie wrote:
 Hannes Reinecke wrote:
> Mike Christie wrote:
>>> The second patch is the more important one, as it
>>> fixes an error during LUN Reset handling in the
>>> initiator. When sending a LUN Reset during an
>>> ongoing R2T transfer, we're suspending Tx and
>>> aborting all _SCSI_ tasks. However, once we're
>>> done there we're resuming Tx and the R2T transfer
>>> will happily continue. So we should rather be
>> This should not be happening. When iscsi_suspend_tx returns the tx 
>> thread has stopped so we know there are no users accessing the task 
>> (well, there could be if a target is sending a tmf response then a r2t, 
>> but if the target is following the rfc there should not be).
>>
>> So when fail_scsi_tasks calls
>>
>> fail_scsi_task ->iscsi_complete_task (this will cleanup conn->task if 
>> this is the same task) -> __iscsi_put_task
>>
>> this should be the last put on the task and that should release it 
>> calling iscsi_free_task which should call cleanup_task to kill any 
>> pending r2t handling and it would remove it from the requeue list.
>>
>> If we are sending a data-out for a task that has had fail_scsi_task 
>> ->iscsi_complete_task -> __iscsi_put_task called for it then we are in 
>> bigger trouble because the last put should have been called on it and we 
>>   are accessing a bad task.
>>
> This is the log I'm getting:
>
>
> Jul 29 10:34:48 tyne kernel:  session1: iscsi_eh_device_reset LU Reset 
> [sc 88007b94d080 lun 6]
> Jul 29 10:34:48 tyne kernel:  session1: iscsi_exec_task_mgmt_fn tmf set 
> timeout
> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x3a lun 6 abort 
> transfer
> Jul 29 10:34:48 tyne kernel:  session1: mgmtpdu [op 0x2 hdr->itt 0x5d 
> datalen 0]
> Jul 29 10:34:48 tyne kernel:  connection1:0: mgmtpdu [itt 0x5d task 
> 88007a01fc00] xmit
> Jul 29 10:34:48 tyne kernel:  connection1:0: tmf rsp [itt 0x5d] response 
> 0 state 1
> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x72 lun 6 abort 
> transfer
> Jul 29 10:34:48 tyne kernel:  session1: iscsi_suspend_tx suspend Tx
> Jul 29 10:34:48 tyne kernel:  session1: iscsi_complete_task task itt 0x72 
> sc 88007b5bc580 still active
> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x57 lun 6 abort 
> transfer
> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x59 lun 6 abort 
> transfer
> Jul 29 10:34:48 tyne kernel:  session1: Tx suspended!
>
> So we're indeed would have continued the R2T task (itt 0x57 and itt 0x59) 
> even though we've
> already received a valid TMF response.
> So I'm afraid it's us ...
 Ah, I misunderstood you. I do not think it has to do with the cleanup 
 still leaving r2ts. I am not sure where you are putting printks, but I 
 think it is this:

  while (!list_empty(&conn->requeue)) {
  if (conn->session->fast_abort && conn->tmf_state != 
 TMF_INITIAL)
  break;

 Once the tmf completes, we will start sending data again.

>>> Ooops. I am too sleepy. Ignore that. I am wrong there.
>>>
>> I guess if fast_abort is 0 though, we will hit this problem. And we will 
>> send data-outs when getting tmf responses as well as when we are sending 
>> the tmf.
> 
> 
> 
> I think the problem is wording like in 10.5.1:
> 
> For ABORT TASK SET and CLEAR TASK SET, the issuing initiator MUST
> continue to respond to all valid target transfer tags (received via
> R2T, Text Response, NOP-In, or SCSI Data-In PDUs) related to the
> affected task set, even after issuing the task management request.
> 
> I think in some other doc (probably the one Mathew and Ulrich mentioned) 
> there is wording about doing similar for abort and lu resets.
> 
> The things is that I think half of targets want us to respond to r2ts 
> and half do not. This is where the fast_abort comes from. If set then we 
> reply to r2ts and if not set we do not. I think once we get a successful 

Fudge. I am really going to be now. I mean if it is set we do not reply 
to r2ts. If not set then we reply.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: [PATCH 0/2] Update TMF handling

2009-07-29 Thread Mike Christie

Mike Christie wrote:
> Mike Christie wrote:
>> Mike Christie wrote:
>>> Hannes Reinecke wrote:
 Mike Christie wrote:
>> The second patch is the more important one, as it
>> fixes an error during LUN Reset handling in the
>> initiator. When sending a LUN Reset during an
>> ongoing R2T transfer, we're suspending Tx and
>> aborting all _SCSI_ tasks. However, once we're
>> done there we're resuming Tx and the R2T transfer
>> will happily continue. So we should rather be
> This should not be happening. When iscsi_suspend_tx returns the tx 
> thread has stopped so we know there are no users accessing the task 
> (well, there could be if a target is sending a tmf response then a r2t, 
> but if the target is following the rfc there should not be).
>
> So when fail_scsi_tasks calls
>
> fail_scsi_task ->iscsi_complete_task (this will cleanup conn->task if 
> this is the same task) -> __iscsi_put_task
>
> this should be the last put on the task and that should release it 
> calling iscsi_free_task which should call cleanup_task to kill any 
> pending r2t handling and it would remove it from the requeue list.
>
> If we are sending a data-out for a task that has had fail_scsi_task 
> ->iscsi_complete_task -> __iscsi_put_task called for it then we are in 
> bigger trouble because the last put should have been called on it and we 
>   are accessing a bad task.
>
 This is the log I'm getting:


 Jul 29 10:34:48 tyne kernel:  session1: iscsi_eh_device_reset LU Reset [sc 
 88007b94d080 lun 6]
 Jul 29 10:34:48 tyne kernel:  session1: iscsi_exec_task_mgmt_fn tmf set 
 timeout
 Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x3a lun 6 abort 
 transfer
 Jul 29 10:34:48 tyne kernel:  session1: mgmtpdu [op 0x2 hdr->itt 0x5d 
 datalen 0]
 Jul 29 10:34:48 tyne kernel:  connection1:0: mgmtpdu [itt 0x5d task 
 88007a01fc00] xmit
 Jul 29 10:34:48 tyne kernel:  connection1:0: tmf rsp [itt 0x5d] response 0 
 state 1
 Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x72 lun 6 abort 
 transfer
 Jul 29 10:34:48 tyne kernel:  session1: iscsi_suspend_tx suspend Tx
 Jul 29 10:34:48 tyne kernel:  session1: iscsi_complete_task task itt 0x72 
 sc 88007b5bc580 still active
 Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x57 lun 6 abort 
 transfer
 Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x59 lun 6 abort 
 transfer
 Jul 29 10:34:48 tyne kernel:  session1: Tx suspended!

 So we're indeed would have continued the R2T task (itt 0x57 and itt 0x59) 
 even though we've
 already received a valid TMF response.
 So I'm afraid it's us ...
>>> Ah, I misunderstood you. I do not think it has to do with the cleanup 
>>> still leaving r2ts. I am not sure where you are putting printks, but I 
>>> think it is this:
>>>
>>>  while (!list_empty(&conn->requeue)) {
>>>  if (conn->session->fast_abort && conn->tmf_state != 
>>> TMF_INITIAL)
>>>  break;
>>>
>>> Once the tmf completes, we will start sending data again.
>>>
>> Ooops. I am too sleepy. Ignore that. I am wrong there.
>>
> 
> I guess if fast_abort is 0 though, we will hit this problem. And we will 
> send data-outs when getting tmf responses as well as when we are sending 
> the tmf.



I think the problem is wording like in 10.5.1:

For ABORT TASK SET and CLEAR TASK SET, the issuing initiator MUST
continue to respond to all valid target transfer tags (received via
R2T, Text Response, NOP-In, or SCSI Data-In PDUs) related to the
affected task set, even after issuing the task management request.

I think in some other doc (probably the one Mathew and Ulrich mentioned) 
there is wording about doing similar for abort and lu resets.

The things is that I think half of targets want us to respond to r2ts 
and half do not. This is where the fast_abort comes from. If set then we 
reply to r2ts and if not set we do not. I think once we get a successful 
response we should always stop.

What are you using for fast_abort right now? If it was not set, could 
you set it and retry your test.

I think if that works, we still want to fix the suspend bit testing, 
because I think you are right and we may still be running due to a bad 
test and set of that bit and fast_abort may be turned off in a valid setup.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: [PATCH 0/2] Update TMF handling

2009-07-29 Thread Mike Christie

Mike Christie wrote:
> Mike Christie wrote:
>> Hannes Reinecke wrote:
>>> Mike Christie wrote:
> The second patch is the more important one, as it
> fixes an error during LUN Reset handling in the
> initiator. When sending a LUN Reset during an
> ongoing R2T transfer, we're suspending Tx and
> aborting all _SCSI_ tasks. However, once we're
> done there we're resuming Tx and the R2T transfer
> will happily continue. So we should rather be
 This should not be happening. When iscsi_suspend_tx returns the tx 
 thread has stopped so we know there are no users accessing the task 
 (well, there could be if a target is sending a tmf response then a r2t, 
 but if the target is following the rfc there should not be).

 So when fail_scsi_tasks calls

 fail_scsi_task ->iscsi_complete_task (this will cleanup conn->task if 
 this is the same task) -> __iscsi_put_task

 this should be the last put on the task and that should release it 
 calling iscsi_free_task which should call cleanup_task to kill any 
 pending r2t handling and it would remove it from the requeue list.

 If we are sending a data-out for a task that has had fail_scsi_task 
 ->iscsi_complete_task -> __iscsi_put_task called for it then we are in 
 bigger trouble because the last put should have been called on it and we 
   are accessing a bad task.

>>> This is the log I'm getting:
>>>
>>>
>>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_eh_device_reset LU Reset [sc 
>>> 88007b94d080 lun 6]
>>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_exec_task_mgmt_fn tmf set 
>>> timeout
>>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x3a lun 6 abort 
>>> transfer
>>> Jul 29 10:34:48 tyne kernel:  session1: mgmtpdu [op 0x2 hdr->itt 0x5d 
>>> datalen 0]
>>> Jul 29 10:34:48 tyne kernel:  connection1:0: mgmtpdu [itt 0x5d task 
>>> 88007a01fc00] xmit
>>> Jul 29 10:34:48 tyne kernel:  connection1:0: tmf rsp [itt 0x5d] response 0 
>>> state 1
>>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x72 lun 6 abort 
>>> transfer
>>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_suspend_tx suspend Tx
>>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_complete_task task itt 0x72 
>>> sc 88007b5bc580 still active
>>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x57 lun 6 abort 
>>> transfer
>>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x59 lun 6 abort 
>>> transfer
>>> Jul 29 10:34:48 tyne kernel:  session1: Tx suspended!
>>>
>>> So we're indeed would have continued the R2T task (itt 0x57 and itt 0x59) 
>>> even though we've
>>> already received a valid TMF response.
>>> So I'm afraid it's us ...
>> Ah, I misunderstood you. I do not think it has to do with the cleanup 
>> still leaving r2ts. I am not sure where you are putting printks, but I 
>> think it is this:
>>
>>  while (!list_empty(&conn->requeue)) {
>>  if (conn->session->fast_abort && conn->tmf_state != 
>> TMF_INITIAL)
>>  break;
>>
>> Once the tmf completes, we will start sending data again.
>>
> 
> Ooops. I am too sleepy. Ignore that. I am wrong there.
> 

I guess if fast_abort is 0 though, we will hit this problem. And we will 
send data-outs when getting tmf responses as well as when we are sending 
the tmf.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: [PATCH 0/2] Update TMF handling

2009-07-29 Thread Mike Christie

Mike Christie wrote:
> Hannes Reinecke wrote:
>> Mike Christie wrote:
 The second patch is the more important one, as it
 fixes an error during LUN Reset handling in the
 initiator. When sending a LUN Reset during an
 ongoing R2T transfer, we're suspending Tx and
 aborting all _SCSI_ tasks. However, once we're
 done there we're resuming Tx and the R2T transfer
 will happily continue. So we should rather be
>>> This should not be happening. When iscsi_suspend_tx returns the tx 
>>> thread has stopped so we know there are no users accessing the task 
>>> (well, there could be if a target is sending a tmf response then a r2t, 
>>> but if the target is following the rfc there should not be).
>>>
>>> So when fail_scsi_tasks calls
>>>
>>> fail_scsi_task ->iscsi_complete_task (this will cleanup conn->task if 
>>> this is the same task) -> __iscsi_put_task
>>>
>>> this should be the last put on the task and that should release it 
>>> calling iscsi_free_task which should call cleanup_task to kill any 
>>> pending r2t handling and it would remove it from the requeue list.
>>>
>>> If we are sending a data-out for a task that has had fail_scsi_task 
>>> ->iscsi_complete_task -> __iscsi_put_task called for it then we are in 
>>> bigger trouble because the last put should have been called on it and we 
>>>   are accessing a bad task.
>>>
>> This is the log I'm getting:
>>
>>
>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_eh_device_reset LU Reset [sc 
>> 88007b94d080 lun 6]
>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_exec_task_mgmt_fn tmf set 
>> timeout
>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x3a lun 6 abort 
>> transfer
>> Jul 29 10:34:48 tyne kernel:  session1: mgmtpdu [op 0x2 hdr->itt 0x5d 
>> datalen 0]
>> Jul 29 10:34:48 tyne kernel:  connection1:0: mgmtpdu [itt 0x5d task 
>> 88007a01fc00] xmit
>> Jul 29 10:34:48 tyne kernel:  connection1:0: tmf rsp [itt 0x5d] response 0 
>> state 1
>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x72 lun 6 abort 
>> transfer
>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_suspend_tx suspend Tx
>> Jul 29 10:34:48 tyne kernel:  session1: iscsi_complete_task task itt 0x72 sc 
>> 88007b5bc580 still active
>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x57 lun 6 abort 
>> transfer
>> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x59 lun 6 abort 
>> transfer
>> Jul 29 10:34:48 tyne kernel:  session1: Tx suspended!
>>
>> So we're indeed would have continued the R2T task (itt 0x57 and itt 0x59) 
>> even though we've
>> already received a valid TMF response.
>> So I'm afraid it's us ...
> 
> Ah, I misunderstood you. I do not think it has to do with the cleanup 
> still leaving r2ts. I am not sure where you are putting printks, but I 
> think it is this:
> 
>  while (!list_empty(&conn->requeue)) {
>  if (conn->session->fast_abort && conn->tmf_state != 
> TMF_INITIAL)
>  break;
> 
> Once the tmf completes, we will start sending data again.
> 

Ooops. I am too sleepy. Ignore that. I am wrong there.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: [PATCH 0/2] Update TMF handling

2009-07-29 Thread Mike Christie

Hannes Reinecke wrote:
> Mike Christie wrote:
>>> The second patch is the more important one, as it
>>> fixes an error during LUN Reset handling in the
>>> initiator. When sending a LUN Reset during an
>>> ongoing R2T transfer, we're suspending Tx and
>>> aborting all _SCSI_ tasks. However, once we're
>>> done there we're resuming Tx and the R2T transfer
>>> will happily continue. So we should rather be
>> This should not be happening. When iscsi_suspend_tx returns the tx 
>> thread has stopped so we know there are no users accessing the task 
>> (well, there could be if a target is sending a tmf response then a r2t, 
>> but if the target is following the rfc there should not be).
>>
>> So when fail_scsi_tasks calls
>>
>> fail_scsi_task ->iscsi_complete_task (this will cleanup conn->task if 
>> this is the same task) -> __iscsi_put_task
>>
>> this should be the last put on the task and that should release it 
>> calling iscsi_free_task which should call cleanup_task to kill any 
>> pending r2t handling and it would remove it from the requeue list.
>>
>> If we are sending a data-out for a task that has had fail_scsi_task 
>> ->iscsi_complete_task -> __iscsi_put_task called for it then we are in 
>> bigger trouble because the last put should have been called on it and we 
>>   are accessing a bad task.
>>
> This is the log I'm getting:
> 
> 
> Jul 29 10:34:48 tyne kernel:  session1: iscsi_eh_device_reset LU Reset [sc 
> 88007b94d080 lun 6]
> Jul 29 10:34:48 tyne kernel:  session1: iscsi_exec_task_mgmt_fn tmf set 
> timeout
> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x3a lun 6 abort 
> transfer
> Jul 29 10:34:48 tyne kernel:  session1: mgmtpdu [op 0x2 hdr->itt 0x5d datalen 
> 0]
> Jul 29 10:34:48 tyne kernel:  connection1:0: mgmtpdu [itt 0x5d task 
> 88007a01fc00] xmit
> Jul 29 10:34:48 tyne kernel:  connection1:0: tmf rsp [itt 0x5d] response 0 
> state 1
> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x72 lun 6 abort 
> transfer
> Jul 29 10:34:48 tyne kernel:  session1: iscsi_suspend_tx suspend Tx
> Jul 29 10:34:48 tyne kernel:  session1: iscsi_complete_task task itt 0x72 sc 
> 88007b5bc580 still active
> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x57 lun 6 abort 
> transfer
> Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x59 lun 6 abort 
> transfer
> Jul 29 10:34:48 tyne kernel:  session1: Tx suspended!
> 
> So we're indeed would have continued the R2T task (itt 0x57 and itt 0x59) 
> even though we've
> already received a valid TMF response.
> So I'm afraid it's us ...

Ah, I misunderstood you. I do not think it has to do with the cleanup 
still leaving r2ts. I am not sure where you are putting printks, but I 
think it is this:

 while (!list_empty(&conn->requeue)) {
 if (conn->session->fast_abort && conn->tmf_state != 
TMF_INITIAL)
 break;

Once the tmf completes, we will start sending data again.

This sort of lines up with where I think you put your printks. Is 
"iscsi_suspend_tx suspend Tx" getting printed out before or after the 
the flush_workqueue.


> 
> I really do think part of the problem is that we setting the SUSPEND bit
> without holding the session lock. We _check_ it under the lock in
> iscsi_xmit(), but setting is done without the lock.
> Which of course causes all sorts of race conditions.

Yeah, we use atomics to set it then just do a if (conn->suspend_tx) 
under the lock to test it.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: [PATCH 0/2] Update TMF handling

2009-07-29 Thread Hannes Reinecke

Mike Christie wrote:
> 
>> The second patch is the more important one, as it
>> fixes an error during LUN Reset handling in the
>> initiator. When sending a LUN Reset during an
>> ongoing R2T transfer, we're suspending Tx and
>> aborting all _SCSI_ tasks. However, once we're
>> done there we're resuming Tx and the R2T transfer
>> will happily continue. So we should rather be
> 
> This should not be happening. When iscsi_suspend_tx returns the tx 
> thread has stopped so we know there are no users accessing the task 
> (well, there could be if a target is sending a tmf response then a r2t, 
> but if the target is following the rfc there should not be).
> 
> So when fail_scsi_tasks calls
> 
> fail_scsi_task ->iscsi_complete_task (this will cleanup conn->task if 
> this is the same task) -> __iscsi_put_task
> 
> this should be the last put on the task and that should release it 
> calling iscsi_free_task which should call cleanup_task to kill any 
> pending r2t handling and it would remove it from the requeue list.
> 
> If we are sending a data-out for a task that has had fail_scsi_task 
> ->iscsi_complete_task -> __iscsi_put_task called for it then we are in 
> bigger trouble because the last put should have been called on it and we 
>   are accessing a bad task.
> 
This is the log I'm getting:


Jul 29 10:34:48 tyne kernel:  session1: iscsi_eh_device_reset LU Reset [sc 
88007b94d080 lun 6]
Jul 29 10:34:48 tyne kernel:  session1: iscsi_exec_task_mgmt_fn tmf set timeout
Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x3a lun 6 abort transfer
Jul 29 10:34:48 tyne kernel:  session1: mgmtpdu [op 0x2 hdr->itt 0x5d datalen 0]
Jul 29 10:34:48 tyne kernel:  connection1:0: mgmtpdu [itt 0x5d task 
88007a01fc00] xmit
Jul 29 10:34:48 tyne kernel:  connection1:0: tmf rsp [itt 0x5d] response 0 
state 1
Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x72 lun 6 abort transfer
Jul 29 10:34:48 tyne kernel:  session1: iscsi_suspend_tx suspend Tx
Jul 29 10:34:48 tyne kernel:  session1: iscsi_complete_task task itt 0x72 sc 
88007b5bc580 still active
Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x57 lun 6 abort transfer
Jul 29 10:34:48 tyne kernel:  connection1:0: task itt 0x59 lun 6 abort transfer
Jul 29 10:34:48 tyne kernel:  session1: Tx suspended!

So we're indeed would have continued the R2T task (itt 0x57 and itt 0x59) even 
though we've
already received a valid TMF response.
So I'm afraid it's us ...

I really do think part of the problem is that we setting the SUSPEND bit
without holding the session lock. We _check_ it under the lock in
iscsi_xmit(), but setting is done without the lock.
Which of course causes all sorts of race conditions.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: [PATCH 0/2] Update TMF handling

2009-07-29 Thread Hannes Reinecke

Mike Christie wrote:
> 
>> The second patch is the more important one, as it
>> fixes an error during LUN Reset handling in the
>> initiator. When sending a LUN Reset during an
>> ongoing R2T transfer, we're suspending Tx and
>> aborting all _SCSI_ tasks. However, once we're
>> done there we're resuming Tx and the R2T transfer
>> will happily continue. So we should rather be
> 
> This should not be happening. When iscsi_suspend_tx returns the tx 
> thread has stopped so we know there are no users accessing the task 
> (well, there could be if a target is sending a tmf response then a r2t, 
> but if the target is following the rfc there should not be).
> 
> So when fail_scsi_tasks calls
> 
> fail_scsi_task ->iscsi_complete_task (this will cleanup conn->task if 
> this is the same task) -> __iscsi_put_task
> 
> this should be the last put on the task and that should release it 
> calling iscsi_free_task which should call cleanup_task to kill any 
> pending r2t handling and it would remove it from the requeue list.
> 
> If we are sending a data-out for a task that has had fail_scsi_task 
> ->iscsi_complete_task -> __iscsi_put_task called for it then we are in 
> bigger trouble because the last put should have been called on it and we 
>   are accessing a bad task.
> 
I fully agree, this is something which shouldn't happen.

However, using this patch stops me from receiving invalid R2T PDUs.
So I can't be that far off the mark.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



Re: [PATCH 0/2] Update TMF handling

2009-07-29 Thread Mike Christie


> 
> The second patch is the more important one, as it
> fixes an error during LUN Reset handling in the
> initiator. When sending a LUN Reset during an
> ongoing R2T transfer, we're suspending Tx and
> aborting all _SCSI_ tasks. However, once we're
> done there we're resuming Tx and the R2T transfer
> will happily continue. So we should rather be

This should not be happening. When iscsi_suspend_tx returns the tx 
thread has stopped so we know there are no users accessing the task 
(well, there could be if a target is sending a tmf response then a r2t, 
but if the target is following the rfc there should not be).

So when fail_scsi_tasks calls

fail_scsi_task ->iscsi_complete_task (this will cleanup conn->task if 
this is the same task) -> __iscsi_put_task

this should be the last put on the task and that should release it 
calling iscsi_free_task which should call cleanup_task to kill any 
pending r2t handling and it would remove it from the requeue list.

If we are sending a data-out for a task that has had fail_scsi_task 
->iscsi_complete_task -> __iscsi_put_task called for it then we are in 
bigger trouble because the last put should have been called on it and we 
  are accessing a bad task.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---



[PATCH 0/2] Update TMF handling

2009-07-29 Thread Hannes Reinecke

Hi all,

these two patches update the TMF handling to make it more
efficient and less error prone.

The first patch is just a minor tweak to allow new TMF
tasks as soon as we've received a response for the pending
one. Reasoning here is that eg LUN Reset might take
quite a while to abort all outstanding tasks, during which
time we cannot send any other LUN Reset even to another
LUN. So obviously, allowing another LUN Reset here
is the right thing to do. And even if we would be sending
a LUN Reset to this LUN we wouldn't do any harm as the
SCSI command abort is protected by a lock, so nothing
will happen here for consecutive LUN Resets.
And of course we're observing the error recovery
hierarchy, so an ABORT TASK will be rejected if LUN
Reset is in progress.

The second patch is the more important one, as it
fixes an error during LUN Reset handling in the
initiator. When sending a LUN Reset during an
ongoing R2T transfer, we're suspending Tx and
aborting all _SCSI_ tasks. However, once we're
done there we're resuming Tx and the R2T transfer
will happily continue. So we should rather be
checking for ongoing TMF tasks in iscsi_task_xmit
and terminate the I/O of the task affects us.
Note we're not actually interested in the outcome
of the TMF task as the I/O will be stopped
anyway even if the TMF task fails.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~--~~~~--~~--~--~---