Re: Lost active R2T transfers during reset
Mike Christie wrote: Hannes Reinecke wrote: Hi Mike, as you might've seen, I finally found the problem for the MSA dropping the connection. It seems that it's follows this section from the RFC: For the LOGICAL UNIT RESET function, the target MUST behave as dictated by the Logical Unit Reset function in [SAM2]. where SAM2 says: When a logical unit is aborting one or more tasks from a SCSI initiator port with the TASK ABORTED status it should complete all of those tasks before entering additional tasks from that SCSI initiator port into the task set. So the tasks must be _completed_ at the target. Which can be interpreted as requiring the target to send an ABORT_TASK_SET to each outstanding task, so that this section applies: For ABORT TASK SET and CLEAR TASK SET, the issuing initiator MUST continue to respond to all valid target transfer tags (received via R2T, Text Response, NOP-In, or SCSI Data-In PDUs) related to the affected task set, even after issuing the task management request. The issuing initiator SHOULD however terminate (i.e., by setting the F-bit to 1) these response sequences as quickly as possible. The target on its part MUST wait for responses on all affected target transfer tags before acting on either of these two task management requests. In case all or part of the response sequence is not received (due to digest errors) for a valid TTT, the target MAY treat it as a case of within-command error recovery class (see Section 6.1.4.1 Recovery Within-command) if it is supporting ErrorRecoveryLevel = 1, or alternatively may drop the connection to complete the requested task set function. This is clarified by RFC 5048 section 4.1.2: The initiator iSCSI layer: a. MUST continue to respond to each TTT received for the affected tasks 4.1.2 and the passage above it from 3720 applies to lu reset too right? That is my understanding. The comment about sending a ABORT_TASK_SET confused me. [ .. ] The target iSCSI layer: a. MUST wait for responses on currently valid target-transfer tags of the affected tasks from the issuing initiator. Which is exactly what I've seen with the 'ttt tracking' patch: Aug 4 13:58:10 tyne kernel: session2: iscsi_eh_device_reset LU Reset [sc 88005cf4ba80 lun 1] Aug 4 13:58:10 tyne kernel: session2: iscsi_exec_task_mgmt_fn tmf set timeout Aug 4 13:58:10 tyne kernel: session2: iscsi_eh_device_reset dev reset result = SUCCESS Aug 4 13:58:12 tyne kernel: session2: iscsi_eh_device_reset LU Reset [sc 88005cc12880 lun 2] Aug 4 13:58:12 tyne kernel: session2: iscsi_exec_task_mgmt_fn tmf set timeout Aug 4 13:58:12 tyne kernel: session2: fail_scsi_task task itt 0xe ttt 0xc5cf6a01 sc 8800378c9d80 still active Aug 4 13:58:12 tyne kernel: session2: fail_scsi_task task itt 0x15 ttt 0x2590d700 sc 88007a5c8980 still active Aug 4 13:58:12 tyne kernel: session2: fail_scsi_task task itt 0x18 ttt 0x4926d000 sc 880078d8da80 still active Aug 4 13:58:12 tyne kernel: session2: fail_scsi_task task itt 0x1f ttt 0x89ac9500 sc 88007a5de080 still active Aug 4 13:58:12 tyne kernel: session2: fail_scsi_task task itt 0x27 ttt 0x7d0d4201 sc 8800378c9680 still active Aug 4 13:58:12 tyne kernel: session2: fail_scsi_task task itt 0x28 ttt 0x4e2c1b01 sc 8800724cf680 still active I think what is being checked in the ttt tracking patch and what is mentioned in the RFC are different. Might well be; it's not that I've understood the iscsi stack in all its subtleties. I think we only need to respond to commands like r2t from the target in order to satisfy the ttt comment. If fast_abort is 0/No, then when we get a R2T we will to send the data for it. This completes the sequence that the target is waiting for. Correct. My point here is that there might be still some Data-out PDUs stuck in the queue, which will never get send as we break out on the first non-eligible PDU. We might slightly violate the RFC in that we send all the data for the r2t, and the RFC says to terminate the sequence quickly so maybe it wanted us to send a data-out with the F bit set but not all the data. I do not know. It probably does not matter. Don't know either, there is this section in the spec: An R2T MAY be answered with one or more SCSI Data-Out PDUs with a matching Target Transfer Tag. If an R2T is answered with a single Data-Out PDU, the Buffer Offset in the Data PDU MUST be the same as the one specified by the R2T, and the data length of the Data PDU MUST be the same as the Desired Data Transfer Length specified in the R2T. If the R2T is answered with a sequence of Data PDUs, the Buffer Offset and Length MUST be within the range of those specified by R2T, and the last PDU MUST have the F bit set to 1. If the last PDU (marked with the F bit) is received before the Desired Data
Re: Lost active R2T transfers during reset
Mike Christie wrote: Mike Christie wrote: Note: if you are running with fast_abort=1/Yes, then we have that problem I mentioned before where a task can get stuck at the head of the requeue/cmd list and so tasks after it will not get run, and in that case r2ts might not get answered. Oh yeah, make sure you are not using your cmdns window patch, because your patch will prevent data-outs from being sent in response to r2ts if the window is closed. Yes, did so. Nae bother. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Lost active R2T transfers during reset
On 08/05/2009 01:52 AM, Hannes Reinecke wrote: Aug 5 08:46:16 tyne kernel: session2: fail_scsi_task task itt 0x5 ttt 0xf2e7e801 sc 880078866180 still active Aug 5 08:46:16 tyne kernel: connection2:0: pending r2t itt 0x5 ttt 0xf2e7e801 dropped Aug 5 08:46:16 tyne kernel: session2: fail_scsi_task task itt 0xf ttt 0x65c42601 sc 880073d5c480 still active Aug 5 08:46:16 tyne kernel: connection2:0: pending r2t itt 0xf ttt 0x65c42601 dropped Aug 5 08:46:16 tyne kernel: session2: iscsi_eh_device_reset dev reset result = SUCCESS So my patch wasn't that far off the mark :-) Yeah :) Something is screwing up. I will do some digging in that code. The original r2t code had some weird optimizations that always screw me up. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups open-iscsi group. To post to this group, send email to open-iscsi@googlegroups.com To unsubscribe from this group, send email to open-iscsi+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/open-iscsi -~--~~~~--~~--~--~---
Re: Lost active R2T transfers during reset
Hannes Reinecke wrote: Hi Mike, as you might've seen, I finally found the problem for the MSA dropping the connection. It seems that it's follows this section from the RFC: For the LOGICAL UNIT RESET function, the target MUST behave as dictated by the Logical Unit Reset function in [SAM2]. where SAM2 says: When a logical unit is aborting one or more tasks from a SCSI initiator port with the TASK ABORTED status it should complete all of those tasks before entering additional tasks from that SCSI initiator port into the task set. So the tasks must be _completed_ at the target. Which can be interpreted as requiring the target to send an ABORT_TASK_SET to each outstanding task, so that this section applies: For ABORT TASK SET and CLEAR TASK SET, the issuing initiator MUST continue to respond to all valid target transfer tags (received via R2T, Text Response, NOP-In, or SCSI Data-In PDUs) related to the affected task set, even after issuing the task management request. The issuing initiator SHOULD however terminate (i.e., by setting the F-bit to 1) these response sequences as quickly as possible. The target on its part MUST wait for responses on all affected target transfer tags before acting on either of these two task management requests. In case all or part of the response sequence is not received (due to digest errors) for a valid TTT, the target MAY treat it as a case of within-command error recovery class (see Section 6.1.4.1 Recovery Within-command) if it is supporting ErrorRecoveryLevel = 1, or alternatively may drop the connection to complete the requested task set function. This is clarified by RFC 5048 section 4.1.2: The initiator iSCSI layer: a. MUST continue to respond to each TTT received for the affected tasks 4.1.2 and the passage above it from 3720 applies to lu reset too right? That is my understanding. The comment about sending a ABORT_TASK_SET confused me. [ .. ] The target iSCSI layer: a. MUST wait for responses on currently valid target-transfer tags of the affected tasks from the issuing initiator. Which is exactly what I've seen with the 'ttt tracking' patch: Aug 4 13:58:10 tyne kernel: session2: iscsi_eh_device_reset LU Reset [sc 88005cf4ba80 lun 1] Aug 4 13:58:10 tyne kernel: session2: iscsi_exec_task_mgmt_fn tmf set timeout Aug 4 13:58:10 tyne kernel: session2: iscsi_eh_device_reset dev reset result = SUCCESS Aug 4 13:58:12 tyne kernel: session2: iscsi_eh_device_reset LU Reset [sc 88005cc12880 lun 2] Aug 4 13:58:12 tyne kernel: session2: iscsi_exec_task_mgmt_fn tmf set timeout Aug 4 13:58:12 tyne kernel: session2: fail_scsi_task task itt 0xe ttt 0xc5cf6a01 sc 8800378c9d80 still active Aug 4 13:58:12 tyne kernel: session2: fail_scsi_task task itt 0x15 ttt 0x2590d700 sc 88007a5c8980 still active Aug 4 13:58:12 tyne kernel: session2: fail_scsi_task task itt 0x18 ttt 0x4926d000 sc 880078d8da80 still active Aug 4 13:58:12 tyne kernel: session2: fail_scsi_task task itt 0x1f ttt 0x89ac9500 sc 88007a5de080 still active Aug 4 13:58:12 tyne kernel: session2: fail_scsi_task task itt 0x27 ttt 0x7d0d4201 sc 8800378c9680 still active Aug 4 13:58:12 tyne kernel: session2: fail_scsi_task task itt 0x28 ttt 0x4e2c1b01 sc 8800724cf680 still active I think what is being checked in the ttt tracking patch and what is mentioned in the RFC are different. I think we only need to respond to commands like r2t from the target in order to satisfy the ttt comment. If fast_abort is 0/No, then when we get a R2T we will to send the data for it. This completes the sequence that the target is waiting for. We might slightly violate the RFC in that we send all the data for the r2t, and the RFC says to terminate the sequence quickly so maybe it wanted us to send a data-out with the F bit set but not all the data. I do not know. It probably does not matter. Once we send the data-outs for all the data that the r2t requested, then the target can send another r2t, send a response for the task (it can send a scsi cmd pdu indicating a error), or it can respond to the TMF that was affecting it. You patch considers the TTT completed when the entire command/task is completed. So you are waiting for the initiator to get the task's status in a scsi cmd pdu (for writes). If we do not get status that the task is completed then your patch prints an error. What your patch is expecting to happen with the current code is for the lu reset to be sent, then R2T responded to, then the target send a scsi cmd response pdu for the tasks affected by the TMF. I do not think this is right, because when the target sends the TMF response then the response applies to all the affected tasks and we do not need a response for each individual scsi command. If you want to see if r2ts are being dropped you can