Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
I did set the scsi_logging_level as you wrote, but it still resulted into exactly the same dmesg (resp. sudo journalctl -b 0) output with the entire patchset (including the scmd-eh_eflags = 0;). Except for the reverted do nothings, for course. However, I did miss to mention the following which is printed once with or without the scsi_logging_level change: [16254.735372] sd 9:0:0:0: Device offlined - not ready after error recovery [16254.735393] sd 9:0:0:0: rejecting I/O to offline device [16254.735397] sd 9:0:0:0: [sdf] killing request [16254.735488] sd 9:0:0:0: [sdf] Unhandled error code [16254.735489] sd 9:0:0:0: [sdf] [16254.735490] Result: hostbyte=0x01 driverbyte=0x00 [16254.735492] sd 9:0:0:0: [sdf] CDB: [16254.735493] cdb[0]=0x2a: 2a 00 03 1c 10 f8 00 00 f0 00 [16254.735498] end_request: I/O error, dev sdf, sector 52171000 Another: [19389.907015] sd 11:0:0:0: Device offlined - not ready after error recovery [19389.907037] sd 11:0:0:0: [sdf] Unhandled error code [19389.907044] sd 11:0:0:0: [sdf] [19389.907049] Result: hostbyte=0x05 driverbyte=0x00 [19389.907055] sd 11:0:0:0: [sdf] CDB: [19389.907060] cdb[0]=0x28: 28 00 03 13 0d 28 00 00 08 00 [19389.907083] end_request: I/O error, dev sdf, sector 51580200 [19389.907121] sd 11:0:0:0: rejecting I/O to offline device [19389.907126] sd 11:0:0:0: killing request In case it matters, the stick's partition stays readable with regard to any data on it that has been cached. On 10.04.2014 13:37, Hannes Reinecke wrote: Andreas, can you test with the entire patch series and enable 'scsi_logging_level -s -E 5' prior to running the tests? -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
That patch appears to work in preventing the crashes, judged on one repeated appearance of the bug. dmesg had the usual [ 215.229903] usb 4-2: usb_disable_lpm called, do nothing [ 215.336941] usb 4-2: reset SuperSpeed USB device number 3 using xhci_hcd [ 215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called with disabled ep 880427b829c0 [ 215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called with disabled ep 880427b82a08 [ 215.350621] usb 4-2: usb_enable_lpm called, do nothing repeated five times, followed by one [ 282.795801] sd 8:0:0:0: Device offlined - not ready after error recovery and then as often as something tried to read from it: [ 295.585472] sd 8:0:0:0: rejecting I/O to offline device The stick could then be properly un- and remounted (the latter if it had been physically replugged) without issue — for the bug to reoccur after one to three minutes. I tried this three times, no dmesg difference except the ep addresses varied on two of that. Andreas Reis On 09.04.2014 20:02, Alan Stern wrote: On Wed, 9 Apr 2014, Hannes Reinecke wrote: I finally got a chance to try it out. It does seem to do what we want. I didn't track the flow of control in complete detail, but the command definitely got aborted both times it was issued. Good, so it is as I thought. James, can we include this patch instead of your prior solution? First, we should have the original bug reporter try it out. Andreas, the patch in question can be found here: http://marc.info/?l=linux-usbm=13962706597w=2 Can you try this in place of the 1/3 patch posted by James? It should have the same effect, of preventing your system from crashing when the READ command fails. Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
Only your 0/3 patch to which Alan linked, along with two other patches by Mathias Nyman (disable usb3 on intel hosts and disable all lpm related control transfers, one of which is the source of the do nothings). I'll revert the latter two and apply the rest of the set. Which I'm guessing currently consists of said 0/3 patch — http://www.spinics.net/lists/linux-scsi/msg73502.html — plus 2/3 and 3/3? Or should I just omit 0/3 and try whichever of the two in 1/3 works best? Rather confusing ATM. Anyway, for whatever reason the bug is happening rather frequently now. I've spotted the following occurring after the Device offlined line two times now: [ 206.901385] sd 11:0:0:0: [sdg] Unhandled error code [ 206.901394] sd 11:0:0:0: [sdg] [ 206.901397] Result: hostbyte=0x01 driverbyte=0x00 [ 206.901400] sd 11:0:0:0: [sdg] CDB: [ 206.901403] cdb[0]=0x2a: 2a 00 02 25 1b 50 00 00 08 00 [ 206.901419] end_request: I/O error, dev sdg, sector 35986256 The second time had sd 12:0:0:0, cdb[0]=0x28: 28 00 03 94 77 20 00 00 08 00 and a different sector. Andreas Reis On 10.04.2014 13:37, Hannes Reinecke wrote: On 04/10/2014 12:58 PM, Andreas Reis wrote: That patch appears to work in preventing the crashes, judged on one repeated appearance of the bug. dmesg had the usual [ 215.229903] usb 4-2: usb_disable_lpm called, do nothing [ 215.336941] usb 4-2: reset SuperSpeed USB device number 3 using xhci_hcd [ 215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called with disabled ep 880427b829c0 [ 215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called with disabled ep 880427b82a08 [ 215.350621] usb 4-2: usb_enable_lpm called, do nothing repeated five times, followed by one [ 282.795801] sd 8:0:0:0: Device offlined - not ready after error recovery and then as often as something tried to read from it: [ 295.585472] sd 8:0:0:0: rejecting I/O to offline device The stick could then be properly un- and remounted (the latter if it had been physically replugged) without issue — for the bug to reoccur after one to three minutes. I tried this three times, no dmesg difference except the ep addresses varied on two of that. Was this just that patch you've tested with or the entire patch series? If the latter, Alan, is this the expected outcome? I would've thought the error recover should _not_ run into offlining devices here, but rather the device should be recovered eventually. Andreas, can you test with the entire patch series and enable 'scsi_logging_level -s -E 5' prior to running the tests? THX. Cheers, Hannes -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On Thu, 10 Apr 2014, Hannes Reinecke wrote: On 04/10/2014 12:58 PM, Andreas Reis wrote: That patch appears to work in preventing the crashes, judged on one repeated appearance of the bug. dmesg had the usual [ 215.229903] usb 4-2: usb_disable_lpm called, do nothing [ 215.336941] usb 4-2: reset SuperSpeed USB device number 3 using xhci_hcd [ 215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called with disabled ep 880427b829c0 [ 215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called with disabled ep 880427b82a08 [ 215.350621] usb 4-2: usb_enable_lpm called, do nothing repeated five times, followed by one [ 282.795801] sd 8:0:0:0: Device offlined - not ready after error recovery and then as often as something tried to read from it: [ 295.585472] sd 8:0:0:0: rejecting I/O to offline device The stick could then be properly un- and remounted (the latter if it had been physically replugged) without issue � for the bug to reoccur after one to three minutes. I tried this three times, no dmesg difference except the ep addresses varied on two of that. Was this just that patch you've tested with or the entire patch series? If the latter, Alan, is this the expected outcome? Yes, it is. The same thing should happen with the entire patch series. I would've thought the error recover should _not_ run into offlining devices here, but rather the device should be recovered eventually. The command times out, it is aborted, and the command is retried. The same thing happens, and we repeat five times. Eventually the SCSI core gives up and declares the device to be offline. Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On 04/10/2014 05:31 PM, Alan Stern wrote: On Thu, 10 Apr 2014, Hannes Reinecke wrote: On 04/10/2014 12:58 PM, Andreas Reis wrote: That patch appears to work in preventing the crashes, judged on one repeated appearance of the bug. dmesg had the usual [ 215.229903] usb 4-2: usb_disable_lpm called, do nothing [ 215.336941] usb 4-2: reset SuperSpeed USB device number 3 using xhci_hcd [ 215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called with disabled ep 880427b829c0 [ 215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called with disabled ep 880427b82a08 [ 215.350621] usb 4-2: usb_enable_lpm called, do nothing repeated five times, followed by one [ 282.795801] sd 8:0:0:0: Device offlined - not ready after error recovery and then as often as something tried to read from it: [ 295.585472] sd 8:0:0:0: rejecting I/O to offline device The stick could then be properly un- and remounted (the latter if it had been physically replugged) without issue � for the bug to reoccur after one to three minutes. I tried this three times, no dmesg difference except the ep addresses varied on two of that. Was this just that patch you've tested with or the entire patch series? If the latter, Alan, is this the expected outcome? Yes, it is. The same thing should happen with the entire patch series. I would've thought the error recover should _not_ run into offlining devices here, but rather the device should be recovered eventually. The command times out, it is aborted, and the command is retried. The same thing happens, and we repeat five times. Eventually the SCSI core gives up and declares the device to be offline. Hmm. Ok. If you are fine with it who am I to argue here. James, shall I resent the patch series? Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On Thu, 2014-04-10 at 19:52 +0200, Hannes Reinecke wrote: On 04/10/2014 05:31 PM, Alan Stern wrote: On Thu, 10 Apr 2014, Hannes Reinecke wrote: On 04/10/2014 12:58 PM, Andreas Reis wrote: That patch appears to work in preventing the crashes, judged on one repeated appearance of the bug. dmesg had the usual [ 215.229903] usb 4-2: usb_disable_lpm called, do nothing [ 215.336941] usb 4-2: reset SuperSpeed USB device number 3 using xhci_hcd [ 215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called with disabled ep 880427b829c0 [ 215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called with disabled ep 880427b82a08 [ 215.350621] usb 4-2: usb_enable_lpm called, do nothing repeated five times, followed by one [ 282.795801] sd 8:0:0:0: Device offlined - not ready after error recovery and then as often as something tried to read from it: [ 295.585472] sd 8:0:0:0: rejecting I/O to offline device The stick could then be properly un- and remounted (the latter if it had been physically replugged) without issue � for the bug to reoccur after one to three minutes. I tried this three times, no dmesg difference except the ep addresses varied on two of that. Was this just that patch you've tested with or the entire patch series? If the latter, Alan, is this the expected outcome? Yes, it is. The same thing should happen with the entire patch series. I would've thought the error recover should _not_ run into offlining devices here, but rather the device should be recovered eventually. The command times out, it is aborted, and the command is retried. The same thing happens, and we repeat five times. Eventually the SCSI core gives up and declares the device to be offline. Hmm. Ok. If you are fine with it who am I to argue here. James, shall I resent the patch series? You mean the one patch? No, it's OK, I have it. It's still not complete, though, as I've said a couple of times. The problem is that we have abort memory on any eh command as well, which this doesn't fix. The scenario is abort command, set flag, abort completes, send TUR, TUR doesn't return, so we now try to abort the TUR, but scsi_abort_eh_cmnd() will skip the abort because the flag is set and move straight to reset. The fix is this, I can just add it as well. James --- diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 771c16b..7516e2c 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -920,6 +920,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct scsi_eh_save *ses, ses-prot_op = scmd-prot_op; scmd-prot_op = SCSI_PROT_NORMAL; + scmd-eh_eflags = 0; scmd-cmnd = ses-eh_cmnd; memset(scmd-cmnd, 0, BLK_MAX_CDB); memset(scmd-sdb, 0, sizeof(scmd-sdb)); -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On 04/10/2014 10:36 PM, James Bottomley wrote: On Thu, 2014-04-10 at 19:52 +0200, Hannes Reinecke wrote: On 04/10/2014 05:31 PM, Alan Stern wrote: On Thu, 10 Apr 2014, Hannes Reinecke wrote: On 04/10/2014 12:58 PM, Andreas Reis wrote: That patch appears to work in preventing the crashes, judged on one repeated appearance of the bug. dmesg had the usual [ 215.229903] usb 4-2: usb_disable_lpm called, do nothing [ 215.336941] usb 4-2: reset SuperSpeed USB device number 3 using xhci_hcd [ 215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called with disabled ep 880427b829c0 [ 215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called with disabled ep 880427b82a08 [ 215.350621] usb 4-2: usb_enable_lpm called, do nothing repeated five times, followed by one [ 282.795801] sd 8:0:0:0: Device offlined - not ready after error recovery and then as often as something tried to read from it: [ 295.585472] sd 8:0:0:0: rejecting I/O to offline device The stick could then be properly un- and remounted (the latter if it had been physically replugged) without issue � for the bug to reoccur after one to three minutes. I tried this three times, no dmesg difference except the ep addresses varied on two of that. Was this just that patch you've tested with or the entire patch series? If the latter, Alan, is this the expected outcome? Yes, it is. The same thing should happen with the entire patch series. I would've thought the error recover should _not_ run into offlining devices here, but rather the device should be recovered eventually. The command times out, it is aborted, and the command is retried. The same thing happens, and we repeat five times. Eventually the SCSI core gives up and declares the device to be offline. Hmm. Ok. If you are fine with it who am I to argue here. James, shall I resent the patch series? You mean the one patch? No, it's OK, I have it. It's still not complete, though, as I've said a couple of times. The problem is that we have abort memory on any eh command as well, which this doesn't fix. The scenario is abort command, set flag, abort completes, send TUR, TUR doesn't return, so we now try to abort the TUR, but scsi_abort_eh_cmnd() will skip the abort because the flag is set and move straight to reset. The fix is this, I can just add it as well. James --- diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 771c16b..7516e2c 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -920,6 +920,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct scsi_eh_save *ses, ses-prot_op = scmd-prot_op; scmd-prot_op = SCSI_PROT_NORMAL; + scmd-eh_eflags = 0; scmd-cmnd = ses-eh_cmnd; memset(scmd-cmnd, 0, BLK_MAX_CDB); memset(scmd-sdb, 0, sizeof(scmd-sdb)); Oh yes, that is correct. Acked-by: Hannes Reinecke h...@suse.de Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On 04/07/2014 05:26 PM, Alan Stern wrote: On Wed, 2 Apr 2014, Hannes Reinecke wrote: On 04/01/2014 11:28 PM, Alan Stern wrote: On Tue, 1 Apr 2014, Hannes Reinecke wrote: So if the above reasoning is okay then this patch should be doing the trick: diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 771c16b..0e72374 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd) /* * Retry after abort failed, escalate to next level. */ + scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED; SCSI_LOG_ERROR_RECOVERY(3, scmd_printk(KERN_INFO, scmd, scmd %p previous abort failed\n, scmd)); (Beware of line breaks) Can you test with it? I don't understand. This doesn't solve the fundamental problem (namely that you escalate before aborting a running command). All it does is clear the SCSI_EH_ABORT_SCHEDULED flag before escalating. Which was precisely the point :-) Hmm. The comment might've been clearer. What this patch is _supposed_ to be doing is that it'll clear the SCSI_EH_ABORT_SCHEDULED flag it it has been set. Which means this will be the second time scsi_abort_command() has been called for the same command. IE the first abort went out, did its thing, but now the same command has timed out again. So this flag gets cleared, and scsi_abort_command() returns FAILED, and _no_ asynchronous abort is being scheduled. scsi_times_out() will then proceed to call scsi_eh_scmd_add(). But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag the SCSI_EH_CANCEL_CMD flag will continue to be set, and the command will be aborted with the main SCSI EH routine. It looks to me as if it should do what you desire, namely abort the command asynchronously the first time, and invoking the SCSI EH the second time. Am I wrong? I don't know -- I'll have to try it out. Currently I'm busy with a bunch of other stuff, so it will take some time. I finally got a chance to try it out. It does seem to do what we want. I didn't track the flow of control in complete detail, but the command definitely got aborted both times it was issued. Good, so it is as I thought. James, can we include this patch instead of your prior solution? Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On Wed, 9 Apr 2014, Hannes Reinecke wrote: I finally got a chance to try it out. It does seem to do what we want. I didn't track the flow of control in complete detail, but the command definitely got aborted both times it was issued. Good, so it is as I thought. James, can we include this patch instead of your prior solution? First, we should have the original bug reporter try it out. Andreas, the patch in question can be found here: http://marc.info/?l=linux-usbm=13962706597w=2 Can you try this in place of the 1/3 patch posted by James? It should have the same effect, of preventing your system from crashing when the READ command fails. Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On Wed, 2 Apr 2014, Hannes Reinecke wrote: On 04/01/2014 11:28 PM, Alan Stern wrote: On Tue, 1 Apr 2014, Hannes Reinecke wrote: So if the above reasoning is okay then this patch should be doing the trick: diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 771c16b..0e72374 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd) /* * Retry after abort failed, escalate to next level. */ + scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED; SCSI_LOG_ERROR_RECOVERY(3, scmd_printk(KERN_INFO, scmd, scmd %p previous abort failed\n, scmd)); (Beware of line breaks) Can you test with it? I don't understand. This doesn't solve the fundamental problem (namely that you escalate before aborting a running command). All it does is clear the SCSI_EH_ABORT_SCHEDULED flag before escalating. Which was precisely the point :-) Hmm. The comment might've been clearer. What this patch is _supposed_ to be doing is that it'll clear the SCSI_EH_ABORT_SCHEDULED flag it it has been set. Which means this will be the second time scsi_abort_command() has been called for the same command. IE the first abort went out, did its thing, but now the same command has timed out again. So this flag gets cleared, and scsi_abort_command() returns FAILED, and _no_ asynchronous abort is being scheduled. scsi_times_out() will then proceed to call scsi_eh_scmd_add(). But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag the SCSI_EH_CANCEL_CMD flag will continue to be set, and the command will be aborted with the main SCSI EH routine. It looks to me as if it should do what you desire, namely abort the command asynchronously the first time, and invoking the SCSI EH the second time. Am I wrong? I don't know -- I'll have to try it out. Currently I'm busy with a bunch of other stuff, so it will take some time. I finally got a chance to try it out. It does seem to do what we want. I didn't track the flow of control in complete detail, but the command definitely got aborted both times it was issued. Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On 04/01/2014 11:28 PM, Alan Stern wrote: On Tue, 1 Apr 2014, Hannes Reinecke wrote: So if the above reasoning is okay then this patch should be doing the trick: diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 771c16b..0e72374 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd) /* * Retry after abort failed, escalate to next level. */ + scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED; SCSI_LOG_ERROR_RECOVERY(3, scmd_printk(KERN_INFO, scmd, scmd %p previous abort failed\n, scmd)); (Beware of line breaks) Can you test with it? I don't understand. This doesn't solve the fundamental problem (namely that you escalate before aborting a running command). All it does is clear the SCSI_EH_ABORT_SCHEDULED flag before escalating. Which was precisely the point :-) Hmm. The comment might've been clearer. What this patch is _supposed_ to be doing is that it'll clear the SCSI_EH_ABORT_SCHEDULED flag it it has been set. Which means this will be the second time scsi_abort_command() has been called for the same command. IE the first abort went out, did its thing, but now the same command has timed out again. So this flag gets cleared, and scsi_abort_command() returns FAILED, and _no_ asynchronous abort is being scheduled. scsi_times_out() will then proceed to call scsi_eh_scmd_add(). But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag the SCSI_EH_CANCEL_CMD flag will continue to be set, and the command will be aborted with the main SCSI EH routine. It looks to me as if it should do what you desire, namely abort the command asynchronously the first time, and invoking the SCSI EH the second time. Am I wrong? I don't know -- I'll have to try it out. Currently I'm busy with a bunch of other stuff, so it will take some time. Looking through the code, I have to wonder why scsi_times_out() modifies scmd-result. Won't this value get overwritten by the LLDD when the command eventually terminates? Yes. However, the 'DID_TIME_OUT' is just a marker that the internal timeout code triggered. If the LLDD overwrites it with a 'real' error code everything's fine. Even worse, what happens in the event of a race where the command terminates normally just before scsi_times_out() changes scmd-result? _That_ is the least of our worries. _If_ the LLDD completes the command while scsi_times_out() is running the whole thing is going down the drain anyway, as the command will be terminated by the LLDD, and we can only hope that we didn't mess up our reference counting. Otherwise we'd have EH running on a stale command, which is going to be a fun to watch. But looking closer, it might be that the line modifying the result in scsi_times_out() is indeed pointless, seeing that it's being set in scsi_eh_abort_handler(), too. I'll be checking if we can simply remove that line. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On 03/31/2014 05:03 PM, James Bottomley wrote: [lets split the thread] On Mon, 2014-03-31 at 16:37 +0200, Hannes Reinecke wrote: On 03/31/2014 03:33 PM, Alan Stern wrote: On Mon, 31 Mar 2014, Hannes Reinecke wrote: On 03/28/2014 08:29 PM, Alan Stern wrote: On Fri, 28 Mar 2014, James Bottomley wrote: Maybe scmd_eh_abort_handler() should check the flag before doing anything. Is there any sort of sychronization to prevent the same incarnation of a command from being aborted twice (or by two different threads at the same time)? If there is, it isn't obvious. See above. scsi_times_out() will only ever called once. What can happen, though, is that _theoretically_ the LLDD might decide to call -done() on a timed out command when scsi_eh_abort_handler() is still pending. That's okay. We can expect the LLDD to have sufficient locking to handle that sort of thing without confusion (usb-storage does, for example). (Also, what's going on at the start of scsi_abort_command()? Contrary to what one might expect, the first part of the function _cancels_ a scheduled abort. And it does so without clearing the SCSI_EH_ABORT_SCHEDULED flag.) The original idea was this: SCSI_EH_ABORT_SCHEDULED is sticky _per command_. Point is, any command abort is only ever send for a timed-out command. And the main problem for a timed-out command is that we simply _do not_ know what happened for that command. So _if_ a command timed out, _and_ we've send an abort, _and_ the command times out _again_ we'll be running into an endless loop between timeout and aborting, and never returning the command at all. So to prevent this we should set a marker on that command telling it to _not_ try to abort the command again. I disagree. We _have_ to abort the command again -- how else can we stop a running command? To prevent the loop you described, we should avoid _retrying_ the command after it is aborted the second time. The actual question is whether it's worth aborting the same command a second time. In principle any reset (like LUN reset etc) should clear the command, too. And the EH abort functionality is geared around this. If, for some reason, the transport layer / device driver requires a command abort to be send then sure, we need to accommodate for that. We already discussed this (that was my first response too). USB needs all outstanding commands aborted before proceeding to reset. I'm starting to think the actual way to fix this is to reset the abort scheduled only if we send something else, so this might be a better fix. It doesn't matter if we finish a command with abort scheduled because the command then gets freed and the flags cleaned. We can take our time with this because the other two patches, which I can send separately fix the current deadlock (we no longer send an unaborted request sense before the reset), and it can go via rc fixes. Yes, agreed. The USB case seems to be a bit more tricky, and at least I need some more time to fully understand the details and implications of this. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On 04/01/2014 12:29 AM, Alan Stern wrote: On Mon, 31 Mar 2014, Hannes Reinecke wrote: Ah. Correct. But that's due to the first patch being incorrect. Cf my response to the original first patch. See my response to your response. :-) Okay, So I probably should refrain from issueing a response to your response to my response lest infinite recursion happens :-) Indeed. What should happen if some other pathway manages to call scsi_try_to_abort_cmd() while scmd-abort_work is still sitting on the work queue? The command would be aborted and the flag would be cleared, but the queued work would remain. Can this ever happen? Not that I could see. A command abort is only ever triggered by the request timeout from the block layer. And the timer is _not_ rearmed once the timeout function (here: scsi_times_out()) is called. Hence I fail to see how it can be called concurrently. scsi_try_to_abort_cmd() is also called (via a different pathway) when a command sent by the error handler itself times out. I haven't traced through all the different paths to make sure none of them can run concurrently. But I'm willing to take your word for it. Yes, but that's not calling scsi_abort_command(), but rather invokes scsi_abort_eh_cmnd(). Sure. But either way, we end up in scsi_try_to_abort_cmd(), which calls the LLDD's abort handler. Thus leading to the possibility of aborting the same command more than once. The original idea was this: SCSI_EH_ABORT_SCHEDULED is sticky _per command_. Point is, any command abort is only ever send for a timed-out command. And the main problem for a timed-out command is that we simply _do not_ know what happened for that command. So _if_ a command timed out, _and_ we've send an abort, _and_ the command times out _again_ we'll be running into an endless loop between timeout and aborting, and never returning the command at all. So to prevent this we should set a marker on that command telling it to _not_ try to abort the command again. I disagree. We _have_ to abort the command again -- how else can we stop a running command? To prevent the loop you described, we should avoid _retrying_ the command after it is aborted the second time. The actual question is whether it's worth aborting the same command a second time. In principle any reset (like LUN reset etc) should clear the command, too. And the EH abort functionality is geared around this. If, for some reason, the transport layer / device driver requires a command abort to be send then sure, we need to accommodate for that. As James mentioned, with USB a reset does not abort a running command. Instead it waits for the command to finish. (However, this could be changed in usb-storage, if required.) As said, yes, in principle you are right. We should be aborting the command a second time, _and then_ starting the escalation. So if the above reasoning is okay then this patch should be doing the trick: diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 771c16b..0e72374 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd) /* * Retry after abort failed, escalate to next level. */ + scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED; SCSI_LOG_ERROR_RECOVERY(3, scmd_printk(KERN_INFO, scmd, scmd %p previous abort failed\n, scmd)); (Beware of line breaks) Can you test with it? I don't understand. This doesn't solve the fundamental problem (namely that you escalate before aborting a running command). All it does is clear the SCSI_EH_ABORT_SCHEDULED flag before escalating. Which was precisely the point :-) Hmm. The comment might've been clearer. What this patch is _supposed_ to be doing is that it'll clear the SCSI_EH_ABORT_SCHEDULED flag it it has been set. Which means this will be the second time scsi_abort_command() has been called for the same command. IE the first abort went out, did its thing, but now the same command has timed out again. So this flag gets cleared, and scsi_abort_command() returns FAILED, and _no_ asynchronous abort is being scheduled. scsi_times_out() will then proceed to call scsi_eh_scmd_add(). But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag the SCSI_EH_CANCEL_CMD flag will continue to be set, and the command will be aborted with the main SCSI EH routine. It looks to me as if it should do what you desire, namely abort the command asynchronously the first time, and invoking the SCSI EH the second time. Am I wrong? Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On Tue, 1 Apr 2014, Hannes Reinecke wrote: So if the above reasoning is okay then this patch should be doing the trick: diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 771c16b..0e72374 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd) /* * Retry after abort failed, escalate to next level. */ + scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED; SCSI_LOG_ERROR_RECOVERY(3, scmd_printk(KERN_INFO, scmd, scmd %p previous abort failed\n, scmd)); (Beware of line breaks) Can you test with it? I don't understand. This doesn't solve the fundamental problem (namely that you escalate before aborting a running command). All it does is clear the SCSI_EH_ABORT_SCHEDULED flag before escalating. Which was precisely the point :-) Hmm. The comment might've been clearer. What this patch is _supposed_ to be doing is that it'll clear the SCSI_EH_ABORT_SCHEDULED flag it it has been set. Which means this will be the second time scsi_abort_command() has been called for the same command. IE the first abort went out, did its thing, but now the same command has timed out again. So this flag gets cleared, and scsi_abort_command() returns FAILED, and _no_ asynchronous abort is being scheduled. scsi_times_out() will then proceed to call scsi_eh_scmd_add(). But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag the SCSI_EH_CANCEL_CMD flag will continue to be set, and the command will be aborted with the main SCSI EH routine. It looks to me as if it should do what you desire, namely abort the command asynchronously the first time, and invoking the SCSI EH the second time. Am I wrong? I don't know -- I'll have to try it out. Currently I'm busy with a bunch of other stuff, so it will take some time. Looking through the code, I have to wonder why scsi_times_out() modifies scmd-result. Won't this value get overwritten by the LLDD when the command eventually terminates? Even worse, what happens in the event of a race where the command terminates normally just before scsi_times_out() changes scmd-result? Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On 03/28/2014 08:29 PM, Alan Stern wrote: On Fri, 28 Mar 2014, James Bottomley wrote: This is a set of three patches we agreed to a while ago to eliminate a USB deadlock. I did rewrite the first patch, if it could be reviewed and tested. I tested all three patches under the same conditions as before, and they all worked correctly. In the revised patch 1, the meaning of SCSI_EH_ABORT_SCHEDULED isn't entirely clear. This boils down to two questions, which I don't know the answers to: What should happen in scmd_eh_abort_handler() if scsi_host_eh_past_deadline() returns true and thereby prevents scsi_try_to_abort_cmd() from being called? The flag wouldn't get cleared then. Ah. Correct. But that's due to the first patch being incorrect. Cf my response to the original first patch. What should happen if some other pathway manages to call scsi_try_to_abort_cmd() while scmd-abort_work is still sitting on the work queue? The command would be aborted and the flag would be cleared, but the queued work would remain. Can this ever happen? Not that I could see. A command abort is only ever triggered by the request timeout from the block layer. And the timer is _not_ rearmed once the timeout function (here: scsi_times_out()) is called. Hence I fail to see how it can be called concurrently. Maybe scmd_eh_abort_handler() should check the flag before doing anything. Is there any sort of sychronization to prevent the same incarnation of a command from being aborted twice (or by two different threads at the same time)? If there is, it isn't obvious. See above. scsi_times_out() will only ever called once. What can happen, though, is that _theoretically_ the LLDD might decide to call -done() on a timed out command when scsi_eh_abort_handler() is still pending. (Also, what's going on at the start of scsi_abort_command()? Contrary to what one might expect, the first part of the function _cancels_ a scheduled abort. And it does so without clearing the SCSI_EH_ABORT_SCHEDULED flag.) The original idea was this: SCSI_EH_ABORT_SCHEDULED is sticky _per command_. Point is, any command abort is only ever send for a timed-out command. And the main problem for a timed-out command is that we simply _do not_ know what happened for that command. So _if_ a command timed out, _and_ we've send an abort, _and_ the command times out _again_ we'll be running into an endless loop between timeout and aborting, and never returning the command at all. So to prevent this we should set a marker on that command telling it to _not_ try to abort the command again. Which is what 'SCSI_EH_ABORT_SCHEDULED' was meant for: - A command times out, sets 'SCSI_EH_ABORT_SCHEDULED' - abort _succeeds_ - The same command times out a second time, notifies that SCSI_EH_ABORT_SCHEDULED is set, and doesn't call scsi_eh_abort_command() but rather escalates directly. _However_, I do feel that we've gotten the wrong track with all of this. In you original mail you stated: Basically, usb-storage deadlocks when the SCSI error handler invokes the eh_device_reset_handler callback while a command is running. The command has timed out and will never complete normally, because the device's firmware has crashed. But usb-storage's device-reset routine waits for the current command to finish, which brings everything to a standstill. Question now to you as the USB god: A command abort is only _ever_ send after a command timeout. And the main idea of the command abort is to remove any context the LLDD or the target might have. So by the time the command abort returns SUCCESS we _expect_ the firmware to have cleared that command. If for some reason the firmware isn't capable of doing so, it should return FAILED. So: - Has the command timeout fired? - If so, why hasn't it returned FAILED? - If it had returned SUCCESS, why did the device_reset_handler wait for the command to complete? Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On Mon, 31 Mar 2014, Hannes Reinecke wrote: On 03/28/2014 08:29 PM, Alan Stern wrote: On Fri, 28 Mar 2014, James Bottomley wrote: This is a set of three patches we agreed to a while ago to eliminate a USB deadlock. I did rewrite the first patch, if it could be reviewed and tested. I tested all three patches under the same conditions as before, and they all worked correctly. In the revised patch 1, the meaning of SCSI_EH_ABORT_SCHEDULED isn't entirely clear. This boils down to two questions, which I don't know the answers to: What should happen in scmd_eh_abort_handler() if scsi_host_eh_past_deadline() returns true and thereby prevents scsi_try_to_abort_cmd() from being called? The flag wouldn't get cleared then. Ah. Correct. But that's due to the first patch being incorrect. Cf my response to the original first patch. See my response to your response. :-) What should happen if some other pathway manages to call scsi_try_to_abort_cmd() while scmd-abort_work is still sitting on the work queue? The command would be aborted and the flag would be cleared, but the queued work would remain. Can this ever happen? Not that I could see. A command abort is only ever triggered by the request timeout from the block layer. And the timer is _not_ rearmed once the timeout function (here: scsi_times_out()) is called. Hence I fail to see how it can be called concurrently. scsi_try_to_abort_cmd() is also called (via a different pathway) when a command sent by the error handler itself times out. I haven't traced through all the different paths to make sure none of them can run concurrently. But I'm willing to take your word for it. Maybe scmd_eh_abort_handler() should check the flag before doing anything. Is there any sort of sychronization to prevent the same incarnation of a command from being aborted twice (or by two different threads at the same time)? If there is, it isn't obvious. See above. scsi_times_out() will only ever called once. What can happen, though, is that _theoretically_ the LLDD might decide to call -done() on a timed out command when scsi_eh_abort_handler() is still pending. That's okay. We can expect the LLDD to have sufficient locking to handle that sort of thing without confusion (usb-storage does, for example). (Also, what's going on at the start of scsi_abort_command()? Contrary to what one might expect, the first part of the function _cancels_ a scheduled abort. And it does so without clearing the SCSI_EH_ABORT_SCHEDULED flag.) The original idea was this: SCSI_EH_ABORT_SCHEDULED is sticky _per command_. Point is, any command abort is only ever send for a timed-out command. And the main problem for a timed-out command is that we simply _do not_ know what happened for that command. So _if_ a command timed out, _and_ we've send an abort, _and_ the command times out _again_ we'll be running into an endless loop between timeout and aborting, and never returning the command at all. So to prevent this we should set a marker on that command telling it to _not_ try to abort the command again. I disagree. We _have_ to abort the command again -- how else can we stop a running command? To prevent the loop you described, we should avoid _retrying_ the command after it is aborted the second time. Which is what 'SCSI_EH_ABORT_SCHEDULED' was meant for: - A command times out, sets 'SCSI_EH_ABORT_SCHEDULED' - abort _succeeds_ - The same command times out a second time, notifies that SCSI_EH_ABORT_SCHEDULED is set, and doesn't call scsi_eh_abort_command() but rather escalates directly. The proper time to escalate is after the command is aborted again, not while the command is still running. The only situation where you might want to escalate while a command is still running would be if you were unable to abort the command. (Hmmm, maybe that's not true for SCSI devices in general. It is true for USB mass-storage, however. Perhaps the reset handlers in usb-storage should be changed so that they will automatically abort a running command before starting the reset.) _However_, I do feel that we've gotten the wrong track with all of this. In you original mail you stated: Basically, usb-storage deadlocks when the SCSI error handler invokes the eh_device_reset_handler callback while a command is running. The command has timed out and will never complete normally, because the device's firmware has crashed. But usb-storage's device-reset routine waits for the current command to finish, which brings everything to a standstill. Question now to you as the USB god: A command abort is only _ever_ send after a command timeout. And the main idea of the command abort is to remove any context the LLDD or the target might have. So by the time the command abort returns SUCCESS we _expect_ the firmware to have cleared that
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On 03/31/2014 03:33 PM, Alan Stern wrote: On Mon, 31 Mar 2014, Hannes Reinecke wrote: On 03/28/2014 08:29 PM, Alan Stern wrote: On Fri, 28 Mar 2014, James Bottomley wrote: This is a set of three patches we agreed to a while ago to eliminate a USB deadlock. I did rewrite the first patch, if it could be reviewed and tested. I tested all three patches under the same conditions as before, and they all worked correctly. In the revised patch 1, the meaning of SCSI_EH_ABORT_SCHEDULED isn't entirely clear. This boils down to two questions, which I don't know the answers to: What should happen in scmd_eh_abort_handler() if scsi_host_eh_past_deadline() returns true and thereby prevents scsi_try_to_abort_cmd() from being called? The flag wouldn't get cleared then. Ah. Correct. But that's due to the first patch being incorrect. Cf my response to the original first patch. See my response to your response. :-) Okay, So I probably should refrain from issueing a response to your response to my response lest infinite recursion happens :-) What should happen if some other pathway manages to call scsi_try_to_abort_cmd() while scmd-abort_work is still sitting on the work queue? The command would be aborted and the flag would be cleared, but the queued work would remain. Can this ever happen? Not that I could see. A command abort is only ever triggered by the request timeout from the block layer. And the timer is _not_ rearmed once the timeout function (here: scsi_times_out()) is called. Hence I fail to see how it can be called concurrently. scsi_try_to_abort_cmd() is also called (via a different pathway) when a command sent by the error handler itself times out. I haven't traced through all the different paths to make sure none of them can run concurrently. But I'm willing to take your word for it. Yes, but that's not calling scsi_abort_command(), but rather invokes scsi_abort_eh_cmnd(). Maybe scmd_eh_abort_handler() should check the flag before doing anything. Is there any sort of sychronization to prevent the same incarnation of a command from being aborted twice (or by two different threads at the same time)? If there is, it isn't obvious. See above. scsi_times_out() will only ever called once. What can happen, though, is that _theoretically_ the LLDD might decide to call -done() on a timed out command when scsi_eh_abort_handler() is still pending. That's okay. We can expect the LLDD to have sufficient locking to handle that sort of thing without confusion (usb-storage does, for example). (Also, what's going on at the start of scsi_abort_command()? Contrary to what one might expect, the first part of the function _cancels_ a scheduled abort. And it does so without clearing the SCSI_EH_ABORT_SCHEDULED flag.) The original idea was this: SCSI_EH_ABORT_SCHEDULED is sticky _per command_. Point is, any command abort is only ever send for a timed-out command. And the main problem for a timed-out command is that we simply _do not_ know what happened for that command. So _if_ a command timed out, _and_ we've send an abort, _and_ the command times out _again_ we'll be running into an endless loop between timeout and aborting, and never returning the command at all. So to prevent this we should set a marker on that command telling it to _not_ try to abort the command again. I disagree. We _have_ to abort the command again -- how else can we stop a running command? To prevent the loop you described, we should avoid _retrying_ the command after it is aborted the second time. The actual question is whether it's worth aborting the same command a second time. In principle any reset (like LUN reset etc) should clear the command, too. And the EH abort functionality is geared around this. If, for some reason, the transport layer / device driver requires a command abort to be send then sure, we need to accommodate for that. Which is what 'SCSI_EH_ABORT_SCHEDULED' was meant for: - A command times out, sets 'SCSI_EH_ABORT_SCHEDULED' - abort _succeeds_ - The same command times out a second time, notifies that SCSI_EH_ABORT_SCHEDULED is set, and doesn't call scsi_eh_abort_command() but rather escalates directly. The proper time to escalate is after the command is aborted again, not while the command is still running. The only situation where you might want to escalate while a command is still running would be if you were unable to abort the command. (Hmmm, maybe that's not true for SCSI devices in general. It is true for USB mass-storage, however. Perhaps the reset handlers in usb-storage should be changed so that they will automatically abort a running command before starting the reset.) As said, yes, in principle you are right. We should be aborting the command a second time, _and then_ starting the escalation. So if the above reasoning is okay then this patch should
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
[lets split the thread] On Mon, 2014-03-31 at 16:37 +0200, Hannes Reinecke wrote: On 03/31/2014 03:33 PM, Alan Stern wrote: On Mon, 31 Mar 2014, Hannes Reinecke wrote: On 03/28/2014 08:29 PM, Alan Stern wrote: On Fri, 28 Mar 2014, James Bottomley wrote: Maybe scmd_eh_abort_handler() should check the flag before doing anything. Is there any sort of sychronization to prevent the same incarnation of a command from being aborted twice (or by two different threads at the same time)? If there is, it isn't obvious. See above. scsi_times_out() will only ever called once. What can happen, though, is that _theoretically_ the LLDD might decide to call -done() on a timed out command when scsi_eh_abort_handler() is still pending. That's okay. We can expect the LLDD to have sufficient locking to handle that sort of thing without confusion (usb-storage does, for example). (Also, what's going on at the start of scsi_abort_command()? Contrary to what one might expect, the first part of the function _cancels_ a scheduled abort. And it does so without clearing the SCSI_EH_ABORT_SCHEDULED flag.) The original idea was this: SCSI_EH_ABORT_SCHEDULED is sticky _per command_. Point is, any command abort is only ever send for a timed-out command. And the main problem for a timed-out command is that we simply _do not_ know what happened for that command. So _if_ a command timed out, _and_ we've send an abort, _and_ the command times out _again_ we'll be running into an endless loop between timeout and aborting, and never returning the command at all. So to prevent this we should set a marker on that command telling it to _not_ try to abort the command again. I disagree. We _have_ to abort the command again -- how else can we stop a running command? To prevent the loop you described, we should avoid _retrying_ the command after it is aborted the second time. The actual question is whether it's worth aborting the same command a second time. In principle any reset (like LUN reset etc) should clear the command, too. And the EH abort functionality is geared around this. If, for some reason, the transport layer / device driver requires a command abort to be send then sure, we need to accommodate for that. We already discussed this (that was my first response too). USB needs all outstanding commands aborted before proceeding to reset. I'm starting to think the actual way to fix this is to reset the abort scheduled only if we send something else, so this might be a better fix. It doesn't matter if we finish a command with abort scheduled because the command then gets freed and the flags cleaned. We can take our time with this because the other two patches, which I can send separately fix the current deadlock (we no longer send an unaborted request sense before the reset), and it can go via rc fixes. James --- diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 771c16b..3cfd2bf 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -1007,6 +1007,12 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, unsigned char *cmnd, const unsigned long stall_for = msecs_to_jiffies(100); int rtn; + /* +* We're sending another command we'll need to abort, so reset the +* abort scheduled flag on the real command before we save its state +*/ + scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED; + retry: scsi_eh_prep_cmnd(scmd, ses, cmnd, cmnd_size, sense_bytes); shost-eh_action = done; -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On Mon, 31 Mar 2014, Hannes Reinecke wrote: Ah. Correct. But that's due to the first patch being incorrect. Cf my response to the original first patch. See my response to your response. :-) Okay, So I probably should refrain from issueing a response to your response to my response lest infinite recursion happens :-) Indeed. What should happen if some other pathway manages to call scsi_try_to_abort_cmd() while scmd-abort_work is still sitting on the work queue? The command would be aborted and the flag would be cleared, but the queued work would remain. Can this ever happen? Not that I could see. A command abort is only ever triggered by the request timeout from the block layer. And the timer is _not_ rearmed once the timeout function (here: scsi_times_out()) is called. Hence I fail to see how it can be called concurrently. scsi_try_to_abort_cmd() is also called (via a different pathway) when a command sent by the error handler itself times out. I haven't traced through all the different paths to make sure none of them can run concurrently. But I'm willing to take your word for it. Yes, but that's not calling scsi_abort_command(), but rather invokes scsi_abort_eh_cmnd(). Sure. But either way, we end up in scsi_try_to_abort_cmd(), which calls the LLDD's abort handler. Thus leading to the possibility of aborting the same command more than once. The original idea was this: SCSI_EH_ABORT_SCHEDULED is sticky _per command_. Point is, any command abort is only ever send for a timed-out command. And the main problem for a timed-out command is that we simply _do not_ know what happened for that command. So _if_ a command timed out, _and_ we've send an abort, _and_ the command times out _again_ we'll be running into an endless loop between timeout and aborting, and never returning the command at all. So to prevent this we should set a marker on that command telling it to _not_ try to abort the command again. I disagree. We _have_ to abort the command again -- how else can we stop a running command? To prevent the loop you described, we should avoid _retrying_ the command after it is aborted the second time. The actual question is whether it's worth aborting the same command a second time. In principle any reset (like LUN reset etc) should clear the command, too. And the EH abort functionality is geared around this. If, for some reason, the transport layer / device driver requires a command abort to be send then sure, we need to accommodate for that. As James mentioned, with USB a reset does not abort a running command. Instead it waits for the command to finish. (However, this could be changed in usb-storage, if required.) As said, yes, in principle you are right. We should be aborting the command a second time, _and then_ starting the escalation. So if the above reasoning is okay then this patch should be doing the trick: diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 771c16b..0e72374 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd) /* * Retry after abort failed, escalate to next level. */ + scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED; SCSI_LOG_ERROR_RECOVERY(3, scmd_printk(KERN_INFO, scmd, scmd %p previous abort failed\n, scmd)); (Beware of line breaks) Can you test with it? I don't understand. This doesn't solve the fundamental problem (namely that you escalate before aborting a running command). All it does is clear the SCSI_EH_ABORT_SCHEDULED flag before escalating. What's needed is something like a 2-bit abort counter. scsi_try_to_abort_cmd() should increment the counter each time it runs, and if scmd_eh_abort_handler() sees that the counter is too high, it should avoid its retry pathway. _Then_ you can move on to something more drastic. Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On Mon, 31 Mar 2014, James Bottomley wrote: The actual question is whether it's worth aborting the same command a second time. In principle any reset (like LUN reset etc) should clear the command, too. And the EH abort functionality is geared around this. If, for some reason, the transport layer / device driver requires a command abort to be send then sure, we need to accommodate for that. We already discussed this (that was my first response too). USB needs all outstanding commands aborted before proceeding to reset. I'm starting to think the actual way to fix this is to reset the abort scheduled only if we send something else, so this might be a better fix. It doesn't matter if we finish a command with abort scheduled because the command then gets freed and the flags cleaned. We can take our time with this because the other two patches, which I can send separately fix the current deadlock (we no longer send an unaborted request sense before the reset), and it can go via rc fixes. James --- diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 771c16b..3cfd2bf 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -1007,6 +1007,12 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, unsigned char *cmnd, const unsigned long stall_for = msecs_to_jiffies(100); int rtn; + /* + * We're sending another command we'll need to abort, so reset the + * abort scheduled flag on the real command before we save its state + */ + scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED; + retry: scsi_eh_prep_cmnd(scmd, ses, cmnd, cmnd_size, sense_bytes); shost-eh_action = done; I don't know if Hannes will like this, but I don't think it is right. This is not the retry pathway that's causing problems; that pathway begins where scmd_eh_abort_handler() calls scsi_queue_insert(). Now, that call is already guarded by this test: if (... (++scmd-retries = scmd-allowed)) { which would seem to prevent the infinite loop that Hannes was concerned about. So maybe the 1/3 patch posted a couple of days ago is the best solution after all. Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Fix USB deadlock caused by SCSI error handling
This is a set of three patches we agreed to a while ago to eliminate a USB deadlock. I did rewrite the first patch, if it could be reviewed and tested. Thanks, James --- Alan Stern (1): [SCSI] Fix command result state propagation James Bottomley (2): [SCSI] Fix abort state memory problem [SCSI] Fix spurious request sense in error handling drivers/scsi/scsi_error.c | 19 --- drivers/scsi/scsi_lib.c | 1 + 2 files changed, 17 insertions(+), 3 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling
On Fri, 28 Mar 2014, James Bottomley wrote: This is a set of three patches we agreed to a while ago to eliminate a USB deadlock. I did rewrite the first patch, if it could be reviewed and tested. I tested all three patches under the same conditions as before, and they all worked correctly. In the revised patch 1, the meaning of SCSI_EH_ABORT_SCHEDULED isn't entirely clear. This boils down to two questions, which I don't know the answers to: What should happen in scmd_eh_abort_handler() if scsi_host_eh_past_deadline() returns true and thereby prevents scsi_try_to_abort_cmd() from being called? The flag wouldn't get cleared then. What should happen if some other pathway manages to call scsi_try_to_abort_cmd() while scmd-abort_work is still sitting on the work queue? The command would be aborted and the flag would be cleared, but the queued work would remain. Can this ever happen? Maybe scmd_eh_abort_handler() should check the flag before doing anything. Is there any sort of sychronization to prevent the same incarnation of a command from being aborted twice (or by two different threads at the same time)? If there is, it isn't obvious. (Also, what's going on at the start of scsi_abort_command()? Contrary to what one might expect, the first part of the function _cancels_ a scheduled abort. And it does so without clearing the SCSI_EH_ABORT_SCHEDULED flag.) Maybe in all these scenarios, the command is doomed to fail anyway so these questions don't make any real difference. I don't know... Alan Stern -- To unsubscribe from this list: send the line unsubscribe linux-usb in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html