Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-11 Thread Andreas Reis
I did set the scsi_logging_level as you wrote, but it still resulted 
into exactly the same dmesg (resp. sudo journalctl -b 0) output with the 
entire patchset (including the scmd-eh_eflags = 0;). Except for the 
reverted do nothings, for course.


However, I did miss to mention the following which is printed once with 
or without the scsi_logging_level change:


[16254.735372] sd 9:0:0:0: Device offlined - not ready after error recovery
[16254.735393] sd 9:0:0:0: rejecting I/O to offline device
[16254.735397] sd 9:0:0:0: [sdf] killing request
[16254.735488] sd 9:0:0:0: [sdf] Unhandled error code
[16254.735489] sd 9:0:0:0: [sdf]
[16254.735490] Result: hostbyte=0x01 driverbyte=0x00
[16254.735492] sd 9:0:0:0: [sdf] CDB:
[16254.735493] cdb[0]=0x2a: 2a 00 03 1c 10 f8 00 00 f0 00
[16254.735498] end_request: I/O error, dev sdf, sector 52171000

Another:

[19389.907015] sd 11:0:0:0: Device offlined - not ready after error recovery
[19389.907037] sd 11:0:0:0: [sdf] Unhandled error code
[19389.907044] sd 11:0:0:0: [sdf]
[19389.907049] Result: hostbyte=0x05 driverbyte=0x00
[19389.907055] sd 11:0:0:0: [sdf] CDB:
[19389.907060] cdb[0]=0x28: 28 00 03 13 0d 28 00 00 08 00
[19389.907083] end_request: I/O error, dev sdf, sector 51580200
[19389.907121] sd 11:0:0:0: rejecting I/O to offline device
[19389.907126] sd 11:0:0:0: killing request

In case it matters, the stick's partition stays readable with regard to 
any data on it that has been cached.


On 10.04.2014 13:37, Hannes Reinecke wrote:

Andreas, can you test with the entire patch series and enable
'scsi_logging_level -s -E 5' prior to running the tests?

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-10 Thread Andreas Reis
That patch appears to work in preventing the crashes, judged on one 
repeated appearance of the bug.


dmesg had the usual
[  215.229903] usb 4-2: usb_disable_lpm called, do nothing
[  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using xhci_hcd
[  215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called 
with disabled ep 880427b829c0
[  215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called 
with disabled ep 880427b82a08

[  215.350621] usb 4-2: usb_enable_lpm called, do nothing

repeated five times, followed by one
[  282.795801] sd 8:0:0:0: Device offlined - not ready after error recovery

and then as often as something tried to read from it:
[  295.585472] sd 8:0:0:0: rejecting I/O to offline device

The stick could then be properly un- and remounted (the latter if it had 
been physically replugged) without issue — for the bug to reoccur after 
one to three minutes. I tried this three times, no dmesg difference 
except the ep addresses varied on two of that.


Andreas Reis

On 09.04.2014 20:02, Alan Stern wrote:

On Wed, 9 Apr 2014, Hannes Reinecke wrote:


I finally got a chance to try it out.  It does seem to do what we want.
I didn't track the flow of control in complete detail, but the command
definitely got aborted both times it was issued.


Good, so it is as I thought. James, can we include this patch instead of
your prior solution?


First, we should have the original bug reporter try it out.

Andreas, the patch in question can be found here:

http://marc.info/?l=linux-usbm=13962706597w=2

Can you try this in place of the 1/3 patch posted by James?  It should
have the same effect, of preventing your system from crashing when the
READ command fails.

Alan Stern



--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-10 Thread Andreas Reis
Only your 0/3 patch to which Alan linked, along with two other patches 
by Mathias Nyman (disable usb3 on intel hosts and disable all lpm 
related control transfers, one of which is the source of the do 
nothings).


I'll revert the latter two and apply the rest of the set. Which I'm 
guessing currently consists of said 0/3 patch —

http://www.spinics.net/lists/linux-scsi/msg73502.html
— plus 2/3 and 3/3?

Or should I just omit 0/3 and try whichever of the two in 1/3 works 
best? Rather confusing ATM.


Anyway, for whatever reason the bug is happening rather frequently now. 
I've spotted the following occurring after the Device offlined line 
two times now:


[  206.901385] sd 11:0:0:0: [sdg] Unhandled error code
[  206.901394] sd 11:0:0:0: [sdg]
[  206.901397] Result: hostbyte=0x01 driverbyte=0x00
[  206.901400] sd 11:0:0:0: [sdg] CDB:
[  206.901403] cdb[0]=0x2a: 2a 00 02 25 1b 50 00 00 08 00
[  206.901419] end_request: I/O error, dev sdg, sector 35986256

The second time had sd 12:0:0:0, cdb[0]=0x28: 28 00 03 94 77 20 00 00 
08 00 and a different sector.


Andreas Reis

On 10.04.2014 13:37, Hannes Reinecke wrote:

On 04/10/2014 12:58 PM, Andreas Reis wrote:

That patch appears to work in preventing the crashes, judged on one
repeated appearance of the bug.

dmesg had the usual
[  215.229903] usb 4-2: usb_disable_lpm called, do nothing
[  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
xhci_hcd
[  215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
with disabled ep 880427b829c0
[  215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
with disabled ep 880427b82a08
[  215.350621] usb 4-2: usb_enable_lpm called, do nothing

repeated five times, followed by one
[  282.795801] sd 8:0:0:0: Device offlined - not ready after error
recovery

and then as often as something tried to read from it:
[  295.585472] sd 8:0:0:0: rejecting I/O to offline device

The stick could then be properly un- and remounted (the latter if it
had been physically replugged) without issue — for the bug to
reoccur after one to three minutes. I tried this three times, no
dmesg difference except the ep addresses varied on two of that.


Was this just that patch you've tested with or the entire patch series?

If the latter, Alan, is this the expected outcome?
I would've thought the error recover should _not_ run into
offlining devices here, but rather the device should be recovered
eventually.

Andreas, can you test with the entire patch series and enable
'scsi_logging_level -s -E 5' prior to running the tests?

THX.

Cheers,

Hannes



--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-10 Thread Alan Stern
On Thu, 10 Apr 2014, Hannes Reinecke wrote:

 On 04/10/2014 12:58 PM, Andreas Reis wrote:
  That patch appears to work in preventing the crashes, judged on one
  repeated appearance of the bug.
  
  dmesg had the usual
  [  215.229903] usb 4-2: usb_disable_lpm called, do nothing
  [  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
  xhci_hcd
  [  215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
  with disabled ep 880427b829c0
  [  215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
  with disabled ep 880427b82a08
  [  215.350621] usb 4-2: usb_enable_lpm called, do nothing
  
  repeated five times, followed by one
  [  282.795801] sd 8:0:0:0: Device offlined - not ready after error
  recovery
  
  and then as often as something tried to read from it:
  [  295.585472] sd 8:0:0:0: rejecting I/O to offline device
  
  The stick could then be properly un- and remounted (the latter if it
  had been physically replugged) without issue � for the bug to
  reoccur after one to three minutes. I tried this three times, no
  dmesg difference except the ep addresses varied on two of that.
  
 Was this just that patch you've tested with or the entire patch series?
 
 If the latter, Alan, is this the expected outcome?

Yes, it is.  The same thing should happen with the entire patch series.

 I would've thought the error recover should _not_ run into
 offlining devices here, but rather the device should be recovered
 eventually.

The command times out, it is aborted, and the command is retried.  The
same thing happens, and we repeat five times.  Eventually the SCSI core
gives up and declares the device to be offline.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-10 Thread Hannes Reinecke

On 04/10/2014 05:31 PM, Alan Stern wrote:

On Thu, 10 Apr 2014, Hannes Reinecke wrote:


On 04/10/2014 12:58 PM, Andreas Reis wrote:

That patch appears to work in preventing the crashes, judged on one
repeated appearance of the bug.

dmesg had the usual
[  215.229903] usb 4-2: usb_disable_lpm called, do nothing
[  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
xhci_hcd
[  215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
with disabled ep 880427b829c0
[  215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
with disabled ep 880427b82a08
[  215.350621] usb 4-2: usb_enable_lpm called, do nothing

repeated five times, followed by one
[  282.795801] sd 8:0:0:0: Device offlined - not ready after error
recovery

and then as often as something tried to read from it:
[  295.585472] sd 8:0:0:0: rejecting I/O to offline device

The stick could then be properly un- and remounted (the latter if it
had been physically replugged) without issue � for the bug to
reoccur after one to three minutes. I tried this three times, no
dmesg difference except the ep addresses varied on two of that.


Was this just that patch you've tested with or the entire patch series?

If the latter, Alan, is this the expected outcome?


Yes, it is.  The same thing should happen with the entire patch series.


I would've thought the error recover should _not_ run into
offlining devices here, but rather the device should be recovered
eventually.


The command times out, it is aborted, and the command is retried.  The
same thing happens, and we repeat five times.  Eventually the SCSI core
gives up and declares the device to be offline.


Hmm. Ok. If you are fine with it who am I to argue here.
James, shall I resent the patch series?

Cheers,

Hannes

--
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-10 Thread James Bottomley
On Thu, 2014-04-10 at 19:52 +0200, Hannes Reinecke wrote:
 On 04/10/2014 05:31 PM, Alan Stern wrote:
  On Thu, 10 Apr 2014, Hannes Reinecke wrote:
 
  On 04/10/2014 12:58 PM, Andreas Reis wrote:
  That patch appears to work in preventing the crashes, judged on one
  repeated appearance of the bug.
 
  dmesg had the usual
  [  215.229903] usb 4-2: usb_disable_lpm called, do nothing
  [  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
  xhci_hcd
  [  215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
  with disabled ep 880427b829c0
  [  215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
  with disabled ep 880427b82a08
  [  215.350621] usb 4-2: usb_enable_lpm called, do nothing
 
  repeated five times, followed by one
  [  282.795801] sd 8:0:0:0: Device offlined - not ready after error
  recovery
 
  and then as often as something tried to read from it:
  [  295.585472] sd 8:0:0:0: rejecting I/O to offline device
 
  The stick could then be properly un- and remounted (the latter if it
  had been physically replugged) without issue � for the bug to
  reoccur after one to three minutes. I tried this three times, no
  dmesg difference except the ep addresses varied on two of that.
 
  Was this just that patch you've tested with or the entire patch series?
 
  If the latter, Alan, is this the expected outcome?
 
  Yes, it is.  The same thing should happen with the entire patch series.
 
  I would've thought the error recover should _not_ run into
  offlining devices here, but rather the device should be recovered
  eventually.
 
  The command times out, it is aborted, and the command is retried.  The
  same thing happens, and we repeat five times.  Eventually the SCSI core
  gives up and declares the device to be offline.
 
 Hmm. Ok. If you are fine with it who am I to argue here.
 James, shall I resent the patch series?

You mean the one patch?  No, it's OK, I have it.

It's still not complete, though, as I've said a couple of times.  The
problem is that we have abort memory on any eh command as well, which
this doesn't fix.

The scenario is abort command, set flag, abort completes, send TUR, TUR
doesn't return, so we now try to abort the TUR, but scsi_abort_eh_cmnd()
will skip the abort because the flag is set and move straight to reset.

The fix is this, I can just add it as well.

James

---

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 771c16b..7516e2c 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -920,6 +920,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct 
scsi_eh_save *ses,
ses-prot_op = scmd-prot_op;
 
scmd-prot_op = SCSI_PROT_NORMAL;
+   scmd-eh_eflags = 0;
scmd-cmnd = ses-eh_cmnd;
memset(scmd-cmnd, 0, BLK_MAX_CDB);
memset(scmd-sdb, 0, sizeof(scmd-sdb));


--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-10 Thread Hannes Reinecke
On 04/10/2014 10:36 PM, James Bottomley wrote:
 On Thu, 2014-04-10 at 19:52 +0200, Hannes Reinecke wrote:
 On 04/10/2014 05:31 PM, Alan Stern wrote:
 On Thu, 10 Apr 2014, Hannes Reinecke wrote:

 On 04/10/2014 12:58 PM, Andreas Reis wrote:
 That patch appears to work in preventing the crashes, judged on one
 repeated appearance of the bug.

 dmesg had the usual
 [  215.229903] usb 4-2: usb_disable_lpm called, do nothing
 [  215.336941] usb 4-2: reset SuperSpeed USB device number 3 using
 xhci_hcd
 [  215.350296] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
 with disabled ep 880427b829c0
 [  215.350305] xhci_hcd :00:14.0: xHCI xhci_drop_endpoint called
 with disabled ep 880427b82a08
 [  215.350621] usb 4-2: usb_enable_lpm called, do nothing

 repeated five times, followed by one
 [  282.795801] sd 8:0:0:0: Device offlined - not ready after error
 recovery

 and then as often as something tried to read from it:
 [  295.585472] sd 8:0:0:0: rejecting I/O to offline device

 The stick could then be properly un- and remounted (the latter if it
 had been physically replugged) without issue � for the bug to
 reoccur after one to three minutes. I tried this three times, no
 dmesg difference except the ep addresses varied on two of that.

 Was this just that patch you've tested with or the entire patch series?

 If the latter, Alan, is this the expected outcome?

 Yes, it is.  The same thing should happen with the entire patch series.

 I would've thought the error recover should _not_ run into
 offlining devices here, but rather the device should be recovered
 eventually.

 The command times out, it is aborted, and the command is retried.  The
 same thing happens, and we repeat five times.  Eventually the SCSI core
 gives up and declares the device to be offline.

 Hmm. Ok. If you are fine with it who am I to argue here.
 James, shall I resent the patch series?
 
 You mean the one patch?  No, it's OK, I have it.
 
 It's still not complete, though, as I've said a couple of times.  The
 problem is that we have abort memory on any eh command as well, which
 this doesn't fix.
 
 The scenario is abort command, set flag, abort completes, send TUR, TUR
 doesn't return, so we now try to abort the TUR, but scsi_abort_eh_cmnd()
 will skip the abort because the flag is set and move straight to reset.
 
 The fix is this, I can just add it as well.
 
 James
 
 ---
 
 diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
 index 771c16b..7516e2c 100644
 --- a/drivers/scsi/scsi_error.c
 +++ b/drivers/scsi/scsi_error.c
 @@ -920,6 +920,7 @@ void scsi_eh_prep_cmnd(struct scsi_cmnd *scmd, struct 
 scsi_eh_save *ses,
   ses-prot_op = scmd-prot_op;
  
   scmd-prot_op = SCSI_PROT_NORMAL;
 + scmd-eh_eflags = 0;
   scmd-cmnd = ses-eh_cmnd;
   memset(scmd-cmnd, 0, BLK_MAX_CDB);
   memset(scmd-sdb, 0, sizeof(scmd-sdb));
 
 
Oh yes, that is correct.

Acked-by: Hannes Reinecke h...@suse.de

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-09 Thread Hannes Reinecke

On 04/07/2014 05:26 PM, Alan Stern wrote:

On Wed, 2 Apr 2014, Hannes Reinecke wrote:


On 04/01/2014 11:28 PM, Alan Stern wrote:

On Tue, 1 Apr 2014, Hannes Reinecke wrote:


So if the above reasoning is okay then this patch should be doing
the trick:

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 771c16b..0e72374 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd)
 /*
  * Retry after abort failed, escalate to next level.
  */
+   scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED;
 SCSI_LOG_ERROR_RECOVERY(3,
 scmd_printk(KERN_INFO, scmd,
 scmd %p previous abort
failed\n, scmd));

(Beware of line
breaks)

Can you test with it?


I don't understand.  This doesn't solve the fundamental problem (namely
that you escalate before aborting a running command).  All it does is
clear the SCSI_EH_ABORT_SCHEDULED flag before escalating.


Which was precisely the point :-)

Hmm. The comment might've been clearer.

What this patch is _supposed_ to be doing is that it'll clear the
SCSI_EH_ABORT_SCHEDULED flag it it has been set.
Which means this will be the second time scsi_abort_command() has
been called for the same command.
IE the first abort went out, did its thing, but now the same command
has timed out again.

So this flag gets cleared, and scsi_abort_command() returns FAILED,
and _no_ asynchronous abort is being scheduled.
scsi_times_out() will then proceed to call scsi_eh_scmd_add().
But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag
the SCSI_EH_CANCEL_CMD flag will continue to be set,
and the command will be aborted with the main SCSI EH routine.

It looks to me as if it should do what you desire, namely abort the
command asynchronously the first time, and invoking the SCSI EH the
second time.

Am I wrong?


I don't know -- I'll have to try it out.  Currently I'm busy with a
bunch of other stuff, so it will take some time.


I finally got a chance to try it out.  It does seem to do what we want.
I didn't track the flow of control in complete detail, but the command
definitely got aborted both times it was issued.

Good, so it is as I thought. James, can we include this patch instead of 
your prior solution?


Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-09 Thread Alan Stern
On Wed, 9 Apr 2014, Hannes Reinecke wrote:

  I finally got a chance to try it out.  It does seem to do what we want.
  I didn't track the flow of control in complete detail, but the command
  definitely got aborted both times it was issued.
 
 Good, so it is as I thought. James, can we include this patch instead of 
 your prior solution?

First, we should have the original bug reporter try it out.

Andreas, the patch in question can be found here:

http://marc.info/?l=linux-usbm=13962706597w=2

Can you try this in place of the 1/3 patch posted by James?  It should 
have the same effect, of preventing your system from crashing when the 
READ command fails.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-07 Thread Alan Stern
On Wed, 2 Apr 2014, Hannes Reinecke wrote:

 On 04/01/2014 11:28 PM, Alan Stern wrote:
  On Tue, 1 Apr 2014, Hannes Reinecke wrote:
  
  So if the above reasoning is okay then this patch should be doing
  the trick:
 
  diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
  index 771c16b..0e72374 100644
  --- a/drivers/scsi/scsi_error.c
  +++ b/drivers/scsi/scsi_error.c
  @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd)
  /*
   * Retry after abort failed, escalate to next level.
   */
  +   scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED;
  SCSI_LOG_ERROR_RECOVERY(3,
  scmd_printk(KERN_INFO, scmd,
  scmd %p previous abort
  failed\n, scmd));
 
  (Beware of line
  breaks)
 
  Can you test with it?
 
  I don't understand.  This doesn't solve the fundamental problem (namely 
  that you escalate before aborting a running command).  All it does is 
  clear the SCSI_EH_ABORT_SCHEDULED flag before escalating.
 
  Which was precisely the point :-)
 
  Hmm. The comment might've been clearer.
 
  What this patch is _supposed_ to be doing is that it'll clear the
  SCSI_EH_ABORT_SCHEDULED flag it it has been set.
  Which means this will be the second time scsi_abort_command() has
  been called for the same command.
  IE the first abort went out, did its thing, but now the same command
  has timed out again.
 
  So this flag gets cleared, and scsi_abort_command() returns FAILED,
  and _no_ asynchronous abort is being scheduled.
  scsi_times_out() will then proceed to call scsi_eh_scmd_add().
  But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag
  the SCSI_EH_CANCEL_CMD flag will continue to be set,
  and the command will be aborted with the main SCSI EH routine.
 
  It looks to me as if it should do what you desire, namely abort the
  command asynchronously the first time, and invoking the SCSI EH the
  second time.
 
  Am I wrong?
  
  I don't know -- I'll have to try it out.  Currently I'm busy with a 
  bunch of other stuff, so it will take some time.

I finally got a chance to try it out.  It does seem to do what we want.  
I didn't track the flow of control in complete detail, but the command 
definitely got aborted both times it was issued.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-02 Thread Hannes Reinecke
On 04/01/2014 11:28 PM, Alan Stern wrote:
 On Tue, 1 Apr 2014, Hannes Reinecke wrote:
 
 So if the above reasoning is okay then this patch should be doing
 the trick:

 diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
 index 771c16b..0e72374 100644
 --- a/drivers/scsi/scsi_error.c
 +++ b/drivers/scsi/scsi_error.c
 @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd)
 /*
  * Retry after abort failed, escalate to next level.
  */
 +   scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED;
 SCSI_LOG_ERROR_RECOVERY(3,
 scmd_printk(KERN_INFO, scmd,
 scmd %p previous abort
 failed\n, scmd));

 (Beware of line
 breaks)

 Can you test with it?

 I don't understand.  This doesn't solve the fundamental problem (namely 
 that you escalate before aborting a running command).  All it does is 
 clear the SCSI_EH_ABORT_SCHEDULED flag before escalating.

 Which was precisely the point :-)

 Hmm. The comment might've been clearer.

 What this patch is _supposed_ to be doing is that it'll clear the
 SCSI_EH_ABORT_SCHEDULED flag it it has been set.
 Which means this will be the second time scsi_abort_command() has
 been called for the same command.
 IE the first abort went out, did its thing, but now the same command
 has timed out again.

 So this flag gets cleared, and scsi_abort_command() returns FAILED,
 and _no_ asynchronous abort is being scheduled.
 scsi_times_out() will then proceed to call scsi_eh_scmd_add().
 But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag
 the SCSI_EH_CANCEL_CMD flag will continue to be set,
 and the command will be aborted with the main SCSI EH routine.

 It looks to me as if it should do what you desire, namely abort the
 command asynchronously the first time, and invoking the SCSI EH the
 second time.

 Am I wrong?
 
 I don't know -- I'll have to try it out.  Currently I'm busy with a 
 bunch of other stuff, so it will take some time.
 
 Looking through the code, I have to wonder why scsi_times_out()  
 modifies scmd-result.  Won't this value get overwritten by the LLDD
 when the command eventually terminates?
 
Yes. However, the 'DID_TIME_OUT' is just a marker that the internal
timeout code triggered.
If the LLDD overwrites it with a 'real' error code everything's fine.

 Even worse, what happens in the event of a race where the command 
 terminates normally just before scsi_times_out() changes scmd-result?
 
_That_ is the least of our worries.
_If_ the LLDD completes the command while scsi_times_out() is
running the whole thing is going down the drain anyway, as the
command will be terminated by the LLDD, and we can only hope that we
didn't mess up our reference counting. Otherwise we'd have EH
running on a stale command, which is going to be a fun to watch.

But looking closer, it might be that the line modifying the result
in scsi_times_out() is indeed pointless, seeing that it's being set
in scsi_eh_abort_handler(), too.
I'll be checking if we can simply remove that line.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-01 Thread Hannes Reinecke
On 03/31/2014 05:03 PM, James Bottomley wrote:
 [lets split the thread]
 On Mon, 2014-03-31 at 16:37 +0200, Hannes Reinecke wrote:
 On 03/31/2014 03:33 PM, Alan Stern wrote:
 On Mon, 31 Mar 2014, Hannes Reinecke wrote:
 On 03/28/2014 08:29 PM, Alan Stern wrote:
 On Fri, 28 Mar 2014, James Bottomley wrote:
 Maybe scmd_eh_abort_handler() should check the flag before doing
 anything.  Is there any sort of sychronization to prevent the same
 incarnation of a command from being aborted twice (or by two different
 threads at the same time)?  If there is, it isn't obvious.

 See above. scsi_times_out() will only ever called once.
 What can happen, though, is that _theoretically_ the LLDD might
 decide to call -done() on a timed out command when
 scsi_eh_abort_handler() is still pending.

 That's okay.  We can expect the LLDD to have sufficient locking to
 handle that sort of thing without confusion (usb-storage does, for
 example).

 (Also, what's going on at the start of scsi_abort_command()?  Contrary
 to what one might expect, the first part of the function _cancels_ a
 scheduled abort.  And it does so without clearing the
 SCSI_EH_ABORT_SCHEDULED flag.)

 The original idea was this:

 SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
 Point is, any command abort is only ever send for a timed-out
 command. And the main problem for a timed-out command is that we
 simply _do not_ know what happened for that command. So _if_ a
 command timed out, _and_ we've send an abort, _and_ the command
 times out _again_ we'll be running into an endless loop between
 timeout and aborting, and never returning the command at all.

 So to prevent this we should set a marker on that command telling it
 to _not_ try to abort the command again.

 I disagree.  We _have_ to abort the command again -- how else can we
 stop a running command?  To prevent the loop you described, we should
 avoid _retrying_ the command after it is aborted the second time.

 The actual question is whether it's worth aborting the same command
 a second time.
 In principle any reset (like LUN reset etc) should clear the
 command, too.
 And the EH abort functionality is geared around this.
 If, for some reason, the transport layer / device driver
 requires a command abort to be send then sure, we need
 to accommodate for that.
 
 We already discussed this (that was my first response too).  USB needs
 all outstanding commands aborted before proceeding to reset.  I'm
 starting to think the actual way to fix this is to reset the abort
 scheduled only if we send something else, so this might be a better fix.
 
 It doesn't matter if we finish a command with abort scheduled because
 the command then gets freed and the flags cleaned.
 
 We can take our time with this because the other two patches, which I
 can send separately fix the current deadlock (we no longer send an
 unaborted request sense before the reset), and it can go via rc fixes.
 
Yes, agreed.

The USB case seems to be a bit more tricky, and at least I need some
more time to fully understand the details and implications of this.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-01 Thread Hannes Reinecke
On 04/01/2014 12:29 AM, Alan Stern wrote:
 On Mon, 31 Mar 2014, Hannes Reinecke wrote:
 
 Ah. Correct. But that's due to the first patch being incorrect.
 Cf my response to the original first patch.

 See my response to your response.  :-)

 Okay, So I probably should refrain from issueing a response to
 your response to my response lest infinite recursion happens :-)
 
 Indeed.
 
   What should happen if some other pathway manages to call
   scsi_try_to_abort_cmd() while scmd-abort_work is still
   sitting on the work queue?  The command would be aborted
   and the flag would be cleared, but the queued work would
   remain.  Can this ever happen?

 Not that I could see.
 A command abort is only ever triggered by the request timeout from
 the block layer. And the timer is _not_ rearmed once the timeout
 function (here: scsi_times_out()) is called.
 Hence I fail to see how it can be called concurrently.

 scsi_try_to_abort_cmd() is also called (via a different pathway) when a 
 command sent by the error handler itself times out.  I haven't traced 
 through all the different paths to make sure none of them can run 
 concurrently.  But I'm willing to take your word for it.

 Yes, but that's not calling scsi_abort_command(), but rather invokes
 scsi_abort_eh_cmnd().
 
 Sure.  But either way, we end up in scsi_try_to_abort_cmd(), which
 calls the LLDD's abort handler.  Thus leading to the possibility of
 aborting the same command more than once.
 
 The original idea was this:

 SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
 Point is, any command abort is only ever send for a timed-out
 command. And the main problem for a timed-out command is that we
 simply _do not_ know what happened for that command. So _if_ a
 command timed out, _and_ we've send an abort, _and_ the command
 times out _again_ we'll be running into an endless loop between
 timeout and aborting, and never returning the command at all.

 So to prevent this we should set a marker on that command telling it
 to _not_ try to abort the command again.

 I disagree.  We _have_ to abort the command again -- how else can we
 stop a running command?  To prevent the loop you described, we should
 avoid _retrying_ the command after it is aborted the second time.

 The actual question is whether it's worth aborting the same command
 a second time.
 In principle any reset (like LUN reset etc) should clear the
 command, too.
 And the EH abort functionality is geared around this.
 If, for some reason, the transport layer / device driver
 requires a command abort to be send then sure, we need
 to accommodate for that.
 
 As James mentioned, with USB a reset does not abort a running command.  
 Instead it waits for the command to finish.  (However, this could be
 changed in usb-storage, if required.)
 
 As said, yes, in principle you are right. We should be aborting the
 command a second time, _and then_ starting the escalation.

 So if the above reasoning is okay then this patch should be doing
 the trick:

 diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
 index 771c16b..0e72374 100644
 --- a/drivers/scsi/scsi_error.c
 +++ b/drivers/scsi/scsi_error.c
 @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd)
 /*
  * Retry after abort failed, escalate to next level.
  */
 +   scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED;
 SCSI_LOG_ERROR_RECOVERY(3,
 scmd_printk(KERN_INFO, scmd,
 scmd %p previous abort
 failed\n, scmd));

 (Beware of line
 breaks)

 Can you test with it?
 
 I don't understand.  This doesn't solve the fundamental problem (namely 
 that you escalate before aborting a running command).  All it does is 
 clear the SCSI_EH_ABORT_SCHEDULED flag before escalating.
 
Which was precisely the point :-)

Hmm. The comment might've been clearer.

What this patch is _supposed_ to be doing is that it'll clear the
SCSI_EH_ABORT_SCHEDULED flag it it has been set.
Which means this will be the second time scsi_abort_command() has
been called for the same command.
IE the first abort went out, did its thing, but now the same command
has timed out again.

So this flag gets cleared, and scsi_abort_command() returns FAILED,
and _no_ asynchronous abort is being scheduled.
scsi_times_out() will then proceed to call scsi_eh_scmd_add().
But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag
the SCSI_EH_CANCEL_CMD flag will continue to be set,
and the command will be aborted with the main SCSI EH routine.

It looks to me as if it should do what you desire, namely abort the
command asynchronously the first time, and invoking the SCSI EH the
second time.

Am I wrong?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To 

Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-04-01 Thread Alan Stern
On Tue, 1 Apr 2014, Hannes Reinecke wrote:

  So if the above reasoning is okay then this patch should be doing
  the trick:
 
  diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
  index 771c16b..0e72374 100644
  --- a/drivers/scsi/scsi_error.c
  +++ b/drivers/scsi/scsi_error.c
  @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd)
  /*
   * Retry after abort failed, escalate to next level.
   */
  +   scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED;
  SCSI_LOG_ERROR_RECOVERY(3,
  scmd_printk(KERN_INFO, scmd,
  scmd %p previous abort
  failed\n, scmd));
 
  (Beware of line
  breaks)
 
  Can you test with it?
  
  I don't understand.  This doesn't solve the fundamental problem (namely 
  that you escalate before aborting a running command).  All it does is 
  clear the SCSI_EH_ABORT_SCHEDULED flag before escalating.
  
 Which was precisely the point :-)
 
 Hmm. The comment might've been clearer.
 
 What this patch is _supposed_ to be doing is that it'll clear the
 SCSI_EH_ABORT_SCHEDULED flag it it has been set.
 Which means this will be the second time scsi_abort_command() has
 been called for the same command.
 IE the first abort went out, did its thing, but now the same command
 has timed out again.
 
 So this flag gets cleared, and scsi_abort_command() returns FAILED,
 and _no_ asynchronous abort is being scheduled.
 scsi_times_out() will then proceed to call scsi_eh_scmd_add().
 But as we've cleared the SCSI_EH_ABORT_SCHEDULED flag
 the SCSI_EH_CANCEL_CMD flag will continue to be set,
 and the command will be aborted with the main SCSI EH routine.
 
 It looks to me as if it should do what you desire, namely abort the
 command asynchronously the first time, and invoking the SCSI EH the
 second time.
 
 Am I wrong?

I don't know -- I'll have to try it out.  Currently I'm busy with a 
bunch of other stuff, so it will take some time.

Looking through the code, I have to wonder why scsi_times_out()  
modifies scmd-result.  Won't this value get overwritten by the LLDD
when the command eventually terminates?

Even worse, what happens in the event of a race where the command 
terminates normally just before scsi_times_out() changes scmd-result?

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-03-31 Thread Hannes Reinecke
On 03/28/2014 08:29 PM, Alan Stern wrote:
 On Fri, 28 Mar 2014, James Bottomley wrote:
 
 This is a set of three patches we agreed to a while ago to eliminate a
 USB deadlock.  I did rewrite the first patch, if it could be reviewed
 and tested.
 
 I tested all three patches under the same conditions as before, and 
 they all worked correctly.
 
 In the revised patch 1, the meaning of SCSI_EH_ABORT_SCHEDULED isn't
 entirely clear.  This boils down to two questions, which I don't 
 know the answers to:
 
   What should happen in scmd_eh_abort_handler() if
   scsi_host_eh_past_deadline() returns true and thereby
   prevents scsi_try_to_abort_cmd() from being called?
   The flag wouldn't get cleared then.
 
Ah. Correct. But that's due to the first patch being incorrect.
Cf my response to the original first patch.

   What should happen if some other pathway manages to call
   scsi_try_to_abort_cmd() while scmd-abort_work is still
   sitting on the work queue?  The command would be aborted
   and the flag would be cleared, but the queued work would
   remain.  Can this ever happen?
 
Not that I could see.
A command abort is only ever triggered by the request timeout from
the block layer. And the timer is _not_ rearmed once the timeout
function (here: scsi_times_out()) is called.
Hence I fail to see how it can be called concurrently.

 Maybe scmd_eh_abort_handler() should check the flag before doing
 anything.  Is there any sort of sychronization to prevent the same
 incarnation of a command from being aborted twice (or by two different
 threads at the same time)?  If there is, it isn't obvious.
 
See above. scsi_times_out() will only ever called once.
What can happen, though, is that _theoretically_ the LLDD might
decide to call -done() on a timed out command when
scsi_eh_abort_handler() is still pending.


 (Also, what's going on at the start of scsi_abort_command()?  Contrary
 to what one might expect, the first part of the function _cancels_ a
 scheduled abort.  And it does so without clearing the
 SCSI_EH_ABORT_SCHEDULED flag.)
 
The original idea was this:

SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
Point is, any command abort is only ever send for a timed-out
command. And the main problem for a timed-out command is that we
simply _do not_ know what happened for that command. So _if_ a
command timed out, _and_ we've send an abort, _and_ the command
times out _again_ we'll be running into an endless loop between
timeout and aborting, and never returning the command at all.

So to prevent this we should set a marker on that command telling it
to _not_ try to abort the command again.
Which is what 'SCSI_EH_ABORT_SCHEDULED' was meant for:

- A command times out, sets 'SCSI_EH_ABORT_SCHEDULED'
- abort _succeeds_
- The same command times out a second time, notifies
  that SCSI_EH_ABORT_SCHEDULED is set, and doesn't call
  scsi_eh_abort_command() but rather escalates directly.

_However_, I do feel that we've gotten the wrong track with all of
this. In you original mail you stated:

 Basically, usb-storage deadlocks when the SCSI error handler
 invokes the eh_device_reset_handler callback while a command
 is running.  The command has timed out and will never complete
 normally, because the device's firmware has crashed.  But
 usb-storage's device-reset routine waits for the current command
 to finish, which brings everything to a standstill.

Question now to you as the USB god:

A command abort is only _ever_ send after a command timeout.
And the main idea of the command abort is to remove any context
the LLDD or the target might have. So by the time the command
abort returns SUCCESS we _expect_ the firmware to have cleared that
command. If for some reason the firmware isn't capable of doing so,
it should return FAILED.

So:
- Has the command timeout fired?
- If so, why hasn't it returned FAILED?
- If it had returned SUCCESS, why did the device_reset_handler
  wait for the command to complete?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke   zSeries  Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-03-31 Thread Alan Stern
On Mon, 31 Mar 2014, Hannes Reinecke wrote:

 On 03/28/2014 08:29 PM, Alan Stern wrote:
  On Fri, 28 Mar 2014, James Bottomley wrote:
  
  This is a set of three patches we agreed to a while ago to eliminate a
  USB deadlock.  I did rewrite the first patch, if it could be reviewed
  and tested.
  
  I tested all three patches under the same conditions as before, and 
  they all worked correctly.
  
  In the revised patch 1, the meaning of SCSI_EH_ABORT_SCHEDULED isn't
  entirely clear.  This boils down to two questions, which I don't 
  know the answers to:
  
  What should happen in scmd_eh_abort_handler() if
  scsi_host_eh_past_deadline() returns true and thereby
  prevents scsi_try_to_abort_cmd() from being called?
  The flag wouldn't get cleared then.
  
 Ah. Correct. But that's due to the first patch being incorrect.
 Cf my response to the original first patch.

See my response to your response.  :-)

  What should happen if some other pathway manages to call
  scsi_try_to_abort_cmd() while scmd-abort_work is still
  sitting on the work queue?  The command would be aborted
  and the flag would be cleared, but the queued work would
  remain.  Can this ever happen?
  
 Not that I could see.
 A command abort is only ever triggered by the request timeout from
 the block layer. And the timer is _not_ rearmed once the timeout
 function (here: scsi_times_out()) is called.
 Hence I fail to see how it can be called concurrently.

scsi_try_to_abort_cmd() is also called (via a different pathway) when a 
command sent by the error handler itself times out.  I haven't traced 
through all the different paths to make sure none of them can run 
concurrently.  But I'm willing to take your word for it.

  Maybe scmd_eh_abort_handler() should check the flag before doing
  anything.  Is there any sort of sychronization to prevent the same
  incarnation of a command from being aborted twice (or by two different
  threads at the same time)?  If there is, it isn't obvious.
  
 See above. scsi_times_out() will only ever called once.
 What can happen, though, is that _theoretically_ the LLDD might
 decide to call -done() on a timed out command when
 scsi_eh_abort_handler() is still pending.

That's okay.  We can expect the LLDD to have sufficient locking to
handle that sort of thing without confusion (usb-storage does, for
example).

  (Also, what's going on at the start of scsi_abort_command()?  Contrary
  to what one might expect, the first part of the function _cancels_ a
  scheduled abort.  And it does so without clearing the
  SCSI_EH_ABORT_SCHEDULED flag.)
  
 The original idea was this:
 
 SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
 Point is, any command abort is only ever send for a timed-out
 command. And the main problem for a timed-out command is that we
 simply _do not_ know what happened for that command. So _if_ a
 command timed out, _and_ we've send an abort, _and_ the command
 times out _again_ we'll be running into an endless loop between
 timeout and aborting, and never returning the command at all.
 
 So to prevent this we should set a marker on that command telling it
 to _not_ try to abort the command again.

I disagree.  We _have_ to abort the command again -- how else can we
stop a running command?  To prevent the loop you described, we should
avoid _retrying_ the command after it is aborted the second time.

 Which is what 'SCSI_EH_ABORT_SCHEDULED' was meant for:
 
 - A command times out, sets 'SCSI_EH_ABORT_SCHEDULED'
 - abort _succeeds_
 - The same command times out a second time, notifies
   that SCSI_EH_ABORT_SCHEDULED is set, and doesn't call
   scsi_eh_abort_command() but rather escalates directly.

The proper time to escalate is after the command is aborted again, not
while the command is still running.  The only situation where you might
want to escalate while a command is still running would be if you were
unable to abort the command.

(Hmmm, maybe that's not true for SCSI devices in general.  It is true 
for USB mass-storage, however.  Perhaps the reset handlers in 
usb-storage should be changed so that they will automatically abort a 
running command before starting the reset.)

 _However_, I do feel that we've gotten the wrong track with all of
 this. In you original mail you stated:
 
  Basically, usb-storage deadlocks when the SCSI error handler
  invokes the eh_device_reset_handler callback while a command
  is running.  The command has timed out and will never complete
  normally, because the device's firmware has crashed.  But
  usb-storage's device-reset routine waits for the current command
  to finish, which brings everything to a standstill.
 
 Question now to you as the USB god:
 
 A command abort is only _ever_ send after a command timeout.
 And the main idea of the command abort is to remove any context
 the LLDD or the target might have. So by the time the command
 abort returns SUCCESS we _expect_ the firmware to have cleared that
 

Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-03-31 Thread Hannes Reinecke
On 03/31/2014 03:33 PM, Alan Stern wrote:
 On Mon, 31 Mar 2014, Hannes Reinecke wrote:
 
 On 03/28/2014 08:29 PM, Alan Stern wrote:
 On Fri, 28 Mar 2014, James Bottomley wrote:

 This is a set of three patches we agreed to a while ago to eliminate a
 USB deadlock.  I did rewrite the first patch, if it could be reviewed
 and tested.

 I tested all three patches under the same conditions as before, and 
 they all worked correctly.

 In the revised patch 1, the meaning of SCSI_EH_ABORT_SCHEDULED isn't
 entirely clear.  This boils down to two questions, which I don't 
 know the answers to:

 What should happen in scmd_eh_abort_handler() if
 scsi_host_eh_past_deadline() returns true and thereby
 prevents scsi_try_to_abort_cmd() from being called?
 The flag wouldn't get cleared then.

 Ah. Correct. But that's due to the first patch being incorrect.
 Cf my response to the original first patch.
 
 See my response to your response.  :-)
 
Okay, So I probably should refrain from issueing a response to
your response to my response lest infinite recursion happens :-)

 What should happen if some other pathway manages to call
 scsi_try_to_abort_cmd() while scmd-abort_work is still
 sitting on the work queue?  The command would be aborted
 and the flag would be cleared, but the queued work would
 remain.  Can this ever happen?

 Not that I could see.
 A command abort is only ever triggered by the request timeout from
 the block layer. And the timer is _not_ rearmed once the timeout
 function (here: scsi_times_out()) is called.
 Hence I fail to see how it can be called concurrently.
 
 scsi_try_to_abort_cmd() is also called (via a different pathway) when a 
 command sent by the error handler itself times out.  I haven't traced 
 through all the different paths to make sure none of them can run 
 concurrently.  But I'm willing to take your word for it.
 
Yes, but that's not calling scsi_abort_command(), but rather invokes
scsi_abort_eh_cmnd().

 Maybe scmd_eh_abort_handler() should check the flag before doing
 anything.  Is there any sort of sychronization to prevent the same
 incarnation of a command from being aborted twice (or by two different
 threads at the same time)?  If there is, it isn't obvious.

 See above. scsi_times_out() will only ever called once.
 What can happen, though, is that _theoretically_ the LLDD might
 decide to call -done() on a timed out command when
 scsi_eh_abort_handler() is still pending.
 
 That's okay.  We can expect the LLDD to have sufficient locking to
 handle that sort of thing without confusion (usb-storage does, for
 example).
 
 (Also, what's going on at the start of scsi_abort_command()?  Contrary
 to what one might expect, the first part of the function _cancels_ a
 scheduled abort.  And it does so without clearing the
 SCSI_EH_ABORT_SCHEDULED flag.)

 The original idea was this:

 SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
 Point is, any command abort is only ever send for a timed-out
 command. And the main problem for a timed-out command is that we
 simply _do not_ know what happened for that command. So _if_ a
 command timed out, _and_ we've send an abort, _and_ the command
 times out _again_ we'll be running into an endless loop between
 timeout and aborting, and never returning the command at all.

 So to prevent this we should set a marker on that command telling it
 to _not_ try to abort the command again.
 
 I disagree.  We _have_ to abort the command again -- how else can we
 stop a running command?  To prevent the loop you described, we should
 avoid _retrying_ the command after it is aborted the second time.
 
The actual question is whether it's worth aborting the same command
a second time.
In principle any reset (like LUN reset etc) should clear the
command, too.
And the EH abort functionality is geared around this.
If, for some reason, the transport layer / device driver
requires a command abort to be send then sure, we need
to accommodate for that.

 Which is what 'SCSI_EH_ABORT_SCHEDULED' was meant for:

 - A command times out, sets 'SCSI_EH_ABORT_SCHEDULED'
 - abort _succeeds_
 - The same command times out a second time, notifies
   that SCSI_EH_ABORT_SCHEDULED is set, and doesn't call
   scsi_eh_abort_command() but rather escalates directly.
 
 The proper time to escalate is after the command is aborted again, not
 while the command is still running.  The only situation where you might
 want to escalate while a command is still running would be if you were
 unable to abort the command.
 
 (Hmmm, maybe that's not true for SCSI devices in general.  It is true 
 for USB mass-storage, however.  Perhaps the reset handlers in 
 usb-storage should be changed so that they will automatically abort a 
 running command before starting the reset.)
 
As said, yes, in principle you are right. We should be aborting the
command a second time, _and then_ starting the escalation.

So if the above reasoning is okay then this patch should 

Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-03-31 Thread James Bottomley
[lets split the thread]
On Mon, 2014-03-31 at 16:37 +0200, Hannes Reinecke wrote:
 On 03/31/2014 03:33 PM, Alan Stern wrote:
  On Mon, 31 Mar 2014, Hannes Reinecke wrote:
  On 03/28/2014 08:29 PM, Alan Stern wrote:
  On Fri, 28 Mar 2014, James Bottomley wrote:
  Maybe scmd_eh_abort_handler() should check the flag before doing
  anything.  Is there any sort of sychronization to prevent the same
  incarnation of a command from being aborted twice (or by two different
  threads at the same time)?  If there is, it isn't obvious.
 
  See above. scsi_times_out() will only ever called once.
  What can happen, though, is that _theoretically_ the LLDD might
  decide to call -done() on a timed out command when
  scsi_eh_abort_handler() is still pending.
  
  That's okay.  We can expect the LLDD to have sufficient locking to
  handle that sort of thing without confusion (usb-storage does, for
  example).
  
  (Also, what's going on at the start of scsi_abort_command()?  Contrary
  to what one might expect, the first part of the function _cancels_ a
  scheduled abort.  And it does so without clearing the
  SCSI_EH_ABORT_SCHEDULED flag.)
 
  The original idea was this:
 
  SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
  Point is, any command abort is only ever send for a timed-out
  command. And the main problem for a timed-out command is that we
  simply _do not_ know what happened for that command. So _if_ a
  command timed out, _and_ we've send an abort, _and_ the command
  times out _again_ we'll be running into an endless loop between
  timeout and aborting, and never returning the command at all.
 
  So to prevent this we should set a marker on that command telling it
  to _not_ try to abort the command again.
  
  I disagree.  We _have_ to abort the command again -- how else can we
  stop a running command?  To prevent the loop you described, we should
  avoid _retrying_ the command after it is aborted the second time.
  
 The actual question is whether it's worth aborting the same command
 a second time.
 In principle any reset (like LUN reset etc) should clear the
 command, too.
 And the EH abort functionality is geared around this.
 If, for some reason, the transport layer / device driver
 requires a command abort to be send then sure, we need
 to accommodate for that.

We already discussed this (that was my first response too).  USB needs
all outstanding commands aborted before proceeding to reset.  I'm
starting to think the actual way to fix this is to reset the abort
scheduled only if we send something else, so this might be a better fix.

It doesn't matter if we finish a command with abort scheduled because
the command then gets freed and the flags cleaned.

We can take our time with this because the other two patches, which I
can send separately fix the current deadlock (we no longer send an
unaborted request sense before the reset), and it can go via rc fixes.

James

---

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 771c16b..3cfd2bf 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1007,6 +1007,12 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, 
unsigned char *cmnd,
const unsigned long stall_for = msecs_to_jiffies(100);
int rtn;
 
+   /*
+* We're sending another command we'll need to abort, so reset the
+* abort scheduled flag on the real command before we save its state
+*/
+   scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED;
+
 retry:
scsi_eh_prep_cmnd(scmd, ses, cmnd, cmnd_size, sense_bytes);
shost-eh_action = done;


--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-03-31 Thread Alan Stern
On Mon, 31 Mar 2014, Hannes Reinecke wrote:

  Ah. Correct. But that's due to the first patch being incorrect.
  Cf my response to the original first patch.
  
  See my response to your response.  :-)
  
 Okay, So I probably should refrain from issueing a response to
 your response to my response lest infinite recursion happens :-)

Indeed.

What should happen if some other pathway manages to call
scsi_try_to_abort_cmd() while scmd-abort_work is still
sitting on the work queue?  The command would be aborted
and the flag would be cleared, but the queued work would
remain.  Can this ever happen?
 
  Not that I could see.
  A command abort is only ever triggered by the request timeout from
  the block layer. And the timer is _not_ rearmed once the timeout
  function (here: scsi_times_out()) is called.
  Hence I fail to see how it can be called concurrently.
  
  scsi_try_to_abort_cmd() is also called (via a different pathway) when a 
  command sent by the error handler itself times out.  I haven't traced 
  through all the different paths to make sure none of them can run 
  concurrently.  But I'm willing to take your word for it.
  
 Yes, but that's not calling scsi_abort_command(), but rather invokes
 scsi_abort_eh_cmnd().

Sure.  But either way, we end up in scsi_try_to_abort_cmd(), which
calls the LLDD's abort handler.  Thus leading to the possibility of
aborting the same command more than once.

  The original idea was this:
 
  SCSI_EH_ABORT_SCHEDULED is sticky _per command_.
  Point is, any command abort is only ever send for a timed-out
  command. And the main problem for a timed-out command is that we
  simply _do not_ know what happened for that command. So _if_ a
  command timed out, _and_ we've send an abort, _and_ the command
  times out _again_ we'll be running into an endless loop between
  timeout and aborting, and never returning the command at all.
 
  So to prevent this we should set a marker on that command telling it
  to _not_ try to abort the command again.
  
  I disagree.  We _have_ to abort the command again -- how else can we
  stop a running command?  To prevent the loop you described, we should
  avoid _retrying_ the command after it is aborted the second time.
  
 The actual question is whether it's worth aborting the same command
 a second time.
 In principle any reset (like LUN reset etc) should clear the
 command, too.
 And the EH abort functionality is geared around this.
 If, for some reason, the transport layer / device driver
 requires a command abort to be send then sure, we need
 to accommodate for that.

As James mentioned, with USB a reset does not abort a running command.  
Instead it waits for the command to finish.  (However, this could be
changed in usb-storage, if required.)

 As said, yes, in principle you are right. We should be aborting the
 command a second time, _and then_ starting the escalation.
 
 So if the above reasoning is okay then this patch should be doing
 the trick:
 
 diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
 index 771c16b..0e72374 100644
 --- a/drivers/scsi/scsi_error.c
 +++ b/drivers/scsi/scsi_error.c
 @@ -189,6 +189,7 @@ scsi_abort_command(struct scsi_cmnd *scmd)
 /*
  * Retry after abort failed, escalate to next level.
  */
 +   scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED;
 SCSI_LOG_ERROR_RECOVERY(3,
 scmd_printk(KERN_INFO, scmd,
 scmd %p previous abort
 failed\n, scmd));
 
 (Beware of line
 breaks)
 
 Can you test with it?

I don't understand.  This doesn't solve the fundamental problem (namely 
that you escalate before aborting a running command).  All it does is 
clear the SCSI_EH_ABORT_SCHEDULED flag before escalating.

What's needed is something like a 2-bit abort counter.  
scsi_try_to_abort_cmd() should increment the counter each time it runs, 
and if scmd_eh_abort_handler() sees that the counter is too high, it 
should avoid its retry pathway.  _Then_ you can move on to something 
more drastic.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-03-31 Thread Alan Stern
On Mon, 31 Mar 2014, James Bottomley wrote:

  The actual question is whether it's worth aborting the same command
  a second time.
  In principle any reset (like LUN reset etc) should clear the
  command, too.
  And the EH abort functionality is geared around this.
  If, for some reason, the transport layer / device driver
  requires a command abort to be send then sure, we need
  to accommodate for that.
 
 We already discussed this (that was my first response too).  USB needs
 all outstanding commands aborted before proceeding to reset.  I'm
 starting to think the actual way to fix this is to reset the abort
 scheduled only if we send something else, so this might be a better fix.
 
 It doesn't matter if we finish a command with abort scheduled because
 the command then gets freed and the flags cleaned.
 
 We can take our time with this because the other two patches, which I
 can send separately fix the current deadlock (we no longer send an
 unaborted request sense before the reset), and it can go via rc fixes.
 
 James
 
 ---
 
 diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
 index 771c16b..3cfd2bf 100644
 --- a/drivers/scsi/scsi_error.c
 +++ b/drivers/scsi/scsi_error.c
 @@ -1007,6 +1007,12 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, 
 unsigned char *cmnd,
   const unsigned long stall_for = msecs_to_jiffies(100);
   int rtn;
  
 + /*
 +  * We're sending another command we'll need to abort, so reset the
 +  * abort scheduled flag on the real command before we save its state
 +  */
 + scmd-eh_eflags = ~SCSI_EH_ABORT_SCHEDULED;
 +
  retry:
   scsi_eh_prep_cmnd(scmd, ses, cmnd, cmnd_size, sense_bytes);
   shost-eh_action = done;

I don't know if Hannes will like this, but I don't think it is right.  
This is not the retry pathway that's causing problems; that pathway 
begins where scmd_eh_abort_handler() calls scsi_queue_insert().

Now, that call is already guarded by this test:

if (...  (++scmd-retries = scmd-allowed)) {

which would seem to prevent the infinite loop that Hannes was concerned 
about.  So maybe the 1/3 patch posted a couple of days ago is the best 
solution after all.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-03-28 Thread James Bottomley
This is a set of three patches we agreed to a while ago to eliminate a
USB deadlock.  I did rewrite the first patch, if it could be reviewed
and tested.

Thanks,

James

---

Alan Stern (1):
  [SCSI] Fix command result state propagation

James Bottomley (2):
  [SCSI] Fix abort state memory problem
  [SCSI] Fix spurious request sense in error handling

 drivers/scsi/scsi_error.c | 19 ---
 drivers/scsi/scsi_lib.c   |  1 +
 2 files changed, 17 insertions(+), 3 deletions(-)


--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Fix USB deadlock caused by SCSI error handling

2014-03-28 Thread Alan Stern
On Fri, 28 Mar 2014, James Bottomley wrote:

 This is a set of three patches we agreed to a while ago to eliminate a
 USB deadlock.  I did rewrite the first patch, if it could be reviewed
 and tested.

I tested all three patches under the same conditions as before, and 
they all worked correctly.

In the revised patch 1, the meaning of SCSI_EH_ABORT_SCHEDULED isn't
entirely clear.  This boils down to two questions, which I don't 
know the answers to:

What should happen in scmd_eh_abort_handler() if
scsi_host_eh_past_deadline() returns true and thereby
prevents scsi_try_to_abort_cmd() from being called?
The flag wouldn't get cleared then.

What should happen if some other pathway manages to call
scsi_try_to_abort_cmd() while scmd-abort_work is still
sitting on the work queue?  The command would be aborted
and the flag would be cleared, but the queued work would
remain.  Can this ever happen?

Maybe scmd_eh_abort_handler() should check the flag before doing
anything.  Is there any sort of sychronization to prevent the same
incarnation of a command from being aborted twice (or by two different
threads at the same time)?  If there is, it isn't obvious.

(Also, what's going on at the start of scsi_abort_command()?  Contrary
to what one might expect, the first part of the function _cancels_ a
scheduled abort.  And it does so without clearing the
SCSI_EH_ABORT_SCHEDULED flag.)

Maybe in all these scenarios, the command is doomed to fail anyway so 
these questions don't make any real difference.  I don't know...

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html