subject:"\[PATCH 3.12 033\/118\] usb\: xhci\: Link TRB must not occur within a USB payload burst"

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-22 Thread David Laight

From: walt
> On 01/21/2014 01:51 AM, David Laight wrote:
> > From: Sarah Sharp
> >> On Mon, Jan 20, 2014 at 11:21:14AM +, David Laight wrote:
> > ...
> >>> A guess...
> >>>
> >>> In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead
> >>> of xhci_td_remainder().
> >>
> David, I tried the one-liner below, which changed nothing AFAICS, but
> then I'm not sure it's the change you intended:
...
>   /* Set the TRB length, TD size, and interrupter fields. */
> - if (xhci->hci_version < 0x100) {
> + if (xhci->hci_version > 0x100) {
>   remainder = xhci_td_remainder(
>   urb->transfer_buffer_length -
>   running_total);

So my wild guess wasn't right.
Can't win them all.

David

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-22 Thread David Laight

From: walt
 On 01/21/2014 01:51 AM, David Laight wrote:
  From: Sarah Sharp
  On Mon, Jan 20, 2014 at 11:21:14AM +, David Laight wrote:
  ...
  A guess...

  In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead
  of xhci_td_remainder().

 David, I tried the one-liner below, which changed nothing AFAICS, but
 then I'm not sure it's the change you intended:
...
   /* Set the TRB length, TD size, and interrupter fields. */
 - if (xhci-hci_version  0x100) {
 + if (xhci-hci_version  0x100) {
   remainder = xhci_td_remainder(
   urb-transfer_buffer_length -
   running_total);

So my wild guess wasn't right.
Can't win them all.

David

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-21 Thread walt

On 01/21/2014 01:51 AM, David Laight wrote:
> From: Sarah Sharp
>> On Mon, Jan 20, 2014 at 11:21:14AM +, David Laight wrote:
> ...
>>> A guess...
>>>
>>> In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead
>>> of xhci_td_remainder().
>>
>> Why?  Walt has a 0.96 xHCI host controller, and the format for how to
>> calculate the TD remainder changed between the 0.96 and the 1.0 spec.
>> That's why we have xhci_v1_0_td_remainder() and xhci_td_remainder().
> 
> I just wonder how many of those differences are just differences in the
> specification, rather than differences in the hardware implementation.
> In some cases it might be that the old hardware just ignored the value.
> 
> I know that the xhci hardware on my ivy bridge cpu does look at that
> value (at least checking for zero), since things failed in subtle ways
> when I got it wrong.
> 
> In this case it was just something easy to change that might be worth
> trying.  I didn't necessarily expect it to make a positive difference.

David, I tried the one-liner below, which changed nothing AFAICS, but
then I'm not sure it's the change you intended:

--- xhci-ring.c.orig2014-01-21 13:28:36.396278813 -0800
+++ xhci-ring.c 2014-01-21 13:35:11.410312814 -0800
@@ -3335,7 +3335,7 @@
}
 
/* Set the TRB length, TD size, and interrupter fields. */
-   if (xhci->hci_version < 0x100) {
+   if (xhci->hci_version > 0x100) {
remainder = xhci_td_remainder(
urb->transfer_buffer_length -
running_total);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-21 Thread David Laight

From: Sarah Sharp
> On Mon, Jan 20, 2014 at 11:21:14AM +, David Laight wrote:
...
> > A guess...
> >
> > In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead
> > of xhci_td_remainder().
> 
> Why?  Walt has a 0.96 xHCI host controller, and the format for how to
> calculate the TD remainder changed between the 0.96 and the 1.0 spec.
> That's why we have xhci_v1_0_td_remainder() and xhci_td_remainder().

I just wonder how many of those differences are just differences in the
specification, rather than differences in the hardware implementation.
In some cases it might be that the old hardware just ignored the value.

I know that the xhci hardware on my ivy bridge cpu does look at that
value (at least checking for zero), since things failed in subtle ways
when I got it wrong.

In this case it was just something easy to change that might be worth
trying.  I didn't necessarily expect it to make a positive difference.

David

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-21 Thread David Laight

From: Sarah Sharp
 On Mon, Jan 20, 2014 at 11:21:14AM +, David Laight wrote:
...
  A guess...
 
  In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead
  of xhci_td_remainder().
 
 Why?  Walt has a 0.96 xHCI host controller, and the format for how to
 calculate the TD remainder changed between the 0.96 and the 1.0 spec.
 That's why we have xhci_v1_0_td_remainder() and xhci_td_remainder().

I just wonder how many of those differences are just differences in the
specification, rather than differences in the hardware implementation.
In some cases it might be that the old hardware just ignored the value.

I know that the xhci hardware on my ivy bridge cpu does look at that
value (at least checking for zero), since things failed in subtle ways
when I got it wrong.

In this case it was just something easy to change that might be worth
trying.  I didn't necessarily expect it to make a positive difference.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-21 Thread walt

On 01/21/2014 01:51 AM, David Laight wrote:
 From: Sarah Sharp
 On Mon, Jan 20, 2014 at 11:21:14AM +, David Laight wrote:
 ...
 A guess...

 In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead
 of xhci_td_remainder().

 Why?  Walt has a 0.96 xHCI host controller, and the format for how to
 calculate the TD remainder changed between the 0.96 and the 1.0 spec.
 That's why we have xhci_v1_0_td_remainder() and xhci_td_remainder().
 
 I just wonder how many of those differences are just differences in the
 specification, rather than differences in the hardware implementation.
 In some cases it might be that the old hardware just ignored the value.
 
 I know that the xhci hardware on my ivy bridge cpu does look at that
 value (at least checking for zero), since things failed in subtle ways
 when I got it wrong.
 
 In this case it was just something easy to change that might be worth
 trying.  I didn't necessarily expect it to make a positive difference.

David, I tried the one-liner below, which changed nothing AFAICS, but
then I'm not sure it's the change you intended:

--- xhci-ring.c.orig2014-01-21 13:28:36.396278813 -0800
+++ xhci-ring.c 2014-01-21 13:35:11.410312814 -0800
@@ -3335,7 +3335,7 @@
}
 
/* Set the TRB length, TD size, and interrupter fields. */
-   if (xhci-hci_version  0x100) {
+   if (xhci-hci_version  0x100) {
remainder = xhci_td_remainder(
urb-transfer_buffer_length -
running_total);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-20 Thread Sarah Sharp

On Mon, Jan 20, 2014 at 11:21:14AM +, David Laight wrote:
> From: walt 
> > On 01/17/2014 06:34 AM, David Laight wrote:
> > 
> > > Can you try the patch I posted that stops the ownership on LINK TRBs
> > > being changed before that on the linked-to TRB?
> > 
> > Please disregard my earlier post about the patch not applying cleanly.
> > That was the usual html corruption, so I found the original on the usb
> > list and it was okay.
> > 
> > Sadly, the patch didn't fix the ASMedia lockup behavior, however :(
> > 
> > I did notice that the lockup occurred only when copying *to* the usb3
> > drive, and not when copying from it.  I think that may be new behavior
> > but I can't swear to it.
> 
> Consistent with another report that says that ethernet worked provided
> that TSO was disabled (ie no sg tx).
> (Without the patch to delay he ownership change on link trbs it didn't
> work at all.)

Please be more clear.  What do you mean by these statements?  That
someone privately reported that your earlier patch [1] did not help
them, but applying your new patch [2] on top of the old patch did?

[1] http://marc.info/?l=linux-usb=138418996717941=2
[2] http://marc.info/?l=linux-usb=138996538403468=2

In general, will you please Cc me and the USB list when replying to
privately reported bugs/confirmations that patches work?  Or if the
confirmation was reported, please provide a link to the mailing list
discussion or bugzilla entry.  We need to keep bug and fix confirmations
publicly archived.  Please keep me on Cc since I filter mail based on
that.

> A guess...
> 
> In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead
> of xhci_td_remainder().

Why?  Walt has a 0.96 xHCI host controller, and the format for how to
calculate the TD remainder changed between the 0.96 and the 1.0 spec.
That's why we have xhci_v1_0_td_remainder() and xhci_td_remainder().

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-20 Thread David Laight

From: walt 
> On 01/17/2014 06:34 AM, David Laight wrote:
> 
> > Can you try the patch I posted that stops the ownership on LINK TRBs
> > being changed before that on the linked-to TRB?
> 
> Please disregard my earlier post about the patch not applying cleanly.
> That was the usual html corruption, so I found the original on the usb
> list and it was okay.
> 
> Sadly, the patch didn't fix the ASMedia lockup behavior, however :(
> 
> I did notice that the lockup occurred only when copying *to* the usb3
> drive, and not when copying from it.  I think that may be new behavior
> but I can't swear to it.

Consistent with another report that says that ethernet worked provided
that TSO was disabled (ie no sg tx).
(Without the patch to delay he ownership change on link trbs it didn't
work at all.)

A guess...

In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead
of xhci_td_remainder().

You can try that on top of either of my other patches.

David

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-20 Thread David Laight

From: 
> On 01/17/2014 06:34 AM, David Laight wrote:
> 
> > Can you try the patch I posted that stops the ownership on LINK TRBs
> > being changed before that on the linked-to TRB?
> 
> Sadly, the patch didn't fix the ASMedia lockup behavior, however :(
> 
> I did notice that the lockup occurred only when copying *to* the usb3
> drive, and not when copying from it.  I think that may be new behavior
> but I can't swear to it.

If the behaviour has changed then the fix is on the right lines.
If reads used to lock up, and don't with the patch then that is
an improvement.
It might be that the tx issue is actually a different 'bug'.

> Just to confirm, here are the first few lines of the patch I used.
> Please let me know if it's the wrong patch:
> ...
> +
> +   field4 = (field4  ~TRB_CYCLE) | ring-cycle_state;
> +   if (trb == ring-enqueue_first-generic)
> +   field4 ^= TRB_CYCLE;
> +
> trb-field[0] = cpu_to_le32(field1);

That looks like my earlier 'test' patch.
The fuller patch is http://article.gmane.org/gmane.linux.usb.general/101784

There shouldn't be any difference in the way the chip is driven, the
later patch just rips out a load of code that is no longer needed.
Mostly it simplifies the way the ownership of ring entries is passed
to the hardware.

David

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-20 Thread Sarah Sharp

On Mon, Jan 20, 2014 at 11:21:14AM +, David Laight wrote:
 From: walt 
  On 01/17/2014 06:34 AM, David Laight wrote:
  
   Can you try the patch I posted that stops the ownership on LINK TRBs
   being changed before that on the linked-to TRB?
  
  Please disregard my earlier post about the patch not applying cleanly.
  That was the usual html corruption, so I found the original on the usb
  list and it was okay.
  
  Sadly, the patch didn't fix the ASMedia lockup behavior, however :(
  
  I did notice that the lockup occurred only when copying *to* the usb3
  drive, and not when copying from it.  I think that may be new behavior
  but I can't swear to it.
 
 Consistent with another report that says that ethernet worked provided
 that TSO was disabled (ie no sg tx).
 (Without the patch to delay he ownership change on link trbs it didn't
 work at all.)

Please be more clear.  What do you mean by these statements?  That
someone privately reported that your earlier patch [1] did not help
them, but applying your new patch [2] on top of the old patch did?

[1] http://marc.info/?l=linux-usbm=138418996717941w=2
[2] http://marc.info/?l=linux-usbm=138996538403468w=2

In general, will you please Cc me and the USB list when replying to
privately reported bugs/confirmations that patches work?  Or if the
confirmation was reported, please provide a link to the mailing list
discussion or bugzilla entry.  We need to keep bug and fix confirmations
publicly archived.  Please keep me on Cc since I filter mail based on
that.

 A guess...
 
 In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead
 of xhci_td_remainder().

Why?  Walt has a 0.96 xHCI host controller, and the format for how to
calculate the TD remainder changed between the 0.96 and the 1.0 spec.
That's why we have xhci_v1_0_td_remainder() and xhci_td_remainder().

Sarah Sharp
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-20 Thread David Laight

From: 
 On 01/17/2014 06:34 AM, David Laight wrote:
 
  Can you try the patch I posted that stops the ownership on LINK TRBs
  being changed before that on the linked-to TRB?
 
 Sadly, the patch didn't fix the ASMedia lockup behavior, however :(
 
 I did notice that the lockup occurred only when copying *to* the usb3
 drive, and not when copying from it.  I think that may be new behavior
 but I can't swear to it.

If the behaviour has changed then the fix is on the right lines.
If reads used to lock up, and don't with the patch then that is
an improvement.
It might be that the tx issue is actually a different 'bug'.

 Just to confirm, here are the first few lines of the patch I used.
 Please let me know if it's the wrong patch:
 ...
 +
 +   field4 = (field4 amp; ~TRB_CYCLE) | ring-gt;cycle_state;
 +   if (trb == amp;ring-gt;enqueue_first-gt;generic)
 +   field4 ^= TRB_CYCLE;
 +
 trb-gt;field[0] = cpu_to_le32(field1);

That looks like my earlier 'test' patch.
The fuller patch is http://article.gmane.org/gmane.linux.usb.general/101784

There shouldn't be any difference in the way the chip is driven, the
later patch just rips out a load of code that is no longer needed.
Mostly it simplifies the way the ownership of ring entries is passed
to the hardware.

David

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-20 Thread David Laight

From: walt 
 On 01/17/2014 06:34 AM, David Laight wrote:
 
  Can you try the patch I posted that stops the ownership on LINK TRBs
  being changed before that on the linked-to TRB?
 
 Please disregard my earlier post about the patch not applying cleanly.
 That was the usual html corruption, so I found the original on the usb
 list and it was okay.
 
 Sadly, the patch didn't fix the ASMedia lockup behavior, however :(
 
 I did notice that the lockup occurred only when copying *to* the usb3
 drive, and not when copying from it.  I think that may be new behavior
 but I can't swear to it.

Consistent with another report that says that ethernet worked provided
that TSO was disabled (ie no sg tx).
(Without the patch to delay he ownership change on link trbs it didn't
work at all.)

A guess...

In queue_bulk_sg_tx() try calling xhci_v1_0_td_remainder() instead
of xhci_td_remainder().

You can try that on top of either of my other patches.

David

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-18 Thread walt

On 01/17/2014 06:34 AM, David Laight wrote:

> Can you try the patch I posted that stops the ownership on LINK TRBs
> being changed before that on the linked-to TRB?

Please disregard my earlier post about the patch not applying cleanly.
That was the usual html corruption, so I found the original on the usb
list and it was okay.

Sadly, the patch didn't fix the ASMedia lockup behavior, however :(

I did notice that the lockup occurred only when copying *to* the usb3
drive, and not when copying from it.  I think that may be new behavior
but I can't swear to it.

Just to confirm, here are the first few lines of the patch I used.
Please let me know if it's the wrong patch:

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 53c2e29..589d336 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2928,6 +2928,11 @@ static void queue_trb(struct xhci_hcd *xhci, struct 
xhci_ring *ring,
struct xhci_generic_trb *trb;

trb = ring-enqueue-generic;
+
+   field4 = (field4  ~TRB_CYCLE) | ring-cycle_state;
+   if (trb == ring-enqueue_first-generic)
+   field4 ^= TRB_CYCLE;
+
trb-field[0] = cpu_to_le32(field1);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-18 Thread walt

On 01/17/2014 06:34 AM, David Laight wrote:
> From: walt
>> Oy, Sarah! ;)  I put the ASMedia adapter in my older amd64 machine, and, 
>> well,
>> the stupid thing Just Works(TM) with kernel 3.12.7!  (Yes, with the same disk
>> docking station, too.)
>>
>> I can't believe the adapter works perfectly in a different computer.  Have 
>> you
>> seen this kind of thing before?
> 
> Could be a horrid timing race between the cpu and xchi controller.
> If the cpu manages to write a NOP or LINK TRB for a following transfer
> before the controller polls the next entry (after raising the IRQ)
> then the controller might process the LINK and then get confused
> when it can't process the linked-to TRB.
> This might not sound likely, but PCIe has significant latency.

 
> Can you try the patch I posted that stops the ownership on LINK TRBs
> being changed before that on the linked-to TRB?

David, the patch doesn't apply cleanly to any 3.12.x sources I can find,
including the latest git from Linus.  Half of the hunks fail.  Could you
create a fresh patch against vanilla 3.12.8 so I can test it? Or point me
to the correct kernel sources, perhaps.

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-18 Thread walt

On 01/17/2014 06:34 AM, David Laight wrote:
 From: walt
 Oy, Sarah! ;)  I put the ASMedia adapter in my older amd64 machine, and, 
 well,
 the stupid thing Just Works(TM) with kernel 3.12.7!  (Yes, with the same disk
 docking station, too.)

 I can't believe the adapter works perfectly in a different computer.  Have 
 you
 seen this kind of thing before?
 
 Could be a horrid timing race between the cpu and xchi controller.
 If the cpu manages to write a NOP or LINK TRB for a following transfer
 before the controller polls the next entry (after raising the IRQ)
 then the controller might process the LINK and then get confused
 when it can't process the linked-to TRB.
 This might not sound likely, but PCIe has significant latency.

 
 Can you try the patch I posted that stops the ownership on LINK TRBs
 being changed before that on the linked-to TRB?

David, the patch doesn't apply cleanly to any 3.12.x sources I can find,
including the latest git from Linus.  Half of the hunks fail.  Could you
create a fresh patch against vanilla 3.12.8 so I can test it? Or point me
to the correct kernel sources, perhaps.

Thanks.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-18 Thread walt

On 01/17/2014 06:34 AM, David Laight wrote:

 Can you try the patch I posted that stops the ownership on LINK TRBs
 being changed before that on the linked-to TRB?

Please disregard my earlier post about the patch not applying cleanly.
That was the usual html corruption, so I found the original on the usb
list and it was okay.

Sadly, the patch didn't fix the ASMedia lockup behavior, however :(

I did notice that the lockup occurred only when copying *to* the usb3
drive, and not when copying from it.  I think that may be new behavior
but I can't swear to it.

Just to confirm, here are the first few lines of the patch I used.
Please let me know if it's the wrong patch:


diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 53c2e29..589d336 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2928,6 +2928,11 @@ static void queue_trb(struct xhci_hcd *xhci, struct 
xhci_ring *ring,
struct xhci_generic_trb *trb;
 
trb = amp;ring-gt;enqueue-gt;generic;
+
+   field4 = (field4 amp; ~TRB_CYCLE) | ring-gt;cycle_state;
+   if (trb == amp;ring-gt;enqueue_first-gt;generic)
+   field4 ^= TRB_CYCLE;
+
trb-gt;field[0] = cpu_to_le32(field1);

big snip
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-17 Thread David Laight

From: walt
> Oy, Sarah! ;)  I put the ASMedia adapter in my older amd64 machine, and, well,
> the stupid thing Just Works(TM) with kernel 3.12.7!  (Yes, with the same disk
> docking station, too.)
> 
> I can't believe the adapter works perfectly in a different computer.  Have you
> seen this kind of thing before?

Could be a horrid timing race between the cpu and xchi controller.
If the cpu manages to write a NOP or LINK TRB for a following transfer
before the controller polls the next entry (after raising the IRQ)
then the controller might process the LINK and then get confused
when it can't process the linked-to TRB.
This might not sound likely, but PCIe has significant latency.

> At the moment I have two machines using your xhci driver and both work 
> perfectly,
> so I thank you again :)
> 
> I'm not sure where to go with this next.  I could put the adapter back in the
> other machine again if you have more patches to test.

Can you try the patch I posted that stops the ownership on LINK TRBs
being changed before that on the linked-to TRB?

I got a private mail from someone indicating that my earlier 'minimal'
patch helped an ASMedia controller talking to the asx189_178a ethernet
hardware.

David

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-17 Thread David Laight

From: walt
 Oy, Sarah! ;)  I put the ASMedia adapter in my older amd64 machine, and, well,
 the stupid thing Just Works(TM) with kernel 3.12.7!  (Yes, with the same disk
 docking station, too.)
 
 I can't believe the adapter works perfectly in a different computer.  Have you
 seen this kind of thing before?

Could be a horrid timing race between the cpu and xchi controller.
If the cpu manages to write a NOP or LINK TRB for a following transfer
before the controller polls the next entry (after raising the IRQ)
then the controller might process the LINK and then get confused
when it can't process the linked-to TRB.
This might not sound likely, but PCIe has significant latency.

 At the moment I have two machines using your xhci driver and both work 
 perfectly,
 so I thank you again :)
 
 I'm not sure where to go with this next.  I could put the adapter back in the
 other machine again if you have more patches to test.

Can you try the patch I posted that stops the ownership on LINK TRBs
being changed before that on the linked-to TRB?

I got a private mail from someone indicating that my earlier 'minimal'
patch helped an ASMedia controller talking to the asx189_178a ethernet
hardware.

David

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-16 Thread Sarah Sharp

On Tue, Jan 14, 2014 at 01:27:25PM -0800, walt wrote:
> On 01/14/2014 09:20 AM, Sarah Sharp wrote:
> > On Mon, Jan 13, 2014 at 03:39:07PM -0800, walt wrote:
> 
> >> Sarah, I just fixed my xhci bug for US$19.99 :)
> >>
> >> #lspci | tail -1
> >> 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
> >> (rev 03)
> >>
> >> This new NEC usb3 controller does everything the ASMedia controller should 
> >> have
> >> done from the start.
> 
>  
> > I just got a similar report from someone with a Fresco Logic host
> > controller, so I may need you to test a work-around patch for the
> > ASMedia host at some point.

Hmm, the Fresco Logic host issue seems unrelated.

> Oy, Sarah! ;)  I put the ASMedia adapter in my older amd64 machine, and, well,
> the stupid thing Just Works(TM) with kernel 3.12.7!  (Yes, with the same disk
> docking station, too.)

Ugh.  Well, I suppose we can chalk it up to hardware failure?  I think
you're the only one to report a verified issue with the Link TRB patch.

You are sure you're running a vanilla 3.12.7 kernel, right?

> I can't believe the adapter works perfectly in a different computer.  Have you
> seen this kind of thing before?

No, at least not this particular host-dying-only-on-one-machine failure
mode.

> At the moment I have two machines using your xhci driver and both work 
> perfectly,
> so I thank you again :)
> 
> I'm not sure where to go with this next.  I could put the adapter back in the
> other machine again if you have more patches to test.

I think any patches I was going to send are moot with this new
information.  The issue with your PCI add-in card only happens in
combination with a specific motherboard, so I don't think it makes sense
to disable the no-op TRBs for that host.

If lots of other people start reporting the same issue with the ASMedia
0.96 host, and reverting commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e
"usb: xhci: Link TRB must not occur within a USB payload burst" helps
them, then I'll reconsider that decision.

Thank you so much for your patience while debugging this issue, and
being willing to try all sorts of kernels and patches.  I'm glad we
finally figured out what the issue was, and you have working xHCI hosts
now.

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-16 Thread Sarah Sharp

On Tue, Jan 14, 2014 at 01:27:25PM -0800, walt wrote:
 On 01/14/2014 09:20 AM, Sarah Sharp wrote:
  On Mon, Jan 13, 2014 at 03:39:07PM -0800, walt wrote:
 
  Sarah, I just fixed my xhci bug for US$19.99 :)
 
  #lspci | tail -1
  04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
  (rev 03)
 
  This new NEC usb3 controller does everything the ASMedia controller should 
  have
  done from the start.
 
  
  I just got a similar report from someone with a Fresco Logic host
  controller, so I may need you to test a work-around patch for the
  ASMedia host at some point.

Hmm, the Fresco Logic host issue seems unrelated.

 Oy, Sarah! ;)  I put the ASMedia adapter in my older amd64 machine, and, well,
 the stupid thing Just Works(TM) with kernel 3.12.7!  (Yes, with the same disk
 docking station, too.)

Ugh.  Well, I suppose we can chalk it up to hardware failure?  I think
you're the only one to report a verified issue with the Link TRB patch.

You are sure you're running a vanilla 3.12.7 kernel, right?

 I can't believe the adapter works perfectly in a different computer.  Have you
 seen this kind of thing before?

No, at least not this particular host-dying-only-on-one-machine failure
mode.

 At the moment I have two machines using your xhci driver and both work 
 perfectly,
 so I thank you again :)
 
 I'm not sure where to go with this next.  I could put the adapter back in the
 other machine again if you have more patches to test.

I think any patches I was going to send are moot with this new
information.  The issue with your PCI add-in card only happens in
combination with a specific motherboard, so I don't think it makes sense
to disable the no-op TRBs for that host.

If lots of other people start reporting the same issue with the ASMedia
0.96 host, and reverting commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e
usb: xhci: Link TRB must not occur within a USB payload burst helps
them, then I'll reconsider that decision.

Thank you so much for your patience while debugging this issue, and
being willing to try all sorts of kernels and patches.  I'm glad we
finally figured out what the issue was, and you have working xHCI hosts
now.

Sarah Sharp
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-14 Thread walt

On 01/14/2014 09:20 AM, Sarah Sharp wrote:
> On Mon, Jan 13, 2014 at 03:39:07PM -0800, walt wrote:

>> Sarah, I just fixed my xhci bug for US$19.99 :)
>>
>> #lspci | tail -1
>> 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
>> (rev 03)
>>
>> This new NEC usb3 controller does everything the ASMedia controller should 
>> have
>> done from the start.

> I just got a similar report from someone with a Fresco Logic host
> controller, so I may need you to test a work-around patch for the
> ASMedia host at some point.

Oy, Sarah! ;)  I put the ASMedia adapter in my older amd64 machine, and, well,
the stupid thing Just Works(TM) with kernel 3.12.7!  (Yes, with the same disk
docking station, too.)

I can't believe the adapter works perfectly in a different computer.  Have you
seen this kind of thing before?

At the moment I have two machines using your xhci driver and both work 
perfectly,
so I thank you again :)

I'm not sure where to go with this next.  I could put the adapter back in the
other machine again if you have more patches to test.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-14 Thread Sarah Sharp

On Mon, Jan 13, 2014 at 03:39:07PM -0800, walt wrote:
> On 01/09/2014 03:50 PM, Sarah Sharp wrote:
> 
> >>> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
> >>
> >> I've wondered if my xhci problems might be caused by hardware quirks, and
> >> wondering why I seem to be the only one who has this problem.
> >>
> >> Maybe I could "take one for the team" by buying new hardware toys that I
> >> don't really need but I could use to test the xhci driver?  (I do enjoy
> >> buying new toys, necessary, or, um, maybe not :)
> > 
> > It would be appreciated if you could see if your device causes other
> > host controllers to fail.  Who am I to keep a geek from new toys? ;)
> 
> Sarah, I just fixed my xhci bug for US$19.99 :)
> 
> #lspci | tail -1
> 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
> (rev 03)
> 
> This new NEC usb3 controller does everything the ASMedia controller should 
> have
> done from the start.  I can even power-up the outboard usb3 disk docking 
> station
> after the computer is running and still everything Just Works :)
> 
> I appreciate all the debugging work you've done to fix the ASMedia problem but
> I think it's time to quit now.  If hundreds of irate linux users complain 
> about
> the same bug then I'll be happy to test more patches.

I just got a similar report from someone with a Fresco Logic host
controller, so I may need you to test a work-around patch for the
ASMedia host at some point.

I'm considering disabling the effect of David's patch for those two host
controllers.  That will mean USB storage works fine, but USB ethernet
may fail.

I had considered just disabling scatter-gather for the hosts, but we can
still run into the ethernet issue if we need to break a TRB at a 64KB
boundary.  So disabling scatter-gather would make USB ethernet work
_most of the time_, but fail intermittently, and USB storage performance
would be impacted.  Since USB ethernet will fail in either case, I would
rather keep USB storage performance and sacrifice USB ethernet on those
particular hosts.

So please keep the ASMedia host around for testing, if possible.

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-14 Thread David Laight

From: walt
> On 01/09/2014 03:50 PM, Sarah Sharp wrote:
> 
> >>> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
> >>
> >> I've wondered if my xhci problems might be caused by hardware quirks, and
> >> wondering why I seem to be the only one who has this problem.
> >>
> >> Maybe I could "take one for the team" by buying new hardware toys that I
> >> don't really need but I could use to test the xhci driver?  (I do enjoy
> >> buying new toys, necessary, or, um, maybe not :)
> >
> > It would be appreciated if you could see if your device causes other
> > host controllers to fail.  Who am I to keep a geek from new toys? ;)
> 
> Sarah, I just fixed my xhci bug for US$19.99 :)
> 
> #lspci | tail -1
> 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
> (rev 03)

Do you know which version of the xhci spec it conforms to?
(Will probably be 0.96 or 1.00).

Of course, ASMedia will probably change the silicon they are using without
changing anything on the packaging.

David

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-14 Thread David Laight

From: walt
 On 01/09/2014 03:50 PM, Sarah Sharp wrote:
 
  On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
 
  I've wondered if my xhci problems might be caused by hardware quirks, and
  wondering why I seem to be the only one who has this problem.
 
  Maybe I could take one for the team by buying new hardware toys that I
  don't really need but I could use to test the xhci driver?  (I do enjoy
  buying new toys, necessary, or, um, maybe not :)
 
  It would be appreciated if you could see if your device causes other
  host controllers to fail.  Who am I to keep a geek from new toys? ;)
 
 Sarah, I just fixed my xhci bug for US$19.99 :)
 
 #lspci | tail -1
 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
 (rev 03)

Do you know which version of the xhci spec it conforms to?
(Will probably be 0.96 or 1.00).

Of course, ASMedia will probably change the silicon they are using without
changing anything on the packaging.

David

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-14 Thread Sarah Sharp

On Mon, Jan 13, 2014 at 03:39:07PM -0800, walt wrote:
 On 01/09/2014 03:50 PM, Sarah Sharp wrote:
 
  On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
 
  I've wondered if my xhci problems might be caused by hardware quirks, and
  wondering why I seem to be the only one who has this problem.
 
  Maybe I could take one for the team by buying new hardware toys that I
  don't really need but I could use to test the xhci driver?  (I do enjoy
  buying new toys, necessary, or, um, maybe not :)
  
  It would be appreciated if you could see if your device causes other
  host controllers to fail.  Who am I to keep a geek from new toys? ;)
 
 Sarah, I just fixed my xhci bug for US$19.99 :)
 
 #lspci | tail -1
 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
 (rev 03)
 
 This new NEC usb3 controller does everything the ASMedia controller should 
 have
 done from the start.  I can even power-up the outboard usb3 disk docking 
 station
 after the computer is running and still everything Just Works :)
 
 I appreciate all the debugging work you've done to fix the ASMedia problem but
 I think it's time to quit now.  If hundreds of irate linux users complain 
 about
 the same bug then I'll be happy to test more patches.

I just got a similar report from someone with a Fresco Logic host
controller, so I may need you to test a work-around patch for the
ASMedia host at some point.

I'm considering disabling the effect of David's patch for those two host
controllers.  That will mean USB storage works fine, but USB ethernet
may fail.

I had considered just disabling scatter-gather for the hosts, but we can
still run into the ethernet issue if we need to break a TRB at a 64KB
boundary.  So disabling scatter-gather would make USB ethernet work
_most of the time_, but fail intermittently, and USB storage performance
would be impacted.  Since USB ethernet will fail in either case, I would
rather keep USB storage performance and sacrifice USB ethernet on those
particular hosts.

So please keep the ASMedia host around for testing, if possible.

Sarah Sharp
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-14 Thread walt

On 01/14/2014 09:20 AM, Sarah Sharp wrote:
 On Mon, Jan 13, 2014 at 03:39:07PM -0800, walt wrote:

 Sarah, I just fixed my xhci bug for US$19.99 :)

 #lspci | tail -1
 04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller 
 (rev 03)

 This new NEC usb3 controller does everything the ASMedia controller should 
 have
 done from the start.

 
 I just got a similar report from someone with a Fresco Logic host
 controller, so I may need you to test a work-around patch for the
 ASMedia host at some point.

Oy, Sarah! ;)  I put the ASMedia adapter in my older amd64 machine, and, well,
the stupid thing Just Works(TM) with kernel 3.12.7!  (Yes, with the same disk
docking station, too.)

I can't believe the adapter works perfectly in a different computer.  Have you
seen this kind of thing before?

At the moment I have two machines using your xhci driver and both work 
perfectly,
so I thank you again :)

I'm not sure where to go with this next.  I could put the adapter back in the
other machine again if you have more patches to test.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-13 Thread walt

On 01/09/2014 03:50 PM, Sarah Sharp wrote:

>>> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
>>
>> I've wondered if my xhci problems might be caused by hardware quirks, and
>> wondering why I seem to be the only one who has this problem.
>>
>> Maybe I could "take one for the team" by buying new hardware toys that I
>> don't really need but I could use to test the xhci driver?  (I do enjoy
>> buying new toys, necessary, or, um, maybe not :)
> 
> It would be appreciated if you could see if your device causes other
> host controllers to fail.  Who am I to keep a geek from new toys? ;)

Sarah, I just fixed my xhci bug for US$19.99 :)

#lspci | tail -1
04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 
03)

This new NEC usb3 controller does everything the ASMedia controller should have
done from the start.  I can even power-up the outboard usb3 disk docking station
after the computer is running and still everything Just Works :)

I appreciate all the debugging work you've done to fix the ASMedia problem but
I think it's time to quit now.  If hundreds of irate linux users complain about
the same bug then I'll be happy to test more patches.

Until then... :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst [NEW HARDWARE]

2014-01-13 Thread walt

On 01/09/2014 03:50 PM, Sarah Sharp wrote:

 On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:

 I've wondered if my xhci problems might be caused by hardware quirks, and
 wondering why I seem to be the only one who has this problem.

 Maybe I could take one for the team by buying new hardware toys that I
 don't really need but I could use to test the xhci driver?  (I do enjoy
 buying new toys, necessary, or, um, maybe not :)
 
 It would be appreciated if you could see if your device causes other
 host controllers to fail.  Who am I to keep a geek from new toys? ;)

Sarah, I just fixed my xhci bug for US$19.99 :)

#lspci | tail -1
04:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 
03)

This new NEC usb3 controller does everything the ASMedia controller should have
done from the start.  I can even power-up the outboard usb3 disk docking station
after the computer is running and still everything Just Works :)

I appreciate all the debugging work you've done to fix the ASMedia problem but
I think it's time to quit now.  If hundreds of irate linux users complain about
the same bug then I'll be happy to test more patches.

Until then... :)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-10 Thread Alan Stern

On Thu, 9 Jan 2014, Sarah Sharp wrote:

> > I can't see anything obvious either.
> > However there is no response to the 'stop endpoint' command.
> > Section 4.6.9 (page 107 of rev1.0) states that the controller will complete
> > any USB IN or OUT transaction before raising the command completion event.
> > Possibly it is too 'stuck' to complete the transaction?
> 
> The host has to stop processing the transaction, it can't "wait" for the
> transaction to finish.  "The Stop Endpoint Command is expected to stop
> endpoint activity as soon as possible, which may mean that it stops in
> the middle of a TRB."

Just to clarify for Walt: There's a difference between a transaction 
and a transfer.  Transfers can take an indefinitely long time to 
complete, because the device doesn't have to accept or send any data 
until it is ready.

By contrast, transactions have sharply bounded lifetimes.  A
transaction consists of some maximum number of packets (usually 3,
sometimes a little more) with upper limits on the time intervals
between them.

A TRB lies somewhere between a transfer and a transaction.  A single 
TRB can encompass a single transaction or multiple transactions, and a 
single transfer can involve more than one TRB.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-10 Thread David Laight

From: walt
> > In the meantime, try this patch, which is something of a long shot.
> 
> No difference.  But I notice the code enables the TRB quirk only if the
> xhci_version is specifically 0x95.  My debug messages claim that "xHCI
> doesn't need link TRB QUIRK" so I'm wondering if adding my asmedia device
> to the quirks list really doesn't change anything unless it's xhci 0.95.

I think the intention was that the quirk would be applied for your
hardware even though it claims to be version 0.96.
That was gust a hopeful guess that the same change would help.

> Does lspci provide that information?  I'm not seeing anything obvious.
I've only seen it in the diagnostic prints (that aren't normally output).

David

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-10 Thread walt

On 01/09/2014 03:50 PM, Sarah Sharp wrote:

>>> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
>>
>> The aftermarket usb3 adapter card and the usb3 outboard hard-drive docking
>> station are the only two usb3 devices I have.
>>
>> I've wondered if my xhci problems might be caused by hardware quirks, and
>> wondering why I seem to be the only one who has this problem.
>>
>> Maybe I could "take one for the team" by buying new hardware toys that I
>> don't really need but I could use to test the xhci driver?  (I do enjoy
>> buying new toys, necessary, or, um, maybe not :)
> 
> It would be appreciated if you could see if your device causes other
> host controllers to fail.  Who am I to keep a geek from new toys? ;)

Thanks :)  Just to clarify, when you say 'your device' do you mean the
outboard disk docking station?  So I need to buy other host controllers,
not new docking stations?

> 
> In the meantime, try this patch, which is something of a long shot.

No difference.  But I notice the code enables the TRB quirk only if the
xhci_version is specifically 0x95.  My debug messages claim that "xHCI
doesn't need link TRB QUIRK" so I'm wondering if adding my asmedia device
to the quirks list really doesn't change anything unless it's xhci 0.95.

Does lspci provide that information?  I'm not seeing anything obvious.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-10 Thread walt

On 01/09/2014 03:50 PM, Sarah Sharp wrote:

 On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:

 The aftermarket usb3 adapter card and the usb3 outboard hard-drive docking
 station are the only two usb3 devices I have.

 I've wondered if my xhci problems might be caused by hardware quirks, and
 wondering why I seem to be the only one who has this problem.

 Maybe I could take one for the team by buying new hardware toys that I
 don't really need but I could use to test the xhci driver?  (I do enjoy
 buying new toys, necessary, or, um, maybe not :)
 
 It would be appreciated if you could see if your device causes other
 host controllers to fail.  Who am I to keep a geek from new toys? ;)

Thanks :)  Just to clarify, when you say 'your device' do you mean the
outboard disk docking station?  So I need to buy other host controllers,
not new docking stations?

 
 In the meantime, try this patch, which is something of a long shot.

No difference.  But I notice the code enables the TRB quirk only if the
xhci_version is specifically 0x95.  My debug messages claim that xHCI
doesn't need link TRB QUIRK so I'm wondering if adding my asmedia device
to the quirks list really doesn't change anything unless it's xhci 0.95.

Does lspci provide that information?  I'm not seeing anything obvious.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-10 Thread David Laight

From: walt
  In the meantime, try this patch, which is something of a long shot.
 
 No difference.  But I notice the code enables the TRB quirk only if the
 xhci_version is specifically 0x95.  My debug messages claim that xHCI
 doesn't need link TRB QUIRK so I'm wondering if adding my asmedia device
 to the quirks list really doesn't change anything unless it's xhci 0.95.

I think the intention was that the quirk would be applied for your
hardware even though it claims to be version 0.96.
That was gust a hopeful guess that the same change would help.

 Does lspci provide that information?  I'm not seeing anything obvious.
I've only seen it in the diagnostic prints (that aren't normally output).

David

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-10 Thread Alan Stern

On Thu, 9 Jan 2014, Sarah Sharp wrote:

  I can't see anything obvious either.
  However there is no response to the 'stop endpoint' command.
  Section 4.6.9 (page 107 of rev1.0) states that the controller will complete
  any USB IN or OUT transaction before raising the command completion event.
  Possibly it is too 'stuck' to complete the transaction?
 
 The host has to stop processing the transaction, it can't wait for the
 transaction to finish.  The Stop Endpoint Command is expected to stop
 endpoint activity as soon as possible, which may mean that it stops in
 the middle of a TRB.

Just to clarify for Walt: There's a difference between a transaction 
and a transfer.  Transfers can take an indefinitely long time to 
complete, because the device doesn't have to accept or send any data 
until it is ready.

By contrast, transactions have sharply bounded lifetimes.  A
transaction consists of some maximum number of packets (usually 3,
sometimes a little more) with upper limits on the time intervals
between them.

A TRB lies somewhere between a transfer and a transaction.  A single 
TRB can encompass a single transaction or multiple transactions, and a 
single transfer can involve more than one TRB.

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-09 Thread Sarah Sharp

[Walt, please use reply-all to keep the list in the loop, thanks.]

On Wed, Jan 08, 2014 at 04:09:14PM +, David Laight wrote:
> > From: Sarah Sharp
> > On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
> > > On 01/07/2014 01:21 PM, Sarah Sharp wrote:
> > >
> > > > Can you please try the attached patch, on top of the previous three
> > > > patches, and send me dmesg?
> > >
> > > Hi Sarah, I just now finished running 0001-More-debugging.patch for the
> > > first time.  The previous dmesg didn't include that patch, but this one
> > > does.
> > >
> > > I read through this dmesg but I nodded off somewhere around line 500.
> > > I hope you can stay awake :)
> > 
> > Well, it has all the info I need, but the results don't make me too
> > happy.  Everything I've checked seems consistent, and I don't know why
> > the host stopped.  The link TRBs are intact, the dequeue pointer for the
> > endpoint was pointing to the transfer that timed out and it had the
> > cycle bit set correctly, etc.  Perhaps the no-op TRBs are really the
> > issue.
> > 
> > I'll have to take a look at the log again tomorrow.  I posted the dmesg
> > on pastebin if David wants to check it out as well:
> > http://pastebin.com/a4AUpsL1
> 
> I can't see anything obvious either.
> However there is no response to the 'stop endpoint' command.
> Section 4.6.9 (page 107 of rev1.0) states that the controller will complete
> any USB IN or OUT transaction before raising the command completion event.
> Possibly it is too 'stuck' to complete the transaction?

The host has to stop processing the transaction, it can't "wait" for the
transaction to finish.  "The Stop Endpoint Command is expected to stop
endpoint activity as soon as possible, which may mean that it stops in
the middle of a TRB."

Usually when hosts get into this kind of mode, something has seriously
gone wrong, like bus error when it issues a bad memory access.

> The endpoint status is also still '1' (running).
> This also means that the 'TR dequeue pointer' is undefined - so the
> controller could easily be processing a later TRB.
> This field might even still contain the ring base address written by
> the driver much earlier.
> 
> This might mean that something 'catastrophic' has happened earlier.
> Maybe the controller isn't actually seeing any doorbell writes at all.
> Maybe the base addresses it has for the rings have all got corrupted.
> At least this looks like amd64 - so there aren't memory coherency issues.
> 
> Some hacks that might help isolate the problem:
> 1) Request an interrupt from the last nop data TRB.
> 2) Put a command nop (decimal 23) TRB into the command ring before
>the 'stop endpoint'.
> 3) Comment out the code that adds the nop data TRBs.
> The first two might need code adding to handle the responses.
> 
> Do we know the actual xhci device?
> I think it reports version 0x96.
> (Sarah - it might be useful if that version were in one of the trace
> messages that is output by default.)

You mean print the PCI device and vendor ID?  Perhaps Subsystem vendor
as well?

On Tue, Jan 07, 2014 at 05:26:37PM -0800, walt wrote:
> On 01/07/2014 04:47 PM, Sarah Sharp wrote:
>  
> > Can you send me the output of `sudo lspci -vvv -n`?  Maybe we can just
> > turn off scatter-gather for your host controller until we get a proper
> > fix in that uses link TRBs instead of no-op TRBs.
> 
> The aftermarket usb3 adapter card and the usb3 outboard hard-drive docking
> station are the only two usb3 devices I have.
> 
> I've wondered if my xhci problems might be caused by hardware quirks, and
> wondering why I seem to be the only one who has this problem.
> 
> Maybe I could "take one for the team" by buying new hardware toys that I
> don't really need but I could use to test the xhci driver?  (I do enjoy
> buying new toys, necessary, or, um, maybe not :)

It would be appreciated if you could see if your device causes other
host controllers to fail.  Who am I to keep a geek from new toys? ;)

In the meantime, try this patch, which is something of a long shot.

Sarah Sharp
>From 19e2ab85ac2cc0d84f56247dcf29bdce14bd70d5 Mon Sep 17 00:00:00 2001
From: Sarah Sharp 
Date: Thu, 9 Jan 2014 15:46:04 -0800
Subject: [PATCH] xhci: Enable Link TRB quirk for 0.96 ASMedia host.

A recent bug fix commit causes an ASMedia host to stop responding to
commands.  See if it needs the link TRB quirk.  This was generally only
necessary for 0.95 hosts, but maybe this 0.96 host needs it.

Signed-off-by: Sarah Sharp 
---
 drivers/usb/host/xhci-pci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c
index 3c898c12a06b..8196ac2289e4 100644
--- a/drivers/usb/host/xhci-pci.c
+++ b/drivers/usb/host/xhci-pci.c
@@ -92,6 +92,8 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci)
 		xhci->quirks |= XHCI_TRUST_TX_LENGTH;
 	}
 
+	if (pdev->vendor == PCI_VENDOR_ID_ASMEDIA && pdev->device == 1042)
+		xhci->quirks |= XHCI_LINK_TRB_QUIRK;

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-09 Thread walt

On 01/09/2014 02:05 AM, David Laight wrote:
>> From: walt
> ...
>> I'm still wondering if I'm suffering from hardware quirks.  From the
>> first day I installed my usb3 adapter card and the usb3 disk docking
>> station I've noticed some quirky behavior.
> 
> Ah - this isn't an 'on chip' usb3 adapter.
> Some kind of PCIe card ?

This one:

http://www.sabrent.com/category/controller-cards/PCIX-USB3/
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-09 Thread David Laight

> From: walt
...
> I'm still wondering if I'm suffering from hardware quirks.  From the
> first day I installed my usb3 adapter card and the usb3 disk docking
> station I've noticed some quirky behavior.

Ah - this isn't an 'on chip' usb3 adapter.
Some kind of PCIe card ?

> e.g. I boot the machine with the docking station powered-off, and then
> later I power it on, the usb3 disk is not detected at all -- until I
> reboot the machine with the docking station still powered on.

That will be something completely different to failures when running.

David

N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-09 Thread David Laight

 From: walt
...
 I'm still wondering if I'm suffering from hardware quirks.  From the
 first day I installed my usb3 adapter card and the usb3 disk docking
 station I've noticed some quirky behavior.

Ah - this isn't an 'on chip' usb3 adapter.
Some kind of PCIe card ?

 e.g. I boot the machine with the docking station powered-off, and then
 later I power it on, the usb3 disk is not detected at all -- until I
 reboot the machine with the docking station still powered on.

That will be something completely different to failures when running.

David

N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a���
0��h���i

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-09 Thread walt

On 01/09/2014 02:05 AM, David Laight wrote:
 From: walt
 ...
 I'm still wondering if I'm suffering from hardware quirks.  From the
 first day I installed my usb3 adapter card and the usb3 disk docking
 station I've noticed some quirky behavior.
 
 Ah - this isn't an 'on chip' usb3 adapter.
 Some kind of PCIe card ?

This one:

http://www.sabrent.com/category/controller-cards/PCIX-USB3/
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-09 Thread Sarah Sharp

[Walt, please use reply-all to keep the list in the loop, thanks.]

On Wed, Jan 08, 2014 at 04:09:14PM +, David Laight wrote:
  From: Sarah Sharp
  On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
   On 01/07/2014 01:21 PM, Sarah Sharp wrote:
  
Can you please try the attached patch, on top of the previous three
patches, and send me dmesg?
  
   Hi Sarah, I just now finished running 0001-More-debugging.patch for the
   first time.  The previous dmesg didn't include that patch, but this one
   does.
  
   I read through this dmesg but I nodded off somewhere around line 500.
   I hope you can stay awake :)
  
  Well, it has all the info I need, but the results don't make me too
  happy.  Everything I've checked seems consistent, and I don't know why
  the host stopped.  The link TRBs are intact, the dequeue pointer for the
  endpoint was pointing to the transfer that timed out and it had the
  cycle bit set correctly, etc.  Perhaps the no-op TRBs are really the
  issue.
  
  I'll have to take a look at the log again tomorrow.  I posted the dmesg
  on pastebin if David wants to check it out as well:
  http://pastebin.com/a4AUpsL1
 
 I can't see anything obvious either.
 However there is no response to the 'stop endpoint' command.
 Section 4.6.9 (page 107 of rev1.0) states that the controller will complete
 any USB IN or OUT transaction before raising the command completion event.
 Possibly it is too 'stuck' to complete the transaction?

The host has to stop processing the transaction, it can't wait for the
transaction to finish.  The Stop Endpoint Command is expected to stop
endpoint activity as soon as possible, which may mean that it stops in
the middle of a TRB.

Usually when hosts get into this kind of mode, something has seriously
gone wrong, like bus error when it issues a bad memory access.

 The endpoint status is also still '1' (running).
 This also means that the 'TR dequeue pointer' is undefined - so the
 controller could easily be processing a later TRB.
 This field might even still contain the ring base address written by
 the driver much earlier.
 
 This might mean that something 'catastrophic' has happened earlier.
 Maybe the controller isn't actually seeing any doorbell writes at all.
 Maybe the base addresses it has for the rings have all got corrupted.
 At least this looks like amd64 - so there aren't memory coherency issues.
 
 Some hacks that might help isolate the problem:
 1) Request an interrupt from the last nop data TRB.
 2) Put a command nop (decimal 23) TRB into the command ring before
the 'stop endpoint'.
 3) Comment out the code that adds the nop data TRBs.
 The first two might need code adding to handle the responses.
 
 Do we know the actual xhci device?
 I think it reports version 0x96.
 (Sarah - it might be useful if that version were in one of the trace
 messages that is output by default.)

You mean print the PCI device and vendor ID?  Perhaps Subsystem vendor
as well?

On Tue, Jan 07, 2014 at 05:26:37PM -0800, walt wrote:
 On 01/07/2014 04:47 PM, Sarah Sharp wrote:
  
  Can you send me the output of `sudo lspci -vvv -n`?  Maybe we can just
  turn off scatter-gather for your host controller until we get a proper
  fix in that uses link TRBs instead of no-op TRBs.
 
 The aftermarket usb3 adapter card and the usb3 outboard hard-drive docking
 station are the only two usb3 devices I have.
 
 I've wondered if my xhci problems might be caused by hardware quirks, and
 wondering why I seem to be the only one who has this problem.
 
 Maybe I could take one for the team by buying new hardware toys that I
 don't really need but I could use to test the xhci driver?  (I do enjoy
 buying new toys, necessary, or, um, maybe not :)

It would be appreciated if you could see if your device causes other
host controllers to fail.  Who am I to keep a geek from new toys? ;)

In the meantime, try this patch, which is something of a long shot.

Sarah Sharp
From 19e2ab85ac2cc0d84f56247dcf29bdce14bd70d5 Mon Sep 17 00:00:00 2001
From: Sarah Sharp sarah.a.sh...@linux.intel.com
Date: Thu, 9 Jan 2014 15:46:04 -0800
Subject: [PATCH] xhci: Enable Link TRB quirk for 0.96 ASMedia host.

A recent bug fix commit causes an ASMedia host to stop responding to
commands.  See if it needs the link TRB quirk.  This was generally only
necessary for 0.95 hosts, but maybe this 0.96 host needs it.

Signed-off-by: Sarah Sharp sarah.a.sh...@linux.intel.com
---
 drivers/usb/host/xhci-pci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c
index 3c898c12a06b..8196ac2289e4 100644
--- a/drivers/usb/host/xhci-pci.c
+++ b/drivers/usb/host/xhci-pci.c
@@ -92,6 +92,8 @@ static void xhci_pci_quirks(struct device *dev, struct xhci_hcd *xhci)
 		xhci-quirks |= XHCI_TRUST_TX_LENGTH;
 	}
 
+	if (pdev-vendor == PCI_VENDOR_ID_ASMEDIA  pdev-device == 1042)
+		xhci-quirks |= XHCI_LINK_TRB_QUIRK;
 	if (pdev-vendor == PCI_VENDOR_ID_NEC)
 		xhci-quirks |=

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-08 Thread walt

On 01/03/2014 03:29 PM, Sarah Sharp wrote:

> I'll let you know when I have some diagnostic patches ready.

Hi Sarah.  I see today gregkh committed the patches you've already sent
me, so I assume someone (other than me) has tested those patches and
discovered some benefit from them?

I'm still wondering if I'm suffering from hardware quirks.  From the
first day I installed my usb3 adapter card and the usb3 disk docking
station I've noticed some quirky behavior.

e.g. I boot the machine with the docking station powered-off, and then
later I power it on, the usb3 disk is not detected at all -- until I
reboot the machine with the docking station still powered on.

Minor stuff, yes, but maybe relevant?  I dunno.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-08 Thread David Laight

> From: Alan Stern
> On Wed, 8 Jan 2014, David Laight wrote:
> 
> > > From: Alan Stern
> > >
> > > This may be a foolish question, but why is xhci-hcd using no-op TRBs in
> > > the first place?
> >
> > Because it can't write in a link TRB because other parts of the
> > code use link TRBs to detect the end of the ring.
> >
> > The problem is that it can't put a link TRB in the middle of
> > a chain of data fragments unless it is at a 'suitable' offset
> > from the start of the data TD. Given arbitrary input fragmentation
> > this means that you can't put a link TRB in the middle of a TD.
> > (The documented alignment might be as high as 16kB.)
> >
> > If the rest of the code used a 'ring end pointer' then a link TRB
> > could be used instead.
> 
> I see.  Sounds like a poor design decision in hindsight.  Can it be
> changed?

Anything can be changed :-)
But it is a bit pervasive.
Padding out with nops isolated the change.

David



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-08 Thread Alan Stern

On Wed, 8 Jan 2014, David Laight wrote:

> > From: Alan Stern
> > 
> > This may be a foolish question, but why is xhci-hcd using no-op TRBs in
> > the first place?
> 
> Because it can't write in a link TRB because other parts of the
> code use link TRBs to detect the end of the ring.
> 
> The problem is that it can't put a link TRB in the middle of
> a chain of data fragments unless it is at a 'suitable' offset
> from the start of the data TD. Given arbitrary input fragmentation
> this means that you can't put a link TRB in the middle of a TD.
> (The documented alignment might be as high as 16kB.)
> 
> If the rest of the code used a 'ring end pointer' then a link TRB
> could be used instead.

I see.  Sounds like a poor design decision in hindsight.  Can it be
changed?

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-08 Thread David Laight

> From: Alan Stern
> 
> This may be a foolish question, but why is xhci-hcd using no-op TRBs in
> the first place?

Because it can't write in a link TRB because other parts of the
code use link TRBs to detect the end of the ring.

The problem is that it can't put a link TRB in the middle of
a chain of data fragments unless it is at a 'suitable' offset
from the start of the data TD. Given arbitrary input fragmentation
this means that you can't put a link TRB in the middle of a TD.
(The documented alignment might be as high as 16kB.)

If the rest of the code used a 'ring end pointer' then a link TRB
could be used instead.

David



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-08 Thread Alan Stern

On Tue, 7 Jan 2014, Sarah Sharp wrote:

> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
> > On 01/07/2014 01:21 PM, Sarah Sharp wrote:
> > 
> > > Can you please try the attached patch, on top of the previous three
> > > patches, and send me dmesg?
> > 
> > Hi Sarah, I just now finished running 0001-More-debugging.patch for the
> > first time.  The previous dmesg didn't include that patch, but this one
> > does.
> > 
> > I read through this dmesg but I nodded off somewhere around line 500.
> > I hope you can stay awake :)
> 
> Well, it has all the info I need, but the results don't make me too
> happy.  Everything I've checked seems consistent, and I don't know why
> the host stopped.  The link TRBs are intact, the dequeue pointer for the
> endpoint was pointing to the transfer that timed out and it had the
> cycle bit set correctly, etc.  Perhaps the no-op TRBs are really the
> issue.

This may be a foolish question, but why is xhci-hcd using no-op TRBs in 
the first place?

It makes sense that they would be needed if you have to unlink an URB 
that isn't the first one on the endpoint ring.  But usb-storage never 
does that; whenever it unlinks an URB, it always unlinks the earliest 
entry in the endpoint's queue.

After unlinking the first URB on the ring, you don't need to fill in
its TRBs with no-ops.  Instead, when you are ready to start the ring
againk, just tell the host controller to move the dequeue pointer up to
the start of the next surviving URB.  (You'll also have to adjust the
CYCLE bits of the TRBs that get skipped over, but that's trivial.)

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-08 Thread David Laight

> From: Sarah Sharp
> On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
> > On 01/07/2014 01:21 PM, Sarah Sharp wrote:
> >
> > > Can you please try the attached patch, on top of the previous three
> > > patches, and send me dmesg?
> >
> > Hi Sarah, I just now finished running 0001-More-debugging.patch for the
> > first time.  The previous dmesg didn't include that patch, but this one
> > does.
> >
> > I read through this dmesg but I nodded off somewhere around line 500.
> > I hope you can stay awake :)
> 
> Well, it has all the info I need, but the results don't make me too
> happy.  Everything I've checked seems consistent, and I don't know why
> the host stopped.  The link TRBs are intact, the dequeue pointer for the
> endpoint was pointing to the transfer that timed out and it had the
> cycle bit set correctly, etc.  Perhaps the no-op TRBs are really the
> issue.
> 
> I'll have to take a look at the log again tomorrow.  I posted the dmesg
> on pastebin if David wants to check it out as well:
> http://pastebin.com/a4AUpsL1

I can't see anything obvious either.
However there is no response to the 'stop endpoint' command.
Section 4.6.9 (page 107 of rev1.0) states that the controller will complete
any USB IN or OUT transaction before raising the command completion event.
Possibly it is too 'stuck' to complete the transaction?

The endpoint status is also still '1' (running).
This also means that the 'TR dequeue pointer' is undefined - so the
controller could easily be processing a later TRB.
This field might even still contain the ring base address written by
the driver much earlier.

This might mean that something 'catastrophic' has happened earlier.
Maybe the controller isn't actually seeing any doorbell writes at all.
Maybe the base addresses it has for the rings have all got corrupted.
At least this looks like amd64 - so there aren't memory coherency issues.

Some hacks that might help isolate the problem:
1) Request an interrupt from the last nop data TRB.
2) Put a command nop (decimal 23) TRB into the command ring before
   the 'stop endpoint'.
3) Comment out the code that adds the nop data TRBs.
The first two might need code adding to handle the responses.

Do we know the actual xhci device?
I think it reports version 0x96.
(Sarah - it might be useful if that version were in one of the trace
messages that is output by default.)

David

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-08 Thread David Laight

 From: Sarah Sharp
 On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
  On 01/07/2014 01:21 PM, Sarah Sharp wrote:
 
   Can you please try the attached patch, on top of the previous three
   patches, and send me dmesg?
 
  Hi Sarah, I just now finished running 0001-More-debugging.patch for the
  first time.  The previous dmesg didn't include that patch, but this one
  does.
 
  I read through this dmesg but I nodded off somewhere around line 500.
  I hope you can stay awake :)
 
 Well, it has all the info I need, but the results don't make me too
 happy.  Everything I've checked seems consistent, and I don't know why
 the host stopped.  The link TRBs are intact, the dequeue pointer for the
 endpoint was pointing to the transfer that timed out and it had the
 cycle bit set correctly, etc.  Perhaps the no-op TRBs are really the
 issue.
 
 I'll have to take a look at the log again tomorrow.  I posted the dmesg
 on pastebin if David wants to check it out as well:
 http://pastebin.com/a4AUpsL1

I can't see anything obvious either.
However there is no response to the 'stop endpoint' command.
Section 4.6.9 (page 107 of rev1.0) states that the controller will complete
any USB IN or OUT transaction before raising the command completion event.
Possibly it is too 'stuck' to complete the transaction?

The endpoint status is also still '1' (running).
This also means that the 'TR dequeue pointer' is undefined - so the
controller could easily be processing a later TRB.
This field might even still contain the ring base address written by
the driver much earlier.

This might mean that something 'catastrophic' has happened earlier.
Maybe the controller isn't actually seeing any doorbell writes at all.
Maybe the base addresses it has for the rings have all got corrupted.
At least this looks like amd64 - so there aren't memory coherency issues.

Some hacks that might help isolate the problem:
1) Request an interrupt from the last nop data TRB.
2) Put a command nop (decimal 23) TRB into the command ring before
   the 'stop endpoint'.
3) Comment out the code that adds the nop data TRBs.
The first two might need code adding to handle the responses.

Do we know the actual xhci device?
I think it reports version 0x96.
(Sarah - it might be useful if that version were in one of the trace
messages that is output by default.)

David


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-08 Thread Alan Stern

On Tue, 7 Jan 2014, Sarah Sharp wrote:

 On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
  On 01/07/2014 01:21 PM, Sarah Sharp wrote:
  
   Can you please try the attached patch, on top of the previous three
   patches, and send me dmesg?
  
  Hi Sarah, I just now finished running 0001-More-debugging.patch for the
  first time.  The previous dmesg didn't include that patch, but this one
  does.
  
  I read through this dmesg but I nodded off somewhere around line 500.
  I hope you can stay awake :)
 
 Well, it has all the info I need, but the results don't make me too
 happy.  Everything I've checked seems consistent, and I don't know why
 the host stopped.  The link TRBs are intact, the dequeue pointer for the
 endpoint was pointing to the transfer that timed out and it had the
 cycle bit set correctly, etc.  Perhaps the no-op TRBs are really the
 issue.

This may be a foolish question, but why is xhci-hcd using no-op TRBs in 
the first place?

It makes sense that they would be needed if you have to unlink an URB 
that isn't the first one on the endpoint ring.  But usb-storage never 
does that; whenever it unlinks an URB, it always unlinks the earliest 
entry in the endpoint's queue.

After unlinking the first URB on the ring, you don't need to fill in
its TRBs with no-ops.  Instead, when you are ready to start the ring
againk, just tell the host controller to move the dequeue pointer up to
the start of the next surviving URB.  (You'll also have to adjust the
CYCLE bits of the TRBs that get skipped over, but that's trivial.)

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-08 Thread David Laight

 From: Alan Stern
 
 This may be a foolish question, but why is xhci-hcd using no-op TRBs in
 the first place?

Because it can't write in a link TRB because other parts of the
code use link TRBs to detect the end of the ring.

The problem is that it can't put a link TRB in the middle of
a chain of data fragments unless it is at a 'suitable' offset
from the start of the data TD. Given arbitrary input fragmentation
this means that you can't put a link TRB in the middle of a TD.
(The documented alignment might be as high as 16kB.)

If the rest of the code used a 'ring end pointer' then a link TRB
could be used instead.

David



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-08 Thread Alan Stern

On Wed, 8 Jan 2014, David Laight wrote:

  From: Alan Stern
  
  This may be a foolish question, but why is xhci-hcd using no-op TRBs in
  the first place?
 
 Because it can't write in a link TRB because other parts of the
 code use link TRBs to detect the end of the ring.
 
 The problem is that it can't put a link TRB in the middle of
 a chain of data fragments unless it is at a 'suitable' offset
 from the start of the data TD. Given arbitrary input fragmentation
 this means that you can't put a link TRB in the middle of a TD.
 (The documented alignment might be as high as 16kB.)
 
 If the rest of the code used a 'ring end pointer' then a link TRB
 could be used instead.

I see.  Sounds like a poor design decision in hindsight.  Can it be
changed?

Alan Stern

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-08 Thread David Laight

 From: Alan Stern
 On Wed, 8 Jan 2014, David Laight wrote:

   From: Alan Stern

   This may be a foolish question, but why is xhci-hcd using no-op TRBs in
   the first place?

  Because it can't write in a link TRB because other parts of the
  code use link TRBs to detect the end of the ring.

  The problem is that it can't put a link TRB in the middle of
  a chain of data fragments unless it is at a 'suitable' offset
  from the start of the data TD. Given arbitrary input fragmentation
  this means that you can't put a link TRB in the middle of a TD.
  (The documented alignment might be as high as 16kB.)

  If the rest of the code used a 'ring end pointer' then a link TRB
  could be used instead.

 I see.  Sounds like a poor design decision in hindsight.  Can it be
 changed?

Anything can be changed :-)
But it is a bit pervasive.
Padding out with nops isolated the change.

David

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-08 Thread walt

On 01/03/2014 03:29 PM, Sarah Sharp wrote:

 I'll let you know when I have some diagnostic patches ready.

Hi Sarah.  I see today gregkh committed the patches you've already sent
me, so I assume someone (other than me) has tested those patches and
discovered some benefit from them?

I'm still wondering if I'm suffering from hardware quirks.  From the
first day I installed my usb3 adapter card and the usb3 disk docking
station I've noticed some quirky behavior.

e.g. I boot the machine with the docking station powered-off, and then
later I power it on, the usb3 disk is not detected at all -- until I
reboot the machine with the docking station still powered on.

Minor stuff, yes, but maybe relevant?  I dunno.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread Sarah Sharp

On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
> On 01/07/2014 01:21 PM, Sarah Sharp wrote:
> 
> > Can you please try the attached patch, on top of the previous three
> > patches, and send me dmesg?
> 
> Hi Sarah, I just now finished running 0001-More-debugging.patch for the
> first time.  The previous dmesg didn't include that patch, but this one
> does.
> 
> I read through this dmesg but I nodded off somewhere around line 500.
> I hope you can stay awake :)

Well, it has all the info I need, but the results don't make me too
happy.  Everything I've checked seems consistent, and I don't know why
the host stopped.  The link TRBs are intact, the dequeue pointer for the
endpoint was pointing to the transfer that timed out and it had the
cycle bit set correctly, etc.  Perhaps the no-op TRBs are really the
issue.

I'll have to take a look at the log again tomorrow.  I posted the dmesg
on pastebin if David wants to check it out as well:
http://pastebin.com/a4AUpsL1

Can you send me the output of `sudo lspci -vvv -n`?  Maybe we can just
turn off scatter-gather for your host controller until we get a proper
fix in that uses link TRBs instead of no-op TRBs.

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread Sarah Sharp

On Tue, Jan 07, 2014 at 12:00:00PM -0800, walt wrote:
> Okay, I used log_buf_len to make dmesg bigger and now I think I have
> the whole thing.  It's attached.

Walt, can you make sure the patch I sent you was applied?  The output
doesn't look like it is.

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread Sarah Sharp

On Tue, Jan 07, 2014 at 01:58:32PM +, David Laight wrote:
> The dmesg contains:
> 
> [  538.728064] EXT4-fs warning (device dm-0): ext4_end_bio:316: I/O error 
> writing to inode 23330865 (offset 0 size 8388608 starting block 812628)
> 
> An 8MB transfer will need at least 128 ring entries (TRB) even if the request
> is a single contiguous memory block.
> 
> Are you using the patch that increases the ring size from 64 to 256?

It's likely that the block layer is breaking up the EXT4 write into
several transfers, since usb-storage limits overall transfer size to
120 KB.  In any case, I added more debugging in the last patch to print
the number of TRBs necessary.  That way we can verify the patch to limit
the number of scatter-gather list entries is working.

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread Sarah Sharp

On Tue, Jan 07, 2014 at 05:29:48AM -0800, walt wrote:
> On 01/06/2014 04:31 PM, Sarah Sharp wrote:
> > Hi Walt,
> > 
> > I have a couple of patches for you to test.
> 
> > Please only apply the first patch (which is diagnostic only), trigger
> > your issue, and send me the resulting dmesg.  Then try applying the
> > other two patches, and see if the issue goes away.  (I suspect it won't
> > but I can't be sure.)
> 
> Thanks Sarah.  dmesg0 is from the diagnostic patch only.  dmesg1 has all
> three patches applied.  Some of the messages in dmesg1 fell off the end of
> the kernel buffer, so I may need to make the buffer larger next time but
> I'll need a reminder of how to do it.

Set CONFIG_LOG_BUF_SHIFT to 21.

> As you suspected, the patches didn't fix the problem, sorry.

Yep, I thought so.  I did glean one bit of information from the logs: it
seems that your host does handle no-op TRBs, at least for a while.
However, after a bigger chunk of TRBs, it goes off into la-la-land.

Assuming one of the rings is comprised of two segments:
0xbb711000 (start)
0xbb7113f0 (end)
0xbb711400 (start)
0xbb7117f0 (end)

The log show no-ops were inserted at:
0xbb7207d0
0xbb7206a0
0xbb720be0
0xbb720be0
0xbb720bd0
0xbb7207e0
0xbb711370 = 8 no-ops
0xbb7117c0 = 3
0xbb7113b0 = 4
0xbb7113a0 = 5
0xbb7117d0 = 2
0xbb711340 = 11
0xbb711770 = 8
0xbb711230 = 28
0xbb7117e0 = 1
0xbb7117b0 = 4
0xbb7113d0 = 2
0xbb7117b0 = 4
0xbb711340 = 11
0xbb711690 = 22

So the host was able to process 28 no-op TRBs, but failed on 22 no-ops
later.  The event ring debugging shows the last event was for
0xbb711680, which is the last TRB before the first no-op inserted before
the host died.  There's no Stop Endpoint Command completion, and it
looks like the command was correctly put on the command ring, so it
seems the host is actually hanging for some reason.

Unfortunately, I made a mistake in the debugging patch I sent
you, so it didn't print out the endpoint rings when the host died.  I
need that info, to see whether the link TRB was still intact, or if we
over-wrote it and caused the host to go fetch some invalid memory.

Can you please try the attached patch, on top of the previous three
patches, and send me dmesg?

> I find that I can tell in advance whether the copy is going to succeed,
> just by watching the light flicker on the usb3 drive.  When the flicker
> is absolutely regular, with no variation whatever, I can tell in 10 or
> 15 seconds that the copy will fail.
> 
> At the same time the light on the main drive goes dark after 10 seconds,
> implying that the usb3 drive stops receiving any data from the main drive
> after 10 seconds, yet the light on the usb3 drive continues to flicker as
> if writing data -- even after the cp officially fails.  The light on the
> usb3 drive never stops flickering until I reboot the machine or unplug
> the usb cable.

Interesting.  Without a USB analyzer, we can't really tell what's
happening.  However, one hypothesis could be that the blinking light is
triggered by an active SCSI command (read/write, etc).

There are three phases of the command: setup, data, and status.  I think
your device is getting the setup phase, and the host is dying before it
sends the data phase.  If the light blinks when it gets a setup phase,
and turns off when the devices sends a status phase, that would explain
its behavior.

But that's just a hypothesis, I have no idea whether it's correct.

Sarah Sharp
>From d085fb5b9630e935d7954fe5947b48402e43bdc1 Mon Sep 17 00:00:00 2001
From: Sarah Sharp 
Date: Tue, 7 Jan 2014 12:39:47 -0800
Subject: [PATCH] More debugging.

Signed-off-by: Sarah Sharp 
---
 drivers/usb/host/xhci-ring.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 2afaf15009e8..228ab8cf868e 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -993,6 +993,9 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg)
 	for (i = 0; i < MAX_HC_SLOTS; i++) {
 		if (!xhci->devs[i])
 			continue;
+
+		xhci_dbg(xhci, "Slot %d output context\n", i);
+		xhci_dbg_ctx(xhci, xhci->devs[i]->out_ctx, 30);
 		for (j = 0; j < 31; j++) {
 			temp_ep = >devs[i]->eps[j];
 			ring = temp_ep->ring;
@@ -1001,6 +1004,10 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg)
 			xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb,
 	"Killing URBs for slot ID %u, "
 	"ep index %u", i, j);
+			xhci_dbg(xhci, "Dev %i Ep 0x%x:\n", i,
+	xhci_get_endpoint_address(j));
+			xhci_debug_ring(xhci, ring);
+			xhci_dbg_ring_ptrs(xhci, ring);
 			while (!list_empty(>td_list)) {
 cur_td = list_first_entry(>td_list,
 		struct xhci_td,
@@ -1011,12 +1018,6 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg)
 xhci_giveback_urb_in_irq(xhci, cur_td,
 		-ESHUTDOWN, "killed");
 			}
-			if (!list_empty(_ep->cancelled_td_list)) {
-xhci_dbg(xhci, "Dev %i Ep 0x%x:\n", i,
-

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread walt

Okay, I used log_buf_len to make dmesg bigger and now I think I have
the whole thing.  It's attached.


dmesg2.gz
Description: application/gzip

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread walt

On 01/07/2014 05:58 AM, David Laight wrote:
> The dmesg contains:
> 
> [  538.728064] EXT4-fs warning (device dm-0): ext4_end_bio:316: I/O error 
> writing to inode 23330865 (offset 0 size 8388608 starting block 812628)
> 
> An 8MB transfer will need at least 128 ring entries (TRB) even if the request
> is a single contiguous memory block.
> 
> Are you using the patch that increases the ring size from 64 to 256?

Yes, 256.  I'll work on getting more debugging output.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread David Laight

The dmesg contains:

[  538.728064] EXT4-fs warning (device dm-0): ext4_end_bio:316: I/O error 
writing to inode 23330865 (offset 0 size 8388608 starting block 812628)

An 8MB transfer will need at least 128 ring entries (TRB) even if the request
is a single contiguous memory block.

Are you using the patch that increases the ring size from 64 to 256?

David

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread David Laight

> From: walt
...
> Thanks Sarah.  dmesg0 is from the diagnostic patch only.  dmesg1 has all
> three patches applied.  Some of the messages in dmesg1 fell off the end of
> the kernel buffer, so I may need to make the buffer larger next time but
> I'll need a reminder of how to do it.
> 
> As you suspected, the patches didn't fix the problem, sorry.

I think Sarah will want the traces of the transfer being setup (ie just
before the error). Some 'normal' completing transfers are also useful.

You might also find the full trace output in one of the files in /var/log.

David

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread walt

On 01/06/2014 04:31 PM, Sarah Sharp wrote:
> On Fri, Jan 03, 2014 at 03:29:29PM -0800, Sarah Sharp wrote:

>>  With the dmesg, I can finally see what happened:
>>
>> [  188.703059] xhci_hcd :03:00.0: Cancel URB 8800b7d2e0c0, dev 1, ep 
>> 0x2, starting at offset 0xbb7b9000
>> [  188.703072] xhci_hcd :03:00.0: // Ding dong!
>> [  193.711022] xhci_hcd :03:00.0: xHCI host not responding to stop 
>> endpoint command.
>> [  193.711029] xhci_hcd :03:00.0: Assuming host is dying, halting host.
>> [  193.711046] xhci_hcd :03:00.0: // Halt the HC
>> [  193.711060] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 0
>> [  193.711066] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 2
>> [  193.711078] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 3
>> [  193.711096] xhci_hcd :03:00.0: Calling usb_hc_died()
>> [  193.711103] xhci_hcd :03:00.0: HC died; cleaning up
>> [  193.76] xhci_hcd :03:00.0: xHCI host controller is dead.
>>
>> It seems that the xHCI driver tried to stop the endpoint ring in order
>> to cancel a SCSI transfer, and the driver never got a response for that.
>>
>> The offset is rather suspicious (0xbb7b9000), and it probably means the
>> driver attempted to cancel a transfer that had been moved to the
>> beginning of the ring segment, with no-op TRBs before the link TRB.
>>
>> I suspect David's patch triggers a bug in the command cancellation code.
>> There's also the unlikely possibility that the no-op TRBs did indeed
>> cause the host to hang.  Either way, I'll have to look into it.
>>
>> I'll let you know when I have some diagnostic patches ready.
> 
> Hi Walt,
> 
> I have a couple of patches for you to test.

> Please only apply the first patch (which is diagnostic only), trigger
> your issue, and send me the resulting dmesg.  Then try applying the
> other two patches, and see if the issue goes away.  (I suspect it won't
> but I can't be sure.)

Thanks Sarah.  dmesg0 is from the diagnostic patch only.  dmesg1 has all
three patches applied.  Some of the messages in dmesg1 fell off the end of
the kernel buffer, so I may need to make the buffer larger next time but
I'll need a reminder of how to do it.

As you suspected, the patches didn't fix the problem, sorry.

I find that I can tell in advance whether the copy is going to succeed,
just by watching the light flicker on the usb3 drive.  When the flicker
is absolutely regular, with no variation whatever, I can tell in 10 or
15 seconds that the copy will fail.

At the same time the light on the main drive goes dark after 10 seconds,
implying that the usb3 drive stops receiving any data from the main drive
after 10 seconds, yet the light on the usb3 drive continues to flicker as
if writing data -- even after the cp officially fails.  The light on the
usb3 drive never stops flickering until I reboot the machine or unplug
the usb cable.

dmesg0.gz
Description: application/gzip

dmesg1.gz
Description: application/gzip

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread walt

On 01/06/2014 04:31 PM, Sarah Sharp wrote:
 On Fri, Jan 03, 2014 at 03:29:29PM -0800, Sarah Sharp wrote:

  With the dmesg, I can finally see what happened:

 [  188.703059] xhci_hcd :03:00.0: Cancel URB 8800b7d2e0c0, dev 1, ep 
 0x2, starting at offset 0xbb7b9000
 [  188.703072] xhci_hcd :03:00.0: // Ding dong!
 [  193.711022] xhci_hcd :03:00.0: xHCI host not responding to stop 
 endpoint command.
 [  193.711029] xhci_hcd :03:00.0: Assuming host is dying, halting host.
 [  193.711046] xhci_hcd :03:00.0: // Halt the HC
 [  193.711060] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 0
 [  193.711066] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 2
 [  193.711078] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 3
 [  193.711096] xhci_hcd :03:00.0: Calling usb_hc_died()
 [  193.711103] xhci_hcd :03:00.0: HC died; cleaning up
 [  193.76] xhci_hcd :03:00.0: xHCI host controller is dead.

 It seems that the xHCI driver tried to stop the endpoint ring in order
 to cancel a SCSI transfer, and the driver never got a response for that.

 The offset is rather suspicious (0xbb7b9000), and it probably means the
 driver attempted to cancel a transfer that had been moved to the
 beginning of the ring segment, with no-op TRBs before the link TRB.

 I suspect David's patch triggers a bug in the command cancellation code.
 There's also the unlikely possibility that the no-op TRBs did indeed
 cause the host to hang.  Either way, I'll have to look into it.

 I'll let you know when I have some diagnostic patches ready.
 
 Hi Walt,
 
 I have a couple of patches for you to test.

 Please only apply the first patch (which is diagnostic only), trigger
 your issue, and send me the resulting dmesg.  Then try applying the
 other two patches, and see if the issue goes away.  (I suspect it won't
 but I can't be sure.)

Thanks Sarah.  dmesg0 is from the diagnostic patch only.  dmesg1 has all
three patches applied.  Some of the messages in dmesg1 fell off the end of
the kernel buffer, so I may need to make the buffer larger next time but
I'll need a reminder of how to do it.

As you suspected, the patches didn't fix the problem, sorry.

I find that I can tell in advance whether the copy is going to succeed,
just by watching the light flicker on the usb3 drive.  When the flicker
is absolutely regular, with no variation whatever, I can tell in 10 or
15 seconds that the copy will fail.

At the same time the light on the main drive goes dark after 10 seconds,
implying that the usb3 drive stops receiving any data from the main drive
after 10 seconds, yet the light on the usb3 drive continues to flicker as
if writing data -- even after the cp officially fails.  The light on the
usb3 drive never stops flickering until I reboot the machine or unplug
the usb cable.



dmesg0.gz
Description: application/gzip


dmesg1.gz
Description: application/gzip

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread David Laight

 From: walt
...
 Thanks Sarah.  dmesg0 is from the diagnostic patch only.  dmesg1 has all
 three patches applied.  Some of the messages in dmesg1 fell off the end of
 the kernel buffer, so I may need to make the buffer larger next time but
 I'll need a reminder of how to do it.
 
 As you suspected, the patches didn't fix the problem, sorry.

I think Sarah will want the traces of the transfer being setup (ie just
before the error). Some 'normal' completing transfers are also useful.

You might also find the full trace output in one of the files in /var/log.

David

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread David Laight

The dmesg contains:

[  538.728064] EXT4-fs warning (device dm-0): ext4_end_bio:316: I/O error 
writing to inode 23330865 (offset 0 size 8388608 starting block 812628)

An 8MB transfer will need at least 128 ring entries (TRB) even if the request
is a single contiguous memory block.

Are you using the patch that increases the ring size from 64 to 256?

David

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread walt

On 01/07/2014 05:58 AM, David Laight wrote:
 The dmesg contains:
 
 [  538.728064] EXT4-fs warning (device dm-0): ext4_end_bio:316: I/O error 
 writing to inode 23330865 (offset 0 size 8388608 starting block 812628)
 
 An 8MB transfer will need at least 128 ring entries (TRB) even if the request
 is a single contiguous memory block.
 
 Are you using the patch that increases the ring size from 64 to 256?

Yes, 256.  I'll work on getting more debugging output.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread walt

Okay, I used log_buf_len to make dmesg bigger and now I think I have
the whole thing.  It's attached.


dmesg2.gz
Description: application/gzip

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread Sarah Sharp

On Tue, Jan 07, 2014 at 05:29:48AM -0800, walt wrote:
 On 01/06/2014 04:31 PM, Sarah Sharp wrote:
  Hi Walt,
  
  I have a couple of patches for you to test.
 
  Please only apply the first patch (which is diagnostic only), trigger
  your issue, and send me the resulting dmesg.  Then try applying the
  other two patches, and see if the issue goes away.  (I suspect it won't
  but I can't be sure.)
 
 Thanks Sarah.  dmesg0 is from the diagnostic patch only.  dmesg1 has all
 three patches applied.  Some of the messages in dmesg1 fell off the end of
 the kernel buffer, so I may need to make the buffer larger next time but
 I'll need a reminder of how to do it.

Set CONFIG_LOG_BUF_SHIFT to 21.

 As you suspected, the patches didn't fix the problem, sorry.

Yep, I thought so.  I did glean one bit of information from the logs: it
seems that your host does handle no-op TRBs, at least for a while.
However, after a bigger chunk of TRBs, it goes off into la-la-land.

Assuming one of the rings is comprised of two segments:
0xbb711000 (start)
0xbb7113f0 (end)
0xbb711400 (start)
0xbb7117f0 (end)

The log show no-ops were inserted at:
0xbb7207d0
0xbb7206a0
0xbb720be0
0xbb720be0
0xbb720bd0
0xbb7207e0
0xbb711370 = 8 no-ops
0xbb7117c0 = 3
0xbb7113b0 = 4
0xbb7113a0 = 5
0xbb7117d0 = 2
0xbb711340 = 11
0xbb711770 = 8
0xbb711230 = 28
0xbb7117e0 = 1
0xbb7117b0 = 4
0xbb7113d0 = 2
0xbb7117b0 = 4
0xbb711340 = 11
0xbb711690 = 22

So the host was able to process 28 no-op TRBs, but failed on 22 no-ops
later.  The event ring debugging shows the last event was for
0xbb711680, which is the last TRB before the first no-op inserted before
the host died.  There's no Stop Endpoint Command completion, and it
looks like the command was correctly put on the command ring, so it
seems the host is actually hanging for some reason.

Unfortunately, I made a mistake in the debugging patch I sent
you, so it didn't print out the endpoint rings when the host died.  I
need that info, to see whether the link TRB was still intact, or if we
over-wrote it and caused the host to go fetch some invalid memory.

Can you please try the attached patch, on top of the previous three
patches, and send me dmesg?

 I find that I can tell in advance whether the copy is going to succeed,
 just by watching the light flicker on the usb3 drive.  When the flicker
 is absolutely regular, with no variation whatever, I can tell in 10 or
 15 seconds that the copy will fail.
 
 At the same time the light on the main drive goes dark after 10 seconds,
 implying that the usb3 drive stops receiving any data from the main drive
 after 10 seconds, yet the light on the usb3 drive continues to flicker as
 if writing data -- even after the cp officially fails.  The light on the
 usb3 drive never stops flickering until I reboot the machine or unplug
 the usb cable.

Interesting.  Without a USB analyzer, we can't really tell what's
happening.  However, one hypothesis could be that the blinking light is
triggered by an active SCSI command (read/write, etc).

There are three phases of the command: setup, data, and status.  I think
your device is getting the setup phase, and the host is dying before it
sends the data phase.  If the light blinks when it gets a setup phase,
and turns off when the devices sends a status phase, that would explain
its behavior.

But that's just a hypothesis, I have no idea whether it's correct.

Sarah Sharp
From d085fb5b9630e935d7954fe5947b48402e43bdc1 Mon Sep 17 00:00:00 2001
From: Sarah Sharp sarah.a.sh...@linux.intel.com
Date: Tue, 7 Jan 2014 12:39:47 -0800
Subject: [PATCH] More debugging.

Signed-off-by: Sarah Sharp sarah.a.sh...@linux.intel.com
---
 drivers/usb/host/xhci-ring.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 2afaf15009e8..228ab8cf868e 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -993,6 +993,9 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg)
 	for (i = 0; i  MAX_HC_SLOTS; i++) {
 		if (!xhci-devs[i])
 			continue;
+
+		xhci_dbg(xhci, Slot %d output context\n, i);
+		xhci_dbg_ctx(xhci, xhci-devs[i]-out_ctx, 30);
 		for (j = 0; j  31; j++) {
 			temp_ep = xhci-devs[i]-eps[j];
 			ring = temp_ep-ring;
@@ -1001,6 +1004,10 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg)
 			xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb,
 	Killing URBs for slot ID %u, 
 	ep index %u, i, j);
+			xhci_dbg(xhci, Dev %i Ep 0x%x:\n, i,
+	xhci_get_endpoint_address(j));
+			xhci_debug_ring(xhci, ring);
+			xhci_dbg_ring_ptrs(xhci, ring);
 			while (!list_empty(ring-td_list)) {
 cur_td = list_first_entry(ring-td_list,
 		struct xhci_td,
@@ -1011,12 +1018,6 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg)
 xhci_giveback_urb_in_irq(xhci, cur_td,
 		-ESHUTDOWN, killed);
 			}
-			if (!list_empty(temp_ep-cancelled_td_list)) {
-xhci_dbg(xhci, Dev %i Ep 0x%x:\n, i,
-

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread Sarah Sharp

On Tue, Jan 07, 2014 at 01:58:32PM +, David Laight wrote:
 The dmesg contains:
 
 [  538.728064] EXT4-fs warning (device dm-0): ext4_end_bio:316: I/O error 
 writing to inode 23330865 (offset 0 size 8388608 starting block 812628)
 
 An 8MB transfer will need at least 128 ring entries (TRB) even if the request
 is a single contiguous memory block.
 
 Are you using the patch that increases the ring size from 64 to 256?

It's likely that the block layer is breaking up the EXT4 write into
several transfers, since usb-storage limits overall transfer size to
120 KB.  In any case, I added more debugging in the last patch to print
the number of TRBs necessary.  That way we can verify the patch to limit
the number of scatter-gather list entries is working.

Sarah Sharp
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread Sarah Sharp

On Tue, Jan 07, 2014 at 12:00:00PM -0800, walt wrote:
 Okay, I used log_buf_len to make dmesg bigger and now I think I have
 the whole thing.  It's attached.

Walt, can you make sure the patch I sent you was applied?  The output
doesn't look like it is.

Sarah Sharp
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-07 Thread Sarah Sharp

On Tue, Jan 07, 2014 at 03:57:00PM -0800, walt wrote:
 On 01/07/2014 01:21 PM, Sarah Sharp wrote:
 
  Can you please try the attached patch, on top of the previous three
  patches, and send me dmesg?
 
 Hi Sarah, I just now finished running 0001-More-debugging.patch for the
 first time.  The previous dmesg didn't include that patch, but this one
 does.
 
 I read through this dmesg but I nodded off somewhere around line 500.
 I hope you can stay awake :)

Well, it has all the info I need, but the results don't make me too
happy.  Everything I've checked seems consistent, and I don't know why
the host stopped.  The link TRBs are intact, the dequeue pointer for the
endpoint was pointing to the transfer that timed out and it had the
cycle bit set correctly, etc.  Perhaps the no-op TRBs are really the
issue.

I'll have to take a look at the log again tomorrow.  I posted the dmesg
on pastebin if David wants to check it out as well:
http://pastebin.com/a4AUpsL1

Can you send me the output of `sudo lspci -vvv -n`?  Maybe we can just
turn off scatter-gather for your host controller until we get a proper
fix in that uses link TRBs instead of no-op TRBs.

Sarah Sharp
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-06 Thread Sarah Sharp

On Fri, Jan 03, 2014 at 03:29:29PM -0800, Sarah Sharp wrote:
> On Fri, Jan 03, 2014 at 01:21:18PM -0800, walt wrote:
> > I'm so sorry Sarah, that was another mistake.  The mistake is so stupid I'm 
> > not
> > going to publish it here :(
> > 
> > Once I finally ran the kernel with debugging actually compiled in, dmesg 
> > contains
> > xhci debugging messages.  Wow :)
> > 
> > It's a big file so I zipped and attached it, which I hope is acceptable in 
> > lkml.
> 
> Yep, that's fine.  Sticking it in pastebin (or up on your server) is
> also fine, if it gets really big.
> 
> > BTW, this dmesg is from a kernel with sg_tablesize = 31, which as I said 
> > before
> > doesn't fix the problem.  The cp stopped around 7GB just as before.
> > 
> > Sorry for the noise...
> 
> No worries! :)  With the dmesg, I can finally see what happened:
> 
> [  188.703059] xhci_hcd :03:00.0: Cancel URB 8800b7d2e0c0, dev 1, ep 
> 0x2, starting at offset 0xbb7b9000
> [  188.703072] xhci_hcd :03:00.0: // Ding dong!
> [  193.711022] xhci_hcd :03:00.0: xHCI host not responding to stop 
> endpoint command.
> [  193.711029] xhci_hcd :03:00.0: Assuming host is dying, halting host.
> [  193.711046] xhci_hcd :03:00.0: // Halt the HC
> [  193.711060] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 0
> [  193.711066] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 2
> [  193.711078] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 3
> [  193.711096] xhci_hcd :03:00.0: Calling usb_hc_died()
> [  193.711103] xhci_hcd :03:00.0: HC died; cleaning up
> [  193.76] xhci_hcd :03:00.0: xHCI host controller is dead.
> 
> It seems that the xHCI driver tried to stop the endpoint ring in order
> to cancel a SCSI transfer, and the driver never got a response for that.
> 
> The offset is rather suspicious (0xbb7b9000), and it probably means the
> driver attempted to cancel a transfer that had been moved to the
> beginning of the ring segment, with no-op TRBs before the link TRB.
> 
> I suspect David's patch triggers a bug in the command cancellation code.
> There's also the unlikely possibility that the no-op TRBs did indeed
> cause the host to hang.  Either way, I'll have to look into it.
> 
> I'll let you know when I have some diagnostic patches ready.

Hi Walt,

I have a couple of patches for you to test.  You can either apply the
attached three patches, or you can pull down a kernel with:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/sarah/xhci.git -b 
3.12-td-fragment-failure

Please only apply the first patch (which is diagnostic only), trigger
your issue, and send me the resulting dmesg.  Then try applying the
other two patches, and see if the issue goes away.  (I suspect it won't
but I can't be sure.)

Sarah Sharp
>From 0261dcd2711c010d786dcd940803a44e1bc19512 Mon Sep 17 00:00:00 2001
From: Sarah Sharp 
Date: Mon, 6 Jan 2014 16:06:27 -0800
Subject: [PATCH 1/3] TD fragment debugging

Signed-off-by: Sarah Sharp 
---
 drivers/usb/host/xhci-ring.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 55fc0c39b7e1..d05f61dc8359 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -982,6 +982,14 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg)
 		 * doesn't touch the memory.
 		 */
 	}
+
+	xhci_dbg(xhci, "Command ring:\n");
+	xhci_debug_ring(xhci, xhci->cmd_ring);
+	xhci_dbg_ring_ptrs(xhci, xhci->cmd_ring);
+	xhci_dbg(xhci, "Event ring:\n");
+	xhci_debug_ring(xhci, xhci->event_ring);
+	xhci_dbg_ring_ptrs(xhci, xhci->event_ring);
+
 	for (i = 0; i < MAX_HC_SLOTS; i++) {
 		if (!xhci->devs[i])
 			continue;
@@ -1003,6 +1011,12 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg)
 xhci_giveback_urb_in_irq(xhci, cur_td,
 		-ESHUTDOWN, "killed");
 			}
+			if (!list_empty(_ep->cancelled_td_list)) {
+xhci_dbg(xhci, "Dev %i Ep 0x%x:\n", i,
+		xhci_get_endpoint_address(j));
+xhci_debug_ring(xhci, ring);
+xhci_dbg_ring_ptrs(xhci, ring);
+			}
 			while (!list_empty(_ep->cancelled_td_list)) {
 cur_td = list_first_entry(
 		_ep->cancelled_td_list,
@@ -2966,6 +2980,10 @@ static int prepare_ring(struct xhci_hcd *xhci, struct xhci_ring *ep_ring,
 		num_trbs, TRBS_PER_SEGMENT - 1);
 return -ENOMEM;
 			}
+			xhci_dbg(xhci, "Insert no-op TRBs at 0x%llx\n",
+	(unsigned long long)
+	xhci_trb_virt_to_dma(ep_ring->enq_seg,
+		ep_ring->enqueue));
 
 			nop_cmd = cpu_to_le32(TRB_TYPE(TRB_TR_NOOP) |
 	ep_ring->cycle_state);
-- 
1.8.3.3

>From 380071d6fa2430c7141faefc8acfc0909c75a0ed Mon Sep 17 00:00:00 2001
From: Ben Hutchings 
Date: Mon, 6 Jan 2014 03:16:32 +
Subject: [PATCH 2/3] xhci: Avoid infinite loop when sg urb requires too many
 trbs

Currently prepare_ring() returns -ENOMEM if the urb won't fit into a
single ring segment.  usb_sg_wait() treats this error as a temporary
condition and will

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-06 Thread David Laight

> From: walt
...
> /* Accept arbitrarily long scatter-gather lists */
> -   hcd->self.sg_tablesize = ~0;
> +   hcd->self.sg_tablesize = 31;

Even if that reduces the number of fragments passed to the xhci driver
it may not be enough to limit the actual number of fragments that
need to be placed in the ring itself.
The xhci driver has to split every fragment on any 64k address boundary.

One possibility is to scan long SG lists to see it they are actually
problematic. If all the fragments are suitably aligned let them through
(without any nops).

My gut feeling is that problems only arise when the ring end isn't at
a 1k boundary in the data.

So provided all the fragments are multiples of 1k (after splitting on 64k
boundaries) the transfer will be processed correctly.
Alternatively, if the fragments are all longer than 1k (after the 64k split),
the one that crosses the ring end can be split in two.

A quick 'fix' would be to assume that anything with too many fragments is
probably ok - maybe check the first fragment is suitably aligned.
That would recover the old behaviour for usb disk transfers with a lot
of fragments - yes it is a hack...

David

RE: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-06 Thread David Laight

 From: walt
...
 /* Accept arbitrarily long scatter-gather lists */
 -   hcd-self.sg_tablesize = ~0;
 +   hcd-self.sg_tablesize = 31;

Even if that reduces the number of fragments passed to the xhci driver
it may not be enough to limit the actual number of fragments that
need to be placed in the ring itself.
The xhci driver has to split every fragment on any 64k address boundary.

One possibility is to scan long SG lists to see it they are actually
problematic. If all the fragments are suitably aligned let them through
(without any nops).

My gut feeling is that problems only arise when the ring end isn't at
a 1k boundary in the data.

So provided all the fragments are multiples of 1k (after splitting on 64k
boundaries) the transfer will be processed correctly.
Alternatively, if the fragments are all longer than 1k (after the 64k split),
the one that crosses the ring end can be split in two.

A quick 'fix' would be to assume that anything with too many fragments is
probably ok - maybe check the first fragment is suitably aligned.
That would recover the old behaviour for usb disk transfers with a lot
of fragments - yes it is a hack...

David

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-06 Thread Sarah Sharp

On Fri, Jan 03, 2014 at 03:29:29PM -0800, Sarah Sharp wrote:
 On Fri, Jan 03, 2014 at 01:21:18PM -0800, walt wrote:
  I'm so sorry Sarah, that was another mistake.  The mistake is so stupid I'm 
  not
  going to publish it here :(
  
  Once I finally ran the kernel with debugging actually compiled in, dmesg 
  contains
  xhci debugging messages.  Wow :)
  
  It's a big file so I zipped and attached it, which I hope is acceptable in 
  lkml.
 
 Yep, that's fine.  Sticking it in pastebin (or up on your server) is
 also fine, if it gets really big.
 
  BTW, this dmesg is from a kernel with sg_tablesize = 31, which as I said 
  before
  doesn't fix the problem.  The cp stopped around 7GB just as before.
  
  Sorry for the noise...
 
 No worries! :)  With the dmesg, I can finally see what happened:
 
 [  188.703059] xhci_hcd :03:00.0: Cancel URB 8800b7d2e0c0, dev 1, ep 
 0x2, starting at offset 0xbb7b9000
 [  188.703072] xhci_hcd :03:00.0: // Ding dong!
 [  193.711022] xhci_hcd :03:00.0: xHCI host not responding to stop 
 endpoint command.
 [  193.711029] xhci_hcd :03:00.0: Assuming host is dying, halting host.
 [  193.711046] xhci_hcd :03:00.0: // Halt the HC
 [  193.711060] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 0
 [  193.711066] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 2
 [  193.711078] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 3
 [  193.711096] xhci_hcd :03:00.0: Calling usb_hc_died()
 [  193.711103] xhci_hcd :03:00.0: HC died; cleaning up
 [  193.76] xhci_hcd :03:00.0: xHCI host controller is dead.
 
 It seems that the xHCI driver tried to stop the endpoint ring in order
 to cancel a SCSI transfer, and the driver never got a response for that.
 
 The offset is rather suspicious (0xbb7b9000), and it probably means the
 driver attempted to cancel a transfer that had been moved to the
 beginning of the ring segment, with no-op TRBs before the link TRB.
 
 I suspect David's patch triggers a bug in the command cancellation code.
 There's also the unlikely possibility that the no-op TRBs did indeed
 cause the host to hang.  Either way, I'll have to look into it.
 
 I'll let you know when I have some diagnostic patches ready.

Hi Walt,

I have a couple of patches for you to test.  You can either apply the
attached three patches, or you can pull down a kernel with:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/sarah/xhci.git -b 
3.12-td-fragment-failure

Please only apply the first patch (which is diagnostic only), trigger
your issue, and send me the resulting dmesg.  Then try applying the
other two patches, and see if the issue goes away.  (I suspect it won't
but I can't be sure.)

Sarah Sharp
From 0261dcd2711c010d786dcd940803a44e1bc19512 Mon Sep 17 00:00:00 2001
From: Sarah Sharp sarah.a.sh...@linux.intel.com
Date: Mon, 6 Jan 2014 16:06:27 -0800
Subject: [PATCH 1/3] TD fragment debugging

Signed-off-by: Sarah Sharp sarah.a.sh...@linux.intel.com
---
 drivers/usb/host/xhci-ring.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 55fc0c39b7e1..d05f61dc8359 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -982,6 +982,14 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg)
 		 * doesn't touch the memory.
 		 */
 	}
+
+	xhci_dbg(xhci, Command ring:\n);
+	xhci_debug_ring(xhci, xhci-cmd_ring);
+	xhci_dbg_ring_ptrs(xhci, xhci-cmd_ring);
+	xhci_dbg(xhci, Event ring:\n);
+	xhci_debug_ring(xhci, xhci-event_ring);
+	xhci_dbg_ring_ptrs(xhci, xhci-event_ring);
+
 	for (i = 0; i  MAX_HC_SLOTS; i++) {
 		if (!xhci-devs[i])
 			continue;
@@ -1003,6 +1011,12 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg)
 xhci_giveback_urb_in_irq(xhci, cur_td,
 		-ESHUTDOWN, killed);
 			}
+			if (!list_empty(temp_ep-cancelled_td_list)) {
+xhci_dbg(xhci, Dev %i Ep 0x%x:\n, i,
+		xhci_get_endpoint_address(j));
+xhci_debug_ring(xhci, ring);
+xhci_dbg_ring_ptrs(xhci, ring);
+			}
 			while (!list_empty(temp_ep-cancelled_td_list)) {
 cur_td = list_first_entry(
 		temp_ep-cancelled_td_list,
@@ -2966,6 +2980,10 @@ static int prepare_ring(struct xhci_hcd *xhci, struct xhci_ring *ep_ring,
 		num_trbs, TRBS_PER_SEGMENT - 1);
 return -ENOMEM;
 			}
+			xhci_dbg(xhci, Insert no-op TRBs at 0x%llx\n,
+	(unsigned long long)
+	xhci_trb_virt_to_dma(ep_ring-enq_seg,
+		ep_ring-enqueue));
 
 			nop_cmd = cpu_to_le32(TRB_TYPE(TRB_TR_NOOP) |
 	ep_ring-cycle_state);
-- 
1.8.3.3

From 380071d6fa2430c7141faefc8acfc0909c75a0ed Mon Sep 17 00:00:00 2001
From: Ben Hutchings b...@decadent.org.uk
Date: Mon, 6 Jan 2014 03:16:32 +
Subject: [PATCH 2/3] xhci: Avoid infinite loop when sg urb requires too many
 trbs

Currently prepare_ring() returns -ENOMEM if the urb won't fit into a
single ring segment.  usb_sg_wait() treats this error as a temporary
condition and will

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-04 Thread Mark Lord

On 14-01-03 10:40 AM, walt wrote:
> On 01/02/2014 11:15 AM, Sarah Sharp wrote:
>> On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
>>> On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
 3.12-stable review patch.  If anyone has any objections, please let me 
 know.

 --

 From: David Laight 

 commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.

 Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link 
 TRB
 can only occur at a boundary between underlying USB frames (512 bytes for
 high speed devices).

 If this isn't done the USB frames aren't formatted correctly and, for 
 example,
 the USB3 ethernet ax88179_178a card will stop sending...
>>>
>>>
>>> Unfortunately this patch causes a regression when copying large files to my
>>> outboard USB3 drive. (Nothing at all to do with networking.)
> 
>> Do you have CONFIG_USB_DEBUG turned on for 3.13?  If so, you should see
>> dmesg output from this statement shortly before your drive fails:
>>
>> if (num_trbs >= TRBS_PER_SEGMENT) {
>>  xhci_err(xhci, "Too many fragments %d, max %d\n",
>>  num_trbs, TRBS_PER_SEGMENT - 1);
>>  return -ENOMEM;
>> }
> 
> Well, the answers depend on whether the usb3 drive uses logical volumes or not
> (lvm2), which I can't explain.  What I've described so far is with lvm2.
> 
> When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect -- the
> console prints two or three lines stating that the ext4 journal has quit and
> the drive is remounted ro.  That particular drive stays wedged until the next
> reboot, but no other ill effects to the system.
> 
> OTOH, when I put a disk with just an ordinary ext4 partition in the usb3 dock,
> (no logical volumes) the copy failure becomes catastrophic, with kernel panic
> messages, leaving the system unresponsive and needing a hard reset to recover.
> 
> I also tried your other suggestion:
> 
> diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> index 4265b48..1a6a43d 100644
> --- a/drivers/usb/host/xhci.c
> +++ b/drivers/usb/host/xhci.c
> @@ -4714,7 +4714,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, 
> xhci_get_quirks_t get_quirks)
> int retval;
>  
> /* Accept arbitrarily long scatter-gather lists */
> -   hcd->self.sg_tablesize = ~0;
> +   hcd->self.sg_tablesize = 31;
>  
> /* support to build packet from discontinuous buffers */
> hcd->self.no_sg_constraint = 1;
> 
> Sadly it didn't fix the problem.  Did I get the patch right?


That sounds almost as if the old version is still being loaded/run,
possibly from the initramfs image?

-- 
Mark Lord
Real-Time Remedies Inc.
ml...@pobox.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-04 Thread Mark Lord

On 14-01-03 10:40 AM, walt wrote:
 On 01/02/2014 11:15 AM, Sarah Sharp wrote:
 On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
 On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
 3.12-stable review patch.  If anyone has any objections, please let me 
 know.

 --

 From: David Laight david.lai...@aculab.com

 commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.

 Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link 
 TRB
 can only occur at a boundary between underlying USB frames (512 bytes for
 high speed devices).

 If this isn't done the USB frames aren't formatted correctly and, for 
 example,
 the USB3 ethernet ax88179_178a card will stop sending...


 Unfortunately this patch causes a regression when copying large files to my
 outboard USB3 drive. (Nothing at all to do with networking.)
 
 Do you have CONFIG_USB_DEBUG turned on for 3.13?  If so, you should see
 dmesg output from this statement shortly before your drive fails:

 if (num_trbs = TRBS_PER_SEGMENT) {
  xhci_err(xhci, Too many fragments %d, max %d\n,
  num_trbs, TRBS_PER_SEGMENT - 1);
  return -ENOMEM;
 }
 
 Well, the answers depend on whether the usb3 drive uses logical volumes or not
 (lvm2), which I can't explain.  What I've described so far is with lvm2.
 
 When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect -- the
 console prints two or three lines stating that the ext4 journal has quit and
 the drive is remounted ro.  That particular drive stays wedged until the next
 reboot, but no other ill effects to the system.
 
 OTOH, when I put a disk with just an ordinary ext4 partition in the usb3 dock,
 (no logical volumes) the copy failure becomes catastrophic, with kernel panic
 messages, leaving the system unresponsive and needing a hard reset to recover.
 
 I also tried your other suggestion:
 
 diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
 index 4265b48..1a6a43d 100644
 --- a/drivers/usb/host/xhci.c
 +++ b/drivers/usb/host/xhci.c
 @@ -4714,7 +4714,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, 
 xhci_get_quirks_t get_quirks)
 int retval;
  
 /* Accept arbitrarily long scatter-gather lists */
 -   hcd-self.sg_tablesize = ~0;
 +   hcd-self.sg_tablesize = 31;
  
 /* support to build packet from discontinuous buffers */
 hcd-self.no_sg_constraint = 1;
 
 Sadly it didn't fix the problem.  Did I get the patch right?


That sounds almost as if the old version is still being loaded/run,
possibly from the initramfs image?

-- 
Mark Lord
Real-Time Remedies Inc.
ml...@pobox.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-03 Thread Sarah Sharp

On Fri, Jan 03, 2014 at 01:21:18PM -0800, walt wrote:
> I'm so sorry Sarah, that was another mistake.  The mistake is so stupid I'm 
> not
> going to publish it here :(
> 
> Once I finally ran the kernel with debugging actually compiled in, dmesg 
> contains
> xhci debugging messages.  Wow :)
> 
> It's a big file so I zipped and attached it, which I hope is acceptable in 
> lkml.

Yep, that's fine.  Sticking it in pastebin (or up on your server) is
also fine, if it gets really big.

> BTW, this dmesg is from a kernel with sg_tablesize = 31, which as I said 
> before
> doesn't fix the problem.  The cp stopped around 7GB just as before.
> 
> Sorry for the noise...

No worries! :)  With the dmesg, I can finally see what happened:

[  188.703059] xhci_hcd :03:00.0: Cancel URB 8800b7d2e0c0, dev 1, ep 
0x2, starting at offset 0xbb7b9000
[  188.703072] xhci_hcd :03:00.0: // Ding dong!
[  193.711022] xhci_hcd :03:00.0: xHCI host not responding to stop endpoint 
command.
[  193.711029] xhci_hcd :03:00.0: Assuming host is dying, halting host.
[  193.711046] xhci_hcd :03:00.0: // Halt the HC
[  193.711060] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 0
[  193.711066] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 2
[  193.711078] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 3
[  193.711096] xhci_hcd :03:00.0: Calling usb_hc_died()
[  193.711103] xhci_hcd :03:00.0: HC died; cleaning up
[  193.76] xhci_hcd :03:00.0: xHCI host controller is dead.

It seems that the xHCI driver tried to stop the endpoint ring in order
to cancel a SCSI transfer, and the driver never got a response for that.

The offset is rather suspicious (0xbb7b9000), and it probably means the
driver attempted to cancel a transfer that had been moved to the
beginning of the ring segment, with no-op TRBs before the link TRB.

I suspect David's patch triggers a bug in the command cancellation code.
There's also the unlikely possibility that the no-op TRBs did indeed
cause the host to hang.  Either way, I'll have to look into it.

I'll let you know when I have some diagnostic patches ready.

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-03 Thread walt

On 01/03/2014 11:54 AM, Sarah Sharp wrote:
> On Fri, Jan 03, 2014 at 07:40:33AM -0800, walt wrote:
>> On 01/02/2014 11:15 AM, Sarah Sharp wrote:
>>> On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
 On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
> 3.12-stable review patch.  If anyone has any objections, please let me 
> know.
>
> --
>
> From: David Laight 
>
> commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.
>
> Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link 
> TRB
> can only occur at a boundary between underlying USB frames (512 bytes for
> high speed devices).
>
> If this isn't done the USB frames aren't formatted correctly and, for 
> example,
> the USB3 ethernet ax88179_178a card will stop sending...


 Unfortunately this patch causes a regression when copying large files to my
 outboard USB3 drive. (Nothing at all to do with networking.)
>>
>>> Do you have CONFIG_USB_DEBUG turned on for 3.13?  If so, you should see
>>> dmesg output from this statement shortly before your drive fails:
>>>
>>> if (num_trbs >= TRBS_PER_SEGMENT) {
>>> xhci_err(xhci, "Too many fragments %d, max %d\n",
>>> num_trbs, TRBS_PER_SEGMENT - 1);
>>> return -ENOMEM;
>>> }
>>
>> Well, the answers depend on whether the usb3 drive uses logical volumes or 
>> not
>> (lvm2), which I can't explain.  What I've described so far is with lvm2.
>>
>> When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect

I'm so sorry Sarah, that was another mistake.  The mistake is so stupid I'm not
going to publish it here :(

Once I finally ran the kernel with debugging actually compiled in, dmesg 
contains
xhci debugging messages.  Wow :)

It's a big file so I zipped and attached it, which I hope is acceptable in lkml.

BTW, this dmesg is from a kernel with sg_tablesize = 31, which as I said before
doesn't fix the problem.  The cp stopped around 7GB just as before.

Sorry for the noise...


xhci.dmesg.gz
Description: application/gzip

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-03 Thread Sarah Sharp

On Fri, Jan 03, 2014 at 07:40:33AM -0800, walt wrote:
> On 01/02/2014 11:15 AM, Sarah Sharp wrote:
> > On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
> >> On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
> >>> 3.12-stable review patch.  If anyone has any objections, please let me 
> >>> know.
> >>>
> >>> --
> >>>
> >>> From: David Laight 
> >>>
> >>> commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.
> >>>
> >>> Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link 
> >>> TRB
> >>> can only occur at a boundary between underlying USB frames (512 bytes for
> >>> high speed devices).
> >>>
> >>> If this isn't done the USB frames aren't formatted correctly and, for 
> >>> example,
> >>> the USB3 ethernet ax88179_178a card will stop sending...
> >>
> >>
> >> Unfortunately this patch causes a regression when copying large files to my
> >> outboard USB3 drive. (Nothing at all to do with networking.)
> 
> > Do you have CONFIG_USB_DEBUG turned on for 3.13?  If so, you should see
> > dmesg output from this statement shortly before your drive fails:
> > 
> > if (num_trbs >= TRBS_PER_SEGMENT) {
> > xhci_err(xhci, "Too many fragments %d, max %d\n",
> > num_trbs, TRBS_PER_SEGMENT - 1);
> > return -ENOMEM;
> > }
> 
> Well, the answers depend on whether the usb3 drive uses logical volumes or not
> (lvm2), which I can't explain.  What I've described so far is with lvm2.
> 
> When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect -- the
> console prints two or three lines stating that the ext4 journal has quit and
> the drive is remounted ro.  That particular drive stays wedged until the next
> reboot, but no other ill effects to the system.

Odd. In 3.12 xHCI has dynamic debugging, and turning on CONFIG_USB_DEBUG
should turn on debugging by default, so it's confusing that you didn't
see any messages.

Can I see your .config from /boot/?  Also, did you try capturing dmesg
with `tail -f /var/log/kern.log` or just dmesg?  Perhaps you need to run
`sudo dmesg -n 7`?

> OTOH, when I put a disk with just an ordinary ext4 partition in the usb3 dock,
> (no logical volumes) the copy failure becomes catastrophic, with kernel panic
> messages, leaving the system unresponsive and needing a hard reset to recover.
> 
> I also tried your other suggestion:
> 
> diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> index 4265b48..1a6a43d 100644
> --- a/drivers/usb/host/xhci.c
> +++ b/drivers/usb/host/xhci.c
> @@ -4714,7 +4714,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, 
> xhci_get_quirks_t get_quirks)
> int retval;
>  
> /* Accept arbitrarily long scatter-gather lists */
> -   hcd->self.sg_tablesize = ~0;
> +   hcd->self.sg_tablesize = 31;
>  
> /* support to build packet from discontinuous buffers */
> hcd->self.no_sg_constraint = 1;
> 
> Sadly it didn't fix the problem.  Did I get the patch right?

Yes, you did.  So perhaps the patch triggers a different bug.  I can't
tell until I see xHCI debugging output.

> Thanks for your help, and I'm happy to try more ideas, as always.

Thanks for your patience. :)

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-03 Thread walt

On 01/02/2014 11:15 AM, Sarah Sharp wrote:
> On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
>> On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
>>> 3.12-stable review patch.  If anyone has any objections, please let me know.
>>>
>>> --
>>>
>>> From: David Laight 
>>>
>>> commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.
>>>
>>> Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
>>> can only occur at a boundary between underlying USB frames (512 bytes for
>>> high speed devices).
>>>
>>> If this isn't done the USB frames aren't formatted correctly and, for 
>>> example,
>>> the USB3 ethernet ax88179_178a card will stop sending...
>>
>>
>> Unfortunately this patch causes a regression when copying large files to my
>> outboard USB3 drive. (Nothing at all to do with networking.)

> Do you have CONFIG_USB_DEBUG turned on for 3.13?  If so, you should see
> dmesg output from this statement shortly before your drive fails:
> 
> if (num_trbs >= TRBS_PER_SEGMENT) {
>   xhci_err(xhci, "Too many fragments %d, max %d\n",
>   num_trbs, TRBS_PER_SEGMENT - 1);
>   return -ENOMEM;
> }

Well, the answers depend on whether the usb3 drive uses logical volumes or not
(lvm2), which I can't explain.  What I've described so far is with lvm2.

When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect -- the
console prints two or three lines stating that the ext4 journal has quit and
the drive is remounted ro.  That particular drive stays wedged until the next
reboot, but no other ill effects to the system.

OTOH, when I put a disk with just an ordinary ext4 partition in the usb3 dock,
(no logical volumes) the copy failure becomes catastrophic, with kernel panic
messages, leaving the system unresponsive and needing a hard reset to recover.

I also tried your other suggestion:

diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 4265b48..1a6a43d 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -4714,7 +4714,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t 
get_quirks)
int retval;

/* Accept arbitrarily long scatter-gather lists */
-   hcd->self.sg_tablesize = ~0;
+   hcd->self.sg_tablesize = 31;

/* support to build packet from discontinuous buffers */
hcd->self.no_sg_constraint = 1;

Sadly it didn't fix the problem.  Did I get the patch right?

Thanks for your help, and I'm happy to try more ideas, as always.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-03 Thread walt

On 01/02/2014 11:15 AM, Sarah Sharp wrote:
 On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
 On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
 3.12-stable review patch.  If anyone has any objections, please let me know.

 --

 From: David Laight david.lai...@aculab.com

 commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.

 Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
 can only occur at a boundary between underlying USB frames (512 bytes for
 high speed devices).

 If this isn't done the USB frames aren't formatted correctly and, for 
 example,
 the USB3 ethernet ax88179_178a card will stop sending...


 Unfortunately this patch causes a regression when copying large files to my
 outboard USB3 drive. (Nothing at all to do with networking.)

 Do you have CONFIG_USB_DEBUG turned on for 3.13?  If so, you should see
 dmesg output from this statement shortly before your drive fails:
 
 if (num_trbs = TRBS_PER_SEGMENT) {
   xhci_err(xhci, Too many fragments %d, max %d\n,
   num_trbs, TRBS_PER_SEGMENT - 1);
   return -ENOMEM;
 }

Well, the answers depend on whether the usb3 drive uses logical volumes or not
(lvm2), which I can't explain.  What I've described so far is with lvm2.

When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect -- the
console prints two or three lines stating that the ext4 journal has quit and
the drive is remounted ro.  That particular drive stays wedged until the next
reboot, but no other ill effects to the system.

OTOH, when I put a disk with just an ordinary ext4 partition in the usb3 dock,
(no logical volumes) the copy failure becomes catastrophic, with kernel panic
messages, leaving the system unresponsive and needing a hard reset to recover.

I also tried your other suggestion:

diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 4265b48..1a6a43d 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -4714,7 +4714,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t 
get_quirks)
int retval;
 
/* Accept arbitrarily long scatter-gather lists */
-   hcd-self.sg_tablesize = ~0;
+   hcd-self.sg_tablesize = 31;
 
/* support to build packet from discontinuous buffers */
hcd-self.no_sg_constraint = 1;

Sadly it didn't fix the problem.  Did I get the patch right?

Thanks for your help, and I'm happy to try more ideas, as always.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-03 Thread Sarah Sharp

On Fri, Jan 03, 2014 at 07:40:33AM -0800, walt wrote:
 On 01/02/2014 11:15 AM, Sarah Sharp wrote:
  On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
  On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
  3.12-stable review patch.  If anyone has any objections, please let me 
  know.
 
  --
 
  From: David Laight david.lai...@aculab.com
 
  commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.
 
  Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link 
  TRB
  can only occur at a boundary between underlying USB frames (512 bytes for
  high speed devices).
 
  If this isn't done the USB frames aren't formatted correctly and, for 
  example,
  the USB3 ethernet ax88179_178a card will stop sending...
 
 
  Unfortunately this patch causes a regression when copying large files to my
  outboard USB3 drive. (Nothing at all to do with networking.)
 
  Do you have CONFIG_USB_DEBUG turned on for 3.13?  If so, you should see
  dmesg output from this statement shortly before your drive fails:
  
  if (num_trbs = TRBS_PER_SEGMENT) {
  xhci_err(xhci, Too many fragments %d, max %d\n,
  num_trbs, TRBS_PER_SEGMENT - 1);
  return -ENOMEM;
  }
 
 Well, the answers depend on whether the usb3 drive uses logical volumes or not
 (lvm2), which I can't explain.  What I've described so far is with lvm2.
 
 When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect -- the
 console prints two or three lines stating that the ext4 journal has quit and
 the drive is remounted ro.  That particular drive stays wedged until the next
 reboot, but no other ill effects to the system.

Odd. In 3.12 xHCI has dynamic debugging, and turning on CONFIG_USB_DEBUG
should turn on debugging by default, so it's confusing that you didn't
see any messages.

Can I see your .config from /boot/?  Also, did you try capturing dmesg
with `tail -f /var/log/kern.log` or just dmesg?  Perhaps you need to run
`sudo dmesg -n 7`?

 OTOH, when I put a disk with just an ordinary ext4 partition in the usb3 dock,
 (no logical volumes) the copy failure becomes catastrophic, with kernel panic
 messages, leaving the system unresponsive and needing a hard reset to recover.
 
 I also tried your other suggestion:
 
 diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
 index 4265b48..1a6a43d 100644
 --- a/drivers/usb/host/xhci.c
 +++ b/drivers/usb/host/xhci.c
 @@ -4714,7 +4714,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, 
 xhci_get_quirks_t get_quirks)
 int retval;
  
 /* Accept arbitrarily long scatter-gather lists */
 -   hcd-self.sg_tablesize = ~0;
 +   hcd-self.sg_tablesize = 31;
  
 /* support to build packet from discontinuous buffers */
 hcd-self.no_sg_constraint = 1;
 
 Sadly it didn't fix the problem.  Did I get the patch right?

Yes, you did.  So perhaps the patch triggers a different bug.  I can't
tell until I see xHCI debugging output.

 Thanks for your help, and I'm happy to try more ideas, as always.

Thanks for your patience. :)

Sarah Sharp
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-03 Thread walt

On 01/03/2014 11:54 AM, Sarah Sharp wrote:
 On Fri, Jan 03, 2014 at 07:40:33AM -0800, walt wrote:
 On 01/02/2014 11:15 AM, Sarah Sharp wrote:
 On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
 On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
 3.12-stable review patch.  If anyone has any objections, please let me 
 know.

 --

 From: David Laight david.lai...@aculab.com

 commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.

 Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link 
 TRB
 can only occur at a boundary between underlying USB frames (512 bytes for
 high speed devices).

 If this isn't done the USB frames aren't formatted correctly and, for 
 example,
 the USB3 ethernet ax88179_178a card will stop sending...


 Unfortunately this patch causes a regression when copying large files to my
 outboard USB3 drive. (Nothing at all to do with networking.)

 Do you have CONFIG_USB_DEBUG turned on for 3.13?  If so, you should see
 dmesg output from this statement shortly before your drive fails:

 if (num_trbs = TRBS_PER_SEGMENT) {
 xhci_err(xhci, Too many fragments %d, max %d\n,
 num_trbs, TRBS_PER_SEGMENT - 1);
 return -ENOMEM;
 }

 Well, the answers depend on whether the usb3 drive uses logical volumes or 
 not
 (lvm2), which I can't explain.  What I've described so far is with lvm2.

 When using lvm2 on the usb3 drive, turning on USB_DEBUG has *no* effect

I'm so sorry Sarah, that was another mistake.  The mistake is so stupid I'm not
going to publish it here :(

Once I finally ran the kernel with debugging actually compiled in, dmesg 
contains
xhci debugging messages.  Wow :)

It's a big file so I zipped and attached it, which I hope is acceptable in lkml.

BTW, this dmesg is from a kernel with sg_tablesize = 31, which as I said before
doesn't fix the problem.  The cp stopped around 7GB just as before.

Sorry for the noise...


xhci.dmesg.gz
Description: application/gzip

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-03 Thread Sarah Sharp

On Fri, Jan 03, 2014 at 01:21:18PM -0800, walt wrote:
 I'm so sorry Sarah, that was another mistake.  The mistake is so stupid I'm 
 not
 going to publish it here :(
 
 Once I finally ran the kernel with debugging actually compiled in, dmesg 
 contains
 xhci debugging messages.  Wow :)
 
 It's a big file so I zipped and attached it, which I hope is acceptable in 
 lkml.

Yep, that's fine.  Sticking it in pastebin (or up on your server) is
also fine, if it gets really big.

 BTW, this dmesg is from a kernel with sg_tablesize = 31, which as I said 
 before
 doesn't fix the problem.  The cp stopped around 7GB just as before.
 
 Sorry for the noise...

No worries! :)  With the dmesg, I can finally see what happened:

[  188.703059] xhci_hcd :03:00.0: Cancel URB 8800b7d2e0c0, dev 1, ep 
0x2, starting at offset 0xbb7b9000
[  188.703072] xhci_hcd :03:00.0: // Ding dong!
[  193.711022] xhci_hcd :03:00.0: xHCI host not responding to stop endpoint 
command.
[  193.711029] xhci_hcd :03:00.0: Assuming host is dying, halting host.
[  193.711046] xhci_hcd :03:00.0: // Halt the HC
[  193.711060] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 0
[  193.711066] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 2
[  193.711078] xhci_hcd :03:00.0: Killing URBs for slot ID 1, ep index 3
[  193.711096] xhci_hcd :03:00.0: Calling usb_hc_died()
[  193.711103] xhci_hcd :03:00.0: HC died; cleaning up
[  193.76] xhci_hcd :03:00.0: xHCI host controller is dead.

It seems that the xHCI driver tried to stop the endpoint ring in order
to cancel a SCSI transfer, and the driver never got a response for that.

The offset is rather suspicious (0xbb7b9000), and it probably means the
driver attempted to cancel a transfer that had been moved to the
beginning of the ring segment, with no-op TRBs before the link TRB.

I suspect David's patch triggers a bug in the command cancellation code.
There's also the unlikely possibility that the no-op TRBs did indeed
cause the host to hang.  Either way, I'll have to look into it.

I'll let you know when I have some diagnostic patches ready.

Sarah Sharp
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-02 Thread Sarah Sharp

On Thu, Jan 02, 2014 at 04:01:29PM -0500, Mark Lord wrote:
> On 14-01-02 02:15 PM, Sarah Sharp wrote:
> > On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
> ..
> >> Unfortunately this patch causes a regression when copying large files to my
> >> outboard USB3 drive. (Nothing at all to do with networking.)
> >>
> >> When I try to copy a large (20GB) file to the USB3 drive, the copy dies 
> >> after
> >> about 7GB, the ext4 journal aborts and the drive is remounted read-only.
> >>
> >> This bug is 100% reproducible (always pretty close to 7GB) and reverting 
> >> this
> >> patch completely fixes the problem.
> > 
> > Ok, I had feared that would be a consequence of this patch.  I think the
> > problem is that the usb-storage driver submitted an URB with more
> > scatter-gather entries than would fit on the ring segment, the xHCI
> > driver rejected the URB with -ENOMEM, and the SCSI core eventually gave
> > up on the SCSI command.
> 
> Is there not a block layer / scheduler tunable for max sg entries or 
> something?

There is a USB host controller tunable for max number of sg entries
that's passed up to the SCSI or block layer.  We discussed changing it,
but there wasn't a good consensus on what to change it to:

http://marc.info/?l=linux-usb=138496358223904=2
http://marc.info/?l=linux-netdev=138496706007262=2

In the end, we thought we didn't need to limit the sglist size because
Paul thought usb-storage limits the overall transfer size to 120K, which
should fit in 31 TRBs:

http://marc.info/?l=linux-usb=138498190419312=2

Walt, could you see if limiting the sglist size helps, by setting
hcd->self.sg_tablesize in xhci.c to 31?

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-02 Thread James Bottomley

On Thu, 2014-01-02 at 16:01 -0500, Mark Lord wrote:
> On 14-01-02 02:15 PM, Sarah Sharp wrote:
> > On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
> ..
> >> Unfortunately this patch causes a regression when copying large files to my
> >> outboard USB3 drive. (Nothing at all to do with networking.)
> >>
> >> When I try to copy a large (20GB) file to the USB3 drive, the copy dies 
> >> after
> >> about 7GB, the ext4 journal aborts and the drive is remounted read-only.
> >>
> >> This bug is 100% reproducible (always pretty close to 7GB) and reverting 
> >> this
> >> patch completely fixes the problem.
> > 
> > Ok, I had feared that would be a consequence of this patch.  I think the
> > problem is that the usb-storage driver submitted an URB with more
> > scatter-gather entries than would fit on the ring segment, the xHCI
> > driver rejected the URB with -ENOMEM, and the SCSI core eventually gave
> > up on the SCSI command.
> 
> 
> Is there not a block layer / scheduler tunable for max sg entries or 
> something?

Yes, it's blk_queue_max_segments()  You can also add an indirect limit
from userspace using the queue/max_sectors_kb parameter (it doesn't
limit the number of sg elements directly, but it does limit the overall
size of the transfer, and since each segment is 4k or larger except at
the beginning and end, that limits the total number of sg elements as
well).

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-02 Thread Mark Lord

On 14-01-02 02:15 PM, Sarah Sharp wrote:
> On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
..
>> Unfortunately this patch causes a regression when copying large files to my
>> outboard USB3 drive. (Nothing at all to do with networking.)
>>
>> When I try to copy a large (20GB) file to the USB3 drive, the copy dies after
>> about 7GB, the ext4 journal aborts and the drive is remounted read-only.
>>
>> This bug is 100% reproducible (always pretty close to 7GB) and reverting this
>> patch completely fixes the problem.
> 
> Ok, I had feared that would be a consequence of this patch.  I think the
> problem is that the usb-storage driver submitted an URB with more
> scatter-gather entries than would fit on the ring segment, the xHCI
> driver rejected the URB with -ENOMEM, and the SCSI core eventually gave
> up on the SCSI command.


Is there not a block layer / scheduler tunable for max sg entries or something?
-- 
Mark Lord
Real-Time Remedies Inc.
ml...@pobox.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-02 Thread Sarah Sharp

On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
> On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
> > 3.12-stable review patch.  If anyone has any objections, please let me know.
> > 
> > --
> > 
> > From: David Laight 
> > 
> > commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.
> > 
> > Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
> > can only occur at a boundary between underlying USB frames (512 bytes for
> > high speed devices).
> > 
> > If this isn't done the USB frames aren't formatted correctly and, for 
> > example,
> > the USB3 ethernet ax88179_178a card will stop sending...
> 
> 
> Unfortunately this patch causes a regression when copying large files to my
> outboard USB3 drive. (Nothing at all to do with networking.)
> 
> When I try to copy a large (20GB) file to the USB3 drive, the copy dies after
> about 7GB, the ext4 journal aborts and the drive is remounted read-only.
> 
> This bug is 100% reproducible (always pretty close to 7GB) and reverting this
> patch completely fixes the problem.

Ok, I had feared that would be a consequence of this patch.  I think the
problem is that the usb-storage driver submitted an URB with more
scatter-gather entries than would fit on the ring segment, the xHCI
driver rejected the URB with -ENOMEM, and the SCSI core eventually gave
up on the SCSI command.

Do you have CONFIG_USB_DEBUG turned on for 3.13?  If so, you should see
dmesg output from this statement shortly before your drive fails:

if (num_trbs >= TRBS_PER_SEGMENT) {
xhci_err(xhci, "Too many fragments %d, max %d\n",
num_trbs, TRBS_PER_SEGMENT - 1);
return -ENOMEM;
}

> (Note to Sarah:  I recently emailed you about this problem, and I *wrongly*
> said that reverting the patch doesn't help.  That was a mistake, sorry.)
> 
> I'm happy to try any debugging suggestions/tricks.

Unfortunately a real fix for this is going to take a bit.  I have a
couple different solutions to the bug the patch solved, but they're much
more invasive than the original patch and will take a couple weeks to
implement and thoroughly test.

If David's patch is just reverted, USB ethernet on 3.12 and later breaks
under xHCI.  The networking folks added scatter-gather support in 3.12.
Those patches could be reverted, but I suspect David Miller will not be
happy with that solution, since the real problem is the xHCI driver
itself, and EHCI scatter-gather works fine.

I think the short term solution is to simply turn off scatter-gather all
together under xHCI until this gets fixed.  It could mean a big
performance hit for USB storage devices, but that means we get correct
behavior for both USB ethernet and USB storage.

> BTW, please tell me if I've cc'd too many people.

Nope, you're fine.  I've Cc'ed the USB and SCSI mailing lists as well.

Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-02 Thread Sarah Sharp

On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
 On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
  3.12-stable review patch.  If anyone has any objections, please let me know.
  
  --
  
  From: David Laight david.lai...@aculab.com
  
  commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.
  
  Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
  can only occur at a boundary between underlying USB frames (512 bytes for
  high speed devices).
  
  If this isn't done the USB frames aren't formatted correctly and, for 
  example,
  the USB3 ethernet ax88179_178a card will stop sending...
 
 
 Unfortunately this patch causes a regression when copying large files to my
 outboard USB3 drive. (Nothing at all to do with networking.)
 
 When I try to copy a large (20GB) file to the USB3 drive, the copy dies after
 about 7GB, the ext4 journal aborts and the drive is remounted read-only.
 
 This bug is 100% reproducible (always pretty close to 7GB) and reverting this
 patch completely fixes the problem.

Ok, I had feared that would be a consequence of this patch.  I think the
problem is that the usb-storage driver submitted an URB with more
scatter-gather entries than would fit on the ring segment, the xHCI
driver rejected the URB with -ENOMEM, and the SCSI core eventually gave
up on the SCSI command.

Do you have CONFIG_USB_DEBUG turned on for 3.13?  If so, you should see
dmesg output from this statement shortly before your drive fails:

if (num_trbs = TRBS_PER_SEGMENT) {
xhci_err(xhci, Too many fragments %d, max %d\n,
num_trbs, TRBS_PER_SEGMENT - 1);
return -ENOMEM;
}

 (Note to Sarah:  I recently emailed you about this problem, and I *wrongly*
 said that reverting the patch doesn't help.  That was a mistake, sorry.)
 
 I'm happy to try any debugging suggestions/tricks.

Unfortunately a real fix for this is going to take a bit.  I have a
couple different solutions to the bug the patch solved, but they're much
more invasive than the original patch and will take a couple weeks to
implement and thoroughly test.

If David's patch is just reverted, USB ethernet on 3.12 and later breaks
under xHCI.  The networking folks added scatter-gather support in 3.12.
Those patches could be reverted, but I suspect David Miller will not be
happy with that solution, since the real problem is the xHCI driver
itself, and EHCI scatter-gather works fine.

I think the short term solution is to simply turn off scatter-gather all
together under xHCI until this gets fixed.  It could mean a big
performance hit for USB storage devices, but that means we get correct
behavior for both USB ethernet and USB storage.

 BTW, please tell me if I've cc'd too many people.

Nope, you're fine.  I've Cc'ed the USB and SCSI mailing lists as well.

Sarah Sharp
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-02 Thread Mark Lord

On 14-01-02 02:15 PM, Sarah Sharp wrote:
 On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
..
 Unfortunately this patch causes a regression when copying large files to my
 outboard USB3 drive. (Nothing at all to do with networking.)

 When I try to copy a large (20GB) file to the USB3 drive, the copy dies after
 about 7GB, the ext4 journal aborts and the drive is remounted read-only.

 This bug is 100% reproducible (always pretty close to 7GB) and reverting this
 patch completely fixes the problem.
 
 Ok, I had feared that would be a consequence of this patch.  I think the
 problem is that the usb-storage driver submitted an URB with more
 scatter-gather entries than would fit on the ring segment, the xHCI
 driver rejected the URB with -ENOMEM, and the SCSI core eventually gave
 up on the SCSI command.


Is there not a block layer / scheduler tunable for max sg entries or something?
-- 
Mark Lord
Real-Time Remedies Inc.
ml...@pobox.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-02 Thread James Bottomley

On Thu, 2014-01-02 at 16:01 -0500, Mark Lord wrote:
 On 14-01-02 02:15 PM, Sarah Sharp wrote:
  On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
 ..
  Unfortunately this patch causes a regression when copying large files to my
  outboard USB3 drive. (Nothing at all to do with networking.)
 
  When I try to copy a large (20GB) file to the USB3 drive, the copy dies 
  after
  about 7GB, the ext4 journal aborts and the drive is remounted read-only.
 
  This bug is 100% reproducible (always pretty close to 7GB) and reverting 
  this
  patch completely fixes the problem.
  
  Ok, I had feared that would be a consequence of this patch.  I think the
  problem is that the usb-storage driver submitted an URB with more
  scatter-gather entries than would fit on the ring segment, the xHCI
  driver rejected the URB with -ENOMEM, and the SCSI core eventually gave
  up on the SCSI command.
 
 
 Is there not a block layer / scheduler tunable for max sg entries or 
 something?

Yes, it's blk_queue_max_segments()  You can also add an indirect limit
from userspace using the queue/max_sectors_kb parameter (it doesn't
limit the number of sg elements directly, but it does limit the overall
size of the transfer, and since each segment is 4k or larger except at
the beginning and end, that limits the total number of sg elements as
well).

James


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2014-01-02 Thread Sarah Sharp

On Thu, Jan 02, 2014 at 04:01:29PM -0500, Mark Lord wrote:
 On 14-01-02 02:15 PM, Sarah Sharp wrote:
  On Tue, Dec 31, 2013 at 12:40:16PM -0800, walt wrote:
 ..
  Unfortunately this patch causes a regression when copying large files to my
  outboard USB3 drive. (Nothing at all to do with networking.)
 
  When I try to copy a large (20GB) file to the USB3 drive, the copy dies 
  after
  about 7GB, the ext4 journal aborts and the drive is remounted read-only.
 
  This bug is 100% reproducible (always pretty close to 7GB) and reverting 
  this
  patch completely fixes the problem.
  
  Ok, I had feared that would be a consequence of this patch.  I think the
  problem is that the usb-storage driver submitted an URB with more
  scatter-gather entries than would fit on the ring segment, the xHCI
  driver rejected the URB with -ENOMEM, and the SCSI core eventually gave
  up on the SCSI command.
 
 Is there not a block layer / scheduler tunable for max sg entries or 
 something?

There is a USB host controller tunable for max number of sg entries
that's passed up to the SCSI or block layer.  We discussed changing it,
but there wasn't a good consensus on what to change it to:

http://marc.info/?l=linux-usbm=138496358223904w=2
http://marc.info/?l=linux-netdevm=138496706007262w=2

In the end, we thought we didn't need to limit the sglist size because
Paul thought usb-storage limits the overall transfer size to 120K, which
should fit in 31 TRBs:

http://marc.info/?l=linux-usbm=138498190419312w=2

Walt, could you see if limiting the sglist size helps, by setting
hcd-self.sg_tablesize in xhci.c to 31?

Sarah Sharp
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2013-12-31 Thread Mark Lord

On 13-12-31 03:40 PM, walt wrote:
> On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
>> 3.12-stable review patch.  If anyone has any objections, please let me know.
>>
>> --
>>
>> From: David Laight 
>>
>> commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.
>>
>> Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
>> can only occur at a boundary between underlying USB frames (512 bytes for
>> high speed devices).
>>
>> If this isn't done the USB frames aren't formatted correctly and, for 
>> example,
>> the USB3 ethernet ax88179_178a card will stop sending...
> 
> 
> Unfortunately this patch causes a regression when copying large files to my
> outboard USB3 drive. (Nothing at all to do with networking.)
> 
> When I try to copy a large (20GB) file to the USB3 drive, the copy dies after
> about 7GB, the ext4 journal aborts and the drive is remounted read-only.
> 
> This bug is 100% reproducible (always pretty close to 7GB) and reverting this
> patch completely fixes the problem.
> 
> (Note to Sarah:  I recently emailed you about this problem, and I *wrongly*
> said that reverting the patch doesn't help.  That was a mistake, sorry.)
..


I have also had USB3 mass-storage issues with kernels since this patch.
Dunno if the patch itself is the problem, but just too much to do with USB3
is very flakey in the most recent 3.12.* kernels, as well as the 3.13-rc* ones.
So I have gone back to running older kernels and patching them.

Cheers
-- 
Mark Lord
Real-Time Remedies Inc.
ml...@pobox.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2013-12-31 Thread walt

On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
> 3.12-stable review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: David Laight 
> 
> commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.
> 
> Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
> can only occur at a boundary between underlying USB frames (512 bytes for
> high speed devices).
> 
> If this isn't done the USB frames aren't formatted correctly and, for example,
> the USB3 ethernet ax88179_178a card will stop sending...

Unfortunately this patch causes a regression when copying large files to my
outboard USB3 drive. (Nothing at all to do with networking.)

When I try to copy a large (20GB) file to the USB3 drive, the copy dies after
about 7GB, the ext4 journal aborts and the drive is remounted read-only.

This bug is 100% reproducible (always pretty close to 7GB) and reverting this
patch completely fixes the problem.

(Note to Sarah:  I recently emailed you about this problem, and I *wrongly*
said that reverting the patch doesn't help.  That was a mistake, sorry.)

I'm happy to try any debugging suggestions/tricks.

BTW, please tell me if I've cc'd too many people.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2013-12-31 Thread walt

On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
 3.12-stable review patch.  If anyone has any objections, please let me know.
 
 --
 
 From: David Laight david.lai...@aculab.com
 
 commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.
 
 Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
 can only occur at a boundary between underlying USB frames (512 bytes for
 high speed devices).
 
 If this isn't done the USB frames aren't formatted correctly and, for example,
 the USB3 ethernet ax88179_178a card will stop sending...


Unfortunately this patch causes a regression when copying large files to my
outboard USB3 drive. (Nothing at all to do with networking.)

When I try to copy a large (20GB) file to the USB3 drive, the copy dies after
about 7GB, the ext4 journal aborts and the drive is remounted read-only.

This bug is 100% reproducible (always pretty close to 7GB) and reverting this
patch completely fixes the problem.

(Note to Sarah:  I recently emailed you about this problem, and I *wrongly*
said that reverting the patch doesn't help.  That was a mistake, sorry.)

I'm happy to try any debugging suggestions/tricks.

BTW, please tell me if I've cc'd too many people.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2013-12-31 Thread Mark Lord

On 13-12-31 03:40 PM, walt wrote:
 On 12/18/2013 01:11 PM, Greg Kroah-Hartman wrote:
 3.12-stable review patch.  If anyone has any objections, please let me know.

 --

 From: David Laight david.lai...@aculab.com

 commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.

 Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
 can only occur at a boundary between underlying USB frames (512 bytes for
 high speed devices).

 If this isn't done the USB frames aren't formatted correctly and, for 
 example,
 the USB3 ethernet ax88179_178a card will stop sending...
 
 
 Unfortunately this patch causes a regression when copying large files to my
 outboard USB3 drive. (Nothing at all to do with networking.)
 
 When I try to copy a large (20GB) file to the USB3 drive, the copy dies after
 about 7GB, the ext4 journal aborts and the drive is remounted read-only.
 
 This bug is 100% reproducible (always pretty close to 7GB) and reverting this
 patch completely fixes the problem.
 
 (Note to Sarah:  I recently emailed you about this problem, and I *wrongly*
 said that reverting the patch doesn't help.  That was a mistake, sorry.)
..


I have also had USB3 mass-storage issues with kernels since this patch.
Dunno if the patch itself is the problem, but just too much to do with USB3
is very flakey in the most recent 3.12.* kernels, as well as the 3.13-rc* ones.
So I have gone back to running older kernels and patching them.

Cheers
-- 
Mark Lord
Real-Time Remedies Inc.
ml...@pobox.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2013-12-18 Thread Greg Kroah-Hartman

3.12-stable review patch.  If anyone has any objections, please let me know.

--

From: David Laight 

commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.

Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
can only occur at a boundary between underlying USB frames (512 bytes for
high speed devices).

If this isn't done the USB frames aren't formatted correctly and, for example,
the USB3 ethernet ax88179_178a card will stop sending (while still receiving)
when running a netperf tcp transmit test with (say) and 8k buffer.

This should be a candidate for stable, the ax88179_178a driver defaults to
gso and tso enabled so it passes a lot of fragmented skb to the USB stack.

Notes from Sarah:

Discussion: http://marc.info/?l=linux-usb=138384509604981=2

This patch fixes a long-standing xHCI driver bug that was revealed by a
change in 3.12 in the usb-net driver.  Commit
638c5115a794981441246fa8fa5d95c1875af5ba "USBNET: support DMA SG" added
support to use bulk endpoint scatter-gather (urb->sg).  Only the USB
ethernet drivers trigger this bug, because the mass storage driver sends
sg list entries in page-sized chunks.

This patch only fixes the issue for bulk endpoint scatter-gather.  The
problem will still occur for periodic endpoints, because hosts will
interpret no-op transfers as a request to skip a service interval, which
is not what we want.

Luckily, the USB core isn't set up for scatter-gather on isochronous
endpoints, and no USB drivers use scatter-gather for interrupt
endpoints.  Document this known limitation so that developers won't try
to use urb->sg for interrupt endpoints until this issue is fixed.  The
more comprehensive fix would be to allow link TRBs in the middle of the
endpoint ring and revert this patch, but that fix would touch too much
code to be allowed in for stable.

This patch should be backported to kernels as old as 3.12, that contain
the commit 638c5115a794981441246fa8fa5d95c1875af5ba "USBNET: support DMA
SG".  Without this patch, the USB network device gets wedged, and stops
sending packets.  Mark Lord confirms this patch fixes the regression:

http://marc.info/?l=linux-netdev=138487107625966=2

Signed-off-by: David Laight 
Signed-off-by: Sarah Sharp 
Tested-by: Mark Lord 
Signed-off-by: Greg Kroah-Hartman 

---
 drivers/usb/host/xhci-ring.c |   54 +--
 include/linux/usb.h  |2 +
 2 files changed, 54 insertions(+), 2 deletions(-)

--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2929,8 +2929,58 @@ static int prepare_ring(struct xhci_hcd
}
 
while (1) {
-   if (room_on_ring(xhci, ep_ring, num_trbs))
-   break;
+   if (room_on_ring(xhci, ep_ring, num_trbs)) {
+   union xhci_trb *trb = ep_ring->enqueue;
+   unsigned int usable = ep_ring->enq_seg->trbs +
+   TRBS_PER_SEGMENT - 1 - trb;
+   u32 nop_cmd;
+
+   /*
+* Section 4.11.7.1 TD Fragments states that a link
+* TRB must only occur at the boundary between
+* data bursts (eg 512 bytes for 480M).
+* While it is possible to split a large fragment
+* we don't know the size yet.
+* Simplest solution is to fill the trb before the
+* LINK with nop commands.
+*/
+   if (num_trbs == 1 || num_trbs <= usable || usable == 0)
+   break;
+
+   if (ep_ring->type != TYPE_BULK)
+   /*
+* While isoc transfers might have a buffer that
+* crosses a 64k boundary it is unlikely.
+* Since we can't add NOPs without generating
+* gaps in the traffic just hope it never
+* happens at the end of the ring.
+* This could be fixed by writing a LINK TRB
+* instead of the first NOP - however the
+* TRB_TYPE_LINK_LE32() calls would all need
+* changing to check the ring length.
+*/
+   break;
+
+   if (num_trbs >= TRBS_PER_SEGMENT) {
+   xhci_err(xhci, "Too many fragments %d, max 
%d\n",
+   num_trbs, TRBS_PER_SEGMENT - 1);
+   return -ENOMEM;
+   }
+
+   nop_cmd = cpu_to_le32(TRB_TYPE(TRB_TR_NOOP) |
+   ep_ring->cycle_state);
+

[PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

2013-12-18 Thread Greg Kroah-Hartman

3.12-stable review patch.  If anyone has any objections, please let me know.

--

From: David Laight david.lai...@aculab.com

commit 35773dac5f862cb1c82ea151eba3e2f6de51ec3e upstream.

Section 4.11.7.1 of rev 1.0 of the xhci specification states that a link TRB
can only occur at a boundary between underlying USB frames (512 bytes for
high speed devices).

If this isn't done the USB frames aren't formatted correctly and, for example,
the USB3 ethernet ax88179_178a card will stop sending (while still receiving)
when running a netperf tcp transmit test with (say) and 8k buffer.

This should be a candidate for stable, the ax88179_178a driver defaults to
gso and tso enabled so it passes a lot of fragmented skb to the USB stack.

Notes from Sarah:

Discussion: http://marc.info/?l=linux-usbm=138384509604981w=2

This patch fixes a long-standing xHCI driver bug that was revealed by a
change in 3.12 in the usb-net driver.  Commit
638c5115a794981441246fa8fa5d95c1875af5ba USBNET: support DMA SG added
support to use bulk endpoint scatter-gather (urb-sg).  Only the USB
ethernet drivers trigger this bug, because the mass storage driver sends
sg list entries in page-sized chunks.

This patch only fixes the issue for bulk endpoint scatter-gather.  The
problem will still occur for periodic endpoints, because hosts will
interpret no-op transfers as a request to skip a service interval, which
is not what we want.

Luckily, the USB core isn't set up for scatter-gather on isochronous
endpoints, and no USB drivers use scatter-gather for interrupt
endpoints.  Document this known limitation so that developers won't try
to use urb-sg for interrupt endpoints until this issue is fixed.  The
more comprehensive fix would be to allow link TRBs in the middle of the
endpoint ring and revert this patch, but that fix would touch too much
code to be allowed in for stable.

This patch should be backported to kernels as old as 3.12, that contain
the commit 638c5115a794981441246fa8fa5d95c1875af5ba USBNET: support DMA
SG.  Without this patch, the USB network device gets wedged, and stops
sending packets.  Mark Lord confirms this patch fixes the regression:

http://marc.info/?l=linux-netdevm=138487107625966w=2

Signed-off-by: David Laight david.lai...@aculab.com
Signed-off-by: Sarah Sharp sarah.a.sh...@linux.intel.com
Tested-by: Mark Lord ml...@pobox.com
Signed-off-by: Greg Kroah-Hartman gre...@linuxfoundation.org

---
 drivers/usb/host/xhci-ring.c |   54 +--
 include/linux/usb.h  |2 +
 2 files changed, 54 insertions(+), 2 deletions(-)

--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2929,8 +2929,58 @@ static int prepare_ring(struct xhci_hcd
}
 
while (1) {
-   if (room_on_ring(xhci, ep_ring, num_trbs))
-   break;
+   if (room_on_ring(xhci, ep_ring, num_trbs)) {
+   union xhci_trb *trb = ep_ring-enqueue;
+   unsigned int usable = ep_ring-enq_seg-trbs +
+   TRBS_PER_SEGMENT - 1 - trb;
+   u32 nop_cmd;
+
+   /*
+* Section 4.11.7.1 TD Fragments states that a link
+* TRB must only occur at the boundary between
+* data bursts (eg 512 bytes for 480M).
+* While it is possible to split a large fragment
+* we don't know the size yet.
+* Simplest solution is to fill the trb before the
+* LINK with nop commands.
+*/
+   if (num_trbs == 1 || num_trbs = usable || usable == 0)
+   break;
+
+   if (ep_ring-type != TYPE_BULK)
+   /*
+* While isoc transfers might have a buffer that
+* crosses a 64k boundary it is unlikely.
+* Since we can't add NOPs without generating
+* gaps in the traffic just hope it never
+* happens at the end of the ring.
+* This could be fixed by writing a LINK TRB
+* instead of the first NOP - however the
+* TRB_TYPE_LINK_LE32() calls would all need
+* changing to check the ring length.
+*/
+   break;
+
+   if (num_trbs = TRBS_PER_SEGMENT) {
+   xhci_err(xhci, Too many fragments %d, max 
%d\n,
+   num_trbs, TRBS_PER_SEGMENT - 1);
+   return -ENOMEM;
+   }
+
+   nop_cmd =

98 matches

Mail list logo