Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-12-21 Thread Michael Reutman
On Fri, Dec 18, 2015 at 5:47 PM, Michael Reutman
 wrote:
> First, apologies for the delay. This fell off my radar and I'm just
> getting back around to running your last patch.
>
> I'm kicking off our test application to run over the weekend with the
> last patch provided. I'll let you and everyone else know how it went
> on Monday morning.

I can confirm at this point that the patch provided has allowed the
original test application to run without issue for roughly 63 hours. A
definite improvement from when we first started looking into this
problem and it would get "stuck" in 15 seconds or less. :)

Alan, I'd say your patch has fixed the issue. Thanks for your
assistance. Is there anything else you need from me to wrap up this
issue?

Michael Reutman
Software Engineer
Epiq Solutions
847-598-0218 x68
www.epiqsolutions.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-12-21 Thread Alan Stern
On Mon, 21 Dec 2015, Michael Reutman wrote:

> On Fri, Dec 18, 2015 at 5:47 PM, Michael Reutman
>  wrote:
> > First, apologies for the delay. This fell off my radar and I'm just
> > getting back around to running your last patch.
> >
> > I'm kicking off our test application to run over the weekend with the
> > last patch provided. I'll let you and everyone else know how it went
> > on Monday morning.
> 
> I can confirm at this point that the patch provided has allowed the
> original test application to run without issue for roughly 63 hours. A
> definite improvement from when we first started looking into this
> problem and it would get "stuck" in 15 seconds or less. :)
> 
> Alan, I'd say your patch has fixed the issue. Thanks for your
> assistance. Is there anything else you need from me to wrap up this
> issue?

Nothing more; thanks for testing it.  I will submit the changes 
shortly.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-12-18 Thread Michael Reutman
First, apologies for the delay. This fell off my radar and I'm just
getting back around to running your last patch.

I'm kicking off our test application to run over the weekend with the
last patch provided. I'll let you and everyone else know how it went
on Monday morning.
Michael Reutman
Software Engineer
Epiq Solutions
847-598-0218 x68
www.epiqsolutions.com


On Wed, Dec 9, 2015 at 1:45 PM, Michael Reutman
 wrote:
> On Mon, Dec 7, 2015 at 1:17 PM, Alan Stern  wrote:
>> Michael:
>>
>> Here at last is the final version of the patch, or something very close
>> to it.  This should be applied on top of the other EHCI patches, but
>> not the quick-and-dirty fix, which it replaces.
>>
>> Please test it and let me know the results.
>>
>> Alan Stern
>>
>
> Thanks Alan. I'll try and get this tested within the next day or two
> and relay the results back to you.
>
> Michael Reutman
> Software Engineer
> Epiq Solutions
> 847-598-0218 x68
> www.epiqsolutions.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-12-09 Thread Michael Reutman
On Mon, Dec 7, 2015 at 1:17 PM, Alan Stern  wrote:
> Michael:
>
> Here at last is the final version of the patch, or something very close
> to it.  This should be applied on top of the other EHCI patches, but
> not the quick-and-dirty fix, which it replaces.
>
> Please test it and let me know the results.
>
> Alan Stern
>

Thanks Alan. I'll try and get this tested within the next day or two
and relay the results back to you.

Michael Reutman
Software Engineer
Epiq Solutions
847-598-0218 x68
www.epiqsolutions.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-12-07 Thread Alan Stern
Michael:

Here at last is the final version of the patch, or something very close 
to it.  This should be applied on top of the other EHCI patches, but 
not the quick-and-dirty fix, which it replaces.

Please test it and let me know the results.

Alan Stern



Index: usb-4.3/drivers/usb/host/ehci.h
===
--- usb-4.3.orig/drivers/usb/host/ehci.h
+++ usb-4.3/drivers/usb/host/ehci.h
@@ -110,6 +110,7 @@ enum ehci_hrtimer_event {
EHCI_HRTIMER_POLL_DEAD, /* Wait for dead controller to stop */
EHCI_HRTIMER_UNLINK_INTR,   /* Wait for interrupt QH unlink */
EHCI_HRTIMER_FREE_ITDS, /* Wait for unused iTDs and siTDs */
+   EHCI_HRTIMER_ACTIVE_UNLINK, /* Wait while unlinking an active QH */
EHCI_HRTIMER_START_UNLINK_INTR, /* Unlink empty interrupt QHs */
EHCI_HRTIMER_ASYNC_UNLINKS, /* Unlink empty async QHs */
EHCI_HRTIMER_IAA_WATCHDOG,  /* Handle lost IAA interrupts */
@@ -156,6 +157,8 @@ struct ehci_hcd {   /* one per controlle
struct list_headasync_idle;
unsignedasync_unlink_cycle;
unsignedasync_count;/* async activity count */
+   __hc32  old_current;/* Test for QH becoming */
+   __hc32  old_token;  /*  inactive during unlink */
 
/* periodic schedule support */
 #defineDEFAULT_I_TDPS  1024/* some HCs can do less 
*/
@@ -432,13 +435,19 @@ struct ehci_qh {
u8  xacterrs;   /* XactErr retry counter */
 #defineQH_XACTERR_MAX  32  /* XactErr retry limit 
*/
 
+   u8  unlink_reason;
+#define QH_UNLINK_HALTED   0x01/* Halt flag is set */
+#define QH_UNLINK_SHORT_READ   0x02/* Recover from a short read */
+#define QH_UNLINK_DUMMY_OVERLAY0x04/* QH overlayed the 
dummy TD */
+#define QH_UNLINK_SHUTDOWN 0x08/* The HC isn't running */
+#define QH_UNLINK_END_OF_QUEUE 0x10/* Reached end of the queue */
+#define QH_UNLINK_REQUESTED0x20/* Disable, reset, or dequeue */
+
u8  gap_uf; /* uframes split/csplit gap */
 
unsignedis_out:1;   /* bulk or intr OUT */
unsignedclearing_tt:1;  /* Clear-TT-Buf in progress */
unsigneddequeue_during_giveback:1;
-   unsignedexception:1;/* got a fault, or an unlink
-  was requested */
unsignedshould_be_inactive:1;
 };
 
Index: usb-4.3/drivers/usb/host/ehci-hcd.c
===
--- usb-4.3.orig/drivers/usb/host/ehci-hcd.c
+++ usb-4.3/drivers/usb/host/ehci-hcd.c
@@ -306,6 +306,7 @@ static void ehci_quiesce (struct ehci_hc
 
 /*-*/
 
+static void end_iaa_cycle(struct ehci_hcd *ehci);
 static void end_unlink_async(struct ehci_hcd *ehci);
 static void unlink_empty_async(struct ehci_hcd *ehci);
 static void unlink_empty_async_suspended(struct ehci_hcd *ehci);
@@ -565,6 +566,9 @@ static int ehci_init(struct usb_hcd *hcd
/* Accept arbitrarily long scatter-gather lists */
if (!(hcd->driver->flags & HCD_LOCAL_MEM))
hcd->self.sg_tablesize = ~0;
+
+   /* Prepare for unlinking active QHs */
+   ehci->old_current = ~0;
return 0;
 }
 
@@ -758,7 +762,7 @@ static irqreturn_t ehci_irq (struct usb_
ehci_dbg(ehci, "IAA with IAAD still set?\n");
if (ehci->iaa_in_progress)
COUNT(ehci->stats.iaa);
-   end_unlink_async(ehci);
+   end_iaa_cycle(ehci);
}
 
/* remote wakeup [4.3.1] */
@@ -911,7 +915,7 @@ static int ehci_urb_dequeue(struct usb_h
 */
} else {
qh = (struct ehci_qh *) urb->hcpriv;
-   qh->exception = 1;
+   qh->unlink_reason |= QH_UNLINK_REQUESTED;
switch (qh->qh_state) {
case QH_STATE_LINKED:
if (usb_pipetype(urb->pipe) == PIPE_INTERRUPT)
@@ -972,7 +976,7 @@ rescan:
goto done;
}
 
-   qh->exception = 1;
+   qh->unlink_reason |= QH_UNLINK_REQUESTED;
switch (qh->qh_state) {
case QH_STATE_LINKED:
WARN_ON(!list_empty(>qtd_list));
@@ -1042,7 +1046,7 @@ ehci_endpoint_reset(struct usb_hcd *hcd,
 * re-linking will call qh_refresh().
 */
usb_settoggle(qh->ps.udev, epnum, is_out, 0);
-   qh->exception = 1;
+   qh->unlink_reason |= QH_UNLINK_REQUESTED;
  

Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-11-19 Thread Michael Reutman
On Wed, Nov 18, 2015 at 10:23 AM, Alan Stern  wrote:
> For the last patch.  And yes, do set usb_snoop_max.

Dmesg output with usb_snoop and usb_snoop_max is attached below.

> I hope so.  There is one thing I'm still undecided about: Should this
> workaround be applied only to AMD/ATI controllers or should it apply to
> everything?  It does have a small probability of slowing down transfers
> under certain circumstances, but on the other hand, it's quite possible
> that other controller types will have the same sort of weakness as the
> AMD ones.

We've tested this application on a handful of other platforms and have
really only had issues with this desktop and our tablet device which
is stuck on the 2.6 kernel. Looking at the output of that tablet when
it gets "stuck" (this was posted back on the original libusb post)
makes it also seem like it may have some other problem than the one we
have been debugging here. Not sure if we can provide any other input
here that would help in making that decision.

Michael Reutman
Software Engineer
Epiq Solutions
847-598-0218 x68
www.epiqsolutions.com
[  834.390818] usb 5-4: opened by process 2868: main
[  834.390842] usb 5-4: usbdev_do_ioctl: SUBMITURB
[  834.390845] usb 5-4: usbfs: process 2868 (main) did not claim interface 0 
before use
[  834.390867] usb 5-4: urb submit 880328793a40
[  834.390869] usb 5-4: userurb 009b9480, ep2 bulk-in, length 16384
[  834.390902] usb 5-4: usbdev_do_ioctl: SUBMITURB
[  834.390907] usb 5-4: urb submit 880328793080
[  834.390908] usb 5-4: userurb 009b94c0, ep2 bulk-in, length 16384
[  834.390911] usb 5-4: usbdev_do_ioctl: SUBMITURB
[  834.390916] usb 5-4: urb submit 8803287926c0
[  834.390917] usb 5-4: userurb 009b9500, ep2 bulk-in, length 16384
[  834.390920] usb 5-4: usbdev_do_ioctl: SUBMITURB
[  834.390924] usb 5-4: urb submit 880328793800
[  834.390925] usb 5-4: userurb 009b9540, ep2 bulk-in, length 16384
[  834.391349] usb 5-4: urb complete 880328793a40
[  834.391350] usb 5-4: userurb 009b9480, ep2 bulk-in, actual_length 
16384 status 0
[  834.391352] data: 01 51 0e 0b 00 00 00 00 87 c7  
  .Q
[  834.391366] usb 5-4: usbdev_do_ioctl: REAPURBNDELAY
[  834.391369] usb 5-4: proc_reapurbnonblock: REAP 009b9480
[  834.391718] usb 5-4: urb complete 880328793080
[  834.391720] usb 5-4: userurb 009b94c0, ep2 bulk-in, actual_length 
16384 status 0
[  834.391721] data: 4d 8d 71 24 00 00 00 00 8b 8d  
  M.q$..
[  834.392218] usb 5-4: urb complete 8803287926c0
[  834.392219] usb 5-4: userurb 009b9500, ep2 bulk-in, actual_length 
16384 status 0
[  834.392221] data: 35 9d 71 24 00 00 00 00 63 e0  
  5.q$c.
[  834.392718] usb 5-4: urb complete 880328793800
[  834.392720] usb 5-4: userurb 009b9540, ep2 bulk-in, actual_length 
16384 status 0
[  834.392721] data: 1d ad 71 24 00 00 00 00 3b 33  
  ..q$;3
[  834.393655] usb 5-4: usbdev_do_ioctl: REAPURBNDELAY
[  834.393658] usb 5-4: proc_reapurbnonblock: REAP 009b94c0
[  834.393999] usb 5-4: usbdev_do_ioctl: REAPURBNDELAY
[  834.394001] usb 5-4: proc_reapurbnonblock: REAP 009b9500
[  834.394023] usb 5-4: usbdev_do_ioctl: DISCARDURB 009b9540
[  834.394152] usb 5-4: usbdev_do_ioctl: REAPURBNDELAY
[  834.394153] usb 5-4: proc_reapurbnonblock: REAP 009b9540
[  834.394716] usb 5-4: usbdev_do_ioctl: REAPURBNDELAY
[  834.404826] usb 5-4: usbdev_do_ioctl: SUBMITURB
[  834.404832] usb 5-4: urb submit 880328792a80
[  834.404834] usb 5-4: userurb 009b9540, ep2 bulk-in, length 16384
[  834.404856] usb 5-4: usbdev_do_ioctl: SUBMITURB
[  834.404859] usb 5-4: urb submit 880328792540
[  834.404860] usb 5-4: userurb 009b9500, ep2 bulk-in, length 16384
[  834.404866] usb 5-4: usbdev_do_ioctl: SUBMITURB
[  834.404869] usb 5-4: urb submit 8803287935c0
[  834.404870] usb 5-4: userurb 009b94c0, ep2 bulk-in, length 16384
[  834.404876] usb 5-4: usbdev_do_ioctl: SUBMITURB
[  834.404879] usb 5-4: urb submit 880328792780
[  834.404881] usb 5-4: userurb 009b9480, ep2 bulk-in, length 16384
[  834.405478] usb 5-4: urb complete 880328792a80
[  834.405480] usb 5-4: userurb 009b9540, ep2 bulk-in, actual_length 
16384 status 0
[  834.405481] data: 05 bd 71 24 00 00 00 00 13 86  
  ..q$..
[  834.405499] usb 5-4: usbdev_do_ioctl: REAPURBNDELAY
[  834.405501] usb 5-4: proc_reapurbnonblock: REAP 009b9540
[  834.405852] usb 5-4: urb complete 880328792540
[  834.405853] usb 5-4: userurb 009b9500, ep2 bulk-in, actual_length 
16384 status 0
[  834.405855] data: 

Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-11-18 Thread Michael Reutman
On Tue, Nov 17, 2015 at 12:21 PM, Alan Stern  wrote:
> On Tue, 17 Nov 2015, Michael Reutman wrote:
>> Ran 1 million operations of launch+cancel usb transfers last night
>> without getting into the stuck state. I'll bump up the iterations to
>> 10 million or so and run it again tonight just to be certain, but I
>> think the last patch has the fix needed for this particular hardware.
>
> Okay, that sounds good.  If everything works out, I'll write a patch
> that does all this properly and ask you to test it.

I ended up setting the usb test to run until the user decided to
cancel it and ran it overnight for a total of 18 hours. Appears to
have avoided the "stuck" state!

> PS: Assuming it does work out, I would appreciate seeing the usb_snoop
> output for a run containing just 20 iterations or so.

Do you want me to grab usb_snoop for the last patch you provided or
with the formal patch? Also, do you want usb_snoop_max enabled as
well?

Glad to be reaching a definite conclusion to this problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-11-18 Thread Alan Stern
On Wed, 18 Nov 2015, Michael Reutman wrote:

> On Tue, Nov 17, 2015 at 12:21 PM, Alan Stern  
> wrote:
> > On Tue, 17 Nov 2015, Michael Reutman wrote:
> >> Ran 1 million operations of launch+cancel usb transfers last night
> >> without getting into the stuck state. I'll bump up the iterations to
> >> 10 million or so and run it again tonight just to be certain, but I
> >> think the last patch has the fix needed for this particular hardware.
> >
> > Okay, that sounds good.  If everything works out, I'll write a patch
> > that does all this properly and ask you to test it.
> 
> I ended up setting the usb test to run until the user decided to
> cancel it and ran it overnight for a total of 18 hours. Appears to
> have avoided the "stuck" state!

Great!

> > PS: Assuming it does work out, I would appreciate seeing the usb_snoop
> > output for a run containing just 20 iterations or so.
> 
> Do you want me to grab usb_snoop for the last patch you provided or
> with the formal patch? Also, do you want usb_snoop_max enabled as
> well?

For the last patch.  And yes, do set usb_snoop_max.

> Glad to be reaching a definite conclusion to this problem.

I hope so.  There is one thing I'm still undecided about: Should this 
workaround be applied only to AMD/ATI controllers or should it apply to 
everything?  It does have a small probability of slowing down transfers 
under certain circumstances, but on the other hand, it's quite possible 
that other controller types will have the same sort of weakness as the 
AMD ones.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-11-17 Thread Michael Reutman
On Thu, Nov 12, 2015 at 10:46 AM, Michael Reutman
 wrote:
> On Mon, Nov 9, 2015 at 9:21 AM, Alan Stern  wrote:
>> Here's a revised quick-and-dirty workaround.  It replaces the earlier
>> one.  (And like the earlier one, it is meant to apply on top of
>> the usbfs-snoop patch, but I don't think that makes any difference
>> now.)
>>
>> The idea is to wait until the QH has remain unchanged for at least 2 ms
>> before deciding that the host controller has stopped using it.  If this
>> doesn't fix the problem then I will be completely out of ideas and will
>> have to ask for help from someone at AMD.
>>
>> When testing this, don't bother with the usbfs-snoop stuff.  With luck
>> it won't matter, because the test program won't hang.
>
> Apologies for the delay, been working on another project recently that
> has taken up most of my time.
>
> I applied the patch and have ran our application for 20 to 30 minutes
> thus far without issue. Later on tonight I'll set it up for an
> overnight test to ensure that it doesn't get stuck somewhere further
> down the line.
>
> Hopefully this is the fix!

Ran 1 million operations of launch+cancel usb transfers last night
without getting into the stuck state. I'll bump up the iterations to
10 million or so and run it again tonight just to be certain, but I
think the last patch has the fix needed for this particular hardware.

Michael Reutman
Software Engineer
Epiq Solutions
847-598-0218 x68
www.epiqsolutions.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-11-17 Thread Alan Stern
On Tue, 17 Nov 2015, Michael Reutman wrote:

> > Apologies for the delay, been working on another project recently that
> > has taken up most of my time.
> >
> > I applied the patch and have ran our application for 20 to 30 minutes
> > thus far without issue. Later on tonight I'll set it up for an
> > overnight test to ensure that it doesn't get stuck somewhere further
> > down the line.
> >
> > Hopefully this is the fix!
> 
> Ran 1 million operations of launch+cancel usb transfers last night
> without getting into the stuck state. I'll bump up the iterations to
> 10 million or so and run it again tonight just to be certain, but I
> think the last patch has the fix needed for this particular hardware.

Okay, that sounds good.  If everything works out, I'll write a patch 
that does all this properly and ask you to test it.

Alan Stern

PS: Assuming it does work out, I would appreciate seeing the usb_snoop 
output for a run containing just 20 iterations or so.

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-11-12 Thread Michael Reutman
On Mon, Nov 9, 2015 at 9:21 AM, Alan Stern  wrote:
> Here's a revised quick-and-dirty workaround.  It replaces the earlier
> one.  (And like the earlier one, it is meant to apply on top of
> the usbfs-snoop patch, but I don't think that makes any difference
> now.)
>
> The idea is to wait until the QH has remain unchanged for at least 2 ms
> before deciding that the host controller has stopped using it.  If this
> doesn't fix the problem then I will be completely out of ideas and will
> have to ask for help from someone at AMD.
>
> When testing this, don't bother with the usbfs-snoop stuff.  With luck
> it won't matter, because the test program won't hang.

Apologies for the delay, been working on another project recently that
has taken up most of my time.

I applied the patch and have ran our application for 20 to 30 minutes
thus far without issue. Later on tonight I'll set it up for an
overnight test to ensure that it doesn't get stuck somewhere further
down the line.

Hopefully this is the fix!

Michael Reutman
Software Engineer
Epiq Solutions
847-598-0218 x68
www.epiqsolutions.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-11-09 Thread Alan Stern
On Mon, 2 Nov 2015, Michael Reutman wrote:

> On Wed, Oct 28, 2015 at 10:48 AM, Alan Stern  
> wrote:
> > The patch below is meant to apply on top of the previous patch.  It's a
> > quick-and-dirty workaround which ought to prevent the problem from
> > occurring.  Let's see if it does.
> 
> Unfortunately, I have some bad news. It appears that the last patch
> just pushes the issue further on down in time. Ran the test fresh
> today and usb got "stuck" in around 15 to 20 minutes. This has
> definitely improved things from before where it took less than a
> minute to get everything out of sorts, so maybe we are on the right
> track?

Here's a revised quick-and-dirty workaround.  It replaces the earlier 
one.  (And like the earlier one, it is meant to apply on top of 
the usbfs-snoop patch, but I don't think that makes any difference 
now.)

The idea is to wait until the QH has remain unchanged for at least 2 ms 
before deciding that the host controller has stopped using it.  If this 
doesn't fix the problem then I will be completely out of ideas and will 
have to ask for help from someone at AMD.

When testing this, don't bother with the usbfs-snoop stuff.  With luck 
it won't matter, because the test program won't hang.

Alan Stern




Index: usb-4.3/drivers/usb/host/ehci-q.c
===
--- usb-4.3.orig/drivers/usb/host/ehci-q.c
+++ usb-4.3/drivers/usb/host/ehci-q.c
@@ -1341,8 +1341,22 @@ static void end_unlink_async(struct ehci
}
 
/* Otherwise process only the first waiting QH (NVIDIA bug?) */
-   else
+   else {
+   __hc32  qh_current, qh_token, old_current, old_token;
+
list_move_tail(>unlink_node, >async_idle);
+   old_current = qh->hw->hw_current;
+   old_token = qh->hw->hw_token;
+   for (;;) {
+   udelay(2000);
+   qh_current = READ_ONCE(qh->hw->hw_current);
+   qh_token = READ_ONCE(qh->hw->hw_token);
+   if (qh_current == old_current && qh_token == old_token)
+   break;
+   old_current = qh_current;
+   old_token = qh_token;
+   }
+   }
 
/* Start a new IAA cycle if any QHs are waiting for it */
if (!list_empty(>async_unlink))

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-11-03 Thread Alan Stern
On Tue, 3 Nov 2015, Peter Chen wrote:

> On Mon, Nov 02, 2015 at 11:24:54AM -0500, Alan Stern wrote:
> > On Mon, 2 Nov 2015, Peter Chen wrote:
> > 
> > > Hi Alan,
> > > 
> > > After a discussion with IC engineer, a brief idea like below:
> > > 
> > > - Delete the QH from the asynclist directly, Issue the first IAA
> > >   routine, this is we do now.
> > > - During the first IAA interrupt, judge if asynclistaddr is next address 
> > > of
> > >   the QH we want to remove, if it does not, issue the second IAA
> > >   routine, else, we consider the controller will not process the removed
> > >   QH any more, so it is safe to remove.
> > > - If the controller still handles the QH previous than the one we want to
> > >   remove , issue IAA again.
> > 
> > I don't understand this completely.  It sounds like you're saying:
> > 
> > Find qh2 such that qh2->qh_next == 
> > 
> > Keep doing IAA interrupts until (asynclistaddr != qh2->qh_dma &&
> > asynclistaddr != qh->qh_dma).
> > 
> > Is that what you mean?  It doesn't sound right.  Consider the case 
> > where the async list contains only two QHs: qh2 and qh.  Then 
> > asynclistaddr will always point to one or the other.
> > 
> > What about this instead?
> > 
> > Keep doing IAA interrupts until (asynclistaddr != qh->qh_dma).
> > 
> > Then do one more IAA interrupt.
> > 
> 
> That's may not enough, it does not consider the QH which we want to
> remove is doing transfer now.

Whoops, you're right.  I forgot that asynclistaddr is the address of 
the next QH to execute, not the address of the currently executing QH.

> We need to know clearly what things the IAA can make sure for us, if
> these things can prove the controller thinks the QH has removed.

I'm convinced that for these AMD/ATI controllers, IAA tells us
practically nothing.  If it worked properly then this bug would not
have occurred.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-11-02 Thread Alan Stern
On Mon, 2 Nov 2015, Peter Chen wrote:

> Hi Alan,
> 
> After a discussion with IC engineer, a brief idea like below:
> 
> - Delete the QH from the asynclist directly, Issue the first IAA
>   routine, this is we do now.
> - During the first IAA interrupt, judge if asynclistaddr is next address of
>   the QH we want to remove, if it does not, issue the second IAA
>   routine, else, we consider the controller will not process the removed
>   QH any more, so it is safe to remove.
> - If the controller still handles the QH previous than the one we want to
>   remove , issue IAA again.

I don't understand this completely.  It sounds like you're saying:

Find qh2 such that qh2->qh_next == 

Keep doing IAA interrupts until (asynclistaddr != qh2->qh_dma &&
asynclistaddr != qh->qh_dma).

Is that what you mean?  It doesn't sound right.  Consider the case 
where the async list contains only two QHs: qh2 and qh.  Then 
asynclistaddr will always point to one or the other.

What about this instead?

Keep doing IAA interrupts until (asynclistaddr != qh->qh_dma).

Then do one more IAA interrupt.

That seems safer.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-11-02 Thread Alan Stern
On Mon, 2 Nov 2015, Michael Reutman wrote:

> On Wed, Oct 28, 2015 at 10:48 AM, Alan Stern  
> wrote:
> > The patch below is meant to apply on top of the previous patch.  It's a
> > quick-and-dirty workaround which ought to prevent the problem from
> > occurring.  Let's see if it does.
> 
> Unfortunately, I have some bad news. It appears that the last patch
> just pushes the issue further on down in time. Ran the test fresh
> today and usb got "stuck" in around 15 to 20 minutes. This has
> definitely improved things from before where it took less than a
> minute to get everything out of sorts, so maybe we are on the right
> track?

Okay, I was afraid of that.  I'll have something else for you to try in 
a few days; things are a little busy here right now.

> I captured dmesg and async output as previous and have attached them below.
> 
> Also, should I still be enabling usbfs_snoop and usbfs_snoop_max
> still? I tried it with and without those enabled just to see if they
> made a difference, but they appear to end in the same state in
> approximately the same amount of time.

At this point they don't really matter any more.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-11-02 Thread Peter Chen
On Fri, Oct 30, 2015 at 11:15:05AM -0400, Alan Stern wrote:
> On Fri, 30 Oct 2015, Peter Chen wrote:
> 
> > After reading spec at 4.8.2:
> > 
> > Software must not remove an active queue head from the schedule.
> > Software should first deactivate all active qTDs, wait for the
> > queue head to go inactive, then remove the queue head from the
> > asynchronous list.
> > 
> > But current software removes an active queue head first, then
> > deactivates qTDs for the unlinked queue head. What's possible
> > harm if we deactivates qTDs first, then unlink the queue head
> > directly without waiting.
> 
> Right, that's the relevant part of the spec.
> 
> > > [In practice it's not feasible to wait for an active QH to become
> > > inactive before removing it, for several reasons.  For one, the QH may
> > > _never_ become inactive (if the endpoint NAKs indefinitely). 
> > 
> > Yes, 
> > 
> > > For
> > > another, the procedure given in the spec (deactivate the qTDs on the
> > > queue) is racy, since the controller can perform a new overlay or
> > > writeback at any time.]
> > > 
> > 
> > But that queue head will be unlinked soon, software (or controller) will
> > not use a new overlay or a new writeback, isn't it?
> 
> You don't know that.  Until the controller sees that the QH has been
> unlinked, it can do a new overlay or writeback at any time.  Don't
> forget that the controller keeps an on-chip cached copy of the QH.  So
> we could have a race like this:
> 
>   Initially the QH is marked active, both in RAM and in the
>   controller's cache.  The next qTD in the queue is marked
>   active.
> 
>   The driver sees that the QH in RAM is active.  It decides to
>   follow the recommendation in the spec.
> 
>   But before the driver can do anything, the controller finishes
>   the current transfer.  It marks the QH inactive in its cache
>   and writes the cached QH back to RAM.
> 
>   Next, the controller seeing that the QH is inactive, reads the
>   next qTD into its cache and performs an overlay.  This marks 
>   the cached version of the QH as active again.
> 
>   The driver marks the qTDs inactive and waits for the QH to become
>   inactive.  Since the QH in RAM is already marked inactive, the
>   driver doesn't have to wait long.  It removes the QH from the
>   async list.
> 
>   The controller writes the cached QH back to RAM, so now the QH 
>   in RAM is active again.  But it has already been removed from
>   the list!
> 
> Other races of the same general sort are also possible.
> 

Hi Alan,

After a discussion with IC engineer, a brief idea like below:

- Delete the QH from the asynclist directly, Issue the first IAA
  routine, this is we do now.
- During the first IAA interrupt, judge if asynclistaddr is next address of
  the QH we want to remove, if it does not, issue the second IAA
  routine, else, we consider the controller will not process the removed
  QH any more, so it is safe to remove.
- If the controller still handles the QH previous than the one we want to
  remove , issue IAA again.

-- 

Best Regards,
Peter Chen
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-10-30 Thread Peter Chen
On Thu, Oct 29, 2015 at 10:37:28AM -0400, Alan Stern wrote:

>

After reading spec at 4.8.2:

Software must not remove an active queue head from the schedule.
Software should first deactivate all active qTDs, wait for the
queue head to go inactive, then remove the queue head from the
asynchronous list.

But current software removes an active queue head first, then
deactivates qTDs for the unlinked queue head. What's possible
harm if we deactivates qTDs first, then unlink the queue head
directly without waiting.

> [In practice it's not feasible to wait for an active QH to become
> inactive before removing it, for several reasons.  For one, the QH may
> _never_ become inactive (if the endpoint NAKs indefinitely). 

Yes, 

> For
> another, the procedure given in the spec (deactivate the qTDs on the
> queue) is racy, since the controller can perform a new overlay or
> writeback at any time.]
> 

But that queue head will be unlinked soon, software (or controller) will
not use a new overlay or a new writeback, isn't it?

-- 

Best Regards,
Peter Chen
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-10-30 Thread Alan Stern
On Fri, 30 Oct 2015, Peter Chen wrote:

> After reading spec at 4.8.2:
> 
> Software must not remove an active queue head from the schedule.
> Software should first deactivate all active qTDs, wait for the
> queue head to go inactive, then remove the queue head from the
> asynchronous list.
> 
> But current software removes an active queue head first, then
> deactivates qTDs for the unlinked queue head. What's possible
> harm if we deactivates qTDs first, then unlink the queue head
> directly without waiting.

Right, that's the relevant part of the spec.

> > [In practice it's not feasible to wait for an active QH to become
> > inactive before removing it, for several reasons.  For one, the QH may
> > _never_ become inactive (if the endpoint NAKs indefinitely). 
> 
> Yes, 
> 
> > For
> > another, the procedure given in the spec (deactivate the qTDs on the
> > queue) is racy, since the controller can perform a new overlay or
> > writeback at any time.]
> > 
> 
> But that queue head will be unlinked soon, software (or controller) will
> not use a new overlay or a new writeback, isn't it?

You don't know that.  Until the controller sees that the QH has been
unlinked, it can do a new overlay or writeback at any time.  Don't
forget that the controller keeps an on-chip cached copy of the QH.  So
we could have a race like this:

Initially the QH is marked active, both in RAM and in the
controller's cache.  The next qTD in the queue is marked
active.

The driver sees that the QH in RAM is active.  It decides to
follow the recommendation in the spec.

But before the driver can do anything, the controller finishes
the current transfer.  It marks the QH inactive in its cache
and writes the cached QH back to RAM.

Next, the controller seeing that the QH is inactive, reads the
next qTD into its cache and performs an overlay.  This marks 
the cached version of the QH as active again.

The driver marks the qTDs inactive and waits for the QH to become
inactive.  Since the QH in RAM is already marked inactive, the
driver doesn't have to wait long.  It removes the QH from the
async list.

The controller writes the cached QH back to RAM, so now the QH 
in RAM is active again.  But it has already been removed from
the list!

Other races of the same general sort are also possible.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-10-29 Thread Alan Stern
On Thu, 29 Oct 2015, Peter Chen wrote:

> > > [In practice it's not feasible to wait for an active QH to become
> > > inactive before removing it, for several reasons.  For one, the QH may
> > > _never_ become inactive (if the endpoint NAKs indefinitely).  For
> > > another, the procedure given in the spec (deactivate the qTDs on the
> > > queue) is racy, since the controller can perform a new overlay or
> > > writeback at any time.]
> > > 
> 
> Alan, one question, what will happen if we never remove an active QH from
> async list?

You mean, what would happen if the driver simply left the QH in the 
async list instead of removing it?  There would be two consequences.

First, the controller would continue to perform DMA reads of the QH.  
These unnecessary reads would consume both PCI (or whatever bus the
controller is on) and memory bandwidth.

Second, when you unlinked an active URB, the controller would not stop 
the transfer.  It would continue even after the URB was unlinked, until 
the the transfer was complete, an error occurred, or the controller was 
shut down.  Basically, it would be impossible to unlink an active URB.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-10-29 Thread Peter Chen
On Thu, Oct 22, 2015 at 05:14:41PM -0400, Alan Stern wrote:
> [Resend with Hans's correct email address this time...]
> 
> On Thu, 22 Oct 2015, Alan Stern wrote:
> 
> > Hans and everyone else:
> > 
> > This continues the discussion of a problem originally posted to the 
> > libusb-devel mailing list
> > (see  if 
> > you're curious).
> > 
> > The EHCI controller in question is an AMD/ATI SB7x0/SB8x0/SB9x0, as 
> > found on the RX780/RX790 motherboard.  I haven't seen this problem 
> > occur with Intel hardware.
> > 
> > The problem arises when an active bulk-in QH is removed from the async
> > schedule.  The current qTD is cancelled, and it is the last qTD on the
> > QH's queue.  At the time the QH is removed from the async list, the
> > overlay region shows that only a fraction of the qTD has been completed
> > (maybe 4 KB transferred out of 16 KB total).
> > 
> > 10 ms later, four new qTDs are added to the QH and it gets added back
> > to the async schedule.  Although I don't know this for certain, I
> > believe the second of these qTDs is stored at the same address as the
> > one that was cancelled.  That's what naturally would happen if the
> > memory pool satisfies an allocation from the most recently freed area.
> > 
> > Anyway, a short time later, it sometimes happen that the controller
> > gets stuck.  The Active bit in the QH's overlay region is clear, and
> > the Current and Next qTD pointers both point to the second qTD in the
> > queue, which obviously is why the controller is not making any forward
> > progress.  The first qTD's Active bit is still set and its Bytes To
> > Transfer is still set to 16 KB.  The second qTD's Active bit is off and
> > its Bytes To Transfer is 0.  In spite of this, neither qTD's data
> > buffer has been overwritten.
> > 
> > Although it's hard to tell exactly what went wrong, my guess is that
> > the after the QH was removed from the async schedule, the controller
> > continued to process it until all 16 KB had been transferred.  (This
> > would have taken no more than 0.5 ms.)  Then at some point, the QH
> > overlay and the now-completed qTD were written back -- that would
> > explain why the second qTD in the queue shows up as not Active and with
> > no bytes remaining to transfer.
> > 
> > On the other hand, that qTD wasn't reused until 10 ms after the QH was
> > removed from the schedule, and it was completely reinitialized before
> > reuse.  The write-back must have occurred later than this; I have no
> > idea why.  I also don't know why the write-back of the QH's overlay
> > region didn't overwrite the Next qTD pointer.
> > 
> > 
> > This is clearly a complicated problem.  It's possible that we're simply
> > dealing with defective hardware, but I tend to doubt it.  It seems more
> > likely that the problem is caused by improperly removing the active QH
> > from the async schedule.  The driver does not follow the instructions
> > given in section 4.8.2 of the EHCI spec, which says that software
> > should not remove active QHs.
> > 
> > [In practice it's not feasible to wait for an active QH to become
> > inactive before removing it, for several reasons.  For one, the QH may
> > _never_ become inactive (if the endpoint NAKs indefinitely).  For
> > another, the procedure given in the spec (deactivate the qTDs on the
> > queue) is racy, since the controller can perform a new overlay or
> > writeback at any time.]
> > 

Alan, one question, what will happen if we never remove an active QH from
async list?

-- 

Best Regards,
Peter Chen
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-10-28 Thread Michael Reutman
On Tue, Oct 27, 2015 at 6:35 PM, Michael Reutman
 wrote:
> On Mon, Oct 26, 2015 at 11:11 AM, Alan Stern  
> wrote:
>> Michael:
>>
>> In the absence of any suggestions, we might as well try to verify my
>> idea about what's going wrong.  The patch below adds a little more
>> debugging information; please try repeating the last test with this
>> patch in place of the previous one.
>
> Sorry for the delayed response, I promise I will run your patch
> tomorrow and post the results immediately after. Got tied up with
> another project.

Alan: I ran the test again with your latest patch and collected the
output of ehci async and dmesg. You should find their output attached
to this email. Let me know if you need anything else.

Michael Reutman
Software Engineer
Epiq Solutions
847-598-0218 x68
www.epiqsolutions.com
[ 1470.108244] audit: type=1400 audit(1446042463.754:50): apparmor="STATUS" 
operation="profile_replace" name="/usr/lib/cups/backend/cups-pdf" pid=2890 
comm="apparmor_parser"
[ 1470.108250] audit: type=1400 audit(1446042463.754:51): apparmor="STATUS" 
operation="profile_replace" name="/usr/sbin/cupsd" pid=2890 
comm="apparmor_parser"
[ 1470.108524] audit: type=1400 audit(1446042463.754:52): apparmor="STATUS" 
operation="profile_replace" name="/usr/sbin/cupsd" pid=2890 
comm="apparmor_parser"
[ 1520.574776] usb 5-4: opened by process 3041: main
[ 1520.574800] usb 5-4: usbdev_do_ioctl: SUBMITURB
[ 1520.574802] usb 5-4: usbfs: process 3041 (main) did not claim interface 0 
before use
[ 1520.574824] usb 5-4: urb submit 8800c3a503c0
[ 1520.574826] usb 5-4: userurb 019bf480, ep2 bulk-in, length 16384
[ 1520.574859] usb 5-4: usbdev_do_ioctl: SUBMITURB
[ 1520.574863] usb 5-4: urb submit 8800c3a506c0
[ 1520.574864] usb 5-4: userurb 019bf4c0, ep2 bulk-in, length 16384
[ 1520.574869] usb 5-4: usbdev_do_ioctl: SUBMITURB
[ 1520.574873] usb 5-4: urb submit 8800c3a51200
[ 1520.574874] usb 5-4: userurb 019bf500, ep2 bulk-in, length 16384
[ 1520.574878] usb 5-4: usbdev_do_ioctl: SUBMITURB
[ 1520.574882] usb 5-4: urb submit 8800c3a51d40
[ 1520.574883] usb 5-4: userurb 019bf540, ep2 bulk-in, length 16384
[ 1520.575251] usb 5-4: urb complete 8800c3a503c0
[ 1520.575252] usb 5-4: userurb 019bf480, ep2 bulk-in, actual_length 
16384 status 0
[ 1520.575255] data: 49 a2 88 39 00 00 00 00 9c bf  
  I..9..
[ 1520.575266] usb 5-4: usbdev_do_ioctl: REAPURBNDELAY
[ 1520.575270] usb 5-4: proc_reapurbnonblock: REAP 019bf480
[ 1520.575572] usb 5-4: usbdev_do_ioctl: REAPURBNDELAY
[ 1520.575621] usb 5-4: urb complete 8800c3a506c0
[ 1520.575623] usb 5-4: userurb 019bf4c0, ep2 bulk-in, actual_length 
16384 status 0
[ 1520.575624] data: 9e b7 e0 65 00 00 00 00 f6 6b  
  ...e.k
[ 1520.575628] usb 5-4: usbdev_do_ioctl: REAPURBNDELAY
[ 1520.575629] usb 5-4: proc_reapurbnonblock: REAP 019bf4c0
[ 1520.576123] usb 5-4: urb complete 8800c3a51200
[ 1520.576125] usb 5-4: userurb 019bf500, ep2 bulk-in, actual_length 
16384 status 0
[ 1520.576126] data: 86 c7 e0 65 00 00 00 00 ce be  
  ...e..
[ 1520.576188] usb 5-4: usbdev_do_ioctl: REAPURBNDELAY
[ 1520.576190] usb 5-4: proc_reapurbnonblock: REAP 019bf500
[ 1520.576202] usb 5-4: usbdev_do_ioctl: DISCARDURB 019bf500
[ 1520.576208] usb 5-4: usbdev_do_ioctl: DISCARDURB 019bf540
[ 1520.576374] usb 5-4: urb complete 8800c3a51d40
[ 1520.576375] usb 5-4: userurb 019bf540, ep2 bulk-in, actual_length 
4096 status -2
[ 1520.576377] data: 6e d7 e0 65 00 00 00 00 a6 11  
  n..e..
[ 1520.576386] usb 5-4: usbdev_do_ioctl: REAPURBNDELAY
[ 1520.576387] usb 5-4: proc_reapurbnonblock: REAP 019bf540
[ 1520.576696] usb 5-4: usbdev_do_ioctl: REAPURBNDELAY
[ 1520.586806] usb 5-4: usbdev_do_ioctl: SUBMITURB
[ 1520.586813] usb 5-4: urb submit 880325afc480
[ 1520.586814] usb 5-4: userurb 019bf540, ep2 bulk-in, length 16384
[ 1520.586829] usb 5-4: usbdev_do_ioctl: SUBMITURB
[ 1520.586832] usb 5-4: urb submit 880325afd200
[ 1520.586833] usb 5-4: userurb 019bf500, ep2 bulk-in, length 16384
[ 1520.586837] usb 5-4: usbdev_do_ioctl: SUBMITURB
[ 1520.586839] usb 5-4: urb submit 880325afc6c0
[ 1520.586840] usb 5-4: userurb 019bf4c0, ep2 bulk-in, length 16384
[ 1520.586844] usb 5-4: usbdev_do_ioctl: SUBMITURB
[ 1520.586846] usb 5-4: urb submit 880325afc300
[ 1520.586847] usb 5-4: userurb 019bf480, ep2 bulk-in, length 16384
[ 1520.587508] usb 5-4: urb complete 880325afc480
[ 1520.587511] usb 5-4: userurb 019bf540, ep2 bulk-in, actual_length 
16384 status 0
[ 1520.587513] data: 68 db e0 65 00 00 00 00 5c 26

Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-10-28 Thread Alan Stern
On Wed, 28 Oct 2015, Michael Reutman wrote:

> Alan: I ran the test again with your latest patch and collected the
> output of ehci async and dmesg. You should find their output attached
> to this email. Let me know if you need anything else.

A smoking gun!  Notice near the end of your dmesg log this line:

[ 1520.794032] ehci-pci :00:16.2: qh 88032718ca80 should be inactive!

It appears just before the hang occurred and nowhere else.  That means 
my suspicion was correct.

The patch below is meant to apply on top of the previous patch.  It's a 
quick-and-dirty workaround which ought to prevent the problem from 
occurring.  Let's see if it does.

Alan Stern



Index: usb-4.3/drivers/usb/host/ehci-q.c
===
--- usb-4.3.orig/drivers/usb/host/ehci-q.c
+++ usb-4.3/drivers/usb/host/ehci-q.c
@@ -1341,8 +1341,11 @@ static void end_unlink_async(struct ehci
}
 
/* Otherwise process only the first waiting QH (NVIDIA bug?) */
-   else
+   else {
list_move_tail(>unlink_node, >async_idle);
+   if (!list_empty(>qtd_list))
+   udelay(2000);
+   }
 
/* Start a new IAA cycle if any QHs are waiting for it */
if (!list_empty(>async_unlink))

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-10-28 Thread Michael Reutman
On Wed, Oct 28, 2015 at 10:48 AM, Alan Stern  wrote:
> A smoking gun!  Notice near the end of your dmesg log this line:
>
> [ 1520.794032] ehci-pci :00:16.2: qh 88032718ca80 should be inactive!
>
> It appears just before the hang occurred and nowhere else.  That means
> my suspicion was correct.
>
> The patch below is meant to apply on top of the previous patch.  It's a
> quick-and-dirty workaround which ought to prevent the problem from
> occurring.  Let's see if it does.
>
> Alan Stern

Wow, this patch has made a definite improvement! Normally, usb would
get "stuck" within 5-15 seconds of starting the application. So far,
it has ran for roughly 15 minutes without any apparent fault.

I'll be unable to access this computer for a few days, but I'll leave
it up and running while I'm gone to see if it can survive long term.

Michael Reutman
Software Engineer
Epiq Solutions
847-598-0218 x68
www.epiqsolutions.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-10-27 Thread Michael Reutman
On Mon, Oct 26, 2015 at 11:11 AM, Alan Stern  wrote:
> Michael:
>
> In the absence of any suggestions, we might as well try to verify my
> idea about what's going wrong.  The patch below adds a little more
> debugging information; please try repeating the last test with this
> patch in place of the previous one.

Sorry for the delayed response, I promise I will run your patch
tomorrow and post the results immediately after. Got tied up with
another project.

Michael Reutman
Software Engineer
Epiq Solutions
847-598-0218 x68
www.epiqsolutions.com
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-10-26 Thread Alan Stern
Michael:

In the absence of any suggestions, we might as well try to verify my 
idea about what's going wrong.  The patch below adds a little more 
debugging information; please try repeating the last test with this 
patch in place of the previous one.

Alan Stern



Index: usb-4.3/drivers/usb/core/devio.c
===
--- usb-4.3.orig/drivers/usb/core/devio.c
+++ usb-4.3/drivers/usb/core/devio.c
@@ -100,6 +100,11 @@ static bool usbfs_snoop;
 module_param(usbfs_snoop, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(usbfs_snoop, "true to log all usbfs traffic");
 
+static unsigned usbfs_snoop_max = 65536;
+module_param(usbfs_snoop_max, uint, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(usbfs_snoop_max,
+   "maximum number of bytes to print while snooping");
+
 #define snoop(dev, format, arg...) \
do {\
if (usbfs_snoop)\
@@ -392,6 +397,7 @@ static void snoop_urb(struct usb_device
ep, t, d, length, timeout_or_status);
}
 
+   data_len = min(data_len, usbfs_snoop_max);
if (data && data_len > 0) {
print_hex_dump(KERN_DEBUG, "data: ", DUMP_PREFIX_NONE, 32, 1,
data, data_len, 1);
@@ -402,7 +408,8 @@ static void snoop_urb_data(struct urb *u
 {
int i, size;
 
-   if (!usbfs_snoop)
+   len = min(len, usbfs_snoop_max);
+   if (!usbfs_snoop || len == 0)
return;
 
if (urb->num_sgs == 0) {
@@ -510,7 +517,9 @@ static void async_completed(struct urb *
cred = get_cred(as->cred);
secid = as->secid;
}
-   snoop(>dev->dev, "urb complete\n");
+   if (urb->actual_length == 0 && urb->transfer_buffer_length == 16384)
+   urb->actual_length = 16;
+   snoop(>dev->dev, "urb complete %p\n", as->urb);
snoop_urb(urb->dev, as->userurb, urb->pipe, urb->actual_length,
as->status, COMPLETE, NULL, 0);
if ((urb->transfer_flags & URB_DIR_MASK) == URB_DIR_IN)
@@ -1372,9 +1381,9 @@ static int proc_do_submiturb(struct usb_
uurb->type = USBDEVFS_URB_TYPE_INTERRUPT;
goto interrupt_urb;
}
-   num_sgs = DIV_ROUND_UP(uurb->buffer_length, USB_SG_SIZE);
-   if (num_sgs == 1 || num_sgs > ps->dev->bus->sg_tablesize)
-   num_sgs = 0;
+// num_sgs = DIV_ROUND_UP(uurb->buffer_length, USB_SG_SIZE);
+// if (num_sgs == 1 || num_sgs > ps->dev->bus->sg_tablesize)
+// num_sgs = 0;
if (ep->streams)
stream_id = uurb->stream_id;
break;
@@ -1490,14 +1499,15 @@ static int proc_do_submiturb(struct usb_
ret = -EFAULT;
goto error;
}
-   } else if (uurb->type == USBDEVFS_URB_TYPE_ISO) {
+// } else if (uurb->type == USBDEVFS_URB_TYPE_ISO) {
+   } else {
/*
 * Isochronous input data may end up being
 * discontiguous if some of the packets are short.
 * Clear the buffer so that the gaps don't leak
 * kernel data to userspace.
 */
-   memset(as->urb->transfer_buffer, 0,
+   memset(as->urb->transfer_buffer, 0x55,
uurb->buffer_length);
}
}
@@ -1554,6 +1564,7 @@ static int proc_do_submiturb(struct usb_
as->pid = get_pid(task_pid(current));
as->cred = get_current_cred();
security_task_getsecid(current, >secid);
+   snoop(>dev->dev, "urb submit %p\n", as->urb);
snoop_urb(ps->dev, as->userurb, as->urb->pipe,
as->urb->transfer_buffer_length, 0, SUBMIT,
NULL, 0);
@@ -1709,8 +1720,12 @@ static struct async *reap_as(struct usb_
 static int proc_reapurb(struct usb_dev_state *ps, void __user *arg)
 {
struct async *as = reap_as(ps);
+
if (as) {
-   int retval = processcompl(as, (void __user * __user *)arg);
+   int retval;
+
+   snoop(>dev->dev, "%s: REAP %p\n", __func__, as->userurb);
+   retval = processcompl(as, (void __user * __user *)arg);
free_async(as);
return retval;
}
@@ -1726,6 +1741,7 @@ static int proc_reapurbnonblock(struct u
 
as = async_getcompleted(ps);
if (as) {
+   snoop(>dev->dev, "%s: REAP %p\n", __func__, as->userurb);
retval = processcompl(as, (void __user * __user *)arg);
free_async(as);
} else {
@@ -1852,8 

Re: Help needed for EHCI problem: removing an active bulk-in QH

2015-10-22 Thread Alan Stern
[Resend with Hans's correct email address this time...]

On Thu, 22 Oct 2015, Alan Stern wrote:

> Hans and everyone else:
> 
> This continues the discussion of a problem originally posted to the 
> libusb-devel mailing list
> (see  if 
> you're curious).
> 
> The EHCI controller in question is an AMD/ATI SB7x0/SB8x0/SB9x0, as 
> found on the RX780/RX790 motherboard.  I haven't seen this problem 
> occur with Intel hardware.
> 
> The problem arises when an active bulk-in QH is removed from the async
> schedule.  The current qTD is cancelled, and it is the last qTD on the
> QH's queue.  At the time the QH is removed from the async list, the
> overlay region shows that only a fraction of the qTD has been completed
> (maybe 4 KB transferred out of 16 KB total).
> 
> 10 ms later, four new qTDs are added to the QH and it gets added back
> to the async schedule.  Although I don't know this for certain, I
> believe the second of these qTDs is stored at the same address as the
> one that was cancelled.  That's what naturally would happen if the
> memory pool satisfies an allocation from the most recently freed area.
> 
> Anyway, a short time later, it sometimes happen that the controller
> gets stuck.  The Active bit in the QH's overlay region is clear, and
> the Current and Next qTD pointers both point to the second qTD in the
> queue, which obviously is why the controller is not making any forward
> progress.  The first qTD's Active bit is still set and its Bytes To
> Transfer is still set to 16 KB.  The second qTD's Active bit is off and
> its Bytes To Transfer is 0.  In spite of this, neither qTD's data
> buffer has been overwritten.
> 
> Although it's hard to tell exactly what went wrong, my guess is that
> the after the QH was removed from the async schedule, the controller
> continued to process it until all 16 KB had been transferred.  (This
> would have taken no more than 0.5 ms.)  Then at some point, the QH
> overlay and the now-completed qTD were written back -- that would
> explain why the second qTD in the queue shows up as not Active and with
> no bytes remaining to transfer.
> 
> On the other hand, that qTD wasn't reused until 10 ms after the QH was
> removed from the schedule, and it was completely reinitialized before
> reuse.  The write-back must have occurred later than this; I have no
> idea why.  I also don't know why the write-back of the QH's overlay
> region didn't overwrite the Next qTD pointer.
> 
> 
> This is clearly a complicated problem.  It's possible that we're simply
> dealing with defective hardware, but I tend to doubt it.  It seems more
> likely that the problem is caused by improperly removing the active QH
> from the async schedule.  The driver does not follow the instructions
> given in section 4.8.2 of the EHCI spec, which says that software
> should not remove active QHs.
> 
> [In practice it's not feasible to wait for an active QH to become
> inactive before removing it, for several reasons.  For one, the QH may
> _never_ become inactive (if the endpoint NAKs indefinitely).  For
> another, the procedure given in the spec (deactivate the qTDs on the
> queue) is racy, since the controller can perform a new overlay or
> writeback at any time.]
> 
> In an attempt to cope with potential problems, the Linux EHCI driver
> goes through _two_ Interrupt on Async Advance (IAA) cycles after taking
> a QH off the async list before considering it to be fully gone from the
> schedule.  (I have observed situations where the QH overlay region was
> written back _after_ the first IAA interrupt.)  But it seems that this
> isn't enough.
> 
> As far as I can see, the only alternative is to stop the async schedule
> whenever an active QH has to be removed.  But that would impose a
> significant penalty on any other async transfers, so I really don't
> want to do it.
> 
> Hans, can you describe how the BSD EHCI driver handles this issue?  
> 
> Any ideas for fixing this or suggestions for additional debugging would 
> be welcome.
> 
> Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Help needed for EHCI problem: removing an active bulk-in QH

2015-10-22 Thread Alan Stern
Hans and everyone else:

This continues the discussion of a problem originally posted to the 
libusb-devel mailing list
(see  if 
you're curious).

The EHCI controller in question is an AMD/ATI SB7x0/SB8x0/SB9x0, as 
found on the RX780/RX790 motherboard.  I haven't seen this problem 
occur with Intel hardware.

The problem arises when an active bulk-in QH is removed from the async
schedule.  The current qTD is cancelled, and it is the last qTD on the
QH's queue.  At the time the QH is removed from the async list, the
overlay region shows that only a fraction of the qTD has been completed
(maybe 4 KB transferred out of 16 KB total).

10 ms later, four new qTDs are added to the QH and it gets added back
to the async schedule.  Although I don't know this for certain, I
believe the second of these qTDs is stored at the same address as the
one that was cancelled.  That's what naturally would happen if the
memory pool satisfies an allocation from the most recently freed area.

Anyway, a short time later, it sometimes happen that the controller
gets stuck.  The Active bit in the QH's overlay region is clear, and
the Current and Next qTD pointers both point to the second qTD in the
queue, which obviously is why the controller is not making any forward
progress.  The first qTD's Active bit is still set and its Bytes To
Transfer is still set to 16 KB.  The second qTD's Active bit is off and
its Bytes To Transfer is 0.  In spite of this, neither qTD's data
buffer has been overwritten.

Although it's hard to tell exactly what went wrong, my guess is that
the after the QH was removed from the async schedule, the controller
continued to process it until all 16 KB had been transferred.  (This
would have taken no more than 0.5 ms.)  Then at some point, the QH
overlay and the now-completed qTD were written back -- that would
explain why the second qTD in the queue shows up as not Active and with
no bytes remaining to transfer.

On the other hand, that qTD wasn't reused until 10 ms after the QH was
removed from the schedule, and it was completely reinitialized before
reuse.  The write-back must have occurred later than this; I have no
idea why.  I also don't know why the write-back of the QH's overlay
region didn't overwrite the Next qTD pointer.


This is clearly a complicated problem.  It's possible that we're simply
dealing with defective hardware, but I tend to doubt it.  It seems more
likely that the problem is caused by improperly removing the active QH
from the async schedule.  The driver does not follow the instructions
given in section 4.8.2 of the EHCI spec, which says that software
should not remove active QHs.

[In practice it's not feasible to wait for an active QH to become
inactive before removing it, for several reasons.  For one, the QH may
_never_ become inactive (if the endpoint NAKs indefinitely).  For
another, the procedure given in the spec (deactivate the qTDs on the
queue) is racy, since the controller can perform a new overlay or
writeback at any time.]

In an attempt to cope with potential problems, the Linux EHCI driver
goes through _two_ Interrupt on Async Advance (IAA) cycles after taking
a QH off the async list before considering it to be fully gone from the
schedule.  (I have observed situations where the QH overlay region was
written back _after_ the first IAA interrupt.)  But it seems that this
isn't enough.

As far as I can see, the only alternative is to stop the async schedule
whenever an active QH has to be removed.  But that would impose a
significant penalty on any other async transfers, so I really don't
want to do it.

Hans, can you describe how the BSD EHCI driver handles this issue?  

Any ideas for fixing this or suggestions for additional debugging would 
be welcome.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html