Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-03-09 Thread Martin Pieuchot
On 26/02/15(Thu) 01:51, Martin Pieuchot wrote:
> On 22/02/15(Sun) 00:54, Martin Pieuchot wrote:
> > On 20/02/15(Fri) 23:15, Stefan Sperling wrote:
> > > In the following configuration I can pretty easily trigger endless
> > > screenfulls of scrolling "ehci_idone: ex=%p is done!" messages,
> > > where %p is to a constant pointer value (same in each message).
> > > 
> > > [ehci host]<--[usb extension cable (hub 1)]<--[usb keyboard (hub 
> > > 2)]<--[usb mouse]
> > > 
> > > So there's a USB extension cable (hub 1) that I plug a keyboard into.
> > > The keyboard in turn has a built-in hub that has a mouse plugged into it.
> > > The problem happens almost every time I pull the keyboard out of hub 1.
> > > 
> > > I've also seen it happen without any intervention on my part (In fact
> > > I was in a different city and couldn't use the box until I got home
> > > several hours later and hit the reset switch... that's no fun)
> > > 
> > > So if a transfer is cancelled (e.g. as a result of pulling the plug), 
> > > then:
> > > 
> > > - usbd_abort_pipe wants to abort a related transfer
> > > - ehci_abort_xfer schedules and waits for ehci_softintr, expecting
> > >   the softintr routine to deal with the cancelled transfer:
> > > 
> > >   /*
> > >* Step 3: Make sure the soft interrupt routine has run. This
> > >* should remove any completed items off the queue.
> > >* The hardware has no reference to completed items (TDs).
> > >* It's safe to remove them at any time.
> > >*/
> > >   s = splusb();
> > >   sc->sc_softwake = 1;
> > >   usb_schedsoftintr(&sc->sc_bus);
> > >   tsleep(&sc->sc_softwake, PZERO, "ehciab", 0);
> > > 
> > > - ehci_softintr gets scheduled
> > > - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
> > >   invokes ehci_check_intr on each
> > > - ehci_check_intr eventually ends up calling ehci_idone
> > > - ehci_idone does nothing for cancelled transfers... ?!?
> > > 
> > >   if (xfer->status == USBD_CANCELLED ||
> > >   xfer->status == USBD_TIMEOUT) {
> > >   DPRINTF(("ehci_idone: aborted xfer=%p\n", xfer));
> > >   return;
> > >   }
> > > 
> > > - something else happens
> > > 
> > > - ehci_abort_xfer awakes from tsleep and sets ex->isdone, since it
> > >   expects the softinterrupt routine to have dealt with the xfer
> > > 
> > > - something else happens
> > > 
> > > - the host controller sends an INTERR interrupt
> > > - ehci_intr1 schedules ehci_softintr
> > > - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
> > >   invokes ehci_check_intr on each
> > > - the cancelled xfer is still in the intrhead TAILQ and ends up in 
> > > ehci_idone
> > > - ehci_idone looks for the isdone flag which is now set, then it
> > >   complains and does nothing
> > > 
> > > - the host controller sends an INTERR interrupt
> > > ... same story again, we get an endless loop
> > > 
> > > This diff breaks the chain of events and fixes the endless loop for me.
> > > I can't reproduce the problem anymore by pulling the keyboard out.
> > > I don't quite understand how this prevents the flood of INTERR interrupts
> > > but it seems to work.
> > > 
> > > I assume there are nasty tentacles in USB land which I'm unfamiliar with.
> > > Is there any reason this could be a bad idea?
> > 
> > Stefan that's a really good analysis.  I think the diff might not be
> > completely correct though.
> > 
> > So basically you're removing the transfer from the active list.  That's
> > generally done after the USB callback has been executed, in your case in
> > ehci_device_intr_done().
> > 
> > For interrupt transfers (pipe->repeat is 1) the transfer is kept on the
> > list while the descriptors are freed/reallocated.  That should be safe
> > since we should be reusing the sames.  
> > 
> > So I don't know if we are missing a spl protection of if there's an xfer 
> > leak but I'm afraid that with your diff usb_transfer_complete() might
> > not be called for the failing xfer.
> > 
> > That's easy to check, look if the ehcixfer pool counter increase when
> > you detach your device.
> > 
> > I'm afraid I cannot help more as I am currently traveling :)
> 
> Here's an alternative diff that removes xfers from the list of
> currently pending transfers as soon as they are finished or
> cancelled.  This should also prevent the race you analysed.
> 
> With this diff ehci_idone() should no longer be called on a
> CANCELLED or TIMEOUT xfer which mean the 'isdone' check can
> be removed.

I reverted the previous diff because it did not properly abort transfer
that were on the pipe's list but not on the interrupt list.

Diff below should take care of that and add an extra check for
suspend/resume.  I'm looking for test reports and oks.

I tried to keep the diff small, cleanups can be done afterward.

Index: ehci.c
===
RCS file: /cvs/src/sys/dev/usb/ehci.c,v
retrieving revision 1.176
diff -u -p -r1.176 ehci.c
--- ehci.c  6 Mar 2015 

Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-02-27 Thread Stefan Sperling
On Fri, Feb 27, 2015 at 06:15:32PM +0100, Martin Pieuchot wrote:
> On 26/02/15(Thu) 22:45, Stefan Sperling wrote:
> > Your diff also stops the vomit of ehci_idone messages for me when
> > I unplug devices or hubs.
> 
> Does that mean you're ok with the diff?

Yes.



Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-02-27 Thread Martin Pieuchot
On 26/02/15(Thu) 22:45, Stefan Sperling wrote:
> On Thu, Feb 26, 2015 at 01:51:29AM +0100, Martin Pieuchot wrote:
> > On 22/02/15(Sun) 00:54, Martin Pieuchot wrote:
> > > On 20/02/15(Fri) 23:15, Stefan Sperling wrote:
> > > > In the following configuration I can pretty easily trigger endless
> > > > screenfulls of scrolling "ehci_idone: ex=%p is done!" messages,
> > > > where %p is to a constant pointer value (same in each message).
> > > > 
> > > > [ehci host]<--[usb extension cable (hub 1)]<--[usb keyboard (hub 
> > > > 2)]<--[usb mouse]
> > > > 
> > > > So there's a USB extension cable (hub 1) that I plug a keyboard into.
> > > > The keyboard in turn has a built-in hub that has a mouse plugged into 
> > > > it.
> > > > The problem happens almost every time I pull the keyboard out of hub 1.
> > > > 
> > > > I've also seen it happen without any intervention on my part (In fact
> > > > I was in a different city and couldn't use the box until I got home
> > > > several hours later and hit the reset switch... that's no fun)
> > > > 
> > > > So if a transfer is cancelled (e.g. as a result of pulling the plug), 
> > > > then:
> > > > 
> > > > - usbd_abort_pipe wants to abort a related transfer
> > > > - ehci_abort_xfer schedules and waits for ehci_softintr, expecting
> > > >   the softintr routine to deal with the cancelled transfer:
> > > > 
> > > > /*
> > > >  * Step 3: Make sure the soft interrupt routine has run. This
> > > >  * should remove any completed items off the queue.
> > > >  * The hardware has no reference to completed items (TDs).
> > > >  * It's safe to remove them at any time.
> > > >  */
> > > > s = splusb();
> > > > sc->sc_softwake = 1;
> > > > usb_schedsoftintr(&sc->sc_bus);
> > > > tsleep(&sc->sc_softwake, PZERO, "ehciab", 0);
> > > > 
> > > > - ehci_softintr gets scheduled
> > > > - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
> > > >   invokes ehci_check_intr on each
> > > > - ehci_check_intr eventually ends up calling ehci_idone
> > > > - ehci_idone does nothing for cancelled transfers... ?!?
> > > > 
> > > > if (xfer->status == USBD_CANCELLED ||
> > > > xfer->status == USBD_TIMEOUT) {
> > > > DPRINTF(("ehci_idone: aborted xfer=%p\n", xfer));
> > > > return;
> > > > }
> > > > 
> > > > - something else happens
> > > > 
> > > > - ehci_abort_xfer awakes from tsleep and sets ex->isdone, since it
> > > >   expects the softinterrupt routine to have dealt with the xfer
> > > > 
> > > > - something else happens
> > > > 
> > > > - the host controller sends an INTERR interrupt
> > > > - ehci_intr1 schedules ehci_softintr
> > > > - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
> > > >   invokes ehci_check_intr on each
> > > > - the cancelled xfer is still in the intrhead TAILQ and ends up in 
> > > > ehci_idone
> > > > - ehci_idone looks for the isdone flag which is now set, then it
> > > >   complains and does nothing
> > > > 
> > > > - the host controller sends an INTERR interrupt
> > > > ... same story again, we get an endless loop
> > > > 
> > > > This diff breaks the chain of events and fixes the endless loop for me.
> > > > I can't reproduce the problem anymore by pulling the keyboard out.
> > > > I don't quite understand how this prevents the flood of INTERR 
> > > > interrupts
> > > > but it seems to work.
> > > > 
> > > > I assume there are nasty tentacles in USB land which I'm unfamiliar 
> > > > with.
> > > > Is there any reason this could be a bad idea?
> > > 
> > > Stefan that's a really good analysis.  I think the diff might not be
> > > completely correct though.
> > > 
> > > So basically you're removing the transfer from the active list.  That's
> > > generally done after the USB callback has been executed, in your case in
> > > ehci_device_intr_done().
> > > 
> > > For interrupt transfers (pipe->repeat is 1) the transfer is kept on the
> > > list while the descriptors are freed/reallocated.  That should be safe
> > > since we should be reusing the sames.  
> > > 
> > > So I don't know if we are missing a spl protection of if there's an xfer 
> > > leak but I'm afraid that with your diff usb_transfer_complete() might
> > > not be called for the failing xfer.
> > > 
> > > That's easy to check, look if the ehcixfer pool counter increase when
> > > you detach your device.
> > > 
> > > I'm afraid I cannot help more as I am currently traveling :)
> > 
> > Here's an alternative diff that removes xfers from the list of
> > currently pending transfers as soon as they are finished or
> > cancelled.  This should also prevent the race you analysed.
> > 
> > With this diff ehci_idone() should no longer be called on a
> > CANCELLED or TIMEOUT xfer which mean the 'isdone' check can
> > be removed.
> > 
> > Does it work for you?
> 
> It does, yes. Thanks.
> Your diff also stops the vomit of ehci_idone messages for me when
> I unp

Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-02-27 Thread Martin Pieuchot
On 27/02/15(Fri) 12:07, Stefan Sperling wrote:
> On Thu, Feb 26, 2015 at 10:45:01PM +0100, Stefan Sperling wrote:
> > There are other issues which are not fixed by this diff and which
> > my diff did not fix either.
> 
> > But these are bugs we can investigate another day. Happy travels!
> 
> I suppose there might well be power supply issues when chaining too many
> passive hubs and devices off a single USB port on the host controller.
> I've now put an active USB hub (separate power source) at the end of
> my long-cable hub and so far things are looking good. I'll report back
> if my problems appear in this configuration, too.
> 
> My setup changed from
> 
> [ehci host]<--[usb extension cable hub]<--[usb keyboard (+hub)]<--[usb mouse]
> 
> to this:
>  [power adapter]
>|
>V
> [ehci host]<--[usb extension cable hub]<--[active hub]<--[usb keyboard (+hub)]
>   ^
>   |
>   +--[usb mouse]

Well there's an obvious "bug" to fix in our stack regarding HUB power
sources, I don't know if that's the one you're hitting but we can try :)

When a device is attached the stack gets its power usage from the
configuration descriptor.  But the logic to check if the hub this device
is attached to has enough power budget is dumb and does not work well
with multiple devices below the same hub.

Instead of making a guess about how much power can be drained by a port
we should really keep track of the power available per hub. 

If you're curious grep for 'powersrc' in sys/dev/usb.

I have this on my TODO list because SuperSpeed devices do things
slightly differently.  I'll try to cook a diff, but it will be for after
5.7.



Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-02-27 Thread Stefan Sperling
On Thu, Feb 26, 2015 at 10:45:01PM +0100, Stefan Sperling wrote:
> There are other issues which are not fixed by this diff and which
> my diff did not fix either.

> But these are bugs we can investigate another day. Happy travels!

I suppose there might well be power supply issues when chaining too many
passive hubs and devices off a single USB port on the host controller.
I've now put an active USB hub (separate power source) at the end of
my long-cable hub and so far things are looking good. I'll report back
if my problems appear in this configuration, too.

My setup changed from

[ehci host]<--[usb extension cable hub]<--[usb keyboard (+hub)]<--[usb mouse]

to this:
 [power adapter]
   |
   V
[ehci host]<--[usb extension cable hub]<--[active hub]<--[usb keyboard (+hub)]
  ^
  |
  +--[usb mouse]



Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-02-26 Thread Stefan Sperling
On Thu, Feb 26, 2015 at 01:51:29AM +0100, Martin Pieuchot wrote:
> On 22/02/15(Sun) 00:54, Martin Pieuchot wrote:
> > On 20/02/15(Fri) 23:15, Stefan Sperling wrote:
> > > In the following configuration I can pretty easily trigger endless
> > > screenfulls of scrolling "ehci_idone: ex=%p is done!" messages,
> > > where %p is to a constant pointer value (same in each message).
> > > 
> > > [ehci host]<--[usb extension cable (hub 1)]<--[usb keyboard (hub 
> > > 2)]<--[usb mouse]
> > > 
> > > So there's a USB extension cable (hub 1) that I plug a keyboard into.
> > > The keyboard in turn has a built-in hub that has a mouse plugged into it.
> > > The problem happens almost every time I pull the keyboard out of hub 1.
> > > 
> > > I've also seen it happen without any intervention on my part (In fact
> > > I was in a different city and couldn't use the box until I got home
> > > several hours later and hit the reset switch... that's no fun)
> > > 
> > > So if a transfer is cancelled (e.g. as a result of pulling the plug), 
> > > then:
> > > 
> > > - usbd_abort_pipe wants to abort a related transfer
> > > - ehci_abort_xfer schedules and waits for ehci_softintr, expecting
> > >   the softintr routine to deal with the cancelled transfer:
> > > 
> > >   /*
> > >* Step 3: Make sure the soft interrupt routine has run. This
> > >* should remove any completed items off the queue.
> > >* The hardware has no reference to completed items (TDs).
> > >* It's safe to remove them at any time.
> > >*/
> > >   s = splusb();
> > >   sc->sc_softwake = 1;
> > >   usb_schedsoftintr(&sc->sc_bus);
> > >   tsleep(&sc->sc_softwake, PZERO, "ehciab", 0);
> > > 
> > > - ehci_softintr gets scheduled
> > > - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
> > >   invokes ehci_check_intr on each
> > > - ehci_check_intr eventually ends up calling ehci_idone
> > > - ehci_idone does nothing for cancelled transfers... ?!?
> > > 
> > >   if (xfer->status == USBD_CANCELLED ||
> > >   xfer->status == USBD_TIMEOUT) {
> > >   DPRINTF(("ehci_idone: aborted xfer=%p\n", xfer));
> > >   return;
> > >   }
> > > 
> > > - something else happens
> > > 
> > > - ehci_abort_xfer awakes from tsleep and sets ex->isdone, since it
> > >   expects the softinterrupt routine to have dealt with the xfer
> > > 
> > > - something else happens
> > > 
> > > - the host controller sends an INTERR interrupt
> > > - ehci_intr1 schedules ehci_softintr
> > > - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
> > >   invokes ehci_check_intr on each
> > > - the cancelled xfer is still in the intrhead TAILQ and ends up in 
> > > ehci_idone
> > > - ehci_idone looks for the isdone flag which is now set, then it
> > >   complains and does nothing
> > > 
> > > - the host controller sends an INTERR interrupt
> > > ... same story again, we get an endless loop
> > > 
> > > This diff breaks the chain of events and fixes the endless loop for me.
> > > I can't reproduce the problem anymore by pulling the keyboard out.
> > > I don't quite understand how this prevents the flood of INTERR interrupts
> > > but it seems to work.
> > > 
> > > I assume there are nasty tentacles in USB land which I'm unfamiliar with.
> > > Is there any reason this could be a bad idea?
> > 
> > Stefan that's a really good analysis.  I think the diff might not be
> > completely correct though.
> > 
> > So basically you're removing the transfer from the active list.  That's
> > generally done after the USB callback has been executed, in your case in
> > ehci_device_intr_done().
> > 
> > For interrupt transfers (pipe->repeat is 1) the transfer is kept on the
> > list while the descriptors are freed/reallocated.  That should be safe
> > since we should be reusing the sames.  
> > 
> > So I don't know if we are missing a spl protection of if there's an xfer 
> > leak but I'm afraid that with your diff usb_transfer_complete() might
> > not be called for the failing xfer.
> > 
> > That's easy to check, look if the ehcixfer pool counter increase when
> > you detach your device.
> > 
> > I'm afraid I cannot help more as I am currently traveling :)
> 
> Here's an alternative diff that removes xfers from the list of
> currently pending transfers as soon as they are finished or
> cancelled.  This should also prevent the race you analysed.
> 
> With this diff ehci_idone() should no longer be called on a
> CANCELLED or TIMEOUT xfer which mean the 'isdone' check can
> be removed.
> 
> Does it work for you?

It does, yes. Thanks.
Your diff also stops the vomit of ehci_idone messages for me when
I unplug devices or hubs.

There are other issues which are not fixed by this diff and which
my diff did not fix either.

Sometimes my USB keyboard seems to go insane and keeps pressing a key
repeatedly even though I'm not pressing it. For example, when I typed
  make install
today the keyboard kept repeating the letter l so the line looked
like this:
  make 
install

Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-02-25 Thread Martin Pieuchot
On 22/02/15(Sun) 00:54, Martin Pieuchot wrote:
> On 20/02/15(Fri) 23:15, Stefan Sperling wrote:
> > In the following configuration I can pretty easily trigger endless
> > screenfulls of scrolling "ehci_idone: ex=%p is done!" messages,
> > where %p is to a constant pointer value (same in each message).
> > 
> > [ehci host]<--[usb extension cable (hub 1)]<--[usb keyboard (hub 2)]<--[usb 
> > mouse]
> > 
> > So there's a USB extension cable (hub 1) that I plug a keyboard into.
> > The keyboard in turn has a built-in hub that has a mouse plugged into it.
> > The problem happens almost every time I pull the keyboard out of hub 1.
> > 
> > I've also seen it happen without any intervention on my part (In fact
> > I was in a different city and couldn't use the box until I got home
> > several hours later and hit the reset switch... that's no fun)
> > 
> > So if a transfer is cancelled (e.g. as a result of pulling the plug), then:
> > 
> > - usbd_abort_pipe wants to abort a related transfer
> > - ehci_abort_xfer schedules and waits for ehci_softintr, expecting
> >   the softintr routine to deal with the cancelled transfer:
> > 
> > /*
> >  * Step 3: Make sure the soft interrupt routine has run. This
> >  * should remove any completed items off the queue.
> >  * The hardware has no reference to completed items (TDs).
> >  * It's safe to remove them at any time.
> >  */
> > s = splusb();
> > sc->sc_softwake = 1;
> > usb_schedsoftintr(&sc->sc_bus);
> > tsleep(&sc->sc_softwake, PZERO, "ehciab", 0);
> > 
> > - ehci_softintr gets scheduled
> > - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
> >   invokes ehci_check_intr on each
> > - ehci_check_intr eventually ends up calling ehci_idone
> > - ehci_idone does nothing for cancelled transfers... ?!?
> > 
> > if (xfer->status == USBD_CANCELLED ||
> > xfer->status == USBD_TIMEOUT) {
> > DPRINTF(("ehci_idone: aborted xfer=%p\n", xfer));
> > return;
> > }
> > 
> > - something else happens
> > 
> > - ehci_abort_xfer awakes from tsleep and sets ex->isdone, since it
> >   expects the softinterrupt routine to have dealt with the xfer
> > 
> > - something else happens
> > 
> > - the host controller sends an INTERR interrupt
> > - ehci_intr1 schedules ehci_softintr
> > - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
> >   invokes ehci_check_intr on each
> > - the cancelled xfer is still in the intrhead TAILQ and ends up in 
> > ehci_idone
> > - ehci_idone looks for the isdone flag which is now set, then it
> >   complains and does nothing
> > 
> > - the host controller sends an INTERR interrupt
> > ... same story again, we get an endless loop
> > 
> > This diff breaks the chain of events and fixes the endless loop for me.
> > I can't reproduce the problem anymore by pulling the keyboard out.
> > I don't quite understand how this prevents the flood of INTERR interrupts
> > but it seems to work.
> > 
> > I assume there are nasty tentacles in USB land which I'm unfamiliar with.
> > Is there any reason this could be a bad idea?
> 
> Stefan that's a really good analysis.  I think the diff might not be
> completely correct though.
> 
> So basically you're removing the transfer from the active list.  That's
> generally done after the USB callback has been executed, in your case in
> ehci_device_intr_done().
> 
> For interrupt transfers (pipe->repeat is 1) the transfer is kept on the
> list while the descriptors are freed/reallocated.  That should be safe
> since we should be reusing the sames.  
> 
> So I don't know if we are missing a spl protection of if there's an xfer 
> leak but I'm afraid that with your diff usb_transfer_complete() might
> not be called for the failing xfer.
> 
> That's easy to check, look if the ehcixfer pool counter increase when
> you detach your device.
> 
> I'm afraid I cannot help more as I am currently traveling :)

Here's an alternative diff that removes xfers from the list of
currently pending transfers as soon as they are finished or
cancelled.  This should also prevent the race you analysed.

With this diff ehci_idone() should no longer be called on a
CANCELLED or TIMEOUT xfer which mean the 'isdone' check can
be removed.

Does it work for you?

Index: ehci.c
===
RCS file: /cvs/src/sys/dev/usb/ehci.c,v
retrieving revision 1.174
diff -u -p -r1.174 ehci.c
--- ehci.c  9 Feb 2015 22:14:43 -   1.174
+++ ehci.c  25 Feb 2015 06:04:49 -
@@ -206,11 +206,7 @@ void   ehci_dump_exfer(struct ehci_xfer *
 #define ehci_add_intr_list(sc, ex) \
TAILQ_INSERT_TAIL(&(sc)->sc_intrhead, (ex), inext);
 #define ehci_del_intr_list(sc, ex) \
-   do { \
-   TAILQ_REMOVE(&sc->sc_intrhead, (ex), inext); \
-   (ex)->inext.tqe_prev = NULL; \
-   } while (0)
-#define ehci_active_intr_list(ex) ((ex)->inext.tqe_prev != NULL)
+   TAILQ_REMOVE(&(sc)

Re: fix stuck cancelled ehci xfers (ehci_idone: ex=%p is done!)

2015-02-21 Thread Martin Pieuchot
On 20/02/15(Fri) 23:15, Stefan Sperling wrote:
> In the following configuration I can pretty easily trigger endless
> screenfulls of scrolling "ehci_idone: ex=%p is done!" messages,
> where %p is to a constant pointer value (same in each message).
> 
> [ehci host]<--[usb extension cable (hub 1)]<--[usb keyboard (hub 2)]<--[usb 
> mouse]
> 
> So there's a USB extension cable (hub 1) that I plug a keyboard into.
> The keyboard in turn has a built-in hub that has a mouse plugged into it.
> The problem happens almost every time I pull the keyboard out of hub 1.
> 
> I've also seen it happen without any intervention on my part (In fact
> I was in a different city and couldn't use the box until I got home
> several hours later and hit the reset switch... that's no fun)
> 
> So if a transfer is cancelled (e.g. as a result of pulling the plug), then:
> 
> - usbd_abort_pipe wants to abort a related transfer
> - ehci_abort_xfer schedules and waits for ehci_softintr, expecting
>   the softintr routine to deal with the cancelled transfer:
> 
>   /*
>* Step 3: Make sure the soft interrupt routine has run. This
>* should remove any completed items off the queue.
>* The hardware has no reference to completed items (TDs).
>* It's safe to remove them at any time.
>*/
>   s = splusb();
>   sc->sc_softwake = 1;
>   usb_schedsoftintr(&sc->sc_bus);
>   tsleep(&sc->sc_softwake, PZERO, "ehciab", 0);
> 
> - ehci_softintr gets scheduled
> - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
>   invokes ehci_check_intr on each
> - ehci_check_intr eventually ends up calling ehci_idone
> - ehci_idone does nothing for cancelled transfers... ?!?
> 
>   if (xfer->status == USBD_CANCELLED ||
>   xfer->status == USBD_TIMEOUT) {
>   DPRINTF(("ehci_idone: aborted xfer=%p\n", xfer));
>   return;
>   }
> 
> - something else happens
> 
> - ehci_abort_xfer awakes from tsleep and sets ex->isdone, since it
>   expects the softinterrupt routine to have dealt with the xfer
> 
> - something else happens
> 
> - the host controller sends an INTERR interrupt
> - ehci_intr1 schedules ehci_softintr
> - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
>   invokes ehci_check_intr on each
> - the cancelled xfer is still in the intrhead TAILQ and ends up in ehci_idone
> - ehci_idone looks for the isdone flag which is now set, then it
>   complains and does nothing
> 
> - the host controller sends an INTERR interrupt
> ... same story again, we get an endless loop
> 
> This diff breaks the chain of events and fixes the endless loop for me.
> I can't reproduce the problem anymore by pulling the keyboard out.
> I don't quite understand how this prevents the flood of INTERR interrupts
> but it seems to work.
> 
> I assume there are nasty tentacles in USB land which I'm unfamiliar with.
> Is there any reason this could be a bad idea?

Stefan that's a really good analysis.  I think the diff might not be
completely correct though.

So basically you're removing the transfer from the active list.  That's
generally done after the USB callback has been executed, in your case in
ehci_device_intr_done().

For interrupt transfers (pipe->repeat is 1) the transfer is kept on the
list while the descriptors are freed/reallocated.  That should be safe
since we should be reusing the sames.  

So I don't know if we are missing a spl protection of if there's an xfer 
leak but I'm afraid that with your diff usb_transfer_complete() might
not be called for the failing xfer.

That's easy to check, look if the ehcixfer pool counter increase when
you detach your device.

I'm afraid I cannot help more as I am currently traveling :)

> 
> Index: ehci.c
> ===
> RCS file: /cvs/src/sys/dev/usb/ehci.c,v
> retrieving revision 1.174
> diff -u -p -r1.174 ehci.c
> --- ehci.c9 Feb 2015 22:14:43 -   1.174
> +++ ehci.c20 Feb 2015 21:32:40 -
> @@ -811,6 +811,7 @@ done:
>  void
>  ehci_idone(struct usbd_xfer *xfer)
>  {
> + struct ehci_softc *sc = (struct ehci_softc *)xfer->device->bus;
>   struct ehci_xfer *ex = (struct ehci_xfer *)xfer;
>  #ifdef EHCI_DEBUG
>   struct ehci_pipe *epipe = (struct ehci_pipe *)xfer->pipe;
> @@ -839,6 +840,8 @@ ehci_idone(struct usbd_xfer *xfer)
>  #endif
>   if (xfer->status == USBD_CANCELLED ||
>   xfer->status == USBD_TIMEOUT) {
> + if (ehci_active_intr_list(ex))
> + ehci_del_intr_list(sc, ex);
>   DPRINTF(("ehci_idone: aborted xfer=%p\n", xfer));
>   return;
>   }
>