On Thu, Feb 26, 2015 at 01:51:29AM +0100, Martin Pieuchot wrote:
> On 22/02/15(Sun) 00:54, Martin Pieuchot wrote:
> > On 20/02/15(Fri) 23:15, Stefan Sperling wrote:
> > > In the following configuration I can pretty easily trigger endless
> > > screenfulls of scrolling "ehci_idone: ex=%p is done!" messages,
> > > where %p is to a constant pointer value (same in each message).
> > >
> > > [ehci host]<--[usb extension cable (hub 1)]<--[usb keyboard (hub
> > > 2)]<--[usb mouse]
> > >
> > > So there's a USB extension cable (hub 1) that I plug a keyboard into.
> > > The keyboard in turn has a built-in hub that has a mouse plugged into it.
> > > The problem happens almost every time I pull the keyboard out of hub 1.
> > >
> > > I've also seen it happen without any intervention on my part (In fact
> > > I was in a different city and couldn't use the box until I got home
> > > several hours later and hit the reset switch... that's no fun)
> > >
> > > So if a transfer is cancelled (e.g. as a result of pulling the plug),
> > > then:
> > >
> > > - usbd_abort_pipe wants to abort a related transfer
> > > - ehci_abort_xfer schedules and waits for ehci_softintr, expecting
> > > the softintr routine to deal with the cancelled transfer:
> > >
> > > /*
> > > * Step 3: Make sure the soft interrupt routine has run. This
> > > * should remove any completed items off the queue.
> > > * The hardware has no reference to completed items (TDs).
> > > * It's safe to remove them at any time.
> > > */
> > > s = splusb();
> > > sc->sc_softwake = 1;
> > > usb_schedsoftintr(&sc->sc_bus);
> > > tsleep(&sc->sc_softwake, PZERO, "ehciab", 0);
> > >
> > > - ehci_softintr gets scheduled
> > > - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
> > > invokes ehci_check_intr on each
> > > - ehci_check_intr eventually ends up calling ehci_idone
> > > - ehci_idone does nothing for cancelled transfers... ?!?
> > >
> > > if (xfer->status == USBD_CANCELLED ||
> > > xfer->status == USBD_TIMEOUT) {
> > > DPRINTF(("ehci_idone: aborted xfer=%p\n", xfer));
> > > return;
> > > }
> > >
> > > - something else happens
> > >
> > > - ehci_abort_xfer awakes from tsleep and sets ex->isdone, since it
> > > expects the softinterrupt routine to have dealt with the xfer
> > >
> > > - something else happens
> > >
> > > - the host controller sends an INTERR interrupt
> > > - ehci_intr1 schedules ehci_softintr
> > > - ehci_softintr loops over xfers on the sc_intrhead TAILQ and
> > > invokes ehci_check_intr on each
> > > - the cancelled xfer is still in the intrhead TAILQ and ends up in
> > > ehci_idone
> > > - ehci_idone looks for the isdone flag which is now set, then it
> > > complains and does nothing
> > >
> > > - the host controller sends an INTERR interrupt
> > > ... same story again, we get an endless loop
> > >
> > > This diff breaks the chain of events and fixes the endless loop for me.
> > > I can't reproduce the problem anymore by pulling the keyboard out.
> > > I don't quite understand how this prevents the flood of INTERR interrupts
> > > but it seems to work.
> > >
> > > I assume there are nasty tentacles in USB land which I'm unfamiliar with.
> > > Is there any reason this could be a bad idea?
> >
> > Stefan that's a really good analysis. I think the diff might not be
> > completely correct though.
> >
> > So basically you're removing the transfer from the active list. That's
> > generally done after the USB callback has been executed, in your case in
> > ehci_device_intr_done().
> >
> > For interrupt transfers (pipe->repeat is 1) the transfer is kept on the
> > list while the descriptors are freed/reallocated. That should be safe
> > since we should be reusing the sames.
> >
> > So I don't know if we are missing a spl protection of if there's an xfer
> > leak but I'm afraid that with your diff usb_transfer_complete() might
> > not be called for the failing xfer.
> >
> > That's easy to check, look if the ehcixfer pool counter increase when
> > you detach your device.
> >
> > I'm afraid I cannot help more as I am currently traveling :)
>
> Here's an alternative diff that removes xfers from the list of
> currently pending transfers as soon as they are finished or
> cancelled. This should also prevent the race you analysed.
>
> With this diff ehci_idone() should no longer be called on a
> CANCELLED or TIMEOUT xfer which mean the 'isdone' check can
> be removed.
>
> Does it work for you?
It does, yes. Thanks.
Your diff also stops the vomit of ehci_idone messages for me when
I unplug devices or hubs.
There are other issues which are not fixed by this diff and which
my diff did not fix either.
Sometimes my USB keyboard seems to go insane and keeps pressing a key
repeatedly even though I'm not pressing it. For example, when I typed
make install
today the keyboard kept repeating the letter l so the line looked
like this:
make
installllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
I've seen this problem with 2 different USB keyboards.
Re-plugging the keyboard or the hub it is connected to now restores normal
behaviour -- before this discussion unplugging usually locked the machine in
ehci_idone and I had to reset the system, so we've made a big step forward.
Sometimes I see a similar(?) issue with my USB mouse behind hubs, where the
mouse pointer movement becomes jerky (doesn't feel as fluent as usual).
Sometimes I see messages like these when I plug devices into another
4-port USB 2.0 hub (uhub8) (the same black hub I brought to your house when I
visited last year) which I plug into my long-cable hub (uhub7) (which only
has a single USB port at the far end, unfortunately):
uhub7 at uhub0
port 1 "Terminus Technology USB 2.0 Hub [MTT]" rev 2.00/1.00 addr 2
uhub8 at uhub7
port 4 "Genesys Logic USB2.0 Hub" rev 2.00/77.64 addr 3
uhub8: port 3, set config 0 at addr 5 failed
uhub8: device problem, disabling port 3
uhub8: device problem, disabling port 1
And the devices won't work. Replugging the long-cable hub (uhub7) fixes that.
I've also seen this with devices plugged into the same uhub8:
uhidev0 at uhub8
port 4 configuration 1 interface 0 "Logitech product 0xc050" rev 2.00/27.20
addr 4
uhidev0: no report descriptor
Again no working device.
But these are bugs we can investigate another day. Happy travels!
Maybe one of the hubs I'm using is broken?
>
> Index: ehci.c
> ===================================================================
> RCS file: /cvs/src/sys/dev/usb/ehci.c,v
> retrieving revision 1.174
> diff -u -p -r1.174 ehci.c
> --- ehci.c 9 Feb 2015 22:14:43 -0000 1.174
> +++ ehci.c 25 Feb 2015 06:04:49 -0000
> @@ -206,11 +206,7 @@ void ehci_dump_exfer(struct ehci_xfer *
> #define ehci_add_intr_list(sc, ex) \
> TAILQ_INSERT_TAIL(&(sc)->sc_intrhead, (ex), inext);
> #define ehci_del_intr_list(sc, ex) \
> - do { \
> - TAILQ_REMOVE(&sc->sc_intrhead, (ex), inext); \
> - (ex)->inext.tqe_prev = NULL; \
> - } while (0)
> -#define ehci_active_intr_list(ex) ((ex)->inext.tqe_prev != NULL)
> + TAILQ_REMOVE(&(sc)->sc_intrhead, (ex), inext);
>
> struct usbd_bus_methods ehci_bus_methods = {
> .open_pipe = ehci_open,
> @@ -753,6 +749,7 @@ ehci_check_qh_intr(struct ehci_softc *sc
> }
> done:
> DPRINTFN(12, ("ehci_check_intr: ex=%p done\n", ex));
> + ehci_del_intr_list(sc, ex);
> timeout_del(&xfer->timeout_handle);
> usb_rem_task(xfer->pipe->device, &xfer->abort_task);
> ehci_idone(xfer);
> @@ -803,6 +800,7 @@ ehci_check_itd_intr(struct ehci_softc *s
> return;
> done:
> DPRINTFN(12, ("ehci_check_itd_intr: ex=%p done\n", ex));
> + ehci_del_intr_list(sc, ex);
> timeout_del(&xfer->timeout_handle);
> usb_rem_task(xfer->pipe->device, &xfer->abort_task);
> ehci_idone(xfer);
> @@ -2722,6 +2720,7 @@ ehci_abort_xfer(struct usbd_xfer *xfer,
> /* If we're dying, just do the software part. */
> s = splusb();
> xfer->status = status; /* make software ignore it */
> + ehci_del_intr_list(sc, ex);
> timeout_del(&xfer->timeout_handle);
> usb_rem_task(xfer->device, &xfer->abort_task);
> #ifdef DIAGNOSTIC
> @@ -2759,6 +2758,7 @@ ehci_abort_xfer(struct usbd_xfer *xfer,
> s = splusb();
> ex->ehci_xfer_flags |= EHCI_XFER_ABORTING;
> xfer->status = status; /* make software ignore it */
> + ehci_del_intr_list(sc, ex);
> timeout_del(&xfer->timeout_handle);
> usb_rem_task(xfer->device, &xfer->abort_task);
> splx(s);
> @@ -2830,6 +2830,7 @@ ehci_abort_isoc_xfer(struct usbd_xfer *x
> if (sc->sc_bus.dying) {
> s = splusb();
> xfer->status = status;
> + ehci_del_intr_list(sc, ex);
> timeout_del(&xfer->timeout_handle);
> usb_rem_task(xfer->device, &xfer->abort_task);
> usb_transfer_complete(xfer);
> @@ -2852,9 +2853,10 @@ ehci_abort_isoc_xfer(struct usbd_xfer *x
> tsleep(&ex->ehci_xfer_flags, PZERO, "ehciiaw", 0);
> return;
> }
> - ex->ehci_xfer_flags |= EHCI_XFER_ABORTING;
>
> + ex->ehci_xfer_flags |= EHCI_XFER_ABORTING;
> xfer->status = status;
> + ehci_del_intr_list(sc, ex);
> timeout_del(&xfer->timeout_handle);
> usb_rem_task(xfer->device, &xfer->abort_task);
>
> @@ -2998,8 +3000,7 @@ ehci_device_ctrl_done(struct usbd_xfer *
> }
> #endif
>
> - if (xfer->status != USBD_NOMEM && ehci_active_intr_list(ex)) {
> - ehci_del_intr_list(sc, ex); /* remove from active list */
> + if (xfer->status != USBD_NOMEM) {
> ehci_free_sqtd_chain(sc, ex);
> }
>
> @@ -3283,8 +3284,7 @@ ehci_device_bulk_done(struct usbd_xfer *
> DPRINTFN(10,("ehci_bulk_done: xfer=%p, actlen=%d\n",
> xfer, xfer->actlen));
>
> - if (xfer->status != USBD_NOMEM && ehci_active_intr_list(ex)) {
> - ehci_del_intr_list(sc, ex); /* remove from active list */
> + if (xfer->status != USBD_NOMEM) {
> ehci_free_sqtd_chain(sc, ex);
> usb_syncmem(&xfer->dmabuf, 0, xfer->length,
> usbd_xfer_isread(xfer) ?
> @@ -3480,11 +3480,10 @@ ehci_device_intr_done(struct usbd_xfer *
> timeout_set(&xfer->timeout_handle, ehci_timeout, xfer);
> timeout_add_msec(&xfer->timeout_handle, xfer->timeout);
> }
> - splx(s);
> -
> + ehci_add_intr_list(sc, ex);
> xfer->status = USBD_IN_PROGRESS;
> - } else if (xfer->status != USBD_NOMEM && ehci_active_intr_list(ex)) {
> - ehci_del_intr_list(sc, ex); /* remove from active list */
> + splx(s);
> + } else if (xfer->status != USBD_NOMEM) {
> ehci_free_sqtd_chain(sc, ex);
> usb_syncmem(&xfer->dmabuf, 0, xfer->length,
> usbd_xfer_isread(xfer) ?
> @@ -3789,8 +3788,7 @@ ehci_device_isoc_done(struct usbd_xfer *
>
> s = splusb();
> epipe->u.isoc.cur_xfers--;
> - if (xfer->status != USBD_NOMEM && ehci_active_intr_list(ex)) {
> - ehci_del_intr_list(sc, ex);
> + if (xfer->status != USBD_NOMEM) {
> ehci_rem_free_itd_chain(sc, ex);
> }
> splx(s);