On 26/02/15(Thu) 22:45, Stefan Sperling wrote: > On Thu, Feb 26, 2015 at 01:51:29AM +0100, Martin Pieuchot wrote: > > On 22/02/15(Sun) 00:54, Martin Pieuchot wrote: > > > On 20/02/15(Fri) 23:15, Stefan Sperling wrote: > > > > In the following configuration I can pretty easily trigger endless > > > > screenfulls of scrolling "ehci_idone: ex=%p is done!" messages, > > > > where %p is to a constant pointer value (same in each message). > > > > > > > > [ehci host]<--[usb extension cable (hub 1)]<--[usb keyboard (hub > > > > 2)]<--[usb mouse] > > > > > > > > So there's a USB extension cable (hub 1) that I plug a keyboard into. > > > > The keyboard in turn has a built-in hub that has a mouse plugged into > > > > it. > > > > The problem happens almost every time I pull the keyboard out of hub 1. > > > > > > > > I've also seen it happen without any intervention on my part (In fact > > > > I was in a different city and couldn't use the box until I got home > > > > several hours later and hit the reset switch... that's no fun) > > > > > > > > So if a transfer is cancelled (e.g. as a result of pulling the plug), > > > > then: > > > > > > > > - usbd_abort_pipe wants to abort a related transfer > > > > - ehci_abort_xfer schedules and waits for ehci_softintr, expecting > > > > the softintr routine to deal with the cancelled transfer: > > > > > > > > /* > > > > * Step 3: Make sure the soft interrupt routine has run. This > > > > * should remove any completed items off the queue. > > > > * The hardware has no reference to completed items (TDs). > > > > * It's safe to remove them at any time. > > > > */ > > > > s = splusb(); > > > > sc->sc_softwake = 1; > > > > usb_schedsoftintr(&sc->sc_bus); > > > > tsleep(&sc->sc_softwake, PZERO, "ehciab", 0); > > > > > > > > - ehci_softintr gets scheduled > > > > - ehci_softintr loops over xfers on the sc_intrhead TAILQ and > > > > invokes ehci_check_intr on each > > > > - ehci_check_intr eventually ends up calling ehci_idone > > > > - ehci_idone does nothing for cancelled transfers... ?!? > > > > > > > > if (xfer->status == USBD_CANCELLED || > > > > xfer->status == USBD_TIMEOUT) { > > > > DPRINTF(("ehci_idone: aborted xfer=%p\n", xfer)); > > > > return; > > > > } > > > > > > > > - something else happens > > > > > > > > - ehci_abort_xfer awakes from tsleep and sets ex->isdone, since it > > > > expects the softinterrupt routine to have dealt with the xfer > > > > > > > > - something else happens > > > > > > > > - the host controller sends an INTERR interrupt > > > > - ehci_intr1 schedules ehci_softintr > > > > - ehci_softintr loops over xfers on the sc_intrhead TAILQ and > > > > invokes ehci_check_intr on each > > > > - the cancelled xfer is still in the intrhead TAILQ and ends up in > > > > ehci_idone > > > > - ehci_idone looks for the isdone flag which is now set, then it > > > > complains and does nothing > > > > > > > > - the host controller sends an INTERR interrupt > > > > ... same story again, we get an endless loop > > > > > > > > This diff breaks the chain of events and fixes the endless loop for me. > > > > I can't reproduce the problem anymore by pulling the keyboard out. > > > > I don't quite understand how this prevents the flood of INTERR > > > > interrupts > > > > but it seems to work. > > > > > > > > I assume there are nasty tentacles in USB land which I'm unfamiliar > > > > with. > > > > Is there any reason this could be a bad idea? > > > > > > Stefan that's a really good analysis. I think the diff might not be > > > completely correct though. > > > > > > So basically you're removing the transfer from the active list. That's > > > generally done after the USB callback has been executed, in your case in > > > ehci_device_intr_done(). > > > > > > For interrupt transfers (pipe->repeat is 1) the transfer is kept on the > > > list while the descriptors are freed/reallocated. That should be safe > > > since we should be reusing the sames. > > > > > > So I don't know if we are missing a spl protection of if there's an xfer > > > leak but I'm afraid that with your diff usb_transfer_complete() might > > > not be called for the failing xfer. > > > > > > That's easy to check, look if the ehcixfer pool counter increase when > > > you detach your device. > > > > > > I'm afraid I cannot help more as I am currently traveling :) > > > > Here's an alternative diff that removes xfers from the list of > > currently pending transfers as soon as they are finished or > > cancelled. This should also prevent the race you analysed. > > > > With this diff ehci_idone() should no longer be called on a > > CANCELLED or TIMEOUT xfer which mean the 'isdone' check can > > be removed. > > > > Does it work for you? > > It does, yes. Thanks. > Your diff also stops the vomit of ehci_idone messages for me when > I unplug devices or hubs.
Does that mean you're ok with the diff? > There are other issues which are not fixed by this diff and which > my diff did not fix either. > > Sometimes my USB keyboard seems to go insane and keeps pressing a key > repeatedly even though I'm not pressing it. For example, when I typed > make install > today the keyboard kept repeating the letter l so the line looked > like this: > make > installllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll > > I've seen this problem with 2 different USB keyboards. > Re-plugging the keyboard or the hub it is connected to now restores normal > behaviour -- before this discussion unplugging usually locked the machine in > ehci_idone and I had to reset the system, so we've made a big step forward. > > Sometimes I see a similar(?) issue with my USB mouse behind hubs, where the > mouse pointer movement becomes jerky (doesn't feel as fluent as usual). Do you see interrupts when such thing happen or do you think it's a software bug? > Sometimes I see messages like these when I plug devices into another > 4-port USB 2.0 hub (uhub8) (the same black hub I brought to your house when I > visited last year) which I plug into my long-cable hub (uhub7) (which only > has a single USB port at the far end, unfortunately): > uhub7 at uhub0 > port 1 "Terminus Technology USB 2.0 Hub [MTT]" rev 2.00/1.00 addr 2 > uhub8 at uhub7 > port 4 "Genesys Logic USB2.0 Hub" rev 2.00/77.64 addr 3 > uhub8: port 3, set config 0 at addr 5 failed > uhub8: device problem, disabling port 3 > uhub8: device problem, disabling port 1 > And the devices won't work. Replugging the long-cable hub (uhub7) fixes that. > > I've also seen this with devices plugged into the same uhub8: > uhidev0 at uhub8 > port 4 configuration 1 interface 0 "Logitech product 0xc050" rev 2.00/27.20 > addr 4 > uhidev0: no report descriptor > Again no working device. > > But these are bugs we can investigate another day. Happy travels! > > Maybe one of the hubs I'm using is broken? I doubt it. When you see such weird problems, please do a "lsusb -v" on your hubs and look at the various ports status. You might give get a clue of what's going on.