On Sun, Oct 29, 2023 at 2:38 AM Hector Martin <mar...@marcan.st> wrote: > > This series is the first of a few bundles of USB fixes we have been > carrying downstream on the Asahi U-Boot branch for a few months. > > Patches #1-#6 fix a series of related robustness issues. Certain > conditions related to endpoint stalls revealed a chain of bugs > throughout the stack that caused U-Boot to completely fail when > encountering some errors (and errors are a fact of life with USB). > > Example scenario: > - The device stalls a bulk endpoint due to an error > - The upper layer driver tries to use the endpoint again > - There is no endpoint stall clear wired up in the API, so for starters > this is doomed to fail (fix: #4) > - xHCI knows the endpoint is halted, but tries to queue the TRB anyway, > which can't work (fix: #5) > - Since the endpoint is halted nothing happens, so the transfer times > out. > - The timeout handling tries to abort the transfer > - abort_td() tries to stop the endpoint, but since it is already halted, > this results in a context state error. As the transfer never started, > there is no completion event, so xhci_wait_for_event() is waiting for > the wrong event type, and logs an error and returns NULL. (fix: #2) > - abort_td() crashes due to failing to guard against the NULL event > (fix: #1) > - Even after fixing all that, abort_td() would not handle the context > state error properly and BUG() (fix: #3). This also fixes a race > condition documented in the xHCI spec that could occur even in the > absence of all the other bugs. > > Patch #6 addresses a related robustness issue where > xhci_wait_for_event() panics on event timeouts other than for transfers. > While this is clearly an unexpected condition and indicates a bug > elsewhere, it's no reason to outright crash. USB is notoriously > impossible to get 100% right, and we can't afford to be breaking users' > systems at any sign of trouble. Error conditions should always be dealt > with as gracefully as possible (even if that results in a completely > broken USB controller, that is much preferable to aborting the boot > process entirely, especially on devices with non-USB storage and > keyboards where USB support is effectively optional for most users). > Since after patch #1 we now guard all callers to xhci_wait_for_event() > with at least trivial NULL checks, it's okay to return and continue in > case of timeouts. > > Patch #7 addresses an unrelated bug I ran into while debugging all this, > and patch #8 adds extra debug logs to make finding future issues less > painful. > > I believe this should fix this Fedora bug too, which is either an > instance of the above sequence of events, or (more likely, given the > difficulty reproducing) the race condition documented in xHCI 4.6.9: > > https://bugzilla.redhat.com/show_bug.cgi?id=2244305 > > Signed-off-by: Hector Martin <mar...@marcan.st> > --- > Changes in v2: > - Squashed in a trivial fix for patch #1 > - Removed spurious blank line > - Added a longer description to the cover letter > - Link to v1: > https://lore.kernel.org/r/20231027-usb-fixes-1-v1-0-1c879bbcd...@marcan.st > > --- > Hector Martin (8): > usb: xhci: Guard all calls to xhci_wait_for_event > usb: xhci: Better error handling in abort_td() > usb: xhci: Allow context state errors when halting an endpoint > usb: xhci: Recover from halted bulk endpoints > usb: xhci: Fail on attempt to queue TRBs to a halted endpoint > usb: xhci: Do not panic on event timeouts > usb: xhci: Fix DMA address calculation in queue_trb > usb: xhci: Add more debugging > > drivers/usb/host/xhci-ring.c | 99 > ++++++++++++++++++++++++++++++++++++-------- > drivers/usb/host/xhci.c | 9 ++++ > include/usb/xhci.h | 2 + > 3 files changed, 92 insertions(+), 18 deletions(-) > --- > base-commit: 7c0d668103abae3ec14cd96d882ba20134bb4529 > change-id: 20231027-usb-fixes-1-83bfc7013012 >
Series LGTM. Reviewed-by: Neal Gompa <n...@gompa.dev> -- 真実はいつも一つ!/ Always, there's only one truth!