You are correct; it seems like this issue arises from the fact that
stalled no-op trbs are ignored and the controller remains stuck in the
Halted state because the Reset Endpoint command never gets run.

>From what I understand, this issue is not caused by losing track of
where the endpoint ring stopped, but by a race condition between
killing URBs and stall errors. As such, the STALL_ERROR handling
happens right after the endpoint ring is stopped. Since the trb is
no-op'ed in xhci_handle_cmd_stop_ep(), the endpoint recovery code
never gets run.

The cdc-acm driver triggers the "perpetually halted endpoint" behavior
via (what I believe are) the following operations:
1. The cdc-acm driver submits and queues 16 bulk IN URBs.
2. A full-speed device times out and fails to respond to its IN token
(including retries), causing the CSPLIT transaction with its upstream
hub to return a STALL. The first URB thus completes with -EPIPE.
3. Upon completion of the first URB, the cdc-acm driver kills the
other 15 URBs, attempts to clear the halt condition, and resubmits all
16 URBs.
4. Every time a URB is killed, the endpoint ring is restarted and URBs
that are not killed yet are executed. If we are unlucky, the URB's TRB
is gets cancelled right after it has been executed but before
xhci_irq() gets to run, causing this condition.

My patch is simply a workaround for this specific case; it may be
worth considering if there are any other tasks that need to be done in
the no-op trb case.

Best,
Gavin

On Mon, May 29, 2017 at 7:06 AM, Mathias Nyman
<mathias.ny...@linux.intel.com> wrote:
> On 26.05.2017 22:12, gavi...@thegavinli.com wrote:
>>
>> From: Gavin Li <g...@thegavinli.com>
>>
>> If a stalling TRB is cancelled and NOOP'ed in xhci_handle_cmd_stop_ep(),
>> finish_td() never gets called to reset the halted endpoint and the
>> endpoint remains indefinitely stalled. This patch ensures that
>> xhci_cleanup_halted_endpoint() is called after a TRB completion if the
>> endpoint is halted, irregardless of the status of any pending TRBs.
>>
>> Signed-off-by: Gavin Li <g...@thegavinli.com>
>
>
> Interesting, I'm trying to understand the details of what's going on here.
>
> Does the event ring first have a stopped transfer event, then a stop
> endpoint
> command completion event, and finally a transfer event with STALL_ERROR
> completion
> code pointing to a no-op trb?
>
> If this is the case do yo have any idea if the STALL_ERROR transfer event
> was issued while the ring was stopping, or only after ring was restarted
> again?
>
> current code relies on the the stopped transfer event to know where on the
> endpoint ring hardware stopped, sometimes this might not be reliable.
>
> I have a change for that, just pushed to my for-usb-next branch in my tree
> at:
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
>
> Your patch basically adds a check for STALL_ERROR for no-op trbs, it reveals
> that the driver currently skips handling all endpoint state related transfer
> events if that TRB was canceled. This needs to investigated more
>
> Thanks
> -Mathias
>
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to