Re: [v6] usb: ohci: Proper handling of ed_rm_list to handle race condition between usb_kill_urb() and finish_unlinks()

2018-03-27 Thread Alan Stern
On Mon, 26 Mar 2018, Jonathan Liu wrote:

> > I am experiencing a USB function call hang from userspace with OCHI
> > (full speed USB device) after updating from Linux 4.14.15 to 4.14.24
> > and noticed this commit.
> >
> > Here is the Linux 4.14.24 kernel stack trace (extracted from SysRq+w
> > and amended with addr2line):
> > [] (__schedule) from [] (schedule+0x50/0xb4)
> > kernel/sched/core.c:2792
> > [] (schedule) from []
> > (usb_kill_urb.part.3+0x78/0xa8) include/asm-generic/preempt.h:59
> > [] (usb_kill_urb.part.3) from []
> > (usbdev_ioctl+0x1288/0x1cf0) drivers/usb/core/urb.c:690
> > [] (usbdev_ioctl) from []
> > (do_vfs_ioctl+0x9c/0x8ec) drivers/usb/core/devio.c:1835
> > [] (do_vfs_ioctl) from [] (SyS_ioctl+0x34/0x5c)
> > fs/ioctl.c:47
> > [] (SyS_ioctl) from [] (ret_fast_syscall+0x0/0x54)
> > include/linux/file.h:39
> >
> > Afterwards the kernel is unresponsive to disconnect/connect of the
> > full speed USB device but I can connect/disconnect a high speed USB
> > device to the same port and communicate to it without problem since it
> > uses EHCI (OHCI is companion controller). If I try to connect the full
> > speed USB device again it is still unresponsive. The userspace
> > application is still hanging after all this.
> >
> > Could this commit be causing the issue?

It sounds like the URB gets cancelled but is never given back.  But 
although the commit you mentioned does affect this logic, I don't see 
the problem could arise.

> There does seem to be something wrong with this commit. I did some
> more testing with Linux 4.16.0-rc6 on sun7i-a20-olinuxino-lime.
> I connected a sun7i-a20-pcduino3b running Linux 4.15.12 that has OTG
> port with gadget zero kernel module (g_zero) loaded to the USB port
> and did some testing with usbtest from tools/usb/testusb.c in kernel
> repository.
> 
> I had to do the following change to avoid getting flooded with
> "usbtest 1-1:3.0: unlink retry" kernel messages after test 11 of
> testusb starts:
> --- a/drivers/usb/misc/usbtest.c
> +++ b/drivers/usb/misc/usbtest.c
> @@ -1473,7 +1473,7 @@ static int unlink1(struct usbtest_dev *dev, int
> pipe, int
> size, int async)
>  * resubmission, but since we're testing 
> unlink
>  * paths, we can't.
>  */
> -   ERROR(dev, "unlink retry\n");
> + /*ERROR(dev, "unlink retry\n");*/

That's a bizarre message in any case.  The condition under which it
gets printed is _not_ an error, just a race.  I would expect it to show
up every now and then just based on random chance.  But if it gets
printed over and over again then something is definitely wrong.

> If I revert the commit, I can run "./testusb -a" to completion without
> any issues with Linux 4.16.0-rc6 on sun7i-a20-olinuxino-lime.
> I don't get a kernel stall detected when testing on x86_64 but I do
> still get at least one "unlink retry" error message from kernel in
> dmesg which can be reproduced using "./testusb -a" or more frequently
> with "./testusb -a -t 11 -c 5000".

A few of those "unlink retry" messages during a test with 5000 
iterations does not mean anything is wrong; don't worry about it.

It would be great if you could debug this farther.  I'm not sure what
to suggest, however.  Clearly we need to know what's going on in
ohci-q.c's finish_unlinks() routine; unfortunately this routine
probably gets called thousands of times before the problem shows up.

Anyway, here's a patch you can try out on the sun7i-a20-olinuxino-lime.  
I have no idea whether it will help pinpoint the source of the problem.

Alan Stern



Index: usb-4.x/drivers/usb/host/ohci-hcd.c
===
--- usb-4.x.orig/drivers/usb/host/ohci-hcd.c
+++ usb-4.x/drivers/usb/host/ohci-hcd.c
@@ -327,14 +327,21 @@ static int ohci_urb_dequeue(struct usb_h
 * some upcoming INTR_SF to call finish_unlinks()
 */
urb_priv = urb->hcpriv;
-   if (urb_priv->ed->state == ED_OPER)
+   if (urb_priv->ed->state == ED_OPER) {
+   ohci_info(ohci, "start dequeue URB %p ed %p\n", urb,
+   urb_priv->ed);
start_ed_unlink(ohci, urb_priv->ed);
-
+   } else
+   ohci_info(ohci, "idle dequeue URB %p ed %p\n", urb,
+   urb_priv->ed);
if (ohci->rh_state != OHCI_RH_RUNNING) {
/* With HC dead, we can clean up right away */
ohci_work(ohci);
}
}
+   else
+   ohci_info(ohci, "failed dequeue URB %p\n", urb);
+
spin_unlock_irqrestore (>lock, flags);
return rc;
 }
Index: usb-4.x/drivers/usb/host/ohci-q.c
===
--- 

Re: [v6] usb: ohci: Proper handling of ed_rm_list to handle race condition between usb_kill_urb() and finish_unlinks()

2018-03-25 Thread Jonathan Liu
Hi,

On 25 March 2018 at 12:21, Jonathan Liu  wrote:
> On 8 February 2018 at 14:55, Jeffy Chen  wrote:
>> From: AMAN DEEP 
>>
>> There is a race condition between finish_unlinks->finish_urb() function
>> and usb_kill_urb() in ohci controller case. The finish_urb calls
>> spin_unlock(>lock) before usb_hcd_giveback_urb() function call,
>> then if during this time, usb_kill_urb is called for another endpoint,
>> then new ed will be added to ed_rm_list at beginning for unlink, and
>> ed_rm_list will point to newly added.
>>
>> When finish_urb() is completed in finish_unlinks() and ed->td_list
>> becomes empty as in below code (in finish_unlinks() function):
>>
>> if (list_empty(>td_list)) {
>> *last = ed->ed_next;
>> ed->ed_next = NULL;
>> } else if (ohci->rh_state == OHCI_RH_RUNNING) {
>> *last = ed->ed_next;
>> ed->ed_next = NULL;
>> ed_schedule(ohci, ed);
>> }
>>
>> The *last = ed->ed_next will make ed_rm_list to point to ed->ed_next
>> and previously added ed by usb_kill_urb will be left unreferenced by
>> ed_rm_list. This causes usb_kill_urb() hang forever waiting for
>> finish_unlink to remove added ed from ed_rm_list.
>>
>> The main reason for hang in this race condtion is addition and removal
>> of ed from ed_rm_list in the beginning during usb_kill_urb and later
>> last* is modified in finish_unlinks().
>>
>> As suggested by Alan Stern, the solution for proper handling of
>> ohci->ed_rm_list is to remove ed from the ed_rm_list before finishing
>> any URBs. Then at the end, we can add ed back to the list if necessary.
>>
>> This properly handle the updated ohci->ed_rm_list in usb_kill_urb().
>>
>> Fixes:977dcfdc6031("USB:OHCI:don't lose track of EDs when a controller dies")
>> Acked-by: Alan Stern 
>> CC: 
>> Signed-off-by: Aman Deep 
>> Signed-off-by: Jeffy Chen 
>> ---
>>
>> Changes in v6:
>> This is a resend of Aman Deep's v5 patch [0], which solved the hang we
>> hit [1]. (Thanks Aman :)
>>
>> The v5 has some format issues, so i slightly adjust the commit message.
>>
>> [0] https://www.spinics.net/lists/linux-usb/msg129010.html
>> [1] https://bugs.chromium.org/p/chromium/issues/detail?id=803749
>>
>>  drivers/usb/host/ohci-q.c | 17 ++---
>>  1 file changed, 10 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/usb/host/ohci-q.c b/drivers/usb/host/ohci-q.c
>> index b2ec8c399363..4ccb85a67bb3 100644
>> --- a/drivers/usb/host/ohci-q.c
>> +++ b/drivers/usb/host/ohci-q.c
>> @@ -1019,6 +1019,8 @@ static void finish_unlinks(struct ohci_hcd *ohci)
>>  * have modified this list.  normally it's just prepending
>>  * entries (which we'd ignore), but paranoia won't hurt.
>>  */
>> +   *last = ed->ed_next;
>> +   ed->ed_next = NULL;
>> modified = 0;
>>
>> /* unlink urbs as requested, but rescan the list after
>> @@ -1077,21 +1079,22 @@ static void finish_unlinks(struct ohci_hcd *ohci)
>> goto rescan_this;
>>
>> /*
>> -* If no TDs are queued, take ED off the ed_rm_list.
>> +* If no TDs are queued, ED is now idle.
>>  * Otherwise, if the HC is running, reschedule.
>> -* If not, leave it on the list for further dequeues.
>> +* If the HC isn't running, add ED back to the
>> +* start of the list for later processing.
>>  */
>> if (list_empty(>td_list)) {
>> -   *last = ed->ed_next;
>> -   ed->ed_next = NULL;
>> ed->state = ED_IDLE;
>> list_del(>in_use_list);
>> } else if (ohci->rh_state == OHCI_RH_RUNNING) {
>> -   *last = ed->ed_next;
>> -   ed->ed_next = NULL;
>> ed_schedule(ohci, ed);
>> } else {
>> -   last = >ed_next;
>> +   ed->ed_next = ohci->ed_rm_list;
>> +   ohci->ed_rm_list = ed;
>> +   /* Don't loop on the same ED */
>> +   if (last == >ed_rm_list)
>> +   last = >ed_next;
>> }
>>
>> if (modified)
>
> I am experiencing a USB function call hang from userspace with OCHI
> (full speed USB device) after updating from Linux 4.14.15 to 4.14.24
> and noticed this commit.
>
> Here is the Linux 4.14.24 kernel stack trace (extracted from SysRq+w
> and amended with addr2line):
> [] (__schedule) from [] (schedule+0x50/0xb4)
> kernel/sched/core.c:2792
> [] (schedule) from []
> (usb_kill_urb.part.3+0x78/0xa8) 

Re: [v6] usb: ohci: Proper handling of ed_rm_list to handle race condition between usb_kill_urb() and finish_unlinks()

2018-03-24 Thread Jonathan Liu
Hi,

On 25 March 2018 at 12:21, Jonathan Liu  wrote:
> Hi,
>
> On 8 February 2018 at 14:55, Jeffy Chen  wrote:
>> From: AMAN DEEP 
>>
>> There is a race condition between finish_unlinks->finish_urb() function
>> and usb_kill_urb() in ohci controller case. The finish_urb calls
>> spin_unlock(>lock) before usb_hcd_giveback_urb() function call,
>> then if during this time, usb_kill_urb is called for another endpoint,
>> then new ed will be added to ed_rm_list at beginning for unlink, and
>> ed_rm_list will point to newly added.
>>
>> When finish_urb() is completed in finish_unlinks() and ed->td_list
>> becomes empty as in below code (in finish_unlinks() function):
>>
>> if (list_empty(>td_list)) {
>> *last = ed->ed_next;
>> ed->ed_next = NULL;
>> } else if (ohci->rh_state == OHCI_RH_RUNNING) {
>> *last = ed->ed_next;
>> ed->ed_next = NULL;
>> ed_schedule(ohci, ed);
>> }
>>
>> The *last = ed->ed_next will make ed_rm_list to point to ed->ed_next
>> and previously added ed by usb_kill_urb will be left unreferenced by
>> ed_rm_list. This causes usb_kill_urb() hang forever waiting for
>> finish_unlink to remove added ed from ed_rm_list.
>>
>> The main reason for hang in this race condtion is addition and removal
>> of ed from ed_rm_list in the beginning during usb_kill_urb and later
>> last* is modified in finish_unlinks().
>>
>> As suggested by Alan Stern, the solution for proper handling of
>> ohci->ed_rm_list is to remove ed from the ed_rm_list before finishing
>> any URBs. Then at the end, we can add ed back to the list if necessary.
>>
>> This properly handle the updated ohci->ed_rm_list in usb_kill_urb().
>>
>> Fixes:977dcfdc6031("USB:OHCI:don't lose track of EDs when a controller dies")
>> Acked-by: Alan Stern 
>> CC: 
>> Signed-off-by: Aman Deep 
>> Signed-off-by: Jeffy Chen 
>> ---
>>
>> Changes in v6:
>> This is a resend of Aman Deep's v5 patch [0], which solved the hang we
>> hit [1]. (Thanks Aman :)
>>
>> The v5 has some format issues, so i slightly adjust the commit message.
>>
>> [0] https://www.spinics.net/lists/linux-usb/msg129010.html
>> [1] https://bugs.chromium.org/p/chromium/issues/detail?id=803749
>>
>>  drivers/usb/host/ohci-q.c | 17 ++---
>>  1 file changed, 10 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/usb/host/ohci-q.c b/drivers/usb/host/ohci-q.c
>> index b2ec8c399363..4ccb85a67bb3 100644
>> --- a/drivers/usb/host/ohci-q.c
>> +++ b/drivers/usb/host/ohci-q.c
>> @@ -1019,6 +1019,8 @@ static void finish_unlinks(struct ohci_hcd *ohci)
>>  * have modified this list.  normally it's just prepending
>>  * entries (which we'd ignore), but paranoia won't hurt.
>>  */
>> +   *last = ed->ed_next;
>> +   ed->ed_next = NULL;
>> modified = 0;
>>
>> /* unlink urbs as requested, but rescan the list after
>> @@ -1077,21 +1079,22 @@ static void finish_unlinks(struct ohci_hcd *ohci)
>> goto rescan_this;
>>
>> /*
>> -* If no TDs are queued, take ED off the ed_rm_list.
>> +* If no TDs are queued, ED is now idle.
>>  * Otherwise, if the HC is running, reschedule.
>> -* If not, leave it on the list for further dequeues.
>> +* If the HC isn't running, add ED back to the
>> +* start of the list for later processing.
>>  */
>> if (list_empty(>td_list)) {
>> -   *last = ed->ed_next;
>> -   ed->ed_next = NULL;
>> ed->state = ED_IDLE;
>> list_del(>in_use_list);
>> } else if (ohci->rh_state == OHCI_RH_RUNNING) {
>> -   *last = ed->ed_next;
>> -   ed->ed_next = NULL;
>> ed_schedule(ohci, ed);
>> } else {
>> -   last = >ed_next;
>> +   ed->ed_next = ohci->ed_rm_list;
>> +   ohci->ed_rm_list = ed;
>> +   /* Don't loop on the same ED */
>> +   if (last == >ed_rm_list)
>> +   last = >ed_next;
>> }
>>
>> if (modified)
>
> I am experiencing a USB function call hang from userspace with OCHI
> (full speed USB device) after updating from Linux 4.14.15 to 4.14.24
> and noticed this commit.
>
> Here is the Linux 4.14.24 kernel stack trace (extracted from SysRq+w
> and amended with addr2line):
> [] (__schedule) from [] (schedule+0x50/0xb4)
> kernel/sched/core.c:2792
> [] (schedule) from []
> (usb_kill_urb.part.3+0x78/0xa8) 

Re: [v6] usb: ohci: Proper handling of ed_rm_list to handle race condition between usb_kill_urb() and finish_unlinks()

2018-03-24 Thread Jonathan Liu
Hi,

On 8 February 2018 at 14:55, Jeffy Chen  wrote:
> From: AMAN DEEP 
>
> There is a race condition between finish_unlinks->finish_urb() function
> and usb_kill_urb() in ohci controller case. The finish_urb calls
> spin_unlock(>lock) before usb_hcd_giveback_urb() function call,
> then if during this time, usb_kill_urb is called for another endpoint,
> then new ed will be added to ed_rm_list at beginning for unlink, and
> ed_rm_list will point to newly added.
>
> When finish_urb() is completed in finish_unlinks() and ed->td_list
> becomes empty as in below code (in finish_unlinks() function):
>
> if (list_empty(>td_list)) {
> *last = ed->ed_next;
> ed->ed_next = NULL;
> } else if (ohci->rh_state == OHCI_RH_RUNNING) {
> *last = ed->ed_next;
> ed->ed_next = NULL;
> ed_schedule(ohci, ed);
> }
>
> The *last = ed->ed_next will make ed_rm_list to point to ed->ed_next
> and previously added ed by usb_kill_urb will be left unreferenced by
> ed_rm_list. This causes usb_kill_urb() hang forever waiting for
> finish_unlink to remove added ed from ed_rm_list.
>
> The main reason for hang in this race condtion is addition and removal
> of ed from ed_rm_list in the beginning during usb_kill_urb and later
> last* is modified in finish_unlinks().
>
> As suggested by Alan Stern, the solution for proper handling of
> ohci->ed_rm_list is to remove ed from the ed_rm_list before finishing
> any URBs. Then at the end, we can add ed back to the list if necessary.
>
> This properly handle the updated ohci->ed_rm_list in usb_kill_urb().
>
> Fixes:977dcfdc6031("USB:OHCI:don't lose track of EDs when a controller dies")
> Acked-by: Alan Stern 
> CC: 
> Signed-off-by: Aman Deep 
> Signed-off-by: Jeffy Chen 
> ---
>
> Changes in v6:
> This is a resend of Aman Deep's v5 patch [0], which solved the hang we
> hit [1]. (Thanks Aman :)
>
> The v5 has some format issues, so i slightly adjust the commit message.
>
> [0] https://www.spinics.net/lists/linux-usb/msg129010.html
> [1] https://bugs.chromium.org/p/chromium/issues/detail?id=803749
>
>  drivers/usb/host/ohci-q.c | 17 ++---
>  1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/usb/host/ohci-q.c b/drivers/usb/host/ohci-q.c
> index b2ec8c399363..4ccb85a67bb3 100644
> --- a/drivers/usb/host/ohci-q.c
> +++ b/drivers/usb/host/ohci-q.c
> @@ -1019,6 +1019,8 @@ static void finish_unlinks(struct ohci_hcd *ohci)
>  * have modified this list.  normally it's just prepending
>  * entries (which we'd ignore), but paranoia won't hurt.
>  */
> +   *last = ed->ed_next;
> +   ed->ed_next = NULL;
> modified = 0;
>
> /* unlink urbs as requested, but rescan the list after
> @@ -1077,21 +1079,22 @@ static void finish_unlinks(struct ohci_hcd *ohci)
> goto rescan_this;
>
> /*
> -* If no TDs are queued, take ED off the ed_rm_list.
> +* If no TDs are queued, ED is now idle.
>  * Otherwise, if the HC is running, reschedule.
> -* If not, leave it on the list for further dequeues.
> +* If the HC isn't running, add ED back to the
> +* start of the list for later processing.
>  */
> if (list_empty(>td_list)) {
> -   *last = ed->ed_next;
> -   ed->ed_next = NULL;
> ed->state = ED_IDLE;
> list_del(>in_use_list);
> } else if (ohci->rh_state == OHCI_RH_RUNNING) {
> -   *last = ed->ed_next;
> -   ed->ed_next = NULL;
> ed_schedule(ohci, ed);
> } else {
> -   last = >ed_next;
> +   ed->ed_next = ohci->ed_rm_list;
> +   ohci->ed_rm_list = ed;
> +   /* Don't loop on the same ED */
> +   if (last == >ed_rm_list)
> +   last = >ed_next;
> }
>
> if (modified)

I am experiencing a USB function call hang from userspace with OCHI
(full speed USB device) after updating from Linux 4.14.15 to 4.14.24
and noticed this commit.

Here is the Linux 4.14.24 kernel stack trace (extracted from SysRq+w
and amended with addr2line):
[] (__schedule) from [] (schedule+0x50/0xb4)
kernel/sched/core.c:2792
[] (schedule) from []
(usb_kill_urb.part.3+0x78/0xa8) include/asm-generic/preempt.h:59
[] (usb_kill_urb.part.3) from []
(usbdev_ioctl+0x1288/0x1cf0) drivers/usb/core/urb.c:690
[] (usbdev_ioctl) from []
(do_vfs_ioctl+0x9c/0x8ec) drivers/usb/core/devio.c:1835