Re: [PATCH] USB: core: Avoid race of async_completed() w/ usbdev_release()

2017-08-11 Thread Alan Stern
On Thu, 10 Aug 2017, Douglas Anderson wrote:

> While running reboot tests w/ a specific set of USB devices (and
> slub_debug enabled), I found that once every few hours my device would
> be crashed with a stack that looked like this:
> 
> [   14.012445] BUG: spinlock bad magic on CPU#0, modprobe/2091
> [   14.012460]  lock: 0xffc0cb055978, .magic: ffc0, .owner: cryption 
> contexts: %lu/%lu
> [   14.012460] /1025536097, .owner_cpu: 0
> [   14.012466] CPU: 0 PID: 2091 Comm: modprobe Not tainted 4.4.79 #352
> [   14.012468] Hardware name: Google Kevin (DT)
> [   14.012471] Call trace:
> [   14.012483] [<>] dump_backtrace+0x0/0x160
> [   14.012487] [<>] show_stack+0x20/0x28
> [   14.012494] [<>] dump_stack+0xb4/0xf0
> [   14.012500] [<>] spin_dump+0x8c/0x98
> [   14.012504] [<>] spin_bug+0x30/0x3c
> [   14.012508] [<>] do_raw_spin_lock+0x40/0x164
> [   14.012515] [<>] _raw_spin_lock_irqsave+0x64/0x74
> [   14.012521] [<>] __wake_up+0x2c/0x60
> [   14.012528] [<>] async_completed+0x2d0/0x300
> [   14.012534] [<>] __usb_hcd_giveback_urb+0xc4/0x138
> [   14.012538] [<>] usb_hcd_giveback_urb+0x54/0xf0
> [   14.012544] [<>] xhci_irq+0x1314/0x1348
> [   14.012548] [<>] usb_hcd_irq+0x40/0x50
> [   14.012553] [<>] handle_irq_event_percpu+0x1b4/0x3f0
> [   14.012556] [<>] handle_irq_event+0x4c/0x7c
> [   14.012561] [<>] handle_fasteoi_irq+0x158/0x1c8
> [   14.012564] [<>] generic_handle_irq+0x30/0x44
> [   14.012568] [<>] __handle_domain_irq+0x90/0xbc
> [   14.012572] [<>] gic_handle_irq+0xcc/0x18c
> 
> Investigation using kgdb() found that the wait queue that was passed
> into wake_up() had been freed (it was filled with slub_debug poison).
> 
> I analyzed and instrumented the code and reproduced.  My current
> belief is that this is happening:
> 
> 1. async_completed() is called (from IRQ).  Moves "as" onto the
>completed list.
> 2. On another CPU, proc_reapurbnonblock_compat() calls
>async_getcompleted().  Blocks on spinlock.
> 3. async_completed() releases the lock; keeps running; gets blocked
>midway through wake_up().
> 4. proc_reapurbnonblock_compat() => async_getcompleted() gets the
>lock; removes "as" from completed list and frees it.
> 5. usbdev_release() is called.  Frees "ps".
> 6. async_completed() finally continues running wake_up().  ...but
>wake_up() has a pointer to the freed "ps".

That looks like a good analysis.

> The instrumentation that led me to believe this was based on adding
> some trace_printk() calls in a select few functions and then using
> kdb's "ftdump" at crash time.  The trace follows (NOTE: in the trace
> below I cheated a little bit and added a udelay(1000) in
> async_completed() after releasing the spinlock because I wanted it to
> trigger quicker):
> 
> <...>-2104   0d.h2 13759034us!: async_completed at start: as=ffc0cc638200
> mtpd-20553 13759356us : async_getcompleted before spin_lock_irqsave
> mtpd-20553d..1 13759362us : async_getcompleted after list_del_init: 
> as=ffc0cc638200
> mtpd-20553 13759371us+: proc_reapurbnonblock_compat: 
> free_async(ffc0cc638200)
> mtpd-20553 13759422us+: async_getcompleted before spin_lock_irqsave
> mtpd-20553 13759479us : usbdev_release at start: ps=ffc0cc042080
> mtpd-20553 13759487us : async_getcompleted before spin_lock_irqsave
> mtpd-20553 13759497us!: usbdev_release after kfree(ps): 
> ps=ffc0cc042080
> <...>-2104   0d.h2 13760294us : async_completed before wake_up(): 
> as=ffc0cc638200
> 
> To fix this problem we can just move the wake_up() under the ps->lock.
> There should be no issues there that I'm aware of.
> 
> Signed-off-by: Douglas Anderson 
> ---
> NOTE: All testing of this patch was done on the Chromium OS kernel
> version 4.4.  The machine tested only barely boots upstream Linux and
> it seemed unlikely I'd be able to reproduce the exact failure there,
> so I didn't try.  The issue seems to clearly apply to upstream based
> on code analysis, though.
> 
>  drivers/usb/core/devio.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
> index ebe27595c4af..0ff0feddfd1f 100644
> --- a/drivers/usb/core/devio.c
> +++ b/drivers/usb/core/devio.c
> @@ -623,6 +623,8 @@ static void async_completed(struct urb *urb)
>   if (as->status < 0 && as->bulk_addr && as->status != -ECONNRESET &&
>   as->status != -ENOENT)
>   cancel_bulk_urbs(ps, as->bulk_addr);
> +
> + wake_up(>wait);
>   spin_unlock(>lock);
>  
>   if (signr) {
> @@ -630,8 +632,6 @@ static void async_completed(struct urb *urb)
>   put_pid(pid);
>   put_cred(cred);
>   }
> -
> - wake_up(>wait);
>  }
>  
>  static void destroy_async(struct usb_dev_state *ps, struct list_head *list)

Acked-by: Alan Stern 

Re: [PATCH] USB: core: Avoid race of async_completed() w/ usbdev_release()

2017-08-11 Thread Alan Stern
On Thu, 10 Aug 2017, Douglas Anderson wrote:

> While running reboot tests w/ a specific set of USB devices (and
> slub_debug enabled), I found that once every few hours my device would
> be crashed with a stack that looked like this:
> 
> [   14.012445] BUG: spinlock bad magic on CPU#0, modprobe/2091
> [   14.012460]  lock: 0xffc0cb055978, .magic: ffc0, .owner: cryption 
> contexts: %lu/%lu
> [   14.012460] /1025536097, .owner_cpu: 0
> [   14.012466] CPU: 0 PID: 2091 Comm: modprobe Not tainted 4.4.79 #352
> [   14.012468] Hardware name: Google Kevin (DT)
> [   14.012471] Call trace:
> [   14.012483] [<>] dump_backtrace+0x0/0x160
> [   14.012487] [<>] show_stack+0x20/0x28
> [   14.012494] [<>] dump_stack+0xb4/0xf0
> [   14.012500] [<>] spin_dump+0x8c/0x98
> [   14.012504] [<>] spin_bug+0x30/0x3c
> [   14.012508] [<>] do_raw_spin_lock+0x40/0x164
> [   14.012515] [<>] _raw_spin_lock_irqsave+0x64/0x74
> [   14.012521] [<>] __wake_up+0x2c/0x60
> [   14.012528] [<>] async_completed+0x2d0/0x300
> [   14.012534] [<>] __usb_hcd_giveback_urb+0xc4/0x138
> [   14.012538] [<>] usb_hcd_giveback_urb+0x54/0xf0
> [   14.012544] [<>] xhci_irq+0x1314/0x1348
> [   14.012548] [<>] usb_hcd_irq+0x40/0x50
> [   14.012553] [<>] handle_irq_event_percpu+0x1b4/0x3f0
> [   14.012556] [<>] handle_irq_event+0x4c/0x7c
> [   14.012561] [<>] handle_fasteoi_irq+0x158/0x1c8
> [   14.012564] [<>] generic_handle_irq+0x30/0x44
> [   14.012568] [<>] __handle_domain_irq+0x90/0xbc
> [   14.012572] [<>] gic_handle_irq+0xcc/0x18c
> 
> Investigation using kgdb() found that the wait queue that was passed
> into wake_up() had been freed (it was filled with slub_debug poison).
> 
> I analyzed and instrumented the code and reproduced.  My current
> belief is that this is happening:
> 
> 1. async_completed() is called (from IRQ).  Moves "as" onto the
>completed list.
> 2. On another CPU, proc_reapurbnonblock_compat() calls
>async_getcompleted().  Blocks on spinlock.
> 3. async_completed() releases the lock; keeps running; gets blocked
>midway through wake_up().
> 4. proc_reapurbnonblock_compat() => async_getcompleted() gets the
>lock; removes "as" from completed list and frees it.
> 5. usbdev_release() is called.  Frees "ps".
> 6. async_completed() finally continues running wake_up().  ...but
>wake_up() has a pointer to the freed "ps".

That looks like a good analysis.

> The instrumentation that led me to believe this was based on adding
> some trace_printk() calls in a select few functions and then using
> kdb's "ftdump" at crash time.  The trace follows (NOTE: in the trace
> below I cheated a little bit and added a udelay(1000) in
> async_completed() after releasing the spinlock because I wanted it to
> trigger quicker):
> 
> <...>-2104   0d.h2 13759034us!: async_completed at start: as=ffc0cc638200
> mtpd-20553 13759356us : async_getcompleted before spin_lock_irqsave
> mtpd-20553d..1 13759362us : async_getcompleted after list_del_init: 
> as=ffc0cc638200
> mtpd-20553 13759371us+: proc_reapurbnonblock_compat: 
> free_async(ffc0cc638200)
> mtpd-20553 13759422us+: async_getcompleted before spin_lock_irqsave
> mtpd-20553 13759479us : usbdev_release at start: ps=ffc0cc042080
> mtpd-20553 13759487us : async_getcompleted before spin_lock_irqsave
> mtpd-20553 13759497us!: usbdev_release after kfree(ps): 
> ps=ffc0cc042080
> <...>-2104   0d.h2 13760294us : async_completed before wake_up(): 
> as=ffc0cc638200
> 
> To fix this problem we can just move the wake_up() under the ps->lock.
> There should be no issues there that I'm aware of.
> 
> Signed-off-by: Douglas Anderson 
> ---
> NOTE: All testing of this patch was done on the Chromium OS kernel
> version 4.4.  The machine tested only barely boots upstream Linux and
> it seemed unlikely I'd be able to reproduce the exact failure there,
> so I didn't try.  The issue seems to clearly apply to upstream based
> on code analysis, though.
> 
>  drivers/usb/core/devio.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
> index ebe27595c4af..0ff0feddfd1f 100644
> --- a/drivers/usb/core/devio.c
> +++ b/drivers/usb/core/devio.c
> @@ -623,6 +623,8 @@ static void async_completed(struct urb *urb)
>   if (as->status < 0 && as->bulk_addr && as->status != -ECONNRESET &&
>   as->status != -ENOENT)
>   cancel_bulk_urbs(ps, as->bulk_addr);
> +
> + wake_up(>wait);
>   spin_unlock(>lock);
>  
>   if (signr) {
> @@ -630,8 +632,6 @@ static void async_completed(struct urb *urb)
>   put_pid(pid);
>   put_cred(cred);
>   }
> -
> - wake_up(>wait);
>  }
>  
>  static void destroy_async(struct usb_dev_state *ps, struct list_head *list)

Acked-by: Alan Stern 



[PATCH] USB: core: Avoid race of async_completed() w/ usbdev_release()

2017-08-10 Thread Douglas Anderson
While running reboot tests w/ a specific set of USB devices (and
slub_debug enabled), I found that once every few hours my device would
be crashed with a stack that looked like this:

[   14.012445] BUG: spinlock bad magic on CPU#0, modprobe/2091
[   14.012460]  lock: 0xffc0cb055978, .magic: ffc0, .owner: cryption 
contexts: %lu/%lu
[   14.012460] /1025536097, .owner_cpu: 0
[   14.012466] CPU: 0 PID: 2091 Comm: modprobe Not tainted 4.4.79 #352
[   14.012468] Hardware name: Google Kevin (DT)
[   14.012471] Call trace:
[   14.012483] [<>] dump_backtrace+0x0/0x160
[   14.012487] [<>] show_stack+0x20/0x28
[   14.012494] [<>] dump_stack+0xb4/0xf0
[   14.012500] [<>] spin_dump+0x8c/0x98
[   14.012504] [<>] spin_bug+0x30/0x3c
[   14.012508] [<>] do_raw_spin_lock+0x40/0x164
[   14.012515] [<>] _raw_spin_lock_irqsave+0x64/0x74
[   14.012521] [<>] __wake_up+0x2c/0x60
[   14.012528] [<>] async_completed+0x2d0/0x300
[   14.012534] [<>] __usb_hcd_giveback_urb+0xc4/0x138
[   14.012538] [<>] usb_hcd_giveback_urb+0x54/0xf0
[   14.012544] [<>] xhci_irq+0x1314/0x1348
[   14.012548] [<>] usb_hcd_irq+0x40/0x50
[   14.012553] [<>] handle_irq_event_percpu+0x1b4/0x3f0
[   14.012556] [<>] handle_irq_event+0x4c/0x7c
[   14.012561] [<>] handle_fasteoi_irq+0x158/0x1c8
[   14.012564] [<>] generic_handle_irq+0x30/0x44
[   14.012568] [<>] __handle_domain_irq+0x90/0xbc
[   14.012572] [<>] gic_handle_irq+0xcc/0x18c

Investigation using kgdb() found that the wait queue that was passed
into wake_up() had been freed (it was filled with slub_debug poison).

I analyzed and instrumented the code and reproduced.  My current
belief is that this is happening:

1. async_completed() is called (from IRQ).  Moves "as" onto the
   completed list.
2. On another CPU, proc_reapurbnonblock_compat() calls
   async_getcompleted().  Blocks on spinlock.
3. async_completed() releases the lock; keeps running; gets blocked
   midway through wake_up().
4. proc_reapurbnonblock_compat() => async_getcompleted() gets the
   lock; removes "as" from completed list and frees it.
5. usbdev_release() is called.  Frees "ps".
6. async_completed() finally continues running wake_up().  ...but
   wake_up() has a pointer to the freed "ps".

The instrumentation that led me to believe this was based on adding
some trace_printk() calls in a select few functions and then using
kdb's "ftdump" at crash time.  The trace follows (NOTE: in the trace
below I cheated a little bit and added a udelay(1000) in
async_completed() after releasing the spinlock because I wanted it to
trigger quicker):

<...>-2104   0d.h2 13759034us!: async_completed at start: as=ffc0cc638200
mtpd-20553 13759356us : async_getcompleted before spin_lock_irqsave
mtpd-20553d..1 13759362us : async_getcompleted after list_del_init: 
as=ffc0cc638200
mtpd-20553 13759371us+: proc_reapurbnonblock_compat: 
free_async(ffc0cc638200)
mtpd-20553 13759422us+: async_getcompleted before spin_lock_irqsave
mtpd-20553 13759479us : usbdev_release at start: ps=ffc0cc042080
mtpd-20553 13759487us : async_getcompleted before spin_lock_irqsave
mtpd-20553 13759497us!: usbdev_release after kfree(ps): 
ps=ffc0cc042080
<...>-2104   0d.h2 13760294us : async_completed before wake_up(): 
as=ffc0cc638200

To fix this problem we can just move the wake_up() under the ps->lock.
There should be no issues there that I'm aware of.

Signed-off-by: Douglas Anderson 
---
NOTE: All testing of this patch was done on the Chromium OS kernel
version 4.4.  The machine tested only barely boots upstream Linux and
it seemed unlikely I'd be able to reproduce the exact failure there,
so I didn't try.  The issue seems to clearly apply to upstream based
on code analysis, though.

 drivers/usb/core/devio.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
index ebe27595c4af..0ff0feddfd1f 100644
--- a/drivers/usb/core/devio.c
+++ b/drivers/usb/core/devio.c
@@ -623,6 +623,8 @@ static void async_completed(struct urb *urb)
if (as->status < 0 && as->bulk_addr && as->status != -ECONNRESET &&
as->status != -ENOENT)
cancel_bulk_urbs(ps, as->bulk_addr);
+
+   wake_up(>wait);
spin_unlock(>lock);
 
if (signr) {
@@ -630,8 +632,6 @@ static void async_completed(struct urb *urb)
put_pid(pid);
put_cred(cred);
}
-
-   wake_up(>wait);
 }
 
 static void destroy_async(struct usb_dev_state *ps, struct list_head *list)
-- 
2.14.0.434.g98096fd7a8-goog



[PATCH] USB: core: Avoid race of async_completed() w/ usbdev_release()

2017-08-10 Thread Douglas Anderson
While running reboot tests w/ a specific set of USB devices (and
slub_debug enabled), I found that once every few hours my device would
be crashed with a stack that looked like this:

[   14.012445] BUG: spinlock bad magic on CPU#0, modprobe/2091
[   14.012460]  lock: 0xffc0cb055978, .magic: ffc0, .owner: cryption 
contexts: %lu/%lu
[   14.012460] /1025536097, .owner_cpu: 0
[   14.012466] CPU: 0 PID: 2091 Comm: modprobe Not tainted 4.4.79 #352
[   14.012468] Hardware name: Google Kevin (DT)
[   14.012471] Call trace:
[   14.012483] [<>] dump_backtrace+0x0/0x160
[   14.012487] [<>] show_stack+0x20/0x28
[   14.012494] [<>] dump_stack+0xb4/0xf0
[   14.012500] [<>] spin_dump+0x8c/0x98
[   14.012504] [<>] spin_bug+0x30/0x3c
[   14.012508] [<>] do_raw_spin_lock+0x40/0x164
[   14.012515] [<>] _raw_spin_lock_irqsave+0x64/0x74
[   14.012521] [<>] __wake_up+0x2c/0x60
[   14.012528] [<>] async_completed+0x2d0/0x300
[   14.012534] [<>] __usb_hcd_giveback_urb+0xc4/0x138
[   14.012538] [<>] usb_hcd_giveback_urb+0x54/0xf0
[   14.012544] [<>] xhci_irq+0x1314/0x1348
[   14.012548] [<>] usb_hcd_irq+0x40/0x50
[   14.012553] [<>] handle_irq_event_percpu+0x1b4/0x3f0
[   14.012556] [<>] handle_irq_event+0x4c/0x7c
[   14.012561] [<>] handle_fasteoi_irq+0x158/0x1c8
[   14.012564] [<>] generic_handle_irq+0x30/0x44
[   14.012568] [<>] __handle_domain_irq+0x90/0xbc
[   14.012572] [<>] gic_handle_irq+0xcc/0x18c

Investigation using kgdb() found that the wait queue that was passed
into wake_up() had been freed (it was filled with slub_debug poison).

I analyzed and instrumented the code and reproduced.  My current
belief is that this is happening:

1. async_completed() is called (from IRQ).  Moves "as" onto the
   completed list.
2. On another CPU, proc_reapurbnonblock_compat() calls
   async_getcompleted().  Blocks on spinlock.
3. async_completed() releases the lock; keeps running; gets blocked
   midway through wake_up().
4. proc_reapurbnonblock_compat() => async_getcompleted() gets the
   lock; removes "as" from completed list and frees it.
5. usbdev_release() is called.  Frees "ps".
6. async_completed() finally continues running wake_up().  ...but
   wake_up() has a pointer to the freed "ps".

The instrumentation that led me to believe this was based on adding
some trace_printk() calls in a select few functions and then using
kdb's "ftdump" at crash time.  The trace follows (NOTE: in the trace
below I cheated a little bit and added a udelay(1000) in
async_completed() after releasing the spinlock because I wanted it to
trigger quicker):

<...>-2104   0d.h2 13759034us!: async_completed at start: as=ffc0cc638200
mtpd-20553 13759356us : async_getcompleted before spin_lock_irqsave
mtpd-20553d..1 13759362us : async_getcompleted after list_del_init: 
as=ffc0cc638200
mtpd-20553 13759371us+: proc_reapurbnonblock_compat: 
free_async(ffc0cc638200)
mtpd-20553 13759422us+: async_getcompleted before spin_lock_irqsave
mtpd-20553 13759479us : usbdev_release at start: ps=ffc0cc042080
mtpd-20553 13759487us : async_getcompleted before spin_lock_irqsave
mtpd-20553 13759497us!: usbdev_release after kfree(ps): 
ps=ffc0cc042080
<...>-2104   0d.h2 13760294us : async_completed before wake_up(): 
as=ffc0cc638200

To fix this problem we can just move the wake_up() under the ps->lock.
There should be no issues there that I'm aware of.

Signed-off-by: Douglas Anderson 
---
NOTE: All testing of this patch was done on the Chromium OS kernel
version 4.4.  The machine tested only barely boots upstream Linux and
it seemed unlikely I'd be able to reproduce the exact failure there,
so I didn't try.  The issue seems to clearly apply to upstream based
on code analysis, though.

 drivers/usb/core/devio.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
index ebe27595c4af..0ff0feddfd1f 100644
--- a/drivers/usb/core/devio.c
+++ b/drivers/usb/core/devio.c
@@ -623,6 +623,8 @@ static void async_completed(struct urb *urb)
if (as->status < 0 && as->bulk_addr && as->status != -ECONNRESET &&
as->status != -ENOENT)
cancel_bulk_urbs(ps, as->bulk_addr);
+
+   wake_up(>wait);
spin_unlock(>lock);
 
if (signr) {
@@ -630,8 +632,6 @@ static void async_completed(struct urb *urb)
put_pid(pid);
put_cred(cred);
}
-
-   wake_up(>wait);
 }
 
 static void destroy_async(struct usb_dev_state *ps, struct list_head *list)
-- 
2.14.0.434.g98096fd7a8-goog