[PATCH v2] r8152: fix lockup when runtime PM is enabled

2015-12-08 Thread Peter Wu
When an interface is brought up which was previously suspended (via
runtime PM), it would hang. This happens because napi_disable is called
before napi_enable.

Solve this by avoiding napi_enable in the resume during open function
(netif_running is true when open is called, IFF_UP is set after a
successful open; netif_running is false when close is called, but IFF_UP
is then still set).

While at it, remove WORK_ENABLE check from rtl8152_open (introduced with
the original change) because it cannot happen:

 - After this patch, runtime resume will not set it during rtl8152_open.
 - When link is up, rtl8152_open is not called.
 - When link is down during system/auto suspend/resume, it is not set.

Fixes: 41cec84cf285 ("r8152: don't enable napi before rx ready")
Link: https://lkml.kernel.org/r/20151205105912.GA1766@al
Signed-off-by: Peter Wu <pe...@lekensteyn.nl>
---
 v2: moved rtl_runtime_suspend_enable from close to rtl8152_suspend

Tested with the scenario from
https://lkml.kernel.org/r/20151208111008.GA18728@al
(added debug prints to verify the suspend/resume moments)
---
 drivers/net/usb/r8152.c | 21 +++--
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index d9427ca..2e32c41 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -3067,17 +3067,6 @@ static int rtl8152_open(struct net_device *netdev)
 
mutex_lock(>control);
 
-   /* The WORK_ENABLE may be set when autoresume occurs */
-   if (test_bit(WORK_ENABLE, >flags)) {
-   clear_bit(WORK_ENABLE, >flags);
-   usb_kill_urb(tp->intr_urb);
-   cancel_delayed_work_sync(>schedule);
-
-   /* disable the tx/rx, if the workqueue has enabled them. */
-   if (netif_carrier_ok(netdev))
-   tp->rtl_ops.disable(tp);
-   }
-
tp->rtl_ops.up(tp);
 
rtl8152_set_speed(tp, AUTONEG_ENABLE,
@@ -3124,12 +3113,6 @@ static int rtl8152_close(struct net_device *netdev)
} else {
mutex_lock(>control);
 
-   /* The autosuspend may have been enabled and wouldn't
-* be disable when autoresume occurs, because the
-* netif_running() would be false.
-*/
-   rtl_runtime_suspend_enable(tp, false);
-
tp->rtl_ops.down(tp);
 
mutex_unlock(>control);
@@ -3512,7 +3495,7 @@ static int rtl8152_resume(struct usb_interface *intf)
netif_device_attach(tp->netdev);
}
 
-   if (netif_running(tp->netdev)) {
+   if (netif_running(tp->netdev) && tp->netdev->flags & IFF_UP) {
if (test_bit(SELECTIVE_SUSPEND, >flags)) {
rtl_runtime_suspend_enable(tp, false);
clear_bit(SELECTIVE_SUSPEND, >flags);
@@ -3532,6 +3515,8 @@ static int rtl8152_resume(struct usb_interface *intf)
}
usb_submit_urb(tp->intr_urb, GFP_KERNEL);
} else if (test_bit(SELECTIVE_SUSPEND, >flags)) {
+   if (tp->netdev->flags & IFF_UP)
+   rtl_runtime_suspend_enable(tp, false);
clear_bit(SELECTIVE_SUSPEND, >flags);
}
 
-- 
2.6.3

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] r8152: fix lockup when runtime PM is enabled

2015-12-08 Thread Peter Wu
On Tue, Dec 08, 2015 at 03:18:59AM +, Hayes Wang wrote:
> Peter Wu
> > Sent: Tuesday, December 08, 2015 12:59 AM
> [...]
> > +   if (tp->netdev->flags & IFF_UP) {
> 
> Maybe you could just replace the checking of netif_running(tp->netdev)
> with this.

Simply replacing netif_running by IFF_UP does not work, it hangs during
close when the device is suspended. This patch is correct, but I have a
v2 patch that moves rtl_runtime_suspend_enable from close to suspend.

This is the evaluated scenario (run = netif_running, up = IFF_UP set):

# suspended before open
suspend (run=0, up=0)
open(run=1)
resume  (run=1, up=0)   <-- fixed by patch
(open ends)

# while up
suspend (run=1, up=1)
resume  (run=1, up=1)   <-- no issue, values match
suspend (run=1, up=1)

# close while suspended
close   (run=0, up=1)
resume  (run=0, up=1)   <-- fixed in patch v2
(close cont and ends)   <-- rtl_runtime_suspend_enable removed in v2
suspend (run=0, up=0)

# while down
resume  (run=0, up=0)
suspend (run=0, up=0)

# open while suspended, open fails
open(run=1)
resume  (run=1, up=0)   <-- fixed by patch
(open fails)
suspend (run=0, up=0)

> Excuse me. I have a question. Before the open() is finished, the
> netif_running() would be true, but the IFF_UP wouldn't be set. Is it
> right?

That is right, this happens behind the scenes:

# open
netif_running = true
open()
if open succeeded
set IFF_UP
else
netif_running = false

# close
netif_running = false
close()
clear IFF_UP
-- 
Kind regards,
Peter Wu
https://lekensteyn.nl
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] r8152: fix lockup when runtime PM is enabled

2015-12-08 Thread Peter Wu
On Tue, Dec 08, 2015 at 12:39:12PM +, Hayes Wang wrote:
> Peter Wu [mailto:pe...@lekensteyn.nl]
> > Sent: Tuesday, December 08, 2015 7:18 PM
> > 
> > When an interface is brought up which was previously suspended (via
> > runtime PM), it would hang. This happens because napi_disable is called
> > before napi_enable.
> > 
> > Solve this by avoiding napi_enable in the resume during open function
> > (netif_running is true when open is called, IFF_UP is set after a
> > successful open; netif_running is false when close is called, but IFF_UP
> > is then still set).
> > 
> > While at it, remove WORK_ENABLE check from rtl8152_open (introduced with
> > the original change) because it cannot happen:
> > 
> >  - After this patch, runtime resume will not set it during rtl8152_open.
> >  - When link is up, rtl8152_open is not called.
> >  - When link is down during system/auto suspend/resume, it is not set.
> > 
> > Fixes: 41cec84cf285 ("r8152: don't enable napi before rx ready")
> > Link: https://lkml.kernel.org/r/20151205105912.GA1766@al
> > Signed-off-by: Peter Wu <pe...@lekensteyn.nl>
> > ---
> >  v2: moved rtl_runtime_suspend_enable from close to rtl8152_suspend
> 
> Acked-by: Hayes Wang <hayesw...@realtek.com>
> 
> Best Regards,
> Hayes

I found another problem with runtime PM. When a device is suspended via
autosuspend and a system suspend takes place, there is no network I/O
after resume. Triggering a renegotiation (ethtool -r eth1) brings back
network activity.

Can you have a look? I guess that reset_resume needs different
treatment, but don't know how to do it properly as suspend is not called
before system reset (because the device is apparently already in
suspended state).

I think that this new issue can be addressed in a different patch, this
patch solves a more severe lockup. Opinions?
-- 
Kind regards,
Peter Wu
https://lekensteyn.nl
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] r8152: fix lockup when runtime PM is enabled

2015-12-07 Thread Peter Wu
When an interface is brought up which was previously suspended (via
runtime PM), it would hang. This happens because napi_disable is called
before napi_enable.

Solve this by avoiding napi_disable before the device is fully up.

While at it, remove WORK_ENABLE check from rtl8152_open (introduced with
the original change) because it cannot happen:

 - After this patch, runtime resume will not set it during rtl8152_open.
 - When link is up, rtl8152_open is not called.
 - When link is down during system/auto suspend/resume, it is not set.

Fixes: 41cec84cf285 ("r8152: don't enable napi before rx ready")
Link: https://lkml.kernel.org/r/20151205105912.GA1766@al
Signed-off-by: Peter Wu <pe...@lekensteyn.nl>
---
 drivers/net/usb/r8152.c | 23 +++
 1 file changed, 7 insertions(+), 16 deletions(-)

diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index d9427ca..b8b083e 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -3067,17 +3067,6 @@ static int rtl8152_open(struct net_device *netdev)
 
mutex_lock(>control);
 
-   /* The WORK_ENABLE may be set when autoresume occurs */
-   if (test_bit(WORK_ENABLE, >flags)) {
-   clear_bit(WORK_ENABLE, >flags);
-   usb_kill_urb(tp->intr_urb);
-   cancel_delayed_work_sync(>schedule);
-
-   /* disable the tx/rx, if the workqueue has enabled them. */
-   if (netif_carrier_ok(netdev))
-   tp->rtl_ops.disable(tp);
-   }
-
tp->rtl_ops.up(tp);
 
rtl8152_set_speed(tp, AUTONEG_ENABLE,
@@ -3516,11 +3505,13 @@ static int rtl8152_resume(struct usb_interface *intf)
if (test_bit(SELECTIVE_SUSPEND, >flags)) {
rtl_runtime_suspend_enable(tp, false);
clear_bit(SELECTIVE_SUSPEND, >flags);
-   napi_disable(>napi);
-   set_bit(WORK_ENABLE, >flags);
-   if (netif_carrier_ok(tp->netdev))
-   rtl_start_rx(tp);
-   napi_enable(>napi);
+   if (tp->netdev->flags & IFF_UP) {
+   napi_disable(>napi);
+   set_bit(WORK_ENABLE, >flags);
+   if (netif_carrier_ok(tp->netdev))
+   rtl_start_rx(tp);
+   napi_enable(>napi);
+   }
} else {
tp->rtl_ops.up(tp);
rtl8152_set_speed(tp, AUTONEG_ENABLE,
-- 
2.6.3

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: (4.3.0) r8152: deadlock related to runtime suspend?

2015-12-07 Thread Peter Wu
On Mon, Dec 07, 2015 at 05:11:50PM +0800, Lu Baolu wrote:
> Hi Peter,
> 
> Have you ever tried disabling auto-pm? Did things go smoothly if auto-pm is 
> disabled?
> 
> I always disable usb auto-pm in below way.
> 
> # echo on | tee /sys/bus/usb/devices/*/power/control
> # echo on > /sys/bus/pci/devices//power/control
> 
> Thanks,
> Baolu

Hi Baolu,

The deadlock does not seem to occur with auto-PM disabled, but that is a
workaround for the issue. The hang can always be reproduced under this
test:

 - Start a QEMU VM, passing through the USB adapter
 - This VM boots to a busybox shell with no other services running or
   udev magic (to reduce interference).
 - Enable runtime PM for all devices by default (see script below)
 - From the console, invoke "ip link set eth1 up" (eth0 is a virtio
   adapter).

# somewhere in /init after mounting filesystems
echo /sbin/hotplug > /proc/sys/kernel/hotplug
echo auto | tee  /sys/bus/pci/devices/*/power/control \
/sys/bus/usb/devices/*/power/control >/dev/null

#!/bin/sh
# /sbin/hotplug
path="/sys/$DEVPATH/power/control"
[ -e "$path" ] || return
newval=auto
read status < "$path"
if [ "x$status" != "x$newval" ]; then
echo "$DEVPATH: $status -> $newval" >/dev/kmsg
echo $newval > "$path"
fi

With "auto", the ip command hangs (a trace can be found on the bottom of
this mail). With "on", it does not.

If I keep a loop spinning that invokes `ethtool eth1`, the command
returns immediately without issues (presumably because the device is not
suspended through runtime PM).

Under some circumstances I get a lockdep warning (when trying to bring
an interface down if I remember correctly). Its trace can be found on
the bottom of this mail.

I'll keep testing. For the lockdep warning, my initial guess is that
calling schedule_delayed_work_sync under tp->lock is a bad idea because
scheduled work can execute and try to claim tp->lock too.

Maybe there are two different lockup cases here, I'll keep testing.

Kind regards,
Peter

> On 12/05/2015 06:59 PM, Peter Wu wrote:
> > Hi,
> >
> > I rarely use a Realtek USB 3.0 Gigabit Ethernet adapter (vid/pid
> > 0bda:8153), but when I did last night, it resulted in a lockup of
> > processes doing networking ("ip link", "ping", "ethtool", ...).
> >
> > A (few) minute(s) before that event, I noticed that there was no network
> > connectivity (ping hung) which was somehow solved by invoking "ethtool
> > eth1" (triggering runtime pm wakeup?). This same trick did not work at
> > the next event. Invoking "ethtool eth1", "ip link", etc. hung completely
> > and interrupt (^C) did not work at all.
> >
> > Since that did not work, I pulled the USB adapter and re-inserted it,
> > hoping it would reset things. That did not work at all, there was a
> > "usb disconnect" message, but no further driver messages.
> >
> > Fast forward an hour, and it has become a disaster. I have terminated
> > and killed many programs via SysRq but am still unable to get a stable
> > system that does not hang on network I/O. Even the suspend process
> > fails so in the end I attempted to shutdown the system. After half an
> > hour after getting the poweroff message, I issued SysRq + B to reboot
> > (since SysRq + O did not shut down either).
> >
> > Attached are logs with various various backtraces from SysRq and failed
> > suspend. Let me know if you need more information!
> >
> > By the way, often I have to rmmod xhci and re-insert it, otherwise
> > plugging it in does not result in a detection. A USB 2.0 port does not
> > have this problem (runtime PM is enabled for all devices). This is the
> > USB 3.0 port:
> >
> > 02:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0
> > Host Controller [1033:0194] (rev 03)
> 

-- 

lockdep splat from the bare machine:

==
[ INFO: possible circular locking dependency detected ]
4.3.0-custom #1 Tainted: G   O   
---
kworker/0:1/38 is trying to acquire lock:
 (>control){+.+.+.}, at: [] rtl8152_resume+0x24/0x130 
[r8152]

but task is already holding lock:
 ((&(>schedule)->work)){+.+.+.}, at: [] 
process_one_work+0x15c/0x660

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 ((&(>schedule)->work)){+.+.+.}:

[] lock_acquire+0xc3/0x1d0
   [] flush_work+0x3d/0x290

Re: (4.3.0) r8152: deadlock related to runtime suspend?

2015-12-07 Thread Peter Wu
On Mon, Dec 07, 2015 at 07:08:50PM +0800, Lu Baolu wrote:
> 
> 
> On 12/07/2015 05:37 PM, Peter Wu wrote:
> > On Mon, Dec 07, 2015 at 05:11:50PM +0800, Lu Baolu wrote:
> >> Hi Peter,
> >>
> >> Have you ever tried disabling auto-pm? Did things go smoothly if auto-pm 
> >> is disabled?
> >>
> >> I always disable usb auto-pm in below way.
> >>
> >> # echo on | tee /sys/bus/usb/devices/*/power/control
> >> # echo on > /sys/bus/pci/devices//power/control
> >>
> >> Thanks,
> >> Baolu
> > Hi Baolu,
> >
> > The deadlock does not seem to occur with auto-PM disabled, but that is a
> > workaround for the issue. The hang can always be reproduced under this
> > test:
> >
> >  - Start a QEMU VM, passing through the USB adapter
> 
> I would suggest you to start with bare metal.
> 
> When you pass through the host controller to a guest VM, you
> probably use IOMMU unit to let hardware access the memory
> directly, but things like pci configure space access, interrupt and
> IO port access still rely on QEMU. This introduces a lot of complexities.

It is a USB device, not a PCI device, so such issues do not apply here
I think.

I have found a possible reason for this lockup. The resume code may
execute napi_disable while napi_enable was not called before. This
autoresume thing happens in the open function which explains why all
other rtnl users are blocked.

Is this a sane analysis?

Kind regards,
Peter

> Thanks,
> Baolu
> 
> >  - This VM boots to a busybox shell with no other services running or
> >udev magic (to reduce interference).
> >  - Enable runtime PM for all devices by default (see script below)
> >  - From the console, invoke "ip link set eth1 up" (eth0 is a virtio
> >adapter).
> >
> > # somewhere in /init after mounting filesystems
> > echo /sbin/hotplug > /proc/sys/kernel/hotplug
> > echo auto | tee  /sys/bus/pci/devices/*/power/control \
> > /sys/bus/usb/devices/*/power/control >/dev/null
> >
> > #!/bin/sh
> > # /sbin/hotplug
> > path="/sys/$DEVPATH/power/control"
> > [ -e "$path" ] || return
> > newval=auto
> > read status < "$path"
> > if [ "x$status" != "x$newval" ]; then
> > echo "$DEVPATH: $status -> $newval" >/dev/kmsg
> > echo $newval > "$path"
> > fi
> >
> > With "auto", the ip command hangs (a trace can be found on the bottom of
> > this mail). With "on", it does not.
> >
> > If I keep a loop spinning that invokes `ethtool eth1`, the command
> > returns immediately without issues (presumably because the device is not
> > suspended through runtime PM).
> >
> > Under some circumstances I get a lockdep warning (when trying to bring
> > an interface down if I remember correctly). Its trace can be found on
> > the bottom of this mail.
> >
> > I'll keep testing. For the lockdep warning, my initial guess is that
> > calling schedule_delayed_work_sync under tp->lock is a bad idea because
> > scheduled work can execute and try to claim tp->lock too.
> >
> > Maybe there are two different lockup cases here, I'll keep testing.
> >
> > Kind regards,
> > Peter
> >
> >> On 12/05/2015 06:59 PM, Peter Wu wrote:
> >>> Hi,
> >>>
> >>> I rarely use a Realtek USB 3.0 Gigabit Ethernet adapter (vid/pid
> >>> 0bda:8153), but when I did last night, it resulted in a lockup of
> >>> processes doing networking ("ip link", "ping", "ethtool", ...).
> >>>
> >>> A (few) minute(s) before that event, I noticed that there was no network
> >>> connectivity (ping hung) which was somehow solved by invoking "ethtool
> >>> eth1" (triggering runtime pm wakeup?). This same trick did not work at
> >>> the next event. Invoking "ethtool eth1", "ip link", etc. hung completely
> >>> and interrupt (^C) did not work at all.
> >>>
> >>> Since that did not work, I pulled the USB adapter and re-inserted it,
> >>> hoping it would reset things. That did not work at all, there was a
> >>> "usb disconnect" message, but no further driver messages.
> >>>
> >>> Fast forward an hour, and it has become a disaster. I have terminated
> >>> and killed many programs via SysRq but am still unable to get a stable
> >>> s

(4.3.0) r8152: deadlock related to runtime suspend?

2015-12-05 Thread Peter Wu
Hi,

I rarely use a Realtek USB 3.0 Gigabit Ethernet adapter (vid/pid
0bda:8153), but when I did last night, it resulted in a lockup of
processes doing networking ("ip link", "ping", "ethtool", ...).

A (few) minute(s) before that event, I noticed that there was no network
connectivity (ping hung) which was somehow solved by invoking "ethtool
eth1" (triggering runtime pm wakeup?). This same trick did not work at
the next event. Invoking "ethtool eth1", "ip link", etc. hung completely
and interrupt (^C) did not work at all.

Since that did not work, I pulled the USB adapter and re-inserted it,
hoping it would reset things. That did not work at all, there was a
"usb disconnect" message, but no further driver messages.

Fast forward an hour, and it has become a disaster. I have terminated
and killed many programs via SysRq but am still unable to get a stable
system that does not hang on network I/O. Even the suspend process
fails so in the end I attempted to shutdown the system. After half an
hour after getting the poweroff message, I issued SysRq + B to reboot
(since SysRq + O did not shut down either).

Attached are logs with various various backtraces from SysRq and failed
suspend. Let me know if you need more information!

By the way, often I have to rmmod xhci and re-insert it, otherwise
plugging it in does not result in a detection. A USB 2.0 port does not
have this problem (runtime PM is enabled for all devices). This is the
USB 3.0 port:

02:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0
Host Controller [1033:0194] (rev 03)
-- 
Kind regards,
Peter Wu
https://lekensteyn.nl
Dec 05 00:32:58 al kernel: usb 3-1: USB disconnect, device number 4
Dec 05 00:34:43 al NetworkManager[441]:   Could not get scan request 
result: Timeout was reached
Dec 05 00:35:45 al NetworkManager[441]:   Could not get scan request 
result: Timeout was reached
Dec 05 00:36:13 al kernel: INFO: task wpa_supplicant:573 blocked for more than 
120 seconds.
Dec 05 00:36:13 al kernel:   Tainted: G   O4.3.0-1-ARCH #1
Dec 05 00:36:13 al kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
Dec 05 00:36:13 al kernel: wpa_supplicant  D 88023bc956c0 0   573  
1 0x
Dec 05 00:36:13 al kernel:  8800aca2fa48 0086 880232600c80 
88022d101900
Dec 05 00:36:13 al kernel:  8800aca3 818b4504 88022d101900 

Dec 05 00:36:13 al kernel:  818b4508 8800aca2fa60 8157a09a 
818b4500
Dec 05 00:36:13 al kernel: Call Trace:
Dec 05 00:36:13 al kernel:  [] schedule+0x3a/0x90
Dec 05 00:36:13 al kernel:  [] 
schedule_preempt_disabled+0x15/0x20
Dec 05 00:36:13 al kernel:  [] 
__mutex_lock_slowpath+0xce/0x140
Dec 05 00:36:13 al kernel:  [] mutex_lock+0x17/0x30
Dec 05 00:36:13 al kernel:  [] rtnl_lock+0x15/0x20
Dec 05 00:36:13 al kernel:  [] nl80211_pre_doit+0xeb/0x180 
[cfg80211]
Dec 05 00:36:13 al kernel:  [] genl_family_rcv_msg+0x1c8/0x3c0
Dec 05 00:36:13 al kernel:  [] ? 
genl_family_rcv_msg+0x3c0/0x3c0
Dec 05 00:36:13 al kernel:  [] genl_rcv_msg+0x76/0xb0
Dec 05 00:36:13 al kernel:  [] netlink_rcv_skb+0xa4/0xc0
Dec 05 00:36:13 al kernel:  [] genl_rcv+0x28/0x40
Dec 05 00:36:13 al kernel:  [] netlink_unicast+0x11a/0x1a0
Dec 05 00:36:13 al kernel:  [] netlink_sendmsg+0x4f8/0x610
Dec 05 00:36:13 al kernel:  [] sock_sendmsg+0x17/0x30
Dec 05 00:36:13 al kernel:  [] ___sys_sendmsg+0x295/0x2b0
Dec 05 00:36:13 al kernel:  [] ? 
should_numa_migrate_memory+0x54/0x130
Dec 05 00:36:13 al kernel:  [] ? handle_mm_fault+0xb38/0x18a0
Dec 05 00:36:13 al kernel:  [] __sys_sendmsg+0x51/0x90
Dec 05 00:36:13 al kernel:  [] SyS_sendmsg+0x12/0x20
Dec 05 00:36:13 al kernel:  [] 
entry_SYSCALL_64_fastpath+0x12/0x71
Dec 05 00:36:13 al kernel: INFO: task Qt bearer threa:8292 blocked for more 
than 120 seconds.
Dec 05 00:36:13 al kernel:   Tainted: G   O4.3.0-1-ARCH #1
Dec 05 00:36:13 al kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
Dec 05 00:36:13 al kernel: Qt bearer threa D 88023bd156c0 0  8292   
8222 0x
Dec 05 00:36:13 al kernel:  8800ac9bfc88 0086 880232601900 
880214323200
Dec 05 00:36:13 al kernel:  8800ac9c 818b4504 880214323200 

Dec 05 00:36:13 al kernel:  818b4508 8800ac9bfca0 8157a09a 
818b4500
Dec 05 00:36:13 al kernel: Call Trace:
Dec 05 00:36:13 al kernel:  [] schedule+0x3a/0x90
Dec 05 00:36:13 al kernel:  [] 
schedule_preempt_disabled+0x15/0x20
Dec 05 00:36:13 al kernel:  [] 
__mutex_lock_slowpath+0xce/0x140
Dec 05 00:36:13 al kernel:  [] mutex_lock+0x17/0x30
Dec 05 00:36:13 al kernel:  [] rtnetlink_rcv+0x19/0x30
Dec 05 00:36:13 al kernel:  [] netlink_unicast+0x11a/0x1a0
Dec 05 00:36:13 al kernel:  [] netlink_sendmsg+0x4f8/0x610
Dec 05 00:36:13 al kernel: 

[v3.13,v3.15] XHCI breaks suspend, causes overheating

2014-06-10 Thread Peter Wu
Hi,

I do not often suspend my laptop with a device inserted on the USB 3.0
port, but when I did last night, it trigged an immediate wake up. Not
only that, but after resume, some kworkers were hogging CPU. Another
problem is that the laptop overheats in some cases (see end of mail).

Some details:
Distribution: Arch Linux x86_64
USB peripheral: Samsung S3 (GT-i9300) phone
USB HCD: 02:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0 Host 
Controller [1033:0194] (rev 03)
Kernel: v3.15-rc8-105-gc593e89 (also confirmed for v3.13.6, other kernels have 
not been tested)

I managed to get the following backtrace with SysRq + l:

NMI backtrace for cpu 0
CPU: 0 PID: 11849 Comm: kworker/0:2 Tainted: G   O  
3.15.0-rc8-custom-00058-gd2cfd31 #1
Hardware name: CLEVO CO.B7130   
/B7130   , BIOS 6.00 08/27/2010
Workqueue: pm pm_runtime_work
..
[815f8bbf] ? _raw_spin_unlock_irqrestore+0x5f/0x80
[a019b54b] xhci_hub_control+0xbb/0x1450 [xhci_hcd]
[81475cfc] ? usb_hcd_submit_urb+0x2dc/0xa80
[81475de7] usb_hcd_submit_urb+0x3c7/0xa80
[8109f535] ? mark_held_locks+0x75/0xa0
[810a4b91] ? __raw_spin_lock_init+0x21/0x60
[8109d1f1] ? lockdep_init_map+0x61/0x530
[814776e2] usb_submit_urb+0x372/0x600
[814780df] usb_start_wait_urb+0x5f/0x150
[810821ed] ? get_parent_ip+0xd/0x50
[81478295] usb_control_msg+0xc5/0x110
[8146b7cc] hub_port_status+0x7c/0x120
[8146d599] hub_activate+0xc9/0x620
[8109f6bd] ? trace_hardirqs_on_caller+0x15d/0x200
[8146dc01] hub_resume+0x31/0x70
[8147b626] usb_resume_interface.isra.5+0xb6/0x120
[8147b788] usb_suspend_both+0xf8/0x1f0
[8147cb83] usb_runtime_suspend+0x33/0x80
[8147cb50] ? usb_probe_interface+0x330/0x330
[814174f2] __rpm_callback+0x32/0x70
[81417556] rpm_callback+0x26/0xa0
[814180d1] rpm_suspend+0x111/0x660
[81418e3d] __pm_runtime_suspend+0x5d/0x90
[8147cbf0] ? usb_runtime_resume+0x20/0x20
[8147cc1a] usb_runtime_idle+0x2a/0x40
[814174f2] __rpm_callback+0x32/0x70
[8141898a] rpm_idle+0x27a/0x310
[8141947f] pm_runtime_work+0xbf/0xd0
[8106bb54] process_one_work+0x1d4/0x560
[8106baf4] ? process_one_work+0x174/0x560
[8106c93c] worker_thread+0x12c/0x3d0
[8106c810] ? manage_workers.isra.23+0x2e0/0x2e0
[810733d6] kthread+0xf6/0x110
[810732e0] ? kthread_create_on_node+0x200/0x200
[8160152c] ret_from_fork+0x7c/0xb0
[810732e0] ? kthread_create_on_node+0x200/0x200

/proc/../stack was not usable, it just showed start_worker (iirc) on
top. `perf report` mentions kworker and khubd as top users with
call traces corresponding to the above backtrace. CPU usage increased a
bit, but the machine was kind of usable:

root 11856  1.2  0.0  0 0 ?S00:51   0:10 
[kworker/0:3]
peter  555  1.8  1.1 3085412 87532 ?   Sl   Jun09   1:48 kwin
root 3  1.9  0.0  0 0 ?SJun09   1:54 
[ksoftirqd/0]
root14  1.9  0.0  0 0 ?SJun09   1:53 
[ksoftirqd/1]
peter  561  2.1  2.4 3373080 193416 ?  Sl   Jun09   2:06 kdeinit4: 
plasma-desktop [kdeinit]
root 11707  2.4  0.0  0 0 ?S00:36   0:40 
[kworker/1:3]
root   354  2.5  1.0 312392 82996 tty7 Ssl+ Jun09   2:27 /usr/bin/X 
:0 vt7 -nolisten tcp -auth /var/run/xauth/A:0-dfgwjb
root39  4.5  0.0  0 0 ?SJun09   4:24 [khubd]
root 11629 12.9  0.0  0 0 ?S00:33   4:02 
[kworker/1:2]
root 12076 14.6  0.0  0 0 ?S01:02   0:17 
[kworker/1:0]
root 11849 16.5  0.0  0 0 ?S00:44   3:25 
[kworker/0:2]

In an attempt to get more details, I executed:

echo workqueue:workqueue_queue_work  tracing/set_event
cat /sys/kernel/debug/tracing/trace_pipe | pv  /tmp/trace.txt

When the bug is triggered, I see 100k events/s with lines such as:

khubd-39[000] d..3  7342.846004: workqueue_queue_work: work 
struct=8800ae0f0ab8 function=pm_runtime_work workqueue=8802326bde00 
req_cpu=256 cpu=0

Attempts to suspend the laptop again failed consistently. After some
tries, the laptop began to beep (presumably thermal limits). I have
never heard such beeps before, the laptop has never been overheating.



Today, I tried to reproduce this bug again. The reproduction steps are:

 1. Start the ADB server (`adb shell`). There is no device connected
yet.
 2. Connect phone to USB 3.0 port (USB debugging is enabled on the
Android phone).
 3. Try to suspend.
 4. If it wakes up again, try to suspend again. Be 

[3.11-rc5..] NULL pointer dereference, lockup after XHCI device dropped out

2013-08-17 Thread Peter Wu
Hi,

I hit a BUG followed by a lockup when trying to wipe a new external USB 3.0
1TB 2.5 Western Digital My Passport (WDBBEP0010BBK-EESN) HDD.
The following command is used (with sudo):

dd if=/dev/zero of=/dev/sdd bs=1M

In another shell, I kept sending a USR1 signal to watch progress (maybe
not related, but mentioned for completeness):

while killall -USR1 dd; do sleep 30; done

While doing this over SSH, another KDE session was open with someone
playing the Java game Minecraft.

I have so far reproduced this lockup three times. The first time, the
lock-up occured after:

84838+0 records in
84838+0 records out
88959090688 bytes (89 GB) copied, 789.707 s, 113 MB/s

I have no logs from the first lockup. The second time (after a reboot):

388514+1 records in
388514+1 records out
407386685440 bytes (407 GB) copied, 3881.25 s, 105 MB/s

The second log part (not third lockup log) below is from this session.
I omitted the remaining lines that are printed when the CPU got hung.

The third BUG occured 22 minutes (156 GB) after rebooting and running
dd and has similar log entries to the previous lock-up. This kernel used
the same kernel sources, but with XHCI debugging enabled.  No
differences were noticable in the console output, other than the missing
xHCI xhci_drop_endpoint called with disabled ep message in the new
logs. See the first console output below.

I have no idea why the USB 3.0 HDD would drop out, but that should
not trigger a NULL pointer dereference anyhow.

The desktop was running kernel v3.11-rc5-35-gf1d6e17 plus three r8169
patches that are unrelated to this issue and has 24 GB RAM and a i5-2320
CPU. I do not know how to easily reproduce this bug, the current
reproduction method described above takes time.

One possibly related bug is https://bugzilla.kernel.org/show_bug.cgi?id=51641,
but I do not have a panic. Any ideas? If you want me to test something
or open a new bug, let me know.

Regards,
Peter
--

=== third lockup log ===
[   17.724296] usb 6-2: new SuperSpeed USB device number 2 using xhci_hcd
[   17.746522] usb 6-2: Parent hub missing LPM exit latency info.  Power 
management will be impacted.
[   17.759786] usb 6-2: New USB device found, idVendor=1058, idProduct=0748
[   17.766494] usb 6-2: New USB device strings: Mfr=1, Product=2, SerialNumber=5
[   17.773628] usb 6-2: Product: My Passport 0748
[   17.778071] usb 6-2: Manufacturer: Western Digital
[   17.782868] usb 6-2: SerialNumber: 57584231414E35383439
[   17.790595] usb-storage 6-2:1.0: USB Mass Storage device detected
[   17.796995] scsi9 : usb-storage 6-2:1.0
[   18.800604] scsi 9:0:0:0: Direct-Access WD   My Passport 0748 1022 
PQ: 0 ANSI: 6
[   18.808994] scsi 9:0:0:1: Enclosure WD   SES Device   1022 
PQ: 0 ANSI: 6
[   18.817618] sd 9:0:0:0: Attached scsi generic sg4 type 0
[   18.819416] sd 9:0:0:0: [sdd] Spinning up disk...
[   18.827943] scsi 9:0:0:1: Attached scsi generic sg5 type 13
[   19.823185] ..ready
[   20.828276] sd 9:0:0:0: [sdd] 1953458176 512-byte logical blocks: (1.00 
TB/931 GiB)
[   20.836548] sd 9:0:0:0: [sdd] Write Protect is off
[   20.841352] sd 9:0:0:0: [sdd] Mode Sense: 47 00 10 08
[   20.847172] sd 9:0:0:0: [sdd] No Caching mode page present
[   20.852674] sd 9:0:0:0: [sdd] Assuming drive cache: write through
[   20.860714] sd 9:0:0:0: [sdd] No Caching mode page present
[   20.866208] sd 9:0:0:0: [sdd] Assuming drive cache: write through
[   20.885996]  sdd: unknown partition table
[   20.891706] sd 9:0:0:0: [sdd] No Caching mode page present
[   20.897214] sd 9:0:0:0: [sdd] Assuming drive cache: write through
[   20.903310] sd 9:0:0:0: [sdd] Attached SCSI disk
[ 1355.971539] usb 6-2: USB disconnect, device number 2
[ 1385.834230] sd 9:0:0:0: Device offlined - not ready after error recovery
[ 1385.842002] sd 9:0:0:0: [sdd] Unhandled error code
[ 1385.846811] sd 9:0:0:0: [sdd]  
[ 1385.849980] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[ 1385.856015] sd 9:0:0:0: [sdd] CDB: 
[ 1385.859518] Write(10): 2a 00 10 4f eb 00 00 00 f0 00
[ 1385.864748] end_request: I/O error, dev sdd, sector 273672960
[ 1385.870509] quiet_error: 6 callbacks suppressed
[ 1385.875047] Buffer I/O error on device sdd, logical block 34209120
[ 1385.881240] lost page write due to I/O error on sdd
[ 1385.886144] Buffer I/O error on device sdd, logical block 34209121
[ 1385.892326] lost page write due to I/O error on sdd
[ 1385.897221] Buffer I/O error on device sdd, logical block 34209122
[ 1385.903405] lost page write due to I/O error on sdd
[ 1385.908290] Buffer I/O error on device sdd, logical block 34209123
[ 1385.914478] lost page write due to I/O error on sdd
[ 1385.919372] Buffer I/O error on device sdd, logical block 34209124
[ 1385.925557] lost page write due to I/O error on sdd
[ 1385.930448] Buffer I/O error on device sdd, logical block 34209125
[ 1385.936630] lost page write due to I/O error on sdd
[ 1385.941530] Buffer I/O error on device sdd, logical 

Re: [PATCH 1/2] Revert Revert HID: Fix logitech-dj: missing Unifying device issue

2013-08-13 Thread Peter Wu
On Tuesday 13 August 2013 08:13:17 Peter Hurley wrote:
 On 08/12/2013 05:54 PM, Peter Wu wrote:
  On Thursday 18 July 2013 16:28:01 Peter Hurley wrote:
  Before we revert to using the workaround, I'd like to suggest that
  this new hidden problem may be an interaction with the xhci_hcd host
  controller driver only.
  
  Looking at the related bug, the OP indicates the machine only has
  USB3 ports. Additionally, comments #7, #100, and #104 of the original
  bug report add additional information that would seem to confirm
  this suspicion.
  
  Let me add I have this USB device running on the uhci_hcd driver
  with or without this workaround on v3.10.
  
  This problem does not seem specific to xhci, uhci seems also effected.
 
 If true, it would certainly help to have a bug report confirming uhci
 failure from a bare-metal system which contained:
 1) kernel version
 2) complete dmesg output
 3) lsusb -v output
 4) lsmod output
 5) usbmon capture from a plug attempt

I was too fast in drawing a conclusion, besides the kernel I also upgraded 
some other packages. Today the issue also showed up in 3.9.9 + updated 
packages.

When checking the dmesg, the issue solved by this patch did not occur (the 
enumeration was successful).

  Today I
  upgraded a system (running Arch Linux) from kernel 3.9.9 to 3.10.5. After
  a
  reboot to 3.10.5, things broke. The setup:
  
  - There are two USB receivers plugged into USB 1.1 ports (different buses
  according to lsusb, uhci), each receiver is paired to a K360 keyboard.
  - One of the receivers are passed to a QEMU guest with -usbdevice
  host:$busid. $devid. This keyboard is working (probably because QEMU
  performed a reset). - Since 3.10.5, the keyboard that is *not* passed to
  the QEMU guest is not functioning on reboot.
  
  After closing the QEMU guest, the USB bus gets reset(?) after which the
  other keyboard suddenly gets detected. I had only booted 3.10.5 twice
  before rolling back to 3.9.9, both boots triggered the issue. Do I need
  to provide a usbmon, lsusb, dmesg and/ or other details from 3.10.5?
 
 Do both keyboards work on bare metal? Seems like this problem might be
 specific to qemu (or kvm) and you may get more insight on those lists.

I haven't tested that, the system automatically boots into openbox + QEMU. 
Previously, both keyboards worked on bare metal, so I think it still works.

  Note that there are other Arch Linux users who have reported issues[1][2]
 
 Unfortunately, not even one user in the referenced reports identified
 the usb hub the receiver was plugged into.

I've asked it now.

  since upgrading to 3.10.z. Triggering a re-enumeration by writing the
  magic
  HID++ message[3] makes the paired devices appear again (as reported in
  forums[1], I haven't tried this on the affected UHCI machine).
  
  While the underlying bug is fixed, can this patch be forwarded to stable?
  I see that 3.10.6 has been released, but still without this patch.
 
 This is still a workaround and not really a fix for the underlying bug.

I meant to say, while the underlying bug is *being* fixed. Anyway, can this 
patch be applied to 3.10?

Sorry for the confusion with uhci, looking further it seems that the wrong USB 
receiver is being passed to QEMU. It's not a kernel issue, perhaps I can blame 
libusbx.

Regards,
Peter
--
To unsubscribe from this list: send the line unsubscribe linux-usb in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html