Re: (4.3.0) r8152: deadlock related to runtime suspend?

2015-12-07 Thread Peter Wu
On Mon, Dec 07, 2015 at 07:08:50PM +0800, Lu Baolu wrote:
> 
> 
> On 12/07/2015 05:37 PM, Peter Wu wrote:
> > On Mon, Dec 07, 2015 at 05:11:50PM +0800, Lu Baolu wrote:
> >> Hi Peter,
> >>
> >> Have you ever tried disabling auto-pm? Did things go smoothly if auto-pm 
> >> is disabled?
> >>
> >> I always disable usb auto-pm in below way.
> >>
> >> # echo on | tee /sys/bus/usb/devices/*/power/control
> >> # echo on > /sys/bus/pci/devices//power/control
> >>
> >> Thanks,
> >> Baolu
> > Hi Baolu,
> >
> > The deadlock does not seem to occur with auto-PM disabled, but that is a
> > workaround for the issue. The hang can always be reproduced under this
> > test:
> >
> >  - Start a QEMU VM, passing through the USB adapter
> 
> I would suggest you to start with bare metal.
> 
> When you pass through the host controller to a guest VM, you
> probably use IOMMU unit to let hardware access the memory
> directly, but things like pci configure space access, interrupt and
> IO port access still rely on QEMU. This introduces a lot of complexities.

It is a USB device, not a PCI device, so such issues do not apply here
I think.

I have found a possible reason for this lockup. The resume code may
execute napi_disable while napi_enable was not called before. This
autoresume thing happens in the open function which explains why all
other rtnl users are blocked.

Is this a sane analysis?

Kind regards,
Peter

> Thanks,
> Baolu
> 
> >  - This VM boots to a busybox shell with no other services running or
> >udev magic (to reduce interference).
> >  - Enable runtime PM for all devices by default (see script below)
> >  - From the console, invoke "ip link set eth1 up" (eth0 is a virtio
> >adapter).
> >
> > # somewhere in /init after mounting filesystems
> > echo /sbin/hotplug > /proc/sys/kernel/hotplug
> > echo auto | tee  /sys/bus/pci/devices/*/power/control \
> > /sys/bus/usb/devices/*/power/control >/dev/null
> >
> > #!/bin/sh
> > # /sbin/hotplug
> > path="/sys/$DEVPATH/power/control"
> > [ -e "$path" ] || return
> > newval=auto
> > read status < "$path"
> > if [ "x$status" != "x$newval" ]; then
> > echo "$DEVPATH: $status -> $newval" >/dev/kmsg
> > echo $newval > "$path"
> > fi
> >
> > With "auto", the ip command hangs (a trace can be found on the bottom of
> > this mail). With "on", it does not.
> >
> > If I keep a loop spinning that invokes `ethtool eth1`, the command
> > returns immediately without issues (presumably because the device is not
> > suspended through runtime PM).
> >
> > Under some circumstances I get a lockdep warning (when trying to bring
> > an interface down if I remember correctly). Its trace can be found on
> > the bottom of this mail.
> >
> > I'll keep testing. For the lockdep warning, my initial guess is that
> > calling schedule_delayed_work_sync under tp->lock is a bad idea because
> > scheduled work can execute and try to claim tp->lock too.
> >
> > Maybe there are two different lockup cases here, I'll keep testing.
> >
> > Kind regards,
> > Peter
> >
> >> On 12/05/2015 06:59 PM, Peter Wu wrote:
> >>> Hi,
> >>>
> >>> I rarely use a Realtek USB 3.0 Gigabit Ethernet adapter (vid/pid
> >>> 0bda:8153), but when I did last night, it resulted in a lockup of
> >>> processes doing networking ("ip link", "ping", "ethtool", ...).
> >>>
> >>> A (few) minute(s) before that event, I noticed that there was no network
> >>> connectivity (ping hung) which was somehow solved by invoking "ethtool
> >>> eth1" (triggering runtime pm wakeup?). This same trick did not work at
> >>> the next event. Invoking "ethtool eth1", "ip link", etc. hung completely
> >>> and interrupt (^C) did not work at all.
> >>>
> >>> Since that did not work, I pulled the USB adapter and re-inserted it,
> >>> hoping it would reset things. That did not work at all, there was a
> >>> "usb disconnect" message, but no further driver messages.
> >>>
> >>> Fast forward an hour, and it has become a disaster. I have terminated
> >>> and killed many programs via SysRq but am still unable to get a stable
> >>> system that does not hang on network I/O. Even the suspend process
> >>> fails so in the end I attempted to shutdown the system. After half an
> >>> hour after getting the poweroff message, I issued SysRq + B to reboot
> >>> (since SysRq + O did not shut down either).
> >>>
> >>> Attached are logs with various various backtraces from SysRq and failed
> >>> suspend. Let me know if you need more information!
> >>>
> >>> By the way, often I have to rmmod xhci and re-insert it, otherwise
> >>> plugging it in does not result in a detection. A USB 2.0 port does not
> >>> have this problem (runtime PM is enabled for all devices). This is the
> >>> USB 3.0 port:
> >>>
> >>> 02:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0
> >>> Host Controller [1033:0194] (rev 03)
> 

-- 
Kind regards,
Peter Wu
https://lekensteyn.nl
--
To unsubscribe from this list: send the li

Re: (4.3.0) r8152: deadlock related to runtime suspend?

2015-12-07 Thread Lu Baolu


On 12/07/2015 05:37 PM, Peter Wu wrote:
> On Mon, Dec 07, 2015 at 05:11:50PM +0800, Lu Baolu wrote:
>> Hi Peter,
>>
>> Have you ever tried disabling auto-pm? Did things go smoothly if auto-pm is 
>> disabled?
>>
>> I always disable usb auto-pm in below way.
>>
>> # echo on | tee /sys/bus/usb/devices/*/power/control
>> # echo on > /sys/bus/pci/devices//power/control
>>
>> Thanks,
>> Baolu
> Hi Baolu,
>
> The deadlock does not seem to occur with auto-PM disabled, but that is a
> workaround for the issue. The hang can always be reproduced under this
> test:
>
>  - Start a QEMU VM, passing through the USB adapter

I would suggest you to start with bare metal.

When you pass through the host controller to a guest VM, you
probably use IOMMU unit to let hardware access the memory
directly, but things like pci configure space access, interrupt and
IO port access still rely on QEMU. This introduces a lot of complexities.

Thanks,
Baolu

>  - This VM boots to a busybox shell with no other services running or
>udev magic (to reduce interference).
>  - Enable runtime PM for all devices by default (see script below)
>  - From the console, invoke "ip link set eth1 up" (eth0 is a virtio
>adapter).
>
> # somewhere in /init after mounting filesystems
> echo /sbin/hotplug > /proc/sys/kernel/hotplug
> echo auto | tee  /sys/bus/pci/devices/*/power/control \
> /sys/bus/usb/devices/*/power/control >/dev/null
>
> #!/bin/sh
> # /sbin/hotplug
> path="/sys/$DEVPATH/power/control"
> [ -e "$path" ] || return
> newval=auto
> read status < "$path"
> if [ "x$status" != "x$newval" ]; then
> echo "$DEVPATH: $status -> $newval" >/dev/kmsg
> echo $newval > "$path"
> fi
>
> With "auto", the ip command hangs (a trace can be found on the bottom of
> this mail). With "on", it does not.
>
> If I keep a loop spinning that invokes `ethtool eth1`, the command
> returns immediately without issues (presumably because the device is not
> suspended through runtime PM).
>
> Under some circumstances I get a lockdep warning (when trying to bring
> an interface down if I remember correctly). Its trace can be found on
> the bottom of this mail.
>
> I'll keep testing. For the lockdep warning, my initial guess is that
> calling schedule_delayed_work_sync under tp->lock is a bad idea because
> scheduled work can execute and try to claim tp->lock too.
>
> Maybe there are two different lockup cases here, I'll keep testing.
>
> Kind regards,
> Peter
>
>> On 12/05/2015 06:59 PM, Peter Wu wrote:
>>> Hi,
>>>
>>> I rarely use a Realtek USB 3.0 Gigabit Ethernet adapter (vid/pid
>>> 0bda:8153), but when I did last night, it resulted in a lockup of
>>> processes doing networking ("ip link", "ping", "ethtool", ...).
>>>
>>> A (few) minute(s) before that event, I noticed that there was no network
>>> connectivity (ping hung) which was somehow solved by invoking "ethtool
>>> eth1" (triggering runtime pm wakeup?). This same trick did not work at
>>> the next event. Invoking "ethtool eth1", "ip link", etc. hung completely
>>> and interrupt (^C) did not work at all.
>>>
>>> Since that did not work, I pulled the USB adapter and re-inserted it,
>>> hoping it would reset things. That did not work at all, there was a
>>> "usb disconnect" message, but no further driver messages.
>>>
>>> Fast forward an hour, and it has become a disaster. I have terminated
>>> and killed many programs via SysRq but am still unable to get a stable
>>> system that does not hang on network I/O. Even the suspend process
>>> fails so in the end I attempted to shutdown the system. After half an
>>> hour after getting the poweroff message, I issued SysRq + B to reboot
>>> (since SysRq + O did not shut down either).
>>>
>>> Attached are logs with various various backtraces from SysRq and failed
>>> suspend. Let me know if you need more information!
>>>
>>> By the way, often I have to rmmod xhci and re-insert it, otherwise
>>> plugging it in does not result in a detection. A USB 2.0 port does not
>>> have this problem (runtime PM is enabled for all devices). This is the
>>> USB 3.0 port:
>>>
>>> 02:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0
>>> Host Controller [1033:0194] (rev 03)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: (4.3.0) r8152: deadlock related to runtime suspend?

2015-12-07 Thread Peter Wu
On Mon, Dec 07, 2015 at 05:11:50PM +0800, Lu Baolu wrote:
> Hi Peter,
> 
> Have you ever tried disabling auto-pm? Did things go smoothly if auto-pm is 
> disabled?
> 
> I always disable usb auto-pm in below way.
> 
> # echo on | tee /sys/bus/usb/devices/*/power/control
> # echo on > /sys/bus/pci/devices//power/control
> 
> Thanks,
> Baolu

Hi Baolu,

The deadlock does not seem to occur with auto-PM disabled, but that is a
workaround for the issue. The hang can always be reproduced under this
test:

 - Start a QEMU VM, passing through the USB adapter
 - This VM boots to a busybox shell with no other services running or
   udev magic (to reduce interference).
 - Enable runtime PM for all devices by default (see script below)
 - From the console, invoke "ip link set eth1 up" (eth0 is a virtio
   adapter).

# somewhere in /init after mounting filesystems
echo /sbin/hotplug > /proc/sys/kernel/hotplug
echo auto | tee  /sys/bus/pci/devices/*/power/control \
/sys/bus/usb/devices/*/power/control >/dev/null

#!/bin/sh
# /sbin/hotplug
path="/sys/$DEVPATH/power/control"
[ -e "$path" ] || return
newval=auto
read status < "$path"
if [ "x$status" != "x$newval" ]; then
echo "$DEVPATH: $status -> $newval" >/dev/kmsg
echo $newval > "$path"
fi

With "auto", the ip command hangs (a trace can be found on the bottom of
this mail). With "on", it does not.

If I keep a loop spinning that invokes `ethtool eth1`, the command
returns immediately without issues (presumably because the device is not
suspended through runtime PM).

Under some circumstances I get a lockdep warning (when trying to bring
an interface down if I remember correctly). Its trace can be found on
the bottom of this mail.

I'll keep testing. For the lockdep warning, my initial guess is that
calling schedule_delayed_work_sync under tp->lock is a bad idea because
scheduled work can execute and try to claim tp->lock too.

Maybe there are two different lockup cases here, I'll keep testing.

Kind regards,
Peter

> On 12/05/2015 06:59 PM, Peter Wu wrote:
> > Hi,
> >
> > I rarely use a Realtek USB 3.0 Gigabit Ethernet adapter (vid/pid
> > 0bda:8153), but when I did last night, it resulted in a lockup of
> > processes doing networking ("ip link", "ping", "ethtool", ...).
> >
> > A (few) minute(s) before that event, I noticed that there was no network
> > connectivity (ping hung) which was somehow solved by invoking "ethtool
> > eth1" (triggering runtime pm wakeup?). This same trick did not work at
> > the next event. Invoking "ethtool eth1", "ip link", etc. hung completely
> > and interrupt (^C) did not work at all.
> >
> > Since that did not work, I pulled the USB adapter and re-inserted it,
> > hoping it would reset things. That did not work at all, there was a
> > "usb disconnect" message, but no further driver messages.
> >
> > Fast forward an hour, and it has become a disaster. I have terminated
> > and killed many programs via SysRq but am still unable to get a stable
> > system that does not hang on network I/O. Even the suspend process
> > fails so in the end I attempted to shutdown the system. After half an
> > hour after getting the poweroff message, I issued SysRq + B to reboot
> > (since SysRq + O did not shut down either).
> >
> > Attached are logs with various various backtraces from SysRq and failed
> > suspend. Let me know if you need more information!
> >
> > By the way, often I have to rmmod xhci and re-insert it, otherwise
> > plugging it in does not result in a detection. A USB 2.0 port does not
> > have this problem (runtime PM is enabled for all devices). This is the
> > USB 3.0 port:
> >
> > 02:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0
> > Host Controller [1033:0194] (rev 03)
> 

-- 

lockdep splat from the bare machine:

==
[ INFO: possible circular locking dependency detected ]
4.3.0-custom #1 Tainted: G   O   
---
kworker/0:1/38 is trying to acquire lock:
 (&tp->control){+.+.+.}, at: [] rtl8152_resume+0x24/0x130 
[r8152]

but task is already holding lock:
 ((&(&tp->schedule)->work)){+.+.+.}, at: [] 
process_one_work+0x15c/0x660

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 ((&(&tp->schedule)->work)){+.+.+.}:

[] lock_acquire+0xc3/0x1d0
   [] flush_work+0x3d/0x290
   [] __cancel_work_timer+0xfe/0x1c0
   [] cancel_delayed_work_sync+0x13/0x20
   [] rtl8152_set_speed+0x2a/0x260 [r8152]
   [] rtl8152_open+0x396/0x4f0 [r8152]
   [] __dev_open+0xaf/0x120
   [] __dev_change_flags+0x9d/0x160
   [] dev_change_flags+0x29/0x70
   [] do_setlink+0x5ba/0xb00
   [] rtnl_newlink+0x5a9/0x8a0
   [] rtnetlink_rcv_msg+0x84/0x210
   [] netlink_rcv_skb+0x97/0xb0

Re: (4.3.0) r8152: deadlock related to runtime suspend?

2015-12-07 Thread Lu Baolu
Hi Peter,

Have you ever tried disabling auto-pm? Did things go smoothly if auto-pm is 
disabled?

I always disable usb auto-pm in below way.

# echo on | tee /sys/bus/usb/devices/*/power/control
# echo on > /sys/bus/pci/devices//power/control

Thanks,
Baolu

On 12/05/2015 06:59 PM, Peter Wu wrote:
> Hi,
>
> I rarely use a Realtek USB 3.0 Gigabit Ethernet adapter (vid/pid
> 0bda:8153), but when I did last night, it resulted in a lockup of
> processes doing networking ("ip link", "ping", "ethtool", ...).
>
> A (few) minute(s) before that event, I noticed that there was no network
> connectivity (ping hung) which was somehow solved by invoking "ethtool
> eth1" (triggering runtime pm wakeup?). This same trick did not work at
> the next event. Invoking "ethtool eth1", "ip link", etc. hung completely
> and interrupt (^C) did not work at all.
>
> Since that did not work, I pulled the USB adapter and re-inserted it,
> hoping it would reset things. That did not work at all, there was a
> "usb disconnect" message, but no further driver messages.
>
> Fast forward an hour, and it has become a disaster. I have terminated
> and killed many programs via SysRq but am still unable to get a stable
> system that does not hang on network I/O. Even the suspend process
> fails so in the end I attempted to shutdown the system. After half an
> hour after getting the poweroff message, I issued SysRq + B to reboot
> (since SysRq + O did not shut down either).
>
> Attached are logs with various various backtraces from SysRq and failed
> suspend. Let me know if you need more information!
>
> By the way, often I have to rmmod xhci and re-insert it, otherwise
> plugging it in does not result in a detection. A USB 2.0 port does not
> have this problem (runtime PM is enabled for all devices). This is the
> USB 3.0 port:
>
> 02:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0
> Host Controller [1033:0194] (rev 03)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


(4.3.0) r8152: deadlock related to runtime suspend?

2015-12-05 Thread Peter Wu
Hi,

I rarely use a Realtek USB 3.0 Gigabit Ethernet adapter (vid/pid
0bda:8153), but when I did last night, it resulted in a lockup of
processes doing networking ("ip link", "ping", "ethtool", ...).

A (few) minute(s) before that event, I noticed that there was no network
connectivity (ping hung) which was somehow solved by invoking "ethtool
eth1" (triggering runtime pm wakeup?). This same trick did not work at
the next event. Invoking "ethtool eth1", "ip link", etc. hung completely
and interrupt (^C) did not work at all.

Since that did not work, I pulled the USB adapter and re-inserted it,
hoping it would reset things. That did not work at all, there was a
"usb disconnect" message, but no further driver messages.

Fast forward an hour, and it has become a disaster. I have terminated
and killed many programs via SysRq but am still unable to get a stable
system that does not hang on network I/O. Even the suspend process
fails so in the end I attempted to shutdown the system. After half an
hour after getting the poweroff message, I issued SysRq + B to reboot
(since SysRq + O did not shut down either).

Attached are logs with various various backtraces from SysRq and failed
suspend. Let me know if you need more information!

By the way, often I have to rmmod xhci and re-insert it, otherwise
plugging it in does not result in a detection. A USB 2.0 port does not
have this problem (runtime PM is enabled for all devices). This is the
USB 3.0 port:

02:00.0 USB controller [0c03]: NEC Corporation uPD720200 USB 3.0
Host Controller [1033:0194] (rev 03)
-- 
Kind regards,
Peter Wu
https://lekensteyn.nl
Dec 05 00:32:58 al kernel: usb 3-1: USB disconnect, device number 4
Dec 05 00:34:43 al NetworkManager[441]:   Could not get scan request 
result: Timeout was reached
Dec 05 00:35:45 al NetworkManager[441]:   Could not get scan request 
result: Timeout was reached
Dec 05 00:36:13 al kernel: INFO: task wpa_supplicant:573 blocked for more than 
120 seconds.
Dec 05 00:36:13 al kernel:   Tainted: G   O4.3.0-1-ARCH #1
Dec 05 00:36:13 al kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
Dec 05 00:36:13 al kernel: wpa_supplicant  D 88023bc956c0 0   573  
1 0x
Dec 05 00:36:13 al kernel:  8800aca2fa48 0086 880232600c80 
88022d101900
Dec 05 00:36:13 al kernel:  8800aca3 818b4504 88022d101900 

Dec 05 00:36:13 al kernel:  818b4508 8800aca2fa60 8157a09a 
818b4500
Dec 05 00:36:13 al kernel: Call Trace:
Dec 05 00:36:13 al kernel:  [] schedule+0x3a/0x90
Dec 05 00:36:13 al kernel:  [] 
schedule_preempt_disabled+0x15/0x20
Dec 05 00:36:13 al kernel:  [] 
__mutex_lock_slowpath+0xce/0x140
Dec 05 00:36:13 al kernel:  [] mutex_lock+0x17/0x30
Dec 05 00:36:13 al kernel:  [] rtnl_lock+0x15/0x20
Dec 05 00:36:13 al kernel:  [] nl80211_pre_doit+0xeb/0x180 
[cfg80211]
Dec 05 00:36:13 al kernel:  [] genl_family_rcv_msg+0x1c8/0x3c0
Dec 05 00:36:13 al kernel:  [] ? 
genl_family_rcv_msg+0x3c0/0x3c0
Dec 05 00:36:13 al kernel:  [] genl_rcv_msg+0x76/0xb0
Dec 05 00:36:13 al kernel:  [] netlink_rcv_skb+0xa4/0xc0
Dec 05 00:36:13 al kernel:  [] genl_rcv+0x28/0x40
Dec 05 00:36:13 al kernel:  [] netlink_unicast+0x11a/0x1a0
Dec 05 00:36:13 al kernel:  [] netlink_sendmsg+0x4f8/0x610
Dec 05 00:36:13 al kernel:  [] sock_sendmsg+0x17/0x30
Dec 05 00:36:13 al kernel:  [] ___sys_sendmsg+0x295/0x2b0
Dec 05 00:36:13 al kernel:  [] ? 
should_numa_migrate_memory+0x54/0x130
Dec 05 00:36:13 al kernel:  [] ? handle_mm_fault+0xb38/0x18a0
Dec 05 00:36:13 al kernel:  [] __sys_sendmsg+0x51/0x90
Dec 05 00:36:13 al kernel:  [] SyS_sendmsg+0x12/0x20
Dec 05 00:36:13 al kernel:  [] 
entry_SYSCALL_64_fastpath+0x12/0x71
Dec 05 00:36:13 al kernel: INFO: task Qt bearer threa:8292 blocked for more 
than 120 seconds.
Dec 05 00:36:13 al kernel:   Tainted: G   O4.3.0-1-ARCH #1
Dec 05 00:36:13 al kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
Dec 05 00:36:13 al kernel: Qt bearer threa D 88023bd156c0 0  8292   
8222 0x
Dec 05 00:36:13 al kernel:  8800ac9bfc88 0086 880232601900 
880214323200
Dec 05 00:36:13 al kernel:  8800ac9c 818b4504 880214323200 

Dec 05 00:36:13 al kernel:  818b4508 8800ac9bfca0 8157a09a 
818b4500
Dec 05 00:36:13 al kernel: Call Trace:
Dec 05 00:36:13 al kernel:  [] schedule+0x3a/0x90
Dec 05 00:36:13 al kernel:  [] 
schedule_preempt_disabled+0x15/0x20
Dec 05 00:36:13 al kernel:  [] 
__mutex_lock_slowpath+0xce/0x140
Dec 05 00:36:13 al kernel:  [] mutex_lock+0x17/0x30
Dec 05 00:36:13 al kernel:  [] rtnetlink_rcv+0x19/0x30
Dec 05 00:36:13 al kernel:  [] netlink_unicast+0x11a/0x1a0
Dec 05 00:36:13 al kernel:  [] netlink_sendmsg+0x4f8/0x610
Dec 05 00:36:13 al kernel:  [] ? netlink_insert+0x1ce/0x350
Dec 05 00:36:13 al kernel:  [] sock_sendmsg+0x17/0x30
Dec 05 00:3