Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On 08/25/2016 05:09 AM, Mathias Nyman wrote: On 24.08.2016 17:14, Alan Stern wrote: On Wed, 24 Aug 2016, Mathias Nyman wrote: The sleep() worked as it delayed freeing the primary hcd, changing the order to first release usb3 hcd and then usb2 hcd. Does this mean that the patch you already posted is the proper fix? Not sure, still just a step in the right direction. "ab2a4bf USB: don't free bandwidth_mutex too early" seems to solve the other DELL XPS mass storage remove bug. This Nexus 5 case still shows a locking issue after my patch. It ran a 4.7.2 kernel which should have your fix included. Jose Marino, could you try the patch on top of latest 4.8-rc3, just to make sure it's not some old issue? About the locking issue, looks like we end up waiting forever for the device lock mutex: [ 240.854069] INFO: task kworker/u16:31:970 blocked for more than 120 seconds. [ 240.854070] Tainted: GW O4.7.2-1-jose #1 [ 240.854074] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 240.854076] Call Trace: [ 240.854077] [] schedule+0x3c/0x90 [ 240.854078] [] schedule_preempt_disabled+0x15/0x20 [ 240.854080] [] __mutex_lock_slowpath+0xce/0x140 [ 240.854082] [] mutex_lock+0x17/0x30 [ 240.854085] [] usb_disconnect+0x51/0x2a0 [usbcore] [ 240.854087] [] usb_remove_hcd+0xc7/0x240 [usbcore] [ 240.854090] [] usb_hcd_pci_remove+0x6f/0x140 [usbcore] [ 240.854091] [] xhci_pci_remove+0x55/0x70 [xhci_pci] (Just one sample, there were many blocked tasks) -Mathias I tried the patch on top of 4.8-rc3. Reboot, suspend/resume, plug in phone and tell it to tether. Again the tether connection did not work, dhcp client doesn't get a response and times out. Then I unplug the phone and everything seems to be handled nicely. No oops or panics or hung tasks. I attach the dmesg. Even though the patch seems to fix the trouble when unplugging, it still does nothing about the tether connection not working after a suspend/resume cycle. Do you think this is a different bug altogether? Any ideas how to trouble shoot that one? dmesg-Nyman-4.8rc3.log.gz Description: application/gzip
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On 24.08.2016 17:14, Alan Stern wrote: On Wed, 24 Aug 2016, Mathias Nyman wrote: The sleep() worked as it delayed freeing the primary hcd, changing the order to first release usb3 hcd and then usb2 hcd. Does this mean that the patch you already posted is the proper fix? Not sure, still just a step in the right direction. "ab2a4bf USB: don't free bandwidth_mutex too early" seems to solve the other DELL XPS mass storage remove bug. This Nexus 5 case still shows a locking issue after my patch. It ran a 4.7.2 kernel which should have your fix included. Jose Marino, could you try the patch on top of latest 4.8-rc3, just to make sure it's not some old issue? About the locking issue, looks like we end up waiting forever for the device lock mutex: [ 240.854069] INFO: task kworker/u16:31:970 blocked for more than 120 seconds. [ 240.854070] Tainted: GW O4.7.2-1-jose #1 [ 240.854074] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 240.854076] Call Trace: [ 240.854077] [] schedule+0x3c/0x90 [ 240.854078] [] schedule_preempt_disabled+0x15/0x20 [ 240.854080] [] __mutex_lock_slowpath+0xce/0x140 [ 240.854082] [] mutex_lock+0x17/0x30 [ 240.854085] [] usb_disconnect+0x51/0x2a0 [usbcore] [ 240.854087] [] usb_remove_hcd+0xc7/0x240 [usbcore] [ 240.854090] [] usb_hcd_pci_remove+0x6f/0x140 [usbcore] [ 240.854091] [] xhci_pci_remove+0x55/0x70 [xhci_pci] (Just one sample, there were many blocked tasks) -Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On Wed, 24 Aug 2016, Mathias Nyman wrote: > > >> Reference counting is supposed to keep everything important from being > >> deallocated until all the releases are finished. In particular, if the > >> hcd structure was already deallocated when usb_put_hcd() was called > >> then there is a refcounting bug somewhere. > > > > That seems like a possible cause, I'll look into the refcouting. > > > > This helps a lot. I was getting stuck with that bug. > > Thanks > > Just to update on this, > It appears to be caused by the same address and bandwidth mutex free bug that > you > already fixed in ab2a4bf USB: don't free bandwidth_mutex too early Yeah, that's the commit which resets the peer->primary_hcd pointer when a shared hcd is released. > The sleep() worked as it delayed freeing the primary hcd, changing the > order to first release usb3 hcd and then usb2 hcd. Does this mean that the patch you already posted is the proper fix? Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
Reference counting is supposed to keep everything important from being deallocated until all the releases are finished. In particular, if the hcd structure was already deallocated when usb_put_hcd() was called then there is a refcounting bug somewhere. That seems like a possible cause, I'll look into the refcouting. This helps a lot. I was getting stuck with that bug. Thanks Just to update on this, It appears to be caused by the same address and bandwidth mutex free bug that you already fixed in ab2a4bf USB: don't free bandwidth_mutex too early The sleep() worked as it delayed freeing the primary hcd, changing the order to first release usb3 hcd and then usb2 hcd. -Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On 23.08.2016 19:14, Jose Marino wrote: On 08/23/2016 06:36 AM, Mathias Nyman wrote: On 23.08.2016 14:26, Mathias Nyman wrote: On 23.08.2016 13:54, Mathias Nyman wrote: On 23.08.2016 02:21, Jose Marino wrote: I'm using my phone (Nexus 5X running Android) to tether a USB connection to my laptop (XPS 15 9550). I plug the phone through the USB-C connection and in the phone I select USB tethering. Initially things look normal: a usb0 network interface appears in the laptop and it tries to get an IP with dhcp. However, I observe two different behaviors depending on whether it's a fresh boot, or I have suspend/resumed the laptop. In a fresh boot everything works fine, I get an IP and the connection works as expected. If I unplug the phone, everything also works as expected. However, after a suspend/resume cycle, I plug the phone in but the laptop never connects to it. The usb0 interface still appears, but the dhcp daemon is unable to get any response and finally times out. The fun part happens when I unplug the phone. I consistently get a kernel panic. ... Anyways, I'll look at that panic in more detail as well The patch did not apply on top of 4.7.2. I applied this patch instead, which I hope is equivalent: diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index d7d5025..20b1b18 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -840,6 +840,10 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg) spin_lock_irqsave(>lock, flags); ep->stop_cmds_pending--; +if (xhci->xhc_state & XHCI_STATE_REMOVING) { +spin_unlock_irqrestore(>lock, flags); +return; +} if (xhci->xhc_state & XHCI_STATE_DYING) { xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb, "Stop EP timer ran, but another timer marked " @@ -893,7 +897,7 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg) spin_unlock_irqrestore(>lock, flags); xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb, "Calling usb_hc_died()"); -usb_hc_died(xhci_to_hcd(xhci)->primary_hcd); +usb_hc_died(xhci_to_hcd(xhci)); xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb, "xHCI host controller is dead."); } So, I apply the patch, reboot, suspend/resume, plug in phone and tell it to tether. The dhcp client is still unable to communicate and times out. However, the patch seems to have avoided the NULL dereference. The computer did not panic although my X session stopped responding. I went to virtual console and recorded a dmesg (find attached). Patch looks correct, It only solves the NULL pointer issue Thanks for testing. The new log again points to a locking issue, I'll take a look. -Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On 23.08.2016 18:10, Alan Stern wrote: On Tue, 23 Aug 2016, Mathias Nyman wrote: The Dell XPS 9550 has an additional xhci controller for handling the type-C port. This controller is hotplug removed from the PCI bus when the last USB type-c device is disconnected. xhci driver, and usb core it seems is not really designed with this in mind. xhci driver will suddenly start reading from PCI. I've been looking at issues related to this. Currently there is at least one similar case with mass storage where we see the device release function being called for the mass storage interface device _after_ we freed all memory related to both xhci hcd's. bug for that is here: https://bugzilla.kernel.org/show_bug.cgi?id=120241 usb devices with their children should be synchronously removed before hcd's are freed, but seems that is not the case, at least not for the device release function for the interface device. Be careful what words you use. USB devices and their children are indeed synchronously _removed_ before hcd's are freed. However, they may not be _released_ until later. What you encountered in that bug report was probably usb_release_dev() calling bus_to_hcd() and usb_put_hcd(). This action is part of a release, and it's allowed to happen long after everything has been removed (i.e., unregistered). Yes, it's the release function which is called later, I did mix up the concepts of remove and release. Reference counting is supposed to keep everything important from being deallocated until all the releases are finished. In particular, if the hcd structure was already deallocated when usb_put_hcd() was called then there is a refcounting bug somewhere. That seems like a possible cause, I'll look into the refcouting. This helps a lot. I was getting stuck with that bug. Thanks -Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On 23.08.2016 17:53, Alan Stern wrote: On Tue, 23 Aug 2016, Mathias Nyman wrote: Or then this happens: (I'll call the hcds usb2_hcd and usb3_hcd to keep track of them, usb2_hcd is the primary_hcd) to begin with: Actually, to begin with neither usb2_hcd nor usb3_hcd exists. Then usb2_hcd is registered, at which point we have: usb2_hcd->primary_hcd = NULL usb2_hcd->shared_hcd = NULL True Then usb3_hcd is registered, at which point we have: usb2_hcd->primary_hcd = usb2_hcd usb2_hcd->shared_hcd = usb3_hcd usb3_hcd->primary_hcd = usb2_hcd usb3_hcd->shared_hcd = usb2_hcd usb3_host is removed first: xhci_pci_remove(struct pci_dev *dev) usb_remove_hcd(xhci->shared_hcd); // remove usb3_hcd usb_put_hcd(xhci->shared_hcd) hcd_release(..) if (hcd->shared_hcd) { //true struct usb_hcd *peer = hcd->shared_hcd; //peer is now usb2_hcd peer->shared_hcd = NULL; //sets usb2_hcd->shared_hcd to NULL peer->primary_hcd = NULL;// sets usb2_hcd->primary_hcd to NULL. Why do we do this?? We do this because then the state is exactly the same as it was after usb2_hcd was registered but before usb3_hcd was registered. So what happened here is very much like what would happen if something went wrong during probing, after the primary hcd was registered and before the secondary hcd was registered. Ok, thanks, that makes sense. xhci driver shouldn't assume usb2_hcd->primary_hcd exists. It was unnecessary anyway as xhci_to_hcd() returns the primary hcd -Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On 23.08.2016 16:13, Greg KH wrote: On Tue, Aug 23, 2016 at 01:54:05PM +0300, Mathias Nyman wrote: On 23.08.2016 02:21, Jose Marino wrote: I'm using my phone (Nexus 5X running Android) to tether a USB connection to my laptop (XPS 15 9550). I plug the phone through the USB-C connection and in the phone I select USB tethering. Initially things look normal: a usb0 network interface appears in the laptop and it tries to get an IP with dhcp. However, I observe two different behaviors depending on whether it's a fresh boot, or I have suspend/resumed the laptop. In a fresh boot everything works fine, I get an IP and the connection works as expected. If I unplug the phone, everything also works as expected. However, after a suspend/resume cycle, I plug the phone in but the laptop never connects to it. The usb0 interface still appears, but the dhcp daemon is unable to get any response and finally times out. The fun part happens when I unplug the phone. I consistently get a kernel panic. I managed to get some logs of the oops+panic from pstore. Find them attached. In this particular situation this is what I did: - Boot laptop (archlinux with kernel 4.7.2) - Suspend/resume - Plug Nexus 5X - After a few seconds unplug Nexus 5X I filed a bug report about this: https://bugzilla.kernel.org/show_bug.cgi?id=153551 The Dell XPS 9550 has an additional xhci controller for handling the type-C port. This controller is hotplug removed from the PCI bus when the last USB type-c device is disconnected. xhci driver, and usb core it seems is not really designed with this in mind. The USB core can handle this just fine. xhci driver will suddenly start reading from PCI. Which means the device is gone, and you need to handle it properly. We fixed up ehci and ohci for this years ago (they were on hotplug busses). For every PCI read, you need to verify that the data is correct, that's the way that any PCI driver needs to work in a "modern" system. This is an XHCI issue, not a USB core issue :) Yes, reading and reacting properly to it is purely xhci. I've been looking at issues related to this. Currently there is at least one similar case with mass storage where we see the device release function being called for the mass storage interface device _after_ we freed all memory related to both xhci hcd's. bug for that is here: https://bugzilla.kernel.org/show_bug.cgi?id=120241 usb devices with their children should be synchronously removed before hcd's are freed, but seems that is not the case, at least not for the device release function for the interface device. Once you notice that your PCI device is gone, you need to start tearing things down as soon as possible. Or just stop things and wait for the PCI core to come around and remove you from the system. That's probably much more simple and I think is what was done for EHCI. That part should be doable, but the part where the interface device release is called after hcd is freed still puzzles me. As Alan suggested I need to check if the reference counting is correct. A horrible workaround to hide this issue was to sleep for a second or two before freeing the hcd memory, this lets some pending work finish before hcds disappear. (more info in that bug report) Yeah, that's not a good idea :) Just a intermediate step to find which way to continue debugging, not a solution. -Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On 08/23/2016 06:36 AM, Mathias Nyman wrote: On 23.08.2016 14:26, Mathias Nyman wrote: On 23.08.2016 13:54, Mathias Nyman wrote: On 23.08.2016 02:21, Jose Marino wrote: I'm using my phone (Nexus 5X running Android) to tether a USB connection to my laptop (XPS 15 9550). I plug the phone through the USB-C connection and in the phone I select USB tethering. Initially things look normal: a usb0 network interface appears in the laptop and it tries to get an IP with dhcp. However, I observe two different behaviors depending on whether it's a fresh boot, or I have suspend/resumed the laptop. In a fresh boot everything works fine, I get an IP and the connection works as expected. If I unplug the phone, everything also works as expected. However, after a suspend/resume cycle, I plug the phone in but the laptop never connects to it. The usb0 interface still appears, but the dhcp daemon is unable to get any response and finally times out. The fun part happens when I unplug the phone. I consistently get a kernel panic. ... Anyways, I'll look at that panic in more detail as well <6>[ 178.693631] xhci_hcd :3e:00.0: USB bus 4 deregistered <6>[ 178.693642] xhci_hcd :3e:00.0: remove, state 1 <6>[ 178.693648] usb usb3: USB disconnect, device number 1 <4>[ 183.634994] xhci_hcd :3e:00.0: xHCI host not responding to stop endpoint command. <4>[ 183.635001] xhci_hcd :3e:00.0: Assuming host is dying, halting host. <4>[ 183.635019] xhci_hcd :3e:00.0: Host not halted after 16000 microseconds. <4>[ 183.635022] xhci_hcd :3e:00.0: Non-responsive xHCI host is not halting. <4>[ 183.635025] xhci_hcd :3e:00.0: Completing active URBs anyway. <1>[ 183.635116] BUG: unable to handle kernel NULL pointer dereference at (null) <1>[ 183.635402] IP: [] usb_hc_died+0x16/0xc0 [usbcore] Looks like the 5 second command timeout timer for stop endpoint commands causes this. the timer (stop_cmd_timer) will call xhci_stop_endpoint_command_watchdog() which calls usb_hc_died(xhci_to_hcd(xhci)->primary_hcd) but hcd are probably freed and pointers set to null already -> NULL pointer dereference. The timer should be synchronously deleted when the device is freed, unless xhci_free_dev() returns early. So either hub_free_dev() is not called for this device at hcd removal, or xhci_free_dev returns early. Or then this happens: (I'll call the hcds usb2_hcd and usb3_hcd to keep track of them, usb2_hcd is the primary_hcd) to begin with: usb2_hcd->primary_hcd = usb2_hcd usb2_hcd->shared_hcd = usb3_hcd usb3_hcd->primary_hcd = usb2_hcd usb3_hcd->shared_hcd = usb2_hcd usb3_host is removed first: xhci_pci_remove(struct pci_dev *dev) usb_remove_hcd(xhci->shared_hcd); // remove usb3_hcd usb_put_hcd(xhci->shared_hcd) hcd_release(..) if (hcd->shared_hcd) {//true struct usb_hcd *peer = hcd->shared_hcd; //peer is now usb2_hcd peer->shared_hcd = NULL; //sets usb2_hcd->shared_hcd to NULL peer->primary_hcd = NULL;// sets usb2_hcd->primary_hcd to NULL. Why do we do this?? stop_cmd_timer triggers before the usb2_hcd is removed: -> xhci_stop_endpoint_command_watchdog() usb_hc_died(xhci_to_hcd(xhci)->primary_hcd) // xhci_to_hcd will get usb2_hcd, usb2_hcd->primary_hcd is set to NULL here. does something like this help? diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index fd9fd12..797137e 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -850,6 +850,10 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg) spin_lock_irqsave(>lock, flags); ep->stop_cmds_pending--; + if (xhci->xhc_state & XHCI_STATE_REMOVING) { + spin_unlock_irqrestore(>lock, flags); + return; + } if (xhci->xhc_state & XHCI_STATE_DYING) { xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb, "Stop EP timer ran, but another timer marked " @@ -903,7 +907,7 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg) spin_unlock_irqrestore(>lock, flags); xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb, "Calling usb_hc_died()"); - usb_hc_died(xhci_to_hcd(xhci)->primary_hcd); + usb_hc_died(xhci_to_hcd(xhci)); xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb, "xHCI host controller is dead."); } The patch did not apply on top of 4.7.2. I applied this patch instead, which I hope is equivalent: diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index d7d5025..20b1b18 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -840,6 +840,10 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg) spin_lock_irqsave(>lock, flags); ep->stop_cmds_pending--; + if (xhci->xhc_state & XHCI_STATE_REMOVING) { +
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On 08/23/2016 01:57 AM, Oliver Neukum wrote: On Mon, 2016-08-22 at 17:21 -0600, Jose Marino wrote: I'm using my phone (Nexus 5X running Android) to tether a USB connection to my laptop (XPS 15 9550). I plug the phone through the USB-C connection and in the phone I select USB tethering. Initially things look normal: a usb0 network interface appears in the laptop and it tries to get an IP with dhcp. However, I observe two different behaviors depending on whether it's a fresh boot, or I have suspend/resumed the laptop. In a fresh boot everything works fine, I get an IP and the connection works as expected. If I unplug the phone, everything also works as expected. However, after a suspend/resume cycle, I plug the phone in but the laptop never connects to it. The usb0 interface still appears, but the dhcp daemon is unable to get any response and finally times out. The fun part happens when I unplug the phone. I consistently get a kernel panic. The HC has crashed thoroughly. Do other devices work after S3? Everything works fine after S3. I regularly use a usb-c ethernet adapter and even though it's a bit flaky, I suspend/resume regularly without any kernel panics or oopses. About the phone, the kernel panic only happens when I enable tethering on the phone. Anything else seems to work fine: plug in to charge, download pictures or access its file system. Regards Oliver -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On Tue, 23 Aug 2016, Mathias Nyman wrote: > The Dell XPS 9550 has an additional xhci controller for handling the type-C > port. > This controller is hotplug removed from the PCI bus when the last USB type-c > device is disconnected. > > xhci driver, and usb core it seems is not really designed with this in mind. > xhci driver will suddenly start reading from PCI. > > I've been looking at issues related to this. Currently there is at least one > similar case with mass storage > where we see the device release function being called for the mass storage > interface device _after_ we > freed all memory related to both xhci hcd's. bug for that is here: > > https://bugzilla.kernel.org/show_bug.cgi?id=120241 > > usb devices with their children should be synchronously removed before hcd's > are freed, but seems > that is not the case, at least not for the device release function for the > interface device. Be careful what words you use. USB devices and their children are indeed synchronously _removed_ before hcd's are freed. However, they may not be _released_ until later. What you encountered in that bug report was probably usb_release_dev() calling bus_to_hcd() and usb_put_hcd(). This action is part of a release, and it's allowed to happen long after everything has been removed (i.e., unregistered). Reference counting is supposed to keep everything important from being deallocated until all the releases are finished. In particular, if the hcd structure was already deallocated when usb_put_hcd() was called then there is a refcounting bug somewhere. > A horrible workaround to hide this issue was to sleep for a second or two > before freeing the hcd memory, > this lets some pending work finish before hcds disappear. (more info in that > bug report) The driver should synchronously cancel the work instead of sleeping. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On Tue, 23 Aug 2016, Mathias Nyman wrote: > Or then this happens: > (I'll call the hcds usb2_hcd and usb3_hcd to keep track of them, usb2_hcd is > the primary_hcd) > > to begin with: Actually, to begin with neither usb2_hcd nor usb3_hcd exists. Then usb2_hcd is registered, at which point we have: usb2_hcd->primary_hcd = NULL usb2_hcd->shared_hcd = NULL Then usb3_hcd is registered, at which point we have: > usb2_hcd->primary_hcd = usb2_hcd > usb2_hcd->shared_hcd = usb3_hcd > > usb3_hcd->primary_hcd = usb2_hcd > usb3_hcd->shared_hcd = usb2_hcd > > > usb3_host is removed first: > xhci_pci_remove(struct pci_dev *dev) >usb_remove_hcd(xhci->shared_hcd); // remove usb3_hcd >usb_put_hcd(xhci->shared_hcd) > hcd_release(..) >if (hcd->shared_hcd) { //true > struct usb_hcd *peer = hcd->shared_hcd; //peer is > now usb2_hcd > peer->shared_hcd = NULL; //sets usb2_hcd->shared_hcd to > NULL > peer->primary_hcd = NULL;// sets usb2_hcd->primary_hcd > to NULL. Why do we do this?? We do this because then the state is exactly the same as it was after usb2_hcd was registered but before usb3_hcd was registered. So what happened here is very much like what would happen if something went wrong during probing, after the primary hcd was registered and before the secondary hcd was registered. Alan Stern -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On Tue, Aug 23, 2016 at 01:54:05PM +0300, Mathias Nyman wrote: > On 23.08.2016 02:21, Jose Marino wrote: > > I'm using my phone (Nexus 5X running Android) to tether a USB connection to > > my laptop (XPS 15 9550). I plug the phone through the USB-C connection and > > in the phone I select USB tethering. Initially things look normal: a usb0 > > network interface appears in the laptop and it tries to get an IP with > > dhcp. However, I observe two different behaviors depending on whether it's > > a fresh boot, or I have suspend/resumed the laptop. In a fresh boot > > everything works fine, I get an IP and the connection works as expected. If > > I unplug the phone, everything also works as expected. > > > > However, after a suspend/resume cycle, I plug the phone in but the laptop > > never connects to it. The usb0 interface still appears, but the dhcp daemon > > is unable to get any response and finally times out. The fun part happens > > when I unplug the phone. I consistently get a kernel panic. > > > > I managed to get some logs of the oops+panic from pstore. Find them > > attached. In this particular situation this is what I did: > > - Boot laptop (archlinux with kernel 4.7.2) > > - Suspend/resume > > - Plug Nexus 5X > > - After a few seconds unplug Nexus 5X > > > > I filed a bug report about this: > > https://bugzilla.kernel.org/show_bug.cgi?id=153551 > > > The Dell XPS 9550 has an additional xhci controller for handling the type-C > port. > This controller is hotplug removed from the PCI bus when the last USB type-c > device is disconnected. > > xhci driver, and usb core it seems is not really designed with this in mind. The USB core can handle this just fine. > xhci driver will suddenly start reading from PCI. Which means the device is gone, and you need to handle it properly. We fixed up ehci and ohci for this years ago (they were on hotplug busses). For every PCI read, you need to verify that the data is correct, that's the way that any PCI driver needs to work in a "modern" system. This is an XHCI issue, not a USB core issue :) > I've been looking at issues related to this. Currently there is at least one > similar case with mass storage > where we see the device release function being called for the mass storage > interface device _after_ we > freed all memory related to both xhci hcd's. bug for that is here: > > https://bugzilla.kernel.org/show_bug.cgi?id=120241 > > usb devices with their children should be synchronously removed before hcd's > are freed, but seems > that is not the case, at least not for the device release function for the > interface device. Once you notice that your PCI device is gone, you need to start tearing things down as soon as possible. Or just stop things and wait for the PCI core to come around and remove you from the system. That's probably much more simple and I think is what was done for EHCI. > A horrible workaround to hide this issue was to sleep for a second or two > before freeing the hcd memory, > this lets some pending work finish before hcds disappear. (more info in that > bug report) Yeah, that's not a good idea :) thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On 23.08.2016 14:26, Mathias Nyman wrote: On 23.08.2016 13:54, Mathias Nyman wrote: On 23.08.2016 02:21, Jose Marino wrote: I'm using my phone (Nexus 5X running Android) to tether a USB connection to my laptop (XPS 15 9550). I plug the phone through the USB-C connection and in the phone I select USB tethering. Initially things look normal: a usb0 network interface appears in the laptop and it tries to get an IP with dhcp. However, I observe two different behaviors depending on whether it's a fresh boot, or I have suspend/resumed the laptop. In a fresh boot everything works fine, I get an IP and the connection works as expected. If I unplug the phone, everything also works as expected. However, after a suspend/resume cycle, I plug the phone in but the laptop never connects to it. The usb0 interface still appears, but the dhcp daemon is unable to get any response and finally times out. The fun part happens when I unplug the phone. I consistently get a kernel panic. ... Anyways, I'll look at that panic in more detail as well <6>[ 178.693631] xhci_hcd :3e:00.0: USB bus 4 deregistered <6>[ 178.693642] xhci_hcd :3e:00.0: remove, state 1 <6>[ 178.693648] usb usb3: USB disconnect, device number 1 <4>[ 183.634994] xhci_hcd :3e:00.0: xHCI host not responding to stop endpoint command. <4>[ 183.635001] xhci_hcd :3e:00.0: Assuming host is dying, halting host. <4>[ 183.635019] xhci_hcd :3e:00.0: Host not halted after 16000 microseconds. <4>[ 183.635022] xhci_hcd :3e:00.0: Non-responsive xHCI host is not halting. <4>[ 183.635025] xhci_hcd :3e:00.0: Completing active URBs anyway. <1>[ 183.635116] BUG: unable to handle kernel NULL pointer dereference at (null) <1>[ 183.635402] IP: [] usb_hc_died+0x16/0xc0 [usbcore] Looks like the 5 second command timeout timer for stop endpoint commands causes this. the timer (stop_cmd_timer) will call xhci_stop_endpoint_command_watchdog() which calls usb_hc_died(xhci_to_hcd(xhci)->primary_hcd) but hcd are probably freed and pointers set to null already -> NULL pointer dereference. The timer should be synchronously deleted when the device is freed, unless xhci_free_dev() returns early. So either hub_free_dev() is not called for this device at hcd removal, or xhci_free_dev returns early. Or then this happens: (I'll call the hcds usb2_hcd and usb3_hcd to keep track of them, usb2_hcd is the primary_hcd) to begin with: usb2_hcd->primary_hcd = usb2_hcd usb2_hcd->shared_hcd = usb3_hcd usb3_hcd->primary_hcd = usb2_hcd usb3_hcd->shared_hcd = usb2_hcd usb3_host is removed first: xhci_pci_remove(struct pci_dev *dev) usb_remove_hcd(xhci->shared_hcd); // remove usb3_hcd usb_put_hcd(xhci->shared_hcd) hcd_release(..) if (hcd->shared_hcd) { //true struct usb_hcd *peer = hcd->shared_hcd; //peer is now usb2_hcd peer->shared_hcd = NULL; //sets usb2_hcd->shared_hcd to NULL peer->primary_hcd = NULL;// sets usb2_hcd->primary_hcd to NULL. Why do we do this?? stop_cmd_timer triggers before the usb2_hcd is removed: -> xhci_stop_endpoint_command_watchdog() usb_hc_died(xhci_to_hcd(xhci)->primary_hcd) // xhci_to_hcd will get usb2_hcd, usb2_hcd->primary_hcd is set to NULL here. does something like this help? diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index fd9fd12..797137e 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -850,6 +850,10 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg) spin_lock_irqsave(>lock, flags); ep->stop_cmds_pending--; + if (xhci->xhc_state & XHCI_STATE_REMOVING) { + spin_unlock_irqrestore(>lock, flags); + return; + } if (xhci->xhc_state & XHCI_STATE_DYING) { xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb, "Stop EP timer ran, but another timer marked " @@ -903,7 +907,7 @@ void xhci_stop_endpoint_command_watchdog(unsigned long arg) spin_unlock_irqrestore(>lock, flags); xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb, "Calling usb_hc_died()"); - usb_hc_died(xhci_to_hcd(xhci)->primary_hcd); + usb_hc_died(xhci_to_hcd(xhci)); xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb, "xHCI host controller is dead."); } -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On 23.08.2016 13:54, Mathias Nyman wrote: On 23.08.2016 02:21, Jose Marino wrote: I'm using my phone (Nexus 5X running Android) to tether a USB connection to my laptop (XPS 15 9550). I plug the phone through the USB-C connection and in the phone I select USB tethering. Initially things look normal: a usb0 network interface appears in the laptop and it tries to get an IP with dhcp. However, I observe two different behaviors depending on whether it's a fresh boot, or I have suspend/resumed the laptop. In a fresh boot everything works fine, I get an IP and the connection works as expected. If I unplug the phone, everything also works as expected. However, after a suspend/resume cycle, I plug the phone in but the laptop never connects to it. The usb0 interface still appears, but the dhcp daemon is unable to get any response and finally times out. The fun part happens when I unplug the phone. I consistently get a kernel panic. ... Anyways, I'll look at that panic in more detail as well <6>[ 178.693631] xhci_hcd :3e:00.0: USB bus 4 deregistered <6>[ 178.693642] xhci_hcd :3e:00.0: remove, state 1 <6>[ 178.693648] usb usb3: USB disconnect, device number 1 <4>[ 183.634994] xhci_hcd :3e:00.0: xHCI host not responding to stop endpoint command. <4>[ 183.635001] xhci_hcd :3e:00.0: Assuming host is dying, halting host. <4>[ 183.635019] xhci_hcd :3e:00.0: Host not halted after 16000 microseconds. <4>[ 183.635022] xhci_hcd :3e:00.0: Non-responsive xHCI host is not halting. <4>[ 183.635025] xhci_hcd :3e:00.0: Completing active URBs anyway. <1>[ 183.635116] BUG: unable to handle kernel NULL pointer dereference at (null) <1>[ 183.635402] IP: [] usb_hc_died+0x16/0xc0 [usbcore] Looks like the 5 second command timeout timer for stop endpoint commands causes this. the timer (stop_cmd_timer) will call xhci_stop_endpoint_command_watchdog() which calls usb_hc_died(xhci_to_hcd(xhci)->primary_hcd) but hcd are probably freed and pointers set to null already -> NULL pointer dereference. The timer should be synchronously deleted when the device is freed, unless xhci_free_dev() returns early. So either hub_free_dev() is not called for this device at hcd removal, or xhci_free_dev returns early. hub_free_dev() hcd->driver->free_dev(hcd, udev); xhci_free_dev() (possible early return here) for (i = 0; i < 31; ++i) { virt_dev->eps[i].ep_state &= ~EP_HALT_PENDING; del_timer_sync(_dev->eps[i].stop_cmd_timer); -Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On 23.08.2016 02:21, Jose Marino wrote: I'm using my phone (Nexus 5X running Android) to tether a USB connection to my laptop (XPS 15 9550). I plug the phone through the USB-C connection and in the phone I select USB tethering. Initially things look normal: a usb0 network interface appears in the laptop and it tries to get an IP with dhcp. However, I observe two different behaviors depending on whether it's a fresh boot, or I have suspend/resumed the laptop. In a fresh boot everything works fine, I get an IP and the connection works as expected. If I unplug the phone, everything also works as expected. However, after a suspend/resume cycle, I plug the phone in but the laptop never connects to it. The usb0 interface still appears, but the dhcp daemon is unable to get any response and finally times out. The fun part happens when I unplug the phone. I consistently get a kernel panic. I managed to get some logs of the oops+panic from pstore. Find them attached. In this particular situation this is what I did: - Boot laptop (archlinux with kernel 4.7.2) - Suspend/resume - Plug Nexus 5X - After a few seconds unplug Nexus 5X I filed a bug report about this: https://bugzilla.kernel.org/show_bug.cgi?id=153551 The Dell XPS 9550 has an additional xhci controller for handling the type-C port. This controller is hotplug removed from the PCI bus when the last USB type-c device is disconnected. xhci driver, and usb core it seems is not really designed with this in mind. xhci driver will suddenly start reading from PCI. I've been looking at issues related to this. Currently there is at least one similar case with mass storage where we see the device release function being called for the mass storage interface device _after_ we freed all memory related to both xhci hcd's. bug for that is here: https://bugzilla.kernel.org/show_bug.cgi?id=120241 usb devices with their children should be synchronously removed before hcd's are freed, but seems that is not the case, at least not for the device release function for the interface device. A horrible workaround to hide this issue was to sleep for a second or two before freeing the hcd memory, this lets some pending work finish before hcds disappear. (more info in that bug report) Also this commit in 4.8-rc3 solves a related xhci PCI hotplug issue: f1f6d9a xhci: don't dereference a xhci member after removing xhci Anyways, I'll look at that panic in more detail as well -Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
On Mon, 2016-08-22 at 17:21 -0600, Jose Marino wrote: > I'm using my phone (Nexus 5X running Android) to tether a USB connection > to my laptop (XPS 15 9550). I plug the phone through the USB-C > connection and in the phone I select USB tethering. Initially things > look normal: a usb0 network interface appears in the laptop and it tries > to get an IP with dhcp. However, I observe two different behaviors > depending on whether it's a fresh boot, or I have suspend/resumed the > laptop. In a fresh boot everything works fine, I get an IP and the > connection works as expected. If I unplug the phone, everything also > works as expected. > > However, after a suspend/resume cycle, I plug the phone in but the > laptop never connects to it. The usb0 interface still appears, but the > dhcp daemon is unable to get any response and finally times out. The fun > part happens when I unplug the phone. I consistently get a kernel panic. The HC has crashed thoroughly. Do other devices work after S3? Regards Oliver -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bug 153551: Kernel panic on Nexus 5X USB unplug while tethering
I'm using my phone (Nexus 5X running Android) to tether a USB connection to my laptop (XPS 15 9550). I plug the phone through the USB-C connection and in the phone I select USB tethering. Initially things look normal: a usb0 network interface appears in the laptop and it tries to get an IP with dhcp. However, I observe two different behaviors depending on whether it's a fresh boot, or I have suspend/resumed the laptop. In a fresh boot everything works fine, I get an IP and the connection works as expected. If I unplug the phone, everything also works as expected. However, after a suspend/resume cycle, I plug the phone in but the laptop never connects to it. The usb0 interface still appears, but the dhcp daemon is unable to get any response and finally times out. The fun part happens when I unplug the phone. I consistently get a kernel panic. I managed to get some logs of the oops+panic from pstore. Find them attached. In this particular situation this is what I did: - Boot laptop (archlinux with kernel 4.7.2) - Suspend/resume - Plug Nexus 5X - After a few seconds unplug Nexus 5X I filed a bug report about this: https://bugzilla.kernel.org/show_bug.cgi?id=153551 <6>[3.027914] input: Integrated_Webcam_HD as /devices/pci:00/:00:14.0/usb1/1-12/1-12:1.0/input/input12 <6>[3.027991] usbcore: registered new interface driver uvcvideo <6>[3.028084] USB Video Class driver (1.1.1) <6>[3.029655] input: DLL06E4:01 06CB:7A13 Touchpad as /devices/pci:00/:00:15.1/i2c_designware.1/i2c-8/i2c-DLL06E4:01/0018:06CB:7A13.0001/input/input13 <6>[3.029808] hid-multitouch 0018:06CB:7A13.0001: input,hidraw1: I2C HID v1.00 Mouse [DLL06E4:01 06CB:7A13] on i2c-DLL06E4:01 <6>[3.031975] input: ELAN Touchscreen as /devices/pci:00/:00:14.0/usb1/1-9/1-9:1.0/0003:04F3:21D5.0005/input/input15 <6>[3.032193] hid-multitouch 0003:04F3:21D5.0005: input,hiddev0,hidraw2: USB HID v1.10 Device [ELAN Touchscreen] on usb-:00:14.0-9/input0 <6>[3.033936] mousedev: PS/2 mouse device common for all mice <6>[3.080671] fbcon: inteldrmfb (fb0) is primary device <3>[3.118560] brcmfmac: brcmf_c_preinit_dcmds: Firmware version = wl0: Nov 10 2015 06:38:10 version 7.35.177.61 (r598657) FWID 01-ea662a8c <6>[3.139895] input: Logitech K750 as /devices/pci:00/:00:14.0/usb1/1-1/1-1:1.2/0003:046D:C52B.0004/0003:046D:4002.0007/input/input17 <6>[3.140146] logitech-hidpp-device 0003:046D:4002.0007: input,hidraw3: USB HID v1.11 Keyboard [Logitech K750] on usb-:00:14.0-1:2 <3>[3.141558] psmouse serio1: synaptics: Unable to query device. <6>[3.341751] clocksource: Switched to clocksource tsc <6>[3.355021] [drm] RC6 on <3>[3.379315] brcmfmac: brcmf_cfg80211_reg_notifier: not a ISO3166 code (0x30 0x30) <6>[3.511184] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready <4>[4.346538] [ cut here ] <4>[4.346609] WARNING: CPU: 0 PID: 357 at drivers/gpu/drm/i915/intel_pm.c:3675 skl_update_other_pipe_wm+0x151/0x160 [i915] <4>[4.346612] WARN_ON(!wm_changed) <4>[4.346660] Modules linked in: hid_logitech_hidpp(+) joydev mousedev hid_multitouch hid_logitech_dj uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev usbhid media ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_recent bbswitch(O) xt_conntrack nf_conntrack iptable_filter mei_wdt(+) iTCO_wdt iTCO_vendor_support i2c_designware_platform i2c_designware_core mxm_wmi dell_wmi intel_rapl x86_pkg_temp_thermal intel_powerclamp dell_laptop coretemp dell_smbios dcdbas kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul nls_iso8859_1 crc32c_intel nls_cp437 ghash_clmulni_intel vfat aesni_intel fat aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd intel_cstate intel_rapl_perf pcspkr evdev efi_pstore psmouse input_leds led_class i915 serio_raw efivars brcmfmac brcmutil cfg80211 rtsx_pci_ms i2c_algo_bit i2c_i801 memstick drm_kms_helper mei_me drm mei intel_gtt syscopyarea sysfillrect sysimgblt idma64 fb_sys_fops shpchp processor_thermal_device intel_lpss_pci thermal intel_soc_dts_iosf fan i2c_hid hid battery hci_uart btbcm btqca btintel bluetooth int3403_thermal dell_smo8800 wmi pinctrl_sunrisepoint rfkill video pinctrl_intel intel_hid intel_lpss_acpi intel_lpss sparse_keymap int3402_thermal ac int340x_thermal_zone int3400_thermal acpi_thermal_rel acpi_pad button acpi_als kfifo_buf tpm_tis industrialio tpm sch_fq_codel vboxnetflt(O) vboxnetadp(O) pci_stub vboxpci(O) vboxdrv(O) ip_tables x_tables ext4 crc16 jbd2 mbcache rtsx_pci_sdmmc mmc_core atkbd libps2 ahci libahci xhci_pci xhci_hcd libata rtsx_pci usbcore scsi_mod usb_common i8042 serio nvme nvme_core <4>[4.346726] CPU: 0 PID: 357 Comm: Xorg.wrap Tainted: G O 4.7.2-1-jose #2 <4>[4.346727] Hardware name: Dell Inc. XPS 15 9550/0N7TVV, BIOS 01.02.00 04/07/2016 <4>[4.346732]