Re: usb HC busted?
Hi Mathias, On Fri, Jul 20, 2018 at 01:54:21PM +0100, Sudip Mukherjee wrote: > Hi Mathias, > > On Fri, Jul 20, 2018 at 02:10:58PM +0300, Mathias Nyman wrote: > > On 19.07.2018 20:32, Sudip Mukherjee wrote: > > > Hi Mathias, > > > > > > On Thu, Jul 19, 2018 at 06:42:19PM +0300, Mathias Nyman wrote: > > > > > > As first aid I could try to implement checks that make sure the > > > > > > flushed URBs > > > > > > trb pointers really are on the current endpoint ring, and also add > > > > > > some warning > > > > > > if we are we are dropping endpoints with URBs still queued. > > > > > > > > > > Yes, please. I think your first-aid will be a much better option than > > > > > the hacky patch I am using atm. > > > > > > > > > > > > So poison is overwritten at e5acda58 with almost its own address, (reading > > backwards) e5 ac da 60, twice. > > looks like something (32bit?)is pointing to itself twice, maybe a linked > > list node next and prev pointer > > being set to point to itself as last item was removed from list. > > > > The cancelled_td_list is part of struct xhci_virt_ep, so that should be > > fine. > > But td_list is part of struct xhci_ring, which was freed. and we removed > > the URBs tds from the td_list when > > flushing the ring after ring was freed > > > > I changed the patch (attached) to make sure it doesn't touch the td_list > > when canceling a URB after > > ring is freed. > > > > How about this one, any improvements? > > Yes, it worked. :D > > So, cycle-1 = no change, just to make sure I can still reproduce the error. > cycle-2 and cycle-3 with your patch, and there was no problem, > slub debug was also happy. > I am starting an autotest with this patch now, and I will have almost > 50 cycles tested by tomorrow morning. I can confirm that your bandaid patch has worked. Total of 67 cycles tested till now and there was no error. Its continuing to test over the weekend. Thank you very much for this one. :) I guess you will start with the proper fix, that you and Alan had been discussing, after you are fully back to work. -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Fri, 20 Jul 2018, Mathias Nyman wrote: > >> But we need to fix this properly as well. > >> xhci needs to be more in sync with usb core in usb_set_interface(), > >> currently xhci > >> has the altssetting up and running when usb core hasn't event started > >> flushing endpoints. > > > > Absolutely. The core tries to be compatible with host controller > > drivers that either allocate bandwidth as it is requested or else > > allocate bandwidth all at once when an altsetting is installed. > > > > xhci-hcd falls into the second category. However, this approach > > requires the bandwidth verification for the new altsetting to be > > performed before the old altsetting has been disabled, and the xHCI > > hardware can't do this. > > > > We may need to change the core so that the old endpoints are disabled > > before the bandwidth check is done, instead of after. Of course, this > > leads to an awkward situation if the check fails -- we'd probably have > > to go back and re-install the old altsetting. > > That would help xhci a lot. > > If we want to avoid the awkward altsetting re-install after bandwidth failure > then adding a extra endpoint flush before checking the bandwidth would > already help a lot. > > The endpoint disabling can then be remain after bandwidth checking. > Does that work for other host controllers? As far as I know, the other host controller drivers don't really care how this is done. xHCI is the only technology where the hardware has to verify the bandwidth requirements. (Maybe some other SuperSpeed controller design also cares, but if so then this change is unlikely to hurt.) Alan Stern ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Mathias, On Fri, Jul 20, 2018 at 02:10:58PM +0300, Mathias Nyman wrote: > On 19.07.2018 20:32, Sudip Mukherjee wrote: > > Hi Mathias, > > > > On Thu, Jul 19, 2018 at 06:42:19PM +0300, Mathias Nyman wrote: > > > > > As first aid I could try to implement checks that make sure the > > > > > flushed URBs > > > > > trb pointers really are on the current endpoint ring, and also add > > > > > some warning > > > > > if we are we are dropping endpoints with URBs still queued. > > > > > > > > Yes, please. I think your first-aid will be a much better option than > > > > the hacky patch I am using atm. > > > > > > > > So poison is overwritten at e5acda58 with almost its own address, (reading > backwards) e5 ac da 60, twice. > looks like something (32bit?)is pointing to itself twice, maybe a linked list > node next and prev pointer > being set to point to itself as last item was removed from list. > > The cancelled_td_list is part of struct xhci_virt_ep, so that should be fine. > But td_list is part of struct xhci_ring, which was freed. and we removed the > URBs tds from the td_list when > flushing the ring after ring was freed > > I changed the patch (attached) to make sure it doesn't touch the td_list when > canceling a URB after > ring is freed. > > How about this one, any improvements? Yes, it worked. :D So, cycle-1 = no change, just to make sure I can still reproduce the error. cycle-2 and cycle-3 with your patch, and there was no problem, slub debug was also happy. I am starting an autotest with this patch now, and I will have almost 50 cycles tested by tomorrow morning. -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On 19.07.2018 17:57, Alan Stern wrote: On Thu, 19 Jul 2018, Mathias Nyman wrote: xhci driver will set up all the endpoints for the new altsetting already in usb_hcd_alloc_bandwidth(). New endpoints will be ready and rings running after this. I don't know the exact history behind this, but I assume it is because xhci does all of the steps to drop/add, disable/enable endpoints and check bandwidth in a single configure endpoint command, that will return errors if there is not enough bandwidth. That's right; Sarah and I spent some time going over this while she was working on it. But it looks like the approach isn't adequate. This command is issued in hcd->driver->check_bandwidth() This means that xhci doesn't really do much in hcd->driver->endpoint_disable or hcd->driver->endpoint_enable It also means that xhci driver assumes rings are empty when hcd->driver->check_bandwidth is called. It will bluntly free dropped rings. If there are URBs left on a endpoint ring that was dropped+added (freed+reallocated) then those URBs will contain pointers to freed ring, causing issues when usb_hcd_flush_endpoint() cancels those URBs. usb_set_interface() usb_hcd_alloc_bandwidth() hcd->driver->drop_endpoint() hcd->driver->add_endpoint() // allocates new rings hcd->driver->check_bandwidth() // issues configure endpoint command, free rings. usb_disable_interface(iface, true) usb_disable_endpoint() usb_hcd_flush_endpoint() // will access freed ring if URBs found!! usb_hcd_disable_endpoint() hcd->driver->endpoint_disable() // xhci does nothing usb_enable_interface(iface, true) usb_enable_endpoint(ep_addrss, true) // not really doing much on xhci side. As first aid I could try to implement checks that make sure the flushed URBs trb pointers really are on the current endpoint ring, and also add some warning if we are we are dropping endpoints with URBs still queued. But we need to fix this properly as well. xhci needs to be more in sync with usb core in usb_set_interface(), currently xhci has the altssetting up and running when usb core hasn't event started flushing endpoints. Absolutely. The core tries to be compatible with host controller drivers that either allocate bandwidth as it is requested or else allocate bandwidth all at once when an altsetting is installed. xhci-hcd falls into the second category. However, this approach requires the bandwidth verification for the new altsetting to be performed before the old altsetting has been disabled, and the xHCI hardware can't do this. We may need to change the core so that the old endpoints are disabled before the bandwidth check is done, instead of after. Of course, this leads to an awkward situation if the check fails -- we'd probably have to go back and re-install the old altsetting. That would help xhci a lot. If we want to avoid the awkward altsetting re-install after bandwidth failure then adding a extra endpoint flush before checking the bandwidth would already help a lot. The endpoint disabling can then be remain after bandwidth checking. Does that work for other host controllers? -Mathias ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On 19.07.2018 20:32, Sudip Mukherjee wrote: Hi Mathias, On Thu, Jul 19, 2018 at 06:42:19PM +0300, Mathias Nyman wrote: As first aid I could try to implement checks that make sure the flushed URBs trb pointers really are on the current endpoint ring, and also add some warning if we are we are dropping endpoints with URBs still queued. Yes, please. I think your first-aid will be a much better option than the hacky patch I am using atm. Attached a patch that checks canceled URB td/trb pointers. I haven't tested it at all (well compiles and boots, but new code never exercised) Does it work for you? No, not exactly. :( I can see your message getting printed. [ 249.518394] xhci_hcd :00:14.0: Canceled URB td not found on endpoint ring [ 249.518431] xhci_hcd :00:14.0: Canceled URB td not found on endpoint ring But I can see the message from slub debug again: [ 348.279986] = [ 348.279993] BUG kmalloc-96 (Tainted: G U O ): Poison overwritten [ 348.279995] - [ 348.279997] Disabling lock debugging due to kernel taint [ 348.28] INFO: 0xe5acda60-0xe5acda67. First byte 0x60 instead of 0x6b [ 348.280012] INFO: Allocated in xhci_ring_alloc.constprop.14+0x31/0x125 [xhci_hcd] age=129264 cpu=0 pid=33 ... [ 348.280095] INFO: Freed in xhci_ring_free+0xa7/0xc6 [xhci_hcd] age=98722 cpu=0 pid=33 ... [ 348.280158] INFO: Slab 0xf46e0fe0 objects=29 used=29 fp=0x (null) flags=0x40008100 [ 348.280160] INFO: Object 0xe5acda48 @offset=6728 fp=0xe5acd700 [ 348.280164] Redzone e5acda40: bb bb bb bb bb bb bb bb [ 348.280167] Object e5acda48: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 348.280169] Object e5acda58: 6b 6b 6b 6b 6b 6b 6b 6b 60 da ac e5 60 da ac e5 `...`... So poison is overwritten at e5acda58 with almost its own address, (reading backwards) e5 ac da 60, twice. looks like something (32bit?)is pointing to itself twice, maybe a linked list node next and prev pointer being set to point to itself as last item was removed from list. The cancelled_td_list is part of struct xhci_virt_ep, so that should be fine. But td_list is part of struct xhci_ring, which was freed. and we removed the URBs tds from the td_list when flushing the ring after ring was freed I changed the patch (attached) to make sure it doesn't touch the td_list when canceling a URB after ring is freed. How about this one, any improvements? -Mathias >From ee48d9f9c2d82058489dcdc38faa34a3cbdb08d1 Mon Sep 17 00:00:00 2001 From: Mathias Nyman Date: Thu, 19 Jul 2018 18:06:18 +0300 Subject: [PATCH v2] xhci: when dequeing a URB make sure it exists on the current endpoint ring. If the endpoint ring has been reallocated since the URB was enqueued, then URB may contain TD and TRB pointers to a already freed ring. If this the case then manuallt return the URB without touching any of the freed ring structure data. Don't try to stop the ring. It would be useless. This can happened if endpoint is not flushed before it is dropped and re-added, which is the case in usb_set_interface() as xhci does things in an odd order. Signed-off-by: Mathias Nyman --- drivers/usb/host/xhci.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c index 711da33..7093341 100644 --- a/drivers/usb/host/xhci.c +++ b/drivers/usb/host/xhci.c @@ -37,6 +37,21 @@ static unsigned int quirks; module_param(quirks, uint, S_IRUGO); MODULE_PARM_DESC(quirks, "Bit flags for quirks to be enabled as default"); +static bool td_on_ring(struct xhci_td *td, struct xhci_ring *ring) +{ + struct xhci_segment *seg = ring->first_seg; + + if (!td || !td->start_seg) + return false; + do { + if (seg == td->start_seg) + return true; + seg = seg->next; + } while (seg && seg != ring->first_seg); + + return false; +} + /* TODO: copied from ehci-hcd.c - can this be refactored? */ /* * xhci_handshake - spin reading hc until handshake completes or fails @@ -1467,6 +1482,21 @@ static int xhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status) goto done; } + /* + * check ring is not re-allocated since URB was enqueued. If it is, then + * make sure none of the ring related pointers in this URB private data + * are touched, such as td_list, otherwise we overwrite freed data + */ + if (!td_on_ring(_priv->td[0], ep_ring)) { + xhci_err(xhci, "Canceled URB td not found on endpoint ring"); + for (i = urb_priv->num_tds_done; i < urb_priv->num_tds; i++) { + td = _priv->td[i]; + if (!list_empty(>cancelled_td_list)) +list_del_init(>cancelled_td_list); + } + goto err_giveback; + } + if (xhci->xhc_state & XHCI_STATE_HALTED) { xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb, "HC halted, freeing TD manually."); --
Re: usb HC busted?
Hi Mathias, On Thu, Jul 19, 2018 at 06:42:19PM +0300, Mathias Nyman wrote: > > > As first aid I could try to implement checks that make sure the flushed > > > URBs > > > trb pointers really are on the current endpoint ring, and also add some > > > warning > > > if we are we are dropping endpoints with URBs still queued. > > > > Yes, please. I think your first-aid will be a much better option than > > the hacky patch I am using atm. > > > > Attached a patch that checks canceled URB td/trb pointers. > I haven't tested it at all (well compiles and boots, but new code never > exercised) > > Does it work for you? No, not exactly. :( I can see your message getting printed. [ 249.518394] xhci_hcd :00:14.0: Canceled URB td not found on endpoint ring [ 249.518431] xhci_hcd :00:14.0: Canceled URB td not found on endpoint ring But I can see the message from slub debug again: [ 348.279986] = [ 348.279993] BUG kmalloc-96 (Tainted: G U O ): Poison overwritten [ 348.279995] - [ 348.279997] Disabling lock debugging due to kernel taint [ 348.28] INFO: 0xe5acda60-0xe5acda67. First byte 0x60 instead of 0x6b [ 348.280012] INFO: Allocated in xhci_ring_alloc.constprop.14+0x31/0x125 [xhci_hcd] age=129264 cpu=0 pid=33 [ 348.280019] ___slab_alloc.constprop.24+0x1fc/0x292 [ 348.280023] __slab_alloc.isra.18.constprop.23+0x1c/0x25 [ 348.280026] kmem_cache_alloc_trace+0x78/0x141 [ 348.280032] xhci_ring_alloc.constprop.14+0x31/0x125 [xhci_hcd] [ 348.280038] xhci_endpoint_init+0x25f/0x30a [xhci_hcd] [ 348.280044] xhci_add_endpoint+0x126/0x149 [xhci_hcd] [ 348.280057] usb_hcd_alloc_bandwidth+0x26a/0x2a0 [usbcore] [ 348.280067] usb_set_interface+0xeb/0x25d [usbcore] [ 348.280071] btusb_work+0xeb/0x324 [btusb] [ 348.280076] process_one_work+0x163/0x2b2 [ 348.280080] worker_thread+0x1a9/0x25c [ 348.280083] kthread+0xf8/0xfd [ 348.280087] ret_from_fork+0x2e/0x38 [ 348.280095] INFO: Freed in xhci_ring_free+0xa7/0xc6 [xhci_hcd] age=98722 cpu=0 pid=33 [ 348.280098] __slab_free+0x4b/0x27a [ 348.280100] kfree+0x12e/0x155 [ 348.280106] xhci_ring_free+0xa7/0xc6 [xhci_hcd] [ 348.280112] xhci_free_endpoint_ring+0x16/0x20 [xhci_hcd] [ 348.280118] xhci_check_bandwidth+0x1c2/0x211 [xhci_hcd] [ 348.280129] usb_hcd_alloc_bandwidth+0x205/0x2a0 [usbcore] [ 348.280139] usb_set_interface+0xeb/0x25d [usbcore] [ 348.280142] btusb_work+0x228/0x324 [btusb] [ 348.280145] process_one_work+0x163/0x2b2 [ 348.280148] worker_thread+0x1a9/0x25c [ 348.280151] kthread+0xf8/0xfd [ 348.280154] ret_from_fork+0x2e/0x38 [ 348.280158] INFO: Slab 0xf46e0fe0 objects=29 used=29 fp=0x (null) flags=0x40008100 [ 348.280160] INFO: Object 0xe5acda48 @offset=6728 fp=0xe5acd700 [ 348.280164] Redzone e5acda40: bb bb bb bb bb bb bb bb [ 348.280167] Object e5acda48: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 348.280169] Object e5acda58: 6b 6b 6b 6b 6b 6b 6b 6b 60 da ac e5 60 da ac e5 `...`... [ 348.280171] Object e5acda68: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 348.280174] Object e5acda78: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 348.280176] Object e5acda88: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 348.280179] Object e5acda98: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkk. [ 348.280181] Redzone e5acdaa8: bb bb bb bb [ 348.280183] Padding e5acdb50: 5a 5a 5a 5a 5a 5a 5a 5a [ 348.280188] CPU: 0 PID: 133 Comm: weston Tainted: GBU O 4.14.55-20180712+ #2 [ 348.280190] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017 [ 348.280192] Call Trace: [ 348.280199] dump_stack+0x47/0x5b [ 348.280202] print_trailer+0x12b/0x133 [ 348.280206] check_bytes_and_report+0x6c/0xae [ 348.280210] check_object+0x10a/0x1db [ 348.280214] alloc_debug_processing+0x79/0x123 [ 348.280218] ___slab_alloc.constprop.24+0x1fc/0x292 [ 348.280224] ? drm_mode_atomic_ioctl+0x374/0x75e [ 348.280227] ? drm_mode_atomic_ioctl+0x374/0x75e [ 348.280231] ? drm_mode_object_get+0x28/0x3a [ 348.280235] ? __radix_tree_lookup+0x27/0x7e [ 348.280238] ? drm_mode_object_get+0x28/0x3a [ 348.280242] ? drm_mode_object_put+0x28/0x4c [ 348.280246] __slab_alloc.isra.18.constprop.23+0x1c/0x25 [ 348.280249] ? __slab_alloc.isra.18.constprop.23+0x1c/0x25 [ 348.280253] kmem_cache_alloc_trace+0x78/0x141 [ 348.280257] ? drm_mode_atomic_ioctl+0x374/0x75e [ 348.280261] drm_mode_atomic_ioctl+0x374/0x75e [ 348.280267] ? drm_atomic_set_property+0x442/0x442 [ 348.280272] drm_ioctl_kernel+0x52/0x88 [ 348.280275] drm_ioctl+0x1fc/0x2c1 [ 348.280279] ? drm_atomic_set_property+0x442/0x442 [
Re: usb HC busted?
As first aid I could try to implement checks that make sure the flushed URBs trb pointers really are on the current endpoint ring, and also add some warning if we are we are dropping endpoints with URBs still queued. Yes, please. I think your first-aid will be a much better option than the hacky patch I am using atm. Attached a patch that checks canceled URB td/trb pointers. I haven't tested it at all (well compiles and boots, but new code never exercised) Does it work for you? But we need to fix this properly as well. xhci needs to be more in sync with usb core in usb_set_interface(), currently xhci has the altssetting up and running when usb core hasn't event started flushing endpoints. I am able to reproduce this on almost all cycles, so I can always test the fix for you after you are fully back from your holiday. Nice, thanks -Mathias >From a7d4af3129a91811c95ea642f6c916b1c1ca6d46 Mon Sep 17 00:00:00 2001 From: Mathias Nyman Date: Thu, 19 Jul 2018 18:06:18 +0300 Subject: [PATCH] xhci: when dequeing a URB make sure it exists on the current endpoint ring. If the endpoint ring has been reallocated since the URB was enqueued, then URB may contain TD and TRB pointers to a already freed ring. If this the case then manuallt return the URB, and don't try to stop the ring. It would be useless. This can happened if endpoint is not flushed before it is dropped and re-added, which is the case in usb_set_interface() as xhci does things in an odd order. Signed-off-by: Mathias Nyman --- drivers/usb/host/xhci.c | 43 --- 1 file changed, 32 insertions(+), 11 deletions(-) diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c index 711da33..5bedab7 100644 --- a/drivers/usb/host/xhci.c +++ b/drivers/usb/host/xhci.c @@ -37,6 +37,21 @@ static unsigned int quirks; module_param(quirks, uint, S_IRUGO); MODULE_PARM_DESC(quirks, "Bit flags for quirks to be enabled as default"); +static bool td_on_ring(struct xhci_td *td, struct xhci_ring *ring) +{ + struct xhci_segment *seg = ring->first_seg; + + if (!td || !td->start_seg) + return false; + do { + if (seg == td->start_seg) + return true; + seg = seg->next; + } while (seg && seg != ring->first_seg); + + return false; +} + /* TODO: copied from ehci-hcd.c - can this be refactored? */ /* * xhci_handshake - spin reading hc until handshake completes or fails @@ -1467,19 +1482,16 @@ static int xhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status) goto done; } + /* check ring is not re-allocated since URB was enqueued */ + if (!td_on_ring(_priv->td[0], ep_ring)) { + xhci_err(xhci, "Canceled URB td not found on endpoint ring"); + goto err_unlink_giveback; + } + if (xhci->xhc_state & XHCI_STATE_HALTED) { xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb, -"HC halted, freeing TD manually."); - for (i = urb_priv->num_tds_done; - i < urb_priv->num_tds; - i++) { - td = _priv->td[i]; - if (!list_empty(>td_list)) -list_del_init(>td_list); - if (!list_empty(>cancelled_td_list)) -list_del_init(>cancelled_td_list); - } - goto err_giveback; + "HC halted, freeing TD manually."); + goto err_unlink_giveback; } i = urb_priv->num_tds_done; @@ -1519,6 +1531,15 @@ static int xhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status) spin_unlock_irqrestore(>lock, flags); return ret; +err_unlink_giveback: + for (i = urb_priv->num_tds_done; i < urb_priv->num_tds; i++) { + td = _priv->td[i]; + if (!list_empty(>td_list)) + list_del_init(>td_list); + if (!list_empty(>cancelled_td_list)) + list_del_init(>cancelled_td_list); + } + err_giveback: if (urb_priv) xhci_urb_free_priv(urb_priv); -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Thu, 19 Jul 2018, Mathias Nyman wrote: > xhci driver will set up all the endpoints for the new altsetting already in > usb_hcd_alloc_bandwidth(). > > New endpoints will be ready and rings running after this. I don't know the > exact > history behind this, but I assume it is because xhci does all of the steps to > drop/add, disable/enable endpoints and check bandwidth in a single configure > endpoint command, that will return errors if there is not enough bandwidth. That's right; Sarah and I spent some time going over this while she was working on it. But it looks like the approach isn't adequate. > This command is issued in hcd->driver->check_bandwidth() > This means that xhci doesn't really do much in hcd->driver->endpoint_disable > or > hcd->driver->endpoint_enable > > It also means that xhci driver assumes rings are empty when > hcd->driver->check_bandwidth is called. It will bluntly free dropped rings. > If there are URBs left on a endpoint ring that was dropped+added > (freed+reallocated) then those URBs will contain pointers to freed ring, > causing issues when usb_hcd_flush_endpoint() cancels those URBs. > > usb_set_interface() >usb_hcd_alloc_bandwidth() > hcd->driver->drop_endpoint() > hcd->driver->add_endpoint() // allocates new rings > hcd->driver->check_bandwidth() // issues configure endpoint command, > free rings. >usb_disable_interface(iface, true) > usb_disable_endpoint() >usb_hcd_flush_endpoint() // will access freed ring if URBs found!! >usb_hcd_disable_endpoint() > hcd->driver->endpoint_disable() // xhci does nothing >usb_enable_interface(iface, true) > usb_enable_endpoint(ep_addrss, true) // not really doing much on xhci > side. > > As first aid I could try to implement checks that make sure the flushed URBs > trb pointers really are on the current endpoint ring, and also add some > warning > if we are we are dropping endpoints with URBs still queued. > > But we need to fix this properly as well. > xhci needs to be more in sync with usb core in usb_set_interface(), currently > xhci > has the altssetting up and running when usb core hasn't event started > flushing endpoints. Absolutely. The core tries to be compatible with host controller drivers that either allocate bandwidth as it is requested or else allocate bandwidth all at once when an altsetting is installed. xhci-hcd falls into the second category. However, this approach requires the bandwidth verification for the new altsetting to be performed before the old altsetting has been disabled, and the xHCI hardware can't do this. We may need to change the core so that the old endpoints are disabled before the bandwidth check is done, instead of after. Of course, this leads to an awkward situation if the check fails -- we'd probably have to go back and re-install the old altsetting. Alan Stern ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Mathias, On Thu, Jul 19, 2018 at 01:59:01PM +0300, Mathias Nyman wrote: > On 17.07.2018 18:10, Sudip Mukherjee wrote: > > Hi Alan, Greg, > > > > On Tue, Jul 17, 2018 at 03:49:18PM +0100, Sudip Mukherjee wrote: > > > On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote: > > > > Hi Alan, > > > > > > > > On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote: > > > > > On Tue, 17 Jul 2018, Sudip Mukherjee wrote: > > > > > > > > > > > I did some more debugging. Tested with a KASAN enabled kernel and > > > > > > that > > > > > > shows the problem. The report is attached. > > > > > > > > > > And, my hacky patch worked as I prevented it from calling > > usb_disable_interface() in this particular case. > > > > Back for a few days, looking at this I hope you had a good holiday. :) > > xhci driver will set up all the endpoints for the new altsetting already in > usb_hcd_alloc_bandwidth(). > > > As first aid I could try to implement checks that make sure the flushed URBs > trb pointers really are on the current endpoint ring, and also add some > warning > if we are we are dropping endpoints with URBs still queued. Yes, please. I think your first-aid will be a much better option than the hacky patch I am using atm. > > But we need to fix this properly as well. > xhci needs to be more in sync with usb core in usb_set_interface(), currently > xhci > has the altssetting up and running when usb core hasn't event started > flushing endpoints. I am able to reproduce this on almost all cycles, so I can always test the fix for you after you are fully back from your holiday. -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On 17.07.2018 18:10, Sudip Mukherjee wrote: Hi Alan, Greg, On Tue, Jul 17, 2018 at 03:49:18PM +0100, Sudip Mukherjee wrote: On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote: Hi Alan, On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote: On Tue, 17 Jul 2018, Sudip Mukherjee wrote: I did some more debugging. Tested with a KASAN enabled kernel and that shows the problem. The report is attached. To my understanding: btusb_work() is calling usb_set_interface() with alternate = 0. which again calls usb_hcd_alloc_bandwidth() and that frees the rings by xhci_free_endpoint_ring(). That doesn't sound like the right thing to do. The rings shouldn't be freed until xhci_endpoint_disable() is called. On the other hand, there doesn't appear to be any xhci_endpoint_disable() routine, although a comment refers to it. Maybe this is the real problem? one of your old mail might help :) https://www.spinics.net/lists/linux-usb/msg98123.html Wrote too soon. Is it the one you are looking for - usb_disable_endpoint() is in drivers/usb/core/message.c I think now I understand what the problem is. usb_set_interface() calls usb_disable_interface() which again calls usb_disable_endpoint(). This usb_disable_endpoint() gets the pointer to 'ep', marks it as NULL and sends the pointer to usb_hcd_flush_endpoint(). After flushing the endpoints usb_disable_endpoint() calls usb_hcd_disable_endpoint() which tries to do: if (hcd->driver->endpoint_disable) hcd->driver->endpoint_disable(hcd, ep); but there is no endpoint_disable() callback in xhci, so the endpoint is never marked as disabled. So, next time usb_hcd_flush_endpoint() is called I get this corruption. And this is exactly where I used to see the problem happening. And, my hacky patch worked as I prevented it from calling usb_disable_interface() in this particular case. Back for a few days, looking at this xhci driver will set up all the endpoints for the new altsetting already in usb_hcd_alloc_bandwidth(). New endpoints will be ready and rings running after this. I don't know the exact history behind this, but I assume it is because xhci does all of the steps to drop/add, disable/enable endpoints and check bandwidth in a single configure endpoint command, that will return errors if there is not enough bandwidth. This command is issued in hcd->driver->check_bandwidth() This means that xhci doesn't really do much in hcd->driver->endpoint_disable or hcd->driver->endpoint_enable It also means that xhci driver assumes rings are empty when hcd->driver->check_bandwidth is called. It will bluntly free dropped rings. If there are URBs left on a endpoint ring that was dropped+added (freed+reallocated) then those URBs will contain pointers to freed ring, causing issues when usb_hcd_flush_endpoint() cancels those URBs. usb_set_interface() usb_hcd_alloc_bandwidth() hcd->driver->drop_endpoint() hcd->driver->add_endpoint() // allocates new rings hcd->driver->check_bandwidth() // issues configure endpoint command, free rings. usb_disable_interface(iface, true) usb_disable_endpoint() usb_hcd_flush_endpoint() // will access freed ring if URBs found!! usb_hcd_disable_endpoint() hcd->driver->endpoint_disable() // xhci does nothing usb_enable_interface(iface, true) usb_enable_endpoint(ep_addrss, true) // not really doing much on xhci side. As first aid I could try to implement checks that make sure the flushed URBs trb pointers really are on the current endpoint ring, and also add some warning if we are we are dropping endpoints with URBs still queued. But we need to fix this properly as well. xhci needs to be more in sync with usb core in usb_set_interface(), currently xhci has the altssetting up and running when usb core hasn't event started flushing endpoints. -Mathias ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Tue, Jul 17, 2018 at 04:59:01PM +0100, Sudip Mukherjee wrote: > On Tue, Jul 17, 2018 at 05:52:59PM +0200, Greg KH wrote: > > On Tue, Jul 17, 2018 at 10:31:38AM -0400, Alan Stern wrote: > > > On Tue, 17 Jul 2018, Greg KH wrote: > > > > > > > > From: Sudip Mukherjee > > > > > Date: Tue, 10 Jul 2018 09:50:00 +0100 > > > > > Subject: [PATCH] hacky solution to mem-corruption > > > > > > > > > > Signed-off-by: Sudip Mukherjee > > > > > --- > > > > > > > No, neither of these is right. It's possible to use > > > usb_set_interface() as a kind of "soft" reset. Even when the new > > > altsetting is specified to be the same as the current one, we still > > > have to tell the lower-layer drivers and hardware about it. > > > > You are right, it's a hacky soft reset, I was just trying to figure out > > what the bluetooth driver was trying to do. I wouldn't expect it to be > > calling that function a lot, but I guess it does :( > > usb_set_interface() is being called two times from bluetooth event. But > I am now adding more debugs to see why your patch did not work. So, a very simple debug to see the sequence of functions being called. I have attached the patch I used. In a good case: [ 124.287991] sudip: xhci_urb_dequeue [ 124.287997] sudip: xhci_queue_stop_endpoint cmd=ee032950 [ 124.288016] sudip: handle_cmd_completion cmd=ee032950 [ 124.288173] sudip: xhci_urb_dequeue [ 124.288176] sudip: xhci_queue_stop_endpoint cmd=ee032950 [ 124.288189] sudip: handle_cmd_completion cmd=ee032950 [ 124.290647] sudip: usb_hcd_flush_endpoint [ 124.290652] sudip: usb_hcd_flush_endpoint But in a bad case: [ 186.786900] sudip: xhci_urb_dequeue [ 186.786905] sudip: xhci_queue_stop_endpoint cmd=ebe47cb0 [ 186.786923] sudip: handle_cmd_completion cmd=ebe47cb0 [ 186.789040] sudip: xhci_urb_dequeue [ 186.789047] sudip: xhci_queue_stop_endpoint cmd=ebe47cb0 [ 186.789069] sudip: handle_cmd_completion cmd=ebe47cb0 [ 186.790082] sudip: usb_hcd_flush_endpoint [ 186.790094] sudip: xhci_urb_dequeue [ 186.790097] sudip: xhci_queue_stop_endpoint cmd=ebe47290 [ 186.790150] sudip: handle_cmd_completion cmd=ebe47290 [ 186.790202] sudip: usb_hcd_flush_endpoint So, when usb_hcd_flush_endpoint() is called by usb_disable_endpoint() it finds urbs still on the urb_list of the ep. And in the process of unlinking them, it again sends the command to stop the endpoint, although that endpoint has already been stopped. So Greg's patch did not work as the memory got corrupted on the first call to usb_set_interface(), whereas that patch was preventing the second call to usb_set_interface(). -- Regards Sudip diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c index 467bedeb542a..8d28f120ec0a 100644 --- a/drivers/usb/core/hcd.c +++ b/drivers/usb/core/hcd.c @@ -1885,6 +1885,7 @@ void usb_hcd_flush_endpoint(struct usb_device *udev, might_sleep(); hcd = bus_to_hcd(udev->bus); + pr_err("sudip: %s\n", __func__); /* No more submits can occur */ spin_lock_irq(_urb_list_lock); rescan: diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 6996235e34a9..4f80791fdfc5 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -1450,6 +1450,7 @@ static void handle_cmd_completion(struct xhci_hcd *xhci, case TRB_STOP_RING: WARN_ON(slot_id != TRB_TO_SLOT_ID( le32_to_cpu(cmd_trb->generic.field[3]))); + pr_err("sudip: %s cmd=%p\n", __func__, cmd); xhci_handle_cmd_stop_ep(xhci, slot_id, cmd_trb, event); break; case TRB_SET_DEQ: @@ -4009,6 +4010,7 @@ int xhci_queue_stop_endpoint(struct xhci_hcd *xhci, struct xhci_command *cmd, u32 type = TRB_TYPE(TRB_STOP_RING); u32 trb_suspend = SUSPEND_PORT_FOR_TRB(suspend); + pr_err("sudip: %s cmd=%p\n", __func__, cmd); return queue_command(xhci, cmd, 0, 0, 0, trb_slot_id | trb_ep_index | type | trb_suspend, false); } diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c index db1de6113db2..3832128107ff 100644 --- a/drivers/usb/host/xhci.c +++ b/drivers/usb/host/xhci.c @@ -1516,6 +1516,7 @@ static int xhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status) ep->stop_cmd_timer.expires = jiffies + XHCI_STOP_EP_CMD_TIMEOUT * HZ; add_timer(>stop_cmd_timer); + pr_err("sudip: %s\n", __func__); xhci_queue_stop_endpoint(xhci, command, urb->dev->slot_id, ep_index, 0); xhci_ring_cmd_db(xhci); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Tue, Jul 17, 2018 at 05:52:59PM +0200, Greg KH wrote: > On Tue, Jul 17, 2018 at 10:31:38AM -0400, Alan Stern wrote: > > On Tue, 17 Jul 2018, Greg KH wrote: > > > > > > From: Sudip Mukherjee > > > > Date: Tue, 10 Jul 2018 09:50:00 +0100 > > > > Subject: [PATCH] hacky solution to mem-corruption > > > > > > > > Signed-off-by: Sudip Mukherjee > > > > --- > > > > No, neither of these is right. It's possible to use > > usb_set_interface() as a kind of "soft" reset. Even when the new > > altsetting is specified to be the same as the current one, we still > > have to tell the lower-layer drivers and hardware about it. > > You are right, it's a hacky soft reset, I was just trying to figure out > what the bluetooth driver was trying to do. I wouldn't expect it to be > calling that function a lot, but I guess it does :( usb_set_interface() is being called two times from bluetooth event. But I am now adding more debugs to see why your patch did not work. -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Tue, Jul 17, 2018 at 10:31:38AM -0400, Alan Stern wrote: > On Tue, 17 Jul 2018, Greg KH wrote: > > > > From: Sudip Mukherjee > > > Date: Tue, 10 Jul 2018 09:50:00 +0100 > > > Subject: [PATCH] hacky solution to mem-corruption > > > > > > Signed-off-by: Sudip Mukherjee > > > --- > > > drivers/usb/core/message.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/usb/core/message.c b/drivers/usb/core/message.c > > > index 7cd4ec33dbf4..7fdf7a27611d 100644 > > > --- a/drivers/usb/core/message.c > > > +++ b/drivers/usb/core/message.c > > > @@ -1398,7 +1398,8 @@ int usb_set_interface(struct usb_device *dev, int > > > interface, int alternate) > > > remove_intf_ep_devs(iface); > > > usb_remove_sysfs_intf_files(iface); > > > } > > > - usb_disable_interface(dev, iface, true); > > > + if (!(iface->cur_altsetting && alt)) > > > + usb_disable_interface(dev, iface, true); > > > > > > > > This feels like a "correct" patch anyway, why would a driver keep > > calling set_interface to an interface that it was already set to? > > > > But can't we check for this higher up in the function? This hack will > > just not disable an interface but it will do all of the other stuff > > being asked for. Does the patch below also solve this for you? It's > > not a good solution of course, but it might work around the problem a > > bit better. > > > > thanks, > > > > greg k-h > > > > diff --git a/drivers/usb/core/message.c b/drivers/usb/core/message.c > > index 1a15392326fc..0f718f1a1ca3 100644 > > --- a/drivers/usb/core/message.c > > +++ b/drivers/usb/core/message.c > > @@ -1376,6 +1376,14 @@ int usb_set_interface(struct usb_device *dev, int > > interface, int alternate) > > return -EINVAL; > > } > > > > + if (iface->cur_altsetting == alt) { > > + /* > > +* foolish bluetooth stack, don't try to set a setting you are > > +* already set to... > > +*/ > > + return 0; > > + } > > + > > /* Make sure we have enough bandwidth for this alternate interface. > > * Remove the current alt setting and add the new alt setting. > > */ > > No, neither of these is right. It's possible to use > usb_set_interface() as a kind of "soft" reset. Even when the new > altsetting is specified to be the same as the current one, we still > have to tell the lower-layer drivers and hardware about it. You are right, it's a hacky soft reset, I was just trying to figure out what the bluetooth driver was trying to do. I wouldn't expect it to be calling that function a lot, but I guess it does :( thanks, greg k-h ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Alan, Greg, On Tue, Jul 17, 2018 at 03:49:18PM +0100, Sudip Mukherjee wrote: > On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote: > > Hi Alan, > > > > On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote: > > > On Tue, 17 Jul 2018, Sudip Mukherjee wrote: > > > > > > > I did some more debugging. Tested with a KASAN enabled kernel and that > > > > shows the problem. The report is attached. > > > > > > > > To my understanding: > > > > > > > > btusb_work() is calling usb_set_interface() with alternate = 0. which > > > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by > > > > xhci_free_endpoint_ring(). > > > > > > That doesn't sound like the right thing to do. The rings shouldn't be > > > freed until xhci_endpoint_disable() is called. > > > > > > On the other hand, there doesn't appear to be any > > > xhci_endpoint_disable() routine, although a comment refers to it. > > > Maybe this is the real problem? > > > > one of your old mail might help :) > > > > https://www.spinics.net/lists/linux-usb/msg98123.html > > Wrote too soon. > > Is it the one you are looking for - > usb_disable_endpoint() is in drivers/usb/core/message.c I think now I understand what the problem is. usb_set_interface() calls usb_disable_interface() which again calls usb_disable_endpoint(). This usb_disable_endpoint() gets the pointer to 'ep', marks it as NULL and sends the pointer to usb_hcd_flush_endpoint(). After flushing the endpoints usb_disable_endpoint() calls usb_hcd_disable_endpoint() which tries to do: if (hcd->driver->endpoint_disable) hcd->driver->endpoint_disable(hcd, ep); but there is no endpoint_disable() callback in xhci, so the endpoint is never marked as disabled. So, next time usb_hcd_flush_endpoint() is called I get this corruption. And this is exactly where I used to see the problem happening. And, my hacky patch worked as I prevented it from calling usb_disable_interface() in this particular case. Greg - answering your question here. My hacky patch was based on the fact that usb_hcd_alloc_bandwidth() is calling hcd->driver->drop_endpoint() and hcd->driver->add_endpoint() if (cur_alt && new_alt). So, I prevented usb_disable_interface() to be called for that same condition. And that worked as the call to usb_hcd_flush_endpoint() was not executed. I know it is not correct and I might be having memory leaks for this, but I have the system working till we get the actual fix. -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Tue, 17 Jul 2018, Sudip Mukherjee wrote: > On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote: > > Hi Alan, > > > > On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote: > > > On Tue, 17 Jul 2018, Sudip Mukherjee wrote: > > > > > > > I did some more debugging. Tested with a KASAN enabled kernel and that > > > > shows the problem. The report is attached. > > > > > > > > To my understanding: > > > > > > > > btusb_work() is calling usb_set_interface() with alternate = 0. which > > > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by > > > > xhci_free_endpoint_ring(). > > > > > > That doesn't sound like the right thing to do. The rings shouldn't be > > > freed until xhci_endpoint_disable() is called. > > > > > > On the other hand, there doesn't appear to be any > > > xhci_endpoint_disable() routine, although a comment refers to it. > > > Maybe this is the real problem? > > > > one of your old mail might help :) > > > > https://www.spinics.net/lists/linux-usb/msg98123.html That message seems to say the same thing as what I just wrote, more or less. > Wrote too soon. > > Is it the one you are looking for - > usb_disable_endpoint() is in drivers/usb/core/message.c No, I'm talking about xhci_endpoint_disable(), which would be called by usb_hcd_disable_endpoint() if it existed. Of course, usb_hcd_disable_endpoint() is called by usb_disable_endpoint(). Alan Stern ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote: > Hi Alan, > > On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote: > > On Tue, 17 Jul 2018, Sudip Mukherjee wrote: > > > > > I did some more debugging. Tested with a KASAN enabled kernel and that > > > shows the problem. The report is attached. > > > > > > To my understanding: > > > > > > btusb_work() is calling usb_set_interface() with alternate = 0. which > > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by > > > xhci_free_endpoint_ring(). > > > > That doesn't sound like the right thing to do. The rings shouldn't be > > freed until xhci_endpoint_disable() is called. > > > > On the other hand, there doesn't appear to be any > > xhci_endpoint_disable() routine, although a comment refers to it. > > Maybe this is the real problem? > > one of your old mail might help :) > > https://www.spinics.net/lists/linux-usb/msg98123.html Wrote too soon. Is it the one you are looking for - usb_disable_endpoint() is in drivers/usb/core/message.c -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Alan, On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote: > On Tue, 17 Jul 2018, Sudip Mukherjee wrote: > > > I did some more debugging. Tested with a KASAN enabled kernel and that > > shows the problem. The report is attached. > > > > To my understanding: > > > > btusb_work() is calling usb_set_interface() with alternate = 0. which > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by > > xhci_free_endpoint_ring(). > > That doesn't sound like the right thing to do. The rings shouldn't be > freed until xhci_endpoint_disable() is called. > > On the other hand, there doesn't appear to be any > xhci_endpoint_disable() routine, although a comment refers to it. > Maybe this is the real problem? one of your old mail might help :) https://www.spinics.net/lists/linux-usb/msg98123.html -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Tue, 17 Jul 2018, Greg KH wrote: > > From: Sudip Mukherjee > > Date: Tue, 10 Jul 2018 09:50:00 +0100 > > Subject: [PATCH] hacky solution to mem-corruption > > > > Signed-off-by: Sudip Mukherjee > > --- > > drivers/usb/core/message.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/usb/core/message.c b/drivers/usb/core/message.c > > index 7cd4ec33dbf4..7fdf7a27611d 100644 > > --- a/drivers/usb/core/message.c > > +++ b/drivers/usb/core/message.c > > @@ -1398,7 +1398,8 @@ int usb_set_interface(struct usb_device *dev, int > > interface, int alternate) > > remove_intf_ep_devs(iface); > > usb_remove_sysfs_intf_files(iface); > > } > > - usb_disable_interface(dev, iface, true); > > + if (!(iface->cur_altsetting && alt)) > > + usb_disable_interface(dev, iface, true); > > > > This feels like a "correct" patch anyway, why would a driver keep > calling set_interface to an interface that it was already set to? > > But can't we check for this higher up in the function? This hack will > just not disable an interface but it will do all of the other stuff > being asked for. Does the patch below also solve this for you? It's > not a good solution of course, but it might work around the problem a > bit better. > > thanks, > > greg k-h > > diff --git a/drivers/usb/core/message.c b/drivers/usb/core/message.c > index 1a15392326fc..0f718f1a1ca3 100644 > --- a/drivers/usb/core/message.c > +++ b/drivers/usb/core/message.c > @@ -1376,6 +1376,14 @@ int usb_set_interface(struct usb_device *dev, int > interface, int alternate) > return -EINVAL; > } > > + if (iface->cur_altsetting == alt) { > + /* > + * foolish bluetooth stack, don't try to set a setting you are > + * already set to... > + */ > + return 0; > + } > + > /* Make sure we have enough bandwidth for this alternate interface. >* Remove the current alt setting and add the new alt setting. >*/ No, neither of these is right. It's possible to use usb_set_interface() as a kind of "soft" reset. Even when the new altsetting is specified to be the same as the current one, we still have to tell the lower-layer drivers and hardware about it. Alan Stern ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Tue, 17 Jul 2018, Sudip Mukherjee wrote: > I did some more debugging. Tested with a KASAN enabled kernel and that > shows the problem. The report is attached. > > To my understanding: > > btusb_work() is calling usb_set_interface() with alternate = 0. which > again calls usb_hcd_alloc_bandwidth() and that frees the rings by > xhci_free_endpoint_ring(). That doesn't sound like the right thing to do. The rings shouldn't be freed until xhci_endpoint_disable() is called. On the other hand, there doesn't appear to be any xhci_endpoint_disable() routine, although a comment refers to it. Maybe this is the real problem? Alan Stern > But then usb_set_interface() continues and > calls usb_disable_interface() -> usb_hcd_flush_endpoint()->unlink1()-> > xhci_urb_dequeue() which at the end gives the command to stop endpoint. > > In all the cycles I have tested I see that only in the fail case > handle_cmd_completion() gets called, but in the cycles where the error > is not there handle_cmd_completion() is not called with that command. > > I am not sure what is happening, and you are the best person to understand > what is happening. :) > > But for now (untill you are back from holiday and suggest a proper solution), > I made a hacky patch (attached) which is working and I donot get any > corruption after that. Both KASAN and slub debug are also happy. > > So, now waiting for you to analyze what is going on and suggest a proper > fix. > > Thanks in advance. > > -- > Regards > Sudip > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Tue, Jul 17, 2018 at 02:20:00PM +0100, Sudip Mukherjee wrote: > Hi Greg, > > On Tue, Jul 17, 2018 at 02:04:11PM +0200, Greg KH wrote: > > On Tue, Jul 17, 2018 at 12:41:04PM +0100, Sudip Mukherjee wrote: > > > Hi Mathias, > > > > > > On Sat, Jun 30, 2018 at 10:07:04PM +0100, Sudip Mukherjee wrote: > > > > Hi Mathias, > > > > > > > > On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote: > > > > > On 27.06.2018 14:59, Sudip Mukherjee wrote: > > > > > > > > Can you share a bit more details on the platform you are using, > > > > > > > > and what types of test you are running. > > > > > > > > > > > > > > Then to track what is going on, I added the slub debugging and :( > > > > I have attached part of dmesg for you to check. > > > > Will appreciate your help in finding out the problem. > > > > > > I did some more debugging. Tested with a KASAN enabled kernel and that > > > shows the problem. The report is attached. > > > > > > To my understanding: > > > > > > btusb_work() is calling usb_set_interface() with alternate = 0. which > > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by > > > xhci_free_endpoint_ring(). But then usb_set_interface() continues and > > > calls usb_disable_interface() -> usb_hcd_flush_endpoint()->unlink1()-> > > > xhci_urb_dequeue() which at the end gives the command to stop endpoint. > > > > > > In all the cycles I have tested I see that only in the fail case > > > handle_cmd_completion() gets called, but in the cycles where the error > > > is not there handle_cmd_completion() is not called with that command. > > > > > > I am not sure what is happening, and you are the best person to understand > > > what is happening. :) > > > > > > But for now (untill you are back from holiday and suggest a proper > > > solution), > > > I made a hacky patch (attached) which is working and I donot get any > > > corruption after that. Both KASAN and slub debug are also happy. > > > > > > So, now waiting for you to analyze what is going on and suggest a proper > > > fix. > > > > > > Thanks in advance. > > > > > > -- > > > Regards > > > Sudip > > > > > [ 236.814156] > > > == > > > [ 236.814187] BUG: KASAN: use-after-free in > > > xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd] > > > [ 236.814193] Read of size 8 at addr 8800789329c8 by task weston/138 > > > > > > [ 236.814203] CPU: 0 PID: 138 Comm: weston Tainted: G U W O > > > 4.14.47-20180606+ #7 > > > [ 236.814206] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017 > > > [ 236.814209] Call Trace: > > > [ 236.814214] > > > [ 236.814226] dump_stack+0x46/0x59 > > > [ 236.814238] print_address_description+0x6b/0x23b > > > [ 236.814255] ? xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd] > > > [ 236.814262] kasan_report+0x220/0x246 > > > [ 236.814278] xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd] > > > [ 236.814294] trb_in_td+0x3b/0x1cd [xhci_hcd] > > > [ 236.814311] handle_cmd_completion+0x1181/0x2c9b [xhci_hcd] > > > [ 236.814329] ? xhci_queue_new_dequeue_state+0x5d9/0x5d9 [xhci_hcd] > > > [ 236.814337] ? drm_handle_vblank+0x4ec/0x590 > > > [ 236.814352] xhci_irq+0x529/0x3294 [xhci_hcd] > > > [ 236.814362] ? __accumulate_pelt_segments+0x24/0x33 > > > [ 236.814378] ? finish_td.isra.40+0x223/0x223 [xhci_hcd] > > > [ 236.814384] ? __accumulate_pelt_segments+0x24/0x33 > > > [ 236.814390] ? __accumulate_pelt_segments+0x24/0x33 > > > [ 236.814405] ? xhci_irq+0x3294/0x3294 [xhci_hcd] > > > [ 236.814412] __handle_irq_event_percpu+0x149/0x3db > > > [ 236.814421] handle_irq_event_percpu+0x65/0x109 > > > [ 236.814428] ? __handle_irq_event_percpu+0x3db/0x3db > > > [ 236.814436] ? ttwu_do_wakeup.isra.18+0x3a2/0x3ce > > > [ 236.814442] handle_irq_event+0xa8/0x10a > > > [ 236.814449] handle_edge_irq+0x4b2/0x538 > > > [ 236.814458] handle_irq+0x3e/0x45 > > > [ 236.814465] do_IRQ+0x5c/0x126 > > > [ 236.814474] common_interrupt+0x7a/0x7a > > > [ 236.814478] > > > [ 236.814483] RIP: 0023:0xf79d3d82 > > > [ 236.814486] RSP: 002b:ffc588e8 EFLAGS: 00200282 ORIG_RAX: > > > ffdc > > > [ 236.814493] RAX: RBX: f7bebd5c RCX: > > > > > > [ 236.814496] RDX: 08d4197c RSI: RDI: > > > f746c020 > > > [ 236.814499] RBP: ffc588e8 R08: R09: > > > > > > [ 236.814503] R10: R11: 00200206 R12: > > > > > > [ 236.814506] R13: R14: R15: > > > > > > > > > [ 236.814513] Allocated by task 2082: > > > [ 236.814521] kasan_kmalloc.part.1+0x51/0xc7 > > > [ 236.814526] kmem_cache_alloc_trace+0x178/0x187 > > > [ 236.814540] xhci_segment_alloc.isra.11+0x9d/0x3bf [xhci_hcd] > > > [ 236.814553] xhci_alloc_segments_for_ring+0x9e/0x176 [xhci_hcd] > > > [ 236.814566]
Re: usb HC busted?
Hi Greg, On Tue, Jul 17, 2018 at 02:04:11PM +0200, Greg KH wrote: > On Tue, Jul 17, 2018 at 12:41:04PM +0100, Sudip Mukherjee wrote: > > Hi Mathias, > > > > On Sat, Jun 30, 2018 at 10:07:04PM +0100, Sudip Mukherjee wrote: > > > Hi Mathias, > > > > > > On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote: > > > > On 27.06.2018 14:59, Sudip Mukherjee wrote: > > > > > > > Can you share a bit more details on the platform you are using, > > > > > > > and what types of test you are running. > > > > > > > > > > > Then to track what is going on, I added the slub debugging and :( > > > I have attached part of dmesg for you to check. > > > Will appreciate your help in finding out the problem. > > > > I did some more debugging. Tested with a KASAN enabled kernel and that > > shows the problem. The report is attached. > > > > To my understanding: > > > > btusb_work() is calling usb_set_interface() with alternate = 0. which > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by > > xhci_free_endpoint_ring(). But then usb_set_interface() continues and > > calls usb_disable_interface() -> usb_hcd_flush_endpoint()->unlink1()-> > > xhci_urb_dequeue() which at the end gives the command to stop endpoint. > > > > In all the cycles I have tested I see that only in the fail case > > handle_cmd_completion() gets called, but in the cycles where the error > > is not there handle_cmd_completion() is not called with that command. > > > > I am not sure what is happening, and you are the best person to understand > > what is happening. :) > > > > But for now (untill you are back from holiday and suggest a proper > > solution), > > I made a hacky patch (attached) which is working and I donot get any > > corruption after that. Both KASAN and slub debug are also happy. > > > > So, now waiting for you to analyze what is going on and suggest a proper > > fix. > > > > Thanks in advance. > > > > -- > > Regards > > Sudip > > > [ 236.814156] > > == > > [ 236.814187] BUG: KASAN: use-after-free in xhci_trb_virt_to_dma+0x2e/0x74 > > [xhci_hcd] > > [ 236.814193] Read of size 8 at addr 8800789329c8 by task weston/138 > > > > [ 236.814203] CPU: 0 PID: 138 Comm: weston Tainted: G U W O > > 4.14.47-20180606+ #7 > > [ 236.814206] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017 > > [ 236.814209] Call Trace: > > [ 236.814214] > > [ 236.814226] dump_stack+0x46/0x59 > > [ 236.814238] print_address_description+0x6b/0x23b > > [ 236.814255] ? xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd] > > [ 236.814262] kasan_report+0x220/0x246 > > [ 236.814278] xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd] > > [ 236.814294] trb_in_td+0x3b/0x1cd [xhci_hcd] > > [ 236.814311] handle_cmd_completion+0x1181/0x2c9b [xhci_hcd] > > [ 236.814329] ? xhci_queue_new_dequeue_state+0x5d9/0x5d9 [xhci_hcd] > > [ 236.814337] ? drm_handle_vblank+0x4ec/0x590 > > [ 236.814352] xhci_irq+0x529/0x3294 [xhci_hcd] > > [ 236.814362] ? __accumulate_pelt_segments+0x24/0x33 > > [ 236.814378] ? finish_td.isra.40+0x223/0x223 [xhci_hcd] > > [ 236.814384] ? __accumulate_pelt_segments+0x24/0x33 > > [ 236.814390] ? __accumulate_pelt_segments+0x24/0x33 > > [ 236.814405] ? xhci_irq+0x3294/0x3294 [xhci_hcd] > > [ 236.814412] __handle_irq_event_percpu+0x149/0x3db > > [ 236.814421] handle_irq_event_percpu+0x65/0x109 > > [ 236.814428] ? __handle_irq_event_percpu+0x3db/0x3db > > [ 236.814436] ? ttwu_do_wakeup.isra.18+0x3a2/0x3ce > > [ 236.814442] handle_irq_event+0xa8/0x10a > > [ 236.814449] handle_edge_irq+0x4b2/0x538 > > [ 236.814458] handle_irq+0x3e/0x45 > > [ 236.814465] do_IRQ+0x5c/0x126 > > [ 236.814474] common_interrupt+0x7a/0x7a > > [ 236.814478] > > [ 236.814483] RIP: 0023:0xf79d3d82 > > [ 236.814486] RSP: 002b:ffc588e8 EFLAGS: 00200282 ORIG_RAX: > > ffdc > > [ 236.814493] RAX: RBX: f7bebd5c RCX: > > > > [ 236.814496] RDX: 08d4197c RSI: RDI: > > f746c020 > > [ 236.814499] RBP: ffc588e8 R08: R09: > > > > [ 236.814503] R10: R11: 00200206 R12: > > > > [ 236.814506] R13: R14: R15: > > > > > > [ 236.814513] Allocated by task 2082: > > [ 236.814521] kasan_kmalloc.part.1+0x51/0xc7 > > [ 236.814526] kmem_cache_alloc_trace+0x178/0x187 > > [ 236.814540] xhci_segment_alloc.isra.11+0x9d/0x3bf [xhci_hcd] > > [ 236.814553] xhci_alloc_segments_for_ring+0x9e/0x176 [xhci_hcd] > > [ 236.814566] xhci_ring_alloc.constprop.16+0x197/0x4ba [xhci_hcd] > > [ 236.814579] xhci_endpoint_init+0x77a/0x9ba [xhci_hcd] > > [ 236.814592] xhci_add_endpoint+0x3bc/0x43b [xhci_hcd] > > [ 236.814615] usb_hcd_alloc_bandwidth+0x7ef/0x857 [usbcore] > > [ 236.814637] usb_set_interface+0x294/0x681 [usbcore] >
Re: usb HC busted?
On Tue, Jul 17, 2018 at 12:41:04PM +0100, Sudip Mukherjee wrote: > Hi Mathias, > > On Sat, Jun 30, 2018 at 10:07:04PM +0100, Sudip Mukherjee wrote: > > Hi Mathias, > > > > On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote: > > > On 27.06.2018 14:59, Sudip Mukherjee wrote: > > > > > > Can you share a bit more details on the platform you are using, and > > > > > > what types of test you are running. > > > > > > > > Then to track what is going on, I added the slub debugging and :( > > I have attached part of dmesg for you to check. > > Will appreciate your help in finding out the problem. > > I did some more debugging. Tested with a KASAN enabled kernel and that > shows the problem. The report is attached. > > To my understanding: > > btusb_work() is calling usb_set_interface() with alternate = 0. which > again calls usb_hcd_alloc_bandwidth() and that frees the rings by > xhci_free_endpoint_ring(). But then usb_set_interface() continues and > calls usb_disable_interface() -> usb_hcd_flush_endpoint()->unlink1()-> > xhci_urb_dequeue() which at the end gives the command to stop endpoint. > > In all the cycles I have tested I see that only in the fail case > handle_cmd_completion() gets called, but in the cycles where the error > is not there handle_cmd_completion() is not called with that command. > > I am not sure what is happening, and you are the best person to understand > what is happening. :) > > But for now (untill you are back from holiday and suggest a proper solution), > I made a hacky patch (attached) which is working and I donot get any > corruption after that. Both KASAN and slub debug are also happy. > > So, now waiting for you to analyze what is going on and suggest a proper > fix. > > Thanks in advance. > > -- > Regards > Sudip > [ 236.814156] > == > [ 236.814187] BUG: KASAN: use-after-free in xhci_trb_virt_to_dma+0x2e/0x74 > [xhci_hcd] > [ 236.814193] Read of size 8 at addr 8800789329c8 by task weston/138 > > [ 236.814203] CPU: 0 PID: 138 Comm: weston Tainted: G U W O > 4.14.47-20180606+ #7 > [ 236.814206] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017 > [ 236.814209] Call Trace: > [ 236.814214] > [ 236.814226] dump_stack+0x46/0x59 > [ 236.814238] print_address_description+0x6b/0x23b > [ 236.814255] ? xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd] > [ 236.814262] kasan_report+0x220/0x246 > [ 236.814278] xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd] > [ 236.814294] trb_in_td+0x3b/0x1cd [xhci_hcd] > [ 236.814311] handle_cmd_completion+0x1181/0x2c9b [xhci_hcd] > [ 236.814329] ? xhci_queue_new_dequeue_state+0x5d9/0x5d9 [xhci_hcd] > [ 236.814337] ? drm_handle_vblank+0x4ec/0x590 > [ 236.814352] xhci_irq+0x529/0x3294 [xhci_hcd] > [ 236.814362] ? __accumulate_pelt_segments+0x24/0x33 > [ 236.814378] ? finish_td.isra.40+0x223/0x223 [xhci_hcd] > [ 236.814384] ? __accumulate_pelt_segments+0x24/0x33 > [ 236.814390] ? __accumulate_pelt_segments+0x24/0x33 > [ 236.814405] ? xhci_irq+0x3294/0x3294 [xhci_hcd] > [ 236.814412] __handle_irq_event_percpu+0x149/0x3db > [ 236.814421] handle_irq_event_percpu+0x65/0x109 > [ 236.814428] ? __handle_irq_event_percpu+0x3db/0x3db > [ 236.814436] ? ttwu_do_wakeup.isra.18+0x3a2/0x3ce > [ 236.814442] handle_irq_event+0xa8/0x10a > [ 236.814449] handle_edge_irq+0x4b2/0x538 > [ 236.814458] handle_irq+0x3e/0x45 > [ 236.814465] do_IRQ+0x5c/0x126 > [ 236.814474] common_interrupt+0x7a/0x7a > [ 236.814478] > [ 236.814483] RIP: 0023:0xf79d3d82 > [ 236.814486] RSP: 002b:ffc588e8 EFLAGS: 00200282 ORIG_RAX: > ffdc > [ 236.814493] RAX: RBX: f7bebd5c RCX: > > [ 236.814496] RDX: 08d4197c RSI: RDI: > f746c020 > [ 236.814499] RBP: ffc588e8 R08: R09: > > [ 236.814503] R10: R11: 00200206 R12: > > [ 236.814506] R13: R14: R15: > > > [ 236.814513] Allocated by task 2082: > [ 236.814521] kasan_kmalloc.part.1+0x51/0xc7 > [ 236.814526] kmem_cache_alloc_trace+0x178/0x187 > [ 236.814540] xhci_segment_alloc.isra.11+0x9d/0x3bf [xhci_hcd] > [ 236.814553] xhci_alloc_segments_for_ring+0x9e/0x176 [xhci_hcd] > [ 236.814566] xhci_ring_alloc.constprop.16+0x197/0x4ba [xhci_hcd] > [ 236.814579] xhci_endpoint_init+0x77a/0x9ba [xhci_hcd] > [ 236.814592] xhci_add_endpoint+0x3bc/0x43b [xhci_hcd] > [ 236.814615] usb_hcd_alloc_bandwidth+0x7ef/0x857 [usbcore] > [ 236.814637] usb_set_interface+0x294/0x681 [usbcore] > [ 236.814645] btusb_work+0x2e6/0x981 [btusb] > [ 236.814651] process_one_work+0x579/0x9e9 > [ 236.814656] worker_thread+0x68f/0x804 > [ 236.814662] kthread+0x31c/0x32b > [ 236.814668] ret_from_fork+0x35/0x40 > > [ 236.814672] Freed by task 1533: > [ 236.814678]
Re: usb HC busted?
Hi Mathias, On Sat, Jun 30, 2018 at 10:07:04PM +0100, Sudip Mukherjee wrote: > Hi Mathias, > > On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote: > > On 27.06.2018 14:59, Sudip Mukherjee wrote: > > > > > Can you share a bit more details on the platform you are using, and > > > > > what types of test you are running. > > > > > Then to track what is going on, I added the slub debugging and :( > I have attached part of dmesg for you to check. > Will appreciate your help in finding out the problem. I did some more debugging. Tested with a KASAN enabled kernel and that shows the problem. The report is attached. To my understanding: btusb_work() is calling usb_set_interface() with alternate = 0. which again calls usb_hcd_alloc_bandwidth() and that frees the rings by xhci_free_endpoint_ring(). But then usb_set_interface() continues and calls usb_disable_interface() -> usb_hcd_flush_endpoint()->unlink1()-> xhci_urb_dequeue() which at the end gives the command to stop endpoint. In all the cycles I have tested I see that only in the fail case handle_cmd_completion() gets called, but in the cycles where the error is not there handle_cmd_completion() is not called with that command. I am not sure what is happening, and you are the best person to understand what is happening. :) But for now (untill you are back from holiday and suggest a proper solution), I made a hacky patch (attached) which is working and I donot get any corruption after that. Both KASAN and slub debug are also happy. So, now waiting for you to analyze what is going on and suggest a proper fix. Thanks in advance. -- Regards Sudip [ 236.814156] == [ 236.814187] BUG: KASAN: use-after-free in xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd] [ 236.814193] Read of size 8 at addr 8800789329c8 by task weston/138 [ 236.814203] CPU: 0 PID: 138 Comm: weston Tainted: G U W O 4.14.47-20180606+ #7 [ 236.814206] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017 [ 236.814209] Call Trace: [ 236.814214] [ 236.814226] dump_stack+0x46/0x59 [ 236.814238] print_address_description+0x6b/0x23b [ 236.814255] ? xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd] [ 236.814262] kasan_report+0x220/0x246 [ 236.814278] xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd] [ 236.814294] trb_in_td+0x3b/0x1cd [xhci_hcd] [ 236.814311] handle_cmd_completion+0x1181/0x2c9b [xhci_hcd] [ 236.814329] ? xhci_queue_new_dequeue_state+0x5d9/0x5d9 [xhci_hcd] [ 236.814337] ? drm_handle_vblank+0x4ec/0x590 [ 236.814352] xhci_irq+0x529/0x3294 [xhci_hcd] [ 236.814362] ? __accumulate_pelt_segments+0x24/0x33 [ 236.814378] ? finish_td.isra.40+0x223/0x223 [xhci_hcd] [ 236.814384] ? __accumulate_pelt_segments+0x24/0x33 [ 236.814390] ? __accumulate_pelt_segments+0x24/0x33 [ 236.814405] ? xhci_irq+0x3294/0x3294 [xhci_hcd] [ 236.814412] __handle_irq_event_percpu+0x149/0x3db [ 236.814421] handle_irq_event_percpu+0x65/0x109 [ 236.814428] ? __handle_irq_event_percpu+0x3db/0x3db [ 236.814436] ? ttwu_do_wakeup.isra.18+0x3a2/0x3ce [ 236.814442] handle_irq_event+0xa8/0x10a [ 236.814449] handle_edge_irq+0x4b2/0x538 [ 236.814458] handle_irq+0x3e/0x45 [ 236.814465] do_IRQ+0x5c/0x126 [ 236.814474] common_interrupt+0x7a/0x7a [ 236.814478] [ 236.814483] RIP: 0023:0xf79d3d82 [ 236.814486] RSP: 002b:ffc588e8 EFLAGS: 00200282 ORIG_RAX: ffdc [ 236.814493] RAX: RBX: f7bebd5c RCX: [ 236.814496] RDX: 08d4197c RSI: RDI: f746c020 [ 236.814499] RBP: ffc588e8 R08: R09: [ 236.814503] R10: R11: 00200206 R12: [ 236.814506] R13: R14: R15: [ 236.814513] Allocated by task 2082: [ 236.814521] kasan_kmalloc.part.1+0x51/0xc7 [ 236.814526] kmem_cache_alloc_trace+0x178/0x187 [ 236.814540] xhci_segment_alloc.isra.11+0x9d/0x3bf [xhci_hcd] [ 236.814553] xhci_alloc_segments_for_ring+0x9e/0x176 [xhci_hcd] [ 236.814566] xhci_ring_alloc.constprop.16+0x197/0x4ba [xhci_hcd] [ 236.814579] xhci_endpoint_init+0x77a/0x9ba [xhci_hcd] [ 236.814592] xhci_add_endpoint+0x3bc/0x43b [xhci_hcd] [ 236.814615] usb_hcd_alloc_bandwidth+0x7ef/0x857 [usbcore] [ 236.814637] usb_set_interface+0x294/0x681 [usbcore] [ 236.814645] btusb_work+0x2e6/0x981 [btusb] [ 236.814651] process_one_work+0x579/0x9e9 [ 236.814656] worker_thread+0x68f/0x804 [ 236.814662] kthread+0x31c/0x32b [ 236.814668] ret_from_fork+0x35/0x40 [ 236.814672] Freed by task 1533: [ 236.814678] kasan_slab_free+0xb3/0x15e [ 236.814683] kfree+0x103/0x1a9 [ 236.814696] xhci_ring_free+0x205/0x286 [xhci_hcd] [ 236.814709] xhci_free_endpoint_ring+0x4d/0x83 [xhci_hcd] [ 236.814722] xhci_check_bandwidth+0x57b/0x65a [xhci_hcd] [ 236.814743] usb_hcd_alloc_bandwidth+0x665/0x857 [usbcore] [
Re: usb HC busted?
Hi Mathias, On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote: > On 27.06.2018 14:59, Sudip Mukherjee wrote: > > > > Can you share a bit more details on the platform you are using, and > > > > what types of test you are running. > > > > > > It is a board based on "Intel(R) Atom(TM) CPU E3840 @ 1.91GHz". > > > The usb device in question is a bluetooth device: > > > > > > > There is however freeing of the same dma address: > > <...>-28448 [003] 492.025808: xhci_ring_free: ISOC f1ffb700: enq > 0x2d31bcc0(0x2d31b000) deq > 0x2d31b000(0x2d31b000) segs 2 stream 0 free_trbs 305 bounce > 17 cycle 0 > <...>-28448 [003] 492.025818: xhci_ring_mem_detail: MATTU xhci segment > free seg->dma @ 0x2d31b000 > <...>-28448 [003] 492.025823: xhci_ring_mem_detail: MATTU xhci segment > free seg->dma @ 0x2d31b000 > <...>-28448 [003] 492.025826: xhci_ring_free: ISOC f1f9b380: enq > 0x2d31b140(0x2d31b000) deq > 0x2d31b000(0x2d31b000) segs 2 stream 0 free_trbs 489 bounce > 17 cycle 1 > <...>-28448 [003] 492.025828: xhci_ring_mem_detail: MATTU xhci segment > free seg->dma @ 0x2d31b000 > <...>-28448 [003] 492.025830: xhci_ring_mem_detail: MATTU xhci segment > free seg->dma @ 0x2d31b000 > > I'd guess it's still the same cause, maybe trace is not complete? It is either mutiple freeing of the same address or mutiple allocation of the same address or a combination of both. To track the mutiple allocation I added few extra debugging and it seems that the mutiple allocation is only happening when someone accesses that memory and makes the first 4 bytes (which holds the offset data) as 0. I have not yet checked in what condition does it try to free the same address more than once. Then to track what is going on, I added the slub debugging and :( I have attached part of dmesg for you to check. Will appreciate your help in finding out the problem. -- Regards Sudip [ 383.096204] = [ 383.096212] BUG kmalloc-96 (Tainted: G U O ): Poison overwritten [ 383.096213] - [ 383.096215] Disabling lock debugging due to kernel taint [ 383.096218] INFO: 0xdccd1b78-0xdccd1b7f. First byte 0x78 instead of 0x6b [ 383.096232] INFO: Allocated in xhci_ring_alloc.constprop.14+0x31/0x125 [xhci_hcd] age=227516 cpu=2 pid=21 [ 383.096240] ___slab_alloc.constprop.24+0x1fc/0x292 [ 383.096243] __slab_alloc.isra.18.constprop.23+0x1c/0x25 [ 383.096246] kmem_cache_alloc_trace+0x78/0x141 [ 383.096252] xhci_ring_alloc.constprop.14+0x31/0x125 [xhci_hcd] [ 383.096259] xhci_endpoint_init+0x25f/0x30a [xhci_hcd] [ 383.096265] xhci_add_endpoint+0x126/0x149 [xhci_hcd] [ 383.096276] usb_hcd_alloc_bandwidth+0x26a/0x2a0 [usbcore] [ 383.096287] usb_set_interface+0xeb/0x25d [usbcore] [ 383.096292] btusb_work+0xeb/0x324 [btusb] [ 383.096296] process_one_work+0x163/0x2b2 [ 383.096299] worker_thread+0x1a9/0x25c [ 383.096301] kthread+0xf8/0xfd [ 383.096306] ret_from_fork+0x2e/0x38 [ 383.096314] INFO: Freed in xhci_ring_free+0xa7/0xc6 [xhci_hcd] age=197020 cpu=0 pid=324 [ 383.096317] __slab_free+0x4b/0x27a [ 383.096319] kfree+0x12e/0x155 [ 383.096325] xhci_ring_free+0xa7/0xc6 [xhci_hcd] [ 383.096331] xhci_free_endpoint_ring+0x16/0x20 [xhci_hcd] [ 383.096338] xhci_check_bandwidth+0x1bf/0x20e [xhci_hcd] [ 383.096348] usb_hcd_alloc_bandwidth+0x205/0x2a0 [usbcore] [ 383.096358] usb_set_interface+0xeb/0x25d [usbcore] [ 383.096361] btusb_work+0x228/0x324 [btusb] [ 383.096364] process_one_work+0x163/0x2b2 [ 383.096367] worker_thread+0x1a9/0x25c [ 383.096370] kthread+0xf8/0xfd [ 383.096373] ret_from_fork+0x2e/0x38 [ 383.096376] INFO: Slab 0xf457e080 objects=29 used=29 fp=0x (null) flags=0x40008100 [ 383.096379] INFO: Object 0xdccd1b60 @offset=7008 fp=0xdccd0350 [ 383.096383] Redzone dccd1b58: bb bb bb bb bb bb bb bb [ 383.096386] Object dccd1b60: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 383.096388] Object dccd1b70: 6b 6b 6b 6b 6b 6b 6b 6b 78 1b cd dc 78 1b cd dc x...x... [ 383.096390] Object dccd1b80: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 383.096393] Object dccd1b90: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 383.096395] Object dccd1ba0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b [ 383.096397] Object dccd1bb0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkk. [ 383.096400] Redzone dccd1bc0: bb bb bb bb [ 383.096402] Padding dccd1c68: 5a 5a 5a 5a 5a 5a 5a 5a [ 383.096407] CPU: 2 PID: 133 Comm: weston Tainted: GBU O
Re: usb HC busted?
On 27.06.2018 14:59, Sudip Mukherjee wrote: Can you share a bit more details on the platform you are using, and what types of test you are running. It is a board based on "Intel(R) Atom(TM) CPU E3840 @ 1.91GHz". The usb device in question is a bluetooth device: Bus 001 Device 012: ID 8087:07dc Intel Corp. And the problem that we are seeing is with phone calls via bluetooth. Does my test above trigger the case? (show "MATTU dmatest match!") I have kept it for tonight, will see the results tomorrow morning. And I am using that same device in the usb script to change "authrized". No, your test did not trigger the error. :( But, my last night's test (with an added debug to get some extra trace for addresses) showed the same error of - "Looking for event-dma", but looking at the ftrace, I could not see it getting same address from dma_pool_zalloc(). Can you please have a look at the dmesg and ftrace at: https://drive.google.com/open?id=1nMy_qVxOQzcZNYa9bw7az9WiS2MZzdKo There is however freeing of the same dma address: <...>-28448 [003] 492.025808: xhci_ring_free: ISOC f1ffb700: enq 0x2d31bcc0(0x2d31b000) deq 0x2d31b000(0x2d31b000) segs 2 stream 0 free_trbs 305 bounce 17 cycle 0 <...>-28448 [003] 492.025818: xhci_ring_mem_detail: MATTU xhci segment free seg->dma @ 0x2d31b000 <...>-28448 [003] 492.025823: xhci_ring_mem_detail: MATTU xhci segment free seg->dma @ 0x2d31b000 <...>-28448 [003] 492.025826: xhci_ring_free: ISOC f1f9b380: enq 0x2d31b140(0x2d31b000) deq 0x2d31b000(0x2d31b000) segs 2 stream 0 free_trbs 489 bounce 17 cycle 1 <...>-28448 [003] 492.025828: xhci_ring_mem_detail: MATTU xhci segment free seg->dma @ 0x2d31b000 <...>-28448 [003] 492.025830: xhci_ring_mem_detail: MATTU xhci segment free seg->dma @ 0x2d31b000 I'd guess it's still the same cause, maybe trace is not complete? -Mathias ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Wed, Jun 27, 2018 at 12:59:48PM +0100, Sudip Mukherjee wrote: > Hi Mathias, > > On Mon, Jun 25, 2018 at 05:15:00PM +0100, Sudip Mukherjee wrote: > > Hi Mathias, > > > > On Thu, Jun 21, 2018 at 02:01:30PM +0300, Mathias Nyman wrote: > > > On 21.06.2018 03:53, Sudip Mukherjee wrote: > > > > Hi Mathias, Andy, > > > > > > > > On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote: > > > > > On 06.06.2018 19:45, Sudip Mukherjee wrote: > > > > > > > Can you share a bit more details on the platform you are using, and what > > > types of test you are running. > > > > Sorry for the delayed reply, I was in Tokyo for the OSS. > > > > It is a board based on "Intel(R) Atom(TM) CPU E3840 @ 1.91GHz". > > The usb device in question is a bluetooth device: > > > > Bus 001 Device 012: ID 8087:07dc Intel Corp. > > > > > And the problem that we are seeing is with phone calls via bluetooth. > > > > > Does my test above trigger the case? (show "MATTU dmatest match!") > > > > I have kept it for tonight, will see the results tomorrow morning. > > And I am using that same device in the usb script to change "authrized". > > No, your test did not trigger the error. :( > > But, my last night's test (with an added debug to get some extra trace for > addresses) showed the same error of - > "Looking for event-dma", but looking at the ftrace, I could not see it > getting same address from dma_pool_zalloc(). > > Can you please have a look at the dmesg and ftrace at: > https://drive.google.com/open?id=1nMy_qVxOQzcZNYa9bw7az9WiS2MZzdKo And to add to my previous mail, in another cycle where I do see the same problem and my extra debugs give the following: <...>-23974 [002] 495.991276: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d21c000 <...>-23974 [002] 495.991285: xhci_ring_mem_detail: SUDIP page details dma=0x2d21c000, vaddr=ed21c000, inuse=1, offset=0 <...>-23974 [002] 495.991289: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d21c000 <...>-23974 [002] 495.991292: xhci_ring_mem_detail: SUDIP page details dma=0x2d21c000, vaddr=ed21c000, inuse=2, offset=0 <...>-23974 [002] 495.991295: xhci_ring_alloc: ISOC f0b62900: enq 0x2d21c000(0x2d21c000) deq 0x2d21c000(0x2d21c000) segs 2 stream 0 free_trbs 509 bounce 17 cycle 1 <...>-23974 [002] 495.991298: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d21c000 <...>-23974 [002] 495.991301: xhci_ring_mem_detail: SUDIP page details dma=0x2d21c000, vaddr=ed21c000, inuse=3, offset=0 <...>-23974 [002] 495.991304: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d21c000 <...>-23974 [002] 495.991306: xhci_ring_mem_detail: SUDIP page details dma=0x2d21c000, vaddr=ed21c000, inuse=4, offset=0 I am totally lost now. Are we looking at two different issues? This log shows same addresses, my previous mail and log did not show the same addresses. :( -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Mathias, On Mon, Jun 25, 2018 at 05:15:00PM +0100, Sudip Mukherjee wrote: > Hi Mathias, > > On Thu, Jun 21, 2018 at 02:01:30PM +0300, Mathias Nyman wrote: > > On 21.06.2018 03:53, Sudip Mukherjee wrote: > > > Hi Mathias, Andy, > > > > > > On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote: > > > > On 06.06.2018 19:45, Sudip Mukherjee wrote: > > > > Can you share a bit more details on the platform you are using, and what > > types of test you are running. > > Sorry for the delayed reply, I was in Tokyo for the OSS. > > It is a board based on "Intel(R) Atom(TM) CPU E3840 @ 1.91GHz". > The usb device in question is a bluetooth device: > > Bus 001 Device 012: ID 8087:07dc Intel Corp. > > And the problem that we are seeing is with phone calls via bluetooth. > > > Does my test above trigger the case? (show "MATTU dmatest match!") > > I have kept it for tonight, will see the results tomorrow morning. > And I am using that same device in the usb script to change "authrized". No, your test did not trigger the error. :( But, my last night's test (with an added debug to get some extra trace for addresses) showed the same error of - "Looking for event-dma", but looking at the ftrace, I could not see it getting same address from dma_pool_zalloc(). Can you please have a look at the dmesg and ftrace at: https://drive.google.com/open?id=1nMy_qVxOQzcZNYa9bw7az9WiS2MZzdKo -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Mathias, On Thu, Jun 21, 2018 at 02:01:30PM +0300, Mathias Nyman wrote: > On 21.06.2018 03:53, Sudip Mukherjee wrote: > > Hi Mathias, Andy, > > > > On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote: > > > On 06.06.2018 19:45, Sudip Mukherjee wrote: > > git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git dmapool-test > https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=dmapool-test > > Tested by just leaving the following running for a few days: > > while true; do echo 0 > authorized; sleep 3; echo 1 > authorized; sleep 3; > done; > For some usb device (for example: /sys/bus/usb/devices/1-8) > > Then grep logs for "MATTU dmatest match! " > > Can you share a bit more details on the platform you are using, and what > types of test you are running. Sorry for the delayed reply, I was in Tokyo for the OSS. It is a board based on "Intel(R) Atom(TM) CPU E3840 @ 1.91GHz". The usb device in question is a bluetooth device: Bus 001 Device 012: ID 8087:07dc Intel Corp. Device Descriptor: bLength18 bDescriptorType 1 bcdUSB 2.00 bDeviceClass 224 Wireless bDeviceSubClass 1 Radio Frequency bDeviceProtocol 1 Bluetooth bMaxPacketSize064 idVendor 0x8087 Intel Corp. idProduct 0x07dc bcdDevice0.01 iManufacturer 0 iProduct0 iSerial 0 bNumConfigurations 1 And the problem that we are seeing is with phone calls via bluetooth. > Does my test above trigger the case? (show "MATTU dmatest match!") I have kept it for tonight, will see the results tomorrow morning. And I am using that same device in the usb script to change "authrized". But looking at the code for dma_pool_alloc(), it seems 'dma' can have same value again only if "*(int *)(page->vaddr + offset)" gets a value of 0 in pool_initialise_page(). But I can't think of anyway how it can be 0. I have also added some more debugs in the kernel to see what might be going wrong there. -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On 21.06.2018 03:53, Sudip Mukherjee wrote: Hi Mathias, Andy, On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote: On 06.06.2018 19:45, Sudip Mukherjee wrote: Hi Andy, And we meet again. :) On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote: On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote: On 04.06.2018 18:28, Sudip Mukherjee wrote: On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote: Odd and unlikely, but to me this looks like some issue in allocating dma memory from pool using dma_pool_zalloc() Here's the story: Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel. All tracing points to dma_pool_zalloc() returning the same dma address block on consecutive calls. In the failing case dma_pool_zalloc() is called 3 - 6us apart. <...>-26362 [002] 1186.756739: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d92b000 <...>-26362 [002] 1186.756745: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d92b000 <...>-26362 [002] 1186.756748: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d92b000 dma_pool_zalloc() is called from xhci_segment_alloc() in drivers/usb/host/xhci-mem.c see: https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci- mem.c#L52 prints above are custom traces added right after dma_pool_zalloc() For better understanding it would be good to have dma_pool_free() calls debugged as well. Sudip has a full (394M unpacked) trace at: https://drive.google.com/open?id=1h-3r-1lfjg8oblBGkzdRIq8z3ZNgGZx- But then it gets stuck, for the whole ring2 dma_pool_zalloc() just returns the same dma address as the last segment for ring1:0x2d92b000. Last part of trace snippet is just another ring being freed. A gentle ping on this. Any idea on what the problem might be and any possible fix? I tried to reproduce it by quickly hacking xhci to allocate and free 50 segments each time we normally allocate one segment from dmapool. I let it run for 3 days on a Atom based platform, but could not reproduce it. xhci testhack can be found here: git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git dmapool-test https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=dmapool-test Tested by just leaving the following running for a few days: while true; do echo 0 > authorized; sleep 3; echo 1 > authorized; sleep 3; done; For some usb device (for example: /sys/bus/usb/devices/1-8) Then grep logs for "MATTU dmatest match! " Can you share a bit more details on the platform you are using, and what types of test you are running. Does my test above trigger the case? (show "MATTU dmatest match!") -Mathias ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi Mathias, Andy, On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote: > On 06.06.2018 19:45, Sudip Mukherjee wrote: > > Hi Andy, > > > > And we meet again. :) > > > > On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote: > > > On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote: > > > > On 04.06.2018 18:28, Sudip Mukherjee wrote: > > > > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote: > > > > > > > > > > > > > Odd and unlikely, but to me this looks like some issue in allocating > > > > dma memory > > > > from pool using dma_pool_zalloc() > > > > > > > > > > > > Here's the story: > > > > Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel. > > > > All tracing points to dma_pool_zalloc() returning the same dma address > > > > block on > > > > consecutive calls. > > > > > > > > In the failing case dma_pool_zalloc() is called 3 - 6us apart. > > > > > > > > <...>-26362 [002] 1186.756739: xhci_ring_mem_detail: MATTU > > > > xhci_segment_alloc dma @ 0x2d92b000 > > > > <...>-26362 [002] 1186.756745: xhci_ring_mem_detail: MATTU > > > > xhci_segment_alloc dma @ 0x2d92b000 > > > > <...>-26362 [002] 1186.756748: xhci_ring_mem_detail: MATTU > > > > xhci_segment_alloc dma @ 0x2d92b000 > > > > > > > > dma_pool_zalloc() is called from xhci_segment_alloc() in > > > > drivers/usb/host/xhci-mem.c > > > > see: > > > > https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci- > > > > mem.c#L52 > > > > > > > > prints above are custom traces added right after dma_pool_zalloc() > > > > > > For better understanding it would be good to have dma_pool_free() calls > > > debugged as well. > > > > Sudip has a full (394M unpacked) trace at: > https://drive.google.com/open?id=1h-3r-1lfjg8oblBGkzdRIq8z3ZNgGZx- > > But then it gets stuck, for the whole ring2 dma_pool_zalloc() just returns > the same dma address as the last segment for > ring1:0x2d92b000. Last part of trace snippet is just another ring being freed. A gentle ping on this. Any idea on what the problem might be and any possible fix? -- regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
Hi All, On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote: > On 06.06.2018 19:45, Sudip Mukherjee wrote: > > Hi Andy, > > > > And we meet again. :) > > > > On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote: > > > On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote: > > > > On 04.06.2018 18:28, Sudip Mukherjee wrote: > > > > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote: > > > > > > > > > > > > > Odd and unlikely, but to me this looks like some issue in allocating > > > > dma memory > > > > from pool using dma_pool_zalloc() > > > > > > > > Adding people with DMA knowledge to cc, maybe someone knows what is > > > > going on. > > > > > > > > Here's the story: > > > > Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel. > > > > All tracing points to dma_pool_zalloc() returning the same dma address > > > > block on > > > > consecutive calls. We have started testing with v4.14.47 now and we are seeing the issue with it also. :( -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On 06.06.2018 19:45, Sudip Mukherjee wrote: Hi Andy, And we meet again. :) On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote: On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote: On 04.06.2018 18:28, Sudip Mukherjee wrote: On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote: Odd and unlikely, but to me this looks like some issue in allocating dma memory from pool using dma_pool_zalloc() Adding people with DMA knowledge to cc, maybe someone knows what is going on. Here's the story: Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel. All tracing points to dma_pool_zalloc() returning the same dma address block on consecutive calls. In the failing case dma_pool_zalloc() is called 3 - 6us apart. <...>-26362 [002] 1186.756739: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d92b000 <...>-26362 [002] 1186.756745: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d92b000 <...>-26362 [002] 1186.756748: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d92b000 dma_pool_zalloc() is called from xhci_segment_alloc() in drivers/usb/host/xhci-mem.c see: https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci- mem.c#L52 prints above are custom traces added right after dma_pool_zalloc() For better understanding it would be good to have dma_pool_free() calls debugged as well. So, I am adding another trace event for dma_pool_free() and continuing with the test. Is there anything else that I should be adding as debug? The patch traced both dma_pool_zalloc() and dma_pool_free() calls from xhci, no need to retry. Sudip has a full (394M unpacked) trace at: https://drive.google.com/open?id=1h-3r-1lfjg8oblBGkzdRIq8z3ZNgGZx- Interesting part is: <...>-26362 [002] 1186.756728: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d34d000 <...>-26362 [002] 1186.756735: xhci_ring_mem_detail: MATTU xhci segment alloc seg->dma @ 0x2d34d000 <...>-26362 [002] 1186.756739: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d92b000 <...>-26362 [002] 1186.756740: xhci_ring_mem_detail: MATTU xhci segment alloc seg->dma @ 0x2d92b000 <...>-26362 [002] 1186.756743: xhci_ring_alloc: ISOC eefa0580: enq 0x2d34d000(0x2d34d000) deq 0x2d34d000(0x2d34d000) segs 2 stream 0 free_trbs 509 bounce 17 cycle 1 <...>-26362 [002] 1186.756745: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d92b000 <...>-26362 [002] 1186.756746: xhci_ring_mem_detail: MATTU xhci segment alloc seg->dma @ 0x2d92b000 <...>-26362 [002] 1186.756748: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d92b000 <...>-26362 [002] 1186.756751: xhci_ring_mem_detail: MATTU xhci segment alloc seg->dma @ 0x2d92b000 <...>-26362 [002] 1186.756752: xhci_ring_alloc: ISOC f19d7c80: enq 0x2d92b000(0x2d92b000) deq 0x2d92b000(0x2d92b000) segs 2 stream 0 free_trbs 509 bounce 17 cycle 1 <...>-26362 [002] d..1 1186.756761: xhci_queue_trb: CMD: Configure Endpoint Command: ctx 2ce96000 slot 7 flags d:C <...>-26362 [002] d..1 1186.756762: xhci_inc_enq: CMD ed930b80: enq 0x2d93adb0(0x2d93a000) deq 0x2d93ada0(0x2d93a000) segs 1 stream 0 free_trbs 253 bounce 0 \ cycle 1 <...>-26362 [002] 1186.757066: xhci_dbg_context_change: Successful Endpoint Configure command <...>-26362 [002] 1186.757072: xhci_ring_free: ISOC eefd9380: enq 0x2c482000(0x2c482000) deq 0x2c482000(0x2c482000) segs 2 stream 0 free_trbs 509 bounce0 cycle 1 <...>-26362 [002] 1186.757075: xhci_ring_mem_detail: MATTU xhci segment free seg->dma @ ee2d23c8 <...>-26362 [002] 1186.757078: xhci_ring_mem_detail: MATTU xhci segment free seg->dma @ c7a93488 <...>-26362 [002] 1186.757080: xhci_ring_free: ISOC eef0d800: enq 0x2c50a000(0x2c50a000) deq 0x2c50a000(0x2c50a000) segs 2 stream 0 free_trbs 509 bounce0 cycle 1 What is shown is the allocation of two ISOC transfer rings, each ring has 2 segments (two dma_pool_zalloc() calls per ring) First ring looks normal, ring1 get dma memory at 0x2d34d000 for first ring segment, and dma memory at 0x2d92b000 for second segment. But then it gets stuck, for the whole ring2 dma_pool_zalloc() just returns the same dma address as the last segment for ring1:0x2d92b000. Last part of trace snippet is just another ring being freed. Full testpatch looked like this: diff --git a/drivers/usb/host/xhci-mem.c b/drivers/usb/host/xhci-mem.c index e5ace89..7d343ad 100644 --- a/drivers/usb/host/xhci-mem.c +++ b/drivers/usb/host/xhci-mem.c @@ -44,10 +44,15 @@ static struct xhci_segment *xhci_segment_alloc(struct xhci_hcd *xhci, return NULL; } + xhci_dbg_trace(xhci, trace_xhci_ring_mem_detail, +
Re: usb HC busted?
Hi Andy, And we meet again. :) On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote: > On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote: > > On 04.06.2018 18:28, Sudip Mukherjee wrote: > > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote: > > > > > > > Odd and unlikely, but to me this looks like some issue in allocating > > dma memory > > from pool using dma_pool_zalloc() > > > > Adding people with DMA knowledge to cc, maybe someone knows what is > > going on. > > > > Here's the story: > > Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel. > > All tracing points to dma_pool_zalloc() returning the same dma address > > block on > > consecutive calls. > > > > In the failing case dma_pool_zalloc() is called 3 - 6us apart. > > > > <...>-26362 [002] 1186.756739: xhci_ring_mem_detail: MATTU > > xhci_segment_alloc dma @ 0x2d92b000 > > <...>-26362 [002] 1186.756745: xhci_ring_mem_detail: MATTU > > xhci_segment_alloc dma @ 0x2d92b000 > > <...>-26362 [002] 1186.756748: xhci_ring_mem_detail: MATTU > > xhci_segment_alloc dma @ 0x2d92b000 > > > > dma_pool_zalloc() is called from xhci_segment_alloc() in > > drivers/usb/host/xhci-mem.c > > see: > > https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci- > > mem.c#L52 > > > > prints above are custom traces added right after dma_pool_zalloc() > > For better understanding it would be good to have dma_pool_free() calls > debugged as well. So, I am adding another trace event for dma_pool_free() and continuing with the test. Is there anything else that I should be adding as debug? -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Wed, Jun 06, 2018 at 05:12:21PM +0300, Mathias Nyman wrote: > On 04.06.2018 18:28, Sudip Mukherjee wrote: > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote: > > > > > > > Will request you to have a look at it. > > > > Odd and unlikely, but to me this looks like some issue in allocating dma > memory > from pool using dma_pool_zalloc() > > Adding people with DMA knowledge to cc, maybe someone knows what is going on. Thanks Mathias. -- Regards Sudip ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote: > On 04.06.2018 18:28, Sudip Mukherjee wrote: > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote: > > > > Odd and unlikely, but to me this looks like some issue in allocating > dma memory > from pool using dma_pool_zalloc() > > Adding people with DMA knowledge to cc, maybe someone knows what is > going on. > > Here's the story: > Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel. > All tracing points to dma_pool_zalloc() returning the same dma address > block on > consecutive calls. > > In the failing case dma_pool_zalloc() is called 3 - 6us apart. > > <...>-26362 [002] 1186.756739: xhci_ring_mem_detail: MATTU > xhci_segment_alloc dma @ 0x2d92b000 > <...>-26362 [002] 1186.756745: xhci_ring_mem_detail: MATTU > xhci_segment_alloc dma @ 0x2d92b000 > <...>-26362 [002] 1186.756748: xhci_ring_mem_detail: MATTU > xhci_segment_alloc dma @ 0x2d92b000 > > dma_pool_zalloc() is called from xhci_segment_alloc() in > drivers/usb/host/xhci-mem.c > see: > https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci- > mem.c#L52 > > prints above are custom traces added right after dma_pool_zalloc() For better understanding it would be good to have dma_pool_free() calls debugged as well. Is it possible that something in parallel just fast enough to free the allocated resource from pool? -- Andy Shevchenko Intel Finland Oy ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: usb HC busted?
On 04.06.2018 18:28, Sudip Mukherjee wrote: On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote: Log show two rings having the same TRB segment dma address, this will completely mess up the transfer: While allocating rigs the enque pointers for the two rings are the same: 461.859315: xhci_ring_alloc: ISOC efa4e580: enq 0x33386000(0x33386000) deq 0x33386000(0x33386000) segs 2 stream 0 ...bs 461.859320: xhci_ring_alloc: ISOC f0ce1f00: enq 0x33386000(0x33386000) deq 0x33386000(0x33386000) segs 2 stream 0 ... So something goes really wrong when allocating or setting up the rings in one of these functions: To verify and rule out dma_pool_zalloc(), could you apply the attached patch and reproduce with new logs? I spoke too soon in my yesterday's mail. We were able to reproduce it on the automated tests. The log and the trace is at: https://drive.google.com/open?id=1h-3r-1lfjg8oblBGkzdRIq8z3ZNgGZx- Will request you to have a look at it. Odd and unlikely, but to me this looks like some issue in allocating dma memory from pool using dma_pool_zalloc() Adding people with DMA knowledge to cc, maybe someone knows what is going on. Here's the story: Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel. All tracing points to dma_pool_zalloc() returning the same dma address block on consecutive calls. In the failing case dma_pool_zalloc() is called 3 - 6us apart. <...>-26362 [002] 1186.756739: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d92b000 <...>-26362 [002] 1186.756745: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d92b000 <...>-26362 [002] 1186.756748: xhci_ring_mem_detail: MATTU xhci_segment_alloc dma @ 0x2d92b000 dma_pool_zalloc() is called from xhci_segment_alloc() in drivers/usb/host/xhci-mem.c see: https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci-mem.c#L52 prints above are custom traces added right after dma_pool_zalloc() @@ -44,10 +44,15 @@ static struct xhci_segment *xhci_segment_alloc(struct xhci_hcd *xhci, return NULL; } + xhci_dbg_trace(xhci, trace_xhci_ring_mem_detail, + "MATTU xhci_segment_alloc dma @ %pad", ); + Any idea what's going on? dma_pool_alloc() has a comment that it drops >lock if it needs to allocate a page, can it be related? Thanks -Mathias ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu