Re: usb HC busted?

2018-07-21 Thread Sudip Mukherjee
Hi Mathias,

On Fri, Jul 20, 2018 at 01:54:21PM +0100, Sudip Mukherjee wrote:
> Hi Mathias,
> 
> On Fri, Jul 20, 2018 at 02:10:58PM +0300, Mathias Nyman wrote:
> > On 19.07.2018 20:32, Sudip Mukherjee wrote:
> > > Hi Mathias,
> > > 
> > > On Thu, Jul 19, 2018 at 06:42:19PM +0300, Mathias Nyman wrote:
> > > > > > As first aid I could try to implement checks that make sure the 
> > > > > > flushed URBs
> > > > > > trb pointers really are on the current endpoint ring, and also add 
> > > > > > some warning
> > > > > > if we are we are dropping endpoints with URBs still queued.
> > > > > 
> > > > > Yes, please. I think your first-aid will be a much better option than
> > > > > the hacky patch I am using atm.
> > > > > 
> > > > 
> 
> > So poison is overwritten at e5acda58 with almost its own address, (reading 
> > backwards) e5 ac da 60, twice.
> > looks like something (32bit?)is pointing to itself twice, maybe a linked 
> > list node next and prev pointer
> > being set to point to itself as last item was removed from list.
> > 
> > The cancelled_td_list is part of struct xhci_virt_ep, so that should be 
> > fine.
> > But td_list is part of struct xhci_ring, which was freed. and we removed 
> > the URBs tds from the td_list when
> > flushing the ring after ring was freed
> > 
> > I changed the patch (attached) to make sure it doesn't touch the td_list 
> > when canceling a URB after
> > ring is freed.
> > 
> > How about this one, any improvements?
> 
> Yes, it worked. :D
> 
> So, cycle-1 = no change, just to make sure I can still reproduce the error.
> cycle-2 and cycle-3 with your patch, and there was no problem,
> slub debug was also happy.
> I am starting an autotest with this patch now, and I will have almost
> 50 cycles tested by tomorrow morning.

I can confirm that your bandaid patch has worked. Total of 67 cycles
tested till now and there was no error. Its continuing to test over the
weekend.
Thank you very much for this one. :)

I guess you will start with the proper fix, that you and Alan had been
discussing, after you are fully back to work.

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-20 Thread Alan Stern
On Fri, 20 Jul 2018, Mathias Nyman wrote:

> >> But we need to fix this properly as well.
> >> xhci needs to be more in sync with usb core in usb_set_interface(), 
> >> currently xhci
> >> has the altssetting up and running when usb core hasn't event started 
> >> flushing endpoints.
> > 
> > Absolutely.  The core tries to be compatible with host controller
> > drivers that either allocate bandwidth as it is requested or else
> > allocate bandwidth all at once when an altsetting is installed.
> > 
> > xhci-hcd falls into the second category.  However, this approach
> > requires the bandwidth verification for the new altsetting to be
> > performed before the old altsetting has been disabled, and the xHCI
> > hardware can't do this.
> > 
> > We may need to change the core so that the old endpoints are disabled
> > before the bandwidth check is done, instead of after.  Of course, this
> > leads to an awkward situation if the check fails -- we'd probably have
> > to go back and re-install the old altsetting.
> 
> That would help xhci a lot.
> 
> If we want to avoid the awkward altsetting re-install after bandwidth failure
> then adding a extra endpoint flush before checking the bandwidth would 
> already help a lot.
> 
> The endpoint disabling can then be remain after bandwidth checking.
> Does that work for other host controllers?

As far as I know, the other host controller drivers don't really care 
how this is done.  xHCI is the only technology where the hardware has 
to verify the bandwidth requirements.  (Maybe some other SuperSpeed 
controller design also cares, but if so then this change is unlikely to 
hurt.)

Alan Stern

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-20 Thread Sudip Mukherjee
Hi Mathias,

On Fri, Jul 20, 2018 at 02:10:58PM +0300, Mathias Nyman wrote:
> On 19.07.2018 20:32, Sudip Mukherjee wrote:
> > Hi Mathias,
> > 
> > On Thu, Jul 19, 2018 at 06:42:19PM +0300, Mathias Nyman wrote:
> > > > > As first aid I could try to implement checks that make sure the 
> > > > > flushed URBs
> > > > > trb pointers really are on the current endpoint ring, and also add 
> > > > > some warning
> > > > > if we are we are dropping endpoints with URBs still queued.
> > > > 
> > > > Yes, please. I think your first-aid will be a much better option than
> > > > the hacky patch I am using atm.
> > > > 
> > > 

> So poison is overwritten at e5acda58 with almost its own address, (reading 
> backwards) e5 ac da 60, twice.
> looks like something (32bit?)is pointing to itself twice, maybe a linked list 
> node next and prev pointer
> being set to point to itself as last item was removed from list.
> 
> The cancelled_td_list is part of struct xhci_virt_ep, so that should be fine.
> But td_list is part of struct xhci_ring, which was freed. and we removed the 
> URBs tds from the td_list when
> flushing the ring after ring was freed
> 
> I changed the patch (attached) to make sure it doesn't touch the td_list when 
> canceling a URB after
> ring is freed.
> 
> How about this one, any improvements?

Yes, it worked. :D

So, cycle-1 = no change, just to make sure I can still reproduce the error.
cycle-2 and cycle-3 with your patch, and there was no problem,
slub debug was also happy.
I am starting an autotest with this patch now, and I will have almost
50 cycles tested by tomorrow morning.

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-20 Thread Mathias Nyman

On 19.07.2018 17:57, Alan Stern wrote:

On Thu, 19 Jul 2018, Mathias Nyman wrote:


xhci driver will set up all the endpoints for the new altsetting already in
usb_hcd_alloc_bandwidth().

New endpoints will be ready and rings running after this. I don't know the exact
history behind this, but I assume it is because xhci does all of the steps to
drop/add, disable/enable endpoints and check bandwidth in a single configure
endpoint command, that will return errors if there is not enough bandwidth.


That's right; Sarah and I spent some time going over this while she was
working on it.  But it looks like the approach isn't adequate.


This command is issued in hcd->driver->check_bandwidth()
This means that xhci doesn't really do much in hcd->driver->endpoint_disable or
hcd->driver->endpoint_enable

It also means that xhci driver assumes rings are empty when
hcd->driver->check_bandwidth is called. It will bluntly free dropped rings.
If there are URBs left on a endpoint ring that was dropped+added
(freed+reallocated) then those URBs will contain pointers to freed ring,
causing issues when usb_hcd_flush_endpoint() cancels those URBs.

usb_set_interface()
usb_hcd_alloc_bandwidth()
  hcd->driver->drop_endpoint()
  hcd->driver->add_endpoint() // allocates new rings
  hcd->driver->check_bandwidth() // issues configure endpoint command, free 
rings.
usb_disable_interface(iface, true)
  usb_disable_endpoint()
usb_hcd_flush_endpoint() // will access freed ring if URBs found!!
usb_hcd_disable_endpoint()
  hcd->driver->endpoint_disable()  // xhci does nothing
usb_enable_interface(iface, true)
  usb_enable_endpoint(ep_addrss, true) // not really doing much on xhci 
side.

As first aid I could try to implement checks that make sure the flushed URBs
trb pointers really are on the current endpoint ring, and also add some warning
if we are we are dropping endpoints with URBs still queued.

But we need to fix this properly as well.
xhci needs to be more in sync with usb core in usb_set_interface(), currently 
xhci
has the altssetting up and running when usb core hasn't event started flushing 
endpoints.


Absolutely.  The core tries to be compatible with host controller
drivers that either allocate bandwidth as it is requested or else
allocate bandwidth all at once when an altsetting is installed.

xhci-hcd falls into the second category.  However, this approach
requires the bandwidth verification for the new altsetting to be
performed before the old altsetting has been disabled, and the xHCI
hardware can't do this.

We may need to change the core so that the old endpoints are disabled
before the bandwidth check is done, instead of after.  Of course, this
leads to an awkward situation if the check fails -- we'd probably have
to go back and re-install the old altsetting.


That would help xhci a lot.

If we want to avoid the awkward altsetting re-install after bandwidth failure
then adding a extra endpoint flush before checking the bandwidth would already 
help a lot.

The endpoint disabling can then be remain after bandwidth checking.
Does that work for other host controllers?

-Mathias
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-20 Thread Mathias Nyman

On 19.07.2018 20:32, Sudip Mukherjee wrote:

Hi Mathias,

On Thu, Jul 19, 2018 at 06:42:19PM +0300, Mathias Nyman wrote:

As first aid I could try to implement checks that make sure the flushed URBs
trb pointers really are on the current endpoint ring, and also add some warning
if we are we are dropping endpoints with URBs still queued.


Yes, please. I think your first-aid will be a much better option than
the hacky patch I am using atm.



Attached a patch that checks canceled URB td/trb pointers.
I haven't tested it at all (well compiles and boots, but new code never 
exercised)

Does it work for you?


No, not exactly. :(

I can see your message getting printed.
[  249.518394] xhci_hcd :00:14.0: Canceled URB td not found on endpoint ring
[  249.518431] xhci_hcd :00:14.0: Canceled URB td not found on endpoint ring

But I can see the message from slub debug again:

[  348.279986] 
=
[  348.279993] BUG kmalloc-96 (Tainted: G U O   ): Poison overwritten
[  348.279995] 
-

[  348.279997] Disabling lock debugging due to kernel taint
[  348.28] INFO: 0xe5acda60-0xe5acda67. First byte 0x60 instead of 0x6b
[  348.280012] INFO: Allocated in xhci_ring_alloc.constprop.14+0x31/0x125 
[xhci_hcd] age=129264 cpu=0 pid=33

...

[  348.280095] INFO: Freed in xhci_ring_free+0xa7/0xc6 [xhci_hcd] age=98722 
cpu=0 pid=33

...

[  348.280158] INFO: Slab 0xf46e0fe0 objects=29 used=29 fp=0x  (null) 
flags=0x40008100
[  348.280160] INFO: Object 0xe5acda48 @offset=6728 fp=0xe5acd700

[  348.280164] Redzone e5acda40: bb bb bb bb bb bb bb bb
  
[  348.280167] Object e5acda48: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  348.280169] Object e5acda58: 6b 6b 6b 6b 6b 6b 6b 6b 60 da ac e5 60 da ac e5 
 `...`...


So poison is overwritten at e5acda58 with almost its own address, (reading 
backwards) e5 ac da 60, twice.
looks like something (32bit?)is pointing to itself twice, maybe a linked list 
node next and prev pointer
being set to point to itself as last item was removed from list.

The cancelled_td_list is part of struct xhci_virt_ep, so that should be fine.
But td_list is part of struct xhci_ring, which was freed. and we removed the 
URBs tds from the td_list when
flushing the ring after ring was freed

I changed the patch (attached) to make sure it doesn't touch the td_list when 
canceling a URB after
ring is freed.

How about this one, any improvements?

-Mathias  

 
>From ee48d9f9c2d82058489dcdc38faa34a3cbdb08d1 Mon Sep 17 00:00:00 2001
From: Mathias Nyman 
Date: Thu, 19 Jul 2018 18:06:18 +0300
Subject: [PATCH v2] xhci: when dequeing a URB make sure it exists on the
 current endpoint ring.

If the endpoint ring has been reallocated since the URB was enqueued,
then URB may contain TD and TRB pointers to a already freed ring.
If this the case then manuallt return the URB without touching any of the
freed ring structure data.

Don't try to stop the ring. It would be useless.

This can happened if endpoint is not flushed before it is dropped and
re-added, which is the case in usb_set_interface() as xhci does
things in an odd order.

Signed-off-by: Mathias Nyman 
---
 drivers/usb/host/xhci.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 711da33..7093341 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -37,6 +37,21 @@ static unsigned int quirks;
 module_param(quirks, uint, S_IRUGO);
 MODULE_PARM_DESC(quirks, "Bit flags for quirks to be enabled as default");
 
+static bool td_on_ring(struct xhci_td *td, struct xhci_ring *ring)
+{
+	struct xhci_segment *seg = ring->first_seg;
+
+	if (!td || !td->start_seg)
+		return false;
+	do {
+		if (seg == td->start_seg)
+			return true;
+		seg = seg->next;
+	} while (seg && seg != ring->first_seg);
+
+	return false;
+}
+
 /* TODO: copied from ehci-hcd.c - can this be refactored? */
 /*
  * xhci_handshake - spin reading hc until handshake completes or fails
@@ -1467,6 +1482,21 @@ static int xhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status)
 		goto done;
 	}
 
+	/*
+	 * check ring is not re-allocated since URB was enqueued. If it is, then
+	 * make sure none of the ring related pointers in this URB private data
+	 * are touched, such as td_list, otherwise we overwrite freed data
+	 */
+	if (!td_on_ring(_priv->td[0], ep_ring)) {
+		xhci_err(xhci, "Canceled URB td not found on endpoint ring");
+		for (i = urb_priv->num_tds_done; i < urb_priv->num_tds; i++) {
+			td = _priv->td[i];
+			if (!list_empty(>cancelled_td_list))
+list_del_init(>cancelled_td_list);
+		}
+		goto err_giveback;
+	}
+
 	if (xhci->xhc_state & XHCI_STATE_HALTED) {
 		xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb,
 "HC halted, freeing TD manually.");
-- 

Re: usb HC busted?

2018-07-19 Thread Sudip Mukherjee
Hi Mathias,

On Thu, Jul 19, 2018 at 06:42:19PM +0300, Mathias Nyman wrote:
> > > As first aid I could try to implement checks that make sure the flushed 
> > > URBs
> > > trb pointers really are on the current endpoint ring, and also add some 
> > > warning
> > > if we are we are dropping endpoints with URBs still queued.
> > 
> > Yes, please. I think your first-aid will be a much better option than
> > the hacky patch I am using atm.
> > 
> 
> Attached a patch that checks canceled URB td/trb pointers.
> I haven't tested it at all (well compiles and boots, but new code never 
> exercised)
> 
> Does it work for you?

No, not exactly. :(

I can see your message getting printed.
[  249.518394] xhci_hcd :00:14.0: Canceled URB td not found on endpoint ring
[  249.518431] xhci_hcd :00:14.0: Canceled URB td not found on endpoint ring

But I can see the message from slub debug again:

[  348.279986] 
=
[  348.279993] BUG kmalloc-96 (Tainted: G U O   ): Poison overwritten
[  348.279995] 
-

[  348.279997] Disabling lock debugging due to kernel taint
[  348.28] INFO: 0xe5acda60-0xe5acda67. First byte 0x60 instead of 0x6b
[  348.280012] INFO: Allocated in xhci_ring_alloc.constprop.14+0x31/0x125 
[xhci_hcd] age=129264 cpu=0 pid=33
[  348.280019]  ___slab_alloc.constprop.24+0x1fc/0x292
[  348.280023]  __slab_alloc.isra.18.constprop.23+0x1c/0x25
[  348.280026]  kmem_cache_alloc_trace+0x78/0x141
[  348.280032]  xhci_ring_alloc.constprop.14+0x31/0x125 [xhci_hcd]
[  348.280038]  xhci_endpoint_init+0x25f/0x30a [xhci_hcd]
[  348.280044]  xhci_add_endpoint+0x126/0x149 [xhci_hcd]
[  348.280057]  usb_hcd_alloc_bandwidth+0x26a/0x2a0 [usbcore]
[  348.280067]  usb_set_interface+0xeb/0x25d [usbcore]
[  348.280071]  btusb_work+0xeb/0x324 [btusb]
[  348.280076]  process_one_work+0x163/0x2b2
[  348.280080]  worker_thread+0x1a9/0x25c
[  348.280083]  kthread+0xf8/0xfd
[  348.280087]  ret_from_fork+0x2e/0x38
[  348.280095] INFO: Freed in xhci_ring_free+0xa7/0xc6 [xhci_hcd] age=98722 
cpu=0 pid=33
[  348.280098]  __slab_free+0x4b/0x27a
[  348.280100]  kfree+0x12e/0x155
[  348.280106]  xhci_ring_free+0xa7/0xc6 [xhci_hcd]
[  348.280112]  xhci_free_endpoint_ring+0x16/0x20 [xhci_hcd]
[  348.280118]  xhci_check_bandwidth+0x1c2/0x211 [xhci_hcd]
[  348.280129]  usb_hcd_alloc_bandwidth+0x205/0x2a0 [usbcore]
[  348.280139]  usb_set_interface+0xeb/0x25d [usbcore]
[  348.280142]  btusb_work+0x228/0x324 [btusb]
[  348.280145]  process_one_work+0x163/0x2b2
[  348.280148]  worker_thread+0x1a9/0x25c
[  348.280151]  kthread+0xf8/0xfd
[  348.280154]  ret_from_fork+0x2e/0x38
[  348.280158] INFO: Slab 0xf46e0fe0 objects=29 used=29 fp=0x  (null) 
flags=0x40008100
[  348.280160] INFO: Object 0xe5acda48 @offset=6728 fp=0xe5acd700

[  348.280164] Redzone e5acda40: bb bb bb bb bb bb bb bb
  
[  348.280167] Object e5acda48: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  348.280169] Object e5acda58: 6b 6b 6b 6b 6b 6b 6b 6b 60 da ac e5 60 da ac e5 
 `...`...
[  348.280171] Object e5acda68: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  348.280174] Object e5acda78: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  348.280176] Object e5acda88: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  348.280179] Object e5acda98: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 
 kkk.
[  348.280181] Redzone e5acdaa8: bb bb bb bb
  
[  348.280183] Padding e5acdb50: 5a 5a 5a 5a 5a 5a 5a 5a
  
[  348.280188] CPU: 0 PID: 133 Comm: weston Tainted: GBU O
4.14.55-20180712+ #2
[  348.280190] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017
[  348.280192] Call Trace:
[  348.280199]  dump_stack+0x47/0x5b
[  348.280202]  print_trailer+0x12b/0x133
[  348.280206]  check_bytes_and_report+0x6c/0xae
[  348.280210]  check_object+0x10a/0x1db
[  348.280214]  alloc_debug_processing+0x79/0x123
[  348.280218]  ___slab_alloc.constprop.24+0x1fc/0x292
[  348.280224]  ? drm_mode_atomic_ioctl+0x374/0x75e
[  348.280227]  ? drm_mode_atomic_ioctl+0x374/0x75e
[  348.280231]  ? drm_mode_object_get+0x28/0x3a
[  348.280235]  ? __radix_tree_lookup+0x27/0x7e
[  348.280238]  ? drm_mode_object_get+0x28/0x3a
[  348.280242]  ? drm_mode_object_put+0x28/0x4c
[  348.280246]  __slab_alloc.isra.18.constprop.23+0x1c/0x25
[  348.280249]  ? __slab_alloc.isra.18.constprop.23+0x1c/0x25
[  348.280253]  kmem_cache_alloc_trace+0x78/0x141
[  348.280257]  ? drm_mode_atomic_ioctl+0x374/0x75e
[  348.280261]  drm_mode_atomic_ioctl+0x374/0x75e
[  348.280267]  ? drm_atomic_set_property+0x442/0x442
[  348.280272]  drm_ioctl_kernel+0x52/0x88
[  348.280275]  drm_ioctl+0x1fc/0x2c1
[  348.280279]  ? drm_atomic_set_property+0x442/0x442
[  

Re: usb HC busted?

2018-07-19 Thread Mathias Nyman

As first aid I could try to implement checks that make sure the flushed URBs
trb pointers really are on the current endpoint ring, and also add some warning
if we are we are dropping endpoints with URBs still queued.


Yes, please. I think your first-aid will be a much better option than
the hacky patch I am using atm.



Attached a patch that checks canceled URB td/trb pointers.
I haven't tested it at all (well compiles and boots, but new code never 
exercised)

Does it work for you?



But we need to fix this properly as well.
xhci needs to be more in sync with usb core in usb_set_interface(), currently 
xhci
has the altssetting up and running when usb core hasn't event started flushing 
endpoints.


I am able to reproduce this on almost all cycles, so I can always test
the fix for you after you are fully back from your holiday.


Nice, thanks

-Mathias
>From a7d4af3129a91811c95ea642f6c916b1c1ca6d46 Mon Sep 17 00:00:00 2001
From: Mathias Nyman 
Date: Thu, 19 Jul 2018 18:06:18 +0300
Subject: [PATCH] xhci: when dequeing a URB make sure it exists on the current
 endpoint ring.

If the endpoint ring has been reallocated since the URB was enqueued,
then URB may contain TD and TRB pointers to a already freed ring.
If this the case then manuallt return the URB, and don't try to stop
the ring. It would be useless.

This can happened if endpoint is not flushed before it is dropped and
re-added, which is the case in usb_set_interface() as xhci does
things in an odd order.

Signed-off-by: Mathias Nyman 
---
 drivers/usb/host/xhci.c | 43 ---
 1 file changed, 32 insertions(+), 11 deletions(-)

diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 711da33..5bedab7 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -37,6 +37,21 @@ static unsigned int quirks;
 module_param(quirks, uint, S_IRUGO);
 MODULE_PARM_DESC(quirks, "Bit flags for quirks to be enabled as default");
 
+static bool td_on_ring(struct xhci_td *td, struct xhci_ring *ring)
+{
+	struct xhci_segment *seg = ring->first_seg;
+
+	if (!td || !td->start_seg)
+		return false;
+	do {
+		if (seg == td->start_seg)
+			return true;
+		seg = seg->next;
+	} while (seg && seg != ring->first_seg);
+
+	return false;
+}
+
 /* TODO: copied from ehci-hcd.c - can this be refactored? */
 /*
  * xhci_handshake - spin reading hc until handshake completes or fails
@@ -1467,19 +1482,16 @@ static int xhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status)
 		goto done;
 	}
 
+	/* check ring is not re-allocated since URB was enqueued */
+	if (!td_on_ring(_priv->td[0], ep_ring)) {
+		xhci_err(xhci, "Canceled URB td not found on endpoint ring");
+		goto err_unlink_giveback;
+	}
+
 	if (xhci->xhc_state & XHCI_STATE_HALTED) {
 		xhci_dbg_trace(xhci, trace_xhci_dbg_cancel_urb,
-"HC halted, freeing TD manually.");
-		for (i = urb_priv->num_tds_done;
-		 i < urb_priv->num_tds;
-		 i++) {
-			td = _priv->td[i];
-			if (!list_empty(>td_list))
-list_del_init(>td_list);
-			if (!list_empty(>cancelled_td_list))
-list_del_init(>cancelled_td_list);
-		}
-		goto err_giveback;
+			   "HC halted, freeing TD manually.");
+		goto err_unlink_giveback;
 	}
 
 	i = urb_priv->num_tds_done;
@@ -1519,6 +1531,15 @@ static int xhci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status)
 	spin_unlock_irqrestore(>lock, flags);
 	return ret;
 
+err_unlink_giveback:
+	for (i = urb_priv->num_tds_done; i < urb_priv->num_tds; i++) {
+		td = _priv->td[i];
+		if (!list_empty(>td_list))
+			list_del_init(>td_list);
+		if (!list_empty(>cancelled_td_list))
+			list_del_init(>cancelled_td_list);
+	}
+
 err_giveback:
 	if (urb_priv)
 		xhci_urb_free_priv(urb_priv);
-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: usb HC busted?

2018-07-19 Thread Alan Stern
On Thu, 19 Jul 2018, Mathias Nyman wrote:

> xhci driver will set up all the endpoints for the new altsetting already in
> usb_hcd_alloc_bandwidth().
> 
> New endpoints will be ready and rings running after this. I don't know the 
> exact
> history behind this, but I assume it is because xhci does all of the steps to
> drop/add, disable/enable endpoints and check bandwidth in a single configure
> endpoint command, that will return errors if there is not enough bandwidth.

That's right; Sarah and I spent some time going over this while she was 
working on it.  But it looks like the approach isn't adequate.

> This command is issued in hcd->driver->check_bandwidth()
> This means that xhci doesn't really do much in hcd->driver->endpoint_disable 
> or
> hcd->driver->endpoint_enable
> 
> It also means that xhci driver assumes rings are empty when
> hcd->driver->check_bandwidth is called. It will bluntly free dropped rings.
> If there are URBs left on a endpoint ring that was dropped+added
> (freed+reallocated) then those URBs will contain pointers to freed ring,
> causing issues when usb_hcd_flush_endpoint() cancels those URBs.
> 
> usb_set_interface()
>usb_hcd_alloc_bandwidth()
>  hcd->driver->drop_endpoint()
>  hcd->driver->add_endpoint() // allocates new rings
>  hcd->driver->check_bandwidth() // issues configure endpoint command, 
> free rings.
>usb_disable_interface(iface, true)
>  usb_disable_endpoint()
>usb_hcd_flush_endpoint() // will access freed ring if URBs found!!
>usb_hcd_disable_endpoint()
>  hcd->driver->endpoint_disable()  // xhci does nothing
>usb_enable_interface(iface, true)
>  usb_enable_endpoint(ep_addrss, true) // not really doing much on xhci 
> side.
> 
> As first aid I could try to implement checks that make sure the flushed URBs
> trb pointers really are on the current endpoint ring, and also add some 
> warning
> if we are we are dropping endpoints with URBs still queued.
> 
> But we need to fix this properly as well.
> xhci needs to be more in sync with usb core in usb_set_interface(), currently 
> xhci
> has the altssetting up and running when usb core hasn't event started 
> flushing endpoints.

Absolutely.  The core tries to be compatible with host controller
drivers that either allocate bandwidth as it is requested or else
allocate bandwidth all at once when an altsetting is installed.  

xhci-hcd falls into the second category.  However, this approach
requires the bandwidth verification for the new altsetting to be
performed before the old altsetting has been disabled, and the xHCI
hardware can't do this.

We may need to change the core so that the old endpoints are disabled 
before the bandwidth check is done, instead of after.  Of course, this 
leads to an awkward situation if the check fails -- we'd probably have 
to go back and re-install the old altsetting.

Alan Stern

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-19 Thread Sudip Mukherjee
Hi Mathias,

On Thu, Jul 19, 2018 at 01:59:01PM +0300, Mathias Nyman wrote:
> On 17.07.2018 18:10, Sudip Mukherjee wrote:
> > Hi Alan, Greg,
> > 
> > On Tue, Jul 17, 2018 at 03:49:18PM +0100, Sudip Mukherjee wrote:
> > > On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote:
> > > > Hi Alan,
> > > > 
> > > > On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote:
> > > > > On Tue, 17 Jul 2018, Sudip Mukherjee wrote:
> > > > > 
> > > > > > I did some more debugging. Tested with a KASAN enabled kernel and 
> > > > > > that
> > > > > > shows the problem. The report is attached.
> > > > > > 

> > 
> > And, my hacky patch worked as I prevented it from calling
> > usb_disable_interface() in this particular case.
> > 
> 
> Back for a few days, looking at this

I hope you had a good holiday. :)

> 
> xhci driver will set up all the endpoints for the new altsetting already in
> usb_hcd_alloc_bandwidth().
> 

> 
> As first aid I could try to implement checks that make sure the flushed URBs
> trb pointers really are on the current endpoint ring, and also add some 
> warning
> if we are we are dropping endpoints with URBs still queued.

Yes, please. I think your first-aid will be a much better option than
the hacky patch I am using atm.

> 
> But we need to fix this properly as well.
> xhci needs to be more in sync with usb core in usb_set_interface(), currently 
> xhci
> has the altssetting up and running when usb core hasn't event started 
> flushing endpoints.

I am able to reproduce this on almost all cycles, so I can always test
the fix for you after you are fully back from your holiday.

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-19 Thread Mathias Nyman

On 17.07.2018 18:10, Sudip Mukherjee wrote:

Hi Alan, Greg,

On Tue, Jul 17, 2018 at 03:49:18PM +0100, Sudip Mukherjee wrote:

On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote:

Hi Alan,

On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote:

On Tue, 17 Jul 2018, Sudip Mukherjee wrote:


I did some more debugging. Tested with a KASAN enabled kernel and that
shows the problem. The report is attached.

To my understanding:

btusb_work() is calling usb_set_interface() with alternate = 0. which
again calls usb_hcd_alloc_bandwidth() and that frees the rings by
xhci_free_endpoint_ring().


That doesn't sound like the right thing to do.  The rings shouldn't be
freed until xhci_endpoint_disable() is called.

On the other hand, there doesn't appear to be any
xhci_endpoint_disable() routine, although a comment refers to it.
Maybe this is the real problem?


one of your old mail might help :)

https://www.spinics.net/lists/linux-usb/msg98123.html


Wrote too soon.

Is it the one you are looking for -
usb_disable_endpoint() is in drivers/usb/core/message.c


I think now I understand what the problem is.
usb_set_interface() calls usb_disable_interface() which again calls
usb_disable_endpoint(). This usb_disable_endpoint() gets the pointer
to 'ep', marks it as NULL and sends the pointer to usb_hcd_flush_endpoint().
After flushing the endpoints usb_disable_endpoint() calls
usb_hcd_disable_endpoint() which tries to do:
if (hcd->driver->endpoint_disable)
hcd->driver->endpoint_disable(hcd, ep);
but there is no endpoint_disable() callback in xhci, so the endpoint is
never marked as disabled. So, next time usb_hcd_flush_endpoint() is
called I get this corruption.
And this is exactly where I used to see the problem happening.

And, my hacky patch worked as I prevented it from calling
usb_disable_interface() in this particular case.



Back for a few days, looking at this

xhci driver will set up all the endpoints for the new altsetting already in
usb_hcd_alloc_bandwidth().

New endpoints will be ready and rings running after this. I don't know the exact
history behind this, but I assume it is because xhci does all of the steps to
drop/add, disable/enable endpoints and check bandwidth in a single configure
endpoint command, that will return errors if there is not enough bandwidth.
This command is issued in hcd->driver->check_bandwidth()
This means that xhci doesn't really do much in hcd->driver->endpoint_disable or
hcd->driver->endpoint_enable

It also means that xhci driver assumes rings are empty when
hcd->driver->check_bandwidth is called. It will bluntly free dropped rings.
If there are URBs left on a endpoint ring that was dropped+added
(freed+reallocated) then those URBs will contain pointers to freed ring,
causing issues when usb_hcd_flush_endpoint() cancels those URBs.

usb_set_interface()
  usb_hcd_alloc_bandwidth()
hcd->driver->drop_endpoint()
hcd->driver->add_endpoint() // allocates new rings
hcd->driver->check_bandwidth() // issues configure endpoint command, free 
rings.
  usb_disable_interface(iface, true)
usb_disable_endpoint()
  usb_hcd_flush_endpoint() // will access freed ring if URBs found!!
  usb_hcd_disable_endpoint()
hcd->driver->endpoint_disable()  // xhci does nothing
  usb_enable_interface(iface, true)
usb_enable_endpoint(ep_addrss, true) // not really doing much on xhci side.

As first aid I could try to implement checks that make sure the flushed URBs
trb pointers really are on the current endpoint ring, and also add some warning
if we are we are dropping endpoints with URBs still queued.

But we need to fix this properly as well.
xhci needs to be more in sync with usb core in usb_set_interface(), currently 
xhci
has the altssetting up and running when usb core hasn't event started flushing 
endpoints.

-Mathias
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-17 Thread Sudip Mukherjee
On Tue, Jul 17, 2018 at 04:59:01PM +0100, Sudip Mukherjee wrote:
> On Tue, Jul 17, 2018 at 05:52:59PM +0200, Greg KH wrote:
> > On Tue, Jul 17, 2018 at 10:31:38AM -0400, Alan Stern wrote:
> > > On Tue, 17 Jul 2018, Greg KH wrote:
> > > 
> > > > > From: Sudip Mukherjee 
> > > > > Date: Tue, 10 Jul 2018 09:50:00 +0100
> > > > > Subject: [PATCH] hacky solution to mem-corruption
> > > > > 
> > > > > Signed-off-by: Sudip Mukherjee 
> > > > > ---
> 
> > > 
> > > No, neither of these is right.  It's possible to use 
> > > usb_set_interface() as a kind of "soft" reset.  Even when the new 
> > > altsetting is specified to be the same as the current one, we still 
> > > have to tell the lower-layer drivers and hardware about it.
> > 
> > You are right, it's a hacky soft reset, I was just trying to figure out
> > what the bluetooth driver was trying to do.  I wouldn't expect it to be
> > calling that function a lot, but I guess it does :(
> 
> usb_set_interface() is being called two times from bluetooth event. But
> I am now adding more debugs to see why your patch did not work.

So, a very simple debug to see the sequence of functions being called.
I have attached the patch I used.

In a good case:
[  124.287991] sudip: xhci_urb_dequeue
[  124.287997] sudip: xhci_queue_stop_endpoint cmd=ee032950
[  124.288016] sudip: handle_cmd_completion cmd=ee032950
[  124.288173] sudip: xhci_urb_dequeue
[  124.288176] sudip: xhci_queue_stop_endpoint cmd=ee032950
[  124.288189] sudip: handle_cmd_completion cmd=ee032950
[  124.290647] sudip: usb_hcd_flush_endpoint
[  124.290652] sudip: usb_hcd_flush_endpoint

But in a bad case:
[  186.786900] sudip: xhci_urb_dequeue
[  186.786905] sudip: xhci_queue_stop_endpoint cmd=ebe47cb0
[  186.786923] sudip: handle_cmd_completion cmd=ebe47cb0
[  186.789040] sudip: xhci_urb_dequeue
[  186.789047] sudip: xhci_queue_stop_endpoint cmd=ebe47cb0
[  186.789069] sudip: handle_cmd_completion cmd=ebe47cb0
[  186.790082] sudip: usb_hcd_flush_endpoint
[  186.790094] sudip: xhci_urb_dequeue
[  186.790097] sudip: xhci_queue_stop_endpoint cmd=ebe47290
[  186.790150] sudip: handle_cmd_completion cmd=ebe47290
[  186.790202] sudip: usb_hcd_flush_endpoint

So, when usb_hcd_flush_endpoint() is called by usb_disable_endpoint() it
finds urbs still on the urb_list of the ep. And in the process of unlinking
them, it again sends the command to stop the endpoint, although that endpoint
has already been stopped.
So Greg's patch did not work as the memory got corrupted on the first call
to usb_set_interface(), whereas that patch was preventing the second call
to usb_set_interface().

--
Regards
Sudip
diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 467bedeb542a..8d28f120ec0a 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -1885,6 +1885,7 @@ void usb_hcd_flush_endpoint(struct usb_device *udev,
might_sleep();
hcd = bus_to_hcd(udev->bus);
 
+   pr_err("sudip: %s\n", __func__);
/* No more submits can occur */
spin_lock_irq(_urb_list_lock);
 rescan:
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 6996235e34a9..4f80791fdfc5 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1450,6 +1450,7 @@ static void handle_cmd_completion(struct xhci_hcd *xhci,
case TRB_STOP_RING:
WARN_ON(slot_id != TRB_TO_SLOT_ID(
le32_to_cpu(cmd_trb->generic.field[3])));
+   pr_err("sudip: %s cmd=%p\n", __func__, cmd);
xhci_handle_cmd_stop_ep(xhci, slot_id, cmd_trb, event);
break;
case TRB_SET_DEQ:
@@ -4009,6 +4010,7 @@ int xhci_queue_stop_endpoint(struct xhci_hcd *xhci, 
struct xhci_command *cmd,
u32 type = TRB_TYPE(TRB_STOP_RING);
u32 trb_suspend = SUSPEND_PORT_FOR_TRB(suspend);
 
+   pr_err("sudip: %s cmd=%p\n", __func__, cmd);
return queue_command(xhci, cmd, 0, 0, 0,
trb_slot_id | trb_ep_index | type | trb_suspend, false);
 }
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index db1de6113db2..3832128107ff 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -1516,6 +1516,7 @@ static int xhci_urb_dequeue(struct usb_hcd *hcd, struct 
urb *urb, int status)
ep->stop_cmd_timer.expires = jiffies +
XHCI_STOP_EP_CMD_TIMEOUT * HZ;
add_timer(>stop_cmd_timer);
+   pr_err("sudip: %s\n", __func__);
xhci_queue_stop_endpoint(xhci, command, urb->dev->slot_id,
 ep_index, 0);
xhci_ring_cmd_db(xhci);
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: usb HC busted?

2018-07-17 Thread Sudip Mukherjee
On Tue, Jul 17, 2018 at 05:52:59PM +0200, Greg KH wrote:
> On Tue, Jul 17, 2018 at 10:31:38AM -0400, Alan Stern wrote:
> > On Tue, 17 Jul 2018, Greg KH wrote:
> > 
> > > > From: Sudip Mukherjee 
> > > > Date: Tue, 10 Jul 2018 09:50:00 +0100
> > > > Subject: [PATCH] hacky solution to mem-corruption
> > > > 
> > > > Signed-off-by: Sudip Mukherjee 
> > > > ---

> > 
> > No, neither of these is right.  It's possible to use 
> > usb_set_interface() as a kind of "soft" reset.  Even when the new 
> > altsetting is specified to be the same as the current one, we still 
> > have to tell the lower-layer drivers and hardware about it.
> 
> You are right, it's a hacky soft reset, I was just trying to figure out
> what the bluetooth driver was trying to do.  I wouldn't expect it to be
> calling that function a lot, but I guess it does :(

usb_set_interface() is being called two times from bluetooth event. But
I am now adding more debugs to see why your patch did not work.

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-17 Thread Greg KH
On Tue, Jul 17, 2018 at 10:31:38AM -0400, Alan Stern wrote:
> On Tue, 17 Jul 2018, Greg KH wrote:
> 
> > > From: Sudip Mukherjee 
> > > Date: Tue, 10 Jul 2018 09:50:00 +0100
> > > Subject: [PATCH] hacky solution to mem-corruption
> > > 
> > > Signed-off-by: Sudip Mukherjee 
> > > ---
> > >  drivers/usb/core/message.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/usb/core/message.c b/drivers/usb/core/message.c
> > > index 7cd4ec33dbf4..7fdf7a27611d 100644
> > > --- a/drivers/usb/core/message.c
> > > +++ b/drivers/usb/core/message.c
> > > @@ -1398,7 +1398,8 @@ int usb_set_interface(struct usb_device *dev, int 
> > > interface, int alternate)
> > >   remove_intf_ep_devs(iface);
> > >   usb_remove_sysfs_intf_files(iface);
> > >   }
> > > - usb_disable_interface(dev, iface, true);
> > > + if (!(iface->cur_altsetting && alt))
> > > + usb_disable_interface(dev, iface, true);
> > 
> > 
> > 
> > This feels like a "correct" patch anyway, why would a driver keep
> > calling set_interface to an interface that it was already set to?
> > 
> > But can't we check for this higher up in the function?  This hack will
> > just not disable an interface but it will do all of the other stuff
> > being asked for.  Does the patch below also solve this for you?  It's
> > not a good solution of course, but it might work around the problem a
> > bit better.
> > 
> > thanks,
> > 
> > greg k-h
> > 
> > diff --git a/drivers/usb/core/message.c b/drivers/usb/core/message.c
> > index 1a15392326fc..0f718f1a1ca3 100644
> > --- a/drivers/usb/core/message.c
> > +++ b/drivers/usb/core/message.c
> > @@ -1376,6 +1376,14 @@ int usb_set_interface(struct usb_device *dev, int 
> > interface, int alternate)
> > return -EINVAL;
> > }
> >  
> > +   if (iface->cur_altsetting == alt) {
> > +   /*
> > +* foolish bluetooth stack, don't try to set a setting you are
> > +* already set to...
> > +*/
> > +   return 0;
> > +   }
> > +
> > /* Make sure we have enough bandwidth for this alternate interface.
> >  * Remove the current alt setting and add the new alt setting.
> >  */
> 
> No, neither of these is right.  It's possible to use 
> usb_set_interface() as a kind of "soft" reset.  Even when the new 
> altsetting is specified to be the same as the current one, we still 
> have to tell the lower-layer drivers and hardware about it.

You are right, it's a hacky soft reset, I was just trying to figure out
what the bluetooth driver was trying to do.  I wouldn't expect it to be
calling that function a lot, but I guess it does :(

thanks,

greg k-h
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-17 Thread Sudip Mukherjee
Hi Alan, Greg,

On Tue, Jul 17, 2018 at 03:49:18PM +0100, Sudip Mukherjee wrote:
> On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote:
> > Hi Alan,
> > 
> > On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote:
> > > On Tue, 17 Jul 2018, Sudip Mukherjee wrote:
> > > 
> > > > I did some more debugging. Tested with a KASAN enabled kernel and that
> > > > shows the problem. The report is attached.
> > > > 
> > > > To my understanding:
> > > > 
> > > > btusb_work() is calling usb_set_interface() with alternate = 0. which
> > > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by
> > > > xhci_free_endpoint_ring().
> > > 
> > > That doesn't sound like the right thing to do.  The rings shouldn't be 
> > > freed until xhci_endpoint_disable() is called.  
> > > 
> > > On the other hand, there doesn't appear to be any 
> > > xhci_endpoint_disable() routine, although a comment refers to it.  
> > > Maybe this is the real problem?
> > 
> > one of your old mail might help :)
> > 
> > https://www.spinics.net/lists/linux-usb/msg98123.html
> 
> Wrote too soon.
> 
> Is it the one you are looking for -
> usb_disable_endpoint() is in drivers/usb/core/message.c

I think now I understand what the problem is.
usb_set_interface() calls usb_disable_interface() which again calls
usb_disable_endpoint(). This usb_disable_endpoint() gets the pointer
to 'ep', marks it as NULL and sends the pointer to usb_hcd_flush_endpoint().
After flushing the endpoints usb_disable_endpoint() calls
usb_hcd_disable_endpoint() which tries to do:
if (hcd->driver->endpoint_disable)
hcd->driver->endpoint_disable(hcd, ep);
but there is no endpoint_disable() callback in xhci, so the endpoint is
never marked as disabled. So, next time usb_hcd_flush_endpoint() is
called I get this corruption. 
And this is exactly where I used to see the problem happening.

And, my hacky patch worked as I prevented it from calling
usb_disable_interface() in this particular case.

Greg - answering your question here. My hacky patch was based on the
fact that usb_hcd_alloc_bandwidth() is calling hcd->driver->drop_endpoint()
and hcd->driver->add_endpoint() if (cur_alt && new_alt). So, I prevented
usb_disable_interface() to be called for that same condition. And that
worked as the call to usb_hcd_flush_endpoint() was not executed.
I know it is not correct and I might be having memory leaks for this, but
I have the system working till we get the actual fix.

--
Regards
Sudip

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-17 Thread Alan Stern
On Tue, 17 Jul 2018, Sudip Mukherjee wrote:

> On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote:
> > Hi Alan,
> > 
> > On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote:
> > > On Tue, 17 Jul 2018, Sudip Mukherjee wrote:
> > > 
> > > > I did some more debugging. Tested with a KASAN enabled kernel and that
> > > > shows the problem. The report is attached.
> > > > 
> > > > To my understanding:
> > > > 
> > > > btusb_work() is calling usb_set_interface() with alternate = 0. which
> > > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by
> > > > xhci_free_endpoint_ring().
> > > 
> > > That doesn't sound like the right thing to do.  The rings shouldn't be 
> > > freed until xhci_endpoint_disable() is called.  
> > > 
> > > On the other hand, there doesn't appear to be any 
> > > xhci_endpoint_disable() routine, although a comment refers to it.  
> > > Maybe this is the real problem?
> > 
> > one of your old mail might help :)
> > 
> > https://www.spinics.net/lists/linux-usb/msg98123.html

That message seems to say the same thing as what I just wrote, more or 
less.

> Wrote too soon.
> 
> Is it the one you are looking for -
> usb_disable_endpoint() is in drivers/usb/core/message.c

No, I'm talking about xhci_endpoint_disable(), which would be called by 
usb_hcd_disable_endpoint() if it existed.  Of course, 
usb_hcd_disable_endpoint() is called by usb_disable_endpoint().

Alan Stern

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-17 Thread Sudip Mukherjee
On Tue, Jul 17, 2018 at 03:40:22PM +0100, Sudip Mukherjee wrote:
> Hi Alan,
> 
> On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote:
> > On Tue, 17 Jul 2018, Sudip Mukherjee wrote:
> > 
> > > I did some more debugging. Tested with a KASAN enabled kernel and that
> > > shows the problem. The report is attached.
> > > 
> > > To my understanding:
> > > 
> > > btusb_work() is calling usb_set_interface() with alternate = 0. which
> > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by
> > > xhci_free_endpoint_ring().
> > 
> > That doesn't sound like the right thing to do.  The rings shouldn't be 
> > freed until xhci_endpoint_disable() is called.  
> > 
> > On the other hand, there doesn't appear to be any 
> > xhci_endpoint_disable() routine, although a comment refers to it.  
> > Maybe this is the real problem?
> 
> one of your old mail might help :)
> 
> https://www.spinics.net/lists/linux-usb/msg98123.html

Wrote too soon.

Is it the one you are looking for -
usb_disable_endpoint() is in drivers/usb/core/message.c

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-17 Thread Sudip Mukherjee
Hi Alan,

On Tue, Jul 17, 2018 at 10:28:14AM -0400, Alan Stern wrote:
> On Tue, 17 Jul 2018, Sudip Mukherjee wrote:
> 
> > I did some more debugging. Tested with a KASAN enabled kernel and that
> > shows the problem. The report is attached.
> > 
> > To my understanding:
> > 
> > btusb_work() is calling usb_set_interface() with alternate = 0. which
> > again calls usb_hcd_alloc_bandwidth() and that frees the rings by
> > xhci_free_endpoint_ring().
> 
> That doesn't sound like the right thing to do.  The rings shouldn't be 
> freed until xhci_endpoint_disable() is called.  
> 
> On the other hand, there doesn't appear to be any 
> xhci_endpoint_disable() routine, although a comment refers to it.  
> Maybe this is the real problem?

one of your old mail might help :)

https://www.spinics.net/lists/linux-usb/msg98123.html

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-17 Thread Alan Stern
On Tue, 17 Jul 2018, Greg KH wrote:

> > From: Sudip Mukherjee 
> > Date: Tue, 10 Jul 2018 09:50:00 +0100
> > Subject: [PATCH] hacky solution to mem-corruption
> > 
> > Signed-off-by: Sudip Mukherjee 
> > ---
> >  drivers/usb/core/message.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/usb/core/message.c b/drivers/usb/core/message.c
> > index 7cd4ec33dbf4..7fdf7a27611d 100644
> > --- a/drivers/usb/core/message.c
> > +++ b/drivers/usb/core/message.c
> > @@ -1398,7 +1398,8 @@ int usb_set_interface(struct usb_device *dev, int 
> > interface, int alternate)
> > remove_intf_ep_devs(iface);
> > usb_remove_sysfs_intf_files(iface);
> > }
> > -   usb_disable_interface(dev, iface, true);
> > +   if (!(iface->cur_altsetting && alt))
> > +   usb_disable_interface(dev, iface, true);
> 
> 
> 
> This feels like a "correct" patch anyway, why would a driver keep
> calling set_interface to an interface that it was already set to?
> 
> But can't we check for this higher up in the function?  This hack will
> just not disable an interface but it will do all of the other stuff
> being asked for.  Does the patch below also solve this for you?  It's
> not a good solution of course, but it might work around the problem a
> bit better.
> 
> thanks,
> 
> greg k-h
> 
> diff --git a/drivers/usb/core/message.c b/drivers/usb/core/message.c
> index 1a15392326fc..0f718f1a1ca3 100644
> --- a/drivers/usb/core/message.c
> +++ b/drivers/usb/core/message.c
> @@ -1376,6 +1376,14 @@ int usb_set_interface(struct usb_device *dev, int 
> interface, int alternate)
>   return -EINVAL;
>   }
>  
> + if (iface->cur_altsetting == alt) {
> + /*
> +  * foolish bluetooth stack, don't try to set a setting you are
> +  * already set to...
> +  */
> + return 0;
> + }
> +
>   /* Make sure we have enough bandwidth for this alternate interface.
>* Remove the current alt setting and add the new alt setting.
>*/

No, neither of these is right.  It's possible to use 
usb_set_interface() as a kind of "soft" reset.  Even when the new 
altsetting is specified to be the same as the current one, we still 
have to tell the lower-layer drivers and hardware about it.

Alan Stern

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-17 Thread Alan Stern
On Tue, 17 Jul 2018, Sudip Mukherjee wrote:

> I did some more debugging. Tested with a KASAN enabled kernel and that
> shows the problem. The report is attached.
> 
> To my understanding:
> 
> btusb_work() is calling usb_set_interface() with alternate = 0. which
> again calls usb_hcd_alloc_bandwidth() and that frees the rings by
> xhci_free_endpoint_ring().

That doesn't sound like the right thing to do.  The rings shouldn't be 
freed until xhci_endpoint_disable() is called.  

On the other hand, there doesn't appear to be any 
xhci_endpoint_disable() routine, although a comment refers to it.  
Maybe this is the real problem?

Alan Stern

> But then usb_set_interface() continues and
> calls usb_disable_interface() -> usb_hcd_flush_endpoint()->unlink1()->
> xhci_urb_dequeue() which at the end gives the command to stop endpoint.
> 
> In all the cycles I have tested I see that only in the fail case
> handle_cmd_completion() gets called, but in the cycles where the error
> is not there handle_cmd_completion() is not called with that command.
> 
> I am not sure what is happening, and you are the best person to understand
> what is happening. :)
> 
> But for now (untill you are back from holiday and suggest a proper solution),
> I made a hacky patch (attached) which is working and I donot get any
> corruption after that. Both KASAN and slub debug are also happy.
> 
> So, now waiting for you to analyze what is going on and suggest a proper
> fix.
> 
> Thanks in advance.
> 
> --
> Regards
> Sudip
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-07-17 Thread Greg KH
On Tue, Jul 17, 2018 at 02:20:00PM +0100, Sudip Mukherjee wrote:
> Hi Greg,
> 
> On Tue, Jul 17, 2018 at 02:04:11PM +0200, Greg KH wrote:
> > On Tue, Jul 17, 2018 at 12:41:04PM +0100, Sudip Mukherjee wrote:
> > > Hi Mathias,
> > > 
> > > On Sat, Jun 30, 2018 at 10:07:04PM +0100, Sudip Mukherjee wrote:
> > > > Hi Mathias,
> > > > 
> > > > On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote:
> > > > > On 27.06.2018 14:59, Sudip Mukherjee wrote:
> > > > > > > > Can you share a bit more details on the platform you are using, 
> > > > > > > > and what types of test you are running.
> > > > > > > 
> > > 
> > > > Then to track what is going on, I added the slub debugging and :(
> > > > I have attached part of dmesg for you to check.
> > > > Will appreciate your help in finding out the problem.
> > > 
> > > I did some more debugging. Tested with a KASAN enabled kernel and that
> > > shows the problem. The report is attached.
> > > 
> > > To my understanding:
> > > 
> > > btusb_work() is calling usb_set_interface() with alternate = 0. which
> > > again calls usb_hcd_alloc_bandwidth() and that frees the rings by
> > > xhci_free_endpoint_ring(). But then usb_set_interface() continues and
> > > calls usb_disable_interface() -> usb_hcd_flush_endpoint()->unlink1()->
> > > xhci_urb_dequeue() which at the end gives the command to stop endpoint.
> > > 
> > > In all the cycles I have tested I see that only in the fail case
> > > handle_cmd_completion() gets called, but in the cycles where the error
> > > is not there handle_cmd_completion() is not called with that command.
> > > 
> > > I am not sure what is happening, and you are the best person to understand
> > > what is happening. :)
> > > 
> > > But for now (untill you are back from holiday and suggest a proper 
> > > solution),
> > > I made a hacky patch (attached) which is working and I donot get any
> > > corruption after that. Both KASAN and slub debug are also happy.
> > > 
> > > So, now waiting for you to analyze what is going on and suggest a proper
> > > fix.
> > > 
> > > Thanks in advance.
> > > 
> > > --
> > > Regards
> > > Sudip
> > 
> > > [  236.814156] 
> > > ==
> > > [  236.814187] BUG: KASAN: use-after-free in 
> > > xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd]
> > > [  236.814193] Read of size 8 at addr 8800789329c8 by task weston/138
> > > 
> > > [  236.814203] CPU: 0 PID: 138 Comm: weston Tainted: G U  W  O
> > > 4.14.47-20180606+ #7
> > > [  236.814206] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017
> > > [  236.814209] Call Trace:
> > > [  236.814214]  
> > > [  236.814226]  dump_stack+0x46/0x59
> > > [  236.814238]  print_address_description+0x6b/0x23b
> > > [  236.814255]  ? xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd]
> > > [  236.814262]  kasan_report+0x220/0x246
> > > [  236.814278]  xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd]
> > > [  236.814294]  trb_in_td+0x3b/0x1cd [xhci_hcd]
> > > [  236.814311]  handle_cmd_completion+0x1181/0x2c9b [xhci_hcd]
> > > [  236.814329]  ? xhci_queue_new_dequeue_state+0x5d9/0x5d9 [xhci_hcd]
> > > [  236.814337]  ? drm_handle_vblank+0x4ec/0x590
> > > [  236.814352]  xhci_irq+0x529/0x3294 [xhci_hcd]
> > > [  236.814362]  ? __accumulate_pelt_segments+0x24/0x33
> > > [  236.814378]  ? finish_td.isra.40+0x223/0x223 [xhci_hcd]
> > > [  236.814384]  ? __accumulate_pelt_segments+0x24/0x33
> > > [  236.814390]  ? __accumulate_pelt_segments+0x24/0x33
> > > [  236.814405]  ? xhci_irq+0x3294/0x3294 [xhci_hcd]
> > > [  236.814412]  __handle_irq_event_percpu+0x149/0x3db
> > > [  236.814421]  handle_irq_event_percpu+0x65/0x109
> > > [  236.814428]  ? __handle_irq_event_percpu+0x3db/0x3db
> > > [  236.814436]  ? ttwu_do_wakeup.isra.18+0x3a2/0x3ce
> > > [  236.814442]  handle_irq_event+0xa8/0x10a
> > > [  236.814449]  handle_edge_irq+0x4b2/0x538
> > > [  236.814458]  handle_irq+0x3e/0x45
> > > [  236.814465]  do_IRQ+0x5c/0x126
> > > [  236.814474]  common_interrupt+0x7a/0x7a
> > > [  236.814478]  
> > > [  236.814483] RIP: 0023:0xf79d3d82
> > > [  236.814486] RSP: 002b:ffc588e8 EFLAGS: 00200282 ORIG_RAX: 
> > > ffdc
> > > [  236.814493] RAX:  RBX: f7bebd5c RCX: 
> > > 
> > > [  236.814496] RDX: 08d4197c RSI:  RDI: 
> > > f746c020
> > > [  236.814499] RBP: ffc588e8 R08:  R09: 
> > > 
> > > [  236.814503] R10:  R11: 00200206 R12: 
> > > 
> > > [  236.814506] R13:  R14:  R15: 
> > > 
> > > 
> > > [  236.814513] Allocated by task 2082:
> > > [  236.814521]  kasan_kmalloc.part.1+0x51/0xc7
> > > [  236.814526]  kmem_cache_alloc_trace+0x178/0x187
> > > [  236.814540]  xhci_segment_alloc.isra.11+0x9d/0x3bf [xhci_hcd]
> > > [  236.814553]  xhci_alloc_segments_for_ring+0x9e/0x176 [xhci_hcd]
> > > [  236.814566]  

Re: usb HC busted?

2018-07-17 Thread Sudip Mukherjee
Hi Greg,

On Tue, Jul 17, 2018 at 02:04:11PM +0200, Greg KH wrote:
> On Tue, Jul 17, 2018 at 12:41:04PM +0100, Sudip Mukherjee wrote:
> > Hi Mathias,
> > 
> > On Sat, Jun 30, 2018 at 10:07:04PM +0100, Sudip Mukherjee wrote:
> > > Hi Mathias,
> > > 
> > > On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote:
> > > > On 27.06.2018 14:59, Sudip Mukherjee wrote:
> > > > > > > Can you share a bit more details on the platform you are using, 
> > > > > > > and what types of test you are running.
> > > > > > 
> > 
> > > Then to track what is going on, I added the slub debugging and :(
> > > I have attached part of dmesg for you to check.
> > > Will appreciate your help in finding out the problem.
> > 
> > I did some more debugging. Tested with a KASAN enabled kernel and that
> > shows the problem. The report is attached.
> > 
> > To my understanding:
> > 
> > btusb_work() is calling usb_set_interface() with alternate = 0. which
> > again calls usb_hcd_alloc_bandwidth() and that frees the rings by
> > xhci_free_endpoint_ring(). But then usb_set_interface() continues and
> > calls usb_disable_interface() -> usb_hcd_flush_endpoint()->unlink1()->
> > xhci_urb_dequeue() which at the end gives the command to stop endpoint.
> > 
> > In all the cycles I have tested I see that only in the fail case
> > handle_cmd_completion() gets called, but in the cycles where the error
> > is not there handle_cmd_completion() is not called with that command.
> > 
> > I am not sure what is happening, and you are the best person to understand
> > what is happening. :)
> > 
> > But for now (untill you are back from holiday and suggest a proper 
> > solution),
> > I made a hacky patch (attached) which is working and I donot get any
> > corruption after that. Both KASAN and slub debug are also happy.
> > 
> > So, now waiting for you to analyze what is going on and suggest a proper
> > fix.
> > 
> > Thanks in advance.
> > 
> > --
> > Regards
> > Sudip
> 
> > [  236.814156] 
> > ==
> > [  236.814187] BUG: KASAN: use-after-free in xhci_trb_virt_to_dma+0x2e/0x74 
> > [xhci_hcd]
> > [  236.814193] Read of size 8 at addr 8800789329c8 by task weston/138
> > 
> > [  236.814203] CPU: 0 PID: 138 Comm: weston Tainted: G U  W  O
> > 4.14.47-20180606+ #7
> > [  236.814206] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017
> > [  236.814209] Call Trace:
> > [  236.814214]  
> > [  236.814226]  dump_stack+0x46/0x59
> > [  236.814238]  print_address_description+0x6b/0x23b
> > [  236.814255]  ? xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd]
> > [  236.814262]  kasan_report+0x220/0x246
> > [  236.814278]  xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd]
> > [  236.814294]  trb_in_td+0x3b/0x1cd [xhci_hcd]
> > [  236.814311]  handle_cmd_completion+0x1181/0x2c9b [xhci_hcd]
> > [  236.814329]  ? xhci_queue_new_dequeue_state+0x5d9/0x5d9 [xhci_hcd]
> > [  236.814337]  ? drm_handle_vblank+0x4ec/0x590
> > [  236.814352]  xhci_irq+0x529/0x3294 [xhci_hcd]
> > [  236.814362]  ? __accumulate_pelt_segments+0x24/0x33
> > [  236.814378]  ? finish_td.isra.40+0x223/0x223 [xhci_hcd]
> > [  236.814384]  ? __accumulate_pelt_segments+0x24/0x33
> > [  236.814390]  ? __accumulate_pelt_segments+0x24/0x33
> > [  236.814405]  ? xhci_irq+0x3294/0x3294 [xhci_hcd]
> > [  236.814412]  __handle_irq_event_percpu+0x149/0x3db
> > [  236.814421]  handle_irq_event_percpu+0x65/0x109
> > [  236.814428]  ? __handle_irq_event_percpu+0x3db/0x3db
> > [  236.814436]  ? ttwu_do_wakeup.isra.18+0x3a2/0x3ce
> > [  236.814442]  handle_irq_event+0xa8/0x10a
> > [  236.814449]  handle_edge_irq+0x4b2/0x538
> > [  236.814458]  handle_irq+0x3e/0x45
> > [  236.814465]  do_IRQ+0x5c/0x126
> > [  236.814474]  common_interrupt+0x7a/0x7a
> > [  236.814478]  
> > [  236.814483] RIP: 0023:0xf79d3d82
> > [  236.814486] RSP: 002b:ffc588e8 EFLAGS: 00200282 ORIG_RAX: 
> > ffdc
> > [  236.814493] RAX:  RBX: f7bebd5c RCX: 
> > 
> > [  236.814496] RDX: 08d4197c RSI:  RDI: 
> > f746c020
> > [  236.814499] RBP: ffc588e8 R08:  R09: 
> > 
> > [  236.814503] R10:  R11: 00200206 R12: 
> > 
> > [  236.814506] R13:  R14:  R15: 
> > 
> > 
> > [  236.814513] Allocated by task 2082:
> > [  236.814521]  kasan_kmalloc.part.1+0x51/0xc7
> > [  236.814526]  kmem_cache_alloc_trace+0x178/0x187
> > [  236.814540]  xhci_segment_alloc.isra.11+0x9d/0x3bf [xhci_hcd]
> > [  236.814553]  xhci_alloc_segments_for_ring+0x9e/0x176 [xhci_hcd]
> > [  236.814566]  xhci_ring_alloc.constprop.16+0x197/0x4ba [xhci_hcd]
> > [  236.814579]  xhci_endpoint_init+0x77a/0x9ba [xhci_hcd]
> > [  236.814592]  xhci_add_endpoint+0x3bc/0x43b [xhci_hcd]
> > [  236.814615]  usb_hcd_alloc_bandwidth+0x7ef/0x857 [usbcore]
> > [  236.814637]  usb_set_interface+0x294/0x681 [usbcore]
> 

Re: usb HC busted?

2018-07-17 Thread Greg KH
On Tue, Jul 17, 2018 at 12:41:04PM +0100, Sudip Mukherjee wrote:
> Hi Mathias,
> 
> On Sat, Jun 30, 2018 at 10:07:04PM +0100, Sudip Mukherjee wrote:
> > Hi Mathias,
> > 
> > On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote:
> > > On 27.06.2018 14:59, Sudip Mukherjee wrote:
> > > > > > Can you share a bit more details on the platform you are using, and 
> > > > > > what types of test you are running.
> > > > > 
> 
> > Then to track what is going on, I added the slub debugging and :(
> > I have attached part of dmesg for you to check.
> > Will appreciate your help in finding out the problem.
> 
> I did some more debugging. Tested with a KASAN enabled kernel and that
> shows the problem. The report is attached.
> 
> To my understanding:
> 
> btusb_work() is calling usb_set_interface() with alternate = 0. which
> again calls usb_hcd_alloc_bandwidth() and that frees the rings by
> xhci_free_endpoint_ring(). But then usb_set_interface() continues and
> calls usb_disable_interface() -> usb_hcd_flush_endpoint()->unlink1()->
> xhci_urb_dequeue() which at the end gives the command to stop endpoint.
> 
> In all the cycles I have tested I see that only in the fail case
> handle_cmd_completion() gets called, but in the cycles where the error
> is not there handle_cmd_completion() is not called with that command.
> 
> I am not sure what is happening, and you are the best person to understand
> what is happening. :)
> 
> But for now (untill you are back from holiday and suggest a proper solution),
> I made a hacky patch (attached) which is working and I donot get any
> corruption after that. Both KASAN and slub debug are also happy.
> 
> So, now waiting for you to analyze what is going on and suggest a proper
> fix.
> 
> Thanks in advance.
> 
> --
> Regards
> Sudip

> [  236.814156] 
> ==
> [  236.814187] BUG: KASAN: use-after-free in xhci_trb_virt_to_dma+0x2e/0x74 
> [xhci_hcd]
> [  236.814193] Read of size 8 at addr 8800789329c8 by task weston/138
> 
> [  236.814203] CPU: 0 PID: 138 Comm: weston Tainted: G U  W  O
> 4.14.47-20180606+ #7
> [  236.814206] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017
> [  236.814209] Call Trace:
> [  236.814214]  
> [  236.814226]  dump_stack+0x46/0x59
> [  236.814238]  print_address_description+0x6b/0x23b
> [  236.814255]  ? xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd]
> [  236.814262]  kasan_report+0x220/0x246
> [  236.814278]  xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd]
> [  236.814294]  trb_in_td+0x3b/0x1cd [xhci_hcd]
> [  236.814311]  handle_cmd_completion+0x1181/0x2c9b [xhci_hcd]
> [  236.814329]  ? xhci_queue_new_dequeue_state+0x5d9/0x5d9 [xhci_hcd]
> [  236.814337]  ? drm_handle_vblank+0x4ec/0x590
> [  236.814352]  xhci_irq+0x529/0x3294 [xhci_hcd]
> [  236.814362]  ? __accumulate_pelt_segments+0x24/0x33
> [  236.814378]  ? finish_td.isra.40+0x223/0x223 [xhci_hcd]
> [  236.814384]  ? __accumulate_pelt_segments+0x24/0x33
> [  236.814390]  ? __accumulate_pelt_segments+0x24/0x33
> [  236.814405]  ? xhci_irq+0x3294/0x3294 [xhci_hcd]
> [  236.814412]  __handle_irq_event_percpu+0x149/0x3db
> [  236.814421]  handle_irq_event_percpu+0x65/0x109
> [  236.814428]  ? __handle_irq_event_percpu+0x3db/0x3db
> [  236.814436]  ? ttwu_do_wakeup.isra.18+0x3a2/0x3ce
> [  236.814442]  handle_irq_event+0xa8/0x10a
> [  236.814449]  handle_edge_irq+0x4b2/0x538
> [  236.814458]  handle_irq+0x3e/0x45
> [  236.814465]  do_IRQ+0x5c/0x126
> [  236.814474]  common_interrupt+0x7a/0x7a
> [  236.814478]  
> [  236.814483] RIP: 0023:0xf79d3d82
> [  236.814486] RSP: 002b:ffc588e8 EFLAGS: 00200282 ORIG_RAX: 
> ffdc
> [  236.814493] RAX:  RBX: f7bebd5c RCX: 
> 
> [  236.814496] RDX: 08d4197c RSI:  RDI: 
> f746c020
> [  236.814499] RBP: ffc588e8 R08:  R09: 
> 
> [  236.814503] R10:  R11: 00200206 R12: 
> 
> [  236.814506] R13:  R14:  R15: 
> 
> 
> [  236.814513] Allocated by task 2082:
> [  236.814521]  kasan_kmalloc.part.1+0x51/0xc7
> [  236.814526]  kmem_cache_alloc_trace+0x178/0x187
> [  236.814540]  xhci_segment_alloc.isra.11+0x9d/0x3bf [xhci_hcd]
> [  236.814553]  xhci_alloc_segments_for_ring+0x9e/0x176 [xhci_hcd]
> [  236.814566]  xhci_ring_alloc.constprop.16+0x197/0x4ba [xhci_hcd]
> [  236.814579]  xhci_endpoint_init+0x77a/0x9ba [xhci_hcd]
> [  236.814592]  xhci_add_endpoint+0x3bc/0x43b [xhci_hcd]
> [  236.814615]  usb_hcd_alloc_bandwidth+0x7ef/0x857 [usbcore]
> [  236.814637]  usb_set_interface+0x294/0x681 [usbcore]
> [  236.814645]  btusb_work+0x2e6/0x981 [btusb]
> [  236.814651]  process_one_work+0x579/0x9e9
> [  236.814656]  worker_thread+0x68f/0x804
> [  236.814662]  kthread+0x31c/0x32b
> [  236.814668]  ret_from_fork+0x35/0x40
> 
> [  236.814672] Freed by task 1533:
> [  236.814678]  

Re: usb HC busted?

2018-07-17 Thread Sudip Mukherjee
Hi Mathias,

On Sat, Jun 30, 2018 at 10:07:04PM +0100, Sudip Mukherjee wrote:
> Hi Mathias,
> 
> On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote:
> > On 27.06.2018 14:59, Sudip Mukherjee wrote:
> > > > > Can you share a bit more details on the platform you are using, and 
> > > > > what types of test you are running.
> > > > 

> Then to track what is going on, I added the slub debugging and :(
> I have attached part of dmesg for you to check.
> Will appreciate your help in finding out the problem.

I did some more debugging. Tested with a KASAN enabled kernel and that
shows the problem. The report is attached.

To my understanding:

btusb_work() is calling usb_set_interface() with alternate = 0. which
again calls usb_hcd_alloc_bandwidth() and that frees the rings by
xhci_free_endpoint_ring(). But then usb_set_interface() continues and
calls usb_disable_interface() -> usb_hcd_flush_endpoint()->unlink1()->
xhci_urb_dequeue() which at the end gives the command to stop endpoint.

In all the cycles I have tested I see that only in the fail case
handle_cmd_completion() gets called, but in the cycles where the error
is not there handle_cmd_completion() is not called with that command.

I am not sure what is happening, and you are the best person to understand
what is happening. :)

But for now (untill you are back from holiday and suggest a proper solution),
I made a hacky patch (attached) which is working and I donot get any
corruption after that. Both KASAN and slub debug are also happy.

So, now waiting for you to analyze what is going on and suggest a proper
fix.

Thanks in advance.

--
Regards
Sudip
[  236.814156] 
==
[  236.814187] BUG: KASAN: use-after-free in xhci_trb_virt_to_dma+0x2e/0x74 
[xhci_hcd]
[  236.814193] Read of size 8 at addr 8800789329c8 by task weston/138

[  236.814203] CPU: 0 PID: 138 Comm: weston Tainted: G U  W  O
4.14.47-20180606+ #7
[  236.814206] Hardware name: xxx, BIOS 2017.01-00087-g43e04de 08/30/2017
[  236.814209] Call Trace:
[  236.814214]  
[  236.814226]  dump_stack+0x46/0x59
[  236.814238]  print_address_description+0x6b/0x23b
[  236.814255]  ? xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd]
[  236.814262]  kasan_report+0x220/0x246
[  236.814278]  xhci_trb_virt_to_dma+0x2e/0x74 [xhci_hcd]
[  236.814294]  trb_in_td+0x3b/0x1cd [xhci_hcd]
[  236.814311]  handle_cmd_completion+0x1181/0x2c9b [xhci_hcd]
[  236.814329]  ? xhci_queue_new_dequeue_state+0x5d9/0x5d9 [xhci_hcd]
[  236.814337]  ? drm_handle_vblank+0x4ec/0x590
[  236.814352]  xhci_irq+0x529/0x3294 [xhci_hcd]
[  236.814362]  ? __accumulate_pelt_segments+0x24/0x33
[  236.814378]  ? finish_td.isra.40+0x223/0x223 [xhci_hcd]
[  236.814384]  ? __accumulate_pelt_segments+0x24/0x33
[  236.814390]  ? __accumulate_pelt_segments+0x24/0x33
[  236.814405]  ? xhci_irq+0x3294/0x3294 [xhci_hcd]
[  236.814412]  __handle_irq_event_percpu+0x149/0x3db
[  236.814421]  handle_irq_event_percpu+0x65/0x109
[  236.814428]  ? __handle_irq_event_percpu+0x3db/0x3db
[  236.814436]  ? ttwu_do_wakeup.isra.18+0x3a2/0x3ce
[  236.814442]  handle_irq_event+0xa8/0x10a
[  236.814449]  handle_edge_irq+0x4b2/0x538
[  236.814458]  handle_irq+0x3e/0x45
[  236.814465]  do_IRQ+0x5c/0x126
[  236.814474]  common_interrupt+0x7a/0x7a
[  236.814478]  
[  236.814483] RIP: 0023:0xf79d3d82
[  236.814486] RSP: 002b:ffc588e8 EFLAGS: 00200282 ORIG_RAX: 
ffdc
[  236.814493] RAX:  RBX: f7bebd5c RCX: 
[  236.814496] RDX: 08d4197c RSI:  RDI: f746c020
[  236.814499] RBP: ffc588e8 R08:  R09: 
[  236.814503] R10:  R11: 00200206 R12: 
[  236.814506] R13:  R14:  R15: 

[  236.814513] Allocated by task 2082:
[  236.814521]  kasan_kmalloc.part.1+0x51/0xc7
[  236.814526]  kmem_cache_alloc_trace+0x178/0x187
[  236.814540]  xhci_segment_alloc.isra.11+0x9d/0x3bf [xhci_hcd]
[  236.814553]  xhci_alloc_segments_for_ring+0x9e/0x176 [xhci_hcd]
[  236.814566]  xhci_ring_alloc.constprop.16+0x197/0x4ba [xhci_hcd]
[  236.814579]  xhci_endpoint_init+0x77a/0x9ba [xhci_hcd]
[  236.814592]  xhci_add_endpoint+0x3bc/0x43b [xhci_hcd]
[  236.814615]  usb_hcd_alloc_bandwidth+0x7ef/0x857 [usbcore]
[  236.814637]  usb_set_interface+0x294/0x681 [usbcore]
[  236.814645]  btusb_work+0x2e6/0x981 [btusb]
[  236.814651]  process_one_work+0x579/0x9e9
[  236.814656]  worker_thread+0x68f/0x804
[  236.814662]  kthread+0x31c/0x32b
[  236.814668]  ret_from_fork+0x35/0x40

[  236.814672] Freed by task 1533:
[  236.814678]  kasan_slab_free+0xb3/0x15e
[  236.814683]  kfree+0x103/0x1a9
[  236.814696]  xhci_ring_free+0x205/0x286 [xhci_hcd]
[  236.814709]  xhci_free_endpoint_ring+0x4d/0x83 [xhci_hcd]
[  236.814722]  xhci_check_bandwidth+0x57b/0x65a [xhci_hcd]
[  236.814743]  usb_hcd_alloc_bandwidth+0x665/0x857 [usbcore]
[  

Re: usb HC busted?

2018-06-30 Thread Sudip Mukherjee
Hi Mathias,

On Fri, Jun 29, 2018 at 02:41:13PM +0300, Mathias Nyman wrote:
> On 27.06.2018 14:59, Sudip Mukherjee wrote:
> > > > Can you share a bit more details on the platform you are using, and 
> > > > what types of test you are running.
> > > 
> > > It is a board based on "Intel(R) Atom(TM) CPU  E3840  @ 1.91GHz".
> > > The usb device in question is a bluetooth device:
> > > 

> > 
> 
> There is however freeing of the same dma address:
> 
> <...>-28448 [003]    492.025808: xhci_ring_free: ISOC f1ffb700: enq 
> 0x2d31bcc0(0x2d31b000) deq 
> 0x2d31b000(0x2d31b000) segs 2 stream 0 free_trbs 305 bounce 
> 17 cycle 0
> <...>-28448 [003]    492.025818: xhci_ring_mem_detail: MATTU xhci segment 
> free seg->dma @ 0x2d31b000
> <...>-28448 [003]    492.025823: xhci_ring_mem_detail: MATTU xhci segment 
> free seg->dma @ 0x2d31b000
> <...>-28448 [003]    492.025826: xhci_ring_free: ISOC f1f9b380: enq 
> 0x2d31b140(0x2d31b000) deq 
> 0x2d31b000(0x2d31b000) segs 2 stream 0 free_trbs 489 bounce 
> 17 cycle 1
> <...>-28448 [003]    492.025828: xhci_ring_mem_detail: MATTU xhci segment 
> free seg->dma @ 0x2d31b000
> <...>-28448 [003]    492.025830: xhci_ring_mem_detail: MATTU xhci segment 
> free seg->dma @ 0x2d31b000
> 
> I'd guess it's still the same cause, maybe trace is not complete?

It is either mutiple freeing of the same address or mutiple allocation
of the same address or a combination of both. To track the mutiple
allocation I added few extra debugging and it seems that the mutiple
allocation is only happening when someone accesses that memory and
makes the first 4 bytes (which holds the offset data) as 0. I have not
yet checked in what condition does it try to free the same address more
than once.
Then to track what is going on, I added the slub debugging and :(
I have attached part of dmesg for you to check.
Will appreciate your help in finding out the problem.


--
Regards
Sudip
[  383.096204] 
=
[  383.096212] BUG kmalloc-96 (Tainted: G U O   ): Poison overwritten
[  383.096213] 
-

[  383.096215] Disabling lock debugging due to kernel taint
[  383.096218] INFO: 0xdccd1b78-0xdccd1b7f. First byte 0x78 instead of 0x6b
[  383.096232] INFO: Allocated in xhci_ring_alloc.constprop.14+0x31/0x125 
[xhci_hcd] age=227516 cpu=2 pid=21
[  383.096240]  ___slab_alloc.constprop.24+0x1fc/0x292
[  383.096243]  __slab_alloc.isra.18.constprop.23+0x1c/0x25
[  383.096246]  kmem_cache_alloc_trace+0x78/0x141
[  383.096252]  xhci_ring_alloc.constprop.14+0x31/0x125 [xhci_hcd]
[  383.096259]  xhci_endpoint_init+0x25f/0x30a [xhci_hcd]
[  383.096265]  xhci_add_endpoint+0x126/0x149 [xhci_hcd]
[  383.096276]  usb_hcd_alloc_bandwidth+0x26a/0x2a0 [usbcore]
[  383.096287]  usb_set_interface+0xeb/0x25d [usbcore]
[  383.096292]  btusb_work+0xeb/0x324 [btusb]
[  383.096296]  process_one_work+0x163/0x2b2
[  383.096299]  worker_thread+0x1a9/0x25c
[  383.096301]  kthread+0xf8/0xfd
[  383.096306]  ret_from_fork+0x2e/0x38
[  383.096314] INFO: Freed in xhci_ring_free+0xa7/0xc6 [xhci_hcd] age=197020 
cpu=0 pid=324
[  383.096317]  __slab_free+0x4b/0x27a
[  383.096319]  kfree+0x12e/0x155
[  383.096325]  xhci_ring_free+0xa7/0xc6 [xhci_hcd]
[  383.096331]  xhci_free_endpoint_ring+0x16/0x20 [xhci_hcd]
[  383.096338]  xhci_check_bandwidth+0x1bf/0x20e [xhci_hcd]
[  383.096348]  usb_hcd_alloc_bandwidth+0x205/0x2a0 [usbcore]
[  383.096358]  usb_set_interface+0xeb/0x25d [usbcore]
[  383.096361]  btusb_work+0x228/0x324 [btusb]
[  383.096364]  process_one_work+0x163/0x2b2
[  383.096367]  worker_thread+0x1a9/0x25c
[  383.096370]  kthread+0xf8/0xfd
[  383.096373]  ret_from_fork+0x2e/0x38
[  383.096376] INFO: Slab 0xf457e080 objects=29 used=29 fp=0x  (null) 
flags=0x40008100
[  383.096379] INFO: Object 0xdccd1b60 @offset=7008 fp=0xdccd0350

[  383.096383] Redzone dccd1b58: bb bb bb bb bb bb bb bb
  
[  383.096386] Object dccd1b60: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  383.096388] Object dccd1b70: 6b 6b 6b 6b 6b 6b 6b 6b 78 1b cd dc 78 1b cd dc 
 x...x...
[  383.096390] Object dccd1b80: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  383.096393] Object dccd1b90: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  383.096395] Object dccd1ba0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
 
[  383.096397] Object dccd1bb0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 
 kkk.
[  383.096400] Redzone dccd1bc0: bb bb bb bb
  
[  383.096402] Padding dccd1c68: 5a 5a 5a 5a 5a 5a 5a 5a
  
[  383.096407] CPU: 2 PID: 133 Comm: weston Tainted: GBU O

Re: usb HC busted?

2018-06-29 Thread Mathias Nyman

On 27.06.2018 14:59, Sudip Mukherjee wrote:

Can you share a bit more details on the platform you are using, and what types 
of test you are running.


It is a board based on "Intel(R) Atom(TM) CPU  E3840  @ 1.91GHz".
The usb device in question is a bluetooth device:

Bus 001 Device 012: ID 8087:07dc Intel Corp.




And the problem that we are seeing is with phone calls via bluetooth.


Does my test above trigger the case? (show "MATTU dmatest match!")


I have kept it for tonight, will see the results tomorrow morning.
And I am using that same device in the usb script to change "authrized".


No, your test did not trigger the error. :(

But, my last night's test (with an added debug to get some extra trace for 
addresses) showed the same error of -
"Looking for event-dma", but looking at the ftrace, I could not see it getting 
same address from dma_pool_zalloc().

Can you please have a look at the dmesg and ftrace at:
https://drive.google.com/open?id=1nMy_qVxOQzcZNYa9bw7az9WiS2MZzdKo



There is however freeing of the same dma address:

<...>-28448 [003]    492.025808: xhci_ring_free: ISOC f1ffb700: enq 
0x2d31bcc0(0x2d31b000) deq 0x2d31b000(0x2d31b000) 
segs 2 stream 0 free_trbs 305 bounce 17 cycle 0
<...>-28448 [003]    492.025818: xhci_ring_mem_detail: MATTU xhci segment free 
seg->dma @ 0x2d31b000
<...>-28448 [003]    492.025823: xhci_ring_mem_detail: MATTU xhci segment free 
seg->dma @ 0x2d31b000
<...>-28448 [003]    492.025826: xhci_ring_free: ISOC f1f9b380: enq 
0x2d31b140(0x2d31b000) deq 0x2d31b000(0x2d31b000) 
segs 2 stream 0 free_trbs 489 bounce 17 cycle 1
<...>-28448 [003]    492.025828: xhci_ring_mem_detail: MATTU xhci segment free 
seg->dma @ 0x2d31b000
<...>-28448 [003]    492.025830: xhci_ring_mem_detail: MATTU xhci segment free 
seg->dma @ 0x2d31b000

I'd guess it's still the same cause, maybe trace is not complete?

-Mathias
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-27 Thread Sudip Mukherjee


On Wed, Jun 27, 2018 at 12:59:48PM +0100, Sudip Mukherjee wrote:
> Hi Mathias,
> 
> On Mon, Jun 25, 2018 at 05:15:00PM +0100, Sudip Mukherjee wrote:
> > Hi Mathias,
> > 
> > On Thu, Jun 21, 2018 at 02:01:30PM +0300, Mathias Nyman wrote:
> > > On 21.06.2018 03:53, Sudip Mukherjee wrote:
> > > > Hi Mathias, Andy,
> > > > 
> > > > On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote:
> > > > > On 06.06.2018 19:45, Sudip Mukherjee wrote:
> 
> > > 
> > > Can you share a bit more details on the platform you are using, and what 
> > > types of test you are running.
> > 
> > Sorry for the delayed reply, I was in Tokyo for the OSS.
> > 
> > It is a board based on "Intel(R) Atom(TM) CPU  E3840  @ 1.91GHz".
> > The usb device in question is a bluetooth device:
> > 
> > Bus 001 Device 012: ID 8087:07dc Intel Corp.
> 
> > 
> > And the problem that we are seeing is with phone calls via bluetooth.
> > 
> > > Does my test above trigger the case? (show "MATTU dmatest match!")
> > 
> > I have kept it for tonight, will see the results tomorrow morning.
> > And I am using that same device in the usb script to change "authrized".
> 
> No, your test did not trigger the error. :(
> 
> But, my last night's test (with an added debug to get some extra trace for 
> addresses) showed the same error of -
> "Looking for event-dma", but looking at the ftrace, I could not see it 
> getting same address from dma_pool_zalloc().
> 
> Can you please have a look at the dmesg and ftrace at:
> https://drive.google.com/open?id=1nMy_qVxOQzcZNYa9bw7az9WiS2MZzdKo

And to add to my previous mail, in another cycle where I do see the
same problem and my extra debugs give the following:

   <...>-23974 [002]    495.991276: xhci_ring_mem_detail: MATTU 
xhci_segment_alloc dma @ 0x2d21c000
   <...>-23974 [002]    495.991285: xhci_ring_mem_detail: SUDIP 
page details dma=0x2d21c000, vaddr=ed21c000, inuse=1, offset=0
   <...>-23974 [002]    495.991289: xhci_ring_mem_detail: MATTU 
xhci_segment_alloc dma @ 0x2d21c000
   <...>-23974 [002]    495.991292: xhci_ring_mem_detail: SUDIP 
page details dma=0x2d21c000, vaddr=ed21c000, inuse=2, offset=0
   <...>-23974 [002]    495.991295: xhci_ring_alloc: ISOC f0b62900: 
enq 0x2d21c000(0x2d21c000) deq 
0x2d21c000(0x2d21c000) segs 2 stream 0 free_trbs 509 bounce 17 
cycle 1
   <...>-23974 [002]    495.991298: xhci_ring_mem_detail: MATTU 
xhci_segment_alloc dma @ 0x2d21c000
   <...>-23974 [002]    495.991301: xhci_ring_mem_detail: SUDIP 
page details dma=0x2d21c000, vaddr=ed21c000, inuse=3, offset=0
   <...>-23974 [002]    495.991304: xhci_ring_mem_detail: MATTU 
xhci_segment_alloc dma @ 0x2d21c000
   <...>-23974 [002]    495.991306: xhci_ring_mem_detail: SUDIP 
page details dma=0x2d21c000, vaddr=ed21c000, inuse=4, offset=0


I am totally lost now. Are we looking at two different issues?
This log shows same addresses, my previous mail and log did not show
the same addresses. :(


--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-27 Thread Sudip Mukherjee
Hi Mathias,

On Mon, Jun 25, 2018 at 05:15:00PM +0100, Sudip Mukherjee wrote:
> Hi Mathias,
> 
> On Thu, Jun 21, 2018 at 02:01:30PM +0300, Mathias Nyman wrote:
> > On 21.06.2018 03:53, Sudip Mukherjee wrote:
> > > Hi Mathias, Andy,
> > > 
> > > On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote:
> > > > On 06.06.2018 19:45, Sudip Mukherjee wrote:

> > 
> > Can you share a bit more details on the platform you are using, and what 
> > types of test you are running.
> 
> Sorry for the delayed reply, I was in Tokyo for the OSS.
> 
> It is a board based on "Intel(R) Atom(TM) CPU  E3840  @ 1.91GHz".
> The usb device in question is a bluetooth device:
> 
> Bus 001 Device 012: ID 8087:07dc Intel Corp.

> 
> And the problem that we are seeing is with phone calls via bluetooth.
> 
> > Does my test above trigger the case? (show "MATTU dmatest match!")
> 
> I have kept it for tonight, will see the results tomorrow morning.
> And I am using that same device in the usb script to change "authrized".

No, your test did not trigger the error. :(

But, my last night's test (with an added debug to get some extra trace for 
addresses) showed the same error of -
"Looking for event-dma", but looking at the ftrace, I could not see it getting 
same address from dma_pool_zalloc().

Can you please have a look at the dmesg and ftrace at:
https://drive.google.com/open?id=1nMy_qVxOQzcZNYa9bw7az9WiS2MZzdKo


--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-25 Thread Sudip Mukherjee
Hi Mathias,

On Thu, Jun 21, 2018 at 02:01:30PM +0300, Mathias Nyman wrote:
> On 21.06.2018 03:53, Sudip Mukherjee wrote:
> > Hi Mathias, Andy,
> > 
> > On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote:
> > > On 06.06.2018 19:45, Sudip Mukherjee wrote:

> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git dmapool-test
> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=dmapool-test
> 
> Tested by just leaving the following running for a few days:
> 
> while true; do echo 0 > authorized; sleep 3; echo 1 > authorized; sleep 3; 
> done;
> For some usb device (for example: /sys/bus/usb/devices/1-8)
> 
> Then grep logs for "MATTU dmatest match! "
> 
> Can you share a bit more details on the platform you are using, and what 
> types of test you are running.

Sorry for the delayed reply, I was in Tokyo for the OSS.

It is a board based on "Intel(R) Atom(TM) CPU  E3840  @ 1.91GHz".
The usb device in question is a bluetooth device:

Bus 001 Device 012: ID 8087:07dc Intel Corp.
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   2.00
  bDeviceClass  224 Wireless
  bDeviceSubClass 1 Radio Frequency
  bDeviceProtocol 1 Bluetooth
  bMaxPacketSize064
  idVendor   0x8087 Intel Corp.
  idProduct  0x07dc
  bcdDevice0.01
  iManufacturer   0
  iProduct0
  iSerial 0
  bNumConfigurations  1

And the problem that we are seeing is with phone calls via bluetooth.

> Does my test above trigger the case? (show "MATTU dmatest match!")

I have kept it for tonight, will see the results tomorrow morning.
And I am using that same device in the usb script to change "authrized".

But looking at the code for dma_pool_alloc(), it seems 'dma' can have
same value again only if "*(int *)(page->vaddr + offset)" gets a value
of 0 in pool_initialise_page(). But I can't think of anyway how it
can be 0. I have also added some more debugs in the kernel to see what
might be going wrong there.

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-21 Thread Mathias Nyman

On 21.06.2018 03:53, Sudip Mukherjee wrote:

Hi Mathias, Andy,

On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote:

On 06.06.2018 19:45, Sudip Mukherjee wrote:

Hi Andy,

And we meet again. :)

On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote:

On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote:

On 04.06.2018 18:28, Sudip Mukherjee wrote:

On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:





Odd and unlikely, but to me this looks like some issue in allocating
dma memory
from pool using dma_pool_zalloc()


Here's the story:
Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel.
All tracing points to dma_pool_zalloc() returning the same dma address
block on
consecutive calls.

In the failing case dma_pool_zalloc() is called 3 - 6us apart.

<...>-26362 [002]   1186.756739: xhci_ring_mem_detail: MATTU
xhci_segment_alloc dma @ 0x2d92b000
<...>-26362 [002]   1186.756745: xhci_ring_mem_detail: MATTU
xhci_segment_alloc dma @ 0x2d92b000
<...>-26362 [002]   1186.756748: xhci_ring_mem_detail: MATTU
xhci_segment_alloc dma @ 0x2d92b000

dma_pool_zalloc() is called from xhci_segment_alloc() in
drivers/usb/host/xhci-mem.c
see:
https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci-
mem.c#L52

prints above are custom traces added right after dma_pool_zalloc()


For better understanding it would be good to have dma_pool_free() calls
debugged as well.




Sudip has a full (394M unpacked) trace at:
https://drive.google.com/open?id=1h-3r-1lfjg8oblBGkzdRIq8z3ZNgGZx-






But then it gets stuck, for the whole ring2 dma_pool_zalloc() just returns the 
same dma address as the last segment for
ring1:0x2d92b000. Last part of trace snippet is just another ring being freed.


A gentle ping on this. Any idea on what the problem might be and any
possible fix?



I tried to reproduce it by quickly hacking xhci to allocate and free 50 
segments each time
we normally allocate one segment from dmapool.
I let it run for 3 days on a Atom based platform, but could not reproduce it.

xhci testhack can be found here:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git dmapool-test
https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=dmapool-test

Tested by just leaving the following running for a few days:

while true; do echo 0 > authorized; sleep 3; echo 1 > authorized; sleep 3; done;
For some usb device (for example: /sys/bus/usb/devices/1-8)

Then grep logs for "MATTU dmatest match! "

Can you share a bit more details on the platform you are using, and what types 
of test you are running.
Does my test above trigger the case? (show "MATTU dmatest match!")

-Mathias

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-20 Thread Sudip Mukherjee
Hi Mathias, Andy,

On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote:
> On 06.06.2018 19:45, Sudip Mukherjee wrote:
> > Hi Andy,
> > 
> > And we meet again. :)
> > 
> > On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote:
> > > On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote:
> > > > On 04.06.2018 18:28, Sudip Mukherjee wrote:
> > > > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:
> > > > > > 
> > > 
> > > > Odd and unlikely, but to me this looks like some issue in allocating
> > > > dma memory
> > > > from pool using dma_pool_zalloc()
> > > > 
> > > > 
> > > > Here's the story:
> > > > Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel.
> > > > All tracing points to dma_pool_zalloc() returning the same dma address
> > > > block on
> > > > consecutive calls.
> > > > 
> > > > In the failing case dma_pool_zalloc() is called 3 - 6us apart.
> > > > 
> > > > <...>-26362 [002]   1186.756739: xhci_ring_mem_detail: MATTU
> > > > xhci_segment_alloc dma @ 0x2d92b000
> > > > <...>-26362 [002]   1186.756745: xhci_ring_mem_detail: MATTU
> > > > xhci_segment_alloc dma @ 0x2d92b000
> > > > <...>-26362 [002]   1186.756748: xhci_ring_mem_detail: MATTU
> > > > xhci_segment_alloc dma @ 0x2d92b000
> > > > 
> > > > dma_pool_zalloc() is called from xhci_segment_alloc() in
> > > > drivers/usb/host/xhci-mem.c
> > > > see:
> > > > https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci-
> > > > mem.c#L52
> > > > 
> > > > prints above are custom traces added right after dma_pool_zalloc()
> > > 
> > > For better understanding it would be good to have dma_pool_free() calls
> > > debugged as well.
> > 
> 
> Sudip has a full (394M unpacked) trace at:
> https://drive.google.com/open?id=1h-3r-1lfjg8oblBGkzdRIq8z3ZNgGZx-
> 



> But then it gets stuck, for the whole ring2 dma_pool_zalloc() just returns 
> the same dma address as the last segment for
> ring1:0x2d92b000. Last part of trace snippet is just another ring being freed.

A gentle ping on this. Any idea on what the problem might be and any
possible fix?

--
regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-08 Thread Sudip Mukherjee
Hi All,

On Thu, Jun 07, 2018 at 10:40:03AM +0300, Mathias Nyman wrote:
> On 06.06.2018 19:45, Sudip Mukherjee wrote:
> > Hi Andy,
> > 
> > And we meet again. :)
> > 
> > On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote:
> > > On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote:
> > > > On 04.06.2018 18:28, Sudip Mukherjee wrote:
> > > > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:
> > > > > > 
> > > 
> > > > Odd and unlikely, but to me this looks like some issue in allocating
> > > > dma memory
> > > > from pool using dma_pool_zalloc()
> > > > 
> > > > Adding people with DMA knowledge to cc, maybe someone knows what is
> > > > going on.
> > > > 
> > > > Here's the story:
> > > > Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel.
> > > > All tracing points to dma_pool_zalloc() returning the same dma address
> > > > block on
> > > > consecutive calls.

We have started testing with v4.14.47 now and we are seeing the issue
with it also. :(

--
Regards
Sudip

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-07 Thread Mathias Nyman

On 06.06.2018 19:45, Sudip Mukherjee wrote:

Hi Andy,

And we meet again. :)

On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote:

On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote:

On 04.06.2018 18:28, Sudip Mukherjee wrote:

On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:





Odd and unlikely, but to me this looks like some issue in allocating
dma memory
from pool using dma_pool_zalloc()

Adding people with DMA knowledge to cc, maybe someone knows what is
going on.

Here's the story:
Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel.
All tracing points to dma_pool_zalloc() returning the same dma address
block on
consecutive calls.

In the failing case dma_pool_zalloc() is called 3 - 6us apart.

<...>-26362 [002]   1186.756739: xhci_ring_mem_detail: MATTU
xhci_segment_alloc dma @ 0x2d92b000
<...>-26362 [002]   1186.756745: xhci_ring_mem_detail: MATTU
xhci_segment_alloc dma @ 0x2d92b000
<...>-26362 [002]   1186.756748: xhci_ring_mem_detail: MATTU
xhci_segment_alloc dma @ 0x2d92b000

dma_pool_zalloc() is called from xhci_segment_alloc() in
drivers/usb/host/xhci-mem.c
see:
https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci-
mem.c#L52

prints above are custom traces added right after dma_pool_zalloc()


For better understanding it would be good to have dma_pool_free() calls
debugged as well.


So, I am adding another trace event for dma_pool_free() and continuing
with the test. Is there anything else that I should be adding as debug?



The patch traced both dma_pool_zalloc() and dma_pool_free() calls from xhci,
no need to retry.

Sudip has a full (394M unpacked) trace at:
https://drive.google.com/open?id=1h-3r-1lfjg8oblBGkzdRIq8z3ZNgGZx-

Interesting part is:

<...>-26362 [002]   1186.756728: xhci_ring_mem_detail: MATTU 
xhci_segment_alloc dma @ 0x2d34d000
<...>-26362 [002]   1186.756735: xhci_ring_mem_detail: MATTU xhci segment 
alloc seg->dma @ 0x2d34d000
<...>-26362 [002]   1186.756739: xhci_ring_mem_detail: MATTU 
xhci_segment_alloc dma @ 0x2d92b000
<...>-26362 [002]   1186.756740: xhci_ring_mem_detail: MATTU xhci segment 
alloc seg->dma @ 0x2d92b000
<...>-26362 [002]   1186.756743: xhci_ring_alloc: ISOC eefa0580: enq 
0x2d34d000(0x2d34d000) deq 0x2d34d000(0x2d34d000) 
segs 2 stream 0 free_trbs 509 bounce 17 cycle 1
<...>-26362 [002]   1186.756745: xhci_ring_mem_detail: MATTU 
xhci_segment_alloc dma @ 0x2d92b000
<...>-26362 [002]   1186.756746: xhci_ring_mem_detail: MATTU xhci segment 
alloc seg->dma @ 0x2d92b000
<...>-26362 [002]   1186.756748: xhci_ring_mem_detail: MATTU 
xhci_segment_alloc dma @ 0x2d92b000
<...>-26362 [002]   1186.756751: xhci_ring_mem_detail: MATTU xhci segment 
alloc seg->dma @ 0x2d92b000
<...>-26362 [002]   1186.756752: xhci_ring_alloc: ISOC f19d7c80: enq 
0x2d92b000(0x2d92b000) deq 0x2d92b000(0x2d92b000) 
segs 2 stream 0 free_trbs 509 bounce 17 cycle 1
<...>-26362 [002] d..1  1186.756761: xhci_queue_trb: CMD: Configure Endpoint 
Command: ctx 2ce96000 slot 7 flags d:C
<...>-26362 [002] d..1  1186.756762: xhci_inc_enq: CMD ed930b80: enq 
0x2d93adb0(0x2d93a000) deq 0x2d93ada0(0x2d93a000) 
segs 1 stream 0 free_trbs 253 bounce 0 \
cycle 1
<...>-26362 [002]   1186.757066: xhci_dbg_context_change: Successful 
Endpoint Configure command
<...>-26362 [002]   1186.757072: xhci_ring_free: ISOC eefd9380: enq 
0x2c482000(0x2c482000) deq 0x2c482000(0x2c482000) 
segs 2 stream 0 free_trbs 509 bounce0 cycle 1
<...>-26362 [002]   1186.757075: xhci_ring_mem_detail: MATTU xhci segment free 
seg->dma @ ee2d23c8
<...>-26362 [002]   1186.757078: xhci_ring_mem_detail: MATTU xhci segment free 
seg->dma @ c7a93488
<...>-26362 [002]   1186.757080: xhci_ring_free: ISOC eef0d800: enq 
0x2c50a000(0x2c50a000) deq 0x2c50a000(0x2c50a000) 
segs 2 stream 0 free_trbs 509 bounce0 cycle 1

What is shown is the allocation of two ISOC transfer rings, each ring has 2 
segments (two dma_pool_zalloc() calls per ring)
First ring looks normal, ring1 get dma memory at 0x2d34d000 for first ring 
segment, and dma memory at 0x2d92b000 for second segment.

But then it gets stuck, for the whole ring2 dma_pool_zalloc() just returns the 
same dma address as the last segment for
ring1:0x2d92b000. Last part of trace snippet is just another ring being freed.

Full testpatch looked like this:

diff --git a/drivers/usb/host/xhci-mem.c b/drivers/usb/host/xhci-mem.c
index e5ace89..7d343ad 100644
--- a/drivers/usb/host/xhci-mem.c
+++ b/drivers/usb/host/xhci-mem.c
@@ -44,10 +44,15 @@ static struct xhci_segment *xhci_segment_alloc(struct 
xhci_hcd *xhci,
return NULL;
}
 
+	xhci_dbg_trace(xhci,  trace_xhci_ring_mem_detail,

+ 

Re: usb HC busted?

2018-06-06 Thread Sudip Mukherjee
Hi Andy,

And we meet again. :)

On Wed, Jun 06, 2018 at 06:36:35PM +0300, Andy Shevchenko wrote:
> On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote:
> > On 04.06.2018 18:28, Sudip Mukherjee wrote:
> > > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:
> > > > 
> 
> > Odd and unlikely, but to me this looks like some issue in allocating
> > dma memory
> > from pool using dma_pool_zalloc()
> > 
> > Adding people with DMA knowledge to cc, maybe someone knows what is
> > going on.
> > 
> > Here's the story:
> > Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel.
> > All tracing points to dma_pool_zalloc() returning the same dma address
> > block on
> > consecutive calls.
> > 
> > In the failing case dma_pool_zalloc() is called 3 - 6us apart.
> > 
> > <...>-26362 [002]   1186.756739: xhci_ring_mem_detail: MATTU
> > xhci_segment_alloc dma @ 0x2d92b000
> > <...>-26362 [002]   1186.756745: xhci_ring_mem_detail: MATTU
> > xhci_segment_alloc dma @ 0x2d92b000
> > <...>-26362 [002]   1186.756748: xhci_ring_mem_detail: MATTU
> > xhci_segment_alloc dma @ 0x2d92b000
> > 
> > dma_pool_zalloc() is called from xhci_segment_alloc() in
> > drivers/usb/host/xhci-mem.c
> > see:
> > https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci-
> > mem.c#L52
> > 
> > prints above are custom traces added right after dma_pool_zalloc()
> 
> For better understanding it would be good to have dma_pool_free() calls
> debugged as well.

So, I am adding another trace event for dma_pool_free() and continuing
with the test. Is there anything else that I should be adding as debug?

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-06 Thread Sudip Mukherjee
On Wed, Jun 06, 2018 at 05:12:21PM +0300, Mathias Nyman wrote:
> On 04.06.2018 18:28, Sudip Mukherjee wrote:
> > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:
> > > 

> > 
> > Will request you to have a look at it.
> > 
> 
> Odd and unlikely, but to me this looks like some issue in allocating dma 
> memory
> from pool using dma_pool_zalloc()
> 
> Adding people with DMA knowledge to cc, maybe someone knows what is going on.

Thanks Mathias.

--
Regards
Sudip
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-06 Thread Andy Shevchenko
On Wed, 2018-06-06 at 17:12 +0300, Mathias Nyman wrote:
> On 04.06.2018 18:28, Sudip Mukherjee wrote:
> > On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:
> > > 

> Odd and unlikely, but to me this looks like some issue in allocating
> dma memory
> from pool using dma_pool_zalloc()
> 
> Adding people with DMA knowledge to cc, maybe someone knows what is
> going on.
> 
> Here's the story:
> Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel.
> All tracing points to dma_pool_zalloc() returning the same dma address
> block on
> consecutive calls.
> 
> In the failing case dma_pool_zalloc() is called 3 - 6us apart.
> 
> <...>-26362 [002]   1186.756739: xhci_ring_mem_detail: MATTU
> xhci_segment_alloc dma @ 0x2d92b000
> <...>-26362 [002]   1186.756745: xhci_ring_mem_detail: MATTU
> xhci_segment_alloc dma @ 0x2d92b000
> <...>-26362 [002]   1186.756748: xhci_ring_mem_detail: MATTU
> xhci_segment_alloc dma @ 0x2d92b000
> 
> dma_pool_zalloc() is called from xhci_segment_alloc() in
> drivers/usb/host/xhci-mem.c
> see:
> https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci-
> mem.c#L52
> 
> prints above are custom traces added right after dma_pool_zalloc()

For better understanding it would be good to have dma_pool_free() calls
debugged as well.

Is it possible that something in parallel just fast enough to free the
allocated resource from pool?

-- 
Andy Shevchenko 
Intel Finland Oy
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: usb HC busted?

2018-06-06 Thread Mathias Nyman

On 04.06.2018 18:28, Sudip Mukherjee wrote:

On Thu, May 24, 2018 at 04:35:34PM +0300, Mathias Nyman wrote:


Log show two rings having the same TRB segment dma address, this will 
completely mess up the transfer:

While allocating rigs the enque pointers for the two rings are the same:

461.859315: xhci_ring_alloc: ISOC efa4e580: enq 
0x33386000(0x33386000) deq 
0x33386000(0x33386000) segs 2 stream 0 ...bs
461.859320: xhci_ring_alloc: ISOC f0ce1f00: enq 
0x33386000(0x33386000) deq 
0x33386000(0x33386000) segs 2 stream 0 ...

So something goes really wrong when allocating or setting up the rings in one 
of these functions:

To verify and rule out dma_pool_zalloc(), could you apply the attached patch 
and reproduce with new logs?


I spoke too soon in my yesterday's mail. We were able to reproduce it
on the automated tests. The log and the trace is at:
https://drive.google.com/open?id=1h-3r-1lfjg8oblBGkzdRIq8z3ZNgGZx-

Will request you to have a look at it.



Odd and unlikely, but to me this looks like some issue in allocating dma memory
from pool using dma_pool_zalloc()

Adding people with DMA knowledge to cc, maybe someone knows what is going on.

Here's the story:
Sudip sees usb issues on a Intel Atom based board with 4.14.2 kernel.
All tracing points to dma_pool_zalloc() returning the same dma address block on
consecutive calls.

In the failing case dma_pool_zalloc() is called 3 - 6us apart.

<...>-26362 [002]   1186.756739: xhci_ring_mem_detail: MATTU 
xhci_segment_alloc dma @ 0x2d92b000
<...>-26362 [002]   1186.756745: xhci_ring_mem_detail: MATTU 
xhci_segment_alloc dma @ 0x2d92b000
<...>-26362 [002]   1186.756748: xhci_ring_mem_detail: MATTU 
xhci_segment_alloc dma @ 0x2d92b000

dma_pool_zalloc() is called from xhci_segment_alloc() in 
drivers/usb/host/xhci-mem.c
see:
https://elixir.bootlin.com/linux/v4.14.2/source/drivers/usb/host/xhci-mem.c#L52

prints above are custom traces added right after dma_pool_zalloc()
@@ -44,10 +44,15 @@ static struct xhci_segment *xhci_segment_alloc(struct 
xhci_hcd *xhci,
return NULL;
}
 
+	xhci_dbg_trace(xhci,  trace_xhci_ring_mem_detail,

+  "MATTU xhci_segment_alloc dma @ %pad", );
+

Any idea what's going on?
dma_pool_alloc() has a comment that it drops >lock if it needs to allocate
a page, can it be related?

Thanks
-Mathias

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu