On Wed, Apr 01, 2020 at 12:04:25PM +0200, Patrick Wildt wrote:
> On Wed, Apr 01, 2020 at 09:40:06AM +0200, Patrick Wildt wrote:
> > On Wed, Apr 01, 2020 at 09:22:07AM +0200, Patrick Wildt wrote:
> > > On Wed, Apr 01, 2020 at 04:47:10PM +1100, Jonathan Gray wrote:
> > > > On Wed, Apr 01, 2020 at 12:58:23PM +1100, Jonathan Gray wrote:
> > > > > On Wed, Mar 18, 2020 at 01:41:06PM +0100, Patrick Wildt wrote:
> > > > > > On Wed, Mar 18, 2020 at 11:22:40AM +0100, Patrick Wildt wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I've spent a few days investigating why USB ethernet adapters are
> > > > > > > so
> > > > > > > horribly slow on my ARMs. Using dt(4) I realized that it was
> > > > > > > spending
> > > > > > > most of its time in memcpy. But, why? As it turns out, all USB
> > > > > > > data
> > > > > > > buffers are mapped COHERENT, which on some/most ARMs means
> > > > > > > uncached.
> > > > > > > Using cached data buffers makes the performance rise from 20
> > > > > > > mbit/s to
> > > > > > > 200 mbit/s. Quite a difference.
> > > > > > >
> > > > > > > sys/dev/usb/usb_mem.c:
> > > > > > > error = bus_dmamem_map(tag, p->segs, p->nsegs, p->size,
> > > > > > > &p->kaddr,
> > > > > > > BUS_DMA_NOWAIT|BUS_DMA_COHERENT);
> > > > > > >
> > > > > > > On x86, COHERENT is essentially a no-op. On ARM, it depends on
> > > > > > > the SoC.
> > > > > > > Some SoCs have cache-coherent USB controllers, some don't. Mine
> > > > > > > does
> > > > > > > not, so mapping it COHERENT means uncached and thus slow.
> > > > > > >
> > > > > > > Why do we do that? Well, when the code was imported in 99, it was
> > > > > > > already there. Since then we have gained infrastructure for DMA
> > > > > > > syncs in the USB stack, which I think are proper.
> > > > > > >
> > > > > > > sys/dev/usb/usbdi.c - usbd_transfer() (before transfer)
> > > > > > >
> > > > > > > if (!usbd_xfer_isread(xfer)) {
> > > > > > > if ((xfer->flags & USBD_NO_COPY) == 0)
> > > > > > > memcpy(KERNADDR(&xfer->dmabuf, 0), xfer->buffer,
> > > > > > > xfer->length);
> > > > > > > usb_syncmem(&xfer->dmabuf, 0, xfer->length,
> > > > > > > BUS_DMASYNC_PREWRITE);
> > > > > > > } else
> > > > > > > usb_syncmem(&xfer->dmabuf, 0, xfer->length,
> > > > > > > BUS_DMASYNC_PREREAD);
> > > > > > > err = pipe->methods->transfer(xfer);
> > > > > > >
> > > > > > > sys/dev/usb/usbdi.c - usb_transfer_complete() (after transfer)
> > > > > > >
> > > > > > > if (xfer->actlen != 0) {
> > > > > > > if (usbd_xfer_isread(xfer)) {
> > > > > > > usb_syncmem(&xfer->dmabuf, 0, xfer->actlen,
> > > > > > > BUS_DMASYNC_POSTREAD);
> > > > > > > if (!(xfer->flags & USBD_NO_COPY))
> > > > > > > memcpy(xfer->buffer,
> > > > > > > KERNADDR(&xfer->dmabuf, 0),
> > > > > > > xfer->actlen);
> > > > > > > } else
> > > > > > > usb_syncmem(&xfer->dmabuf, 0, xfer->actlen,
> > > > > > > BUS_DMASYNC_POSTWRITE);
> > > > > > > }
> > > > > > >
> > > > > > > We cannot just remove COHERENT, since some drivers, like ehci(4),
> > > > > > > use
> > > > > > > the same backend to allocate their rings. And I can't vouch for
> > > > > > > those
> > > > > > > drivers' sanity.
> > > > > > >
> > > > > > > As a first step, I would like to go ahead with another solution,
> > > > > > > which
> > > > > > > is based on a diff from Marius Strobl, who added those syncs in
> > > > > > > the
> > > > > > > first place. Essentially it splits the memory handling into
> > > > > > > cacheable
> > > > > > > and non-cacheable blocks. The USB data transfers and everyone
> > > > > > > who uses
> > > > > > > usbd_alloc_buffer() then use cacheable buffers, while code like
> > > > > > > ehci(4)
> > > > > > > still don't. This is a bit of a safer approach imho, since we
> > > > > > > don't
> > > > > > > hurt the controller drivers, but speed up the data buffers.
> > > > > > >
> > > > > > > Once we have verified that there are no regressions, we can adjust
> > > > > > > ehci(4) and the like, add proper syncs, make sure they still work
> > > > > > > as
> > > > > > > well as before, and maybe then back this out again.
> > > > > > >
> > > > > > > Keep note that this is all a no-op on X86, but all the other
> > > > > > > archs will
> > > > > > > profit from this.
> > > > > > >
> > > > > > > ok?
> > > > > > >
> > > > > > > Patrick
> > > > > >
> > > > > > Update diff with inverted logic. kettenis@ argues that we should
> > > > > > invert the logic, and those who need COHERENT memory should ask
> > > > > > for that explicitly, since for bus_dmamem_map() it also needs to
> > > > > > be passed explicitly. This also points out all those users that
> > > > > > use usb_allocmem() internally, where we might want to have a look
> > > > > > if COHERENT is actually needed or not, or if it can be refactored
> > > > > > in another way.
> > > > >
> > > > > These commits broke usb on imx.6 with cubox:
> > > > >
> > > > > imxehci0 at simplebus3
> > > > > usb0 at imxehci0: USB revision 2.0
> > > > > usb0: root hub problem
> > > > > imxehci1 at simplebus3
> > > > > usb1 at imxehci1: USB revision 2.0
> > > > > usb1: root hub problem
> > > > > "usbmisc" at simplebus3 not configured
> > > >
> > > > pandaboard with omap4 (another cortex a9) also has broken usb with
> > > > the latest snapshot:
> > > >
> > > > omehci0 at simplebus0
> > > > usb0 at omehci0: USB revision 2.0
> > > > usb0: root hub problem
> > >
> > > I think I know what it is. When we enqueue a request for the root
> > > hub, the buffer, for which the USB subsystem allocates a DMA buffer,
> > > is filled not by an external device, but by our driver.
> > >
> > > Now on completion of that request, since it's doing a READ of the
> > > root hub, it will do a DMA sync with POSTREAD. Since the ehci code
> > > hasn't flushed the buffer do memory, but simply did a memcpy to the
> > > buffer, the POSTREAD will essentially drop whatever ehci's memcpy
> > > did.
> > >
> > > Can you check if that makes a difference? It essentially forces the
> > > root hub code to flush the data to the caches, so that the POSTREAD
> > > can successfully flush the cache and read the data from memory. I
> > > wonder if there's a better way of doing this, but I kinda doubt it.
> > >
> > > Patrick
> > >
> > > diff --git a/sys/dev/usb/ehci.c b/sys/dev/usb/ehci.c
> > > index a352d83eaf4..d81901d3762 100644
> > > --- a/sys/dev/usb/ehci.c
> > > +++ b/sys/dev/usb/ehci.c
> > > @@ -2154,6 +2154,9 @@ ehci_root_ctrl_start(struct usbd_xfer *xfer)
> > > err = USBD_IOERROR;
> > > goto ret;
> > > }
> > > + if (totlen != 0)
> > > + usb_syncmem(&xfer->dmabuf, 0, totlen,
> > > + BUS_DMASYNC_PREWRITE);
> > > xfer->actlen = totlen;
> > > err = USBD_NORMAL_COMPLETION;
> > > ret:
> > >
> >
> > If that diff does indeed work, I think a better fix would be to move the
> > DMA syncs from the USB layer into the driver layer, so that for the root
> > methods no sync has to be done, but all the individual device methods do
> > proper syncs. Preparing that diff would take some more minutes though.
> > In the meantime we could use USB_DMA_COHERENT for all requests without
> > an extra allocated buffer, which includes all the control requests.
> > Then move the syncs to the drivers, and at last remove the flag again.
>
> Since kettenis@ wondered how that should look like, I took the time to
> look at eHCI and here's how I approached it. No diff yet, I'll come up
> with that soon.
>
> ======
> Before Start of Transfer: We need to make sure that the buffers are
> synced before handing them to the hardware.
>
> Check device_*_start to sync &xfer->dmabuf. Easiest way is to check,
> where DMAADDR(&xfer->dmabuf) used. That's where we tell the hardware
> where our transfer buffer is. So those indicate places where we should
> sync before telling the HW to go.
>
> * ehci_pcd(): No device transfer, just memset, sync would be wrong. OK
> * ehci_root_ctrl_start(): No device transfer, sync would be wrong. OK
> * ehci_alloc_sqtd_chain(): chain is allocated based on the buffer, we
> already call usb_syncmem() for the buffer. OK
> * ehci_device_intr_done(): called _after_ the transfer completes and
> after we memcpy the data from a successful transfer. on repeat, where
> it enqueues the transfer again, it calls usb_syncmem() for POST. I
> think that one is nonsense, and should have been done before calling
> usb_transfer_complete(). Then it calls ehci_alloc_sqtd_chain(), which
> does a proper PRE-sync. NOT OK
> * ehci_alloc_itd_chain(): Similar to alloc_sqtd_chain(), but there's no
> sync. NOT OK
> * ehci_alloc_sitd_chain(): Similar to alloc_sqtd_chain(), but there's no
> sync. NOT OK
> ======
> Before Transfer Completion: We need to make sure buffers are synced
> before handing them back to the stack.
>
> Check usb_transfer_complete() callers to have synced &xfer->dmabuf. But
> not those for error handling, only those were we supposedly have buffers
> with valid data.
>
> * ehci_isoc_idone(): only the descriptors are synced, probably needs a
> sync on the buffer. NOT OK
> * ehci_idone(): only the descriptors are synced, probably needs a sync on
> the buffer, especially since there's the following comment. NOT OK
>
> /* XXX transfer_complete memcpys out transfer data (for in endpoints)
> * during this call, before methods->done is called: dma sync required
> * beforehand? */
> usb_transfer_complete(xfer);
>
> * ehci_root_ctrl_start: No device transfer, same code path as in the
> other part. OK.
> * all other functions: error handling, no successfull completion, no sync
> needed. OK.
> ====
>
> So, to summarize:
>
> * ehci_alloc_itd_chain() and ehci_alloc_sitd_chain() both should sync
> so that the buffers handed to the hardware are fine.
> * ehci_isoc_idone() and ehci_idone() both should sync the buffers
> modified by the hardware, before handing them to the stack.
> * ehci_device_intr_done() doesn't need to call usb_syncmem again,
> after we have fixed ehci_idone() to do the sync.
>
> With those changes, removing USB_DMA_COHERENT from the usb_allocmem()
> again, as well as removing the usb_syncmem() calls in usbdi.c, should
> make the Cubox-i work. I'll come up with a diff and see if I can find
> some Cubox-i to test it with.
>
> Though we'll still need to check ohci, uhci, xhci and dwc2 before we
> can actually rip that out again.
>
> Patrick
And that's the diff I'd try based on the analysis.
Keep note that the reason the isochronous completion uses xfer->length
instead of xfer->actlen is, that isochronous transfers don't use
continuous buffers. So it's either going through each successful frame
and doing a sync per frame, or simply syncing the whole buffer.
diff --git a/sys/dev/usb/ehci.c b/sys/dev/usb/ehci.c
index a352d83eaf4..e70eea42aa7 100644
--- a/sys/dev/usb/ehci.c
+++ b/sys/dev/usb/ehci.c
@@ -856,6 +856,10 @@ ehci_isoc_idone(struct usbd_xfer *xfer)
#endif
xfer->actlen = actlen;
xfer->status = USBD_NORMAL_COMPLETION;
+
+ usb_syncmem(&xfer->dmabuf, 0, xfer->length,
+ usbd_xfer_isread(xfer) ?
+ BUS_DMASYNC_POSTREAD : BUS_DMASYNC_POSTWRITE);
usb_transfer_complete(xfer);
}
@@ -911,9 +915,9 @@ ehci_idone(struct usbd_xfer *xfer)
} else
xfer->status = USBD_NORMAL_COMPLETION;
- /* XXX transfer_complete memcpys out transfer data (for in endpoints)
- * during this call, before methods->done is called: dma sync required
- * beforehand? */
+ usb_syncmem(&xfer->dmabuf, 0, xfer->actlen,
+ usbd_xfer_isread(xfer) ?
+ BUS_DMASYNC_POSTREAD : BUS_DMASYNC_POSTWRITE);
usb_transfer_complete(xfer);
DPRINTFN(/*12*/2, ("ehci_idone: ex=%p done\n", ex));
}
@@ -3208,9 +3212,6 @@ ehci_device_intr_done(struct usbd_xfer *xfer)
if (xfer->pipe->repeat) {
ehci_free_sqtd_chain(sc, ex);
- usb_syncmem(&xfer->dmabuf, 0, xfer->length,
- usbd_xfer_isread(xfer) ?
- BUS_DMASYNC_POSTREAD : BUS_DMASYNC_POSTWRITE);
sqh = epipe->sqh;
err = ehci_alloc_sqtd_chain(sc, xfer->length, xfer, &data,
&dataend);
@@ -3408,6 +3409,9 @@ ehci_alloc_itd_chain(struct ehci_softc *sc, struct
usbd_xfer *xfer)
if (nframes == 0)
return (1);
+ usb_syncmem(&xfer->dmabuf, 0, xfer->length,
+ usbd_xfer_isread(xfer) ?
+ BUS_DMASYNC_PREREAD : BUS_DMASYNC_PREWRITE);
for (i = 0; i < nframes; i++) {
uint32_t froffs = offs;
@@ -3522,6 +3526,9 @@ ehci_alloc_sitd_chain(struct ehci_softc *sc, struct
usbd_xfer *xfer)
if (usbd_xfer_isread(xfer))
endp |= EHCI_SITD_SET_DIR(1);
+ usb_syncmem(&xfer->dmabuf, 0, xfer->length,
+ usbd_xfer_isread(xfer) ?
+ BUS_DMASYNC_PREREAD : BUS_DMASYNC_PREWRITE);
for (i = 0; i < nframes; i++) {
uint32_t addr = DMAADDR(&xfer->dmabuf, offs);
uint32_t page = EHCI_PAGE(addr + xfer->frlengths[i] - 1);