RE: Max. libusb bulk send/receive values lowered with xhci?

2014-01-27 Thread David Laight
From: Jérôme Carretero
> I was happily using big (10MB) buffers before, and with recent kernels,
> when using USB3, I had to reduce the size of my buffers a lot.
> By the way, I couldn't find any information on a maximum size for the
> bulk transfers using libusb, maybe you know about that also ?
> 
> So, using v3.13, this what I get from the kernel when doing a bulk read
> of 4 MiB:
> 
> [  506.856282] xhci_hcd :00:14.0: Too many fragments 256, max 63
...
> I saw your 3.12-td-fragment-failure branch and tried it; there,
> sometimes the transfers don't work, with:
> 
> xhci_hcd :00:14.0: WARN Event TRB for slot 10 ep 4 with no TDs queued?
> python2: page allocation failure: order:10, mode:0x1040d0

I've had a quick look and the reason for the allocation failure is fairly
obvious.

The libusb ioctl is handled by proc_do_submiturb() it will use scatter-gather
for long requests, but always chops things up into 16k fragments.
So (as in the trace above) a 4MB transfers requires 256 fragments.

If the number of segments exceeds the advertised sg_tablesize (which
is now 128) then it falls back on using a single fragment.
For a 4MB buffer this is 1024 contiguous pages - not surprisingly it
sometimes fails (it really ought to sleep - but that is another issue).

Possibly proc_do_submit() should use longer fragments [1] in order to
get below the sg_tablesise limit.
However this is still doomed to fail.
A single 16MB buffers crosses at least 255 64kB boundaries so the xhci
driver will need 256 or 257 TRB to describe the buffer.

The only way for xhci to accept these transfers is to apply the patch
I posted last week that checks for aligned buffers and skips the 'pad
with NOPs' code if they are aligned, and then set sg_tablesize to ~0.

The 'struct usb_bus' currently contains 2 fields associated with scatter-
gather:
- no_sg_constraint:1 is set by xhci and checked by usbnet/ax88179_178a
  before it uses 'randomly aligned' fragments.
- sg_tablesize is supposed to be the limit on the number of sg fragments.
  ehci, ohci and uhci either set 0 or ~0.
  xhci currently sets TRBS_PER_SEGMENT/2 == 128 (previously 32, older ~0).
  Some code only checks for non-zero.

It would be better if the former were changed to be a limit on the
number of 'unconstrained' fragments; since that limit is somewhat
different (in xhci) from the limit on the number of aligned fragments.
Alternatively both could be treated at booleans, and we just hope
that any fragment limits aren't exceeded.

[1] I don't know if it is best to try to allocate 2^n pages, falling
back on smaller sizes (if they'll meet the fragment count limit)
rather than allocating equal sized fragments.
The code should probably also be willing to allocate more fragments
if it can't allocated even 16k blocks.
However processing variable-sized fragment lists requires that the
sg[].length field not be modified by the dma_map code - I don't
know if that is generally true?

David




RE: Max. libusb bulk send/receive values lowered with xhci?

2014-01-23 Thread David Laight
From: Jérôme Carretero
> Hi Sarah,
> 
> I was happily using big (10MB) buffers before, and with recent kernels,
> when using USB3, I had to reduce the size of my buffers a lot.
> By the way, I couldn't find any information on a maximum size for the
> bulk transfers using libusb, maybe you know about that also ?
> 
> So, using v3.13, this what I get from the kernel when doing a bulk read
> of 4 MiB:
> 
> [  506.856282] xhci_hcd :00:14.0: Too many fragments 256, max 63
> [  506.856288] usb 4-5: usbfs: usb_submit_urb returned -12
...

> I saw your 3.12-td-fragment-failure branch and tried it; there,
> sometimes the transfers don't work, with:
> 
> xhci_hcd :00:14.0: WARN Event TRB for slot 10 ep 4 with no TDs queued?

That shouldn't happen, otoh the xchi code is too complicated
and it doesn't actually surprise me.
Without some specific traces in the normal paths it is probably
impossible to work out what went wrong.

> python2: page allocation failure: order:10, mode:0x1040d0

That is just a failure to allocate a 4MB block of kernel memory.
Trying to allocate a block that large is rather doomed.

It looks like the code is allocation a contiguous buffer (virtual and
physical) for the request, and then somewhere it is being split
into separate address:length pairs for each 4k physical page.

Given the names of the fields of 'struct scatterlist' I suspect the
original use required a separate entry for each page.
I've not looked at the dma mapping code (nop on x86) to see if it
does actually map multiple pages for a single sg entry.

In any case the xhci driver doesn't need separate fragments for
each page - so could usefully detect adjacent fragments and use
a single TD for them.
(Which wouldn't help if the ioctl code allocated fragmented buffers.)

I might try to cook up a patch that will help aligned transfers.

David


N�r��yb�X��ǧv�^�)޺{.n�+{��^n�r���z���h�&���G���h�(�階�ݢj"���m��z�ޖ���f���h���~�m�