Re: PCIe Access - achieve bursts without DMA

2014-02-03 Thread Michael Moese
On Fri, Jan 31, 2014 at 03:18:30PM -0800, David Hawkins wrote:
 1. Peripheral board DMA (board-to-board)
 2. Peripheral board DMA to host memory.
 3. Host (root complex) DMA.
 
 As far as verification of your custom peripheral board FPGA IP is
 concerned, if I was a customer, and you had data for (1) and (2),
 I'd be pretty happy (and could care less about (2), since its so
 system dependent).

Usually I would totally agree with you and try to implement the benchmark
using DMA transfers Unfortunately, we have some boards and IP cores that
do not support DMA transfers, or the target system must not do by a 
requirement, and as I have no influence on these, I had to investigate
on how to improve my throughput.
I've submitted a RFC Patch earlier today, which allowed me to perform
PCIe read bursts on IO memory, achieving 18 MB/s instead of the 3 MB/s
I got when using non-cached reads. However, I had to ioremap() my 
memory, like Gabriel said, using write-thru configuration. 

 Since its an FPGA-based IP. I'd also expect to see a PCIe simulation
 with Bus Functional Models showing what the optimal performance of
 your IP was, and then how it nicely matches with the measurements
 in (1). If you do not have a PCIe logic analyzer, both Xilinx and
 Altera have Chipscope/SignalTap logic analyzers that can be used
 for tracing traffic at the TLP layer inside the FPGA.

Of course our IP developers to simulation and analyzing, we have PCI
and PCIe analyzer and all other equipment one might need. However,
we've seen that not only on PowerPC but also on x86, performing real
bursts is not intuitive.


Thank you for your help - we might be satisfied with the achieved 
18 MB/s.


Michael
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: PCIe Access - achieve bursts without DMA

2014-02-03 Thread David Laight
From: Michael Moese
 Thank you for your help - we might be satisfied with the achieved
 18 MB/s.

We achieved about twice that using the PEX dma controller.
I found the following comment I wrote:

/* Long transfer requests are cut into smaller DMA requests.
 * Each PCIe request can contain a maximum of 128 bytes, but the
 * dma engine can have multiple PCIe requests outstanding and this
 * speeds things up somewhat (50ns/byte with 128, 24ns/byte with 1024).
 * 1k is somewhere near the point of diminishing returns. */

Those times would include a system call.
The transfers were done through a simple driver that converted pread()
and pwrite() requests into accesses to the boards memory.
The non-dma versions are just copy_to/from_user() directly between
the PCIe and user buffers.

Your 3MB/s for single word transfers is similar to what we saw.
Cycle times that make an ISA bus look fast.

David



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: PCIe Access - achieve bursts without DMA

2014-02-03 Thread Michael Moese
On Mon, Feb 03, 2014 at 10:17:43AM +, David Laight wrote:

 We achieved about twice that using the PEX dma controller.

 Your 3MB/s for single word transfers is similar to what we saw.
 Cycle times that make an ISA bus look fast.

Indeed, this is a really poor performance. I know we could achieve much
more performance using DMA, we have several products where we simply 
don't have DMA available - this requires searching for other paths.

My ioremap_wt() could help in these situations, at least increasing
performance for non-DMA operation to a not-that-bad level.

I don't know if other devices could benefit from this, but surely we
got several IPs that would, but those were not yet upstreamed, we're
still working on this.

Michael


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: PCIe Access - achieve bursts without DMA

2014-02-03 Thread David Laight
From: Michael Moese 
 On Mon, Feb 03, 2014 at 10:17:43AM +, David Laight wrote:
 
  We achieved about twice that using the PEX dma controller.
 
  Your 3MB/s for single word transfers is similar to what we saw.
  Cycle times that make an ISA bus look fast.
 
 Indeed, this is a really poor performance. I know we could achieve much
 more performance using DMA, we have several products where we simply
 don't have DMA available - this requires searching for other paths.

I got the host (ppc) to do a dma, not the card. (This does need a
dma controller that is adequately intergrated with the PCIe logic.)
So it doesn't require any hardware changes.
I did have to design the software to minimise the number of single
memory transfers.

 My ioremap_wt() could help in these situations, at least increasing
 performance for non-DMA operation to a not-that-bad level.

I needed to do writes as well as reads - so I think I would have
needed to map PCIe space fully cached (rather than write-through).
The speed of back to back writes is better than reads (even if they don't
get combined) because the requests get 'posted' and overlap on the
PCIe bus.

Managing cached accesses does get tricky - you need to make sure that
both sides never have to write to the same cache line.

 I don't know if other devices could benefit from this, but surely we
 got several IPs that would, but those were not yet upstreamed, we're
 still working on this.
 
 Michael
 



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: PCIe Access - achieve bursts without DMA

2014-02-03 Thread David Hawkins

Hi Michael,


On Fri, Jan 31, 2014 at 03:18:30PM -0800, David Hawkins wrote:

1. Peripheral board DMA (board-to-board)
2. Peripheral board DMA to host memory.
3. Host (root complex) DMA.

As far as verification of your custom peripheral board FPGA IP is
concerned, if I was a customer, and you had data for (1) and (2),
I'd be pretty happy (and could care less about (2), since its so
system dependent).


Usually I would totally agree with you and try to implement the benchmark
using DMA transfers Unfortunately, we have some boards and IP cores that
do not support DMA transfers, or the target system must not do by a
requirement, and as I have no influence on these, I had to investigate
on how to improve my throughput.


Ah, I see, that does make your life difficult then.


I've submitted a RFC Patch earlier today, which allowed me to perform
PCIe read bursts on IO memory, achieving 18 MB/s instead of the 3 MB/s
I got when using non-cached reads. However, I had to ioremap() my
memory, like Gabriel said, using write-thru configuration.


That sounds like a reasonable compromise.

Cheers,
Dave
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: PCIe Access - achieve bursts without DMA

2014-01-31 Thread Gabriel Paubert
On Thu, Jan 30, 2014 at 12:20:21PM +, Moese, Michael wrote:
 Hello PPC-developers,
 I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores
 located inside our FPGA. On x86-based systems I was able to achieve bursts for
 both read and write access. On PPC32, using an e500v2, I had no success at 
 all 
 so far. 
 I tried using ioremap_wc(), like I did on x86, for writing, and it only 
 results in my
 writes just being single requests, one after another.

I believe that on PPC, write-combine is directly mapped to nocache. I can't 
remember
if there is a writethrough option for ioremap (but adding it would probably be
relaively easy).

 For reads, I noticed I could not ioremap_cache() on PPC, so I used simple 
 ioremap()
 here. 

You might be able to use ioremap_cache and using direct cache control 
instruction
(dcbf/dcbi) to achieve your goals. This becomes similar to handling machines 
with 
no hardware cache coherency. You have to know the hardware cache line size to 
make
this work.

This said, it might be better to mark the memory as guarded and non-coherent 
(WIMG=), I don't know what ioremap_cache does for the MG bits and don't
have the time to look it up right now.

 I used several ways to read from the device, from simple 
 readl(),memcpy_from_io(), 
 memcpy()  to cacheable_memcpy() - with no improvements.  Even when just 
 issuing
 a batch of prefetch()-calls for all the memory to read did not result in read 
 bursts.

If the device data you want to read is supposed to be cacheable (which means 
basically
that the data does not change unexpectedly under you, i.e., is not as volatile 
as
a typical device I/O register), you don't want to use readl() which adds some
synchronization to the read.

Prefetch only works on writeback memory, maybe writethrough, expecting it to 
work on
cache-inhibited memory is contradictory.

Regards,
Gabriel
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: PCIe Access - achieve bursts without DMA

2014-01-31 Thread Benjamin Herrenschmidt
On Thu, 2014-01-30 at 12:20 +, Moese, Michael wrote:
 Hello PPC-developers,
 I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores
 located inside our FPGA. On x86-based systems I was able to achieve bursts for
 both read and write access. On PPC32, using an e500v2, I had no success at 
 all 
 so far. 
 I tried using ioremap_wc(), like I did on x86, for writing, and it only 
 results in my
 writes just being single requests, one after another.

Hrm, ioremap_wc will give you a mapping without the G (guard) bit.
Whether that results in some store gathering or not on IOs depends on a
specific HW implementation, you'll have to check with the FSP folks on
that one, there could also be a chicken switch (HID bit or similar)
needed to enable that (there was on some earlier ppc32 chips).

Another thing you can try is to use FP register load/stores.

 For reads, I noticed I could not ioremap_cache() on PPC, so I used simple 
 ioremap()
 here. 
 I used several ways to read from the device, from simple 
 readl(),memcpy_from_io(), 
 memcpy()  to cacheable_memcpy() - with no improvements.  Even when just 
 issuing
 a batch of prefetch()-calls for all the memory to read did not result in read 
 bursts.
 
 I only get really poor results, writing is possible with around 40 MiByte/s, 
 whereas I  
 can read at about only 3 MiByte/s.
 After hours of studying the reference manual from freescale, looking into 
 other code
 and searching the web, I'm close to resignation.
 
 Maybe someone of you has some more directions for me, I'd appreciate every 
 hint
 that leads me to my problem's solution - maybe I just missed something or 
 lack 
 knowledge about this architecture in general.
 
 Thanks for your reading.
 
 
 Michael
 ___
 Linuxppc-dev mailing list
 Linuxppc-dev@lists.ozlabs.org
 https://lists.ozlabs.org/listinfo/linuxppc-dev


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: PCIe Access - achieve bursts without DMA

2014-01-31 Thread David Hawkins

Hi Michael,


I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores
located inside our FPGA. On x86-based systems I was able to achieve bursts for
both read and write access. On PPC32, using an e500v2, I had no success at all
so far.


Whenever I want to benchmark PCI/PCIe performance I do the
following tests;

1. Peripheral board DMA (board-to-board)

   Use two of your FPGA boards in a chassis and DMA between them.

   In a PCI system, you can put the cards on the same bus segment and
   then between a bridge and see how that affects things. In your case,
   the PCIe traffic will all be via the root-complex/switch, so
   you should get the same performance regardless of which PCIe slot
   you use.

   This is likely the best you can do as far as bursts go.

2. Peripheral board DMA to host memory.

   In this case I typically insmod a simple driver on the host that
   gives me a page of memory, and then DMA into and out of that
   memory, using the DMA controller on the peripheral.

3. Host (root complex) DMA.

   If your host has a DMA controller, then program it per (2).

As far as verification of your custom peripheral board FPGA IP is
concerned, if I was a customer, and you had data for (1) and (2),
I'd be pretty happy (and could care less about (2), since its so
system dependent).

Since its an FPGA-based IP. I'd also expect to see a PCIe simulation
with Bus Functional Models showing what the optimal performance of
your IP was, and then how it nicely matches with the measurements
in (1). If you do not have a PCIe logic analyzer, both Xilinx and
Altera have Chipscope/SignalTap logic analyzers that can be used
for tracing traffic at the TLP layer inside the FPGA.

Just some thoughts ...

Cheers,
Dave

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


PCIe Access - achieve bursts without DMA

2014-01-30 Thread Moese, Michael
Hello PPC-developers,
I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores
located inside our FPGA. On x86-based systems I was able to achieve bursts for
both read and write access. On PPC32, using an e500v2, I had no success at all 
so far. 
I tried using ioremap_wc(), like I did on x86, for writing, and it only results 
in my
writes just being single requests, one after another.
For reads, I noticed I could not ioremap_cache() on PPC, so I used simple 
ioremap()
here. 
I used several ways to read from the device, from simple 
readl(),memcpy_from_io(), 
memcpy()  to cacheable_memcpy() - with no improvements.  Even when just issuing
a batch of prefetch()-calls for all the memory to read did not result in read 
bursts.

I only get really poor results, writing is possible with around 40 MiByte/s, 
whereas I  
can read at about only 3 MiByte/s.
After hours of studying the reference manual from freescale, looking into other 
code
and searching the web, I'm close to resignation.

Maybe someone of you has some more directions for me, I'd appreciate every hint
that leads me to my problem's solution - maybe I just missed something or lack 
knowledge about this architecture in general.

Thanks for your reading.


Michael
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: PCIe Access - achieve bursts without DMA

2014-01-30 Thread David Laight
From Moese, Michael
 Hello PPC-developers,
 I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores
 located inside our FPGA. On x86-based systems I was able to achieve bursts for
 both read and write access. On PPC32, using an e500v2, I had no success at all
 so far.

I'm not sure that you can.
I had to write a simple driver for the PCIe CSB bridge dma on a 83xx ppc.
I think that might be the one in the e500v2.

I don't know how fast 'normal' PCIe slaves are, but we were accessing
an Altera fpga and the latency is less than pedestrian.
I think an ISA bus can run faster!
With moderate length transfers, the throughput was more than adequate.

David



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev