Re: PCIe Access - achieve bursts without DMA
On Fri, Jan 31, 2014 at 03:18:30PM -0800, David Hawkins wrote: 1. Peripheral board DMA (board-to-board) 2. Peripheral board DMA to host memory. 3. Host (root complex) DMA. As far as verification of your custom peripheral board FPGA IP is concerned, if I was a customer, and you had data for (1) and (2), I'd be pretty happy (and could care less about (2), since its so system dependent). Usually I would totally agree with you and try to implement the benchmark using DMA transfers Unfortunately, we have some boards and IP cores that do not support DMA transfers, or the target system must not do by a requirement, and as I have no influence on these, I had to investigate on how to improve my throughput. I've submitted a RFC Patch earlier today, which allowed me to perform PCIe read bursts on IO memory, achieving 18 MB/s instead of the 3 MB/s I got when using non-cached reads. However, I had to ioremap() my memory, like Gabriel said, using write-thru configuration. Since its an FPGA-based IP. I'd also expect to see a PCIe simulation with Bus Functional Models showing what the optimal performance of your IP was, and then how it nicely matches with the measurements in (1). If you do not have a PCIe logic analyzer, both Xilinx and Altera have Chipscope/SignalTap logic analyzers that can be used for tracing traffic at the TLP layer inside the FPGA. Of course our IP developers to simulation and analyzing, we have PCI and PCIe analyzer and all other equipment one might need. However, we've seen that not only on PowerPC but also on x86, performing real bursts is not intuitive. Thank you for your help - we might be satisfied with the achieved 18 MB/s. Michael ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: PCIe Access - achieve bursts without DMA
From: Michael Moese Thank you for your help - we might be satisfied with the achieved 18 MB/s. We achieved about twice that using the PEX dma controller. I found the following comment I wrote: /* Long transfer requests are cut into smaller DMA requests. * Each PCIe request can contain a maximum of 128 bytes, but the * dma engine can have multiple PCIe requests outstanding and this * speeds things up somewhat (50ns/byte with 128, 24ns/byte with 1024). * 1k is somewhere near the point of diminishing returns. */ Those times would include a system call. The transfers were done through a simple driver that converted pread() and pwrite() requests into accesses to the boards memory. The non-dma versions are just copy_to/from_user() directly between the PCIe and user buffers. Your 3MB/s for single word transfers is similar to what we saw. Cycle times that make an ISA bus look fast. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: PCIe Access - achieve bursts without DMA
On Mon, Feb 03, 2014 at 10:17:43AM +, David Laight wrote: We achieved about twice that using the PEX dma controller. Your 3MB/s for single word transfers is similar to what we saw. Cycle times that make an ISA bus look fast. Indeed, this is a really poor performance. I know we could achieve much more performance using DMA, we have several products where we simply don't have DMA available - this requires searching for other paths. My ioremap_wt() could help in these situations, at least increasing performance for non-DMA operation to a not-that-bad level. I don't know if other devices could benefit from this, but surely we got several IPs that would, but those were not yet upstreamed, we're still working on this. Michael ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: PCIe Access - achieve bursts without DMA
From: Michael Moese On Mon, Feb 03, 2014 at 10:17:43AM +, David Laight wrote: We achieved about twice that using the PEX dma controller. Your 3MB/s for single word transfers is similar to what we saw. Cycle times that make an ISA bus look fast. Indeed, this is a really poor performance. I know we could achieve much more performance using DMA, we have several products where we simply don't have DMA available - this requires searching for other paths. I got the host (ppc) to do a dma, not the card. (This does need a dma controller that is adequately intergrated with the PCIe logic.) So it doesn't require any hardware changes. I did have to design the software to minimise the number of single memory transfers. My ioremap_wt() could help in these situations, at least increasing performance for non-DMA operation to a not-that-bad level. I needed to do writes as well as reads - so I think I would have needed to map PCIe space fully cached (rather than write-through). The speed of back to back writes is better than reads (even if they don't get combined) because the requests get 'posted' and overlap on the PCIe bus. Managing cached accesses does get tricky - you need to make sure that both sides never have to write to the same cache line. I don't know if other devices could benefit from this, but surely we got several IPs that would, but those were not yet upstreamed, we're still working on this. Michael ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: PCIe Access - achieve bursts without DMA
Hi Michael, On Fri, Jan 31, 2014 at 03:18:30PM -0800, David Hawkins wrote: 1. Peripheral board DMA (board-to-board) 2. Peripheral board DMA to host memory. 3. Host (root complex) DMA. As far as verification of your custom peripheral board FPGA IP is concerned, if I was a customer, and you had data for (1) and (2), I'd be pretty happy (and could care less about (2), since its so system dependent). Usually I would totally agree with you and try to implement the benchmark using DMA transfers Unfortunately, we have some boards and IP cores that do not support DMA transfers, or the target system must not do by a requirement, and as I have no influence on these, I had to investigate on how to improve my throughput. Ah, I see, that does make your life difficult then. I've submitted a RFC Patch earlier today, which allowed me to perform PCIe read bursts on IO memory, achieving 18 MB/s instead of the 3 MB/s I got when using non-cached reads. However, I had to ioremap() my memory, like Gabriel said, using write-thru configuration. That sounds like a reasonable compromise. Cheers, Dave ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: PCIe Access - achieve bursts without DMA
On Thu, Jan 30, 2014 at 12:20:21PM +, Moese, Michael wrote: Hello PPC-developers, I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores located inside our FPGA. On x86-based systems I was able to achieve bursts for both read and write access. On PPC32, using an e500v2, I had no success at all so far. I tried using ioremap_wc(), like I did on x86, for writing, and it only results in my writes just being single requests, one after another. I believe that on PPC, write-combine is directly mapped to nocache. I can't remember if there is a writethrough option for ioremap (but adding it would probably be relaively easy). For reads, I noticed I could not ioremap_cache() on PPC, so I used simple ioremap() here. You might be able to use ioremap_cache and using direct cache control instruction (dcbf/dcbi) to achieve your goals. This becomes similar to handling machines with no hardware cache coherency. You have to know the hardware cache line size to make this work. This said, it might be better to mark the memory as guarded and non-coherent (WIMG=), I don't know what ioremap_cache does for the MG bits and don't have the time to look it up right now. I used several ways to read from the device, from simple readl(),memcpy_from_io(), memcpy() to cacheable_memcpy() - with no improvements. Even when just issuing a batch of prefetch()-calls for all the memory to read did not result in read bursts. If the device data you want to read is supposed to be cacheable (which means basically that the data does not change unexpectedly under you, i.e., is not as volatile as a typical device I/O register), you don't want to use readl() which adds some synchronization to the read. Prefetch only works on writeback memory, maybe writethrough, expecting it to work on cache-inhibited memory is contradictory. Regards, Gabriel ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: PCIe Access - achieve bursts without DMA
On Thu, 2014-01-30 at 12:20 +, Moese, Michael wrote: Hello PPC-developers, I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores located inside our FPGA. On x86-based systems I was able to achieve bursts for both read and write access. On PPC32, using an e500v2, I had no success at all so far. I tried using ioremap_wc(), like I did on x86, for writing, and it only results in my writes just being single requests, one after another. Hrm, ioremap_wc will give you a mapping without the G (guard) bit. Whether that results in some store gathering or not on IOs depends on a specific HW implementation, you'll have to check with the FSP folks on that one, there could also be a chicken switch (HID bit or similar) needed to enable that (there was on some earlier ppc32 chips). Another thing you can try is to use FP register load/stores. For reads, I noticed I could not ioremap_cache() on PPC, so I used simple ioremap() here. I used several ways to read from the device, from simple readl(),memcpy_from_io(), memcpy() to cacheable_memcpy() - with no improvements. Even when just issuing a batch of prefetch()-calls for all the memory to read did not result in read bursts. I only get really poor results, writing is possible with around 40 MiByte/s, whereas I can read at about only 3 MiByte/s. After hours of studying the reference manual from freescale, looking into other code and searching the web, I'm close to resignation. Maybe someone of you has some more directions for me, I'd appreciate every hint that leads me to my problem's solution - maybe I just missed something or lack knowledge about this architecture in general. Thanks for your reading. Michael ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: PCIe Access - achieve bursts without DMA
Hi Michael, I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores located inside our FPGA. On x86-based systems I was able to achieve bursts for both read and write access. On PPC32, using an e500v2, I had no success at all so far. Whenever I want to benchmark PCI/PCIe performance I do the following tests; 1. Peripheral board DMA (board-to-board) Use two of your FPGA boards in a chassis and DMA between them. In a PCI system, you can put the cards on the same bus segment and then between a bridge and see how that affects things. In your case, the PCIe traffic will all be via the root-complex/switch, so you should get the same performance regardless of which PCIe slot you use. This is likely the best you can do as far as bursts go. 2. Peripheral board DMA to host memory. In this case I typically insmod a simple driver on the host that gives me a page of memory, and then DMA into and out of that memory, using the DMA controller on the peripheral. 3. Host (root complex) DMA. If your host has a DMA controller, then program it per (2). As far as verification of your custom peripheral board FPGA IP is concerned, if I was a customer, and you had data for (1) and (2), I'd be pretty happy (and could care less about (2), since its so system dependent). Since its an FPGA-based IP. I'd also expect to see a PCIe simulation with Bus Functional Models showing what the optimal performance of your IP was, and then how it nicely matches with the measurements in (1). If you do not have a PCIe logic analyzer, both Xilinx and Altera have Chipscope/SignalTap logic analyzers that can be used for tracing traffic at the TLP layer inside the FPGA. Just some thoughts ... Cheers, Dave ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
PCIe Access - achieve bursts without DMA
Hello PPC-developers, I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores located inside our FPGA. On x86-based systems I was able to achieve bursts for both read and write access. On PPC32, using an e500v2, I had no success at all so far. I tried using ioremap_wc(), like I did on x86, for writing, and it only results in my writes just being single requests, one after another. For reads, I noticed I could not ioremap_cache() on PPC, so I used simple ioremap() here. I used several ways to read from the device, from simple readl(),memcpy_from_io(), memcpy() to cacheable_memcpy() - with no improvements. Even when just issuing a batch of prefetch()-calls for all the memory to read did not result in read bursts. I only get really poor results, writing is possible with around 40 MiByte/s, whereas I can read at about only 3 MiByte/s. After hours of studying the reference manual from freescale, looking into other code and searching the web, I'm close to resignation. Maybe someone of you has some more directions for me, I'd appreciate every hint that leads me to my problem's solution - maybe I just missed something or lack knowledge about this architecture in general. Thanks for your reading. Michael ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: PCIe Access - achieve bursts without DMA
From Moese, Michael Hello PPC-developers, I'm currently trying to benchmark access speeds to our PCIe-connected IP-cores located inside our FPGA. On x86-based systems I was able to achieve bursts for both read and write access. On PPC32, using an e500v2, I had no success at all so far. I'm not sure that you can. I had to write a simple driver for the PCIe CSB bridge dma on a 83xx ppc. I think that might be the one in the e500v2. I don't know how fast 'normal' PCIe slaves are, but we were accessing an Altera fpga and the latency is less than pedestrian. I think an ISA bus can run faster! With moderate length transfers, the throughput was more than adequate. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev