We have some new Power8 nodes with dual-port FDR HCAs. I have not tested
same-node Verbs throughput. Using Linux’s Cross Memory Attach (CMA), I can get
30 GB/s for 2 MB messages between two cores and then it drops off to ~12 GB/s.
The PCIe Gen3 x16 slots should max at ~15 GB/s. I agree that when
In that case we should find a way to eliminate this behavior. I will
take a look later this week and see if there is a workable solution.
-Nathan
On Wed, Mar 11, 2015 at 11:41:00AM -0600, Howard Pritchard wrote:
>My experience with DMA engines located on the other side of a PCI-e 16x
>ge
My experience with DMA engines located on the other side of a PCI-e 16x
gen3 bus from the cpus is that for a couple of ranks doing large
transfers between each other on a node, using the DMA engine looks good.
But once there are multiple ranks exchanging data (like up to 32 ranks on a
dual socket h
Definitely a side-effect though it could be beneficial in some cases as
the RDMA engine in the HCA may be faster than using memcpy (larger than
a certain size). I don't know how to best fix this as I need all RDMA
capable BTLs to listed for RMA. I though about adding another list to
track BTLs tha
This message is mostly for Nathan, but figured I would go with the wider
distribution. I have noticed some different behaviour that I assume started
with this change.
https://github.com/open-mpi/ompi/commit/4bf7a207e90997e75ba1c60d9d191d9d96402d04
I am noticing that the openib BTL will also b