On Jul 6, 2013, at 4:59 PM, Michael Thomadakis <drmichaelt7...@gmail.com> wrote:

> When you stack runs on SandyBridge nodes atached to HCAs ove PCI3 gen 3 do 
> you pay any special attention to the memory buffers according to which 
> socket/memory controller  their physical memory belongs to?
> 
> For instance, if the HCA is attached to the PCIgen3 lanes of Socket 1 do you 
> do anything special when the read/write buffers map to physical memory 
> belonging to Socket 2? Or do you7 avoid using buffers mapping ro memory that 
> belongs (is accessible via) the other socket?

It is not *necessary* to do ensure that buffers are NUMA-local to the PCI 
device that they are writing to, but it certainly results in lower latency to 
read/write to PCI devices (regardless of flavor) that are attached to an MPI 
process' local NUMA node.  The Hardware Locality (hwloc) tool "lstopo" can 
print a pretty picture of your server to show you where your PCI busses are 
connected.

For TCP, Open MPI will use all TCP devices that it finds by default (because it 
is assumed that latency is so high that NUMA locality doesn't matter).  The 
openib (OpenFabrics) transport will use the "closest" HCA ports that it can 
find to each MPI process.  

In our upcoming Cisco ultra low latency BTL, it defaults to using the closest 
Cisco VIC ports that it can find for short messages (i.e., to minimize 
latency), but uses all available VICs for long messages (i.e., to maximize 
bandwidth).

> Has this situation improved with Ivy-Brige systems or Haswell?

It's the same overall architecture (i.e., NUMA).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to