Hi,
I was comparing a simple Send-Recv program to another program with two
memcpy's to/from shared memory. Number of processes = 2 and different
array sizes (from 10^6 - 10^8 doubles) on IA64. With the --mca btl sm,self
options I get almost twice the bandwidth compared to the two memcpy's. I
Hi,
I am trying to implement the following collectives in MPI
sharedmemory, Alltoall, Broadcast, Reduce with zero copy
optimizations.So for Reduce, my compiler allocates all the send
buffers in sharedmemory (mmap anonymous), and allocates only one
receive buffer againin shared memory. Then all the