Hi, and thanks for the feedback everyone.
George Bosilca wrote:
Brian is completely right. Here is a more detailed description of this
problem.
[....]
On the other side, I hope that not many users write such applications.
This is the best way to completely kill the performances of any MPI
implementation, by overloading one process with messages. This is
exactly what MPI_Reduce and MPI_Gather do, one process will get the
final result and all other processes only have to send some data. This
behavior only arises when the gather or the reduce use a very flat
tree, and only for short messages. Because of the short messages there
is no handshake between the sender and the receiver, which will make
all messages unexpected, and the flat tree guarantee that there will
be a lot of small messages. If you add a barrier every now and then
(100 iterations) this problem will never happens.
I have done some more testing. Of the tested parameters, I'm observing
this behaviour with group sizes from 16-44, and from 1 to 32768 integers
in MPI_Reduce. For MPI_Gather, I'm observing crashes with group sizes
16-44 and from 1 to 4096 integers (per node).
In other words, it actually happens with other tree configurations and
larger packet sizes :-/
By the way, I'm also observing crashes with MPI_Broadcast (groups of
size 4-44 with the root process (rank 0) broadcasting integer arrays of
size 16384 and 32768). It looks like the root process is crashing. Can
a sender crash because it runs out of buffer space as well?
---------- snip --------------
/home/johnm/local/ompi/bin/mpirun -hostfile lamhosts.all.r360 -np 4
./ompi-crash 16384 1 3000
{ 'groupsize' : 4, 'count' : 16384, 'bytes' : 65536, 'bufbytes' :
262144, 'iters' : 3000, 'bmno' : 1
[compute-0-0][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=104
mpirun noticed that job rank 0 with PID 16366 on node compute-0-0 exited
on signal 15 (Terminated).
3 additional processes aborted (not shown)
---------- snip --------------
One more thing, doing a lot of collective in a loop and computing the
total time is not the correct way to evaluate the cost of any
collective communication, simply because you will favor all algorithms
based on pipelining. There is plenty of literature about this topic.
george.
As I said in the original e-mail: I had only thrown them in for a bit of
sanity checking. I expected funny numbers, but not that OpenMPI would
crash.
The original idea was just to make a quick comparison of Allreduce,
Allgather and Alltoall in LAM and OpenMPI. The opportunity for
pipelining the operations there is rather small since they can't get
much out of phase with each other.
Regards,
--
// John Markus Bjørndalen
// http://www.cs.uit.no/~johnm/