Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather

John Markus Bjørndalen Thu, 28 Feb 2008 14:45:28 -0500

Hi, and thanks for the feedback everyone.

George Bosilca wrote:

Brian is completely right. Here is a more detailed description of thisproblem.

[....]

On the other side, I hope that not many users write such applications.This is the best way to completely kill the performances of any MPIimplementation, by overloading one process with messages. This isexactly what MPI_Reduce and MPI_Gather do, one process will get thefinal result and all other processes only have to send some data. Thisbehavior only arises when the gather or the reduce use a very flattree, and only for short messages. Because of the short messages thereis no handshake between the sender and the receiver, which will makeall messages unexpected, and the flat tree guarantee that there willbe a lot of small messages. If you add a barrier every now and then(100 iterations) this problem will never happens.

I have done some more testing. Of the tested parameters, I'm observingthis behaviour with group sizes from 16-44, and from 1 to 32768 integersin MPI_Reduce. For MPI_Gather, I'm observing crashes with group sizes16-44 and from 1 to 4096 integers (per node).

In other words, it actually happens with other tree configurations andlarger packet sizes :-/

By the way, I'm also observing crashes with MPI_Broadcast (groups ofsize 4-44 with the root process (rank 0) broadcasting integer arrays ofsize 16384 and 32768). It looks like the root process is crashing. Cana sender crash because it runs out of buffer space as well?


---------- snip --------------

/home/johnm/local/ompi/bin/mpirun -hostfile lamhosts.all.r360 -np 4./ompi-crash 16384 1 3000{ 'groupsize' : 4, 'count' : 16384, 'bytes' : 65536, 'bufbytes' :262144, 'iters' : 3000, 'bmno' : 1[compute-0-0][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]mca_btl_tcp_frag_recv: readv failed with errno=104mpirun noticed that job rank 0 with PID 16366 on node compute-0-0 exitedon signal 15 (Terminated).

3 additional processes aborted (not shown)
---------- snip --------------

One more thing, doing a lot of collective in a loop and computing thetotal time is not the correct way to evaluate the cost of anycollective communication, simply because you will favor all algorithmsbased on pipelining. There is plenty of literature about this topic.
  george.

As I said in the original e-mail: I had only thrown them in for a bit ofsanity checking. I expected funny numbers, but not that OpenMPI wouldcrash.

The original idea was just to make a quick comparison of Allreduce,Allgather and Alltoall in LAM and OpenMPI. The opportunity forpipelining the operations there is rather small since they can't getmuch out of phase with each other.



Regards,

--
// John Markus Bjørndalen
// http://www.cs.uit.no/~johnm/

Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather

Reply via email to