Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather

George Bosilca Thu, 28 Feb 2008 17:08:57 -0500


On Feb 28, 2008, at 2:45 PM, John Markus Bjørndalen wrote:

Hi, and thanks for the feedback everyone.

George Bosilca wrote:

Brian is completely right. Here is a more detailed description of this
problem.

[....]

On the other side, I hope that not many users write such applications.

This is the best way to completely kill the performances of any MPI
implementation, by overloading one process with messages. This is
exactly what MPI_Reduce and MPI_Gather do, one process will get the

final result and all other processes only have to send some data. This

behavior only arises when the gather or the reduce use a very flat

tree, and only for short messages. Because of the short messages there

is no handshake between the sender and the receiver, which will make
all messages unexpected, and the flat tree guarantee that there will
be a lot of small messages. If you add a barrier every now and then
(100 iterations) this problem will never happens.

I have done some more testing. Of the tested parameters, I'm observing

this behaviour with group sizes from 16-44, and from 1 to 32768 integers

in MPI_Reduce. For MPI_Gather, I'm observing crashes with group sizes
16-44 and from 1 to 4096 integers (per node).

In other words, it actually happens with other tree configurations and
larger packet sizes :-/

This is the limit for the rendez-vous protocol over TCP. And is the upper limit where this problem will arise. I have a strong doubt that is possible to create the same problem with messages larger than the eager size of your BTL ...

By the way, I'm also observing crashes with MPI_Broadcast (groups of
size 4-44 with the root process (rank 0) broadcasting integer arrays of size 16384 and 32768). It looks like the root process is crashing. Can
a sender crash because it runs out of buffer space as well?

I don't think the root crashed. I guess that one of the other nodes crashed, the root got a bad socket (which is what the first error message seems to indicate), and get terminated. As the output is not synchronized between the nodes, one cannot rely on its order nor contents. Moreover, mpirun report that the root was killed with signal 15, which is how we cleanup the remaining processes when we detect that something really bad (like a seg fault) happened in the parallel application.



---------- snip --------------
/home/johnm/local/ompi/bin/mpirun -hostfile lamhosts.all.r360 -np 4
./ompi-crash  16384 1 3000
{  'groupsize' : 4, 'count' : 16384, 'bytes' : 65536, 'bufbytes' :
262144, 'iters' : 3000, 'bmno' : 1
[compute-0-0][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=104

mpirun noticed that job rank 0 with PID 16366 on node compute-0-0 exited

on signal 15 (Terminated).
3 additional processes aborted (not shown)
---------- snip --------------


One more thing, doing a lot of collective in a loop and computing the
total time is not the correct way to evaluate the cost of any

collective communication, simply because you will favor all algorithms

based on pipelining. There is plenty of literature about this topic.

 george.

As I said in the original e-mail: I had only thrown them in for a bit of

sanity checking. I expected funny numbers, but not that OpenMPI would
crash.

The original idea was just to make a quick comparison of Allreduce,
Allgather and Alltoall in LAM and OpenMPI. The opportunity for
pipelining the operations there is rather small since they can't get
much out of phase with each other.

There are many differences between the routed and non routed collectives. All errors that you reported so far are related to rooted collectives, which make sense. I didn't state that it is normal that Open MPI do not behave [sic]. I wonder if you can get such errors with non routed collectives (such as allreduce, allgather and alltoall), or with messages larger than the eager size ?

If you type "ompi_info --param btl tcp", you will see what is the eager size for the TCP BTL. Everything smaller than this size will be send eagerly; have the opportunity to became unexpected on the receiver side and can lead to this problem. As a quick test, you can add "--mca btl_tcp_eager_limit 2048" to your mpirun command line, and this problem will not happen with for size over the 2K. This was the original solution for the flow control problem. If you know your application will generate thousands of unexpected messages, then you should set the eager limit to zero.


  Thanks,
    george.




Regards,

--
// John Markus Bjørndalen
// http://www.cs.uit.no/~johnm/


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

smime.p7s
Description: S/MIME cryptographic signature

Re: [OMPI users] OpenMPI 1.2.5 race condition / core dump with MPI_Reduce and MPI_Gather

Reply via email to