On Feb 28, 2008, at 2:45 PM, John Markus Bjørndalen wrote:
Hi, and thanks for the feedback everyone. George Bosilca wrote:Brian is completely right. Here is a more detailed description of thisproblem.[....]On the other side, I hope that not many users write such applications.This is the best way to completely kill the performances of any MPI implementation, by overloading one process with messages. This is exactly what MPI_Reduce and MPI_Gather do, one process will get thefinal result and all other processes only have to send some data. Thisbehavior only arises when the gather or the reduce use a very flattree, and only for short messages. Because of the short messages thereis no handshake between the sender and the receiver, which will make all messages unexpected, and the flat tree guarantee that there will be a lot of small messages. If you add a barrier every now and then (100 iterations) this problem will never happens.I have done some more testing. Of the tested parameters, I'm observingthis behaviour with group sizes from 16-44, and from 1 to 32768 integersin MPI_Reduce. For MPI_Gather, I'm observing crashes with group sizes 16-44 and from 1 to 4096 integers (per node). In other words, it actually happens with other tree configurations and larger packet sizes :-/
This is the limit for the rendez-vous protocol over TCP. And is the upper limit where this problem will arise. I have a strong doubt that is possible to create the same problem with messages larger than the eager size of your BTL ...
By the way, I'm also observing crashes with MPI_Broadcast (groups ofsize 4-44 with the root process (rank 0) broadcasting integer arrays of size 16384 and 32768). It looks like the root process is crashing. Cana sender crash because it runs out of buffer space as well?
I don't think the root crashed. I guess that one of the other nodes crashed, the root got a bad socket (which is what the first error message seems to indicate), and get terminated. As the output is not synchronized between the nodes, one cannot rely on its order nor contents. Moreover, mpirun report that the root was killed with signal 15, which is how we cleanup the remaining processes when we detect that something really bad (like a seg fault) happened in the parallel application.
---------- snip -------------- /home/johnm/local/ompi/bin/mpirun -hostfile lamhosts.all.r360 -np 4 ./ompi-crash 16384 1 3000 { 'groupsize' : 4, 'count' : 16384, 'bytes' : 65536, 'bufbytes' : 262144, 'iters' : 3000, 'bmno' : 1 [compute-0-0][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=104mpirun noticed that job rank 0 with PID 16366 on node compute-0-0 exitedon signal 15 (Terminated). 3 additional processes aborted (not shown) ---------- snip --------------As I said in the original e-mail: I had only thrown them in for a bit ofOne more thing, doing a lot of collective in a loop and computing the total time is not the correct way to evaluate the cost of anycollective communication, simply because you will favor all algorithmsbased on pipelining. There is plenty of literature about this topic. george.sanity checking. I expected funny numbers, but not that OpenMPI would crash. The original idea was just to make a quick comparison of Allreduce, Allgather and Alltoall in LAM and OpenMPI. The opportunity for pipelining the operations there is rather small since they can't get much out of phase with each other.
There are many differences between the routed and non routed collectives. All errors that you reported so far are related to rooted collectives, which make sense. I didn't state that it is normal that Open MPI do not behave [sic]. I wonder if you can get such errors with non routed collectives (such as allreduce, allgather and alltoall), or with messages larger than the eager size ?
If you type "ompi_info --param btl tcp", you will see what is the eager size for the TCP BTL. Everything smaller than this size will be send eagerly; have the opportunity to became unexpected on the receiver side and can lead to this problem. As a quick test, you can add "--mca btl_tcp_eager_limit 2048" to your mpirun command line, and this problem will not happen with for size over the 2K. This was the original solution for the flow control problem. If you know your application will generate thousands of unexpected messages, then you should set the eager limit to zero.
Thanks, george.
Regards, -- // John Markus Bjørndalen // http://www.cs.uit.no/~johnm/ _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
smime.p7s
Description: S/MIME cryptographic signature