Hi, I have a problem with sending/receiving large buffers when using openmpi (version 1.3.3), e.g.,
MPI_Allreduce(sbuf, rbuf, count, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); with count=180000000 (this problem does not appear to be unique for Allreduce, but occurs with Reduce, Bcats as well; maybe more). Initially I thought the maximum value for count would be 2^31-1 because count is an int. However, when using MPICH2 I receive a segfault already when count=2^31/8 thus I suspect that they transfer bytes instead of doubles internally and the count for the # of bytes wraps around at that value. This I can deal with (it is not nice, but I can wrap all calls such that as soon as count > 268435456 several calls are made). Hwoever, with openmpi I just cannot figure out what the largest permitted value is: in most cases the MPI calls hang for count > 176763240, but this is not completely reproducable. This appears to depend on the history, i.e., what other MPI routines have been called before that. >From looking at the code as far as I understand the MPICH2 problem should not appear for openmpi: the allreduce call is split up into several calls anyway - see the loop for (phase = 0; phase < num_phases; phase ++) { ... } in coll_tuned_allreduce.c. In fact that loop is executed just fine. The "hang" occurs when ompi_coll_tuned_sendrecv is called (line 839 of coll_tuned_allreduce.c). Here is the call of that function: (gdb) s ompi_coll_tuned_sendrecv_actual (sendbuf=0x2aab2d539410, scount=90000000, sdatatype=0x602530, dest=1, stag=-12, recvbuf=0x2aab02694010, rcount=90000000, rdatatype=0x602530, source=1, rtag=-12, comm=0x602730, status=0x0) at coll_tuned_util.c:41 and the program just hangs as soon as ompi_request_wait_all (line 55 of coll_tuned_util.c) is executed. Any ideas how to fix this? Cheers, Martin -- Martin Siegert Head, Research Computing WestGrid Site Lead IT Services phone: 778 782-4691 Simon Fraser University fax: 778 782-4242 Burnaby, British Columbia email: sieg...@sfu.ca Canada V5A 1S6