Re: [OMPI users] freezing in mpi_allreduce operation

Jeff Squyres Sat, 24 Sep 2011 09:35:51 -0400

Holy crimminey, I'm totally lost in your Fortran syntax.  :-)

What you describe might be a bug in our MPI_IN_PLACE handling for 
MPI_ALLREDUCE.


Could you possible make a small test case that a) we can run, and b) uses 
straightforward Fortran? (avoid using terms like "assumed shape" and "assumed 
size" and ...any other Fortran stuff that confuses simple C programmers like us 
:-) )

What version of Open MPI is this?


On Sep 8, 2011, at 5:59 PM, Greg Fischer wrote:

> Note also that coding the mpi_allreduce as:
> 
>    call 
> mpi_allreduce(MPI_IN_PLACE,phim(0,1,1,1,grp),phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr)
> 
> results in the same freezing behavior in the 60th iteration.  (I don't recall 
> why the arrays were being passed, possibly just a mistake.)
> 
> 
> On Thu, Sep 8, 2011 at 4:17 PM, Greg Fischer <greg.a.fisc...@gmail.com> wrote:
> I am seeing mpi_allreduce operations freeze execution of my code on some 
> moderately-sized problems.  The freeze does not manifest itself in every 
> problem.  In addition, it is in a portion of the code that is repeated many 
> times.  In the problem discussed below, the problem appears in the 60th 
> iteration.
> 
> The current test case that I'm looking at is a 64-processor job.  This 
> particular mpi_allreduce call applies to all 64 processors, with each 
> communicator in the call containing a total of 4 processors.  When I add 
> print statements before and after the offending line, I see that all 64 
> processors successfully make it to the mpi_allreduce call, but only 32 
> successfully exit.  Stack traces on the other 32 yield something along the 
> lines of the trace listed at the bottom of this message.  The call, itself, 
> looks like:
> 
>  call mpi_allreduce(MPI_IN_PLACE, 
> phim(0:(phim_size-1),1:im,1:jm,1:kmloc(coords(2)+1),grp), &
>                     
> phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr)
> 
> These messages are sized to remain under the 32-bit integer size limitation 
> for the "count" parameter.  The intent is to perform the allreduce operation 
> on a contiguous block of the array.  Previously, I had been passing an 
> assumed-shape array (i.e. phim(:,:,:,:,grp), but found some documentation 
> indicating that was potentially dangerous.  Making the change from assumed- 
> to explicit-shaped arrays doesn't solve the problem.   However, if I declare 
> an additional array and use separate send and receive buffers:
> 
>  call 
> mpi_allreduce(phim_local,phim_global,phim_size*im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr)
>  phim(:,:,:,:,grp) = phim_global
> 
> Then the problem goes away, and every thing works normally.  Does anyone have 
> any insight as to what may be happening here?  I'm using "include 'mpif.h'" 
> rather than the f90 module, does that potentially explain this?
> 
> Thanks,
> Greg
> 
> Stack trace(s) for thread: 1
> -----------------
> [0] (1 processes)
> -----------------
> main() at ?:?
>   solver() at solver.f90:31
>     solver_q_down() at solver_q_down.f90:52
>       iter() at iter.f90:56
>         mcalc() at mcalc.f90:38
>           pmpi_allreduce__() at ?:?
>             PMPI_Allreduce() at ?:?
>               ompi_coll_tuned_allreduce_intra_dec_fixed() at ?:?
>                 ompi_coll_tuned_allreduce_intra_ring_segmented() at ?:?
>                   ompi_coll_tuned_sendrecv_actual() at ?:?
>                     ompi_request_default_wait_all() at ?:?
>                       opal_progress() at ?:?
> Stack trace(s) for thread: 2
> -----------------
> [0] (1 processes)
> -----------------
> start_thread() at ?:?
>   btl_openib_async_thread() at ?:?
>     poll() at ?:?
> Stack trace(s) for thread: 3
> -----------------
> [0] (1 processes)
> -----------------
> start_thread() at ?:?
>   service_thread_start() at ?:?
>     select() at ?:?
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] freezing in mpi_allreduce operation

Reply via email to