Re: [OMPI users] MPI_Reduce performance
Gabriele, Can you clarify... those timings are what is reported for the reduction call specifically, not the total execution time? If so, then the difference is, to a first approximation, the time you spend sitting idly by doing absolutely nothing waiting at the barrier. Ciao Terry -- Dr. Terry Frankcombe Research School of Chemistry, Australian National University Ph: (+61) 0417 163 509Skype: terry.frankcombe
Re: [OMPI users] MPI_Reduce performance
On 8 Sep 2010, at 10:21, Gabriele Fatigati wrote: > So, im my opinion, it is better put MPI_Barrier before any MPI_Reduce to > mitigate "asynchronous" behaviour of MPI_Reduce in OpenMPI. I suspect the > same for others collective communications. Someone can explaine me why > MPI_reduce has this strange behaviour? There are many cases where where adding an explicit barrier before a call to reduce would be superfluous so the standard rightly says that it isn't needed and need not be performed. As you've seen though there are also cases where it can help. I'd be interested to know the effect if you only added a barrier before MPI_Reduce occasionally, perhaps every one or two hundred iterations, this can also have a beneficial effect as a barrier every iteration adds significant overhead. This is a textbook example of where the new asynchronous barrier could help, in theory it should have the effect of being able keep processes in sync without any additional overhead in the case that they are already well synchronised. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] MPI_Reduce performance
Doing Reduce without Barrier first allows one process to call Reduce and exit immediately without waiting for other processes to call Reduce. Therefore, this allows one process to advance faster than other processes. I suspect the 2671 second result is the difference between the fastest and slowest process. Having Barrier reduce the time difference because it forces the faster processes to go slower. On Wed, Sep 8, 2010 at 3:21 AM, Gabriele Fatigatiwrote: > Dear OpenMPI users, > > i'm using OpenMPI 1.3.3 on Infiniband 4x interconnnection network. My > parallel application use intensive MPI_Reduce communication over > communicator created with MPI_Comm_split. > > I've noted strange behaviour during execution. My code is instrumented with > Scalasca 1.3 to report subroutine execution time. First execution shows > elapsed time with 128 processors ( job_communicator is created with > MPI_Comm_split). In both cases is composed to the same ranks of > MPI_COMM_WORLD: > > MPI_Reduce(.,job_communicator) > > The elapsed time is 2671 sec. > > Second run use MPI_BARRIER before MPI_Reduce: > > MPI_Barrier(job_communicator..) > MPI_Reduce(.,job_communicator) > > The elapsed time of Barrier+Reduce is 2167 sec, (about 8 minutes less). > > So, im my opinion, it is better put MPI_Barrier before any MPI_Reduce to > mitigate "asynchronous" behaviour of MPI_Reduce in OpenMPI. I suspect the > same for others collective communications. Someone can explaine me why > MPI_reduce has this strange behaviour? > > Thanks in advance. > > > > > -- > Ing. Gabriele Fatigati > > Parallel programmer > > CINECA Systems & Tecnologies Department > > Supercomputing Group > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > www.cineca.itTel: +39 051 6171722 > > g.fatigati [AT] cineca.it > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- David Zhang University of California, San Diego
[OMPI users] MPI_Reduce performance
Dear OpenMPI users, i'm using OpenMPI 1.3.3 on Infiniband 4x interconnnection network. My parallel application use intensive MPI_Reduce communication over communicator created with MPI_Comm_split. I've noted strange behaviour during execution. My code is instrumented with Scalasca 1.3 to report subroutine execution time. First execution shows elapsed time with 128 processors ( job_communicator is created with MPI_Comm_split). In both cases is composed to the same ranks of MPI_COMM_WORLD: MPI_Reduce(.,job_communicator) The elapsed time is 2671 sec. Second run use MPI_BARRIER before MPI_Reduce: MPI_Barrier(job_communicator..) MPI_Reduce(.,job_communicator) The elapsed time of Barrier+Reduce is 2167 sec, (about 8 minutes less). So, im my opinion, it is better put MPI_Barrier before any MPI_Reduce to mitigate "asynchronous" behaviour of MPI_Reduce in OpenMPI. I suspect the same for others collective communications. Someone can explaine me why MPI_reduce has this strange behaviour? Thanks in advance. -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it