Re: [OMPI users] MPI_Reduce performance

Gus Correa Thu, 9 Sep 2010 12:00:43 -0400

Hello All

Gabrielle's question, Ashley's recipe, and Dick Treutmann's cautionarywords, may be part of a larger context of load balance, or not?


Would Ashley's recipe of sporadic barriers be a silver bullet to
improve load imbalance problems, regardless of which collectives or
even point-to-point calls are in use?

I have in mind for instance our big climate models.
Some of them work in MPMD mode, where several executables
representing atmosphere, ocean, etc, have their own
communicators, but interact with each other indirectly,
coordinated by a flux coupler (within yet another communicator).

The coupler receives, merges, and sends data across the other (physical)components.

The components don't talk to each other:
the coupler is the broker, the master.
This structure may be used in other fields and areas of application,
I'd guess.

More often than not some components lag behind (regardless of how
much you tune the number of processors assigned to each component),
slowing down the whole scheme.
The coupler must sit and wait for that late component,
the other components must sit and wait for the coupler,
and the (vicious) "positive feedback" cycle that
Ashley mentioned goes on and on.

Would sporadic barriers in the flux coupler "shake up" these delays?

Ashley:  How did you get to the magic number of 25 iterations for the
sporadic barriers?
Would it be application/communicator pattern dependent?

Many thanks,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Richard Treumann wrote:

Ashley's observation may apply to an application that iterates on manyto one communication patterns. If the only collective used isMPI_Reduce, some non-root tasks can get ahead and keep pushing iterationresults at tasks that are nearer the root. This could overload them andcause some extra slow down.In most parallel applications, there is some web of interdependencyacross tasks between iterations that keeps them roughly in step. I findit hare to believe that there are many programs that need semanticallyredundant MPI_Barriers.
For example -
In a program that does neighbor communication, no task can get very farahead of its neighbors. It is possible for a task at one corner to be aa few steps ahead of one at the opposite corner but only a few steps. Inthis case though, the distant neighbor is not being affected by thattask that is out ahead anyway. It is only affected by its immediateneighbors,
In a program that does an MPI_Bcast from root and an MPI_Reduce to rootin each iteration, No task gets far ahead because the task that finishedthe Bcast early, just wait longer at the Reduce.
An program that makes a call to a non-rooted collective every iterationwill stay in pretty tight synch.
Think carefully before tossing in either MPI_Barrier or somenon-blocking barrier. Unless MPI_Bcast or MPI_Reduce is the onlycollective you call, your problem is likely not progress skew..
Dick Treumann - MPI TeamIBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846         Fax (845) 433-8363



From:   Ashley Pittman <ash...@pittman.co.uk>
To:     Open MPI Users <us...@open-mpi.org>
Date:   09/09/2010 03:53 AM
Subject:        Re: [OMPI users] MPI_Reduce performance
Sent by:        users-boun...@open-mpi.org


------------------------------------------------------------------------




On 9 Sep 2010, at 08:31, Terry Frankcombe wrote:

 > On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote:
 >> As people have said, these time values are to be expected. All they
 >> reflect is the time difference spent in reduce waiting for the slowest
 >> process to catch up to everyone else. The barrier removes that factor
 >> by forcing all processes to start from the same place.
 >>
 >>
 >> No mystery here - just a reflection of the fact that your processes
 >> arrive at the MPI_Reduce calls at different times.
 >
 >
 > Yes, however, it seems Gabriele is saying the total execution time
 > *drops* by ~500 s when the barrier is put *in*.  (Is that the right way
 > around, Gabriele?)
 >
 > That's harder to explain as a sync issue.
Not really, you need some way of keeping processes in sync or else theslow ones get slower and the fast ones stay fast. If you have anun-balanced algorithm then you can end up swamping certain ranks andwhen they get behind they get even slower and performance goes off acliff edge.
Adding sporadic barriers keeps everything in sync and running nicely, ifthings are performing well then the barrier only slows things down butif there is a problem it'll bring all process back together and destroythe positive feedback cycle. This is why you often only need asynchronisation point every so often, I'm also a huge fan of asyncronousbarriers as a full sync is a blunt and slow operation, using asyncronousbarriers you can allow small differences in timing but prevent them fromgetting too large with very little overhead in the common case whereprocesses are synced already. I'm thinking specifically of starting async-barrier on iteration N, waiting for it on N+25 and immediatelystarting another one, again waiting for it 25 steps later.
Ashley.

--

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk <http://padb.pittman.org.uk/>


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



------------------------------------------------------------------------

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] MPI_Reduce performance

Reply via email to