Re: [OMPI users] MPI_Reduce performance
On Sep 9, 2010, at 4:31 PM, Ashley Pittman wrote: >> What is the exact semantics of an asynchronous barrier, > > I'm not sure of the exact semantics but once you've got your head around the > concept it's fairly simple to understand how to use it, you call > MPI_IBarrier() and it gives you a handle you can test with MPI_Test() or > block for with MPI_Wait(). The tricky part comes in how many times you can > call MPI_IBarrier() on a communicator without waiting for the previous > barriers to complete but I haven't been following the discussion on this one > to know the specifics. FWIW: here's a brief writeup of MPI_Ibarrier - http://blogs.cisco.com/ciscotalk/performance/comments/ibarrier/ -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] MPI_Reduce performance
The difference is that MPI_Barrier or even MPI_Ibarrier is a crude tool for the job. It is likely to leave resources idle that could be doing productive work. I agree it is part of the tool kit but if this kind of problem is significant enough so a textbook should cover it then I would advocate that people be taught where to look for the problem and to consider which tool to use. The first step is to look for way to improve load balance. The second step is to look for ways to impede as few processes as possible and only when they must be impeded. I am also suggesting that there is a middle class of application that improves with semantically redundant barrier. Applications with a significant design flaw that it would be better to find and fix than to mask. The MPI standard provides for a flow control to prevent eager message flooding but it does not provide a flow control for descriptor flooding. Descriptor flooding may even bring down a job and the standard does not oblige the MPI implementation to prevent that. This problem was understood from the beginning but it cannot be solved within the semantic rules for non-blocking Send/Recv. The options were to give the problem to the application writer or to say "A non-blocking operation will return promptly unless there is a resource shortage". Dick Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 09/10/2010 10:27:02 AM: > [image removed] > > Re: [OMPI users] MPI_Reduce performance > > Eugene Loh > > to: > > Open MPI Users > > 09/10/2010 10:30 AM > > Sent by: > > users-boun...@open-mpi.org > > Please respond to Open MPI Users > > Richard Treumann wrote: > > Hi Ashley > > I understand the problem with descriptor flooding can be serious in > an application with unidirectional data dependancy. Perhaps we have > a different perception of how common that is. > Ashley speculated it was a "significant minority." I don't know > what that means, but it seems like it is a minority (most > computations have causal relationships among the processes holding > unbounded imbalances in check) and yet we end up seeing these exceptions. > I think that adding some flow control to the application is a better > solution than semantically redundant barrier. > It seems to me there is no difference. Flow control, at this level, > is just semantically redundant synchronization. A barrier is just a > special case of that.___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Reduce performance
Richard Treumann wrote: Hi Ashley I understand the problem with descriptor flooding can be serious in an application with unidirectional data dependancy. Perhaps we have a different perception of how common that is. Ashley speculated it was a "significant minority." I don't know what that means, but it seems like it is a minority (most computations have causal relationships among the processes holding unbounded imbalances in check) and yet we end up seeing these exceptions. I think that adding some flow control to the application is a better solution than semantically redundant barrier. It seems to me there is no difference. Flow control, at this level, is just semantically redundant synchronization. A barrier is just a special case of that.
Re: [OMPI users] MPI_Reduce performance
Hi Ashley I understand the problem with descriptor flooding can be serious in an application with unidirectional data dependancy. Perhaps we have a different perception of how common that is. It seems to me that such programs would be very rare but if they are more common than I imagine, then discussion of how to modulate them is worthwhile. In many cases, I think that adding some flow control to the application is a better solution than semantically redundant barrier. (A barrier that is there only to affect performance, not correctness, is what I mean by semantically redundant) For example, a Master/Worker application could have each worker break after every 4th send to the master and post an MPI_Recv for an OK_to_continue token. If the token had been sent, this would delay the worker a few microseconds. If it had not been sent, the worker would be kept waiting. The Master would keep track of how many messages from each worker it had absorbed and on message 3 from a particular worker, send an OK_to_continue token to that worker. The master would keep sending OK_to_continue tokens every 4th recv from then on (7, 11, 15 ...) The descriptor queues would all remain short and only a worker that the master could not keep up with would ever lose a chance to keep working. By sending the OK_to_continue token a bit early, the application would ensure that when there was no backlog, every worker would find a token when it looked for it and there would be no significant loss of compute time. Even with non-blocking barrier and a 10 step lag between Ibarrier and Wait, , if some worker is slow for 12 steps, the fast workers end up being kept in a non-productive MPI_Wait. Dick Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 09/09/2010 05:34:15 PM: > [image removed] > > Re: [OMPI users] MPI_Reduce performance > > Ashley Pittman > > to: > > Open MPI Users > > 09/09/2010 05:37 PM > > Sent by: > > users-boun...@open-mpi.org > > Please respond to Open MPI Users > > > On 9 Sep 2010, at 21:40, Richard Treumann wrote: > > > > > Ashley > > > > Can you provide an example of a situation in which these > semantically redundant barriers help? > > I'm not making the case for semantically redundant barriers, I'm > making a case for implicit synchronisation in every iteration of a > application. Many applications have this already by nature of the > data-flow required, anything that calls mpi_allgather or > mpi_allreduce are the easiest to verify but there are many other > ways of achieving the same thing. My point is about the subset of > programs which don't have this attribute and are therefore > susceptible to synchronisation problems. It's my experience that > for low iteration counts these codes can run fine but once they hit > a problem they go over a cliff edge performance wise and there is no > way back from there until the end of the job. The email from > Gabriele would appear to be a case that demonstrates this problem > but I've seen it many times before. > > Using your previous email as an example I would describe adding > barriers to a problem as a way artificially reducing the > "elasticity" of the program to ensure balanced use of resources. > > > I may be missing something but my statement for the text book would be > > > > "If adding a barrier to your MPI program makes it run faster, > there is almost certainly a flaw in it that is better solved another way." > > > > The only exception I can think of is some sort of one direction > data dependancy with messages small enough to go eagerly. A program > that calls MPI_Reduce with a small message and the same root every > iteration and calls no other collective would be an example. > > > > In that case, fast tasks at leaf positions would run free and a > slow task near the root could pile up early arrivals and end up with > some additional slowing. Unless it was driven into paging I cannot > imagine the slowdown would be significant though. > > I've diagnosed problems where the cause was a receive queue of tens > of thousands of messages, in this case each and every receive > performs slowly unless the descriptor is near the front of the queue > so the concern is not purely about memory usage at individual > processes although that can also be a factor. > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Reduce performance
Eugene, you did not take into account the dispersion/dephasing between different processes. As cluster size and the number of instances of parallel process increase, the dispersion increases as well, making different instances to be a kind out of sync - not really out of sync, but just because of different speed of execution on different nodes, delays, etc... If you account for this, you get the result I mentioned. Alex - Original Message - From: Eugene Loh To: Open MPI Users Sent: Thursday, September 09, 2010 11:32 PM Subject: Re: [OMPI users] MPI_Reduce performance Alex A. Granovsky wrote: Isn't in evident from the theory of random processes and probability theory that in the limit of infinitely large cluster and parallel process, the probability of deadlocks with current implementation is unfortunately quite a finite quantity and in limit approaches to unity regardless on any particular details of the program. No, not at all. Consider simulating a physical volume. Each process is assigned to some small subvolume. It updates conditions locally, but on the surface of its simulation subvolume it needs information from "nearby" processes. It cannot proceed along the surface until it has that neighboring information. Its neighbors, in turn, cannot proceed until their neighbors have reached some point. Two distant processes can be quite out of step with one another, but only by some bounded amount. At some point, a leading process has to wait for information from a laggard to propagate to it. All processes proceed together, in some loose lock-step fashion. Many applications behave in this fashion. Actually, in many applications, the synchronization is tightened in that "physics" is made to propagate faster than neighbor-by-neighbor. As the number of processes increases, the laggard might seem relatively slower in comparison, but that isn't deadlock. As the size of the cluster increases, the chances of a system component failure increase, but that also is a different matter. -- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Reduce performance
Ashley Can you provide an example of a situation in which these semantically redundant barriers help? I may be missing something but my statement for the text book would be "If adding a barrier to your MPI program makes it run faster, there is almost certainly a flaw in it that is better solved another way." The only exception I can think of is some sort of one direction data dependancy with messages small enough to go eagerly. A program that calls MPI_Reduce with a small message and the same root every iteration and calls no other collective would be an example. In that case, fast tasks at leaf positions would run free and a slow task near the root could pile up early arrivals and end up with some additional slowing. Unless it was driven into paging I cannot imagine the slowdown would be significant though. Even that should not be a problem for an MPI implementation that backs off on eager send before it floods early arrival buffers. Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363
Re: [OMPI users] MPI_Reduce performance
On 9 Sep 2010, at 21:10, jody wrote: > Hi > @Ashley: > What is the exact semantics of an asynchronous barrier, I'm not sure of the exact semantics but once you've got your head around the concept it's fairly simple to understand how to use it, you call MPI_IBarrier() and it gives you a handle you can test with MPI_Test() or block for with MPI_Wait(). The tricky part comes in how many times you can call MPI_IBarrier() on a communicator without waiting for the previous barriers to complete but I haven't been following the discussion on this one to know the specifics. > and is it part of the MPI specs? It will be a part of the next revision of the standard I believe. It's been a long time coming and there is at least one implementation out there already but I can't comment on it's usability today. To be clear it's something I've long advocated and have implemented and played around with in the past however it's not yet available to users today but I believe it will be shortly and as you'll have read my believe is it's going to be a very useful addition to the MPI offering. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] MPI_Reduce performance
Hi @Ashley: What is the exact semantics of an asynchronous barrier, and is it part of the MPI specs? Thanks Jody On Thu, Sep 9, 2010 at 9:34 PM, Ashley Pittmanwrote: > > On 9 Sep 2010, at 17:00, Gus Correa wrote: > >> Hello All >> >> Gabrielle's question, Ashley's recipe, and Dick Treutmann's cautionary >> words, may be part of a larger context of load balance, or not? >> >> Would Ashley's recipe of sporadic barriers be a silver bullet to >> improve load imbalance problems, regardless of which collectives or >> even point-to-point calls are in use? > > No, it only holds where there is no data dependency between some of the > ranks, in particular if there are any non-rooted collectives in an iteration > of your code then it cannot make any difference at all, likewise if you have > a reduce followed by a barrier using the same root for example then you > already have global synchronisation each iteration and it won't help. My > feeling is that it applies to a significant minority of problems, certainly > the phrase "adding barriers can make codes faster" should be textbook stuff > if it isn't already. > >> Would sporadic barriers in the flux coupler "shake up" these delays? > > I don't fully understand your description but it sounds like it might set the > program back to a clean slate which would give you per-iteraion delays only > rather than cumulative or worse delays. > >> Ashley: How did you get to the magic number of 25 iterations for the >> sporadic barriers? > > Experience and finger in the air. The major factors in picking this number > is the likelihood of a positives feedback cycle of delays happening, the > delays these delays add and the cost of a barrier itself. Having too low a > value will slightly reduce performance, having too high a value can > drastically reduce performance. > > As a further item (because I like them) the asynchronous barrier is even > better again if used properly, in the good case it doesn't cause any process > to block ever so the cost is only that of the CPU cycles the code takes > itself, in the bad case where it has to delay a rank then this tends to have > a positive impact on performance. > >> Would it be application/communicator pattern dependent? > > Absolutely. > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] MPI_Reduce performance
On 9 Sep 2010, at 17:00, Gus Correa wrote: > Hello All > > Gabrielle's question, Ashley's recipe, and Dick Treutmann's cautionary words, > may be part of a larger context of load balance, or not? > > Would Ashley's recipe of sporadic barriers be a silver bullet to > improve load imbalance problems, regardless of which collectives or > even point-to-point calls are in use? No, it only holds where there is no data dependency between some of the ranks, in particular if there are any non-rooted collectives in an iteration of your code then it cannot make any difference at all, likewise if you have a reduce followed by a barrier using the same root for example then you already have global synchronisation each iteration and it won't help. My feeling is that it applies to a significant minority of problems, certainly the phrase "adding barriers can make codes faster" should be textbook stuff if it isn't already. > Would sporadic barriers in the flux coupler "shake up" these delays? I don't fully understand your description but it sounds like it might set the program back to a clean slate which would give you per-iteraion delays only rather than cumulative or worse delays. > Ashley: How did you get to the magic number of 25 iterations for the > sporadic barriers? Experience and finger in the air. The major factors in picking this number is the likelihood of a positives feedback cycle of delays happening, the delays these delays add and the cost of a barrier itself. Having too low a value will slightly reduce performance, having too high a value can drastically reduce performance. As a further item (because I like them) the asynchronous barrier is even better again if used properly, in the good case it doesn't cause any process to block ever so the cost is only that of the CPU cycles the code takes itself, in the bad case where it has to delay a rank then this tends to have a positive impact on performance. > Would it be application/communicator pattern dependent? Absolutely. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] MPI_Reduce performance
Alex A. Granovsky wrote: Isn't in evident from the theory of random processes and probability theory that in the limit of infinitely large cluster and parallel process, the probability of deadlocks with current implementation is unfortunately quite a finite quantity and in limit approaches to unity regardless on any particular details of the program. No, not at all. Consider simulating a physical volume. Each process is assigned to some small subvolume. It updates conditions locally, but on the surface of its simulation subvolume it needs information from "nearby" processes. It cannot proceed along the surface until it has that neighboring information. Its neighbors, in turn, cannot proceed until their neighbors have reached some point. Two distant processes can be quite out of step with one another, but only by some bounded amount. At some point, a leading process has to wait for information from a laggard to propagate to it. All processes proceed together, in some loose lock-step fashion. Many applications behave in this fashion. Actually, in many applications, the synchronization is tightened in that "physics" is made to propagate faster than neighbor-by-neighbor. As the number of processes increases, the laggard might seem relatively slower in comparison, but that isn't deadlock. As the size of the cluster increases, the chances of a system component failure increase, but that also is a different matter.
Re: [OMPI users] MPI_Reduce performance
Isn't in evident from the theory of random processes and probability theory that in the limit of infinitely large cluster and parallel process, the probability of deadlocks with current implementation is unfortunately quite a finite quantity and in limit approaches to unity regardless on any particular details of the program. Just my two cents. Alex Granovsky - Original Message - From: Richard Treumann To: Open MPI Users Sent: Thursday, September 09, 2010 10:10 PM Subject: Re: [OMPI users] MPI_Reduce performance I was pointing out that most programs have some degree of elastic synchronization built in. Tasks (or groups or components in a coupled model) seldom only produce data.they also consume what other tasks produce and that limits the potential skew. If step n for a task (or group or coupled component) depends on data produced by step n-1 in another task (or group or coupled component) then no task can be farther ahead of the task it depends on than one step. If there are 2 tasks that each need the others step n-1 result to compute step n then they can never get farther than one step out of synch. If there were a rank ordered loop of 8 tasks so each one needs the output of the prior step on task ((me-1) mod tasks) to compute then you can get more skew because if task 5 gets stalled in step 3, task 6 will finish step 3 and send results to 7 but stall on recv for step 4 (lacking the end of step 3 send by task 5) task 7 will finish step 4 and send results to 0 but stall on recv for step 5 task 0 will finish step 5 and send results to 1 but stall on recv for step 6 etc In a 2D or 3D grid, the dependency is tighter so the possible skew is less. but it is still significant on a huge grid In a program with frequent calls to MPI_Allreduce on COMM_WORLD, the skew is very limited. The available skew gets harder to predict as the interdependencies grow more complex. I call this "elasticity" because the amount of stretch varies but, like a bungee cord or an waist band, only goes so far. Every parallel program has some degree of elasticity built into the way its parts interact. I assume a coupler has some elasticity too. That is, ocean and atmosphere each model Monday and report in to coupler but neither can model Tuesday until they get some of the Monday results generated by the other. (I am pretending granularity is day by day) Wouldn't the right level of synchronization among component result automatically form the data dependencies among them? Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 From: Eugene Loh <eugene@oracle.com> To: Open MPI Users <us...@open-mpi.org> Date: 09/09/2010 12:40 PM Subject: Re: [OMPI users] MPI_Reduce performance Sent by: users-boun...@open-mpi.org -- Gus Correa wrote: > More often than not some components lag behind (regardless of how > much you tune the number of processors assigned to each component), > slowing down the whole scheme. > The coupler must sit and wait for that late component, > the other components must sit and wait for the coupler, > and the (vicious) "positive feedback" cycle that > Ashley mentioned goes on and on. I think "sit and wait" is the "typical" scenario that Dick mentions. Someone lags, so someone else has to wait. In contrast, the "feedback" cycle Ashley mentions is where someone lags and someone else keeps racing ahead, pumping even more data at the laggard, forcing the laggard ever further behind. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Reduce performance
I was pointing out that most programs have some degree of elastic synchronization built in. Tasks (or groups or components in a coupled model) seldom only produce data.they also consume what other tasks produce and that limits the potential skew. If step n for a task (or group or coupled component) depends on data produced by step n-1 in another task (or group or coupled component) then no task can be farther ahead of the task it depends on than one step. If there are 2 tasks that each need the others step n-1 result to compute step n then they can never get farther than one step out of synch. If there were a rank ordered loop of 8 tasks so each one needs the output of the prior step on task ((me-1) mod tasks) to compute then you can get more skew because if task 5 gets stalled in step 3, task 6 will finish step 3 and send results to 7 but stall on recv for step 4 (lacking the end of step 3 send by task 5) task 7 will finish step 4 and send results to 0 but stall on recv for step 5 task 0 will finish step 5 and send results to 1 but stall on recv for step 6 etc In a 2D or 3D grid, the dependency is tighter so the possible skew is less. but it is still significant on a huge grid In a program with frequent calls to MPI_Allreduce on COMM_WORLD, the skew is very limited. The available skew gets harder to predict as the interdependencies grow more complex. I call this "elasticity" because the amount of stretch varies but, like a bungee cord or an waist band, only goes so far. Every parallel program has some degree of elasticity built into the way its parts interact. I assume a coupler has some elasticity too. That is, ocean and atmosphere each model Monday and report in to coupler but neither can model Tuesday until they get some of the Monday results generated by the other. (I am pretending granularity is day by day) Wouldn't the right level of synchronization among component result automatically form the data dependencies among them? Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 From: Eugene Loh <eugene@oracle.com> To: Open MPI Users <us...@open-mpi.org> List-Post: users@lists.open-mpi.org Date: 09/09/2010 12:40 PM Subject: Re: [OMPI users] MPI_Reduce performance Sent by: users-boun...@open-mpi.org Gus Correa wrote: > More often than not some components lag behind (regardless of how > much you tune the number of processors assigned to each component), > slowing down the whole scheme. > The coupler must sit and wait for that late component, > the other components must sit and wait for the coupler, > and the (vicious) "positive feedback" cycle that > Ashley mentioned goes on and on. I think "sit and wait" is the "typical" scenario that Dick mentions. Someone lags, so someone else has to wait. In contrast, the "feedback" cycle Ashley mentions is where someone lags and someone else keeps racing ahead, pumping even more data at the laggard, forcing the laggard ever further behind. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Reduce performance
Gus Correa wrote: More often than not some components lag behind (regardless of how much you tune the number of processors assigned to each component), slowing down the whole scheme. The coupler must sit and wait for that late component, the other components must sit and wait for the coupler, and the (vicious) "positive feedback" cycle that Ashley mentioned goes on and on. I think "sit and wait" is the "typical" scenario that Dick mentions. Someone lags, so someone else has to wait. In contrast, the "feedback" cycle Ashley mentions is where someone lags and someone else keeps racing ahead, pumping even more data at the laggard, forcing the laggard ever further behind.
Re: [OMPI users] MPI_Reduce performance
Hello All Gabrielle's question, Ashley's recipe, and Dick Treutmann's cautionary words, may be part of a larger context of load balance, or not? Would Ashley's recipe of sporadic barriers be a silver bullet to improve load imbalance problems, regardless of which collectives or even point-to-point calls are in use? I have in mind for instance our big climate models. Some of them work in MPMD mode, where several executables representing atmosphere, ocean, etc, have their own communicators, but interact with each other indirectly, coordinated by a flux coupler (within yet another communicator). The coupler receives, merges, and sends data across the other (physical) components. The components don't talk to each other: the coupler is the broker, the master. This structure may be used in other fields and areas of application, I'd guess. More often than not some components lag behind (regardless of how much you tune the number of processors assigned to each component), slowing down the whole scheme. The coupler must sit and wait for that late component, the other components must sit and wait for the coupler, and the (vicious) "positive feedback" cycle that Ashley mentioned goes on and on. Would sporadic barriers in the flux coupler "shake up" these delays? Ashley: How did you get to the magic number of 25 iterations for the sporadic barriers? Would it be application/communicator pattern dependent? Many thanks, Gus Correa - Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA - Richard Treumann wrote: Ashley's observation may apply to an application that iterates on many to one communication patterns. If the only collective used is MPI_Reduce, some non-root tasks can get ahead and keep pushing iteration results at tasks that are nearer the root. This could overload them and cause some extra slow down. In most parallel applications, there is some web of interdependency across tasks between iterations that keeps them roughly in step. I find it hare to believe that there are many programs that need semantically redundant MPI_Barriers. For example - In a program that does neighbor communication, no task can get very far ahead of its neighbors. It is possible for a task at one corner to be a a few steps ahead of one at the opposite corner but only a few steps. In this case though, the distant neighbor is not being affected by that task that is out ahead anyway. It is only affected by its immediate neighbors, In a program that does an MPI_Bcast from root and an MPI_Reduce to root in each iteration, No task gets far ahead because the task that finished the Bcast early, just wait longer at the Reduce. An program that makes a call to a non-rooted collective every iteration will stay in pretty tight synch. Think carefully before tossing in either MPI_Barrier or some non-blocking barrier. Unless MPI_Bcast or MPI_Reduce is the only collective you call, your problem is likely not progress skew.. Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 From: Ashley Pittman <ash...@pittman.co.uk> To: Open MPI Users <us...@open-mpi.org> Date: 09/09/2010 03:53 AM Subject:Re: [OMPI users] MPI_Reduce performance Sent by:users-boun...@open-mpi.org On 9 Sep 2010, at 08:31, Terry Frankcombe wrote: > On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote: >> As people have said, these time values are to be expected. All they >> reflect is the time difference spent in reduce waiting for the slowest >> process to catch up to everyone else. The barrier removes that factor >> by forcing all processes to start from the same place. >> >> >> No mystery here - just a reflection of the fact that your processes >> arrive at the MPI_Reduce calls at different times. > > > Yes, however, it seems Gabriele is saying the total execution time > *drops* by ~500 s when the barrier is put *in*. (Is that the right way > around, Gabriele?) > > That's harder to explain as a sync issue. Not really, you need some way of keeping processes in sync or else the slow ones get slower and the fast ones stay fast. If you have an un-balanced algorithm then you can end up swamping certain ranks and when they get behind they get even slower and performance goes off a cliff edge. Adding sporadic barriers keeps everything in sync and running nicely, if things are performing well then the barrier only slows things down but if there is a problem it'll bring all process back togeth
Re: [OMPI users] MPI_Reduce performance
Ashley's observation may apply to an application that iterates on many to one communication patterns. If the only collective used is MPI_Reduce, some non-root tasks can get ahead and keep pushing iteration results at tasks that are nearer the root. This could overload them and cause some extra slow down. In most parallel applications, there is some web of interdependency across tasks between iterations that keeps them roughly in step. I find it hare to believe that there are many programs that need semantically redundant MPI_Barriers. For example - In a program that does neighbor communication, no task can get very far ahead of its neighbors. It is possible for a task at one corner to be a a few steps ahead of one at the opposite corner but only a few steps. In this case though, the distant neighbor is not being affected by that task that is out ahead anyway. It is only affected by its immediate neighbors, In a program that does an MPI_Bcast from root and an MPI_Reduce to root in each iteration, No task gets far ahead because the task that finished the Bcast early, just wait longer at the Reduce. An program that makes a call to a non-rooted collective every iteration will stay in pretty tight synch. Think carefully before tossing in either MPI_Barrier or some non-blocking barrier. Unless MPI_Bcast or MPI_Reduce is the only collective you call, your problem is likely not progress skew.. Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 From: Ashley Pittman <ash...@pittman.co.uk> To: Open MPI Users <us...@open-mpi.org> List-Post: users@lists.open-mpi.org Date: 09/09/2010 03:53 AM Subject: Re: [OMPI users] MPI_Reduce performance Sent by: users-boun...@open-mpi.org On 9 Sep 2010, at 08:31, Terry Frankcombe wrote: > On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote: >> As people have said, these time values are to be expected. All they >> reflect is the time difference spent in reduce waiting for the slowest >> process to catch up to everyone else. The barrier removes that factor >> by forcing all processes to start from the same place. >> >> >> No mystery here - just a reflection of the fact that your processes >> arrive at the MPI_Reduce calls at different times. > > > Yes, however, it seems Gabriele is saying the total execution time > *drops* by ~500 s when the barrier is put *in*. (Is that the right way > around, Gabriele?) > > That's harder to explain as a sync issue. Not really, you need some way of keeping processes in sync or else the slow ones get slower and the fast ones stay fast. If you have an un-balanced algorithm then you can end up swamping certain ranks and when they get behind they get even slower and performance goes off a cliff edge. Adding sporadic barriers keeps everything in sync and running nicely, if things are performing well then the barrier only slows things down but if there is a problem it'll bring all process back together and destroy the positive feedback cycle. This is why you often only need a synchronisation point every so often, I'm also a huge fan of asyncronous barriers as a full sync is a blunt and slow operation, using asyncronous barriers you can allow small differences in timing but prevent them from getting too large with very little overhead in the common case where processes are synced already. I'm thinking specifically of starting a sync-barrier on iteration N, waiting for it on N+25 and immediately starting another one, again waiting for it 25 steps later. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Reduce performance
On Sep 9, 2010, at 1:46 AM, Ashley Pittman wrote: > > On 9 Sep 2010, at 08:31, Terry Frankcombe wrote: > >> On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote: >>> As people have said, these time values are to be expected. All they >>> reflect is the time difference spent in reduce waiting for the slowest >>> process to catch up to everyone else. The barrier removes that factor >>> by forcing all processes to start from the same place. >>> >>> >>> No mystery here - just a reflection of the fact that your processes >>> arrive at the MPI_Reduce calls at different times. >> >> >> Yes, however, it seems Gabriele is saying the total execution time >> *drops* by ~500 s when the barrier is put *in*. (Is that the right way >> around, Gabriele?) >> >> That's harder to explain as a sync issue. > > Not really, you need some way of keeping processes in sync or else the slow > ones get slower and the fast ones stay fast. If you have an un-balanced > algorithm then you can end up swamping certain ranks and when they get behind > they get even slower and performance goes off a cliff edge. > > Adding sporadic barriers keeps everything in sync and running nicely, if > things are performing well then the barrier only slows things down but if > there is a problem it'll bring all process back together and destroy the > positive feedback cycle. This is why you often only need a synchronisation > point every so often, Precisely. And that is why we added the "sync" collective which inserts a barrier every so often since we don't have async barriers at this time. Also helps clean up memory growth due to unanticipated messages arriving during unbalanced algorithms. See ompi_info --param coll all to see how to enable it. > I'm also a huge fan of asyncronous barriers as a full sync is a blunt and > slow operation, using asyncronous barriers you can allow small differences in > timing but prevent them from getting too large with very little overhead in > the common case where processes are synced already. I'm thinking > specifically of starting a sync-barrier on iteration N, waiting for it on > N+25 and immediately starting another one, again waiting for it 25 steps > later. > > Ashley. > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Reduce performance
On 9 Sep 2010, at 08:31, Terry Frankcombe wrote: > On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote: >> As people have said, these time values are to be expected. All they >> reflect is the time difference spent in reduce waiting for the slowest >> process to catch up to everyone else. The barrier removes that factor >> by forcing all processes to start from the same place. >> >> >> No mystery here - just a reflection of the fact that your processes >> arrive at the MPI_Reduce calls at different times. > > > Yes, however, it seems Gabriele is saying the total execution time > *drops* by ~500 s when the barrier is put *in*. (Is that the right way > around, Gabriele?) > > That's harder to explain as a sync issue. Not really, you need some way of keeping processes in sync or else the slow ones get slower and the fast ones stay fast. If you have an un-balanced algorithm then you can end up swamping certain ranks and when they get behind they get even slower and performance goes off a cliff edge. Adding sporadic barriers keeps everything in sync and running nicely, if things are performing well then the barrier only slows things down but if there is a problem it'll bring all process back together and destroy the positive feedback cycle. This is why you often only need a synchronisation point every so often, I'm also a huge fan of asyncronous barriers as a full sync is a blunt and slow operation, using asyncronous barriers you can allow small differences in timing but prevent them from getting too large with very little overhead in the common case where processes are synced already. I'm thinking specifically of starting a sync-barrier on iteration N, waiting for it on N+25 and immediately starting another one, again waiting for it 25 steps later. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] MPI_Reduce performance
Yes Terry, thats' right. 2010/9/9 Terry Frankcombe> On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote: > > As people have said, these time values are to be expected. All they > > reflect is the time difference spent in reduce waiting for the slowest > > process to catch up to everyone else. The barrier removes that factor > > by forcing all processes to start from the same place. > > > > > > No mystery here - just a reflection of the fact that your processes > > arrive at the MPI_Reduce calls at different times. > > > Yes, however, it seems Gabriele is saying the total execution time > *drops* by ~500 s when the barrier is put *in*. (Is that the right way > around, Gabriele?) > > That's harder to explain as a sync issue. > > > > > On Sep 9, 2010, at 1:14 AM, Gabriele Fatigati wrote: > > > > > More in depth, > > > > > > > > > total execution time without Barrier is about 1 sec. > > > > > > > > > Total execution time with Barrier+Reduce is 9453, with 128 procs. > > > > > > 2010/9/9 Terry Frankcombe > > > Gabriele, > > > > > > Can you clarify... those timings are what is reported for > > > the reduction > > > call specifically, not the total execution time? > > > > > > If so, then the difference is, to a first approximation, the > > > time you > > > spend sitting idly by doing absolutely nothing waiting at > > > the barrier. > > > > > > Ciao > > > Terry > > > > > > > > > -- > > > Dr. Terry Frankcombe > > > Research School of Chemistry, Australian National University > > > Ph: (+61) 0417 163 509Skype: terry.frankcombe > > > > > > > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > > > > > > > -- > > > Ing. Gabriele Fatigati > > > > > > Parallel programmer > > > > > > CINECA Systems & Tecnologies Department > > > > > > Supercomputing Group > > > > > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > > > > > www.cineca.itTel: +39 051 6171722 > > > > > > g.fatigati [AT] cineca.it > > > > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] MPI_Reduce performance
Mm,I don't understand. The experiments on my appliciation shows that an intensive use of Barrier+ Reduce is more faster than a single Reduce. 2010/9/9 Ralph Castain> As people have said, these time values are to be expected. All they reflect > is the time difference spent in reduce waiting for the slowest process to > catch up to everyone else. The barrier removes that factor by forcing all > processes to start from the same place. > > No mystery here - just a reflection of the fact that your processes arrive > at the MPI_Reduce calls at different times. > > > On Sep 9, 2010, at 1:14 AM, Gabriele Fatigati wrote: > > More in depth, > > total execution time without Barrier is about 1 sec. > > Total execution time with Barrier+Reduce is 9453, with 128 procs. > > 2010/9/9 Terry Frankcombe > >> Gabriele, >> >> Can you clarify... those timings are what is reported for the reduction >> call specifically, not the total execution time? >> >> If so, then the difference is, to a first approximation, the time you >> spend sitting idly by doing absolutely nothing waiting at the barrier. >> >> Ciao >> Terry >> >> >> -- >> Dr. Terry Frankcombe >> Research School of Chemistry, Australian National University >> Ph: (+61) 0417 163 509Skype: terry.frankcombe >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > > -- > Ing. Gabriele Fatigati > > Parallel programmer > > CINECA Systems & Tecnologies Department > > Supercomputing Group > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > www.cineca.itTel: +39 051 6171722 > > g.fatigati [AT] cineca.it > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] MPI_Reduce performance
On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote: > As people have said, these time values are to be expected. All they > reflect is the time difference spent in reduce waiting for the slowest > process to catch up to everyone else. The barrier removes that factor > by forcing all processes to start from the same place. > > > No mystery here - just a reflection of the fact that your processes > arrive at the MPI_Reduce calls at different times. Yes, however, it seems Gabriele is saying the total execution time *drops* by ~500 s when the barrier is put *in*. (Is that the right way around, Gabriele?) That's harder to explain as a sync issue. > On Sep 9, 2010, at 1:14 AM, Gabriele Fatigati wrote: > > > More in depth, > > > > > > total execution time without Barrier is about 1 sec. > > > > > > Total execution time with Barrier+Reduce is 9453, with 128 procs. > > > > 2010/9/9 Terry Frankcombe> > Gabriele, > > > > Can you clarify... those timings are what is reported for > > the reduction > > call specifically, not the total execution time? > > > > If so, then the difference is, to a first approximation, the > > time you > > spend sitting idly by doing absolutely nothing waiting at > > the barrier. > > > > Ciao > > Terry > > > > > > -- > > Dr. Terry Frankcombe > > Research School of Chemistry, Australian National University > > Ph: (+61) 0417 163 509Skype: terry.frankcombe > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > -- > > Ing. Gabriele Fatigati > > > > Parallel programmer > > > > CINECA Systems & Tecnologies Department > > > > Supercomputing Group > > > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > > > www.cineca.itTel: +39 051 6171722 > > > > g.fatigati [AT] cineca.it > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Reduce performance
As people have said, these time values are to be expected. All they reflect is the time difference spent in reduce waiting for the slowest process to catch up to everyone else. The barrier removes that factor by forcing all processes to start from the same place. No mystery here - just a reflection of the fact that your processes arrive at the MPI_Reduce calls at different times. On Sep 9, 2010, at 1:14 AM, Gabriele Fatigati wrote: > More in depth, > > total execution time without Barrier is about 1 sec. > > Total execution time with Barrier+Reduce is 9453, with 128 procs. > > 2010/9/9 Terry Frankcombe> Gabriele, > > Can you clarify... those timings are what is reported for the reduction > call specifically, not the total execution time? > > If so, then the difference is, to a first approximation, the time you > spend sitting idly by doing absolutely nothing waiting at the barrier. > > Ciao > Terry > > > -- > Dr. Terry Frankcombe > Research School of Chemistry, Australian National University > Ph: (+61) 0417 163 509Skype: terry.frankcombe > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > -- > Ing. Gabriele Fatigati > > Parallel programmer > > CINECA Systems & Tecnologies Department > > Supercomputing Group > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > www.cineca.itTel: +39 051 6171722 > > g.fatigati [AT] cineca.it > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] MPI_Reduce performance
More in depth, total execution time without Barrier is about 1 sec. Total execution time with Barrier+Reduce is 9453, with 128 procs. 2010/9/9 Terry Frankcombe> Gabriele, > > Can you clarify... those timings are what is reported for the reduction > call specifically, not the total execution time? > > If so, then the difference is, to a first approximation, the time you > spend sitting idly by doing absolutely nothing waiting at the barrier. > > Ciao > Terry > > > -- > Dr. Terry Frankcombe > Research School of Chemistry, Australian National University > Ph: (+61) 0417 163 509Skype: terry.frankcombe > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] MPI_Reduce performance
Hy Terry, this time is spent in MPI_Reduce, it isn't total execution time. 2010/9/9 Terry Frankcombe> Gabriele, > > Can you clarify... those timings are what is reported for the reduction > call specifically, not the total execution time? > > If so, then the difference is, to a first approximation, the time you > spend sitting idly by doing absolutely nothing waiting at the barrier. > > Ciao > Terry > > > -- > Dr. Terry Frankcombe > Research School of Chemistry, Australian National University > Ph: (+61) 0417 163 509Skype: terry.frankcombe > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] MPI_Reduce performance
Gabriele, Can you clarify... those timings are what is reported for the reduction call specifically, not the total execution time? If so, then the difference is, to a first approximation, the time you spend sitting idly by doing absolutely nothing waiting at the barrier. Ciao Terry -- Dr. Terry Frankcombe Research School of Chemistry, Australian National University Ph: (+61) 0417 163 509Skype: terry.frankcombe
Re: [OMPI users] MPI_Reduce performance
On 8 Sep 2010, at 10:21, Gabriele Fatigati wrote: > So, im my opinion, it is better put MPI_Barrier before any MPI_Reduce to > mitigate "asynchronous" behaviour of MPI_Reduce in OpenMPI. I suspect the > same for others collective communications. Someone can explaine me why > MPI_reduce has this strange behaviour? There are many cases where where adding an explicit barrier before a call to reduce would be superfluous so the standard rightly says that it isn't needed and need not be performed. As you've seen though there are also cases where it can help. I'd be interested to know the effect if you only added a barrier before MPI_Reduce occasionally, perhaps every one or two hundred iterations, this can also have a beneficial effect as a barrier every iteration adds significant overhead. This is a textbook example of where the new asynchronous barrier could help, in theory it should have the effect of being able keep processes in sync without any additional overhead in the case that they are already well synchronised. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] MPI_Reduce performance
Doing Reduce without Barrier first allows one process to call Reduce and exit immediately without waiting for other processes to call Reduce. Therefore, this allows one process to advance faster than other processes. I suspect the 2671 second result is the difference between the fastest and slowest process. Having Barrier reduce the time difference because it forces the faster processes to go slower. On Wed, Sep 8, 2010 at 3:21 AM, Gabriele Fatigatiwrote: > Dear OpenMPI users, > > i'm using OpenMPI 1.3.3 on Infiniband 4x interconnnection network. My > parallel application use intensive MPI_Reduce communication over > communicator created with MPI_Comm_split. > > I've noted strange behaviour during execution. My code is instrumented with > Scalasca 1.3 to report subroutine execution time. First execution shows > elapsed time with 128 processors ( job_communicator is created with > MPI_Comm_split). In both cases is composed to the same ranks of > MPI_COMM_WORLD: > > MPI_Reduce(.,job_communicator) > > The elapsed time is 2671 sec. > > Second run use MPI_BARRIER before MPI_Reduce: > > MPI_Barrier(job_communicator..) > MPI_Reduce(.,job_communicator) > > The elapsed time of Barrier+Reduce is 2167 sec, (about 8 minutes less). > > So, im my opinion, it is better put MPI_Barrier before any MPI_Reduce to > mitigate "asynchronous" behaviour of MPI_Reduce in OpenMPI. I suspect the > same for others collective communications. Someone can explaine me why > MPI_reduce has this strange behaviour? > > Thanks in advance. > > > > > -- > Ing. Gabriele Fatigati > > Parallel programmer > > CINECA Systems & Tecnologies Department > > Supercomputing Group > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > www.cineca.itTel: +39 051 6171722 > > g.fatigati [AT] cineca.it > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- David Zhang University of California, San Diego