Thanks Dear, Jonathan seems almost perfect; I percieve the same.
On Fri, Nov 6, 2009 at 6:17 PM, Tom Rosmond <rosm...@reachone.com> wrote: > AMJAD > > On your first question, the answer is probably, if everything else is > done correctly. The first test is to not try to do the overlapping > communication and computation, but do them sequentially and make sure > the answers are correct. Have you done this test? Debugging your > original approach will be challenging, and having a control solution > will be a big help. > I followed the path of sequentional---then--parallel blocking----and then parallel non-blocking. My serial solution is the control solution. > > On your second question, if I understand it correctly, is that it is > always better to minimize the number of messages. In problems like this > communication costs are dominated by latency, so bundling the data into > the fewest possible messages will ALWAYS be better. > Thats good. But what pointed out by Jonathan: If you really do hide most of the communications cost with your non-blocking communications, then it may not matter too much. is the point I want to be sure about. > T. Rosmond > > > > On Fri, 2009-11-06 at 17:44 -0500, amjad ali wrote: > > Hi all, > > > > I need/request some help from those who have some experience in > > debugging/profiling/tuning parallel scientific codes, specially for > > PDEs/CFD. > > > > I have parallelized a Fortran CFD code to run on > > Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is > > that: > > > > Suppose that the grid/mesh is decomposed for n number of processors, > > such that each processors has a number of elements that share their > > side/face with different processors. What I do is that I start non > > blocking MPI communication at the partition boundary faces (faces > > shared between any two processors) , and then start computing values > > on the internal/non-shared faces. When I complete this computation, I > > put WAITALL to ensure MPI communication completion. Then I do > > computation on the partition boundary faces (shared-ones). This way I > > try to hide the communication behind computation. Is it correct? > > > > IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less > > elements) with an another processor B then it sends/recvs 50 different > > messages. So in general if a processors has X number of faces sharing > > with any number of other processors it sends/recvs that much messages. > > Is this way has "very much reduced" performance in comparison to the > > possibility that processor A will send/recv a single-bundle message > > (containg all 50-faces-data) to process B. Means that in general a > > processor will only send/recv that much messages as the number of > > processors neighbour to it. It will send a single bundle/pack of > > messages to each neighbouring processor. > > Is their "quite a much difference" between these two approaches? > > > > THANK YOU VERY MUCH. > > AMJAD. > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >