Thanks Dear,

Jonathan seems almost perfect; I percieve the same.

On Fri, Nov 6, 2009 at 6:17 PM, Tom Rosmond <rosm...@reachone.com> wrote:

> AMJAD
>
> On your first question, the answer is probably, if everything else is
> done correctly.  The first test is to not try to do the overlapping
> communication and computation, but do them sequentially and make sure
> the answers are correct. Have you done this test?  Debugging your
> original approach will be challenging, and having a control solution
> will be a big help.
>

I followed the path of sequentional---then--parallel blocking----and then
parallel non-blocking.
My serial solution is the control solution.


>
> On your second question, if I  understand it correctly, is that it is
> always better to minimize the number of messages.  In problems like this
> communication costs are dominated by latency, so bundling the data into
> the fewest possible messages will ALWAYS be better.
>

Thats good.
But what pointed out by Jonathan:

If you really do hide most of the communications cost with your non-blocking
communications, then it may not matter too much.

is the point I want to be sure about.


> T. Rosmond
>
>
>
> On Fri, 2009-11-06 at 17:44 -0500, amjad ali wrote:
> > Hi all,
> >
> > I need/request some help from those who have some experience in
> > debugging/profiling/tuning parallel scientific codes, specially for
> > PDEs/CFD.
> >
> > I have parallelized a Fortran CFD code to run on
> > Ethernet-based-Linux-Cluster. Regarding MPI communication what I do is
> > that:
> >
> > Suppose that the grid/mesh is decomposed for n number of processors,
> > such that each processors has a number of elements that share their
> > side/face with different processors. What I do is that I start non
> > blocking MPI communication at the partition boundary faces (faces
> > shared between any two processors) , and then start computing values
> > on the internal/non-shared faces. When I complete this computation, I
> > put WAITALL to ensure MPI communication completion. Then I do
> > computation on the partition boundary faces (shared-ones). This way I
> > try to hide the communication behind computation. Is it correct?
> >
> > IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or less
> > elements) with an another processor B then it sends/recvs 50 different
> > messages. So in general if a processors has X number of faces sharing
> > with any number of other processors it sends/recvs that much messages.
> > Is this way has "very much reduced" performance in comparison to the
> > possibility that processor A will send/recv a single-bundle message
> > (containg all 50-faces-data) to process B. Means that in general a
> > processor will only send/recv that much messages as the number of
> > processors neighbour to it.  It will send a single bundle/pack of
> > messages to each neighbouring processor.
> > Is their "quite a much difference" between these two approaches?
> >
> > THANK YOU VERY MUCH.
> > AMJAD.
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to