Re: [OMPI users] Programming Help needed

Jonathan Dursi Fri, 6 Nov 2009 18:11:32 -0500

Hi, Amjad:

[...]
What I do is that I start nonblocking MPI communication at the partition boundary faces (faces sharedbetween any two processors) , and then start computing values on theinternal/non-shared faces. When I complete this computation, I putWAITALL to ensure MPI communication completion. Then I do computation onthe partition boundary faces (shared-ones). This way I try to hide thecommunication behind computation. Is it correct?

As long as your numerical method allows you to do this (that is, youdefinitely don't need those boundary values to compute the internalvalues), then yes, this approach can hide some of the communicationcosts very effectively. The way I'd program this if I were doing itfrom scratch would be to do the usual blocking approach (no one computesanything until all the faces are exchanged) first and get that working,then break up the computation step into internal and boundarycomputations and make sure it still works, and then change the messagingto isends/irecvs/waitalls, and make sure it still works, and only theninterleave the two.

IMPORTANT: Secondly, if processor A shares 50 faces (on 50 or lesselements) with an another processor B then it sends/recvs 50 differentmessages. So in general if a processors has X number of faces sharingwith any number of other processors it sends/recvs that much messages.Is this way has "very much reduced" performance in comparison to thepossibility that processor A will send/recv a single-bundle message(containg all 50-faces-data) to process B. Means that in general aprocessor will only send/recv that much messages as the number ofprocessors neighbour to it. It will send a single bundle/pack ofmessages to each neighbouring processor.
Is their "quite a much difference" between these two approaches?

Your individual element faces that are being communicated are likelyquite small. It is quite generally the case that bundling many smallmessages into large messages can significantly improve performance, asyou avoid incurring the repeated latency costs of sending many messages.

As always, though, the answer is `it depends', and the only way to knowis to try it both ways. If you really do hide most of thecommunications cost with your non-blocking communications, then it maynot matter too much. In addition, if you don't know beforehand how muchdata you need to send/receive, then you'll need a handshaking step whichintroduces more synchronization and may actually hurt performance, oryou'll have to use MPI2 one-sided communications. On the other hand,if this shared boundary doesn't change through the simulation, you couldjust figure out at start-up time how big the messages will be betweenneighbours and use that as the basis for the usual two-sided messages.

My experience is that there's an excellent chance you'll improve theperformance by packing the little messages into fewer larger messages.


   Jonathan
--
Jonathan Dursi     <ljdu...@scinet.utoronto.ca>

Re: [OMPI users] Programming Help needed

Reply via email to