On Mar 26, 2020, at 5:36 AM, Raut, S Biplab <biplab.r...@amd.com> wrote:
> 
> I am doing pairwise send-recv and not all-to-all since not all the data is 
> required by all the ranks.
> And I am doing blocking send and recv calls since there are multiple 
> iterations of such message chunks to be sent with synchronization.
> 
> I understand your recommendation in the below mail, however I still see 
> benefit for my application level algorithm to do pairwise send-recv chunks 
> where each chunk is within eager limit.
> Since the input and output buffer is same within the process, so I can avoid 
> certain buffering at each sender rank by doing successive send calls within 
> eager limit to receiver ranks and then have recv calls.

But if the buffers are small enough to fall within the eager limit, there's 
very little benefit to not having an A/B buffering scheme.  Sure, it's 2x the 
memory, but it's 2 times a small number (measured in KB).  Assuming you have GB 
of RAM, it's hard to believe that this would make a meaningful difference.  
Indeed, one way to think of the eager limit is: "it's small enough that the 
cost of a memcpy doesn't matter."

I'm not sure I understand your comments about preventing copying.  MPI will 
always do the most efficient thing to send the message, regardless of whether 
it is under the eager limit or not.  I also don't quite grok your comments 
about "application buffering" and message buffering required by the eager 
protocol.

The short version of this is: you shouldn't worry about any of this.  Rely on 
the underlying MPI to do the most efficient thing possible, and you should use 
a communication algorithm that makes sense for your application.  In most 
cases, you'll be good.

If you start trying to tune for a specific environment, platform, and MPI 
implementation, the number of variables grows exponentially.  And if you change 
any one parameter in the whole setup, your optimizations may get lost.  Also, 
if you add a bunch of infrastructure in your app to try to exactly match your 
environment+platform+implementation (e.g., manual segmenting to fit your 
overall message into the eager limit), you may just be adding additional 
overhead that effectively nullifies any optimization you might get (especially 
if the optimization is very small).  Indeed, the methods used for shared memory 
and similar to but different than the methods used for networks.  And there's a 
wide variety of network capabilities; some can be more efficient than others 
(depending on a zillion factors).

If you're using shared memory, ensure that your Linux kernel has good shared 
memory support (e.g., support for CMA), and let MPI optimize the message 
transfers for you.

-- 
Jeff Squyres
jsquy...@cisco.com

Reply via email to