On Mar 26, 2020, at 5:36 AM, Raut, S Biplab <biplab.r...@amd.com> wrote: > > I am doing pairwise send-recv and not all-to-all since not all the data is > required by all the ranks. > And I am doing blocking send and recv calls since there are multiple > iterations of such message chunks to be sent with synchronization. > > I understand your recommendation in the below mail, however I still see > benefit for my application level algorithm to do pairwise send-recv chunks > where each chunk is within eager limit. > Since the input and output buffer is same within the process, so I can avoid > certain buffering at each sender rank by doing successive send calls within > eager limit to receiver ranks and then have recv calls.
But if the buffers are small enough to fall within the eager limit, there's very little benefit to not having an A/B buffering scheme. Sure, it's 2x the memory, but it's 2 times a small number (measured in KB). Assuming you have GB of RAM, it's hard to believe that this would make a meaningful difference. Indeed, one way to think of the eager limit is: "it's small enough that the cost of a memcpy doesn't matter." I'm not sure I understand your comments about preventing copying. MPI will always do the most efficient thing to send the message, regardless of whether it is under the eager limit or not. I also don't quite grok your comments about "application buffering" and message buffering required by the eager protocol. The short version of this is: you shouldn't worry about any of this. Rely on the underlying MPI to do the most efficient thing possible, and you should use a communication algorithm that makes sense for your application. In most cases, you'll be good. If you start trying to tune for a specific environment, platform, and MPI implementation, the number of variables grows exponentially. And if you change any one parameter in the whole setup, your optimizations may get lost. Also, if you add a bunch of infrastructure in your app to try to exactly match your environment+platform+implementation (e.g., manual segmenting to fit your overall message into the eager limit), you may just be adding additional overhead that effectively nullifies any optimization you might get (especially if the optimization is very small). Indeed, the methods used for shared memory and similar to but different than the methods used for networks. And there's a wide variety of network capabilities; some can be more efficient than others (depending on a zillion factors). If you're using shared memory, ensure that your Linux kernel has good shared memory support (e.g., support for CMA), and let MPI optimize the message transfers for you. -- Jeff Squyres jsquy...@cisco.com