Re: [OMPI devel] Rationale behind memcpy chunk size (in smsc/xpmem)

2022-08-04 Thread Giorgos Katevainis via devel
Hi,
thanks for the reply,

This all sounds logical. My current leading theory is also that it has to do 
with the memcpy
implementation, in conjunction with all the other factors (arch, cpu & its 
details, copy size, etc).

I tested on additional/different architectures, and there the pattern was 
reversed, with the non-
chunked memcpy doing better (in large sizes). Furthermore, I also observed the 
pattern changing on
some systems going from the point-to-point tests to ones with collectives.

So it looks like, as expected, that there's probably not much of a 
concrete/single answer, and it
depends on a bunch of different things and their combination, with maybe not a 
single best solution
for all scenarios. To further quench my curiosity, I might experiment with 
different/custom/specific
memory copy implementations, and with different compilers (e.g. icc).

George

PS. Yes, XPMEM lazily maps the pages at page fault, the first time the 
attachment is touched (during
the benchmark's warm-up run(s)). After this is done, it shouldn't be a factor.

On Wed, 2022-08-03 at 13:30 +, Jeff Squyres (jsquyres) wrote:
> Sorry for the delay in replies -- it's summer / vacation season, and I think 
> we (as a community)
> are a little behind in answering some of these emails.  :-(
> 
> It's hard to say for any given machine, but a bunch of different hardware 
> factors can come into
> play, such as:
> 
> - L1, L2, L3 cache sizes
> - Cache contention
> - Memory controller connectivity and locality
> 
> I.e., exactly which hardware resources are the memcpy()'s in question using, 
> and how do they
> interact with each other?  How much overhead is produced, and/or how much 
> contention ensues when
> multiple requests are in flight simultaneously?  For example, it may be 
> counter-intuitive, but
> sometimes injecting a small amount of delay in a software pipeline can allow 
> hardware resources to
> not become overwhelmed, and therefore the overall execution becomes more 
> efficient, and therefore
> consume less wall-clock execution time.  Hence, doing 2 x 1MB memcpy()'s (to 
> effect a 2MB
> MPI_send) may actually be overall more efficient, even though the individual 
> parts of the
> transaction are less efficient.  This is a complete guess, and may have 
> nothing to do with your
> system, but it's one of many possibilities.
> 
> Another possible factor: the specific memcpy() implementation is highly 
> relevant.  It's been a few
> years since I've paid close attention to memcpy(), but at one time, there was 
> significant
> variation in the quality of memcpy() implementations between different 
> compilers and/or versions
> of libc.  I don't know if this is still a factor, or whether memcpy() is 
> pretty well optimized in
> most situations these days.  Additionally, alignment can be an issue 
> (although for message sizes
> of 2MB, I'm guessing your buffer is page-aligned, and this probably isn't an 
> issue).
> 
> All that being said, I'm not intimately familiar with the internals of XPMEM, 
> so I don't know what
> userspace/kernel space mechanisms will come into play for mapping the shared 
> memory (e.g., is it
> lazily mapping the shared memory?).
> 
> Also, you're probably doing this already, but these kinds of things are worth 
> mentioning: make
> sure your performance benchmarks are testing the right things: do warmup 
> transfers, make sure
> you're not swapping, make sure all the processes and memory are pinned 
> properly, make sure you're
> on an otherwise-quiet machine, ... etc.  All the Usual Benchmarking Things.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> 
> 
> From: devel  on behalf of Giorgos 
> Katevainis via devel
> 
> Sent: Thursday, July 28, 2022 9:33 AM
> To: Open MPI Developers
> Cc: Giorgos Katevainis
> Subject: [OMPI devel] Rationale behind memcpy chunk size (in smsc/xpmem)
> 
> Hello all,
> 
> I've come across the "memcpy_chunk_size" MCA parameter in smsc/xpmem, which 
> effectively causes
> memory copies to take place in chunks (used in mca_smsc_xpmem_memmove()). The 
> comment reads:
> 
> "Maximum size to copy with a single call to memcpy. On some systems a smaller 
> or larger number may
> provide better performance (default: 256k)"
> 
> And I have indeed observed performance difference by adjusting it! E.g. in a 
> simple point-to-point
> test, 2 MB messages do significantly better with the parameter set to 1 MB vs 
> 2 MB. But... why? I
> suppose I could imagine a memcpy of larger size being more efficient, but 
> what would cause many
> small ones to end up being quicker than a single large one? Might it have 
> something to do with
> memcpy intrinsics and different implementation for different sizes?
> 
> If someone knows what's going on under the hood and/or could direct me to any 
> relevant resources,
> I
> would greatly appreciate it!
> 
> George



Re: [OMPI devel] Rationale behind memcpy chunk size (in smsc/xpmem)

2022-08-03 Thread Jeff Squyres (jsquyres) via devel
Sorry for the delay in replies -- it's summer / vacation season, and I think we 
(as a community) are a little behind in answering some of these emails.  :-(

It's hard to say for any given machine, but a bunch of different hardware 
factors can come into play, such as:

- L1, L2, L3 cache sizes
- Cache contention
- Memory controller connectivity and locality

I.e., exactly which hardware resources are the memcpy()'s in question using, 
and how do they interact with each other?  How much overhead is produced, 
and/or how much contention ensues when multiple requests are in flight 
simultaneously?  For example, it may be counter-intuitive, but sometimes 
injecting a small amount of delay in a software pipeline can allow hardware 
resources to not become overwhelmed, and therefore the overall execution 
becomes more efficient, and therefore consume less wall-clock execution time.  
Hence, doing 2 x 1MB memcpy()'s (to effect a 2MB MPI_send) may actually be 
overall more efficient, even though the individual parts of the transaction are 
less efficient.  This is a complete guess, and may have nothing to do with your 
system, but it's one of many possibilities.

Another possible factor: the specific memcpy() implementation is highly 
relevant.  It's been a few years since I've paid close attention to memcpy(), 
but at one time, there was significant variation in the quality of memcpy() 
implementations between different compilers and/or versions of libc.  I don't 
know if this is still a factor, or whether memcpy() is pretty well optimized in 
most situations these days.  Additionally, alignment can be an issue (although 
for message sizes of 2MB, I'm guessing your buffer is page-aligned, and this 
probably isn't an issue).

All that being said, I'm not intimately familiar with the internals of XPMEM, 
so I don't know what userspace/kernel space mechanisms will come into play for 
mapping the shared memory (e.g., is it lazily mapping the shared memory?).

Also, you're probably doing this already, but these kinds of things are worth 
mentioning: make sure your performance benchmarks are testing the right things: 
do warmup transfers, make sure you're not swapping, make sure all the processes 
and memory are pinned properly, make sure you're on an otherwise-quiet machine, 
... etc.  All the Usual Benchmarking Things.

--
Jeff Squyres
jsquy...@cisco.com


From: devel  on behalf of Giorgos Katevainis 
via devel 
Sent: Thursday, July 28, 2022 9:33 AM
To: Open MPI Developers
Cc: Giorgos Katevainis
Subject: [OMPI devel] Rationale behind memcpy chunk size (in smsc/xpmem)

Hello all,

I've come across the "memcpy_chunk_size" MCA parameter in smsc/xpmem, which 
effectively causes
memory copies to take place in chunks (used in mca_smsc_xpmem_memmove()). The 
comment reads:

"Maximum size to copy with a single call to memcpy. On some systems a smaller 
or larger number may
provide better performance (default: 256k)"

And I have indeed observed performance difference by adjusting it! E.g. in a 
simple point-to-point
test, 2 MB messages do significantly better with the parameter set to 1 MB vs 2 
MB. But... why? I
suppose I could imagine a memcpy of larger size being more efficient, but what 
would cause many
small ones to end up being quicker than a single large one? Might it have 
something to do with
memcpy intrinsics and different implementation for different sizes?

If someone knows what's going on under the hood and/or could direct me to any 
relevant resources, I
would greatly appreciate it!

George