Re: [OMPI users] vectorized reductions

2021-07-20 Thread Gilles Gouaillardet via users
You are welcome to provide any data that evidences the current 
implementation


(intrinsics, AVX512) is not the most efficient, and you are free to 
issue a Pull Request


in order to suggest a better one.


The op/avx component has pretty much nothing to do with scalability:

only one node is required to measure the performance, and the

test/datatype/reduce_local test can be used as a measurement.

/* several core counts should be used in order to fully evaluate the 
infamous AVX512 frequency downscaling */


The benefits of ap/avx (including AVX512) have been reported, for 
example at 
https://github.com/open-mpi/ompi/issues/8334#issuecomment-759864154



FWIW, George added SVE support in https://github.com/bosilca/ompi/pull/14,

and I added support for NEON and SVE in 
https://github.com/ggouaillardet/ompi/tree/topic/op_arm


None of these have been merged, but you are free to evaluate them and 
report the performance numbers.




On 7/20/2021 11:00 PM, Dave Love via users wrote:

Gilles Gouaillardet via users  writes:


One motivation is packaging: a single Open MPI implementation has to be
built, that can run on older x86 processors (supporting only SSE) and the
latest ones (supporting AVX512).

I take dispatch on micro-architecture for granted, but it doesn't
require an assembler/intrinsics implementation.  See the level-1
routines in recent BLIS, for example (an instance where GCC was supposed
to fail).  That works for all relevant architectures, though I don't
think the aarch64 and ppc64le dispatch was ever included.  Presumably
it's less prone to errors than low-level code.


The op/avx component will select at
runtime the most efficient implementation for vectorized reductions.

It will select the micro-architecture with the most features, which may
or may not be the most efficient.  Is the avx512 version actually faster
than avx2?

Anyway, if this is important at scale, which I can't test, please at
least vectorize op_base_functions.c for aarch64 and ppc64le.  With GCC,
and probably other compilers -- at least clang, I think -- it doesn't
even need changes to cc flags.  With GCC and recent glibc, target clones
cover micro-arches with practically no effort.  Otherwise you probably
need similar infrastructure to what's there now, but not to devote the
effort to using intrinsics as far as I can see.




Re: [OMPI users] vectorized reductions

2021-07-20 Thread Dave Love via users
Gilles Gouaillardet via users  writes:

> One motivation is packaging: a single Open MPI implementation has to be
> built, that can run on older x86 processors (supporting only SSE) and the
> latest ones (supporting AVX512).

I take dispatch on micro-architecture for granted, but it doesn't
require an assembler/intrinsics implementation.  See the level-1
routines in recent BLIS, for example (an instance where GCC was supposed
to fail).  That works for all relevant architectures, though I don't
think the aarch64 and ppc64le dispatch was ever included.  Presumably
it's less prone to errors than low-level code.

> The op/avx component will select at
> runtime the most efficient implementation for vectorized reductions.

It will select the micro-architecture with the most features, which may
or may not be the most efficient.  Is the avx512 version actually faster
than avx2?

Anyway, if this is important at scale, which I can't test, please at
least vectorize op_base_functions.c for aarch64 and ppc64le.  With GCC,
and probably other compilers -- at least clang, I think -- it doesn't
even need changes to cc flags.  With GCC and recent glibc, target clones
cover micro-arches with practically no effort.  Otherwise you probably
need similar infrastructure to what's there now, but not to devote the
effort to using intrinsics as far as I can see.


Re: [OMPI users] vectorized reductions

2021-07-19 Thread Gilles Gouaillardet via users
One motivation is packaging: a single Open MPI implementation has to be
built, that can run on older x86 processors (supporting only SSE) and the
latest ones (supporting AVX512). The op/avx component will select at
runtime the most efficient implementation for vectorized reductions.

On Mon, Jul 19, 2021 at 11:11 PM Dave Love via users <
users@lists.open-mpi.org> wrote:

> I meant to ask a while ago about vectorized reductions after I saw a
> paper that I can't now find.  I didn't understand what was behind it.
>
> Can someone explain why you need to hand-code the avx implementations of
> the reduction operations now used on x86_64?  As far as I remember, the
> paper didn't justify the effort past alluding to a compiler being unable
> to vectorize reductions.  I wonder which compiler(s); the recent ones
> I'm familiar with certainly can if you allow them (or don't stop them --
> icc, sigh).  I've been assured before that GCC can't, but that's
> probably due to using the default correct FP compilation and/or not
> restricting function arguments.  So I wonder what's the problem just
> using C and a tolerably recent GCC if necessary -- is there something
> else behind this?
>
> Since only x86 is supported, I had a go on ppc64le and with minimal
> effort saw GCC vectorizing more of the base implementation functions
> than are included in the avx version.  Similarly for x86
> micro-architectures.  (I'd need convincing that avx512 is worth the
> frequency reduction.)  It would doubtless be the same on aarch64, say,
> but I only have the POWER.
>
> Thanks for any info.
>


[OMPI users] vectorized reductions

2021-07-19 Thread Dave Love via users
I meant to ask a while ago about vectorized reductions after I saw a
paper that I can't now find.  I didn't understand what was behind it.

Can someone explain why you need to hand-code the avx implementations of
the reduction operations now used on x86_64?  As far as I remember, the
paper didn't justify the effort past alluding to a compiler being unable
to vectorize reductions.  I wonder which compiler(s); the recent ones
I'm familiar with certainly can if you allow them (or don't stop them --
icc, sigh).  I've been assured before that GCC can't, but that's
probably due to using the default correct FP compilation and/or not
restricting function arguments.  So I wonder what's the problem just
using C and a tolerably recent GCC if necessary -- is there something
else behind this?

Since only x86 is supported, I had a go on ppc64le and with minimal
effort saw GCC vectorizing more of the base implementation functions
than are included in the avx version.  Similarly for x86
micro-architectures.  (I'd need convincing that avx512 is worth the
frequency reduction.)  It would doubtless be the same on aarch64, say,
but I only have the POWER.

Thanks for any info.