Re: [OMPI users] vectorized reductions
You are welcome to provide any data that evidences the current implementation (intrinsics, AVX512) is not the most efficient, and you are free to issue a Pull Request in order to suggest a better one. The op/avx component has pretty much nothing to do with scalability: only one node is required to measure the performance, and the test/datatype/reduce_local test can be used as a measurement. /* several core counts should be used in order to fully evaluate the infamous AVX512 frequency downscaling */ The benefits of ap/avx (including AVX512) have been reported, for example at https://github.com/open-mpi/ompi/issues/8334#issuecomment-759864154 FWIW, George added SVE support in https://github.com/bosilca/ompi/pull/14, and I added support for NEON and SVE in https://github.com/ggouaillardet/ompi/tree/topic/op_arm None of these have been merged, but you are free to evaluate them and report the performance numbers. On 7/20/2021 11:00 PM, Dave Love via users wrote: Gilles Gouaillardet via users writes: One motivation is packaging: a single Open MPI implementation has to be built, that can run on older x86 processors (supporting only SSE) and the latest ones (supporting AVX512). I take dispatch on micro-architecture for granted, but it doesn't require an assembler/intrinsics implementation. See the level-1 routines in recent BLIS, for example (an instance where GCC was supposed to fail). That works for all relevant architectures, though I don't think the aarch64 and ppc64le dispatch was ever included. Presumably it's less prone to errors than low-level code. The op/avx component will select at runtime the most efficient implementation for vectorized reductions. It will select the micro-architecture with the most features, which may or may not be the most efficient. Is the avx512 version actually faster than avx2? Anyway, if this is important at scale, which I can't test, please at least vectorize op_base_functions.c for aarch64 and ppc64le. With GCC, and probably other compilers -- at least clang, I think -- it doesn't even need changes to cc flags. With GCC and recent glibc, target clones cover micro-arches with practically no effort. Otherwise you probably need similar infrastructure to what's there now, but not to devote the effort to using intrinsics as far as I can see.
Re: [OMPI users] vectorized reductions
Gilles Gouaillardet via users writes: > One motivation is packaging: a single Open MPI implementation has to be > built, that can run on older x86 processors (supporting only SSE) and the > latest ones (supporting AVX512). I take dispatch on micro-architecture for granted, but it doesn't require an assembler/intrinsics implementation. See the level-1 routines in recent BLIS, for example (an instance where GCC was supposed to fail). That works for all relevant architectures, though I don't think the aarch64 and ppc64le dispatch was ever included. Presumably it's less prone to errors than low-level code. > The op/avx component will select at > runtime the most efficient implementation for vectorized reductions. It will select the micro-architecture with the most features, which may or may not be the most efficient. Is the avx512 version actually faster than avx2? Anyway, if this is important at scale, which I can't test, please at least vectorize op_base_functions.c for aarch64 and ppc64le. With GCC, and probably other compilers -- at least clang, I think -- it doesn't even need changes to cc flags. With GCC and recent glibc, target clones cover micro-arches with practically no effort. Otherwise you probably need similar infrastructure to what's there now, but not to devote the effort to using intrinsics as far as I can see.
Re: [OMPI users] vectorized reductions
One motivation is packaging: a single Open MPI implementation has to be built, that can run on older x86 processors (supporting only SSE) and the latest ones (supporting AVX512). The op/avx component will select at runtime the most efficient implementation for vectorized reductions. On Mon, Jul 19, 2021 at 11:11 PM Dave Love via users < users@lists.open-mpi.org> wrote: > I meant to ask a while ago about vectorized reductions after I saw a > paper that I can't now find. I didn't understand what was behind it. > > Can someone explain why you need to hand-code the avx implementations of > the reduction operations now used on x86_64? As far as I remember, the > paper didn't justify the effort past alluding to a compiler being unable > to vectorize reductions. I wonder which compiler(s); the recent ones > I'm familiar with certainly can if you allow them (or don't stop them -- > icc, sigh). I've been assured before that GCC can't, but that's > probably due to using the default correct FP compilation and/or not > restricting function arguments. So I wonder what's the problem just > using C and a tolerably recent GCC if necessary -- is there something > else behind this? > > Since only x86 is supported, I had a go on ppc64le and with minimal > effort saw GCC vectorizing more of the base implementation functions > than are included in the avx version. Similarly for x86 > micro-architectures. (I'd need convincing that avx512 is worth the > frequency reduction.) It would doubtless be the same on aarch64, say, > but I only have the POWER. > > Thanks for any info. >
[OMPI users] vectorized reductions
I meant to ask a while ago about vectorized reductions after I saw a paper that I can't now find. I didn't understand what was behind it. Can someone explain why you need to hand-code the avx implementations of the reduction operations now used on x86_64? As far as I remember, the paper didn't justify the effort past alluding to a compiler being unable to vectorize reductions. I wonder which compiler(s); the recent ones I'm familiar with certainly can if you allow them (or don't stop them -- icc, sigh). I've been assured before that GCC can't, but that's probably due to using the default correct FP compilation and/or not restricting function arguments. So I wonder what's the problem just using C and a tolerably recent GCC if necessary -- is there something else behind this? Since only x86 is supported, I had a go on ppc64le and with minimal effort saw GCC vectorizing more of the base implementation functions than are included in the avx version. Similarly for x86 micro-architectures. (I'd need convincing that avx512 is worth the frequency reduction.) It would doubtless be the same on aarch64, say, but I only have the POWER. Thanks for any info.