Configuring OpenMPI 4.1.0 with GCC 10.2.0 on
Intel(R) Xeon(R) CPU E5-2620 v3, a Haswell processor
that supports AVX2 but not AVX512, resulted in

checking for AVX512 support (no additional flags)... no
checking for AVX512 support (with -march=skylake-avx512)... yes

in "configure" output, and in config.log

MCA_BUILD_ompi_op_has_avx512_support_FALSE='#'
MCA_BUILD_ompi_op_has_avx512_support_TRUE=''

Consequently AVX512 intrinsic functions were erroneously
deployed, resulting in OpenMPI failure.

The relevant test code was in essence

cat > conftest.c << EOF
#include <immintrin.h>

int main()
{
        __m512 vA, vB;

        _mm512_add_ps(vA, vB);

        return 0;
}
EOF

The problem with this is that the result of the function
is never used, so at optimization level higher than O0
the compiler elimates the function as "dead code" (DCE).
To wit,

gcc -O3 -march=skylake-avx512 -S conftest.c

yields

        .file   "conftest.c"
        .text
        .section        .text.startup,"ax",@progbits
        .p2align 4
        .globl  main
        .type   main, @function
main:
.LFB5345:
        .cfi_startproc
        xorl    %eax, %eax
        ret
        .cfi_endproc
.LFE5345:
        .size   main, .-main
        .ident  "GCC: (GNU) 10.2.0"
        .section        .note.GNU-stack,"",@progbits

Compare this with the result of

gcc -O0 -march=skylake-avx512 -S conftest.c

in which the function IS called:

        .file   "conftest.c"
        .text
        .globl  main
        .type   main, @function
main:
.LFB4092:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        andq    $-64, %rsp
        subq    $136, %rsp
        vmovaps 72(%rsp), %zmm0
        vmovaps %zmm0, -56(%rsp)
        vmovaps 8(%rsp), %zmm0
        vmovaps %zmm0, -120(%rsp)
        movl    $0, %eax
        leave
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc
.LFE4092:
        .size   main, .-main
        .ident  "GCC: (GNU) 10.2.0"
        .section        .note.GNU-stack,"",@progbits

Note the use of a 512-bit ZMM register - ZMM registers
are used only by AVX512 instructions.  Hence at O3 the
test program does not detect the lack of AVX512 support
by the host processor.

An easy remedy would be to declare the operands as
"volatile" and thereby force to compiler to invoke the
function:

cat > conftest.c << EOF
#include <immintrin.h>

int main()
{
        volatile __m512 vA, vB;

        _mm512_add_ps(vA, vB);

        return 0;
}

Compiled at O3, the resulting executable dumps core as it
should when run on my Haswell processor, returning nonzero
exit status ($?), which would inform "configure" that the
processor does not have AVX512 capability.

Finally please note that this error could affect the
detection of support for other instruction sets on other
families of processors: compiler optimization must be
inhibited for such tests to be reliable!

Max
---
Max R. Dechantsreiter
President
Performance Jones L.L.C.
m...@performancejones.com
Skype: PerformanceJones (UTC+01:00)
+1 414 446-3100 (telephone/voicemail)
http://www.linkedin.com/in/benchmarking

Reply via email to