https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101472

            Bug ID: 101472
           Summary: AVX-512 wrong code for consecutive masked scatters
           Product: gcc
           Version: 11.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: dlustig at nvidia dot com
  Target Milestone: ---

Created attachment 51162
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51162&action=edit
Test case

$ cat two_scatters.c
#include <immintrin.h>

void two_scatters(void* base_addr, __mmask8 k1, __mmask8 k2, __m512i vindex,
__m256i a) {
    _mm512_mask_i64scatter_epi32(base_addr, k1, vindex, a, 1);
    _mm512_mask_i64scatter_epi32(base_addr, k2, vindex, a, 1);
}

$ g++-11 -S -O3 -march=skylake-avx512 -Wall -Wextra -fno-strict-aliasing
-fwrapv -fno-aggressive-loop-optimizations two_scatters.c -o -
...
_Z12two_scattersPvhhDv8_xDv4_x:
        kmovb   %edx, %k2
        vpscatterqd     %ymm1, (%rdi,%zmm0,1){%k2}
        ret
...

Only one vpscatterqd instruction is generated, even though I would expect two. 
The optimizer seems to think the first store is redundant with the second due
to matching addresses, and hence optimizes it away.  However, since two
different masks are being used, the scatters are not actually redundant.

Perturbing the example by passing two different base addresses, or inserting an
asm("nop"); in between, etc., will cause both scatters to get emitted:
https://godbolt.org/z/3b8v86on4

Stand-alone executable test case attached as well.

GCC version info:

$ gcc-11 -v
Using built-in specs.
COLLECT_GCC=gcc-11
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
11.1.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-11
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-vtable-verify --enable-plugin
--enable-default-pie --with-system-zlib --enable-libphobos-checking=release
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--disable-werror --disable-cet --with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none=/build/gcc-11-YRKbe7/gcc-11-11.1.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-YRKbe7/gcc-11-11.1.0/debian/tmp-gcn/usr
--without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.1.0 (Ubuntu 11.1.0-1ubuntu1~18.04.1)

Reply via email to