https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91201

            Bug ID: 91201
           Summary: [7~9 Regression] SIMD not generated for horizontal sum
                    of bytes in array
           Product: gcc
           Version: 9.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bisqwit at iki dot fi
  Target Milestone: ---

For this code —

    typedef unsigned long long E;
    const unsigned D = 2;
    E bytes[D];
    unsigned char sum() 
    {
        E b[D]{};
        //#pragma omp simd
        for(unsigned n=0; n<D; ++n)
        {
            E temp = bytes[n];
            temp += (temp >> 32);
            temp += (temp >> 16);
            temp += (temp >> 8);
            b[n] = temp;
        }
        E result = 0;
        //#pragma omp simd
        for(unsigned n=0; n<D; ++n) result += b[n];
        return result;
    }

GCC 6.4 generates the following neat assembler code, but all versions since GCC
7 (including GCC 9.1) fail to utilize SIMD instructions at all.

        vmovdqa xmm0, XMMWORD PTR bytes[rip]
        vpsrlq  xmm1, xmm0, 32
        vpaddq  xmm1, xmm1, xmm0
        vpsrlq  xmm0, xmm1, 16
        vpaddq  xmm1, xmm0, xmm1
        vpsrlq  xmm0, xmm1, 8
        vpaddq  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 8
        vpaddq  xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        ret

The code that GCC versions since 7.0, including and up to 9.1, generates, is:

        mov     rcx, QWORD PTR bytes[rip]
        mov     rdx, QWORD PTR bytes[rip+8]
        mov     rax, rcx
        shr     rax, 32
        add     rcx, rax
        mov     rax, rcx
        shr     rax, 16
        add     rcx, rax
        mov     rax, rdx
        shr     rax, 32
        add     rdx, rax
        mov     rax, rdx
        shr     rax, 16
        add     rdx, rax
        mov     rax, rcx
        shr     rax, 8
        add     rcx, rdx
        add     rcx, rax
        shr     rdx, 8
        lea     rax, [rcx+rdx]
        ret

Tested using compiler options -Ofast -std=c++17 -pedantic -Wall -Wextra -W
-march=skylake. Tried also haswell, broadwell and znver1 for the -march option.

If I change the D constant to a larger one, such as 4 or 8, then SIMD
instructions will begin appearing. Interestingly with D=4, it uses stack as a
temporary, but with D=8, it manages without (on both AVX and non-AVX code).

If I uncomment the two OpenMP pragmas, then SIMD code will manifest, so it is
clear that the compiler _can_ generate the optimal code, but for some reason
chooses not to.

The testcase is a horizontal sum of all bytes in an array by the way.

Compiler Explorer link for quick testing: https://godbolt.org/z/azkXiL

Reply via email to