https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91201
Bug ID: 91201 Summary: [7~9 Regression] SIMD not generated for horizontal sum of bytes in array Product: gcc Version: 9.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: bisqwit at iki dot fi Target Milestone: --- For this code — typedef unsigned long long E; const unsigned D = 2; E bytes[D]; unsigned char sum() { E b[D]{}; //#pragma omp simd for(unsigned n=0; n<D; ++n) { E temp = bytes[n]; temp += (temp >> 32); temp += (temp >> 16); temp += (temp >> 8); b[n] = temp; } E result = 0; //#pragma omp simd for(unsigned n=0; n<D; ++n) result += b[n]; return result; } GCC 6.4 generates the following neat assembler code, but all versions since GCC 7 (including GCC 9.1) fail to utilize SIMD instructions at all. vmovdqa xmm0, XMMWORD PTR bytes[rip] vpsrlq xmm1, xmm0, 32 vpaddq xmm1, xmm1, xmm0 vpsrlq xmm0, xmm1, 16 vpaddq xmm1, xmm0, xmm1 vpsrlq xmm0, xmm1, 8 vpaddq xmm0, xmm0, xmm1 vpsrldq xmm1, xmm0, 8 vpaddq xmm0, xmm0, xmm1 vmovq rax, xmm0 ret The code that GCC versions since 7.0, including and up to 9.1, generates, is: mov rcx, QWORD PTR bytes[rip] mov rdx, QWORD PTR bytes[rip+8] mov rax, rcx shr rax, 32 add rcx, rax mov rax, rcx shr rax, 16 add rcx, rax mov rax, rdx shr rax, 32 add rdx, rax mov rax, rdx shr rax, 16 add rdx, rax mov rax, rcx shr rax, 8 add rcx, rdx add rcx, rax shr rdx, 8 lea rax, [rcx+rdx] ret Tested using compiler options -Ofast -std=c++17 -pedantic -Wall -Wextra -W -march=skylake. Tried also haswell, broadwell and znver1 for the -march option. If I change the D constant to a larger one, such as 4 or 8, then SIMD instructions will begin appearing. Interestingly with D=4, it uses stack as a temporary, but with D=8, it manages without (on both AVX and non-AVX code). If I uncomment the two OpenMP pragmas, then SIMD code will manifest, so it is clear that the compiler _can_ generate the optimal code, but for some reason chooses not to. The testcase is a horizontal sum of all bytes in an array by the way. Compiler Explorer link for quick testing: https://godbolt.org/z/azkXiL