[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #9 from Witold Baryluk --- Indeed, passing -fno-tree-pre in the first example does make it be vectorized. In the mesh_simple.c this corresponds to ONTHEFLY_CONSTANTS being defined, but USE_LOOP_CONSTANTS being not. The SIMPLIFIED can be defined or not, it vectorizes now in both cases. Targeting -march=knm. This is with #define OCTAVES 12, a compile time constant, so compiler fully unrolls the most inner loop. Without -fno-tree-pre: 1230 : 1230: 41 57 push %r15 1232: 62 a1 7d 40 ef c0 vpxord %zmm16,%zmm16,%zmm16 1238: 49 ba 53 ec 85 1a femovabs $0xc4ceb9fe1a85ec53,%r10 123f: b9 ce c4 1242: 41 56 push %r14 1244: c5 7a 10 0d f8 0d 00vmovss 0xdf8(%rip),%xmm9# 2044 <_IO_stdin_used+0x44> 124b: 00 124c: 62 31 7c 48 28 d0 vmovaps %zmm16,%zmm10 1252: 41 55 push %r13 1254: c5 7a 10 3d ec 0d 00vmovss 0xdec(%rip),%xmm15# 2048 <_IO_stdin_used+0x48> 125b: 00 125c: 62 a1 7c 48 28 d0 vmovaps %zmm16,%zmm18 1262: 41 54 push %r12 1264: c5 7a 10 35 e0 0d 00vmovss 0xde0(%rip),%xmm14# 204c <_IO_stdin_used+0x4c> 126b: 00 126c: 49 b9 cd 8c 55 ed d7movabs $0xff51afd7ed558ccd,%r9 1273: af 51 ff 1276: 55 push %rbp 1277: c5 7a 10 2d d1 0d 00vmovss 0xdd1(%rip),%xmm13# 2050 <_IO_stdin_used+0x50> 127e: 00 127f: 49 be 68 66 ac 6a bfmovabs $0xfa8d7ebf6aac6668,%r14 1286: 7e 8d fa 1289: 53 push %rbx 128a: c5 7a 10 25 c2 0d 00vmovss 0xdc2(%rip),%xmm12# 2054 <_IO_stdin_used+0x54> 1291: 00 1292: 48 89 7c 24 f8 mov%rdi,-0x8(%rsp) 1297: c7 44 24 f0 00 00 00movl $0x0,-0x10(%rsp) 129e: 00 129f: c7 44 24 f4 00 00 00movl $0x0,-0xc(%rsp) 12a6: 00 12a7: c5 7a 10 1d a9 0d 00vmovss 0xda9(%rip),%xmm11# 2058 <_IO_stdin_used+0x58> 12ae: 00 12af: 62 e1 7e 08 10 0d a3vmovss 0xda3(%rip),%xmm17# 205c <_IO_stdin_used+0x5c> 12b6: 0d 00 00 12b9: 0f 1f 80 00 00 00 00nopl 0x0(%rax) 12c0: 48 8b 6c 24 f8 mov-0x8(%rsp),%rbp 12c5: 31 f6 xor%esi,%esi 12c7: 31 db xor%ebx,%ebx 12c9: 62 31 7c 48 28 c2 vmovaps %zmm18,%zmm8 12cf: 90 nop 12d0: 8b 54 24 f0 mov-0x10(%rsp),%edx 12d4: 45 31 e4xor%r12d,%r12d 12d7: 62 b1 7c 48 28 f8 vmovaps %zmm16,%zmm7 12dd: 62 c1 7c 48 28 d9 vmovaps %zmm9,%zmm19 12e3: c5 32 11 cc vmovss %xmm9,%xmm9,%xmm4 12e7: eb 26 jmp130f 12e9: 0f 1f 80 00 00 00 00nopl 0x0(%rax) 12f0: c5 ba 59 c4 vmulss %xmm4,%xmm8,%xmm0 12f4: 62 f3 7d 08 0a c0 09vrndscaless $0x9,%xmm0,%xmm0,%xmm0 12fb: c5 fa 2c f0 vcvttss2si %xmm0,%esi 12ff: c4 c1 5a 59 c2 vmulss %xmm10,%xmm4,%xmm0 1304: 62 f3 7d 08 0a c0 09vrndscaless $0x9,%xmm0,%xmm0,%xmm0 130b: c5 fa 2c d0 vcvttss2si %xmm0,%edx 130f: 4c 89 e1mov%r12,%rcx 1312: 62 c1 7c 48 28 e8 vmovaps %zmm8,%zmm21 1318: 48 c1 e9 21 shr$0x21,%rcx 131c: 62 e1 7c 48 28 e4 vmovaps %zmm4,%zmm20 1322: c5 d2 2a ea vcvtsi2ss %edx,%xmm5,%xmm5 1326: 4c 31 e1xor%r12,%rcx 1329: 49 0f af ca imul %r10,%rcx 132d: 48 63 d2movslq %edx,%rdx 1330: c5 e2 2a de vcvtsi2ss %esi,%xmm3,%xmm3 1334: 4f 8d 24 0c lea(%r12,%r9,1),%r12 1338: 48 69 d2 53 42 41 4eimul $0x4e414253,%rdx,%rdx 133f: 62 c2 55 08 9b e2 vfmsub132ss %xmm10,%xmm5,%xmm20 1345: c4 c1 52 58 e9 vaddss %xmm9,%xmm5,%xmm5 134a: 48 8d 01lea(%rcx),%rax 134d: 48 c1 e8 21 shr$0x21,%rax 1351: 62 e2 65 08 9b ec vfmsub132ss %xmm4,%xmm3,%xmm21 1357: 48 31 c1xor%rax,%rcx 135a: 4c 8d ba 53 42 41 4elea0x4e414253(%rdx),%r15 1361: 48 89 cfmov%rcx,%rdi 1364: 48 89 c8mov%rcx,%rax 1367: 48 81 f7 70 46 ab 58xor$0x58ab4670,%rdi 136e: c4 c1 62 58 d9 vaddss %xmm9,%xmm3,%xmm3 1373: 48 c1 e8 21 shr
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2019-10-17 CC||rguenth at gcc dot gnu.org Blocks||53947 Ever confirmed|0 |1 --- Comment #8 from Richard Biener --- You can try -fno-tree-pre because for the original issue you mention the issue is that PRE figures the first iteration computed values at compile-time which then effectively rotates the loop which the vectorizer is not happy with. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #7 from Witold Baryluk --- Online examples: https://gcc.godbolt.org/z/Nyjty3
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #6 from Witold Baryluk --- I also tested clang with LLVM 10~svn374655 and it does vectorize the loop properly, even when both frequency and amplitude variables are updated every loop. It still doesn't inline calls to sinf, even if I set -fno-math-errno and other things from -ffast-math. My random guess is that it is because there is no hardware support for vectorized sinf, and there is no vectorized variant of sinf software implementation either. If I provide my own version of sinf using simple Taylor expansion, clang fully vectorized the code: 401320: 62 e1 7d 58 fe 3d 56vpaddd 0xd56(%rip){1to16},%zmm0,%zmm23 # 402080 <_IO_stdin_used+0x80> 401327: 0d 00 00 40132a: 62 61 7c 48 5b c0 vcvtdq2ps %zmm0,%zmm24 401330: 62 a1 7c 48 5b ff vcvtdq2ps %zmm23,%zmm23 401336: 62 f1 7c 48 10 4c 24vmovups 0x140(%rsp),%zmm1 40133d: 05 40133e: 62 61 3c 40 59 d1 vmulps %zmm1,%zmm24,%zmm26 401344: 62 61 44 40 59 f9 vmulps %zmm1,%zmm23,%zmm31 40134a: 62 f1 7c 48 10 4c 24vmovups 0x100(%rsp),%zmm1 401351: 04 401352: 62 61 3c 40 59 d9 vmulps %zmm1,%zmm24,%zmm27 401358: 62 f1 44 40 59 c9 vmulps %zmm1,%zmm23,%zmm1 40135e: 62 01 2c 40 59 ca vmulps %zmm26,%zmm26,%zmm25 401364: 62 f1 7c 48 10 54 24vmovups 0x80(%rsp),%zmm2 40136b: 02 40136c: 62 61 3c 40 59 e2 vmulps %zmm2,%zmm24,%zmm28 401372: 62 f1 44 40 59 d2 vmulps %zmm2,%zmm23,%zmm2 401378: 62 02 25 40 ac ca vfnmadd213ps %zmm26,%zmm27,%zmm25 40137e: 62 f1 7c 48 10 5c 24vmovups 0x40(%rsp),%zmm3 401385: 01 401386: 62 61 3c 40 59 eb vmulps %zmm3,%zmm24,%zmm29 40138c: 62 f1 44 40 59 db vmulps %zmm3,%zmm23,%zmm3 401392: 62 01 1c 40 59 d4 vmulps %zmm28,%zmm28,%zmm26 401398: 62 01 04 40 59 df vmulps %zmm31,%zmm31,%zmm27 40139e: 62 02 15 40 ac d4 vfnmadd213ps %zmm28,%zmm29,%zmm26 4013a4: 62 f1 7c 48 10 6c 24vmovups -0x40(%rsp),%zmm5 4013ab: ff 4013ac: 62 f1 3c 40 59 e5 vmulps %zmm5,%zmm24,%zmm4 4013b2: 62 f1 44 40 59 ed vmulps %zmm5,%zmm23,%zmm5 4013b8: 62 61 6c 48 59 e2 vmulps %zmm2,%zmm2,%zmm28 4013be: 62 f1 7c 48 10 7c 24vmovups -0x80(%rsp),%zmm7 4013c5: fe 4013c6: 62 f1 3c 40 59 f7 vmulps %zmm7,%zmm24,%zmm6 4013cc: 62 f1 44 40 59 ff vmulps %zmm7,%zmm23,%zmm7 4013d2: 62 61 5c 48 59 ec vmulps %zmm4,%zmm4,%zmm29 4013d8: 62 61 54 48 59 f5 vmulps %zmm5,%zmm5,%zmm30 4013de: 62 62 4d 48 ac ec vfnmadd213ps %zmm4,%zmm6,%zmm29 4013e4: 62 d1 3c 40 59 e3 vmulps %zmm11,%zmm24,%zmm4 4013ea: 62 d1 44 40 59 f3 vmulps %zmm11,%zmm23,%zmm6 4013f0: 62 02 75 48 ac df vfnmadd213ps %zmm31,%zmm1,%zmm27 4013f6: 62 d1 3c 40 59 cc vmulps %zmm12,%zmm24,%zmm1 4013fc: 62 41 44 40 59 fc vmulps %zmm12,%zmm23,%zmm31 401402: 62 71 5c 48 59 c4 vmulps %zmm4,%zmm4,%zmm8 401408: 62 62 65 48 ac e2 vfnmadd213ps %zmm2,%zmm3,%zmm28 40140e: 62 72 75 48 ac c4 vfnmadd213ps %zmm4,%zmm1,%zmm8 401414: 62 d1 3c 40 59 ce vmulps %zmm14,%zmm24,%zmm1 40141a: 62 d1 44 40 59 d6 vmulps %zmm14,%zmm23,%zmm2 401420: 62 62 45 48 ac f5 vfnmadd213ps %zmm5,%zmm7,%zmm30 401426: 62 d1 3c 40 59 df vmulps %zmm15,%zmm24,%zmm3 40142c: 62 d1 44 40 59 e7 vmulps %zmm15,%zmm23,%zmm4 401432: 62 f1 74 48 59 e9 vmulps %zmm1,%zmm1,%zmm5 401438: 62 f1 4c 48 59 fe vmulps %zmm6,%zmm6,%zmm7 40143e: 62 71 6c 48 59 ca vmulps %zmm2,%zmm2,%zmm9 401444: 62 f2 65 48 ac e9 vfnmadd213ps %zmm1,%zmm3,%zmm5 40144a: 62 b1 3c 40 59 c9 vmulps %zmm17,%zmm24,%zmm1 401450: 62 f2 05 40 ac fe vfnmadd213ps %zmm6,%zmm31,%zmm7 401456: 62 b1 44 40 59 d9 vmulps %zmm17,%zmm23,%zmm3 40145c: 62 b1 3c 40 59 f2 vmulps %zmm18,%zmm24,%zmm6 401462: 62 21 44 40 59 fa vmulps %zmm18,%zmm23,%zmm31 401468: 62 72 5d 48 ac ca vfnmadd213ps %zmm2,%zmm4,%zmm9 40146e: 62 f1 74 48 59 d1 vmulps %zmm1,%zmm1,%zmm2 401474: 62 f1 64 48 59 e3 vmulps %zmm3,%zmm3,%zmm4 40147a: 62 f2 4d 48 ac d1 vfnmadd213ps %zmm1,%zmm6,%zmm2 401480: 62 f2 05 40 ac e3 vfnmadd213ps %zmm3,%zmm31,%zmm4 401486: 62 b1 3c 40 59 cc vmulps %zmm20,%zmm24,%zmm1 40148c: 62 b1 3c 40 59 dd vmulps %zmm21,%zmm24,%zmm3 401492: 62 f1 74 48 59 f1 vmulps %zmm1,%zmm1,%zmm6 401498: 62 21 44 40 59 fc vmulps %zmm20,%zmm23,%zmm31 40149e: 62 f2 65 48 ac f1 vfnmadd213ps
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #5 from Witold Baryluk --- As a bonus: static float perlin1d(float x) { float accum = 0.0f; for (int i = 0; i < 8; i++) { accum += powf(0.781f, i) * sinf(x * powf(2.131f, i)); } return accum; } claims to be vectorized, but really isn't, and has non inline or lowered calls to sinf and expf_finite.
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #4 from Witold Baryluk --- If I reduce minimized test case even further: only frequency update: VECTORIZED: static float perlin1d(float x) { float accum = 0.0f; float amplitude = 1.0f; float frequency = 1.0f; for (int i = 0; i < 8; i++) { accum += amplitude * sinf(x * frequency); frequency *= 2.131f; } return accum; } __attribute__((noinline)) static void fill_data(int width, float * __restrict__ height_data, float scale) { for (int i = 0; i < width; i++) { height_data[i] = perlin1d(i); } } only amplitude update: VECTORIZED: static float perlin1d(float x) { float accum = 0.0f; float amplitude = 1.0f; float frequency = 1.0f; for (int i = 0; i < 8; i++) { accum += amplitude * sinf(x * frequency); amplitude *= 0.781f; } return accum; } __attribute__((noinline)) static void fill_data(int width, float * __restrict__ height_data, float scale) { for (int i = 0; i < width; i++) { height_data[i] = perlin1d(i); } } both frequency and amplitude update: NOT VECTORIZED: static float perlin1d(float x) { float accum = 0.0f; float amplitude = 1.0f; float frequency = 1.0f; for (int i = 0; i < 8; i++) { accum += amplitude * sinf(x * frequency); amplitude *= 0.781f; frequency *= 2.131f; } return accum; } __attribute__((noinline)) static void fill_data(int width, float * __restrict__ height_data, float scale) { for (int i = 0; i < width; i++) { height_data[i] = perlin1d(i); } }
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #3 from Witold Baryluk --- If only the frequency is updated in the inner loop: frequency *= 2.131f; function fill_data is vectorized: mesh_minimal.c:34:3: optimized: loop vectorized using 64 byte vectors mesh_minimal.c:33:13: note: vectorized 1 loops in function. However if amplitude is updated in the inner loop: amplitude *= 0.781f; function fill_data is NOT vectorized. mesh_minimal.c:34:3: missed: couldn't vectorize loop mesh_minimal.c:34:3: missed: not vectorized: latch block not empty. mesh_minimal.c:33:13: note: vectorized 0 loops in function. Here for reference: /* line 20 */ static float perlin1d(float x) { float accum = 0.0; float frequency = 1.0; float amplitude = 1.0; for (int i = 0; i < 8; i++) { accum += amplitude * (sinf(x * frequency + (float)i)); frequency *= 2.131f; amplitude *= 0.781f; } return accum; } __attribute__((noinline)) /* line 33 */ static void fill_data(int width, float * __restrict__ height_data, float scale) { /* line 34 */ for (int i = 0; i < width; i++) { height_data[i] = perlin1d(i); } }
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #2 from Witold Baryluk --- Added a minimized test case that has only one outer loop, and f and h are removed for simple inlined replacement. Example diagnostic: $ gcc -std=c17 -march=knm -O3 -ffast-math -fassociative-math -ftree-vectorizer-verbose=2 -fopt-info-vec-all -ggdb -Wall mesh_minimal.c -o mesh_minimal_knm -lm mesh_minimal.c:34:3: missed: couldn't vectorize loop mesh_minimal.c:34:3: missed: not vectorized: latch block not empty. mesh_minimal.c:33:13: note: vectorized 0 loops in function.
[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130 --- Comment #1 from Witold Baryluk --- Created attachment 47052 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47052=edit Minimized test case