[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-17 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #9 from Witold Baryluk  ---
Indeed, passing -fno-tree-pre in the first example does make it be vectorized.

In the mesh_simple.c this corresponds to ONTHEFLY_CONSTANTS being defined, but
USE_LOOP_CONSTANTS being not. The SIMPLIFIED can be defined or not, it
vectorizes now in both cases.

Targeting -march=knm.

This is with #define OCTAVES 12, a compile time constant, so compiler fully
unrolls the most inner loop.

Without -fno-tree-pre:

1230 :
1230:   41 57   push   %r15
1232:   62 a1 7d 40 ef c0   vpxord %zmm16,%zmm16,%zmm16
1238:   49 ba 53 ec 85 1a femovabs $0xc4ceb9fe1a85ec53,%r10
123f:   b9 ce c4 
1242:   41 56   push   %r14
1244:   c5 7a 10 0d f8 0d 00vmovss 0xdf8(%rip),%xmm9# 2044
<_IO_stdin_used+0x44>
124b:   00 
124c:   62 31 7c 48 28 d0   vmovaps %zmm16,%zmm10
1252:   41 55   push   %r13
1254:   c5 7a 10 3d ec 0d 00vmovss 0xdec(%rip),%xmm15# 2048
<_IO_stdin_used+0x48>
125b:   00 
125c:   62 a1 7c 48 28 d0   vmovaps %zmm16,%zmm18
1262:   41 54   push   %r12
1264:   c5 7a 10 35 e0 0d 00vmovss 0xde0(%rip),%xmm14# 204c
<_IO_stdin_used+0x4c>
126b:   00 
126c:   49 b9 cd 8c 55 ed d7movabs $0xff51afd7ed558ccd,%r9
1273:   af 51 ff 
1276:   55  push   %rbp
1277:   c5 7a 10 2d d1 0d 00vmovss 0xdd1(%rip),%xmm13# 2050
<_IO_stdin_used+0x50>
127e:   00 
127f:   49 be 68 66 ac 6a bfmovabs $0xfa8d7ebf6aac6668,%r14
1286:   7e 8d fa 
1289:   53  push   %rbx
128a:   c5 7a 10 25 c2 0d 00vmovss 0xdc2(%rip),%xmm12# 2054
<_IO_stdin_used+0x54>
1291:   00 
1292:   48 89 7c 24 f8  mov%rdi,-0x8(%rsp)
1297:   c7 44 24 f0 00 00 00movl   $0x0,-0x10(%rsp)
129e:   00 
129f:   c7 44 24 f4 00 00 00movl   $0x0,-0xc(%rsp)
12a6:   00 
12a7:   c5 7a 10 1d a9 0d 00vmovss 0xda9(%rip),%xmm11# 2058
<_IO_stdin_used+0x58>
12ae:   00 
12af:   62 e1 7e 08 10 0d a3vmovss 0xda3(%rip),%xmm17# 205c
<_IO_stdin_used+0x5c>
12b6:   0d 00 00 
12b9:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
12c0:   48 8b 6c 24 f8  mov-0x8(%rsp),%rbp
12c5:   31 f6   xor%esi,%esi
12c7:   31 db   xor%ebx,%ebx
12c9:   62 31 7c 48 28 c2   vmovaps %zmm18,%zmm8
12cf:   90  nop
12d0:   8b 54 24 f0 mov-0x10(%rsp),%edx
12d4:   45 31 e4xor%r12d,%r12d
12d7:   62 b1 7c 48 28 f8   vmovaps %zmm16,%zmm7
12dd:   62 c1 7c 48 28 d9   vmovaps %zmm9,%zmm19
12e3:   c5 32 11 cc vmovss %xmm9,%xmm9,%xmm4
12e7:   eb 26   jmp130f

12e9:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
12f0:   c5 ba 59 c4 vmulss %xmm4,%xmm8,%xmm0
12f4:   62 f3 7d 08 0a c0 09vrndscaless $0x9,%xmm0,%xmm0,%xmm0
12fb:   c5 fa 2c f0 vcvttss2si %xmm0,%esi
12ff:   c4 c1 5a 59 c2  vmulss %xmm10,%xmm4,%xmm0
1304:   62 f3 7d 08 0a c0 09vrndscaless $0x9,%xmm0,%xmm0,%xmm0
130b:   c5 fa 2c d0 vcvttss2si %xmm0,%edx
130f:   4c 89 e1mov%r12,%rcx
1312:   62 c1 7c 48 28 e8   vmovaps %zmm8,%zmm21
1318:   48 c1 e9 21 shr$0x21,%rcx
131c:   62 e1 7c 48 28 e4   vmovaps %zmm4,%zmm20
1322:   c5 d2 2a ea vcvtsi2ss %edx,%xmm5,%xmm5
1326:   4c 31 e1xor%r12,%rcx
1329:   49 0f af ca imul   %r10,%rcx
132d:   48 63 d2movslq %edx,%rdx
1330:   c5 e2 2a de vcvtsi2ss %esi,%xmm3,%xmm3
1334:   4f 8d 24 0c lea(%r12,%r9,1),%r12
1338:   48 69 d2 53 42 41 4eimul   $0x4e414253,%rdx,%rdx
133f:   62 c2 55 08 9b e2   vfmsub132ss %xmm10,%xmm5,%xmm20
1345:   c4 c1 52 58 e9  vaddss %xmm9,%xmm5,%xmm5
134a:   48 8d 01lea(%rcx),%rax
134d:   48 c1 e8 21 shr$0x21,%rax
1351:   62 e2 65 08 9b ec   vfmsub132ss %xmm4,%xmm3,%xmm21
1357:   48 31 c1xor%rax,%rcx
135a:   4c 8d ba 53 42 41 4elea0x4e414253(%rdx),%r15
1361:   48 89 cfmov%rcx,%rdi
1364:   48 89 c8mov%rcx,%rax
1367:   48 81 f7 70 46 ab 58xor$0x58ab4670,%rdi
136e:   c4 c1 62 58 d9  vaddss %xmm9,%xmm3,%xmm3
1373:   48 c1 e8 21 shr

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-17 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-10-17
 CC||rguenth at gcc dot gnu.org
 Blocks||53947
 Ever confirmed|0   |1

--- Comment #8 from Richard Biener  ---
You can try -fno-tree-pre because for the original issue you mention the issue
is that PRE figures the first iteration computed values at compile-time which
then effectively rotates the loop which the vectorizer is not happy with.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #7 from Witold Baryluk  ---
Online examples: https://gcc.godbolt.org/z/Nyjty3

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #6 from Witold Baryluk  ---
I also tested clang with LLVM 10~svn374655 and it does vectorize the loop
properly, even when both frequency and amplitude variables are updated every
loop. 

It still doesn't inline calls to sinf, even if I set -fno-math-errno and other
things from -ffast-math. My random guess is that it is because there is no
hardware support for vectorized sinf, and there is no vectorized variant of
sinf software implementation either. If I provide my own version of sinf using
simple Taylor expansion, clang fully vectorized the code:



  401320:   62 e1 7d 58 fe 3d 56vpaddd 0xd56(%rip){1to16},%zmm0,%zmm23 
  # 402080 <_IO_stdin_used+0x80>
  401327:   0d 00 00 
  40132a:   62 61 7c 48 5b c0   vcvtdq2ps %zmm0,%zmm24
  401330:   62 a1 7c 48 5b ff   vcvtdq2ps %zmm23,%zmm23
  401336:   62 f1 7c 48 10 4c 24vmovups 0x140(%rsp),%zmm1
  40133d:   05 
  40133e:   62 61 3c 40 59 d1   vmulps %zmm1,%zmm24,%zmm26
  401344:   62 61 44 40 59 f9   vmulps %zmm1,%zmm23,%zmm31
  40134a:   62 f1 7c 48 10 4c 24vmovups 0x100(%rsp),%zmm1
  401351:   04 
  401352:   62 61 3c 40 59 d9   vmulps %zmm1,%zmm24,%zmm27
  401358:   62 f1 44 40 59 c9   vmulps %zmm1,%zmm23,%zmm1
  40135e:   62 01 2c 40 59 ca   vmulps %zmm26,%zmm26,%zmm25
  401364:   62 f1 7c 48 10 54 24vmovups 0x80(%rsp),%zmm2
  40136b:   02 
  40136c:   62 61 3c 40 59 e2   vmulps %zmm2,%zmm24,%zmm28
  401372:   62 f1 44 40 59 d2   vmulps %zmm2,%zmm23,%zmm2
  401378:   62 02 25 40 ac ca   vfnmadd213ps %zmm26,%zmm27,%zmm25
  40137e:   62 f1 7c 48 10 5c 24vmovups 0x40(%rsp),%zmm3
  401385:   01 
  401386:   62 61 3c 40 59 eb   vmulps %zmm3,%zmm24,%zmm29
  40138c:   62 f1 44 40 59 db   vmulps %zmm3,%zmm23,%zmm3
  401392:   62 01 1c 40 59 d4   vmulps %zmm28,%zmm28,%zmm26
  401398:   62 01 04 40 59 df   vmulps %zmm31,%zmm31,%zmm27
  40139e:   62 02 15 40 ac d4   vfnmadd213ps %zmm28,%zmm29,%zmm26
  4013a4:   62 f1 7c 48 10 6c 24vmovups -0x40(%rsp),%zmm5
  4013ab:   ff 
  4013ac:   62 f1 3c 40 59 e5   vmulps %zmm5,%zmm24,%zmm4
  4013b2:   62 f1 44 40 59 ed   vmulps %zmm5,%zmm23,%zmm5
  4013b8:   62 61 6c 48 59 e2   vmulps %zmm2,%zmm2,%zmm28
  4013be:   62 f1 7c 48 10 7c 24vmovups -0x80(%rsp),%zmm7
  4013c5:   fe 
  4013c6:   62 f1 3c 40 59 f7   vmulps %zmm7,%zmm24,%zmm6
  4013cc:   62 f1 44 40 59 ff   vmulps %zmm7,%zmm23,%zmm7
  4013d2:   62 61 5c 48 59 ec   vmulps %zmm4,%zmm4,%zmm29
  4013d8:   62 61 54 48 59 f5   vmulps %zmm5,%zmm5,%zmm30
  4013de:   62 62 4d 48 ac ec   vfnmadd213ps %zmm4,%zmm6,%zmm29
  4013e4:   62 d1 3c 40 59 e3   vmulps %zmm11,%zmm24,%zmm4
  4013ea:   62 d1 44 40 59 f3   vmulps %zmm11,%zmm23,%zmm6
  4013f0:   62 02 75 48 ac df   vfnmadd213ps %zmm31,%zmm1,%zmm27
  4013f6:   62 d1 3c 40 59 cc   vmulps %zmm12,%zmm24,%zmm1
  4013fc:   62 41 44 40 59 fc   vmulps %zmm12,%zmm23,%zmm31
  401402:   62 71 5c 48 59 c4   vmulps %zmm4,%zmm4,%zmm8
  401408:   62 62 65 48 ac e2   vfnmadd213ps %zmm2,%zmm3,%zmm28
  40140e:   62 72 75 48 ac c4   vfnmadd213ps %zmm4,%zmm1,%zmm8
  401414:   62 d1 3c 40 59 ce   vmulps %zmm14,%zmm24,%zmm1
  40141a:   62 d1 44 40 59 d6   vmulps %zmm14,%zmm23,%zmm2
  401420:   62 62 45 48 ac f5   vfnmadd213ps %zmm5,%zmm7,%zmm30
  401426:   62 d1 3c 40 59 df   vmulps %zmm15,%zmm24,%zmm3
  40142c:   62 d1 44 40 59 e7   vmulps %zmm15,%zmm23,%zmm4
  401432:   62 f1 74 48 59 e9   vmulps %zmm1,%zmm1,%zmm5
  401438:   62 f1 4c 48 59 fe   vmulps %zmm6,%zmm6,%zmm7
  40143e:   62 71 6c 48 59 ca   vmulps %zmm2,%zmm2,%zmm9
  401444:   62 f2 65 48 ac e9   vfnmadd213ps %zmm1,%zmm3,%zmm5
  40144a:   62 b1 3c 40 59 c9   vmulps %zmm17,%zmm24,%zmm1
  401450:   62 f2 05 40 ac fe   vfnmadd213ps %zmm6,%zmm31,%zmm7
  401456:   62 b1 44 40 59 d9   vmulps %zmm17,%zmm23,%zmm3
  40145c:   62 b1 3c 40 59 f2   vmulps %zmm18,%zmm24,%zmm6
  401462:   62 21 44 40 59 fa   vmulps %zmm18,%zmm23,%zmm31
  401468:   62 72 5d 48 ac ca   vfnmadd213ps %zmm2,%zmm4,%zmm9
  40146e:   62 f1 74 48 59 d1   vmulps %zmm1,%zmm1,%zmm2
  401474:   62 f1 64 48 59 e3   vmulps %zmm3,%zmm3,%zmm4
  40147a:   62 f2 4d 48 ac d1   vfnmadd213ps %zmm1,%zmm6,%zmm2
  401480:   62 f2 05 40 ac e3   vfnmadd213ps %zmm3,%zmm31,%zmm4
  401486:   62 b1 3c 40 59 cc   vmulps %zmm20,%zmm24,%zmm1
  40148c:   62 b1 3c 40 59 dd   vmulps %zmm21,%zmm24,%zmm3
  401492:   62 f1 74 48 59 f1   vmulps %zmm1,%zmm1,%zmm6
  401498:   62 21 44 40 59 fc   vmulps %zmm20,%zmm23,%zmm31
  40149e:   62 f2 65 48 ac f1   vfnmadd213ps 

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #5 from Witold Baryluk  ---
As a bonus:


static float perlin1d(float x) {
  float accum = 0.0f;
  for (int i = 0; i < 8; i++) {
accum += powf(0.781f, i) * sinf(x * powf(2.131f, i));
  }
  return accum;
}


claims to be vectorized, but really isn't, and has non inline or lowered calls
to sinf and expf_finite.

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #4 from Witold Baryluk  ---
If I reduce minimized test case even further:

only frequency update: VECTORIZED:

static float perlin1d(float x) {
  float accum = 0.0f;
  float amplitude = 1.0f;
  float frequency = 1.0f;
  for (int i = 0; i < 8; i++) {
accum += amplitude * sinf(x * frequency);
frequency *= 2.131f;
  }
  return accum;
}

__attribute__((noinline))
static void fill_data(int width, float * __restrict__ height_data, float scale)
{
  for (int i = 0; i < width; i++) {
height_data[i] = perlin1d(i);
  }
}


only amplitude update: VECTORIZED:

static float perlin1d(float x) {
  float accum = 0.0f;
  float amplitude = 1.0f;
  float frequency = 1.0f;
  for (int i = 0; i < 8; i++) {
accum += amplitude * sinf(x * frequency);
amplitude *= 0.781f;
  }
  return accum;
}

__attribute__((noinline))
static void fill_data(int width, float * __restrict__ height_data, float scale)
{
  for (int i = 0; i < width; i++) {
height_data[i] = perlin1d(i);
  }
}

both frequency and amplitude update: NOT VECTORIZED:

static float perlin1d(float x) {
  float accum = 0.0f;
  float amplitude = 1.0f;
  float frequency = 1.0f;
  for (int i = 0; i < 8; i++) {
accum += amplitude * sinf(x * frequency);
amplitude *= 0.781f;
frequency *= 2.131f;
  }
  return accum;
}

__attribute__((noinline))
static void fill_data(int width, float * __restrict__ height_data, float scale)
{
  for (int i = 0; i < width; i++) {
height_data[i] = perlin1d(i);
  }
}

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #3 from Witold Baryluk  ---
If only the frequency is updated in the inner loop:

frequency *= 2.131f;

function fill_data is vectorized:

mesh_minimal.c:34:3: optimized: loop vectorized using 64 byte vectors
mesh_minimal.c:33:13: note: vectorized 1 loops in function.


However if amplitude is updated in the inner loop:

amplitude *= 0.781f;

function fill_data is NOT vectorized.

mesh_minimal.c:34:3: missed: couldn't vectorize loop
mesh_minimal.c:34:3: missed: not vectorized: latch block not empty.
mesh_minimal.c:33:13: note: vectorized 0 loops in function.


Here for reference:


/* line 20 */ static float perlin1d(float x) {
  float accum = 0.0;
  float frequency = 1.0;
  float amplitude = 1.0;
  for (int i = 0; i < 8; i++) {
accum += amplitude * (sinf(x * frequency + (float)i));
frequency *= 2.131f;
amplitude *= 0.781f;
  }
  return accum;
}

__attribute__((noinline))
/* line 33 */ static void fill_data(int width, float * __restrict__
height_data, float scale) {
  /* line 34 */ for (int i = 0; i < width; i++) {
height_data[i] = perlin1d(i);
  }
}

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #2 from Witold Baryluk  ---
Added a minimized test case that has only one outer loop, and f and h are
removed for simple inlined replacement.

Example diagnostic:

$ gcc -std=c17 -march=knm -O3 -ffast-math -fassociative-math
-ftree-vectorizer-verbose=2 -fopt-info-vec-all -ggdb -Wall mesh_minimal.c -o
mesh_minimal_knm -lm

mesh_minimal.c:34:3: missed: couldn't vectorize loop
mesh_minimal.c:34:3: missed: not vectorized: latch block not empty.
mesh_minimal.c:33:13: note: vectorized 0 loops in function.

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #1 from Witold Baryluk  ---
Created attachment 47052
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47052=edit
Minimized test case