[Bug tree-optimization/88492] SLP optimization generates ugly code

2024-02-26 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

--- Comment #9 from Andrew Pinski  ---
I noticed once I add V4QI and V2HI support to the aarch64 backend, this code
gets even worse.

[Bug tree-optimization/88492] SLP optimization generates ugly code

2022-01-04 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

Andrew Pinski  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=99412

--- Comment #8 from Andrew Pinski  ---
Similar to PR 99412 .

[Bug tree-optimization/88492] SLP optimization generates ugly code

2021-04-14 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

--- Comment #7 from Tamar Christina  ---
(In reply to ptomsich from comment #6)
> With the current master, the test case generates (with -mcpu=neoverse-n1):

> which contrasts with LLVM13 (with -mcpu=neoverse-n1):
> 
> test_slp:   // @test_slp
>   .cfi_startproc
> // %bb.0:   // %entry
>   ldr q0, [x0]
>   moviv1.16b, #1
>   moviv2.2d, #
>   udotv2.4s, v0.16b, v1.16b
>   addvs0, v2.4s
>   fmovw0, s0
>   ret
> .Lfunc_end0:
>   .size   test_slp, .Lfunc_end0-test_slp
> 
> or (LLVM13 w/o the mcpu-option):
> 
>   .type   test_slp,@function
> test_slp:   // @test_slp
>   .cfi_startproc
> // %bb.0:   // %entry
>   ldr q0, [x0]
>   ushll2  v1.8h, v0.16b, #0
>   ushll   v0.8h, v0.8b, #0
>   uaddl2  v2.4s, v0.8h, v1.8h
>   uaddl   v0.4s, v0.4h, v1.4h
>   add v0.4s, v0.4s, v2.4s
>   addvs0, v0.4s
>   fmovw0, s0
>   ret
> .Lfunc_end0:
>   .size   test_slp, .Lfunc_end0-test_slp

It's definitely a neat trick, but correct me if I'm wrong: it's only possible
because addition is commutative.

Clang has just simply reordered the loads because the loop is very simple to
just

for( int i = 0; i < 4; i++, b += 4 )
{
tmp[i][0] = b[0];
tmp[i][1] = b[1];
tmp[i][2] = b[2];
tmp[i][3] = b[3];
}

Which GCC also handles fine. 

As Richi mentioned before

>I know the "real" code this testcase is from has actual operations
> in place of the b[N] reads, for the above vectorization looks somewhat
> pointless given we end up decomposing the result again.

It seems a bit of a too narrow focus to optimize for this particular example as
the real code does "other" things.

i.e.

Both GCC and Clang fall apart with

int test_slp( unsigned char *b )
{
unsigned int tmp[4][4];
int sum = 0;
for( int i = 0; i < 4; i++, b += 4 )
{
tmp[i][0] = b[0] - b[4];
tmp[i][2] = b[1] + b[5];
tmp[i][1] = b[2] - b[6];
tmp[i][3] = b[3] + b[7];
}
for( int i = 0; i < 4; i++ )
{
sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i];
}
return sum;
}

which has about the same access pattern as the real code.

If you change the operations you'll notice that for others examples like

int test_slp( unsigned char *b )
{
unsigned int tmp[4][4];
int sum = 0;
for( int i = 0; i < 4; i++, b += 4 )
{
tmp[i][0] = b[0] - b[4];
tmp[i][2] = b[1] - b[5];
tmp[i][1] = b[2] - b[6];
tmp[i][3] = b[3] - b[7];
}
for( int i = 0; i < 4; i++ )
{
sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i];
}
return sum;
}

GCC handles this better (but we are let down by register allocation).

To me it seems quite unlikely that actual code would be written like that, but
I guess there could be a case to be made to try to reassoc loads as well.

[Bug tree-optimization/88492] SLP optimization generates ugly code

2021-04-14 Thread ptomsich at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

ptomsich at gcc dot gnu.org changed:

   What|Removed |Added

 CC||ptomsich at gcc dot gnu.org

--- Comment #6 from ptomsich at gcc dot gnu.org ---
With the current master, the test case generates (with -mcpu=neoverse-n1):

.arch armv8.2-a+crc+fp16+rcpc+dotprod+profile
.file   "pr88492.c"
.text
.align  2
.p2align 5,,15
.global test_slp
.type   test_slp, %function
test_slp:
.LFB0:
.cfi_startproc
ldr q2, [x0]
adrpx1, .LC0
ldr q16, [x1, #:lo12:.LC0]
uxtlv4.8h, v2.8b
uxtl2   v2.8h, v2.16b
uxtlv0.4s, v4.4h
uxtlv6.4s, v2.4h
uxtl2   v4.4s, v4.8h
uxtl2   v2.4s, v2.8h
mov v1.16b, v0.16b
mov v7.16b, v6.16b
mov v5.16b, v4.16b
mov v3.16b, v2.16b
tbl v0.16b, {v0.16b - v1.16b}, v16.16b
tbl v6.16b, {v6.16b - v7.16b}, v16.16b
tbl v4.16b, {v4.16b - v5.16b}, v16.16b
tbl v2.16b, {v2.16b - v3.16b}, v16.16b
add v0.4s, v0.4s, v4.4s
add v6.4s, v6.4s, v2.4s
add v0.4s, v0.4s, v6.4s
addvs0, v0.4s
fmovw0, s0
ret
.cfi_endproc
.LFE0:
.size   test_slp, .-test_slp

which contrasts with LLVM13 (with -mcpu=neoverse-n1):

test_slp:   // @test_slp
.cfi_startproc
// %bb.0:   // %entry
ldr q0, [x0]
moviv1.16b, #1
moviv2.2d, #
udotv2.4s, v0.16b, v1.16b
addvs0, v2.4s
fmovw0, s0
ret
.Lfunc_end0:
.size   test_slp, .Lfunc_end0-test_slp

or (LLVM13 w/o the mcpu-option):

.type   test_slp,@function
test_slp:   // @test_slp
.cfi_startproc
// %bb.0:   // %entry
ldr q0, [x0]
ushll2  v1.8h, v0.16b, #0
ushll   v0.8h, v0.8b, #0
uaddl2  v2.4s, v0.8h, v1.8h
uaddl   v0.4s, v0.4h, v1.4h
add v0.4s, v0.4s, v2.4s
addvs0, v0.4s
fmovw0, s0
ret
.Lfunc_end0:
.size   test_slp, .Lfunc_end0-test_slp

[Bug tree-optimization/88492] SLP optimization generates ugly code

2019-07-12 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

--- Comment #5 from Richard Biener  ---
Yeah.  Again, on x86 with -mavx2 we now have right after late FRE:

   [local count: 214748371]:
  vect__1.6_64 = MEM  [(unsigned char *)b_24(D)];
  vect__1.7_63 = VEC_PERM_EXPR ;
  vect__2.9_62 = [vec_unpack_lo_expr] vect__1.7_63;
  vect__2.9_59 = [vec_unpack_hi_expr] vect__1.7_63;
  vect__2.8_57 = [vec_unpack_lo_expr] vect__2.9_62;
  vect__2.8_56 = [vec_unpack_hi_expr] vect__2.9_62;
  vect__2.8_55 = [vec_unpack_lo_expr] vect__2.9_59;
  vect__2.8_54 = [vec_unpack_hi_expr] vect__2.9_59;
  MEM  [(unsigned int *)] = vect__2.8_57;
  MEM  [(unsigned int *) + 16B] = vect__2.8_56;
  MEM  [(unsigned int *) + 32B] = vect__2.8_55;
  MEM  [(unsigned int *) + 48B] = vect__2.8_54;
  vectp_b.4_65 = b_24(D) + 16;
  _8 = BIT_FIELD_REF ;
  _22 = BIT_FIELD_REF ;
  _30 = _8 + _22;
  _14 = BIT_FIELD_REF ;
  _81 = BIT_FIELD_REF ;
  _43 = _14 + _30;
  _45 = _43 + _81;
  sum_34 = (int) _45;
  _58 = BIT_FIELD_REF ;
  _38 = BIT_FIELD_REF ;
  _72 = _38 + _58;
  _68 = BIT_FIELD_REF ;
  _53 = BIT_FIELD_REF ;
  _29 = _68 + _72;
  _31 = _29 + _53;
  _7 = _31 + _45;
  sum_61 = (int) _7;
  _47 = BIT_FIELD_REF ;
  _88 = BIT_FIELD_REF ;
  _44 = _47 + _88;
  _74 = _44 + _90;
  _73 = _74 + _92;
  _83 = _7 + _73;
  sum_84 = (int) _83;
  _94 = BIT_FIELD_REF ;
  _96 = BIT_FIELD_REF ;
  _71 = _94 + _96;
  _98 = BIT_FIELD_REF ;
  _100 = BIT_FIELD_REF ;
  _70 = _71 + _98;
  _46 = _70 + _100;
  _18 = _46 + _83;
  sum_27 = (int) _18;
  tmp ={v} {CLOBBER};
  return sum_27;

I know the "real" code this testcase is from has actual operations
in place of the b[N] reads, for the above vectorization looks somewhat
pointless given we end up decomposing the result again.

So the appropriate fix would of course be to vectorize the reduction
loop (but that hits the sign-changing reduction issue).

[Bug tree-optimization/88492] SLP optimization generates ugly code

2019-07-12 Thread hliu at amperecomputing dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

Hao Liu  changed:

   What|Removed |Added

 CC||hliu at amperecomputing dot com

--- Comment #4 from Hao Liu  ---
It seems Richard Biener's patch (r272843) can remove the redundant load/store.
r272843 comments as following: 
> 2019-07-01  Richard Biener  
>
>* tree-ssa-sccvn.c (class pass_fre): Add may_iterate
>pass parameter.
>(pass_fre::execute): Honor it.
>* passes.def: Adjust pass_fre invocations to allow iterating,
>add non-iterating pass_fre before late threading/dom.
>
>* gcc.dg/tree-ssa/pr77445-2.c: Adjust.

Tested with Jiangning's case with "gcc -O3", the following code is generated:

  test_slp:
  .LFB0:
.cfi_startproc
adrpx1, .LC0
ldr q0, [x0]
ldr q1, [x1, #:lo12:.LC0]
tbl v0.16b, {v0.16b}, v1.16b
uxtlv1.8h, v0.8b
uxtl2   v0.8h, v0.16b
uxtlv4.4s, v1.4h
uxtlv2.4s, v0.4h
uxtl2   v0.4s, v0.8h
uxtl2   v1.4s, v1.8h
dup s21, v4.s[0]
dup s22, v2.s[1]
dup s3, v0.s[1]
dup s6, v1.s[0]
dup s23, v4.s[1]
dup s16, v2.s[0]
add v3.2s, v3.2s, v22.2s
dup s20, v0.s[0]
dup s17, v1.s[1]
dup s5, v0.s[2]
fmovw0, s3
add v3.2s, v6.2s, v21.2s
dup s19, v2.s[2]
add v17.2s, v17.2s, v23.2s
dup s7, v4.s[2]
fmovw1, s3
add v3.2s, v16.2s, v20.2s
dup s18, v1.s[2]
fmovw3, s17
dup s2, v2.s[3]
fmovw2, s3
add v3.2s, v5.2s, v19.2s
dup s0, v0.s[3]
dup s4, v4.s[3]
add w0, w0, w3
dup s1, v1.s[3]
fmovw3, s3
add v3.2s, v7.2s, v18.2s
add v0.2s, v2.2s, v0.2s
add w1, w1, w2
add w0, w0, w1
fmovw2, s3
add w3, w3, w2
fmovw2, s0
add v0.2s, v1.2s, v4.2s
add w0, w0, w3
fmovw1, s0
add w1, w2, w1
add w0, w0, w1
ret

Although SLP still generates SIMD code, it looks much better than previous code
with memory load/store. Performance is expected to be better as no redundant
load/store.

[Bug tree-optimization/88492] SLP optimization generates ugly code

2019-04-09 Thread tnfchris at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

Tamar Christina  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
 CC||tnfchris at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #3 from Tamar Christina  ---
I'll be taking a look at this one as a part of GCC 10 as well.

[Bug tree-optimization/88492] SLP optimization generates ugly code

2018-12-14 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 CC||rguenth at gcc dot gnu.org

--- Comment #2 from Richard Biener  ---
IIRC we have a duplicate for this.  The issue is the SLP vectorizer doesn't
handle reductions (not implemented) and thus the vector results need
to be decomposed for the scalar reduction tail.  On x86 we get with -mavx2

vmovdqu (%rdi), %xmm0
vpshufb .LC0(%rip), %xmm0, %xmm0
vpmovzxbw   %xmm0, %xmm1
vpsrldq $8, %xmm0, %xmm0
vpmovzxwd   %xmm1, %xmm2
vpsrldq $8, %xmm1, %xmm1
vpmovzxbw   %xmm0, %xmm0
vpmovzxwd   %xmm1, %xmm1
vmovaps %xmm2, -72(%rsp)
movl-68(%rsp), %eax
vmovaps %xmm1, -56(%rsp)
vpmovzxwd   %xmm0, %xmm1
vpsrldq $8, %xmm0, %xmm0
addl-52(%rsp), %eax
vpmovzxwd   %xmm0, %xmm0
vmovaps %xmm1, -40(%rsp)
movl-56(%rsp), %edx
addl-36(%rsp), %eax
vmovaps %xmm0, -24(%rsp)
addl-72(%rsp), %edx
addl-20(%rsp), %eax
addl-40(%rsp), %edx
addl-24(%rsp), %edx
addl%edx, %eax
movl-48(%rsp), %edx
addl-64(%rsp), %edx
addl-32(%rsp), %edx
addl-16(%rsp), %edx
addl%edx, %eax
movl-44(%rsp), %edx
addl-60(%rsp), %edx
addl-28(%rsp), %edx
addl-12(%rsp), %edx
addl%edx, %eax
ret

the main issue of course that we fail to elide the stack temporary.
Re-running FRE after loop opts might help here but of course
SLP vectorization handling the reduction would be best (though the
tail loop is structured badly, not matching up with the head one).

Whether vectorizing this specific testcases head loop is profitable
or not is questionable on its own of course (but you can easily make
it so and still get similar ugly code in the tail).

[Bug tree-optimization/88492] SLP optimization generates ugly code

2018-12-14 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492

ktkachov at gcc dot gnu.org changed:

   What|Removed |Added

 Target||aarch64
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2018-12-14
 CC||ktkachov at gcc dot gnu.org
 Blocks||53947
 Ever confirmed|0   |1

--- Comment #1 from ktkachov at gcc dot gnu.org ---
Confirmed. Don't know if the vectoriser can do anything better here, but we if
not the cost models should be disabling it.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations