[Bug tree-optimization/88492] SLP optimization generates ugly code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492 --- Comment #9 from Andrew Pinski --- I noticed once I add V4QI and V2HI support to the aarch64 backend, this code gets even worse.
[Bug tree-optimization/88492] SLP optimization generates ugly code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492 Andrew Pinski changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=99412 --- Comment #8 from Andrew Pinski --- Similar to PR 99412 .
[Bug tree-optimization/88492] SLP optimization generates ugly code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492 --- Comment #7 from Tamar Christina --- (In reply to ptomsich from comment #6) > With the current master, the test case generates (with -mcpu=neoverse-n1): > which contrasts with LLVM13 (with -mcpu=neoverse-n1): > > test_slp: // @test_slp > .cfi_startproc > // %bb.0: // %entry > ldr q0, [x0] > moviv1.16b, #1 > moviv2.2d, # > udotv2.4s, v0.16b, v1.16b > addvs0, v2.4s > fmovw0, s0 > ret > .Lfunc_end0: > .size test_slp, .Lfunc_end0-test_slp > > or (LLVM13 w/o the mcpu-option): > > .type test_slp,@function > test_slp: // @test_slp > .cfi_startproc > // %bb.0: // %entry > ldr q0, [x0] > ushll2 v1.8h, v0.16b, #0 > ushll v0.8h, v0.8b, #0 > uaddl2 v2.4s, v0.8h, v1.8h > uaddl v0.4s, v0.4h, v1.4h > add v0.4s, v0.4s, v2.4s > addvs0, v0.4s > fmovw0, s0 > ret > .Lfunc_end0: > .size test_slp, .Lfunc_end0-test_slp It's definitely a neat trick, but correct me if I'm wrong: it's only possible because addition is commutative. Clang has just simply reordered the loads because the loop is very simple to just for( int i = 0; i < 4; i++, b += 4 ) { tmp[i][0] = b[0]; tmp[i][1] = b[1]; tmp[i][2] = b[2]; tmp[i][3] = b[3]; } Which GCC also handles fine. As Richi mentioned before >I know the "real" code this testcase is from has actual operations > in place of the b[N] reads, for the above vectorization looks somewhat > pointless given we end up decomposing the result again. It seems a bit of a too narrow focus to optimize for this particular example as the real code does "other" things. i.e. Both GCC and Clang fall apart with int test_slp( unsigned char *b ) { unsigned int tmp[4][4]; int sum = 0; for( int i = 0; i < 4; i++, b += 4 ) { tmp[i][0] = b[0] - b[4]; tmp[i][2] = b[1] + b[5]; tmp[i][1] = b[2] - b[6]; tmp[i][3] = b[3] + b[7]; } for( int i = 0; i < 4; i++ ) { sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i]; } return sum; } which has about the same access pattern as the real code. If you change the operations you'll notice that for others examples like int test_slp( unsigned char *b ) { unsigned int tmp[4][4]; int sum = 0; for( int i = 0; i < 4; i++, b += 4 ) { tmp[i][0] = b[0] - b[4]; tmp[i][2] = b[1] - b[5]; tmp[i][1] = b[2] - b[6]; tmp[i][3] = b[3] - b[7]; } for( int i = 0; i < 4; i++ ) { sum += tmp[0][i] + tmp[1][i] + tmp[2][i] + tmp[3][i]; } return sum; } GCC handles this better (but we are let down by register allocation). To me it seems quite unlikely that actual code would be written like that, but I guess there could be a case to be made to try to reassoc loads as well.
[Bug tree-optimization/88492] SLP optimization generates ugly code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492 ptomsich at gcc dot gnu.org changed: What|Removed |Added CC||ptomsich at gcc dot gnu.org --- Comment #6 from ptomsich at gcc dot gnu.org --- With the current master, the test case generates (with -mcpu=neoverse-n1): .arch armv8.2-a+crc+fp16+rcpc+dotprod+profile .file "pr88492.c" .text .align 2 .p2align 5,,15 .global test_slp .type test_slp, %function test_slp: .LFB0: .cfi_startproc ldr q2, [x0] adrpx1, .LC0 ldr q16, [x1, #:lo12:.LC0] uxtlv4.8h, v2.8b uxtl2 v2.8h, v2.16b uxtlv0.4s, v4.4h uxtlv6.4s, v2.4h uxtl2 v4.4s, v4.8h uxtl2 v2.4s, v2.8h mov v1.16b, v0.16b mov v7.16b, v6.16b mov v5.16b, v4.16b mov v3.16b, v2.16b tbl v0.16b, {v0.16b - v1.16b}, v16.16b tbl v6.16b, {v6.16b - v7.16b}, v16.16b tbl v4.16b, {v4.16b - v5.16b}, v16.16b tbl v2.16b, {v2.16b - v3.16b}, v16.16b add v0.4s, v0.4s, v4.4s add v6.4s, v6.4s, v2.4s add v0.4s, v0.4s, v6.4s addvs0, v0.4s fmovw0, s0 ret .cfi_endproc .LFE0: .size test_slp, .-test_slp which contrasts with LLVM13 (with -mcpu=neoverse-n1): test_slp: // @test_slp .cfi_startproc // %bb.0: // %entry ldr q0, [x0] moviv1.16b, #1 moviv2.2d, # udotv2.4s, v0.16b, v1.16b addvs0, v2.4s fmovw0, s0 ret .Lfunc_end0: .size test_slp, .Lfunc_end0-test_slp or (LLVM13 w/o the mcpu-option): .type test_slp,@function test_slp: // @test_slp .cfi_startproc // %bb.0: // %entry ldr q0, [x0] ushll2 v1.8h, v0.16b, #0 ushll v0.8h, v0.8b, #0 uaddl2 v2.4s, v0.8h, v1.8h uaddl v0.4s, v0.4h, v1.4h add v0.4s, v0.4s, v2.4s addvs0, v0.4s fmovw0, s0 ret .Lfunc_end0: .size test_slp, .Lfunc_end0-test_slp
[Bug tree-optimization/88492] SLP optimization generates ugly code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492 --- Comment #5 from Richard Biener --- Yeah. Again, on x86 with -mavx2 we now have right after late FRE: [local count: 214748371]: vect__1.6_64 = MEM [(unsigned char *)b_24(D)]; vect__1.7_63 = VEC_PERM_EXPR ; vect__2.9_62 = [vec_unpack_lo_expr] vect__1.7_63; vect__2.9_59 = [vec_unpack_hi_expr] vect__1.7_63; vect__2.8_57 = [vec_unpack_lo_expr] vect__2.9_62; vect__2.8_56 = [vec_unpack_hi_expr] vect__2.9_62; vect__2.8_55 = [vec_unpack_lo_expr] vect__2.9_59; vect__2.8_54 = [vec_unpack_hi_expr] vect__2.9_59; MEM [(unsigned int *)] = vect__2.8_57; MEM [(unsigned int *) + 16B] = vect__2.8_56; MEM [(unsigned int *) + 32B] = vect__2.8_55; MEM [(unsigned int *) + 48B] = vect__2.8_54; vectp_b.4_65 = b_24(D) + 16; _8 = BIT_FIELD_REF ; _22 = BIT_FIELD_REF ; _30 = _8 + _22; _14 = BIT_FIELD_REF ; _81 = BIT_FIELD_REF ; _43 = _14 + _30; _45 = _43 + _81; sum_34 = (int) _45; _58 = BIT_FIELD_REF ; _38 = BIT_FIELD_REF ; _72 = _38 + _58; _68 = BIT_FIELD_REF ; _53 = BIT_FIELD_REF ; _29 = _68 + _72; _31 = _29 + _53; _7 = _31 + _45; sum_61 = (int) _7; _47 = BIT_FIELD_REF ; _88 = BIT_FIELD_REF ; _44 = _47 + _88; _74 = _44 + _90; _73 = _74 + _92; _83 = _7 + _73; sum_84 = (int) _83; _94 = BIT_FIELD_REF ; _96 = BIT_FIELD_REF ; _71 = _94 + _96; _98 = BIT_FIELD_REF ; _100 = BIT_FIELD_REF ; _70 = _71 + _98; _46 = _70 + _100; _18 = _46 + _83; sum_27 = (int) _18; tmp ={v} {CLOBBER}; return sum_27; I know the "real" code this testcase is from has actual operations in place of the b[N] reads, for the above vectorization looks somewhat pointless given we end up decomposing the result again. So the appropriate fix would of course be to vectorize the reduction loop (but that hits the sign-changing reduction issue).
[Bug tree-optimization/88492] SLP optimization generates ugly code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492 Hao Liu changed: What|Removed |Added CC||hliu at amperecomputing dot com --- Comment #4 from Hao Liu --- It seems Richard Biener's patch (r272843) can remove the redundant load/store. r272843 comments as following: > 2019-07-01 Richard Biener > >* tree-ssa-sccvn.c (class pass_fre): Add may_iterate >pass parameter. >(pass_fre::execute): Honor it. >* passes.def: Adjust pass_fre invocations to allow iterating, >add non-iterating pass_fre before late threading/dom. > >* gcc.dg/tree-ssa/pr77445-2.c: Adjust. Tested with Jiangning's case with "gcc -O3", the following code is generated: test_slp: .LFB0: .cfi_startproc adrpx1, .LC0 ldr q0, [x0] ldr q1, [x1, #:lo12:.LC0] tbl v0.16b, {v0.16b}, v1.16b uxtlv1.8h, v0.8b uxtl2 v0.8h, v0.16b uxtlv4.4s, v1.4h uxtlv2.4s, v0.4h uxtl2 v0.4s, v0.8h uxtl2 v1.4s, v1.8h dup s21, v4.s[0] dup s22, v2.s[1] dup s3, v0.s[1] dup s6, v1.s[0] dup s23, v4.s[1] dup s16, v2.s[0] add v3.2s, v3.2s, v22.2s dup s20, v0.s[0] dup s17, v1.s[1] dup s5, v0.s[2] fmovw0, s3 add v3.2s, v6.2s, v21.2s dup s19, v2.s[2] add v17.2s, v17.2s, v23.2s dup s7, v4.s[2] fmovw1, s3 add v3.2s, v16.2s, v20.2s dup s18, v1.s[2] fmovw3, s17 dup s2, v2.s[3] fmovw2, s3 add v3.2s, v5.2s, v19.2s dup s0, v0.s[3] dup s4, v4.s[3] add w0, w0, w3 dup s1, v1.s[3] fmovw3, s3 add v3.2s, v7.2s, v18.2s add v0.2s, v2.2s, v0.2s add w1, w1, w2 add w0, w0, w1 fmovw2, s3 add w3, w3, w2 fmovw2, s0 add v0.2s, v1.2s, v4.2s add w0, w0, w3 fmovw1, s0 add w1, w2, w1 add w0, w0, w1 ret Although SLP still generates SIMD code, it looks much better than previous code with memory load/store. Performance is expected to be better as no redundant load/store.
[Bug tree-optimization/88492] SLP optimization generates ugly code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492 Tamar Christina changed: What|Removed |Added Status|NEW |ASSIGNED CC||tnfchris at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |tnfchris at gcc dot gnu.org --- Comment #3 from Tamar Christina --- I'll be taking a look at this one as a part of GCC 10 as well.
[Bug tree-optimization/88492] SLP optimization generates ugly code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492 Richard Biener changed: What|Removed |Added Keywords||missed-optimization CC||rguenth at gcc dot gnu.org --- Comment #2 from Richard Biener --- IIRC we have a duplicate for this. The issue is the SLP vectorizer doesn't handle reductions (not implemented) and thus the vector results need to be decomposed for the scalar reduction tail. On x86 we get with -mavx2 vmovdqu (%rdi), %xmm0 vpshufb .LC0(%rip), %xmm0, %xmm0 vpmovzxbw %xmm0, %xmm1 vpsrldq $8, %xmm0, %xmm0 vpmovzxwd %xmm1, %xmm2 vpsrldq $8, %xmm1, %xmm1 vpmovzxbw %xmm0, %xmm0 vpmovzxwd %xmm1, %xmm1 vmovaps %xmm2, -72(%rsp) movl-68(%rsp), %eax vmovaps %xmm1, -56(%rsp) vpmovzxwd %xmm0, %xmm1 vpsrldq $8, %xmm0, %xmm0 addl-52(%rsp), %eax vpmovzxwd %xmm0, %xmm0 vmovaps %xmm1, -40(%rsp) movl-56(%rsp), %edx addl-36(%rsp), %eax vmovaps %xmm0, -24(%rsp) addl-72(%rsp), %edx addl-20(%rsp), %eax addl-40(%rsp), %edx addl-24(%rsp), %edx addl%edx, %eax movl-48(%rsp), %edx addl-64(%rsp), %edx addl-32(%rsp), %edx addl-16(%rsp), %edx addl%edx, %eax movl-44(%rsp), %edx addl-60(%rsp), %edx addl-28(%rsp), %edx addl-12(%rsp), %edx addl%edx, %eax ret the main issue of course that we fail to elide the stack temporary. Re-running FRE after loop opts might help here but of course SLP vectorization handling the reduction would be best (though the tail loop is structured badly, not matching up with the head one). Whether vectorizing this specific testcases head loop is profitable or not is questionable on its own of course (but you can easily make it so and still get similar ugly code in the tail).
[Bug tree-optimization/88492] SLP optimization generates ugly code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88492 ktkachov at gcc dot gnu.org changed: What|Removed |Added Target||aarch64 Status|UNCONFIRMED |NEW Last reconfirmed||2018-12-14 CC||ktkachov at gcc dot gnu.org Blocks||53947 Ever confirmed|0 |1 --- Comment #1 from ktkachov at gcc dot gnu.org --- Confirmed. Don't know if the vectoriser can do anything better here, but we if not the cost models should be disabling it. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations