[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 Jakub Jelinek changed: What|Removed |Added Target Milestone|9.3 |9.4 --- Comment #24 from Jakub Jelinek --- GCC 9.3.0 has been released, adjusting target milestone.
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 Jeffrey A. Law changed: What|Removed |Added Priority|P3 |P2 CC||law at redhat dot com
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #23 from Andreas Schwab --- gcc.target/aarch64/pr93565.c fails with -mabi=ilp32.
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #22 from Segher Boessenkool --- T0T2T3T4 alpha 6049096 100.020% 100.018% 100.001% arc 4019384 100.000% 99.989% 99.989% arm 14177962 99.999% 99.999% 100.000% arm64 12968466 99.938% 99.888% 100.000% c6x 2346077 100.000% 100.001% 100.001% csky 3332454 100.000% 100.000% 100.000% h8300 1165256 99.999% 99.999% 100.000% i386 11227764 100.001% 100.001% 100.000% ia64 18088488 100.003% 100.007% 100.003% m68k 3716871 100.000% 100.000% 100.000% microblaze 4935181 100.000% 99.995% 99.995% mips 8407681 100.000% 100.000% 100.000% mips64 6979344 99.987% 99.981% 99.981% nds32 4471023 100.000% 99.994% 99.994% nios2 3643253 100.000% 99.999% 99.999% openrisc 4182200 100.000% 99.995% 99.995% parisc 7710095 100.001% 100.001% 100.000% parisc64 8676725 100.003% 100.002% 99.999% powerpc 10603859 100.000% 100.000% 100.001% powerpc64 17552718 100.007% 100.005% 99.999% powerpc64le 17552718 100.007% 100.005% 99.999% riscv32 1546172 100.000% 99.999% 99.999% riscv64 6623170 100.010% 100.005% 100.001% s390 13103095 99.995% 99.993% 99.999% sh 3216555 99.999% 99.992% 99.993% shnommu 1611176 99.999% 99.999% 100.000% sparc 436 100.000% 99.997% 99.997% sparc64 6751939 100.000% 99.997% 99.997% x86_64 19681173 100.000% 100.000% 100.000% xtensa 0 0 0 0 T0 is orig, T2 is only sign_extend, T3 is sign_extend and no same sources, T4 is only no same source (SET_SRC). The diffs look less than they are, this is just size, and with 2-2 combines size does not change (on many targets). For powerpc, *all* the changes these patches make hurt code quality (they change two parallel insns to two sequential ones). I think combine should just do what it already does, and you should add some peepholes, or maybe some new pass?
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #21 from Segher Boessenkool --- (In reply to Andrew Pinski from comment #20) > (In reply to Segher Boessenkool from comment #18) > > Created attachment 47841 [details] > > Patch to treat sign_extend as is_just_move > > Do you think zero_extend should maybe be treated as such too? Maybe? > What about truncate (MIPS64 uses truncate a lot as moves)? Also maybe. Test runs take a little over three hours (vs. less than two hours in GCC 8 times). I'll experiment with those things, but first the bigger issue (parallel of two identical SETs, just with different dest).
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #20 from Andrew Pinski --- (In reply to Segher Boessenkool from comment #18) > Created attachment 47841 [details] > Patch to treat sign_extend as is_just_move Do you think zero_extend should maybe be treated as such too? What about truncate (MIPS64 uses truncate a lot as moves)?
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #19 from Segher Boessenkool --- With that above patch, I get (T0 is original, T2 is with patch, these are file sizes of a Linux build, mostly defconfig): T0T2 alpha 6049096 100.020% arc 4019384 100.000% arm 14177962 99.999% arm64 12968466 99.938% c6x 2346077 100.000% csky 3332454 100.000% h8300 1165256 99.999% i386 11227764 100.001% ia64 18088488 100.003% m68k 3716871 100.000% microblaze 4935181 100.000% mips 8407681 100.000% mips64 6979344 99.987% nds32 4471023 100.000% nios2 3643253 100.000% openrisc 4182200 100.000% parisc 7710095 100.001% parisc64 8676725 100.003% powerpc 10603859 100.000% powerpc64 17552718 100.007% powerpc64le 17552718 100.007% riscv32 1546172 100.000% riscv64 6623170 100.010% s390 13103095 99.995% sh 3216555 99.999% shnommu 1611176 99.999% sparc 436 100.000% sparc64 6751939 100.000% x86_64 19681173 100.000% xtensa 0 0 I think I'll commit this, but let's look at the original problem first as well.
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #18 from Segher Boessenkool --- Created attachment 47841 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47841=edit Patch to treat sign_extend as is_just_move
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #17 from Segher Boessenkool --- That above commit is just a spec special, it doesn't solve anything else, imnsho.
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #16 from Segher Boessenkool --- It is not the same cost. It reduces the path length.
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #15 from CVS Commits --- The master branch has been updated by Wilco Dijkstra : https://gcc.gnu.org/g:5bfc8303ffe2d86e938d45f13cd99a39469dac4f commit r10-6598-g5bfc8303ffe2d86e938d45f13cd99a39469dac4f Author: Wilco Dijkstra Date: Wed Feb 12 18:23:21 2020 + [AArch64] Set ctz rtx_cost (PR93565) Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary sign extension. Avoid this by setting the cost for ctz to be higher than that of a simple ALU instruction. Deepsjeng performance improves by ~0.6%. gcc/ PR rtl-optimization/93565 * config/aarch64/aarch64.c (aarch64_rtx_costs): Add CTZ costs. testsuite/ PR rtl-optimization/93565 * gcc.target/aarch64/pr93565.c: New test.
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #14 from Richard Earnshaw --- With the simpler test case we see Breakpoint 1, try_combine (i3=0x764d33c0, i2=0x764d3380, i1=0x0, i0=0x0, new_direct_jump_p=0x7fffd850, last_combined_insn=0x764d33c0) at /home/rearnsha/gnusrc/gcc-cross/master/gcc/combine.c:2671 2671{ (nil) (nil) (insn 7 4 8 2 (set (reg/v:SI 96 [ a ]) (and:SI (reg:SI 104) (const_int 14 [0xe]))) "/tmp/t2.c":3:7 535 {andsi3} (expr_list:REG_DEAD (reg:SI 104) (nil))) (insn 8 7 10 2 (set (reg:DI 99 [ a ]) (sign_extend:DI (reg/v:SI 96 [ a ]))) "/tmp/t2.c":4:13 106 {*extendsidi2_aarch64} (nil)) And then the resulting insn that we try is (parallel [ (set (reg:DI 99 [ a ]) (and:DI (subreg:DI (reg:SI 104) 0) (const_int 14 [0xe]))) (set (reg/v:SI 96 [ a ]) (and:SI (reg:SI 104) (const_int 14 [0xe]))) ]) This insn doesn't match, and so we try to break it into two set insn and try those individually. But that gives us back insn 7 again and then a new insn based on the (now extended lifetime) of r104. It seems to me that if we are doing this sort of transformation, then it's only likely to be profitable if the cost of the really new insn is strictly cheaper than what we have before. Being the same cost is not enough in this case.
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #13 from Segher Boessenkool --- nonzero_bits is not reliable. We also cannot really do what you propose here, all of this is done for *every* combination. We currently generate (set (reg/v:SI 96 [ a ]) (and:SI (reg:SI 104) (const_int 14 [0xe]))) (set (reg:DI 99 [ a ]) (and:DI (subreg:DI (reg:SI 104) 0) (const_int 14 [0xe]))) If we can somehow see the first one is just the lowpart subreg of the second, we can handle it the same as the first case.
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #12 from Jakub Jelinek --- (In reply to Segher Boessenkool from comment #11) > (The original problem I have an idea for -- don't generate a parallel of > two SETs with equal SET_SRC -- but that doesn't handle the new case). For the new case, nonzero_bits should find out that the sign_extension is the same thing as zero_extension and it would be best to just do a single and e.g. on a paradoxical subreg of the source and a pseudo copy.
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #11 from Segher Boessenkool --- (The original problem I have an idea for -- don't generate a parallel of two SETs with equal SET_SRC -- but that doesn't handle the new case).
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 --- Comment #10 from Segher Boessenkool --- One of the first things combine tries is Trying 7 -> 8: 7: r96:SI=r104:SI&0xe REG_DEAD r104:SI 8: r99:DI=sign_extend(r96:SI) ... Successfully matched this instruction: (set (reg/v:SI 96 [ a ]) (and:SI (reg:SI 104) (const_int 14 [0xe]))) Successfully matched this instruction: (set (reg:DI 99 [ a ]) (and:DI (subreg:DI (reg:SI 104) 0) (const_int 14 [0xe]))) allowing combination of insns 7 and 8 original costs 4 + 4 = 8 replacement costs 4 + 4 = 8 modifying insn i2 7: r96:SI=r104:SI&0xe deferring rescan insn with uid = 7. modifying insn i3 8: r99:DI=r104:SI#0&0xe REG_DEAD r104:SI deferring rescan insn with uid = 8. Since combine is a greedy optimisation, what it ends up with depends on the order it tries things in. Any local minimum it finds can prevent it from finding a more global minimum. In that sense, this is not a regression. How do you propose we could generate better code for this case? Without regressing everything else.
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 Jakub Jelinek changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2020-02-11 CC||jakub at gcc dot gnu.org Target Milestone|--- |9.3 Ever confirmed|0 |1 --- Comment #9 from Jakub Jelinek --- The #c8 testcase on x86_64-linux -O2 regressed with r9-2064-gc4c5ad1d6d1e1e1fe7a1c2b3bb097cc269dc7306
[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565 Wilco changed: What|Removed |Added CC||segher at kernel dot crashing.org Summary|Combine duplicates count|[9/10 regression] Combine |trailing zero instructions |duplicates instructions --- Comment #8 from Wilco --- Here is a much simpler example: void f (int *p, int y) { int a = y & 14; *p = a | p[a]; } Trunk and GCC9.1 for x64: mov eax, esi and esi, 14 and eax, 14 or eax, DWORD PTR [rdi+rsi*4] mov DWORD PTR [rdi], eax ret and AArch64: and x2, x1, 14 and w1, w1, 14 ldr w2, [x0, x2, lsl 2] orr w1, w2, w1 str w1, [x0] ret However GCC8.2 does: and w1, w1, 14 ldr w2, [x0, w1, sxtw 2] orr w2, w2, w1 str w2, [x0] ret So it is a 9 regression...