[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-03-12 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

Jakub Jelinek  changed:

   What|Removed |Added

   Target Milestone|9.3 |9.4

--- Comment #24 from Jakub Jelinek  ---
GCC 9.3.0 has been released, adjusting target milestone.

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-25 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

Jeffrey A. Law  changed:

   What|Removed |Added

   Priority|P3  |P2
 CC||law at redhat dot com

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-15 Thread sch...@linux-m68k.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #23 from Andreas Schwab  ---
gcc.target/aarch64/pr93565.c fails with -mabi=ilp32.

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-14 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #22 from Segher Boessenkool  ---
T0T2T3T4
   alpha   6049096  100.020%  100.018%  100.001%
 arc   4019384  100.000%   99.989%   99.989%
 arm  14177962   99.999%   99.999%  100.000%
   arm64  12968466   99.938%   99.888%  100.000%
 c6x   2346077  100.000%  100.001%  100.001%
csky   3332454  100.000%  100.000%  100.000%
   h8300   1165256   99.999%   99.999%  100.000%
i386  11227764  100.001%  100.001%  100.000%
ia64  18088488  100.003%  100.007%  100.003%
m68k   3716871  100.000%  100.000%  100.000%
  microblaze   4935181  100.000%   99.995%   99.995%
mips   8407681  100.000%  100.000%  100.000%
  mips64   6979344   99.987%   99.981%   99.981%
   nds32   4471023  100.000%   99.994%   99.994%
   nios2   3643253  100.000%   99.999%   99.999%
openrisc   4182200  100.000%   99.995%   99.995%
  parisc   7710095  100.001%  100.001%  100.000%
parisc64   8676725  100.003%  100.002%   99.999%
 powerpc  10603859  100.000%  100.000%  100.001%
   powerpc64  17552718  100.007%  100.005%   99.999%
 powerpc64le  17552718  100.007%  100.005%   99.999%
 riscv32   1546172  100.000%   99.999%   99.999%
 riscv64   6623170  100.010%  100.005%  100.001%
s390  13103095   99.995%   99.993%   99.999%
  sh   3216555   99.999%   99.992%   99.993%
 shnommu   1611176   99.999%   99.999%  100.000%
   sparc   436  100.000%   99.997%   99.997%
 sparc64   6751939  100.000%   99.997%   99.997%
  x86_64  19681173  100.000%  100.000%  100.000%
  xtensa 0 0 0 0

T0 is orig, T2 is only sign_extend, T3 is sign_extend and no same sources,
T4 is only no same source (SET_SRC).

The diffs look less than they are, this is just size, and with 2-2 combines
size does not change (on many targets).  For powerpc, *all* the changes
these patches make hurt code quality (they change two parallel insns to
two sequential ones).

I think combine should just do what it already does, and you should add
some peepholes, or maybe some new pass?

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-13 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #21 from Segher Boessenkool  ---
(In reply to Andrew Pinski from comment #20)
> (In reply to Segher Boessenkool from comment #18)
> > Created attachment 47841 [details]
> > Patch to treat sign_extend as is_just_move
> 
> Do you think zero_extend should maybe be treated as such too?

Maybe?

> What about truncate (MIPS64 uses truncate a lot as moves)?

Also maybe.

Test runs take a little over three hours (vs. less than two hours
in GCC 8 times).  I'll experiment with those things, but first the
bigger issue (parallel of two identical SETs, just with different
dest).

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-13 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #20 from Andrew Pinski  ---
(In reply to Segher Boessenkool from comment #18)
> Created attachment 47841 [details]
> Patch to treat sign_extend as is_just_move

Do you think zero_extend should maybe be treated as such too?  What about
truncate (MIPS64 uses truncate a lot as moves)?

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-13 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #19 from Segher Boessenkool  ---
With that above patch, I get (T0 is original, T2 is with patch, these are
file sizes of a Linux build, mostly defconfig):

T0T2
   alpha   6049096  100.020%
 arc   4019384  100.000%
 arm  14177962   99.999%
   arm64  12968466   99.938%
 c6x   2346077  100.000%
csky   3332454  100.000%
   h8300   1165256   99.999%
i386  11227764  100.001%
ia64  18088488  100.003%
m68k   3716871  100.000%
  microblaze   4935181  100.000%
mips   8407681  100.000%
  mips64   6979344   99.987%
   nds32   4471023  100.000%
   nios2   3643253  100.000%
openrisc   4182200  100.000%
  parisc   7710095  100.001%
parisc64   8676725  100.003%
 powerpc  10603859  100.000%
   powerpc64  17552718  100.007%
 powerpc64le  17552718  100.007%
 riscv32   1546172  100.000%
 riscv64   6623170  100.010%
s390  13103095   99.995%
  sh   3216555   99.999%
 shnommu   1611176   99.999%
   sparc   436  100.000%
 sparc64   6751939  100.000%
  x86_64  19681173  100.000%
  xtensa 0 0

I think I'll commit this, but let's look at the original problem first as well.

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-13 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #18 from Segher Boessenkool  ---
Created attachment 47841
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47841=edit
Patch to treat sign_extend as is_just_move

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-12 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #17 from Segher Boessenkool  ---
That above commit is just a spec special, it doesn't solve anything else,
imnsho.

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-12 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #16 from Segher Boessenkool  ---
It is not the same cost.  It reduces the path length.

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-12 Thread cvs-commit at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #15 from CVS Commits  ---
The master branch has been updated by Wilco Dijkstra :

https://gcc.gnu.org/g:5bfc8303ffe2d86e938d45f13cd99a39469dac4f

commit r10-6598-g5bfc8303ffe2d86e938d45f13cd99a39469dac4f
Author: Wilco Dijkstra 
Date:   Wed Feb 12 18:23:21 2020 +

[AArch64] Set ctz rtx_cost (PR93565)

Combine sometimes behaves oddly and duplicates ctz to remove an unnecessary
sign extension.  Avoid this by setting the cost for ctz to be higher than
that of a simple ALU instruction.  Deepsjeng performance improves by ~0.6%.

gcc/
PR rtl-optimization/93565
* config/aarch64/aarch64.c (aarch64_rtx_costs): Add CTZ costs.

testsuite/
PR rtl-optimization/93565
* gcc.target/aarch64/pr93565.c: New test.

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-12 Thread rearnsha at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #14 from Richard Earnshaw  ---
With the simpler test case we see

Breakpoint 1, try_combine (i3=0x764d33c0, i2=0x764d3380, i1=0x0, 
i0=0x0, new_direct_jump_p=0x7fffd850, 
last_combined_insn=0x764d33c0)
at /home/rearnsha/gnusrc/gcc-cross/master/gcc/combine.c:2671
2671{
(nil)
(nil)
(insn 7 4 8 2 (set (reg/v:SI 96 [ a ])
(and:SI (reg:SI 104)
(const_int 14 [0xe]))) "/tmp/t2.c":3:7 535 {andsi3}
 (expr_list:REG_DEAD (reg:SI 104)
(nil)))
(insn 8 7 10 2 (set (reg:DI 99 [ a ])
(sign_extend:DI (reg/v:SI 96 [ a ]))) "/tmp/t2.c":4:13 106
{*extendsidi2_aarch64}
 (nil))


And then the resulting insn that we try is

(parallel [
(set (reg:DI 99 [ a ])
(and:DI (subreg:DI (reg:SI 104) 0)
(const_int 14 [0xe])))
(set (reg/v:SI 96 [ a ])
(and:SI (reg:SI 104)
(const_int 14 [0xe])))
])

This insn doesn't match, and so we try to break it into two set insn and try
those individually.  But that gives us back insn 7 again and then a new insn
based on the (now extended lifetime) of r104.  It seems to me that if we are
doing this sort of transformation, then it's only likely to be profitable if
the cost of the really new insn is strictly cheaper than what we have before. 
Being the same cost is not enough in this case.

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-11 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #13 from Segher Boessenkool  ---
nonzero_bits is not reliable.  We also cannot really do what you propose
here, all of this is done for *every* combination.

We currently generate

(set (reg/v:SI 96 [ a ])
(and:SI (reg:SI 104)
(const_int 14 [0xe])))
(set (reg:DI 99 [ a ])
(and:DI (subreg:DI (reg:SI 104) 0)
(const_int 14 [0xe])))

If we can somehow see the first one is just the lowpart subreg of the second,
we can handle it the same as the first case.

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-11 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #12 from Jakub Jelinek  ---
(In reply to Segher Boessenkool from comment #11)
> (The original problem I have an idea for -- don't generate a parallel of
> two SETs with equal SET_SRC -- but that doesn't handle the new case).

For the new case, nonzero_bits should find out that the sign_extension is the
same thing as zero_extension and it would be best to just do a single and e.g.
on a paradoxical subreg of the source and a pseudo copy.

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-11 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #11 from Segher Boessenkool  ---
(The original problem I have an idea for -- don't generate a parallel of
two SETs with equal SET_SRC -- but that doesn't handle the new case).

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-11 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

--- Comment #10 from Segher Boessenkool  ---
One of the first things combine tries is

Trying 7 -> 8:
7: r96:SI=r104:SI&0xe
  REG_DEAD r104:SI
8: r99:DI=sign_extend(r96:SI)
...
Successfully matched this instruction:
(set (reg/v:SI 96 [ a ])
(and:SI (reg:SI 104)
(const_int 14 [0xe])))
Successfully matched this instruction:
(set (reg:DI 99 [ a ])
(and:DI (subreg:DI (reg:SI 104) 0)
(const_int 14 [0xe])))
allowing combination of insns 7 and 8
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8
modifying insn i2 7: r96:SI=r104:SI&0xe
deferring rescan insn with uid = 7.
modifying insn i3 8: r99:DI=r104:SI#0&0xe
  REG_DEAD r104:SI
deferring rescan insn with uid = 8.

Since combine is a greedy optimisation, what it ends up with depends on the
order it tries things in.  Any local minimum it finds can prevent it from
finding a more global minimum.  In that sense, this is not a regression.

How do you propose we could generate better code for this case?  Without
regressing everything else.

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-11 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

Jakub Jelinek  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2020-02-11
 CC||jakub at gcc dot gnu.org
   Target Milestone|--- |9.3
 Ever confirmed|0   |1

--- Comment #9 from Jakub Jelinek  ---
The #c8 testcase on x86_64-linux -O2 regressed with
r9-2064-gc4c5ad1d6d1e1e1fe7a1c2b3bb097cc269dc7306

[Bug rtl-optimization/93565] [9/10 regression] Combine duplicates instructions

2020-02-11 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93565

Wilco  changed:

   What|Removed |Added

 CC||segher at kernel dot 
crashing.org
Summary|Combine duplicates count|[9/10 regression] Combine
   |trailing zero instructions  |duplicates instructions

--- Comment #8 from Wilco  ---
Here is a much simpler example:

void f (int *p, int y)
{
  int a = y & 14;
  *p = a | p[a];
}

Trunk and GCC9.1 for x64:
mov eax, esi
and esi, 14
and eax, 14
or  eax, DWORD PTR [rdi+rsi*4]
mov DWORD PTR [rdi], eax
ret

and AArch64:
and x2, x1, 14
and w1, w1, 14
ldr w2, [x0, x2, lsl 2]
orr w1, w2, w1
str w1, [x0]
ret

However GCC8.2 does:
and w1, w1, 14
ldr w2, [x0, w1, sxtw 2]
orr w2, w2, w1
str w2, [x0]
ret

So it is a 9 regression...