[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #16 from Hongtao Liu --- > > 4952 /* See if a MEM has already been loaded with a widening operation; > 4953 if it has, we can use a subreg of that. Many CISC machines > 4954 also have such operations, but this is only likely to be > 4955 beneficial on these machines. */ Oh, it's pre_reload cse_insn, not postreload gcse
[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #15 from Hongtao Liu --- > I don't see this as problematic. IIRC, there was a discussion in the past > that a couple (two?) memory accesses from the same location close to each > other can be faster (so, -O2, not -Os) than preloading the value to the > register first. At lease for memory with vector mode, it's better to preload the value to register first. > > In contrast, the example from the Comment #11 already has the correct value > in %eax, so there is no need to reload it again from memory, even in a > narrower mode. So the problem is why cse can't handle same memory with narrower mode, maybe it's because there's zero_extend in the first load. cse looks like can handle simple wider mode memory. 4952 /* See if a MEM has already been loaded with a widening operation; 4953 if it has, we can use a subreg of that. Many CISC machines 4954 also have such operations, but this is only likely to be 4955 beneficial on these machines. */
[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #14 from Andrew Pinski --- (In reply to Uroš Bizjak from comment #13) > (In reply to Hongtao Liu from comment #12) > > short a; > > short c; > > short d; > > void > > foo (short b, short f) > > { > > c = b + a; > > d = f + a; > > } > > > > foo(short, short): > > addwa(%rip), %di > > addwa(%rip), %si > > movw%di, c(%rip) > > movw%si, d(%rip) > > ret > > > > this one is bad since gcc10.1 and there's no subreg, The problem is if the > > operand is used by more than 1 insn, and they all support separate m > > constraint, mem_cost is quite small(just 1, reg move cost is 2), and this > > makes RA more inclined to propagate memory across insns. I guess RA assumes > > the separate m means the insn only support memory_operand? > > I don't see this as problematic. IIRC, there was a discussion in the past > that a couple (two?) memory accesses from the same location close to each > other can be faster (so, -O2, not -Os) than preloading the value to the > register first. Someone just filed a similar issue to the above testcase (the one in comment #12) as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114688 :).
[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #13 from Uroš Bizjak --- (In reply to Hongtao Liu from comment #12) > short a; > short c; > short d; > void > foo (short b, short f) > { > c = b + a; > d = f + a; > } > > foo(short, short): > addwa(%rip), %di > addwa(%rip), %si > movw%di, c(%rip) > movw%si, d(%rip) > ret > > this one is bad since gcc10.1 and there's no subreg, The problem is if the > operand is used by more than 1 insn, and they all support separate m > constraint, mem_cost is quite small(just 1, reg move cost is 2), and this > makes RA more inclined to propagate memory across insns. I guess RA assumes > the separate m means the insn only support memory_operand? I don't see this as problematic. IIRC, there was a discussion in the past that a couple (two?) memory accesses from the same location close to each other can be faster (so, -O2, not -Os) than preloading the value to the register first. In contrast, the example from the Comment #11 already has the correct value in %eax, so there is no need to reload it again from memory, even in a narrower mode.
[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #12 from Hongtao Liu --- short a; short c; short d; void foo (short b, short f) { c = b + a; d = f + a; } foo(short, short): addwa(%rip), %di addwa(%rip), %si movw%di, c(%rip) movw%si, d(%rip) ret this one is bad since gcc10.1 and there's no subreg, The problem is if the operand is used by more than 1 insn, and they all support separate m constraint, mem_cost is quite small(just 1, reg move cost is 2), and this makes RA more inclined to propagate memory across insns. I guess RA assumes the separate m means the insn only support memory_operand? 961 if (op_class == NO_REGS) 962/* Although we don't need insn to reload from 963 memory, still accessing memory is usually more 964 expensive than a register. */ 965pp->mem_cost = frequency; 966 else
[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #11 from Hongtao Liu --- unsigned v; long long v2; char foo () { v2 = v; return v; } This is related to *movqi_internal, and codegen has been worse since gcc8.1 foo: movlv(%rip), %eax movq%rax, v2(%rip) movzbl v(%rip), %eax ret
[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #10 from Uroš Bizjak --- (In reply to Hongtao Liu from comment #5) > > My experience is memory cost for the operand with rm or separate r, m is > > different which impacts RA decision. > > > > https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html > > Change operands[1] alternative 2 from m -> rm, then RA makes perfect > decision. Yes, I can confirm this oddity: movlv1(%rip), %edx # 5 [c=6 l=6] *zero_extendsidi2/3 movq%rdx, v2(%rip) # 16[c=4 l=7] *movdi_internal/5 movq%rdx, %rax # 18[c=4 l=3] *movdi_internal/3 ret # 21[c=0 l=1] simple_return_internal But even there is room for improvement. The last move can be eliminated by allocating %eax in the first instruction.
[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #9 from Hongtao Liu --- > > It looks that different modes of memory read confuse LRA to not CSE the read. > > IMO, if the preloaded value is later accessed in different modes, LRA should > leave it. Alternatively, LRA should CSE memory accesses in different modes. (insn 7 6 12 2 (set (reg:HI 101 [ _5 ]) (subreg:HI (reg:SI 98 [ v1.0_1 ]) 0)) "test.c":6:12 86 {*movhi_internal} (expr_list:REG_DEAD (reg:SI 98 [ v1.0_1 ]) (nil))) May be we should reduce cost from simple move instruction(with subreg?) when calculating total_cost, since it's probably be eliminated by later rtl optimization.
[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #8 from Uroš Bizjak --- BTW: The reason for the original change: (define_insn "*movhi_internal" - [(set (match_operand:HI 0 "nonimmediate_operand" "=r,r ,r ,m ,*k,*k ,*r,*m,*k,?r,?v,*v,*v,*m") - (match_operand:HI 1 "general_operand" "r ,rn,rm,rn,*r,*km,*k,*k,CBC,v, r, v, m, v"))] + [(set (match_operand:HI 0 "nonimmediate_operand" +"=r,r,r,m ,*k,*k ,r ,m ,*k ,?r,?*v,*v,*v,*v,m") + (match_operand:HI 1 "general_operand" +"r ,n,m,rn,r ,*km,*k,*k,CBC,*v,r ,C ,*v,m ,*v"))] was that (r,r) overrides (r,rn) and (r,rm), so the later two can be changed (without introducing any side effect) to (r,n) and (r,m), since (reg,reg) is always matched by the (r,r) constraint. The different treatment of the changed later two patterns is confusing at least.
[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #7 from Uroš Bizjak --- (In reply to Hongtao Liu from comment #5) > > My experience is memory cost for the operand with rm or separate r, m is > > different which impacts RA decision. > > > > https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html > > Change operands[1] alternative 2 from m -> rm, then RA makes perfect > decision. Oh, you are also the author of the above patch ;) Can you please take the issue from here and perhaps review other x86 patterns for unoptimal constraints? I was always under impression that rm and separate "r,m" are treated in the same way...
[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #6 from Uroš Bizjak --- LRA starts with this: 5: r98:SI=[`v1'] REG_EQUIV [`v1'] 6: [`v2']=zero_extend(r98:SI) 7: r101:HI=r98:SI#0 REG_DEAD r98:SI 12: ax:HI=r101:HI REG_DEAD r101:HI 13: use ax:HI then decides that: Removing equiv init insn 5 (freq=1000) 5: r98:SI=[`v1'] REG_EQUIV [`v1'] and substitutes all follow-up usages of r98 with a memory access. In insn 6, we have: (mem/c:SI (symbol_ref:DI ("v1"))) while in insn 7 we have: (mem/c:HI (symbol_ref:DI ("v1"))) It looks that different modes of memory read confuse LRA to not CSE the read. IMO, if the preloaded value is later accessed in different modes, LRA should leave it. Alternatively, LRA should CSE memory accesses in different modes. Cc LRA expert ... oh, he already is in the loop.
[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #5 from Hongtao Liu --- > My experience is memory cost for the operand with rm or separate r, m is > different which impacts RA decision. > > https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html Change operands[1] alternative 2 from m -> rm, then RA makes perfect decision.
[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #4 from Hongtao Liu --- (In reply to Uroš Bizjak from comment #3) > (In reply to Jakub Jelinek from comment #2) > > This changed with r12-5584-gca5667e867252db3c8642ee90f55427149cd92b6 > > Strange, if I revert the constraints to the previous setting with: > > --cut here-- > diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md > index 10ae3113ae8..262dd25a8e0 100644 > --- a/gcc/config/i386/i386.md > +++ b/gcc/config/i386/i386.md > @@ -2870,9 +2870,9 @@ (define_peephole2 > > (define_insn "*movhi_internal" >[(set (match_operand:HI 0 "nonimmediate_operand" > -"=r,r,r,m ,*k,*k ,r ,m ,*k ,?r,?*v,*Yv,*v,*v,jm,m") > +"=r,r,r,m ,*k,*k ,*r ,*m ,*k ,?r,?v,*Yv,*v,*v,*jm,*m") > (match_operand:HI 1 "general_operand" > -"r ,n,m,rn,r ,*km,*k,*k,CBC,*v,r ,C ,*v,m ,*x,*v"))] > +"r ,n,m,rn,*r ,*km,*k,*k,CBC,v,r ,C ,v,m ,x,v"))] >"!(MEM_P (operands[0]) && MEM_P (operands[1])) > && ix86_hardreg_mov_ok (operands[0], operands[1])" > { > --cut here-- > > I still get: > > movlv1(%rip), %eax # 6 [c=6 l=6] *zero_extendsidi2/3 > movq%rax, v2(%rip) # 16[c=4 l=7] *movdi_internal/5 > movzwl v1(%rip), %eax # 7 [c=5 l=7] *movhi_internal/2 My experience is memory cost for the operand with rm or separate r, m is different which impacts RA decision. https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html
[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 --- Comment #3 from Uroš Bizjak --- (In reply to Jakub Jelinek from comment #2) > This changed with r12-5584-gca5667e867252db3c8642ee90f55427149cd92b6 Strange, if I revert the constraints to the previous setting with: --cut here-- diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 10ae3113ae8..262dd25a8e0 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -2870,9 +2870,9 @@ (define_peephole2 (define_insn "*movhi_internal" [(set (match_operand:HI 0 "nonimmediate_operand" -"=r,r,r,m ,*k,*k ,r ,m ,*k ,?r,?*v,*Yv,*v,*v,jm,m") +"=r,r,r,m ,*k,*k ,*r ,*m ,*k ,?r,?v,*Yv,*v,*v,*jm,*m") (match_operand:HI 1 "general_operand" -"r ,n,m,rn,r ,*km,*k,*k,CBC,*v,r ,C ,*v,m ,*x,*v"))] +"r ,n,m,rn,*r ,*km,*k,*k,CBC,v,r ,C ,v,m ,x,v"))] "!(MEM_P (operands[0]) && MEM_P (operands[1])) && ix86_hardreg_mov_ok (operands[0], operands[1])" { --cut here-- I still get: movlv1(%rip), %eax # 6 [c=6 l=6] *zero_extendsidi2/3 movq%rax, v2(%rip) # 16[c=4 l=7] *movdi_internal/5 movzwl v1(%rip), %eax # 7 [c=5 l=7] *movhi_internal/2
[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org, ||uros at gcc dot gnu.org --- Comment #2 from Jakub Jelinek --- This changed with r12-5584-gca5667e867252db3c8642ee90f55427149cd92b6
[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591 Jeffrey A. Law changed: What|Removed |Added CC||law at gcc dot gnu.org Priority|P3 |P2