[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #16 from Hongtao Liu  ---

> 
> 4952  /* See if a MEM has already been loaded with a widening operation;
> 4953 if it has, we can use a subreg of that.  Many CISC machines
> 4954 also have such operations, but this is only likely to be
> 4955 beneficial on these machines.  */

Oh, it's pre_reload cse_insn, not postreload gcse

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #15 from Hongtao Liu  ---
> I don't see this as problematic. IIRC, there was a discussion in the past
> that a couple (two?) memory accesses from the same location close to each
> other can be faster (so, -O2, not -Os) than preloading the value to the
> register first.
At lease for memory with vector mode, it's better to preload the value to
register first.
> 
> In contrast, the example from the Comment #11 already has the correct value
> in %eax, so there is no need to reload it again from memory, even in a
> narrower mode.

So the problem is why cse can't handle same memory with narrower mode, maybe
it's because there's zero_extend in the first load. cse looks like can handle
simple wider mode memory.

4952  /* See if a MEM has already been loaded with a widening operation;
4953 if it has, we can use a subreg of that.  Many CISC machines
4954 also have such operations, but this is only likely to be
4955 beneficial on these machines.  */

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-11 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #14 from Andrew Pinski  ---
(In reply to Uroš Bizjak from comment #13)
> (In reply to Hongtao Liu from comment #12)
> > short a;
> > short c;
> > short d;
> > void
> > foo (short b, short f)
> > {
> >   c = b + a;
> >   d = f + a;
> > }
> > 
> > foo(short, short):
> > addwa(%rip), %di
> > addwa(%rip), %si
> > movw%di, c(%rip)
> > movw%si, d(%rip)
> > ret
> > 
> > this one is bad since gcc10.1 and there's no subreg, The problem is if the
> > operand is used by more than 1 insn, and they all support separate m
> > constraint, mem_cost is quite small(just 1, reg move cost is 2), and this
> > makes RA more inclined to propagate memory across insns. I guess RA assumes
> > the separate m means the insn only support memory_operand?
> 
> I don't see this as problematic. IIRC, there was a discussion in the past
> that a couple (two?) memory accesses from the same location close to each
> other can be faster (so, -O2, not -Os) than preloading the value to the
> register first.

Someone just filed a similar issue to the above testcase (the one in comment
#12) as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114688 :).

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-11 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #13 from Uroš Bizjak  ---
(In reply to Hongtao Liu from comment #12)
> short a;
> short c;
> short d;
> void
> foo (short b, short f)
> {
>   c = b + a;
>   d = f + a;
> }
> 
> foo(short, short):
> addwa(%rip), %di
> addwa(%rip), %si
> movw%di, c(%rip)
> movw%si, d(%rip)
> ret
> 
> this one is bad since gcc10.1 and there's no subreg, The problem is if the
> operand is used by more than 1 insn, and they all support separate m
> constraint, mem_cost is quite small(just 1, reg move cost is 2), and this
> makes RA more inclined to propagate memory across insns. I guess RA assumes
> the separate m means the insn only support memory_operand?

I don't see this as problematic. IIRC, there was a discussion in the past that
a couple (two?) memory accesses from the same location close to each other can
be faster (so, -O2, not -Os) than preloading the value to the register first.

In contrast, the example from the Comment #11 already has the correct value in
%eax, so there is no need to reload it again from memory, even in a narrower
mode.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-11 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #12 from Hongtao Liu  ---
short a;
short c;
short d;
void
foo (short b, short f)
{
  c = b + a;
  d = f + a;
}

foo(short, short):
addwa(%rip), %di
addwa(%rip), %si
movw%di, c(%rip)
movw%si, d(%rip)
ret

this one is bad since gcc10.1 and there's no subreg, The problem is if the
operand is used by more than 1 insn, and they all support separate m
constraint, mem_cost is quite small(just 1, reg move cost is 2), and this makes
RA more inclined to propagate memory across insns. I guess RA assumes the
separate m means the insn only support memory_operand?

 961  if (op_class == NO_REGS)
 962/* Although we don't need insn to reload from
 963   memory, still accessing memory is usually more
 964   expensive than a register.  */
 965pp->mem_cost = frequency;
 966  else

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #11 from Hongtao Liu  ---
unsigned v;
long long v2;
char foo ()
{
v2 = v;
return v;
}

This is related to *movqi_internal, and codegen has been worse since gcc8.1

foo:
movlv(%rip), %eax
movq%rax, v2(%rip)
movzbl  v(%rip), %eax
ret

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #10 from Uroš Bizjak  ---
(In reply to Hongtao Liu from comment #5)
> > My experience is memory cost for the operand with rm or separate r, m is
> > different which impacts RA decision.
> > 
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html
> 
> Change operands[1] alternative 2 from m -> rm, then RA makes perfect
> decision.

Yes, I can confirm this oddity:

movlv1(%rip), %edx  # 5 [c=6 l=6]  *zero_extendsidi2/3
movq%rdx, v2(%rip)  # 16[c=4 l=7]  *movdi_internal/5
movq%rdx, %rax  # 18[c=4 l=3]  *movdi_internal/3
ret # 21[c=0 l=1]  simple_return_internal

But even there is room for improvement. The last move can be eliminated by
allocating %eax in the first instruction.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #9 from Hongtao Liu  ---

> 
> It looks that different modes of memory read confuse LRA to not CSE the read.
> 
> IMO, if the preloaded value is later accessed in different modes, LRA should
> leave it. Alternatively, LRA should CSE memory accesses in different modes.

(insn 7 6 12 2 (set (reg:HI 101 [ _5 ])
(subreg:HI (reg:SI 98 [ v1.0_1 ]) 0)) "test.c":6:12 86
{*movhi_internal}
 (expr_list:REG_DEAD (reg:SI 98 [ v1.0_1 ])
(nil)))

May be we should reduce cost from simple move instruction(with subreg?) when
calculating total_cost, since it's probably be eliminated by later rtl
optimization.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #8 from Uroš Bizjak  ---
BTW: The reason for the original change:

 (define_insn "*movhi_internal"
-  [(set (match_operand:HI 0 "nonimmediate_operand" "=r,r ,r ,m ,*k,*k
,*r,*m,*k,?r,?v,*v,*v,*m")
-   (match_operand:HI 1 "general_operand"  "r
,rn,rm,rn,*r,*km,*k,*k,CBC,v, r, v, m, v"))]
+  [(set (match_operand:HI 0 "nonimmediate_operand"
+"=r,r,r,m ,*k,*k ,r ,m ,*k ,?r,?*v,*v,*v,*v,m")
+   (match_operand:HI 1 "general_operand"
+"r ,n,m,rn,r ,*km,*k,*k,CBC,*v,r  ,C ,*v,m ,*v"))]

was that (r,r) overrides (r,rn) and (r,rm), so the later two can be changed
(without introducing any side effect) to (r,n) and (r,m), since (reg,reg) is
always matched by the (r,r) constraint. The different treatment of the changed
later two patterns is confusing at least.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #7 from Uroš Bizjak  ---
(In reply to Hongtao Liu from comment #5)
> > My experience is memory cost for the operand with rm or separate r, m is
> > different which impacts RA decision.
> > 
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html
> 
> Change operands[1] alternative 2 from m -> rm, then RA makes perfect
> decision.

Oh, you are also the author of the above patch ;)

Can you please take the issue from here and perhaps review other x86 patterns
for unoptimal constraints? I was always under impression that rm and separate
"r,m" are treated in the same way...

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #6 from Uroš Bizjak  ---
LRA starts with this:

5: r98:SI=[`v1']
  REG_EQUIV [`v1']
6: [`v2']=zero_extend(r98:SI)
7: r101:HI=r98:SI#0
  REG_DEAD r98:SI
   12: ax:HI=r101:HI
  REG_DEAD r101:HI
   13: use ax:HI

then decides that:

  Removing equiv init insn 5 (freq=1000)
5: r98:SI=[`v1']
  REG_EQUIV [`v1']

and substitutes all follow-up usages of r98 with a memory access. In insn 6, we
have:

(mem/c:SI (symbol_ref:DI ("v1")))

while in insn 7 we have:

(mem/c:HI (symbol_ref:DI ("v1")))

It looks that different modes of memory read confuse LRA to not CSE the read.

IMO, if the preloaded value is later accessed in different modes, LRA should
leave it. Alternatively, LRA should CSE memory accesses in different modes.

Cc LRA expert ... oh, he already is in the loop.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #5 from Hongtao Liu  ---
> My experience is memory cost for the operand with rm or separate r, m is
> different which impacts RA decision.
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html

Change operands[1] alternative 2 from m -> rm, then RA makes perfect decision.

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #4 from Hongtao Liu  ---
(In reply to Uroš Bizjak from comment #3)
> (In reply to Jakub Jelinek from comment #2)
> > This changed with r12-5584-gca5667e867252db3c8642ee90f55427149cd92b6
> 
> Strange, if I revert the constraints to the previous setting with: 
> 
> --cut here--
> diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
> index 10ae3113ae8..262dd25a8e0 100644
> --- a/gcc/config/i386/i386.md
> +++ b/gcc/config/i386/i386.md
> @@ -2870,9 +2870,9 @@ (define_peephole2
>  
>  (define_insn "*movhi_internal"
>[(set (match_operand:HI 0 "nonimmediate_operand"
> -"=r,r,r,m ,*k,*k ,r ,m ,*k ,?r,?*v,*Yv,*v,*v,jm,m")
> +"=r,r,r,m ,*k,*k ,*r ,*m ,*k ,?r,?v,*Yv,*v,*v,*jm,*m")
> (match_operand:HI 1 "general_operand"
> -"r ,n,m,rn,r ,*km,*k,*k,CBC,*v,r  ,C  ,*v,m ,*x,*v"))]
> +"r ,n,m,rn,*r ,*km,*k,*k,CBC,v,r  ,C  ,v,m ,x,v"))]
>"!(MEM_P (operands[0]) && MEM_P (operands[1]))
> && ix86_hardreg_mov_ok (operands[0], operands[1])"
>  {
> --cut here--
> 
> I still get:
> 
> movlv1(%rip), %eax  # 6 [c=6 l=6]  *zero_extendsidi2/3
> movq%rax, v2(%rip)  # 16[c=4 l=7]  *movdi_internal/5
> movzwl  v1(%rip), %eax  # 7 [c=5 l=7]  *movhi_internal/2

My experience is memory cost for the operand with rm or separate r, m is
different which impacts RA decision.

https://gcc.gnu.org/pipermail/gcc-patches/2022-May/595573.html

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-10 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

--- Comment #3 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #2)
> This changed with r12-5584-gca5667e867252db3c8642ee90f55427149cd92b6

Strange, if I revert the constraints to the previous setting with: 

--cut here--
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 10ae3113ae8..262dd25a8e0 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -2870,9 +2870,9 @@ (define_peephole2

 (define_insn "*movhi_internal"
   [(set (match_operand:HI 0 "nonimmediate_operand"
-"=r,r,r,m ,*k,*k ,r ,m ,*k ,?r,?*v,*Yv,*v,*v,jm,m")
+"=r,r,r,m ,*k,*k ,*r ,*m ,*k ,?r,?v,*Yv,*v,*v,*jm,*m")
(match_operand:HI 1 "general_operand"
-"r ,n,m,rn,r ,*km,*k,*k,CBC,*v,r  ,C  ,*v,m ,*x,*v"))]
+"r ,n,m,rn,*r ,*km,*k,*k,CBC,v,r  ,C  ,v,m ,x,v"))]
   "!(MEM_P (operands[0]) && MEM_P (operands[1]))
&& ix86_hardreg_mov_ok (operands[0], operands[1])"
 {
--cut here--

I still get:

movlv1(%rip), %eax  # 6 [c=6 l=6]  *zero_extendsidi2/3
movq%rax, v2(%rip)  # 16[c=4 l=7]  *movdi_internal/5
movzwl  v1(%rip), %eax  # 7 [c=5 l=7]  *movhi_internal/2

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-08 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org,
   ||uros at gcc dot gnu.org

--- Comment #2 from Jakub Jelinek  ---
This changed with r12-5584-gca5667e867252db3c8642ee90f55427149cd92b6

[Bug target/114591] [12/13/14 Regression] register allocators introduce an extra load operation since gcc-12

2024-04-04 Thread law at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114591

Jeffrey A. Law  changed:

   What|Removed |Added

 CC||law at gcc dot gnu.org
   Priority|P3  |P2