[Bug target/17108] Store with update not generated for a simple loop

2019-04-17 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=17108

--- Comment #9 from Segher Boessenkool  ---
Author: segher
Date: Wed Apr 17 09:45:57 2019
New Revision: 270407

URL: https://gcc.gnu.org/viewcvs?rev=270407&root=gcc&view=rev
Log:
rs6000: Improve the load/store-with-update patterns (PR17108)

Many of these patterns only worked in 32-bit mode, and some only worked
in 64-bit mode.  This patch makes these use Pmode, fixing the PR.  On
the other hand, the stack updates have to use the same mode for the
stack pointer as for the value stored, so let's simplify that a bit.

Many of these patterns pass the wrong mode to
avoiding_indexed_address_p (it should be the mode of the datum
accessed, not the mode of the pointer).

Finally, I merge some patterns into one (using iterators).


PR target/17108
* config/rs6000/rs6000.c (rs6000_split_multireg_move): Adjust pattern
name.
(rs6000_emit_allocate_stack_1): Simplify condition.  Adjust pattern
name.
* config/rs6000/rs6000.md (bits): Add entries for SF and DF.
(*movdi_update1): Use Pmode.
(movdi__update): Fix argument to avoiding_indexed_address_p.
(movdi__update_stack): Rename to ...
(movdi_update_stack): ... this.  Fix comment.  Change condition. Don't
use Pmode.
(*movsi_update1): Use Pmode.
(*movsi_update2): Use Pmode.
(movsi_update): Rename to ...
(movsi__update): ... this.  Use Pmode.
(movsi_update_stack): Fix condition.
(*movhi_update1): Use Pmode.  Fix argument to
avoiding_indexed_address_p.
(*movhi_update2): Ditto.
(*movhi_update3): Ditto.
(*movhi_update4): Ditto.
(*movqi_update1): Ditto.
(*movqi_update2): Ditto.
(*movqi_update3): Ditto.
(*movsf_update1, *movdf_update1): Merge, rename to...
(*mov_update1): This.  Use Pmode.  Fix argument to
avoiding_indexed_address_p.  Add "size" attribute.
(*movsf_update2, *movdf_update2): Merge, rename to...
(*mov_update2): This.  Ditto.
(*movsf_update3): Use Pmode.  Fix argument to
avoiding_indexed_address_p.
(*movsf_update4): Ditto.
(allocate_stack): Simplify condition.  Adjust pattern names.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/rs6000/rs6000.c
trunk/gcc/config/rs6000/rs6000.md

[Bug target/17108] Store with update not generated for a simple loop

2019-04-12 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=17108

Segher Boessenkool  changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org

--- Comment #8 from Segher Boessenkool  ---
We currently generate (for -O2 -m64, -O3 unrolls it completely, see comment 7)

li 9,8
mtctr 9
.p2align 4,,15
.L2:
stfs 1,0(3)
addi 3,3,4
bdnz .L2
blr



and for -m32 we get

li 9,8
addi 3,3,-4
mtctr 9
.p2align 4,,15
.L2:
stfsu 1,4(3)
bdnz .L2
blr




The difference is partly the selected -mcpu=, but that doesn't explain it
completely.

The gimple passes (probably ivopts) have decided to do a pre_inc here; all
differences are at RTL level.  Except for -mcpu=power9 they didn't.

A case where it works as expected, -O2 -m32 -mcpu=power4, the auto_inc_dec
pass does not help (this is caused by rtx_cost issues):

starting bb 3
   11: [r122:SI]=r127:SF
   11: [r122:SI]=r127:SF
found mem(11) *(r[122]+0)
   10: r122:SI=r122:SI+0x4
   10: r122:SI=r122:SI+0x4
found pre inc(10) r[122]+=4
   11: [r122:SI]=r127:SF
found mem(11) *(r[122]+0)
trying SIMPLE_PRE_INC
cost failure old=16 new=408

(I have a patch for that).



but then combine comes along and does

Trying 10 -> 11:
   10: r122:SI=r122:SI+0x4
   11: [r122:SI]=r127:SF
Successfully matched this instruction:
(parallel [
(set (mem:SF (plus:SI (reg:SI 122 [ ivtmp.10 ])
(const_int 4 [0x4])) [1 MEM[base: _17, offset: 0B]+0 S4
A32])
(reg/v:SF 127 [ d ]))
(set (reg:SI 122 [ ivtmp.10 ])
(plus:SI (reg:SI 122 [ ivtmp.10 ])
(const_int 4 [0x4])))
])
allowing combination of insns 10 and 11
original costs 4 + 4 = 8
replacement cost 4



-m64 however says

Trying 10 -> 11:
   10: r122:DI=r122:DI+0x4
   11: [r122:DI]=r127:SF
Failed to match this instruction:
(parallel [
(set (mem:SF (plus:DI (reg:DI 122 [ ivtmp.11 ])
(const_int 4 [0x4])) [1 MEM[base: _17, offset: 0B]+0 S4
A32])
(reg/v:SF 127 [ d ]))
(set (reg:DI 122 [ ivtmp.11 ])
(plus:DI (reg:DI 122 [ ivtmp.11 ])
(const_int 4 [0x4])))
])



Oh dear, we do not have the float load/store-with-update instructions for -m64.
On all modern 64-bit CPUs these are cracked, so they execute the same as the
separate addi and store instructions, but it costs code space.  And if we do
not want them we should make them more expensive, not just pretend the insns
do not exist :-)

[Bug target/17108] Store with update not generated for a simple loop

2012-08-13 Thread PHHargrove at lbl dot gov
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17108

Paul H. Hargrove  changed:

   What|Removed |Added

 CC||PHHargrove at lbl dot gov

--- Comment #7 from Paul H. Hargrove  2012-08-13 
23:19:45 UTC ---
FWIW:
With GCC 4.8.0 20120809 the test code is fully unrolled at -O3:
.L.foo:
stfs 1,0(3)
stfs 1,4(3)
stfs 1,8(3)
stfs 1,12(3)
stfs 1,16(3)
stfs 1,20(3)
stfs 1,24(3)
stfs 1,28(3)
blr

However, at -O2 the code shown in comment #6 is still being generated (except
with different register allocations).  This is true even when "-mupdate" is
passed to explicitly enable store/update instructions.


[Bug target/17108] Store with update not generated for a simple loop

2008-01-25 Thread pinskia at gcc dot gnu dot org


--- Comment #6 from pinskia at gcc dot gnu dot org  2008-01-25 22:35 ---
No this is still not fixed, we get now on the trunk as of yesterday:
foo:
li 0,8
li 9,0
mtctr 0
.p2align 3,,7
.L2:
stfsx 1,3,9
addi 9,9,4
bdnz .L2
blr


In fact I think this is worse, than what was done before but oh well.  We still
don't generate the store with update instruction.


-- 

pinskia at gcc dot gnu dot org changed:

   What|Removed |Added

 Status|WAITING |NEW
   Last reconfirmed|2007-07-02 21:40:02 |2008-01-25 22:35:16
   date||
Summary|Missed opportunity for  |Store with update not
   |strength reduction  |generated for a simple loop


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17108