Re: [PATCH][v3] GIMPLE store merging pass

2016-09-13 Thread Kyrill Tkachov


On 08/09/16 16:25, Bernhard Reutner-Fischer wrote:

On 8 September 2016 at 10:31, Kyrill Tkachov
 wrote:

On 07/09/16 20:03, Bernhard Reutner-Fischer wrote:

On September 6, 2016 5:14:47 PM GMT+02:00, Kyrill Tkachov
 wrote:

Thanks, fixed all the above in my tree (will be retesting).


What about debug statements? ISTM you should skip those.
(Isn't visited reset before entry of a pass?)


Yes, I'll skip debug statements. Comments in gimple.h say that the visited
property is undefined at pass boundaries, so I'd rather initialize it here.

Right.



Maybe I missed the bikeshedding about the name but I'd have used
-fmerge-stores instead.


Wouldn't be hard to change. I can change it any point if there's a
consensus.

Did you consider any relation to any of
https://gcc.gnu.org/PR22141
https://gcc.gnu.org/PR23684
https://gcc.gnu.org/PR47059
https://gcc.gnu.org/PR54422
and their dups
or https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg77311.html
(the -fmerge-bitfields suggestion from imgtec; maybe the testcases are
of interest)


Thanks for the pointers. I was not aware of PR23684 and have extended the patch
to handle that case as well.
The current version I'm hoping to get in only handles constant stores.
There are tricks we can do for non-constant contiguous stores as well and I hope
to implement some of them in the future, but for now I'm concentrating on 
getting
the initial version up to scratch.
I am now re-testing re-benchmarking the patch after your feedback and some
other improvements I've added since the last version and hope to send out
an updated version soon.

Thanks,
Kyrill


thanks,




Re: [PATCH][v3] GIMPLE store merging pass

2016-09-08 Thread Bernhard Reutner-Fischer
On 8 September 2016 at 10:31, Kyrill Tkachov
 wrote:
>
> On 07/09/16 20:03, Bernhard Reutner-Fischer wrote:
>>
>> On September 6, 2016 5:14:47 PM GMT+02:00, Kyrill Tkachov
>>  wrote:

> Thanks, fixed all the above in my tree (will be retesting).
>

>> What about debug statements? ISTM you should skip those.
>> (Isn't visited reset before entry of a pass?)
>
>
> Yes, I'll skip debug statements. Comments in gimple.h say that the visited
> property is undefined at pass boundaries, so I'd rather initialize it here.

Right.
>
>
>> Maybe I missed the bikeshedding about the name but I'd have used
>> -fmerge-stores instead.
>
>
> Wouldn't be hard to change. I can change it any point if there's a
> consensus.

Did you consider any relation to any of
https://gcc.gnu.org/PR22141
https://gcc.gnu.org/PR23684
https://gcc.gnu.org/PR47059
https://gcc.gnu.org/PR54422
and their dups
or https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg77311.html
(the -fmerge-bitfields suggestion from imgtec; maybe the testcases are
of interest)

thanks,


Re: [PATCH][v3] GIMPLE store merging pass

2016-09-08 Thread Kyrill Tkachov


On 07/09/16 20:03, Bernhard Reutner-Fischer wrote:

On September 6, 2016 5:14:47 PM GMT+02:00, Kyrill Tkachov 
 wrote:

Hi all,

s/contigous/contiguous/
s/ where where/ where/

+struct merged_store_group
+{
+  HOST_WIDE_INT start;
+  HOST_WIDE_INT width;
+  unsigned char *val;
+  unsigned int align;
+  auto_vec stores;
+  /* We record the first and last original statements in the sequence because
+ because we'll need their vuse/vdef and replacement position.  */
+  gimple *last_stmt;

s/ because because/ because/

Why aren't these two HWIs unsigned, likewise in store_immediate_info and in 
most other spots in the patch?

+ fprintf (dump_file, "Afer writing ");
s/Afer /After/

/access if prohibitively slow/s/ if /is /

I'd get rid of successful_p in imm_store_chain_info::output_merged_stores.


Thanks, fixed all the above in my tree (will be retesting).



+unsigned int
+pass_store_merging::execute (function *fun)
+{
+  basic_block bb;
+  hash_set orig_stmts;
+
+  FOR_EACH_BB_FN (bb, fun)
+{
+  gimple_stmt_iterator gsi;
+  HOST_WIDE_INT num_statements = 0;
+  /* Record the original statements so that we can keep track of
+statements emitted in this pass and not re-process new
+statements.  */
+  for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next ())
+   {
+ gimple_set_visited (gsi_stmt (gsi), false);
+ num_statements++;
+   }
+
+  if (num_statements < 2)
+   continue;

What about debug statements? ISTM you should skip those.
(Isn't visited reset before entry of a pass?)


Yes, I'll skip debug statements. Comments in gimple.h say that the visited
property is undefined at pass boundaries, so I'd rather initialize it here.



Maybe I missed the bikeshedding about the name but I'd have used -fmerge-stores 
instead.


Wouldn't be hard to change. I can change it any point if there's a consensus.

Thanks for the feedback.
Kyrill


Thanks,

The v3 of this patch addresses feedback I received on the version
posted at [1].
The merged store buffer is now represented as a char array that we
splat values onto with
native_encode_expr and native_interpret_expr. This allows us to merge
anything that native_encode_expr
accepts, including floating point values and short vectors. So this
version extends the functionality
of the previous one in that it handles floating point values as well.

The first phase of the algorithm that detects the contiguous stores is
also slightly refactored according
to feedback to read more fluently.

Richi, I experimented with merging up to MOVE_MAX bytes rather than
word size but I got worse results on aarch64.
MOVE_MAX there is 16 (because it has load/store register pair
instructions) but the 128-bit immediates that we ended
synthesising were too complex. Perhaps the TImode immediate store RTL
expansions could be improved, but for now
I've left the maximum merge size to be BITS_PER_WORD.

I've disabled the pass for PDP-endian targets as the merging code
proved to be quite fiddly to get right for different
endiannesses and I didn't feel comfortable writing logic for
BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN targets without serious
testing capabilities. I hope that's ok (I note the bswap pass also
doesn't try to do anything on such targets).

Tested on arm, aarch64, x86_64 and on big-endian arm and aarch64.

How does this version look?
Thanks,
Kyrill

[1] https://gcc.gnu.org/ml/gcc-patches/2016-08/msg01512.html

2016-09-06  Kyrylo Tkachov  

 PR middle-end/22141
 * Makefile.in (OBJS): Add gimple-ssa-store-merging.o.
 * common.opt (fstore-merging): New Optimization option.
 * opts.c (default_options_table): Add entry for
 OPT_ftree_store_merging.
 * params.def (PARAM_STORE_MERGING_ALLOW_UNALIGNED): Define.
 * passes.def: Insert pass_tree_store_merging.
 * tree-pass.h (make_pass_store_merging): Declare extern
 prototype.
 * gimple-ssa-store-merging.c: New file.
 * doc/invoke.texi (Optimization Options): Document
 -fstore-merging.

2016-09-06  Kyrylo Tkachov  
 Jakub Jelinek  

 PR middle-end/22141
 * gcc.c-torture/execute/pr22141-1.c: New test.
 * gcc.c-torture/execute/pr22141-2.c: Likewise.
 * gcc.target/aarch64/ldp_stp_1.c: Adjust for -fstore-merging.
 * gcc.target/aarch64/ldp_stp_4.c: Likewise.
 * gcc.dg/store_merging_1.c: New test.
 * gcc.dg/store_merging_2.c: Likewise.
 * gcc.dg/store_merging_3.c: Likewise.
 * gcc.dg/store_merging_4.c: Likewise.
 * gcc.dg/store_merging_5.c: Likewise.
 * gcc.dg/store_merging_6.c: Likewise.
 * gcc.target/i386/pr22141.c: Likewise.
 * gcc.target/i386/pr34012.c: Add -fno-store-merging to dg-options.






Re: [PATCH][v3] GIMPLE store merging pass

2016-09-07 Thread Jakub Jelinek
On Wed, Sep 07, 2016 at 10:19:11AM +0200, Richard Biener wrote:
> > If you want a 64-bit store, you'd need to merge the two, and that would be
> > even more expensive.  It is a matter of say:
> > movl $0x12345678, (%rsp)
> > movl $0x09abcdef, 4(%rsp)
> > vs.
> > movabsq $0x09abcdef12345678, %rax
> > movq %rax, (%rsp)
> > vs.
> > movl $0x09abcdef, %eax
> > salq $32, %rax
> > orq $0x12345678, %rax
> > movq $rax, (%rsp)
> 
> vs.
> 
> movq $LC0, (%rsp)

You don't want to store the address, so you'd use
movq .LC0, %rax
movq %rax, (%rsp)

> I think the important part to notice is that it should be straight forward
> for a target / the expander to split a large store from an immediate
> into any of the above but very hard to do the opposite.  Thus from a
> GIMPLE side "canonicalizing" to large stores (that are eventually
> supported and well-aligned) seems best to me.

I bet many programs assume that say 64-bit aligned store in the source is
atomic in 64-bit apps, without using __atomic_store (..., __ATOMIC_RELAXED);
So such a change would break that.

Jakub


Re: [PATCH][v3] GIMPLE store merging pass

2016-09-07 Thread Bernhard Reutner-Fischer
On September 6, 2016 5:14:47 PM GMT+02:00, Kyrill Tkachov 
 wrote:
>Hi all,

s/contigous/contiguous/
s/ where where/ where/

+struct merged_store_group
+{
+  HOST_WIDE_INT start;
+  HOST_WIDE_INT width;
+  unsigned char *val;
+  unsigned int align;
+  auto_vec stores;
+  /* We record the first and last original statements in the sequence because
+ because we'll need their vuse/vdef and replacement position.  */
+  gimple *last_stmt;

s/ because because/ because/

Why aren't these two HWIs unsigned, likewise in store_immediate_info and in 
most other spots in the patch?

+ fprintf (dump_file, "Afer writing ");
s/Afer /After/

/access if prohibitively slow/s/ if /is /

I'd get rid of successful_p in imm_store_chain_info::output_merged_stores.


+unsigned int
+pass_store_merging::execute (function *fun)
+{
+  basic_block bb;
+  hash_set orig_stmts;
+
+  FOR_EACH_BB_FN (bb, fun)
+{
+  gimple_stmt_iterator gsi;
+  HOST_WIDE_INT num_statements = 0;
+  /* Record the original statements so that we can keep track of
+statements emitted in this pass and not re-process new
+statements.  */
+  for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next ())
+   {
+ gimple_set_visited (gsi_stmt (gsi), false);
+ num_statements++;
+   }
+
+  if (num_statements < 2)
+   continue;

What about debug statements? ISTM you should skip those.
(Isn't visited reset before entry of a pass?)

Maybe I missed the bikeshedding about the name but I'd have used -fmerge-stores 
instead.

Thanks,
>
>The v3 of this patch addresses feedback I received on the version
>posted at [1].
>The merged store buffer is now represented as a char array that we
>splat values onto with
>native_encode_expr and native_interpret_expr. This allows us to merge
>anything that native_encode_expr
>accepts, including floating point values and short vectors. So this
>version extends the functionality
>of the previous one in that it handles floating point values as well.
>
>The first phase of the algorithm that detects the contiguous stores is
>also slightly refactored according
>to feedback to read more fluently.
>
>Richi, I experimented with merging up to MOVE_MAX bytes rather than
>word size but I got worse results on aarch64.
>MOVE_MAX there is 16 (because it has load/store register pair
>instructions) but the 128-bit immediates that we ended
>synthesising were too complex. Perhaps the TImode immediate store RTL
>expansions could be improved, but for now
>I've left the maximum merge size to be BITS_PER_WORD.
>
>I've disabled the pass for PDP-endian targets as the merging code
>proved to be quite fiddly to get right for different
>endiannesses and I didn't feel comfortable writing logic for
>BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN targets without serious
>testing capabilities. I hope that's ok (I note the bswap pass also
>doesn't try to do anything on such targets).
>
>Tested on arm, aarch64, x86_64 and on big-endian arm and aarch64.
>
>How does this version look?
>Thanks,
>Kyrill
>
>[1] https://gcc.gnu.org/ml/gcc-patches/2016-08/msg01512.html
>
>2016-09-06  Kyrylo Tkachov  
>
> PR middle-end/22141
> * Makefile.in (OBJS): Add gimple-ssa-store-merging.o.
> * common.opt (fstore-merging): New Optimization option.
> * opts.c (default_options_table): Add entry for
> OPT_ftree_store_merging.
> * params.def (PARAM_STORE_MERGING_ALLOW_UNALIGNED): Define.
> * passes.def: Insert pass_tree_store_merging.
> * tree-pass.h (make_pass_store_merging): Declare extern
> prototype.
> * gimple-ssa-store-merging.c: New file.
> * doc/invoke.texi (Optimization Options): Document
> -fstore-merging.
>
>2016-09-06  Kyrylo Tkachov  
> Jakub Jelinek  
>
> PR middle-end/22141
> * gcc.c-torture/execute/pr22141-1.c: New test.
> * gcc.c-torture/execute/pr22141-2.c: Likewise.
> * gcc.target/aarch64/ldp_stp_1.c: Adjust for -fstore-merging.
> * gcc.target/aarch64/ldp_stp_4.c: Likewise.
> * gcc.dg/store_merging_1.c: New test.
> * gcc.dg/store_merging_2.c: Likewise.
> * gcc.dg/store_merging_3.c: Likewise.
> * gcc.dg/store_merging_4.c: Likewise.
> * gcc.dg/store_merging_5.c: Likewise.
> * gcc.dg/store_merging_6.c: Likewise.
> * gcc.target/i386/pr22141.c: Likewise.
> * gcc.target/i386/pr34012.c: Add -fno-store-merging to dg-options.




Re: [PATCH][v3] GIMPLE store merging pass

2016-09-07 Thread Bernd Schmidt



On 09/07/2016 10:19 AM, Richard Biener wrote:

On Tue, 6 Sep 2016, Jakub Jelinek wrote:



If you want a 64-bit store, you'd need to merge the two, and that would be
even more expensive.  It is a matter of say:
movl $0x12345678, (%rsp)
movl $0x09abcdef, 4(%rsp)
vs.
movabsq $0x09abcdef12345678, %rax
movq %rax, (%rsp)
vs.
movl $0x09abcdef, %eax
salq $32, %rax
orq $0x12345678, %rax
movq $rax, (%rsp)


vs.

movq $LC0, (%rsp)

?


Not the same. That moves the address of $LC0.


Bernd


Re: [PATCH][v3] GIMPLE store merging pass

2016-09-07 Thread Jeff Law

On 09/07/2016 02:19 AM, Richard Biener wrote:

On Tue, 6 Sep 2016, Jakub Jelinek wrote:


On Tue, Sep 06, 2016 at 04:59:23PM +0100, Kyrill Tkachov wrote:

On 06/09/16 16:32, Jakub Jelinek wrote:

On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote:

The v3 of this patch addresses feedback I received on the version posted at [1].
The merged store buffer is now represented as a char array that we splat values 
onto with
native_encode_expr and native_interpret_expr. This allows us to merge anything 
that native_encode_expr
accepts, including floating point values and short vectors. So this version 
extends the functionality
of the previous one in that it handles floating point values as well.

The first phase of the algorithm that detects the contiguous stores is also 
slightly refactored according
to feedback to read more fluently.

Richi, I experimented with merging up to MOVE_MAX bytes rather than word size 
but I got worse results on aarch64.
MOVE_MAX there is 16 (because it has load/store register pair instructions) but 
the 128-bit immediates that we ended
synthesising were too complex. Perhaps the TImode immediate store RTL 
expansions could be improved, but for now
I've left the maximum merge size to be BITS_PER_WORD.

At least from playing with this kind of things in the RTL PR22141 patch,
I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit
constants usually isn't a win (not just for speed optimized blocks but also for
-Os).  For 64-bit store if the constant isn't signed 32-bit or unsigned
32-bit you need movabsq into some temporary register which has like 3 times 
worse
latency than normal store if I remember well, and then store it.


We could restrict the maximum width of the stores generated to 32 bits on 
x86_64.
I think this would need another parameter or target macro for the target to set.
Alternatively, is it a possibility for x86 to be a bit smarter in its DImode 
mov-immediate
expansion? For example break up the 64-bit movabsq immediate into two SImode 
immediates?


If you want a 64-bit store, you'd need to merge the two, and that would be
even more expensive.  It is a matter of say:
movl $0x12345678, (%rsp)
movl $0x09abcdef, 4(%rsp)
vs.
movabsq $0x09abcdef12345678, %rax
movq %rax, (%rsp)
vs.
movl $0x09abcdef, %eax
salq $32, %rax
orq $0x12345678, %rax
movq $rax, (%rsp)


vs.

movq $LC0, (%rsp)

?


etc.  Guess it needs to be benchmarked on contemporary CPUs, I'm pretty sure
the last sequence is the worst one.


I think the important part to notice is that it should be straight forward
for a target / the expander to split a large store from an immediate
into any of the above but very hard to do the opposite.  Thus from a
GIMPLE side "canonicalizing" to large stores (that are eventually
supported and well-aligned) seems best to me.

Agreed.





I'm aware of that. The patch already has logic to avoid emitting unaligned 
accesses
for SLOW_UNALIGNED_ACCESS targets. Beyond that the patch introduces the 
parameter
PARAM_STORE_MERGING_ALLOW_UNALIGNED that can be used by the user or target to
forbid generation of unaligned stores by the pass altogether. Beyond that I'm 
not sure
how to behave more intelligently here. Any ideas?


Dunno, the heuristics was the main problem with my patch.  Generally, I'd
say there is a difference between cold and hot blocks, in cold ones perhaps
unaligned stores are more appropriate (if supported at all and not way too
slow), while in hot ones less desirable.


Note that I repeatedly argue that if we can canonicalize sth to "larger"
then even if unaligned, the expander should be able to produce optimal
code again (it might not do, of course).
And agreed.  Furthermore, it's in line with our guiding principles WRT 
separation of the tree/SSA optimizers from target dependencies.


So let's push those decisions into the expanders/backend/target and 
canonicalize to the larger stores.


jeff




Re: [PATCH][v3] GIMPLE store merging pass

2016-09-07 Thread Richard Biener
On Tue, 6 Sep 2016, Jakub Jelinek wrote:

> On Tue, Sep 06, 2016 at 04:59:23PM +0100, Kyrill Tkachov wrote:
> > On 06/09/16 16:32, Jakub Jelinek wrote:
> > >On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote:
> > >>The v3 of this patch addresses feedback I received on the version posted 
> > >>at [1].
> > >>The merged store buffer is now represented as a char array that we splat 
> > >>values onto with
> > >>native_encode_expr and native_interpret_expr. This allows us to merge 
> > >>anything that native_encode_expr
> > >>accepts, including floating point values and short vectors. So this 
> > >>version extends the functionality
> > >>of the previous one in that it handles floating point values as well.
> > >>
> > >>The first phase of the algorithm that detects the contiguous stores is 
> > >>also slightly refactored according
> > >>to feedback to read more fluently.
> > >>
> > >>Richi, I experimented with merging up to MOVE_MAX bytes rather than word 
> > >>size but I got worse results on aarch64.
> > >>MOVE_MAX there is 16 (because it has load/store register pair 
> > >>instructions) but the 128-bit immediates that we ended
> > >>synthesising were too complex. Perhaps the TImode immediate store RTL 
> > >>expansions could be improved, but for now
> > >>I've left the maximum merge size to be BITS_PER_WORD.
> > >At least from playing with this kind of things in the RTL PR22141 patch,
> > >I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit
> > >constants usually isn't a win (not just for speed optimized blocks but 
> > >also for
> > >-Os).  For 64-bit store if the constant isn't signed 32-bit or unsigned
> > >32-bit you need movabsq into some temporary register which has like 3 
> > >times worse
> > >latency than normal store if I remember well, and then store it.
> > 
> > We could restrict the maximum width of the stores generated to 32 bits on 
> > x86_64.
> > I think this would need another parameter or target macro for the target to 
> > set.
> > Alternatively, is it a possibility for x86 to be a bit smarter in its 
> > DImode mov-immediate
> > expansion? For example break up the 64-bit movabsq immediate into two 
> > SImode immediates?
> 
> If you want a 64-bit store, you'd need to merge the two, and that would be
> even more expensive.  It is a matter of say:
>   movl $0x12345678, (%rsp)
>   movl $0x09abcdef, 4(%rsp)
> vs.
>   movabsq $0x09abcdef12345678, %rax
>   movq %rax, (%rsp)
> vs.
>   movl $0x09abcdef, %eax
>   salq $32, %rax
>   orq $0x12345678, %rax
>   movq $rax, (%rsp)

vs.

movq $LC0, (%rsp)

?

> etc.  Guess it needs to be benchmarked on contemporary CPUs, I'm pretty sure
> the last sequence is the worst one.

I think the important part to notice is that it should be straight forward
for a target / the expander to split a large store from an immediate
into any of the above but very hard to do the opposite.  Thus from a
GIMPLE side "canonicalizing" to large stores (that are eventually
supported and well-aligned) seems best to me.
 
> > >What alias set is used for the accesses if there are different alias sets
> > >involved in between the merged stores?
> > 
> > As per https://gcc.gnu.org/ml/gcc/2016-06/msg00162.html the type used in 
> > those cases
> > would be ptr_type_node. See the get_type_for_merged_store function in the 
> > patch.
> 
> Richi knows this best.  I just wonder if e.g. all the stores go into fields
> of the same structure it wouldn't be better if you need to punt use that
> structure as the type rather than alias set 0.

Well, yes - if the IL always accesses a common handled component you
can use the alias set of that component.  But it's some work to do
this correctly as you can have MEM[ + 4] = 1; which stores an 'int'
to a struct with two floats.   So you do have to be careful, also when
merging overlapping stores (in which case you'll certainly have an
irregular situation).

Which is why I suggested the "easy" approach above which should handle
a lot of cases well already.

> > I'm aware of that. The patch already has logic to avoid emitting unaligned 
> > accesses
> > for SLOW_UNALIGNED_ACCESS targets. Beyond that the patch introduces the 
> > parameter
> > PARAM_STORE_MERGING_ALLOW_UNALIGNED that can be used by the user or target 
> > to
> > forbid generation of unaligned stores by the pass altogether. Beyond that 
> > I'm not sure
> > how to behave more intelligently here. Any ideas?
> 
> Dunno, the heuristics was the main problem with my patch.  Generally, I'd
> say there is a difference between cold and hot blocks, in cold ones perhaps
> unaligned stores are more appropriate (if supported at all and not way too
> slow), while in hot ones less desirable.

Note that I repeatedly argue that if we can canonicalize sth to "larger"
then even if unaligned, the expander should be able to produce optimal
code again (it might not do, of course).

Hope to look at the patch in detail soon.


Re: [PATCH][v3] GIMPLE store merging pass

2016-09-06 Thread Kyrill Tkachov


On 06/09/16 17:21, Jakub Jelinek wrote:

On Tue, Sep 06, 2016 at 04:59:23PM +0100, Kyrill Tkachov wrote:

On 06/09/16 16:32, Jakub Jelinek wrote:

On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote:

The v3 of this patch addresses feedback I received on the version posted at [1].
The merged store buffer is now represented as a char array that we splat values 
onto with
native_encode_expr and native_interpret_expr. This allows us to merge anything 
that native_encode_expr
accepts, including floating point values and short vectors. So this version 
extends the functionality
of the previous one in that it handles floating point values as well.

The first phase of the algorithm that detects the contiguous stores is also 
slightly refactored according
to feedback to read more fluently.

Richi, I experimented with merging up to MOVE_MAX bytes rather than word size 
but I got worse results on aarch64.
MOVE_MAX there is 16 (because it has load/store register pair instructions) but 
the 128-bit immediates that we ended
synthesising were too complex. Perhaps the TImode immediate store RTL 
expansions could be improved, but for now
I've left the maximum merge size to be BITS_PER_WORD.

At least from playing with this kind of things in the RTL PR22141 patch,
I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit
constants usually isn't a win (not just for speed optimized blocks but also for
-Os).  For 64-bit store if the constant isn't signed 32-bit or unsigned
32-bit you need movabsq into some temporary register which has like 3 times 
worse
latency than normal store if I remember well, and then store it.

We could restrict the maximum width of the stores generated to 32 bits on 
x86_64.
I think this would need another parameter or target macro for the target to set.
Alternatively, is it a possibility for x86 to be a bit smarter in its DImode 
mov-immediate
expansion? For example break up the 64-bit movabsq immediate into two SImode 
immediates?

If you want a 64-bit store, you'd need to merge the two, and that would be
even more expensive.  It is a matter of say:
movl $0x12345678, (%rsp)
movl $0x09abcdef, 4(%rsp)
vs.
movabsq $0x09abcdef12345678, %rax
movq %rax, (%rsp)
vs.
movl $0x09abcdef, %eax
salq $32, %rax
orq $0x12345678, %rax
movq $rax, (%rsp)
etc.  Guess it needs to be benchmarked on contemporary CPUs, I'm pretty sure
the last sequence is the worst one.


I'd think the backend could expand to the 1st form given
a DImode move of 0x09abcdef12345678 to a memory destination?
Or restrict the maximum merged store size to 32 bits.
Or use rtx costs to iteratively reduce the size of the stores until
we reach the cheap narrower constants. I suspect the problem there would
be to weigh the benefit of using a narrower cheap constant against the
cost of adding more stores. At the GIMPLE level we don't really want to
be calling RTX costs on anything other than constants.




What alias set is used for the accesses if there are different alias sets
involved in between the merged stores?

As per https://gcc.gnu.org/ml/gcc/2016-06/msg00162.html the type used in those 
cases
would be ptr_type_node. See the get_type_for_merged_store function in the patch.

Richi knows this best.  I just wonder if e.g. all the stores go into fields
of the same structure it wouldn't be better if you need to punt use that
structure as the type rather than alias set 0.


I'm aware of that. The patch already has logic to avoid emitting unaligned 
accesses
for SLOW_UNALIGNED_ACCESS targets. Beyond that the patch introduces the 
parameter
PARAM_STORE_MERGING_ALLOW_UNALIGNED that can be used by the user or target to
forbid generation of unaligned stores by the pass altogether. Beyond that I'm 
not sure
how to behave more intelligently here. Any ideas?

Dunno, the heuristics was the main problem with my patch.  Generally, I'd
say there is a difference between cold and hot blocks, in cold ones perhaps
unaligned stores are more appropriate (if supported at all and not way too
slow), while in hot ones less desirable.


Could do. I'll look into it, thanks.


And, do you have some SPEC2k and/or SPEC2k6 numbers, for
  e.g. x86_64/i686/arm/aarch64/powerpc64le?

I did some benchmarking on aarch64 in the initial submission at
https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00942.html
aarch64 showed some improvement and no regressions on x86_64.
I'll be rerunning the numbers on aarch64/x86_64/arm as the patch has expanded
in scope since then (handling more bitfields, floating point constants).
I just wanted to get this version out before the Cauldron for comments.


The RTL PR22141 changes weren't added mainly because it slowed down SPEC2k*
on powerpc.

Unfortunately I don't have access to SPEC on powerpc. Any help with 
testing/benchmarking
there would be very much appreciated.

I don't have SPEC set up on any arch, perhaps you can ask some of the IBM
folks to 

Re: [PATCH][v3] GIMPLE store merging pass

2016-09-06 Thread Jakub Jelinek
On Tue, Sep 06, 2016 at 04:59:23PM +0100, Kyrill Tkachov wrote:
> On 06/09/16 16:32, Jakub Jelinek wrote:
> >On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote:
> >>The v3 of this patch addresses feedback I received on the version posted at 
> >>[1].
> >>The merged store buffer is now represented as a char array that we splat 
> >>values onto with
> >>native_encode_expr and native_interpret_expr. This allows us to merge 
> >>anything that native_encode_expr
> >>accepts, including floating point values and short vectors. So this version 
> >>extends the functionality
> >>of the previous one in that it handles floating point values as well.
> >>
> >>The first phase of the algorithm that detects the contiguous stores is also 
> >>slightly refactored according
> >>to feedback to read more fluently.
> >>
> >>Richi, I experimented with merging up to MOVE_MAX bytes rather than word 
> >>size but I got worse results on aarch64.
> >>MOVE_MAX there is 16 (because it has load/store register pair instructions) 
> >>but the 128-bit immediates that we ended
> >>synthesising were too complex. Perhaps the TImode immediate store RTL 
> >>expansions could be improved, but for now
> >>I've left the maximum merge size to be BITS_PER_WORD.
> >At least from playing with this kind of things in the RTL PR22141 patch,
> >I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit
> >constants usually isn't a win (not just for speed optimized blocks but also 
> >for
> >-Os).  For 64-bit store if the constant isn't signed 32-bit or unsigned
> >32-bit you need movabsq into some temporary register which has like 3 times 
> >worse
> >latency than normal store if I remember well, and then store it.
> 
> We could restrict the maximum width of the stores generated to 32 bits on 
> x86_64.
> I think this would need another parameter or target macro for the target to 
> set.
> Alternatively, is it a possibility for x86 to be a bit smarter in its DImode 
> mov-immediate
> expansion? For example break up the 64-bit movabsq immediate into two SImode 
> immediates?

If you want a 64-bit store, you'd need to merge the two, and that would be
even more expensive.  It is a matter of say:
movl $0x12345678, (%rsp)
movl $0x09abcdef, 4(%rsp)
vs.
movabsq $0x09abcdef12345678, %rax
movq %rax, (%rsp)
vs.
movl $0x09abcdef, %eax
salq $32, %rax
orq $0x12345678, %rax
movq $rax, (%rsp)
etc.  Guess it needs to be benchmarked on contemporary CPUs, I'm pretty sure
the last sequence is the worst one.

> >What alias set is used for the accesses if there are different alias sets
> >involved in between the merged stores?
> 
> As per https://gcc.gnu.org/ml/gcc/2016-06/msg00162.html the type used in 
> those cases
> would be ptr_type_node. See the get_type_for_merged_store function in the 
> patch.

Richi knows this best.  I just wonder if e.g. all the stores go into fields
of the same structure it wouldn't be better if you need to punt use that
structure as the type rather than alias set 0.

> I'm aware of that. The patch already has logic to avoid emitting unaligned 
> accesses
> for SLOW_UNALIGNED_ACCESS targets. Beyond that the patch introduces the 
> parameter
> PARAM_STORE_MERGING_ALLOW_UNALIGNED that can be used by the user or target to
> forbid generation of unaligned stores by the pass altogether. Beyond that I'm 
> not sure
> how to behave more intelligently here. Any ideas?

Dunno, the heuristics was the main problem with my patch.  Generally, I'd
say there is a difference between cold and hot blocks, in cold ones perhaps
unaligned stores are more appropriate (if supported at all and not way too
slow), while in hot ones less desirable.
> 
> >And, do you have some SPEC2k and/or SPEC2k6 numbers, for
> >  e.g. x86_64/i686/arm/aarch64/powerpc64le?
> 
> I did some benchmarking on aarch64 in the initial submission at
> https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00942.html
> aarch64 showed some improvement and no regressions on x86_64.
> I'll be rerunning the numbers on aarch64/x86_64/arm as the patch has expanded
> in scope since then (handling more bitfields, floating point constants).
> I just wanted to get this version out before the Cauldron for comments.
> 
> >The RTL PR22141 changes weren't added mainly because it slowed down SPEC2k*
> >on powerpc.
> 
> Unfortunately I don't have access to SPEC on powerpc. Any help with 
> testing/benchmarking
> there would be very much appreciated.

I don't have SPEC set up on any arch, perhaps you can ask some of the IBM
folks to benchmark it for you (that is what I've done back when writing the
patch)?

BTW, do you handle bitfield stores too?  Once you don't support just
constant stores, for bitfields but perhaps also for other adjacent
operations not just copying, but e.g. also bitwise operations (&,|,^,~)
on adjacent memory locations might be useful, or for bitfields even other
operations (e.g. addition if there are enough bits 

Re: [PATCH][v3] GIMPLE store merging pass

2016-09-06 Thread Kyrill Tkachov

Hi Jakub,

On 06/09/16 16:32, Jakub Jelinek wrote:

On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote:

The v3 of this patch addresses feedback I received on the version posted at [1].
The merged store buffer is now represented as a char array that we splat values 
onto with
native_encode_expr and native_interpret_expr. This allows us to merge anything 
that native_encode_expr
accepts, including floating point values and short vectors. So this version 
extends the functionality
of the previous one in that it handles floating point values as well.

The first phase of the algorithm that detects the contiguous stores is also 
slightly refactored according
to feedback to read more fluently.

Richi, I experimented with merging up to MOVE_MAX bytes rather than word size 
but I got worse results on aarch64.
MOVE_MAX there is 16 (because it has load/store register pair instructions) but 
the 128-bit immediates that we ended
synthesising were too complex. Perhaps the TImode immediate store RTL 
expansions could be improved, but for now
I've left the maximum merge size to be BITS_PER_WORD.

At least from playing with this kind of things in the RTL PR22141 patch,
I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit
constants usually isn't a win (not just for speed optimized blocks but also for
-Os).  For 64-bit store if the constant isn't signed 32-bit or unsigned
32-bit you need movabsq into some temporary register which has like 3 times 
worse
latency than normal store if I remember well, and then store it.


We could restrict the maximum width of the stores generated to 32 bits on 
x86_64.
I think this would need another parameter or target macro for the target to set.
Alternatively, is it a possibility for x86 to be a bit smarter in its DImode 
mov-immediate
expansion? For example break up the 64-bit movabsq immediate into two SImode 
immediates?


   If it can
be CSEd and the same constant used multiple times in adjacent code perhaps.


Perhaps. From glancing at SPEC2006 generated code for aarch64 I didn't spot too 
many opportunities
for that though.


Various other targets have different costs for different constants,
so it would be nice if the pass considered that (computed RTX costs of those
constants and used that in some heuristics).


Could do. That could avoid creating too expensive immediates.


What alias set is used for the accesses if there are different alias sets
involved in between the merged stores?


As per https://gcc.gnu.org/ml/gcc/2016-06/msg00162.html the type used in those 
cases
would be ptr_type_node. See the get_type_for_merged_store function in the patch.


Also alignment can matter, even on non-strict alignment targets (speed vs.
-Os for that).


I'm aware of that. The patch already has logic to avoid emitting unaligned 
accesses
for SLOW_UNALIGNED_ACCESS targets. Beyond that the patch introduces the 
parameter
PARAM_STORE_MERGING_ALLOW_UNALIGNED that can be used by the user or target to
forbid generation of unaligned stores by the pass altogether. Beyond that I'm 
not sure
how to behave more intelligently here. Any ideas?


And, do you have some SPEC2k and/or SPEC2k6 numbers, for
  e.g. x86_64/i686/arm/aarch64/powerpc64le?


I did some benchmarking on aarch64 in the initial submission at
https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00942.html
aarch64 showed some improvement and no regressions on x86_64.
I'll be rerunning the numbers on aarch64/x86_64/arm as the patch has expanded
in scope since then (handling more bitfields, floating point constants).
I just wanted to get this version out before the Cauldron for comments.


The RTL PR22141 changes weren't added mainly because it slowed down SPEC2k*
on powerpc.


Unfortunately I don't have access to SPEC on powerpc. Any help with 
testing/benchmarking
there would be very much appreciated.


Also, do you only handle constants or also the case where there is partial
or complete copying from some other memory, where it could be turned into
larger chunk loads + stores or __builtin_memcpy?


At the moment just constants. I hope in the future to extend it to perform more 
tricks
involving contiguous stores.


I've disabled the pass for PDP-endian targets as the merging code proved to be 
quite fiddly to get right for different
endiannesses and I didn't feel comfortable writing logic for BYTES_BIG_ENDIAN 
!= WORDS_BIG_ENDIAN targets without serious
testing capabilities. I hope that's ok (I note the bswap pass also doesn't try 
to do anything on such targets).

I think that is fine, it isn't the only pass that punts in this case.


Thanks for the comments and ideas,
Kyrill


Jakub





Re: [PATCH][v3] GIMPLE store merging pass

2016-09-06 Thread Jakub Jelinek
On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote:
> The v3 of this patch addresses feedback I received on the version posted at 
> [1].
> The merged store buffer is now represented as a char array that we splat 
> values onto with
> native_encode_expr and native_interpret_expr. This allows us to merge 
> anything that native_encode_expr
> accepts, including floating point values and short vectors. So this version 
> extends the functionality
> of the previous one in that it handles floating point values as well.
> 
> The first phase of the algorithm that detects the contiguous stores is also 
> slightly refactored according
> to feedback to read more fluently.
> 
> Richi, I experimented with merging up to MOVE_MAX bytes rather than word size 
> but I got worse results on aarch64.
> MOVE_MAX there is 16 (because it has load/store register pair instructions) 
> but the 128-bit immediates that we ended
> synthesising were too complex. Perhaps the TImode immediate store RTL 
> expansions could be improved, but for now
> I've left the maximum merge size to be BITS_PER_WORD.

At least from playing with this kind of things in the RTL PR22141 patch,
I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit
constants usually isn't a win (not just for speed optimized blocks but also for
-Os).  For 64-bit store if the constant isn't signed 32-bit or unsigned
32-bit you need movabsq into some temporary register which has like 3 times 
worse
latency than normal store if I remember well, and then store it.  If it can
be CSEd and the same constant used multiple times in adjacent code perhaps.
Various other targets have different costs for different constants,
so it would be nice if the pass considered that (computed RTX costs of those
constants and used that in some heuristics).
What alias set is used for the accesses if there are different alias sets
involved in between the merged stores?
Also alignment can matter, even on non-strict alignment targets (speed vs.
-Os for that).
And, do you have some SPEC2k and/or SPEC2k6 numbers, for
 e.g. x86_64/i686/arm/aarch64/powerpc64le?
The RTL PR22141 changes weren't added mainly because it slowed down SPEC2k*
on powerpc.
Also, do you only handle constants or also the case where there is partial
or complete copying from some other memory, where it could be turned into
larger chunk loads + stores or __builtin_memcpy?

> I've disabled the pass for PDP-endian targets as the merging code proved to 
> be quite fiddly to get right for different
> endiannesses and I didn't feel comfortable writing logic for BYTES_BIG_ENDIAN 
> != WORDS_BIG_ENDIAN targets without serious
> testing capabilities. I hope that's ok (I note the bswap pass also doesn't 
> try to do anything on such targets).

I think that is fine, it isn't the only pass that punts in this case.

Jakub


[PATCH][v3] GIMPLE store merging pass

2016-09-06 Thread Kyrill Tkachov

Hi all,

The v3 of this patch addresses feedback I received on the version posted at [1].
The merged store buffer is now represented as a char array that we splat values 
onto with
native_encode_expr and native_interpret_expr. This allows us to merge anything 
that native_encode_expr
accepts, including floating point values and short vectors. So this version 
extends the functionality
of the previous one in that it handles floating point values as well.

The first phase of the algorithm that detects the contiguous stores is also 
slightly refactored according
to feedback to read more fluently.

Richi, I experimented with merging up to MOVE_MAX bytes rather than word size 
but I got worse results on aarch64.
MOVE_MAX there is 16 (because it has load/store register pair instructions) but 
the 128-bit immediates that we ended
synthesising were too complex. Perhaps the TImode immediate store RTL 
expansions could be improved, but for now
I've left the maximum merge size to be BITS_PER_WORD.

I've disabled the pass for PDP-endian targets as the merging code proved to be 
quite fiddly to get right for different
endiannesses and I didn't feel comfortable writing logic for BYTES_BIG_ENDIAN 
!= WORDS_BIG_ENDIAN targets without serious
testing capabilities. I hope that's ok (I note the bswap pass also doesn't try 
to do anything on such targets).

Tested on arm, aarch64, x86_64 and on big-endian arm and aarch64.

How does this version look?
Thanks,
Kyrill

[1] https://gcc.gnu.org/ml/gcc-patches/2016-08/msg01512.html

2016-09-06  Kyrylo Tkachov  

PR middle-end/22141
* Makefile.in (OBJS): Add gimple-ssa-store-merging.o.
* common.opt (fstore-merging): New Optimization option.
* opts.c (default_options_table): Add entry for
OPT_ftree_store_merging.
* params.def (PARAM_STORE_MERGING_ALLOW_UNALIGNED): Define.
* passes.def: Insert pass_tree_store_merging.
* tree-pass.h (make_pass_store_merging): Declare extern
prototype.
* gimple-ssa-store-merging.c: New file.
* doc/invoke.texi (Optimization Options): Document
-fstore-merging.

2016-09-06  Kyrylo Tkachov  
Jakub Jelinek  

PR middle-end/22141
* gcc.c-torture/execute/pr22141-1.c: New test.
* gcc.c-torture/execute/pr22141-2.c: Likewise.
* gcc.target/aarch64/ldp_stp_1.c: Adjust for -fstore-merging.
* gcc.target/aarch64/ldp_stp_4.c: Likewise.
* gcc.dg/store_merging_1.c: New test.
* gcc.dg/store_merging_2.c: Likewise.
* gcc.dg/store_merging_3.c: Likewise.
* gcc.dg/store_merging_4.c: Likewise.
* gcc.dg/store_merging_5.c: Likewise.
* gcc.dg/store_merging_6.c: Likewise.
* gcc.target/i386/pr22141.c: Likewise.
* gcc.target/i386/pr34012.c: Add -fno-store-merging to dg-options.
diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 5e3cbe5d736e2cf9429df989d9d95708b1243011..ca2415b79de60c9be44871777d57bcc8d5c07487 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1295,6 +1295,7 @@ OBJS = \
 	gimple-ssa-isolate-paths.o \
 	gimple-ssa-nonnull-compare.o \
 	gimple-ssa-split-paths.o \
+	gimple-ssa-store-merging.o \
 	gimple-ssa-strength-reduction.o \
 	gimple-streamer-in.o \
 	gimple-streamer-out.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 2004fbc7d0d7b1ca65d06cee0f7398f2c9e3fc64..d951ae7cd6fe8fdae17e17b8f5be7c636ea1986d 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1452,6 +1452,10 @@ fstrict-volatile-bitfields
 Common Report Var(flag_strict_volatile_bitfields) Init(-1) Optimization
 Force bitfield accesses to match their type width.
 
+fstore-merging
+Common Var(flag_store_merging) Optimization
+Use the tree store merging pass.
+
 fguess-branch-probability
 Common Report Var(flag_guess_branch_prob) Optimization
 Enable guessing of branch probabilities.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 87da1f1c12b718fa63c9b89fdd8f85fbc6b54cb0..2e2a9fdd92733beb6948efc7a5c4b3fc79da73a9 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -401,7 +401,7 @@ Objective-C and Objective-C++ Dialects}.
 -fsingle-precision-constant -fsplit-ivs-in-unroller @gol
 -fsplit-paths @gol
 -fsplit-wide-types -fssa-backprop -fssa-phiopt @gol
--fstdarg-opt -fstrict-aliasing @gol
+-fstdarg-opt -fstore-merging -fstrict-aliasing @gol
 -fstrict-overflow -fthread-jumps -ftracer -ftree-bit-ccp @gol
 -ftree-builtin-call-dce -ftree-ccp -ftree-ch @gol
 -ftree-coalesce-vars -ftree-copy-prop -ftree-dce -ftree-dominator-opts @gol
@@ -412,8 +412,8 @@ Objective-C and Objective-C++ Dialects}.
 -ftree-loop-vectorize @gol
 -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta @gol
 -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
--ftree-switch-conversion -ftree-tail-merge -ftree-ter @gol
--ftree-vectorize -ftree-vrp -funconstrained-commons @gol
+-ftree-switch-conversion -ftree-tail-merge @gol
+-ftree-ter -ftree-vectorize -ftree-vrp -funconstrained-commons @gol
 -funit-at-a-time