Re: [PATCH][v6] GIMPLE store merging pass

2016-10-24 Thread Richard Biener
On Mon, 24 Oct 2016, Kyrill Tkachov wrote:

> 
> On 21/10/16 19:37, Richard Biener wrote:
> > On October 21, 2016 3:29:15 PM GMT+02:00, Kyrill Tkachov
> >  wrote:
> > > Hi Richard,
> > > 
> > > On 21/10/16 13:37, Richard Biener wrote:
> > > > On Tue, 18 Oct 2016, Kyrill Tkachov wrote:
> > > > 
> > > > > Hi Richard,
> > > > > 
> > > > > This patch is a merge of [1] and [2] and implements the manual
> > > merging of
> > > > > bitfields
> > > > > as outlined in [1] but actually makes it work on BYTES_BIG_ENDIAN
> > > too.
> > > > > It caused me a lot of headeache because the bit offset is counted
> > >from the
> > > > > most significant bit
> > > > > in the byte, even though BITS_BIG_ENDIAN was 0 (BITS_BIG_ENDIAN
> > > looks
> > > > > irrelevant for store merging
> > > > > anyway as it's just used to described RTL extract operations).
> > > > > I've included ASCII diagrams of the steps in the merging algorithm.
> > > > Heh, thanks.
> > > > 
> > > > > Bootstrapped and tested on arm-none-linux-gnueabihf,
> > > aarch64-none-linux-gnu,
> > > > > x86_64-unknown-linux-gnu.
> > > > > Also tested on aarch64_be-none-elf.
> > > > > 
> > > > > How does this version look now?
> > > > Mostly good.  For
> > > > 
> > > > +bool
> > > > +pass_store_merging::terminate_all_aliasing_chains (tree dest, tree
> > > base,
> > > > +  gimple *stmt)
> > > > +{
> > > > ...
> > > > +  /* Check if the assignment destination (BASE) is part of a store
> > > chain.
> > > > + This is to catch non-constant stores to destinations that may
> > > be
> > > > part
> > > > + of a chain.  */
> > > > +  if (base)
> > > > +{
> > > > +  chain_info = m_stores.get (base);
> > > > +  if (chain_info)
> > > > +   {
> > > > + struct store_immediate_info *info;
> > > > + unsigned int i;
> > > > + FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
> > > > +   {
> > > > + if (refs_may_alias_p (info->dest, dest))
> > > > +   {
> > > > 
> > > > I suppose the chain is not yet sorted in any way?
> > > > 
> > > > At least for 'dest' which do not have a known constant offset we
> > > > could do
> > > > 
> > > >  if (base)
> > > >terminate_and_release_chain (base);
> > > Do you mean when get_inner_reference returns non-NULL for POFFSET?
> > Yes.
> > 
> > > Or do you think we should try to look into dest in this function?
> > > 
> > > > to speed things up?  IIRC we do not terminate chains early in
> > > > this phase when we have enough stores to form a group, so
> > > > writing a testcase that triggers quadraticness would be as simple
> > > > as having
> > > > 
> > > > char a[100];
> > > > 
> > > > void foo ()
> > > > {
> > > >a[0] = 1;
> > > >a[1] = 2;
> > > >
> > > >a[999] = 3;
> > > > }
> > > > 
> > > > ?
> > > > 
> > > > so I think you probably want to limit the number of stores you
> > > > ever put onto a chain and if you reach that limit, terminate
> > > > and release it?  Like just choose 16 or 64?  (and experiment
> > > > with the above kind of testcases)
> > > I was initially thinking of imposing such a limit as well but
> > > later I thought we'd want to extend the output code to be able to emit
> > > a memcpy (or memset) call for large regions, so detecting the largest
> > > possible
> > > regions would be needed.
> > But then we need a better data structure here to avoid the quadraticness.
> 
> For the testcase:
> 
> char a[100];
> 
> void foo ()
> {
>   a[0] = 1;
>   a[1] = 2;
>   
>   a[999] = 3;
> }
> 
> and similar we don't have quadratic behaviour because constant stores to a
> constant offset
> from base (i.e. stores valid for merging) get automatically inserted into the
> chain without
> doing the alias checking (we keep track of their program order so that
> overlapping stores
> are merged properly). Since pushing into a vec is not a linear operation we're
> fine.

Good.

> Once I teach it to quickly terminate chains when encountering stores to base
> with a variable
> offset (like you suggest above) there is only one case of quadraticness:
> We walk the stores in the chain (causing quadratic behaviour) when among
> the valid mergeable stores we have interspersed stores to a constant offset
> from base but that
> do not store a constant value e.g. a[5] = x;
> So the input needs to be something like:
> 
> char a[16000];
> 
> void
> foo (char x)
> {
>   a[0] = 1; a[8000 + 0] = x;
>   a[1] = 1; a[8000 + 1] = x;
>   a[2] = 1; a[8000 + 2] = x;
>   a[3] = 1; a[8000 + 3] = x;
>   a[4] = 1; a[8000 + 4] = x;
>   ...
>   a[7999] = 1; a[8000 + 7999] = x;
> }
> 
> I've reproduced a slowdown (with store merging taking a disproportional
> time in -ftime-report) with this case and I've verified that limiting the
> maximum number of statements in the chain fixes it.

OK - as said, the algorithm trivially looks quadratic in some cases
and we should 

Re: [PATCH][v6] GIMPLE store merging pass

2016-10-24 Thread Kyrill Tkachov


On 21/10/16 19:37, Richard Biener wrote:

On October 21, 2016 3:29:15 PM GMT+02:00, Kyrill Tkachov 
 wrote:

Hi Richard,

On 21/10/16 13:37, Richard Biener wrote:

On Tue, 18 Oct 2016, Kyrill Tkachov wrote:


Hi Richard,

This patch is a merge of [1] and [2] and implements the manual

merging of

bitfields
as outlined in [1] but actually makes it work on BYTES_BIG_ENDIAN

too.

It caused me a lot of headeache because the bit offset is counted

>from the

most significant bit
in the byte, even though BITS_BIG_ENDIAN was 0 (BITS_BIG_ENDIAN

looks

irrelevant for store merging
anyway as it's just used to described RTL extract operations).
I've included ASCII diagrams of the steps in the merging algorithm.

Heh, thanks.


Bootstrapped and tested on arm-none-linux-gnueabihf,

aarch64-none-linux-gnu,

x86_64-unknown-linux-gnu.
Also tested on aarch64_be-none-elf.

How does this version look now?

Mostly good.  For

+bool
+pass_store_merging::terminate_all_aliasing_chains (tree dest, tree

base,

+  gimple *stmt)
+{
...
+  /* Check if the assignment destination (BASE) is part of a store

chain.

+ This is to catch non-constant stores to destinations that may

be

part
+ of a chain.  */
+  if (base)
+{
+  chain_info = m_stores.get (base);
+  if (chain_info)
+   {
+ struct store_immediate_info *info;
+ unsigned int i;
+ FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
+   {
+ if (refs_may_alias_p (info->dest, dest))
+   {

I suppose the chain is not yet sorted in any way?

At least for 'dest' which do not have a known constant offset we
could do

 if (base)
   terminate_and_release_chain (base);

Do you mean when get_inner_reference returns non-NULL for POFFSET?

Yes.


Or do you think we should try to look into dest in this function?


to speed things up?  IIRC we do not terminate chains early in
this phase when we have enough stores to form a group, so
writing a testcase that triggers quadraticness would be as simple
as having

char a[100];

void foo ()
{
   a[0] = 1;
   a[1] = 2;
   
   a[999] = 3;
}

?

so I think you probably want to limit the number of stores you
ever put onto a chain and if you reach that limit, terminate
and release it?  Like just choose 16 or 64?  (and experiment
with the above kind of testcases)

I was initially thinking of imposing such a limit as well but
later I thought we'd want to extend the output code to be able to emit
a memcpy (or memset) call for large regions, so detecting the largest
possible
regions would be needed.

But then we need a better data structure here to avoid the quadraticness.


For the testcase:

char a[100];

void foo ()
{
  a[0] = 1;
  a[1] = 2;
  
  a[999] = 3;
}

and similar we don't have quadratic behaviour because constant stores to a 
constant offset
from base (i.e. stores valid for merging) get automatically inserted into the 
chain without
doing the alias checking (we keep track of their program order so that 
overlapping stores
are merged properly). Since pushing into a vec is not a linear operation we're 
fine.

Once I teach it to quickly terminate chains when encountering stores to base 
with a variable
offset (like you suggest above) there is only one case of quadraticness:
We walk the stores in the chain (causing quadratic behaviour) when among
the valid mergeable stores we have interspersed stores to a constant offset 
from base but that
do not store a constant value e.g. a[5] = x;
So the input needs to be something like:

char a[16000];

void
foo (char x)
{
  a[0] = 1; a[8000 + 0] = x;
  a[1] = 1; a[8000 + 1] = x;
  a[2] = 1; a[8000 + 2] = x;
  a[3] = 1; a[8000 + 3] = x;
  a[4] = 1; a[8000 + 4] = x;
  ...
  a[7999] = 1; a[8000 + 7999] = x;
}

I've reproduced a slowdown (with store merging taking a disproportional
time in -ftime-report) with this case and I've verified that limiting the
maximum number of statements in the chain fixes it.




  But that is not implemented yet (though I have

experimented
with it) so I can add a limit here. Should I just hardcode a limit or
should I make it
into a --param (MAX_STMTS_IN_STORE_MERGING_CHAIN or something)?

A param is preferred.


Thanks. I'll introduce the limit but, as mentioned above, the simple case
of inserting multiple stores is not quadratic.

Kyrill




+ bit_off = byte_off << LOG2_BITS_PER_UNIT;
+ if (!wi::neg_p (bit_off) && wi::fits_shwi_p

(bit_off))

+   {
+ bitpos += bit_off.to_shwi ();
+

I think you want bit_off += bitpos before the fits_shwi check
otherwise this add may still overflow.

+ base_addr = copy_node (base_addr);
+ TREE_OPERAND (base_addr, 1)
+   = build_zero_cst (TREE_TYPE (TREE_OPERAND (
+  base_addr,


Re: [PATCH][v6] GIMPLE store merging pass

2016-10-21 Thread Richard Biener
On October 21, 2016 3:29:15 PM GMT+02:00, Kyrill Tkachov 
 wrote:
>Hi Richard,
>
>On 21/10/16 13:37, Richard Biener wrote:
>> On Tue, 18 Oct 2016, Kyrill Tkachov wrote:
>>
>>> Hi Richard,
>>>
>>> This patch is a merge of [1] and [2] and implements the manual
>merging of
>>> bitfields
>>> as outlined in [1] but actually makes it work on BYTES_BIG_ENDIAN
>too.
>>> It caused me a lot of headeache because the bit offset is counted
>from the
>>> most significant bit
>>> in the byte, even though BITS_BIG_ENDIAN was 0 (BITS_BIG_ENDIAN
>looks
>>> irrelevant for store merging
>>> anyway as it's just used to described RTL extract operations).
>>> I've included ASCII diagrams of the steps in the merging algorithm.
>> Heh, thanks.
>>
>>> Bootstrapped and tested on arm-none-linux-gnueabihf,
>aarch64-none-linux-gnu,
>>> x86_64-unknown-linux-gnu.
>>> Also tested on aarch64_be-none-elf.
>>>
>>> How does this version look now?
>> Mostly good.  For
>>
>> +bool
>> +pass_store_merging::terminate_all_aliasing_chains (tree dest, tree
>base,
>> +  gimple *stmt)
>> +{
>> ...
>> +  /* Check if the assignment destination (BASE) is part of a store
>chain.
>> + This is to catch non-constant stores to destinations that may
>be
>> part
>> + of a chain.  */
>> +  if (base)
>> +{
>> +  chain_info = m_stores.get (base);
>> +  if (chain_info)
>> +   {
>> + struct store_immediate_info *info;
>> + unsigned int i;
>> + FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
>> +   {
>> + if (refs_may_alias_p (info->dest, dest))
>> +   {
>>
>> I suppose the chain is not yet sorted in any way?
>>
>> At least for 'dest' which do not have a known constant offset we
>> could do
>>
>> if (base)
>>   terminate_and_release_chain (base);
>
>Do you mean when get_inner_reference returns non-NULL for POFFSET?

Yes.

>Or do you think we should try to look into dest in this function?
>
>> to speed things up?  IIRC we do not terminate chains early in
>> this phase when we have enough stores to form a group, so
>> writing a testcase that triggers quadraticness would be as simple
>> as having
>>
>> char a[100];
>>
>> void foo ()
>> {
>>   a[0] = 1;
>>   a[1] = 2;
>>   
>>   a[999] = 3;
>> }
>>
>> ?
>>
>> so I think you probably want to limit the number of stores you
>> ever put onto a chain and if you reach that limit, terminate
>> and release it?  Like just choose 16 or 64?  (and experiment
>> with the above kind of testcases)
>
>I was initially thinking of imposing such a limit as well but
>later I thought we'd want to extend the output code to be able to emit
>a memcpy (or memset) call for large regions, so detecting the largest
>possible
>regions would be needed.

But then we need a better data structure here to avoid the quadraticness.

 But that is not implemented yet (though I have
>experimented
>with it) so I can add a limit here. Should I just hardcode a limit or
>should I make it
>into a --param (MAX_STMTS_IN_STORE_MERGING_CHAIN or something)?

A param is preferred.

>>
>> + bit_off = byte_off << LOG2_BITS_PER_UNIT;
>> + if (!wi::neg_p (bit_off) && wi::fits_shwi_p
>(bit_off))
>> +   {
>> + bitpos += bit_off.to_shwi ();
>> +
>>
>> I think you want bit_off += bitpos before the fits_shwi check
>> otherwise this add may still overflow.
>>
>> + base_addr = copy_node (base_addr);
>> + TREE_OPERAND (base_addr, 1)
>> +   = build_zero_cst (TREE_TYPE (TREE_OPERAND (
>> +  base_addr,
>1)));
>>
>> I'd prefer
>>
>>base_addr = build2 (MEM_REF, ...);
>>
>> here.
>
>Thanks for the feedback,
>Kyrill
>
>>
>> Thanks,
>> Richard.
>>
>>> Thanks,
>>> Kyrill
>>>
>>> [1] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg00573.html
>>> [2] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg00572.html
>>>
>>> 2016-10-18  Kyrylo Tkachov  
>>>
>>>  PR middle-end/22141
>>>  * Makefile.in (OBJS): Add gimple-ssa-store-merging.o.
>>>  * common.opt (fstore-merging): New Optimization option.
>>>  * opts.c (default_options_table): Add entry for
>>>  OPT_ftree_store_merging.
>>>  * fold-const.h (can_native_encode_type_p): Declare prototype.
>>>  * fold-const.c (can_native_encode_type_p): Define.
>>>  * params.def (PARAM_STORE_MERGING_ALLOW_UNALIGNED): Define.
>>>  * passes.def: Insert pass_tree_store_merging.
>>>  * tree-pass.h (make_pass_store_merging): Declare extern
>>>  prototype.
>>>  * gimple-ssa-store-merging.c: New file.
>>>  * doc/invoke.texi (Optimization Options): Document
>>>  -fstore-merging.
>>>
>>> 2016-10-18  Kyrylo Tkachov  
>>>  Jakub Jelinek  
>>>  Andrew Pinski  

Re: [PATCH][v6] GIMPLE store merging pass

2016-10-21 Thread Kyrill Tkachov

Hi Richard,

On 21/10/16 13:37, Richard Biener wrote:

On Tue, 18 Oct 2016, Kyrill Tkachov wrote:


Hi Richard,

This patch is a merge of [1] and [2] and implements the manual merging of
bitfields
as outlined in [1] but actually makes it work on BYTES_BIG_ENDIAN too.
It caused me a lot of headeache because the bit offset is counted from the
most significant bit
in the byte, even though BITS_BIG_ENDIAN was 0 (BITS_BIG_ENDIAN looks
irrelevant for store merging
anyway as it's just used to described RTL extract operations).
I've included ASCII diagrams of the steps in the merging algorithm.

Heh, thanks.


Bootstrapped and tested on arm-none-linux-gnueabihf, aarch64-none-linux-gnu,
x86_64-unknown-linux-gnu.
Also tested on aarch64_be-none-elf.

How does this version look now?

Mostly good.  For

+bool
+pass_store_merging::terminate_all_aliasing_chains (tree dest, tree base,
+  gimple *stmt)
+{
...
+  /* Check if the assignment destination (BASE) is part of a store chain.
+ This is to catch non-constant stores to destinations that may be
part
+ of a chain.  */
+  if (base)
+{
+  chain_info = m_stores.get (base);
+  if (chain_info)
+   {
+ struct store_immediate_info *info;
+ unsigned int i;
+ FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
+   {
+ if (refs_may_alias_p (info->dest, dest))
+   {

I suppose the chain is not yet sorted in any way?

At least for 'dest' which do not have a known constant offset we
could do

if (base)
  terminate_and_release_chain (base);


Do you mean when get_inner_reference returns non-NULL for POFFSET?
Or do you think we should try to look into dest in this function?


to speed things up?  IIRC we do not terminate chains early in
this phase when we have enough stores to form a group, so
writing a testcase that triggers quadraticness would be as simple
as having

char a[100];

void foo ()
{
  a[0] = 1;
  a[1] = 2;
  
  a[999] = 3;
}

?

so I think you probably want to limit the number of stores you
ever put onto a chain and if you reach that limit, terminate
and release it?  Like just choose 16 or 64?  (and experiment
with the above kind of testcases)


I was initially thinking of imposing such a limit as well but
later I thought we'd want to extend the output code to be able to emit
a memcpy (or memset) call for large regions, so detecting the largest possible
regions would be needed. But that is not implemented yet (though I have 
experimented
with it) so I can add a limit here. Should I just hardcode a limit or should I 
make it
into a --param (MAX_STMTS_IN_STORE_MERGING_CHAIN or something)?



+ bit_off = byte_off << LOG2_BITS_PER_UNIT;
+ if (!wi::neg_p (bit_off) && wi::fits_shwi_p (bit_off))
+   {
+ bitpos += bit_off.to_shwi ();
+

I think you want bit_off += bitpos before the fits_shwi check
otherwise this add may still overflow.

+ base_addr = copy_node (base_addr);
+ TREE_OPERAND (base_addr, 1)
+   = build_zero_cst (TREE_TYPE (TREE_OPERAND (
+  base_addr, 1)));

I'd prefer

   base_addr = build2 (MEM_REF, ...);

here.


Thanks for the feedback,
Kyrill



Thanks,
Richard.


Thanks,
Kyrill

[1] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg00573.html
[2] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg00572.html

2016-10-18  Kyrylo Tkachov  

 PR middle-end/22141
 * Makefile.in (OBJS): Add gimple-ssa-store-merging.o.
 * common.opt (fstore-merging): New Optimization option.
 * opts.c (default_options_table): Add entry for
 OPT_ftree_store_merging.
 * fold-const.h (can_native_encode_type_p): Declare prototype.
 * fold-const.c (can_native_encode_type_p): Define.
 * params.def (PARAM_STORE_MERGING_ALLOW_UNALIGNED): Define.
 * passes.def: Insert pass_tree_store_merging.
 * tree-pass.h (make_pass_store_merging): Declare extern
 prototype.
 * gimple-ssa-store-merging.c: New file.
 * doc/invoke.texi (Optimization Options): Document
 -fstore-merging.

2016-10-18  Kyrylo Tkachov  
 Jakub Jelinek  
 Andrew Pinski  

 PR middle-end/22141
 PR rtl-optimization/23684
 * gcc.c-torture/execute/pr22141-1.c: New test.
 * gcc.c-torture/execute/pr22141-2.c: Likewise.
 * gcc.target/aarch64/ldp_stp_1.c: Adjust for -fstore-merging.
 * gcc.target/aarch64/ldp_stp_4.c: Likewise.
 * gcc.dg/store_merging_1.c: New test.
 * gcc.dg/store_merging_2.c: Likewise.
 * gcc.dg/store_merging_3.c: Likewise.
 * gcc.dg/store_merging_4.c: Likewise.
 * gcc.dg/store_merging_5.c: Likewise.
 * gcc.dg/store_merging_6.c: Likewise.
 * gcc.dg/store_merging_7.c: Likewise.
 * 

Re: [PATCH][v6] GIMPLE store merging pass

2016-10-21 Thread Richard Biener
On Tue, 18 Oct 2016, Kyrill Tkachov wrote:

> Hi Richard,
> 
> This patch is a merge of [1] and [2] and implements the manual merging of
> bitfields
> as outlined in [1] but actually makes it work on BYTES_BIG_ENDIAN too.
> It caused me a lot of headeache because the bit offset is counted from the
> most significant bit
> in the byte, even though BITS_BIG_ENDIAN was 0 (BITS_BIG_ENDIAN looks
> irrelevant for store merging
> anyway as it's just used to described RTL extract operations).
> I've included ASCII diagrams of the steps in the merging algorithm.

Heh, thanks.

> Bootstrapped and tested on arm-none-linux-gnueabihf, aarch64-none-linux-gnu,
> x86_64-unknown-linux-gnu.
> Also tested on aarch64_be-none-elf.
> 
> How does this version look now?

Mostly good.  For

+bool
+pass_store_merging::terminate_all_aliasing_chains (tree dest, tree base,
+  gimple *stmt)
+{
...
+  /* Check if the assignment destination (BASE) is part of a store chain.
+ This is to catch non-constant stores to destinations that may be 
part
+ of a chain.  */
+  if (base)
+{
+  chain_info = m_stores.get (base);
+  if (chain_info)
+   {
+ struct store_immediate_info *info;
+ unsigned int i;
+ FOR_EACH_VEC_ELT ((*chain_info)->m_store_info, i, info)
+   {
+ if (refs_may_alias_p (info->dest, dest))
+   {

I suppose the chain is not yet sorted in any way?

At least for 'dest' which do not have a known constant offset we
could do

   if (base)
 terminate_and_release_chain (base);

to speed things up?  IIRC we do not terminate chains early in 
this phase when we have enough stores to form a group, so
writing a testcase that triggers quadraticness would be as simple
as having

char a[100];

void foo ()
{
 a[0] = 1;
 a[1] = 2;
 
 a[999] = 3;
}

?

so I think you probably want to limit the number of stores you
ever put onto a chain and if you reach that limit, terminate
and release it?  Like just choose 16 or 64?  (and experiment
with the above kind of testcases)

+ bit_off = byte_off << LOG2_BITS_PER_UNIT;
+ if (!wi::neg_p (bit_off) && wi::fits_shwi_p (bit_off))
+   {
+ bitpos += bit_off.to_shwi ();
+

I think you want bit_off += bitpos before the fits_shwi check
otherwise this add may still overflow.

+ base_addr = copy_node (base_addr);
+ TREE_OPERAND (base_addr, 1)
+   = build_zero_cst (TREE_TYPE (TREE_OPERAND (
+  base_addr, 1)));

I'd prefer

  base_addr = build2 (MEM_REF, ...);

here.

Thanks,
Richard.

> Thanks,
> Kyrill
> 
> [1] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg00573.html
> [2] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg00572.html
> 
> 2016-10-18  Kyrylo Tkachov  
> 
> PR middle-end/22141
> * Makefile.in (OBJS): Add gimple-ssa-store-merging.o.
> * common.opt (fstore-merging): New Optimization option.
> * opts.c (default_options_table): Add entry for
> OPT_ftree_store_merging.
> * fold-const.h (can_native_encode_type_p): Declare prototype.
> * fold-const.c (can_native_encode_type_p): Define.
> * params.def (PARAM_STORE_MERGING_ALLOW_UNALIGNED): Define.
> * passes.def: Insert pass_tree_store_merging.
> * tree-pass.h (make_pass_store_merging): Declare extern
> prototype.
> * gimple-ssa-store-merging.c: New file.
> * doc/invoke.texi (Optimization Options): Document
> -fstore-merging.
> 
> 2016-10-18  Kyrylo Tkachov  
> Jakub Jelinek  
> Andrew Pinski  
> 
> PR middle-end/22141
> PR rtl-optimization/23684
> * gcc.c-torture/execute/pr22141-1.c: New test.
> * gcc.c-torture/execute/pr22141-2.c: Likewise.
> * gcc.target/aarch64/ldp_stp_1.c: Adjust for -fstore-merging.
> * gcc.target/aarch64/ldp_stp_4.c: Likewise.
> * gcc.dg/store_merging_1.c: New test.
> * gcc.dg/store_merging_2.c: Likewise.
> * gcc.dg/store_merging_3.c: Likewise.
> * gcc.dg/store_merging_4.c: Likewise.
> * gcc.dg/store_merging_5.c: Likewise.
> * gcc.dg/store_merging_6.c: Likewise.
> * gcc.dg/store_merging_7.c: Likewise.
> * gcc.target/i386/pr22141.c: Likewise.
> * gcc.target/i386/pr34012.c: Add -fno-store-merging to dg-options.
> * g++.dg/init/new17.C: Likewise.
> 

-- 
Richard Biener 
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)


[PATCH][v6] GIMPLE store merging pass

2016-10-18 Thread Kyrill Tkachov

Hi Richard,

This patch is a merge of [1] and [2] and implements the manual merging of 
bitfields
as outlined in [1] but actually makes it work on BYTES_BIG_ENDIAN too.
It caused me a lot of headeache because the bit offset is counted from the most 
significant bit
in the byte, even though BITS_BIG_ENDIAN was 0 (BITS_BIG_ENDIAN looks 
irrelevant for store merging
anyway as it's just used to described RTL extract operations).
I've included ASCII diagrams of the steps in the merging algorithm.

Bootstrapped and tested on arm-none-linux-gnueabihf, aarch64-none-linux-gnu, 
x86_64-unknown-linux-gnu.
Also tested on aarch64_be-none-elf.

How does this version look now?

Thanks,
Kyrill

[1] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg00573.html
[2] https://gcc.gnu.org/ml/gcc-patches/2016-10/msg00572.html

2016-10-18  Kyrylo Tkachov  

PR middle-end/22141
* Makefile.in (OBJS): Add gimple-ssa-store-merging.o.
* common.opt (fstore-merging): New Optimization option.
* opts.c (default_options_table): Add entry for
OPT_ftree_store_merging.
* fold-const.h (can_native_encode_type_p): Declare prototype.
* fold-const.c (can_native_encode_type_p): Define.
* params.def (PARAM_STORE_MERGING_ALLOW_UNALIGNED): Define.
* passes.def: Insert pass_tree_store_merging.
* tree-pass.h (make_pass_store_merging): Declare extern
prototype.
* gimple-ssa-store-merging.c: New file.
* doc/invoke.texi (Optimization Options): Document
-fstore-merging.

2016-10-18  Kyrylo Tkachov  
Jakub Jelinek  
Andrew Pinski  

PR middle-end/22141
PR rtl-optimization/23684
* gcc.c-torture/execute/pr22141-1.c: New test.
* gcc.c-torture/execute/pr22141-2.c: Likewise.
* gcc.target/aarch64/ldp_stp_1.c: Adjust for -fstore-merging.
* gcc.target/aarch64/ldp_stp_4.c: Likewise.
* gcc.dg/store_merging_1.c: New test.
* gcc.dg/store_merging_2.c: Likewise.
* gcc.dg/store_merging_3.c: Likewise.
* gcc.dg/store_merging_4.c: Likewise.
* gcc.dg/store_merging_5.c: Likewise.
* gcc.dg/store_merging_6.c: Likewise.
* gcc.dg/store_merging_7.c: Likewise.
* gcc.target/i386/pr22141.c: Likewise.
* gcc.target/i386/pr34012.c: Add -fno-store-merging to dg-options.
* g++.dg/init/new17.C: Likewise.
diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index d6e44e4b77990b9cd48ce31f657e48c44191ade1..02632ea70483224623ac7cb2b97d18686d3a4a3c 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1296,6 +1296,7 @@ OBJS = \
 	gimple-ssa-isolate-paths.o \
 	gimple-ssa-nonnull-compare.o \
 	gimple-ssa-split-paths.o \
+	gimple-ssa-store-merging.o \
 	gimple-ssa-strength-reduction.o \
 	gimple-ssa-sprintf.o \
 	gimple-streamer-in.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 6f24f568d313bbf8f35e8809639ba4d976b57765..f5dc86b6376ded8f2f77313d75f41572eb0e77ad 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1463,6 +1463,10 @@ fstrict-volatile-bitfields
 Common Report Var(flag_strict_volatile_bitfields) Init(-1) Optimization
 Force bitfield accesses to match their type width.
 
+fstore-merging
+Common Report Var(flag_store_merging) Optimization
+Merge adjacent stores.
+
 fguess-branch-probability
 Common Report Var(flag_guess_branch_prob) Optimization
 Enable guessing of branch probabilities.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 795ad1b63a914de50113cd83a1eb7094306f0634..85cedf87cf135e5eb7d53a6ebfd308e19dec7830 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -404,7 +404,7 @@ Objective-C and Objective-C++ Dialects}.
 -fsingle-precision-constant -fsplit-ivs-in-unroller @gol
 -fsplit-paths @gol
 -fsplit-wide-types -fssa-backprop -fssa-phiopt @gol
--fstdarg-opt -fstrict-aliasing @gol
+-fstdarg-opt -fstore-merging -fstrict-aliasing @gol
 -fstrict-overflow -fthread-jumps -ftracer -ftree-bit-ccp @gol
 -ftree-builtin-call-dce -ftree-ccp -ftree-ch @gol
 -ftree-coalesce-vars -ftree-copy-prop -ftree-dce -ftree-dominator-opts @gol
@@ -415,8 +415,8 @@ Objective-C and Objective-C++ Dialects}.
 -ftree-loop-vectorize @gol
 -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta @gol
 -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
--ftree-switch-conversion -ftree-tail-merge -ftree-ter @gol
--ftree-vectorize -ftree-vrp -funconstrained-commons @gol
+-ftree-switch-conversion -ftree-tail-merge @gol
+-ftree-ter -ftree-vectorize -ftree-vrp -funconstrained-commons @gol
 -funit-at-a-time -funroll-all-loops -funroll-loops @gol
 -funsafe-math-optimizations -funswitch-loops @gol
 -fipa-ra -fvariable-expansion-in-unroller -fvect-cost-model -fvpt @gol
@@ -6668,6 +6668,7 @@ compilation time.
 -fsplit-wide-types @gol
 -fssa-backprop @gol
 -fssa-phiopt @gol
+-fstore-merging @gol
 -ftree-bit-ccp @gol
 -ftree-ccp @gol
 -ftree-ch @gol
@@ -8010,6 +8011,13 @@ Perform scalar replacement of aggregates.  This pass replaces structure
 references with