[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")

2021-08-14 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729

Andrew Pinski  changed:

   What|Removed |Added

   Severity|normal  |enhancement

--- Comment #6 from Andrew Pinski  ---
This is what we produce on the gimple level:

  abc = "abc";
  ab = "ab";
  a = "a";


Interesting no compiler I see does this optimization for this code.

[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")

2017-10-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729

--- Comment #5 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #4)
> As for this exact ones, I'm now working on GIMPLE store merging
> improvements, but that of course won't handle this case.
> For RTL I had code to handle this at RTL DSE time, see PR22141 and
> https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html
> The problem was that the patch caused performance regressions on PowerPC and
> it was hard to find a good cost model for it.  Of course, for -Os the cost
> model would be quite simple, but although you count instructions, you were
> reporting this for -O3.

Yeah, fewer total stores, fewer instructions, and smaller code size *is* what
makes this better for performance.  An 8-byte store that doesn't cross a
cache-line boundary has nearly identical cost to a 1-byte store at least on
Intel.

x86 is robust with overlapping stores, although store-forwarding only works for
loads that get all their data from one store (and even then some CPUs have some
alignment restrictions for the load relative to the store).  Still, that
generally means that fewer wider stores are better, because most CPUs can
forward from a 4B store to a byte reload of any of those 4 bytes.


> Doing this at GIMPLE time is impossible, because it is extremely complex
> where exactly the variables are allocated, depends on many flags etc. (e.g.
> -fsanitize=address allocates pads in between them, some targets allocate
> them from top to bottom, others the other way around, ...),

Allocation order is fixed for a given target?  Ideally we'd allocate locals to
pack them together well to avoid wasted padding, and/or put ones used together
next to each other for possible SIMD (including non-loop XMM stuff like a pair
of `double`s or copying a group of integer locals into a struct).  (In case of
a really large local array, you want variables used together in the same page
and same cache line.)

Considering all the possibilities might be computationally infeasible though,
especially if the typical gains are small.

> -fstack-protector* might protect some but not others and thus allocate in
> different buckets, alignment could play roles etc.

Anyway, sounds like it would make more sense to look for possibilities likes
this in RTL when deciding how to lay out the local variables.  For x86 it seems
gcc sorts them by size?  Changing the order of declaration changes the order of
the stores, but not the locations.

[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")

2017-10-26 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729

--- Comment #4 from Jakub Jelinek  ---
(In reply to Peter Cordes from comment #2)
> Are bug reports like this useful at all?  It seems that a good fraction of
> the missed-optimization bugs I file are things that gcc doesn't really have
> the infrastructure to find.  I'm hoping it's helping to improve gcc in the
> long run, at least.  I guess I could try to learn more about gcc internals
> to find out why it misses them on my own before filing, but either way it
> seems potentially useful to document efficient asm possibilities even if
> gcc's current design makes it hard to take advantage.

I think they are useful, even if some of them just are never resolved, some
perhaps take a few years, thanks for the reports.  For some of the reports
we'll find out we have the infrastructure and can do it easily, for others it
is possible to add infrastructure etc., but some will remain hard if we want to
keep the compiler maintainable and supporting multiple architectures, in some
cases fixing something requires very early detailed knowledge about
architecture when we have at that point only approximate costs tuned from big
amounts of code, etc.

As for this exact ones, I'm now working on GIMPLE store merging improvements,
but that of course won't handle this case.
For RTL I had code to handle this at RTL DSE time, see PR22141 and
https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html
The problem was that the patch caused performance regressions on PowerPC and it
was hard to find a good cost model for it.  Of course, for -Os the cost model
would be quite simple, but although you count instructions, you were reporting
this for -O3.
Doing this at GIMPLE time is impossible, because it is extremely complex where
exactly the variables are allocated, depends on many flags etc. (e.g.
-fsanitize=address allocates pads in between them, some targets allocate them
from top to bottom, others the other way around, ...), -fstack-protector* might
protect some but not others and thus allocate in different buckets, alignment
could play roles etc.

[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")

2017-10-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729

--- Comment #3 from Peter Cordes  ---
Oh also, why is MSP430 using 3 byte-stores instead of a mov.w + mov.b for
storing ab[]?  (on the godbolt link in the initial report)


   # msp430-gcc 6.2.1.16) 6.2.1 20161212
MOV.W   #25185, 6(R1)
MOV.W   #99, 8(R1)   # abc[]

MOV.B   #97, 3(R1)
MOV.B   #98, 4(R1)
MOV.B   #0, 5(R1)# ab[]

MOV.B   #97, 1(R1)
MOV.B   #0, 2(R1)# a[]

Even if alignment is required (IDK), either the first two or last two mov.b
instructions for ab[] could combine into a mov.w, like is done for abc[].  Is
that a target bug?

MSP430 is on Godbolt and it's not a RISC with word size > largest immediate, so
I was looking at it to see if it was just an x86 missed optimization.

Like I was saying for ARM, gcc seems to do a poor job on many RISC ISAs with
this, given the redundancy between strings.

[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")

2017-10-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729

--- Comment #2 from Peter Cordes  ---
(In reply to Richard Biener from comment #1)
> The issue is we have no merging of stores at the RTL level and the GIMPLE
> level doesn't know whether the variables will end up allocated next to each
> other.

Are bug reports like this useful at all?  It seems that a good fraction of the
missed-optimization bugs I file are things that gcc doesn't really have the
infrastructure to find.  I'm hoping it's helping to improve gcc in the long
run, at least.  I guess I could try to learn more about gcc internals to find
out why it misses them on my own before filing, but either way it seems
potentially useful to document efficient asm possibilities even if gcc's
current design makes it hard to take advantage.


Anyway, could GIMPLE notice that multiple small objects are being written and
hint to RTL that it would be useful to allocate them in a certain way?  (And
give RTL a merged store that RTL would have to split if it decides not to?)

Or a more conservative approach could still be an improvement.  Can RTL realize
that it can use 4-byte stores that overlap into not-yet-initialized or
otherwise dead memory?

For -march=haswell  or generic we get 

movl$97, %edx
movl$25185, %eax   # avoid an LCP stall on Nehalem or earlier
movw%dx, 7(%rsp)
... lea
movl$6513249, 12(%rsp)
movw%ax, 9(%rsp)
movb$0, 11(%rsp)

This is pretty bad for code-size, and this would do the same thing with no
merging between objects, just knowing when to allow overlap into other objects.

movl   $0x61, 7(%rsp)# imm32 still shorter than a mov imm32 ->
reg and 16-bit store
movl $0x6261, 9(%rsp)
movl   $0x636261, 12(%rsp)


(Teaching gcc that mov $imm16 is safe on Sandybridge-family is a separate bug,
I guess.  It's only other instructions with an imm16 that LCP stall, unlike on
Nehalem and earlier where mov $imm16 is a problem too.  Silvermont marks
instruction lengths in the cache to avoid LCP stalls entirely, and gcc knows
that.)

[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")

2017-10-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-10-26
  Component|tree-optimization   |rtl-optimization
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
The issue is we have no merging of stores at the RTL level and the GIMPLE level
doesn't know whether the variables will end up allocated next to each other.