[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729 Andrew Pinski changed: What|Removed |Added Severity|normal |enhancement --- Comment #6 from Andrew Pinski --- This is what we produce on the gimple level: abc = "abc"; ab = "ab"; a = "a"; Interesting no compiler I see does this optimization for this code.
[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729 --- Comment #5 from Peter Cordes --- (In reply to Jakub Jelinek from comment #4) > As for this exact ones, I'm now working on GIMPLE store merging > improvements, but that of course won't handle this case. > For RTL I had code to handle this at RTL DSE time, see PR22141 and > https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html > The problem was that the patch caused performance regressions on PowerPC and > it was hard to find a good cost model for it. Of course, for -Os the cost > model would be quite simple, but although you count instructions, you were > reporting this for -O3. Yeah, fewer total stores, fewer instructions, and smaller code size *is* what makes this better for performance. An 8-byte store that doesn't cross a cache-line boundary has nearly identical cost to a 1-byte store at least on Intel. x86 is robust with overlapping stores, although store-forwarding only works for loads that get all their data from one store (and even then some CPUs have some alignment restrictions for the load relative to the store). Still, that generally means that fewer wider stores are better, because most CPUs can forward from a 4B store to a byte reload of any of those 4 bytes. > Doing this at GIMPLE time is impossible, because it is extremely complex > where exactly the variables are allocated, depends on many flags etc. (e.g. > -fsanitize=address allocates pads in between them, some targets allocate > them from top to bottom, others the other way around, ...), Allocation order is fixed for a given target? Ideally we'd allocate locals to pack them together well to avoid wasted padding, and/or put ones used together next to each other for possible SIMD (including non-loop XMM stuff like a pair of `double`s or copying a group of integer locals into a struct). (In case of a really large local array, you want variables used together in the same page and same cache line.) Considering all the possibilities might be computationally infeasible though, especially if the typical gains are small. > -fstack-protector* might protect some but not others and thus allocate in > different buckets, alignment could play roles etc. Anyway, sounds like it would make more sense to look for possibilities likes this in RTL when deciding how to lay out the local variables. For x86 it seems gcc sorts them by size? Changing the order of declaration changes the order of the stores, but not the locations.
[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729 --- Comment #4 from Jakub Jelinek --- (In reply to Peter Cordes from comment #2) > Are bug reports like this useful at all? It seems that a good fraction of > the missed-optimization bugs I file are things that gcc doesn't really have > the infrastructure to find. I'm hoping it's helping to improve gcc in the > long run, at least. I guess I could try to learn more about gcc internals > to find out why it misses them on my own before filing, but either way it > seems potentially useful to document efficient asm possibilities even if > gcc's current design makes it hard to take advantage. I think they are useful, even if some of them just are never resolved, some perhaps take a few years, thanks for the reports. For some of the reports we'll find out we have the infrastructure and can do it easily, for others it is possible to add infrastructure etc., but some will remain hard if we want to keep the compiler maintainable and supporting multiple architectures, in some cases fixing something requires very early detailed knowledge about architecture when we have at that point only approximate costs tuned from big amounts of code, etc. As for this exact ones, I'm now working on GIMPLE store merging improvements, but that of course won't handle this case. For RTL I had code to handle this at RTL DSE time, see PR22141 and https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html The problem was that the patch caused performance regressions on PowerPC and it was hard to find a good cost model for it. Of course, for -Os the cost model would be quite simple, but although you count instructions, you were reporting this for -O3. Doing this at GIMPLE time is impossible, because it is extremely complex where exactly the variables are allocated, depends on many flags etc. (e.g. -fsanitize=address allocates pads in between them, some targets allocate them from top to bottom, others the other way around, ...), -fstack-protector* might protect some but not others and thus allocate in different buckets, alignment could play roles etc.
[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729 --- Comment #3 from Peter Cordes --- Oh also, why is MSP430 using 3 byte-stores instead of a mov.w + mov.b for storing ab[]? (on the godbolt link in the initial report) # msp430-gcc 6.2.1.16) 6.2.1 20161212 MOV.W #25185, 6(R1) MOV.W #99, 8(R1) # abc[] MOV.B #97, 3(R1) MOV.B #98, 4(R1) MOV.B #0, 5(R1)# ab[] MOV.B #97, 1(R1) MOV.B #0, 2(R1)# a[] Even if alignment is required (IDK), either the first two or last two mov.b instructions for ab[] could combine into a mov.w, like is done for abc[]. Is that a target bug? MSP430 is on Godbolt and it's not a RISC with word size > largest immediate, so I was looking at it to see if it was just an x86 missed optimization. Like I was saying for ARM, gcc seems to do a poor job on many RISC ISAs with this, given the redundancy between strings.
[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729 --- Comment #2 from Peter Cordes --- (In reply to Richard Biener from comment #1) > The issue is we have no merging of stores at the RTL level and the GIMPLE > level doesn't know whether the variables will end up allocated next to each > other. Are bug reports like this useful at all? It seems that a good fraction of the missed-optimization bugs I file are things that gcc doesn't really have the infrastructure to find. I'm hoping it's helping to improve gcc in the long run, at least. I guess I could try to learn more about gcc internals to find out why it misses them on my own before filing, but either way it seems potentially useful to document efficient asm possibilities even if gcc's current design makes it hard to take advantage. Anyway, could GIMPLE notice that multiple small objects are being written and hint to RTL that it would be useful to allocate them in a certain way? (And give RTL a merged store that RTL would have to split if it decides not to?) Or a more conservative approach could still be an improvement. Can RTL realize that it can use 4-byte stores that overlap into not-yet-initialized or otherwise dead memory? For -march=haswell or generic we get movl$97, %edx movl$25185, %eax # avoid an LCP stall on Nehalem or earlier movw%dx, 7(%rsp) ... lea movl$6513249, 12(%rsp) movw%ax, 9(%rsp) movb$0, 11(%rsp) This is pretty bad for code-size, and this would do the same thing with no merging between objects, just knowing when to allow overlap into other objects. movl $0x61, 7(%rsp)# imm32 still shorter than a mov imm32 -> reg and 16-bit store movl $0x6261, 9(%rsp) movl $0x636261, 12(%rsp) (Teaching gcc that mov $imm16 is safe on Sandybridge-family is a separate bug, I guess. It's only other instructions with an imm16 that LCP stall, unlike on Nehalem and earlier where mov $imm16 is a problem too. Silvermont marks instruction lengths in the cache to avoid LCP stalls entirely, and gcc knows that.)
[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-10-26 Component|tree-optimization |rtl-optimization Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- The issue is we have no merging of stores at the RTL level and the GIMPLE level doesn't know whether the variables will end up allocated next to each other.