[Bug middle-end/89922] Loop on fixed size array is not unrolled and poorly optimized at -O2

2019-04-10 Thread JunMa at linux dot alibaba.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89922

JunMa  changed:

   What|Removed |Added

 CC||JunMa at linux dot alibaba.com

--- Comment #5 from JunMa  ---
the testcase in https://godbolt.org/z/iKi0pb is well optimized in gcc6.5 with
O3, but not gcc7 and later. 
I have checked the gimple code dumped by optimized pass which are same.
The difference is done by rtl_cse1 pass.

[Bug middle-end/89922] Loop on fixed size array is not unrolled and poorly optimized at -O2

2019-04-05 Thread antoshkka at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89922

--- Comment #4 from Antony Polukhin  ---
> Was the testcase just an artificial one or does it appear (in this
> isolated form!) in a real application/benchmark?

I was not investigating a particular benchmark or real world application at
first.

My guess is that heuristic will affect cryptography (initializing big arrays
with magic constants) and math (matrix multiplication with identity matrix for
example).

I've tried to check the validity of the guess. The very first attempt
succeeded. Hash computation for a constant string is not well optimized:
https://godbolt.org/z/iKi0pb The heuristic may notice that the string is a
local variable and may force the loop unrolling. Hash computations on a
constant variable is a common case in libstdc++ when working with unordered
maps and sets.

There's definitely some room for improvement for cases when a local variable is
used in the loop only.

[Bug middle-end/89922] Loop on fixed size array is not unrolled and poorly optimized at -O2

2019-04-05 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89922

--- Comment #3 from rguenther at suse dot de  ---
On Thu, 4 Apr 2019, antoshkka at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89922
> 
> --- Comment #2 from Antony Polukhin  ---
> The estimation is very close to the actual result for the loop.
> 
> But it does not take into the account the instructions before the loop that 
> are
> eliminated due to unrolling. Some heuristic like "initializing the local
> variable with goes away for unrolled loops if the variable is rewritten in 
> loop
> or if the variable is not used outside the loop"

Sure, but you'd need to collect some function-scope information to feed
such heuristics to not fall on the wrong-side if those constraints are
not met.

Was the testcase just an artificial one or does it appear (in this
isolated form!) in a real application/benchmark?

[Bug middle-end/89922] Loop on fixed size array is not unrolled and poorly optimized at -O2

2019-04-04 Thread antoshkka at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89922

--- Comment #2 from Antony Polukhin  ---
The estimation is very close to the actual result for the loop.

But it does not take into the account the instructions before the loop that are
eliminated due to unrolling. Some heuristic like "initializing the local
variable with goes away for unrolled loops if the variable is rewritten in loop
or if the variable is not used outside the loop"

[Bug middle-end/89922] Loop on fixed size array is not unrolled and poorly optimized at -O2

2019-04-03 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89922

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-04-03
 CC||rguenth at gcc dot gnu.org
Summary|Loop on fixed size array is |Loop on fixed size array is
   |not unrolled and poorly |not unrolled and poorly
   |optimized   |optimized at -O2
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
With -O3 I even see this vectorized to

_Z4testi:
.LFB0:
.cfi_startproc
movabsq $12884901890, %rdx
movl$1, (%rdi)
movq%rdi, %rax
movq%rdx, 8(%rdi)
movl%esi, 4(%rdi)
movdqu  (%rdi), %xmm0
paddd   .LC0(%rip), %xmm0
movl$8, 16(%rdi)
movups  %xmm0, (%rdi)
ret

but no jumps so you must use plain -O2?  Here unrolling is only done
if we estimate the code to not grow but the estimate is

  Loop size: 7
  Estimated size after unrolling: 9
Not unrolling loop 1: size would grow.

so you have to specify -funroll-loops where we get the desired

_Z4testi:
.LFB0:
.cfi_startproc
addl$1, %esi
movl$1, (%rdi)
movq%rdi, %rax
movabsq $25769803780, %rdx
movl%esi, 4(%rdi)
movq%rdx, 8(%rdi)
movl$8, 16(%rdi)
ret

then.  For the unrolling heuristic it's hard to see the (partly)
constant initializer.