[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator with -O3

2023-04-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587

--- Comment #7 from Tamar Christina  ---
(In reply to Richard Biener from comment #5)
> (In reply to Tamar Christina from comment #4)
> > (In reply to Richard Biener from comment #3)
> > > The issue isn't unrolling but invariant motion.  We unroll the innermost
> > > loop, vectorizer the middle loop and then unroll that as well.  That 
> > > leaves
> > > us with
> > > 64 invariant loads from b[] in the outer loop which I think RTL opts never
> > > "schedule back", even with -fsched-pressure.
> > > 
> > 
> > Aside from the loads, by fully unrolling the inner loop, that means we need
> > 16 unique registers live for the destination every iteration.  That's
> > already half the SIMD register file on AArch64 gone, not counting the
> > invariant loads.
> 
> Why?  You can try -fno-tree-pre -fno-tree-loop-im -fno-predictive-commoning

Oh, I was basing that on the output of the existing using a lower loop count
with e.g.
template void f<16, 16, 4>

But yes, those options avoid the spills, but of course without them you leave
all the loads inside the loop iteration.

I was hoping more we could get closer to https://godbolt.org/z/7c5YfxE5j which
is a lot better code. i.e. the invariants moved inside the outer loop.  But
yes, I do understand this may be hard to do automatically.

> 
> > The #pragma GCC unroll 8 doesn't work as that seems to stop GIMPLE unrolling
> > and does it at RTL instead.
> 
> ... because on GIMPLE we only can fully unroll or not.

But is this an intrinsic limitation or just because atm we only unroll for SLP?

> 
> > At the moment a way for the user to locally control the unroll amount would
> > already be a good step. I know there's the param, but that's global and
> > typically the unroll factor would depend on the GEMM kernel.
> 
> As said it should already work to the extent that on GIMPLE we do not
> perform classical loop unrolling.

Right, but the RTL unroller produces horrible code.. e.g. the addressing modes
are pretty bad.

[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator with -O3

2023-04-24 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #6 from Alexander Monakov  ---
(In reply to Richard Biener from comment #5)
> 
> Why?  You can try -fno-tree-pre -fno-tree-loop-im -fno-predictive-commoning

Note that while at -O2 just -fno-tree-loop-im is enough, at -O3 one also needs
-fno-gcse, which otherwise seems to perform unrestricted hoisting of loop
invariants at RTL level.

[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator with -O3

2023-04-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587

--- Comment #5 from Richard Biener  ---
(In reply to Tamar Christina from comment #4)
> (In reply to Richard Biener from comment #3)
> > The issue isn't unrolling but invariant motion.  We unroll the innermost
> > loop, vectorizer the middle loop and then unroll that as well.  That leaves
> > us with
> > 64 invariant loads from b[] in the outer loop which I think RTL opts never
> > "schedule back", even with -fsched-pressure.
> > 
> 
> Aside from the loads, by fully unrolling the inner loop, that means we need
> 16 unique registers live for the destination every iteration.  That's
> already half the SIMD register file on AArch64 gone, not counting the
> invariant loads.

Why?  You can try -fno-tree-pre -fno-tree-loop-im -fno-predictive-commoning

> > Estimating register pressure on GIMPLE is hard and we heavily rely on
> > "optimistic" transforms with regard to things being optimized in followup
> > passes during the GIMPLE phase.
> 
> Understood, but if we can't do it automatically, is there a way to tell the
> unroller not to fully unroll this?

Like you did ...

> The #pragma GCC unroll 8 doesn't work as that seems to stop GIMPLE unrolling
> and does it at RTL instead.

... because on GIMPLE we only can fully unroll or not.

> At the moment a way for the user to locally control the unroll amount would
> already be a good step. I know there's the param, but that's global and
> typically the unroll factor would depend on the GEMM kernel.

As said it should already work to the extent that on GIMPLE we do not
perform classical loop unrolling.

[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator with -O3

2023-04-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587

--- Comment #4 from Tamar Christina  ---
(In reply to Richard Biener from comment #3)
> The issue isn't unrolling but invariant motion.  We unroll the innermost
> loop, vectorizer the middle loop and then unroll that as well.  That leaves
> us with
> 64 invariant loads from b[] in the outer loop which I think RTL opts never
> "schedule back", even with -fsched-pressure.
> 

Aside from the loads, by fully unrolling the inner loop, that means we need 16
unique registers live for the destination every iteration.  That's already half
the SIMD register file on AArch64 gone, not counting the invariant loads.

> Estimating register pressure on GIMPLE is hard and we heavily rely on
> "optimistic" transforms with regard to things being optimized in followup
> passes during the GIMPLE phase.

Understood, but if we can't do it automatically, is there a way to tell the
unroller not to fully unroll this?

The #pragma GCC unroll 8 doesn't work as that seems to stop GIMPLE unrolling
and does it at RTL instead.

At the moment a way for the user to locally control the unroll amount would
already be a good step. I know there's the param, but that's global and
typically the unroll factor would depend on the GEMM kernel.

[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator with -O3

2023-04-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587

Richard Biener  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2023-04-24

--- Comment #3 from Richard Biener  ---
The issue isn't unrolling but invariant motion.  We unroll the innermost loop,
vectorizer the middle loop and then unroll that as well.  That leaves us with
64 invariant loads from b[] in the outer loop which I think RTL opts never
"schedule back", even with -fsched-pressure.

Estimating register pressure on GIMPLE is hard and we heavily rely on
"optimistic" transforms with regard to things being optimized in followup
passes during the GIMPLE phase.

[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator

2023-04-21 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587

--- Comment #2 from Andrew Pinski  ---
At -O2 we get:

size: 26-3, last_iteration: 2-2
  Loop size: 26
  Estimated size after unrolling: 245
Not unrolling loop 1: size would grow.

With -O3:

size: 20-4, last_iteration: 2-2
  Loop size: 20
  Estimated size after unrolling: 170
/app/example.cpp:8:29: optimized: loop with 16 iterations completely unrolled
(header execution count 63136016)
Exit condition of peeled iterations was eliminated.
Last iteration exit edge was proved true.
Forced exit to be taken: if (0 != 0)



Well yes -O3 is known to cause issues like this. I had thought it was
documented saying that sometimes -O3 might cause performance issues over -O2
but I can't find that documentation either.

[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator

2023-04-21 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587

Andrew Pinski  changed:

   What|Removed |Added

   Keywords|ra  |

--- Comment #1 from Andrew Pinski  ---
Simplified testcase which shows the issue even on x86:
```
typedef float float32_t;
template
void f(const float32_t *__restrict a, const float32_t *__restrict b, float32_t
*c) {
for (int i = 0; i < N; ++i) {
for (int j=0; j < M; ++j) {
for (int k=0; k < K; ++k) {
c[i*N + j] += a[i*K + k] * b[k*M + j];
}
}
}
}

template void f<16, 16, 16>(const float32_t *__restrict a, const float32_t
*__restrict b, float32_t *c);
```

[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator

2023-04-21 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587

Andrew Pinski  changed:

   What|Removed |Added

   Keywords||ra
   Severity|normal  |enhancement