[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587 --- Comment #7 from Tamar Christina --- (In reply to Richard Biener from comment #5) > (In reply to Tamar Christina from comment #4) > > (In reply to Richard Biener from comment #3) > > > The issue isn't unrolling but invariant motion. We unroll the innermost > > > loop, vectorizer the middle loop and then unroll that as well. That > > > leaves > > > us with > > > 64 invariant loads from b[] in the outer loop which I think RTL opts never > > > "schedule back", even with -fsched-pressure. > > > > > > > Aside from the loads, by fully unrolling the inner loop, that means we need > > 16 unique registers live for the destination every iteration. That's > > already half the SIMD register file on AArch64 gone, not counting the > > invariant loads. > > Why? You can try -fno-tree-pre -fno-tree-loop-im -fno-predictive-commoning Oh, I was basing that on the output of the existing using a lower loop count with e.g. template void f<16, 16, 4> But yes, those options avoid the spills, but of course without them you leave all the loads inside the loop iteration. I was hoping more we could get closer to https://godbolt.org/z/7c5YfxE5j which is a lot better code. i.e. the invariants moved inside the outer loop. But yes, I do understand this may be hard to do automatically. > > > The #pragma GCC unroll 8 doesn't work as that seems to stop GIMPLE unrolling > > and does it at RTL instead. > > ... because on GIMPLE we only can fully unroll or not. But is this an intrinsic limitation or just because atm we only unroll for SLP? > > > At the moment a way for the user to locally control the unroll amount would > > already be a good step. I know there's the param, but that's global and > > typically the unroll factor would depend on the GEMM kernel. > > As said it should already work to the extent that on GIMPLE we do not > perform classical loop unrolling. Right, but the RTL unroller produces horrible code.. e.g. the addressing modes are pretty bad.
[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #6 from Alexander Monakov --- (In reply to Richard Biener from comment #5) > > Why? You can try -fno-tree-pre -fno-tree-loop-im -fno-predictive-commoning Note that while at -O2 just -fno-tree-loop-im is enough, at -O3 one also needs -fno-gcse, which otherwise seems to perform unrestricted hoisting of loop invariants at RTL level.
[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587 --- Comment #5 from Richard Biener --- (In reply to Tamar Christina from comment #4) > (In reply to Richard Biener from comment #3) > > The issue isn't unrolling but invariant motion. We unroll the innermost > > loop, vectorizer the middle loop and then unroll that as well. That leaves > > us with > > 64 invariant loads from b[] in the outer loop which I think RTL opts never > > "schedule back", even with -fsched-pressure. > > > > Aside from the loads, by fully unrolling the inner loop, that means we need > 16 unique registers live for the destination every iteration. That's > already half the SIMD register file on AArch64 gone, not counting the > invariant loads. Why? You can try -fno-tree-pre -fno-tree-loop-im -fno-predictive-commoning > > Estimating register pressure on GIMPLE is hard and we heavily rely on > > "optimistic" transforms with regard to things being optimized in followup > > passes during the GIMPLE phase. > > Understood, but if we can't do it automatically, is there a way to tell the > unroller not to fully unroll this? Like you did ... > The #pragma GCC unroll 8 doesn't work as that seems to stop GIMPLE unrolling > and does it at RTL instead. ... because on GIMPLE we only can fully unroll or not. > At the moment a way for the user to locally control the unroll amount would > already be a good step. I know there's the param, but that's global and > typically the unroll factor would depend on the GEMM kernel. As said it should already work to the extent that on GIMPLE we do not perform classical loop unrolling.
[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587 --- Comment #4 from Tamar Christina --- (In reply to Richard Biener from comment #3) > The issue isn't unrolling but invariant motion. We unroll the innermost > loop, vectorizer the middle loop and then unroll that as well. That leaves > us with > 64 invariant loads from b[] in the outer loop which I think RTL opts never > "schedule back", even with -fsched-pressure. > Aside from the loads, by fully unrolling the inner loop, that means we need 16 unique registers live for the destination every iteration. That's already half the SIMD register file on AArch64 gone, not counting the invariant loads. > Estimating register pressure on GIMPLE is hard and we heavily rely on > "optimistic" transforms with regard to things being optimized in followup > passes during the GIMPLE phase. Understood, but if we can't do it automatically, is there a way to tell the unroller not to fully unroll this? The #pragma GCC unroll 8 doesn't work as that seems to stop GIMPLE unrolling and does it at RTL instead. At the moment a way for the user to locally control the unroll amount would already be a good step. I know there's the param, but that's global and typically the unroll factor would depend on the GEMM kernel.
[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587 Richard Biener changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed||2023-04-24 --- Comment #3 from Richard Biener --- The issue isn't unrolling but invariant motion. We unroll the innermost loop, vectorizer the middle loop and then unroll that as well. That leaves us with 64 invariant loads from b[] in the outer loop which I think RTL opts never "schedule back", even with -fsched-pressure. Estimating register pressure on GIMPLE is hard and we heavily rely on "optimistic" transforms with regard to things being optimized in followup passes during the GIMPLE phase.
[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587 --- Comment #2 from Andrew Pinski --- At -O2 we get: size: 26-3, last_iteration: 2-2 Loop size: 26 Estimated size after unrolling: 245 Not unrolling loop 1: size would grow. With -O3: size: 20-4, last_iteration: 2-2 Loop size: 20 Estimated size after unrolling: 170 /app/example.cpp:8:29: optimized: loop with 16 iterations completely unrolled (header execution count 63136016) Exit condition of peeled iterations was eliminated. Last iteration exit edge was proved true. Forced exit to be taken: if (0 != 0) Well yes -O3 is known to cause issues like this. I had thought it was documented saying that sometimes -O3 might cause performance issues over -O2 but I can't find that documentation either.
[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587 Andrew Pinski changed: What|Removed |Added Keywords|ra | --- Comment #1 from Andrew Pinski --- Simplified testcase which shows the issue even on x86: ``` typedef float float32_t; template void f(const float32_t *__restrict a, const float32_t *__restrict b, float32_t *c) { for (int i = 0; i < N; ++i) { for (int j=0; j < M; ++j) { for (int k=0; k < K; ++k) { c[i*N + j] += a[i*K + k] * b[k*M + j]; } } } } template void f<16, 16, 16>(const float32_t *__restrict a, const float32_t *__restrict b, float32_t *c); ```
[Bug tree-optimization/109587] Deeply nested loop unrolling overwhelms register allocator
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109587 Andrew Pinski changed: What|Removed |Added Keywords||ra Severity|normal |enhancement