[Bug target/117438] x86's pass_align_tight_loops may cause performance regression in nested loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117438 --- Comment #6 from GCC Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:6350e956d1a74963a62bedabef3d4a1a3f2d4852 commit r15-5489-g6350e956d1a74963a62bedabef3d4a1a3f2d4852 Author: MayShao-oc Date: Thu Nov 7 10:57:02 2024 +0800 Add microarchtecture tunable for pass_align_tight_loops [PR117438] Hi Hongtao: Add m_CASCADELAK, and m_SKYLAKE_AVX512. Place X86_TUNE_ALIGN_TIGHT_LOOPS in the appropriate section. Bootstrapped X86_64. Ok for trunk? BR Mayshao gcc/ChangeLog: PR target/117438 * config/i386/i386-features.cc (TARGET_ALIGN_TIGHT_LOOPS): default true in all processors except for m_ZHAOXIN, m_CASCADELAKE, and m_SKYLAKE_AVX512. * config/i386/i386.h (TARGET_ALIGN_TIGHT_LOOPS): New Macro. * config/i386/x86-tune.def (X86_TUNE_ALIGN_TIGHT_LOOPS): New tune
[Bug target/117438] x86's pass_align_tight_loops may cause performance regression in nested loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117438 --- Comment #5 from Hongtao Liu --- I reproduce with 30% regression on CLX, there's more frontend-bound with aligned case, it's uarch specific, will make it a uarch tune.
[Bug target/117438] x86's pass_align_tight_loops may cause performance regression in nested loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117438 --- Comment #4 from Hongtao Liu --- (In reply to Mayshao-oc from comment #0) > Created attachment 59530 [details] > gcc -O1 loop.c > > Pass_align_tight_loops align the inner loop aggressively, this may cause > significant performance regression of some nested loops.The attached loop.c > could be compiled by gcc -O1 to reproduce the scenario. For the testcase, on SPR, align is 25% better than no_align. Looks like nops is not an issue on SPR.
[Bug target/117438] x86's pass_align_tight_loops may cause performance regression in nested loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117438 Hongtao Liu changed: What|Removed |Added CC||liuhongt at gcc dot gnu.org --- Comment #3 from Hongtao Liu --- (In reply to Andrew Pinski from comment #1) > >this may cause significant performance regression of some nested loops. > > I suspect it depends on the micro-arch for the x86 target. > > What are you running the test on? > > .p2align 6 > .L3: > > I notice GCC aligns only the inner loop to 64 byte boundary while clang/LLVM > aligns each loop (inner and outer) loops to 16 byte boundary. It aligns inner loop to cacheline which depends on the micro-arch. Yes, we're worried about such regression before, but didn't observe any in SPEC and other workroads. Maybe we can provide some option to control the optimization, by default still turn it on.(also try to implement some heuristic to prevent regressions from such inner loops)
[Bug target/117438] x86's pass_align_tight_loops may cause performance regression in nested loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117438 --- Comment #2 from Andrew Pinski --- https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651699.html
[Bug target/117438] x86's pass_align_tight_loops may cause performance regression in nested loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117438 --- Comment #1 from Andrew Pinski --- >this may cause significant performance regression of some nested loops. I suspect it depends on the micro-arch for the x86 target. What are you running the test on? .p2align 6 .L3: I notice GCC aligns only the inner loop to 64 byte boundary while clang/LLVM aligns each loop (inner and outer) loops to 16 byte boundary.