[Bug target/117438] x86's pass_align_tight_loops may cause performance regression in nested loops

2024-11-19 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117438

--- Comment #6 from GCC Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:6350e956d1a74963a62bedabef3d4a1a3f2d4852

commit r15-5489-g6350e956d1a74963a62bedabef3d4a1a3f2d4852
Author: MayShao-oc 
Date:   Thu Nov 7 10:57:02 2024 +0800

Add microarchtecture tunable for pass_align_tight_loops [PR117438]

Hi Hongtao:
   Add m_CASCADELAK, and m_SKYLAKE_AVX512.
   Place X86_TUNE_ALIGN_TIGHT_LOOPS in the appropriate section.

   Bootstrapped X86_64.
   Ok for trunk?
BR
Mayshao
gcc/ChangeLog:

PR target/117438
* config/i386/i386-features.cc (TARGET_ALIGN_TIGHT_LOOPS):
default true in all processors except for m_ZHAOXIN, m_CASCADELAKE,
and
m_SKYLAKE_AVX512.
* config/i386/i386.h (TARGET_ALIGN_TIGHT_LOOPS): New Macro.
* config/i386/x86-tune.def (X86_TUNE_ALIGN_TIGHT_LOOPS):
New tune

[Bug target/117438] x86's pass_align_tight_loops may cause performance regression in nested loops

2024-11-05 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117438

--- Comment #5 from Hongtao Liu  ---
I reproduce with 30% regression on CLX, there's more frontend-bound
with aligned case, it's uarch specific, will make it a uarch tune.

[Bug target/117438] x86's pass_align_tight_loops may cause performance regression in nested loops

2024-11-04 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117438

--- Comment #4 from Hongtao Liu  ---
(In reply to Mayshao-oc from comment #0)
> Created attachment 59530 [details]
> gcc -O1 loop.c
> 
> Pass_align_tight_loops align the inner loop aggressively, this may cause
> significant performance regression of some nested loops.The attached loop.c
> could be compiled by gcc -O1 to reproduce the scenario.

For the testcase, on SPR, align is 25% better than no_align. Looks like nops is
not an issue on SPR.

[Bug target/117438] x86's pass_align_tight_loops may cause performance regression in nested loops

2024-11-04 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117438

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #3 from Hongtao Liu  ---
(In reply to Andrew Pinski from comment #1)
> >this may cause significant performance regression of some nested loops.
> 
> I suspect it depends on the micro-arch for the x86 target.
> 
> What are you running the test on?
> 
> .p2align 6
> .L3:
> 
> I notice GCC aligns only the inner loop to 64 byte boundary while clang/LLVM
> aligns each loop (inner and outer) loops to 16 byte boundary.

It aligns inner loop to cacheline which depends on the micro-arch.

Yes, we're worried about such regression before, but didn't observe any in SPEC
and other workroads.

Maybe we can provide some option to control the optimization, by default still
turn it on.(also try to implement some heuristic to prevent regressions from
such  inner loops)

[Bug target/117438] x86's pass_align_tight_loops may cause performance regression in nested loops

2024-11-04 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117438

--- Comment #2 from Andrew Pinski  ---
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651699.html

[Bug target/117438] x86's pass_align_tight_loops may cause performance regression in nested loops

2024-11-04 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117438

--- Comment #1 from Andrew Pinski  ---
>this may cause significant performance regression of some nested loops.

I suspect it depends on the micro-arch for the x86 target.

What are you running the test on?

.p2align 6
.L3:

I notice GCC aligns only the inner loop to 64 byte boundary while clang/LLVM
aligns each loop (inner and outer) loops to 16 byte boundary.