https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82362
Bug ID: 82362 Summary: [8 Regression] SPEC CPU2006 436.cactusADM ~7% performance deviation with trunk@251713 Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: alexander.nesterovskiy at intel dot com Target Milestone: --- r251713 brings reasonable improvement to alloca. However there is a side effect of this patch - 436.cactusADM performance became unstable when compiled with -Ofast -march=core-avx2 -mfpmath=sse -funroll-loops The impact is more noticeable when compiled with auto-parallelization -ftree-parallelize-loops=N Comparing performance for particular 7-runs (relative to median performance of r251711): r251711: 92,8% 92,9% 93,0% 106,7% 107,0% 107,0% 107,2% r251713: 99,5% 99,6% 99,8% 100,0% 100,3% 100,6% 100,6% r251711 is prettty stable, while r251713 is +7% faster on some runs and -7% slower on other. There are few dynamic arrays in the body of Bench_StaggeredLeapfrog2 sub in StaggeredLeapfrog2.fppized.f. When compiled with "-fstack-arrays" (default for "-Ofast") arrays are allocated by alloca. Allocated memory size is rounded-up to 16-bytes in r251713 with code like "size = (size + 15) & -16". In prior revisions it differs in just one byte: "size = (size + 22) & -16" Which actually may just waste extra 16 bytes for each array depending on initial "size" value. Actual r251713 code, built with gfortran -S -masm=intel -o StaggeredLeapfrog2.fppized_r251713.s -O3 -fstack-arrays -march=core-avx2 -mfpmath=sse -funroll-loops -ftree-parallelize-loops=8 StaggeredLeapfrog2.fppized.f ------------ lea rax, [15+r13*8] ; size = <...> + 15 shr rax, 4 ; zero-out sal rax, 4 ; lower 4 bits sub rsp, rax mov QWORD PTR [rbp-4984], rsp ; Array 1 sub rsp, rax mov QWORD PTR [rbp-4448], rsp ; Array 2 sub rsp, rax mov QWORD PTR [rbp-4784], rsp ; Array 3 ... and so on ------------ Aligning rsp to cache line size (on each allocation or even once in the beginning) brings performance to stable high values: ------------ lea rax, [15+r13*8] shr rax, 4 sal rax, 4 shr rsp, 6 ; Align rsp to shl rsp, 6 ; 64-byte border sub rsp, rax mov QWORD PTR [rbp-4984], rsp sub rsp, rax mov QWORD PTR [rbp-4448], rsp sub rsp, rax mov QWORD PTR [rbp-4784], rsp ------------ 64-byte aligned version performance compared to the same median performance of r251711: 106,7% 107,0% 107,0% 107,1% 107,1% 107,2% 107,4% Maybe what is necessary here is some kind of option to force array aligning for gfortran (like "-align array64byte" for ifort) ?