[Bug c++/93008] Need a way to make inlining heuristics ignore whether a function is inline

2024-05-05 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008 --- Comment #14 from Chris Elrod --- To me, an "inline" function is one that the compiler inlines. It just happens that the `inline` keyword also means both comdat semantics, and possibly hiding the symbol to make it internal

[Bug target/110027] Misaligned vector store on detect_stack_use_after_return

2024-03-08 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027 --- Comment #9 from Chris Elrod --- > Interestingly this seems to be only reproducible on Arch Linux. Other gcc > 13.1.1 builds, Fedora for instance, seem to behave correctly. I haven't tried that reproducer on Fedora with gcc 13.2.1, which

[Bug target/114276] Trapping on aligned operations when using vector builtins + `-std=gnu++23 -fsanitize=address -fstack-protector-strong`

2024-03-07 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114276 --- Comment #1 from Chris Elrod --- Created attachment 57652 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57652=edit assembly from adding `-S`

[Bug target/114276] New: Trapping on aligned operations when using vector builtins + `-std=gnu++23 -fsanitize=address -fstack-protector-strong`

2024-03-07 Thread elrodc at gmail dot com via Gcc-bugs
Version: 13.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- Created attachment 57651 --> https://gcc.gnu.org/bugzi

[Bug middle-end/112824] Stack spills and vector splitting with vector builtins

2023-12-04 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 --- Comment #8 from Chris Elrod --- > If it's designed the way you want it to be, another issue would be like, > should we lower 512-bit vector builtins/intrinsic to ymm/xmm when > -mprefer-vector-width=256, the answer is we'd rather not.

[Bug middle-end/112824] Stack spills and vector splitting with vector builtins

2023-12-04 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 --- Comment #6 from Chris Elrod --- Hongtao Liu, I do think that one should ideally be able to get optimal codegen when using 512-bit builtin vectors or vector intrinsics, without needing to set `-mprefer-vector-width=512` (and, currently, also

[Bug middle-end/112824] Stack spills and vector splitting with vector builtins

2023-12-03 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 --- Comment #3 from Chris Elrod --- > I thought I hit the important cases, but my non-minimal example still gets > unnecessary register splits and stack spills, so maybe I missed places, or > perhaps there's another issue. Adding the unroll

[Bug middle-end/112824] Stack spills and vector splitting with vector builtins

2023-12-03 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 --- Comment #2 from Chris Elrod --- https://godbolt.org/z/3648aMTz8 Perhaps a simpler diff is that you can reproduce by uncommenting the pragma, but codegen becomes good with it. template constexpr auto operator*(OuterDualUA2 a, OuterDualUA2

[Bug tree-optimization/112824] Stack spills and vector splitting with vector builtins

2023-12-02 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 --- Comment #1 from Chris Elrod --- Here I have added a godbolt example where I manually unroll the array, where GCC generates excellent code https://godbolt.org/z/sd4bhGW7e I'm not sure it is 100% optimal, but with an inner Dual size of `7`,

[Bug tree-optimization/112824] New: Stack spills and vector splitting with vector builtins

2023-12-02 Thread elrodc at gmail dot com via Gcc-bugs
Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- I am not sure which component to place this under, but selected tree-optimization as I suspect this is some sort of alias analysis failure preventing

[Bug c++/111493] [concepts] multidimensional subscript operator inside requires is broken

2023-09-20 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111493 --- Comment #2 from Chris Elrod --- Note that it also shows up in gcc-13. I put gcc-14 as the version to indicate that I confirmed it is still a problem on latest trunk. Not sure what the policy is on which version we should report.

[Bug c++/111493] New: [concepts] multidimensional subscript operator inside requires is broken

2023-09-20 Thread elrodc at gmail dot com via Gcc-bugs
: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- Two example programs: > #include > constexpr auto foo(const auto , int i, int j) > requires(requires(decltype(A)

[Bug target/89929] __attribute__((target("avx512bw"))) doesn't work on non avx512bw systems

2022-05-30 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929 --- Comment #32 from Chris Elrod --- Ha, I accidentally misreported my gcc version. I was already using 12.1.1. Using x86-64-v4 worked, excellent! Thanks.

[Bug target/89929] __attribute__((target("avx512bw"))) doesn't work on non avx512bw systems

2022-05-30 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929 --- Comment #30 from Chris Elrod --- > #if defined(__clang__) > #define MULTIVERSION > \ > __attribute__((target_clones("avx512dq", "avx2", "default"))) > #else > #define

[Bug target/89929] __attribute__((target("avx512bw"))) doesn't work on non avx512bw systems

2022-05-30 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929 Chris Elrod changed: What|Removed |Added CC||elrodc at gmail dot com --- Comment #29

[Bug middle-end/95899] -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains

2020-06-25 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899 --- Comment #2 from Chris Elrod --- Interesting. Compiling with: gcc -march=native -fvariable-expansion-in-unroller -Ofast -funroll-loops -S dot.c -o dot.s Yields: ``` .L4: vmovupd (%rdi,%r11), %zmm9 vmovupd 64(%rdi,%r11),

[Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains

2020-06-25 Thread elrodc at gmail dot com
: 10.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- Created attachment 48784 --> https://gcc.gnu.org/bugzi

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-02-12 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #54 from Chris Elrod --- I commented elsewhere, but I built trunk a few days ago with H.J.Lu's patches (attached here) and Thomas Koenig's inlining patches. With these patches, g++ and all versions of the Fortran code produced

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #35 from Chris Elrod --- > rsqrt: > .LFB12: > .cfi_startproc > vrsqrt28ps (%rsi), %zmm0 > vmovups %zmm0, (%rdi) > vzeroupper > ret > > (huh? isn't there a NR step missing?) > I assume

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #32 from Chris Elrod --- (In reply to Marc Glisse from comment #31) > (In reply to Chris Elrod from comment #30) > > gcc caclulates the rsqrt directly > > No, vrsqrt14ps is just the first step in calculating sqrt here (slightly >

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #30 from Chris Elrod --- gcc still (In reply to Marc Glisse from comment #29) > The main difference I can see is that clang computes rsqrt directly, while > gcc first computes sqrt and then computes the inverse. Also gcc seems afraid

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #28 from Chris Elrod --- Created attachment 45501 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45501=edit Minimum working example of the rsqrt problem. Can be compiled with: gcc -Ofast -S -march=skylake-avx512

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #27 from Chris Elrod --- g++ -mrecip=all -O3 -fno-signed-zeros -fassociative-math -freciprocal-math -fno-math-errno -ffinite-math-only -fno-trapping-math -fdump-tree-optimized -S -march=native -shared -fPIC -mprefer-vector-width=512

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #26 from Chris Elrod --- > You can try enabling -mrecip to see RSQRT in .optimized - there's > probably late 1/sqrt optimization on RTL. No luck. The full commands I used: gfortran -Ofast -mrecip -S -fdump-tree-optimized

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #24 from Chris Elrod --- The dump looks like this: vect__67.78_217 = SQRT (vect__213.77_225); vect_ui33_68.79_248 = { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0,

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #22 from Chris Elrod --- Okay. I did that, and the time went from about 4.25 microseconds down to 4.0 microseconds. So that is an improvement, but accounts for only a small part of the difference with the LLVM-compilers. -O3

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-21 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #20 from Chris Elrod --- To add a little more: I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in Julia. Without adding a Newton step, the answers are wrong beyond just a couple significant digits. With the

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-21 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #19 from Chris Elrod --- To add a little more: I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in Julia. Without adding a Newton step, the answers are wrong beyond just a couple significant digits. With the

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-07 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #18 from Chris Elrod --- I can confirm that the inlined packing does allow gfortran to vectorize the loop. So allowing packing to inline does seem (to me) like an optimization well worth making. However, performance seems to be

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #14 from Chris Elrod --- It's not really reproducible across runs: $ time ./gfortvectests Transpose benchmark completed in 22.7010765 SIMD benchmark completed in 1.37529969 All are equal: F All are approximately

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #12 from Chris Elrod --- Created attachment 45363 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45363=edit Fortran program for running benchmarks. Okay, thank you. I attached a Fortran program you can run to benchmark the

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #10 from Chris Elrod --- (In reply to Thomas Koenig from comment #9) > Hm. > > It would help if your benchmark was complete, so I could run it. > I don't suppose you happen to have and be familiar with Julia? If you (or someone

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #8 from Chris Elrod --- Created attachment 45358 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45358=edit gfortran compiled assembly for the tranposed version of the original code. Here is the assembly for the loop body of

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #7 from Chris Elrod --- Created attachment 45357 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45357=edit Assembly generated by Flang compiler on the original version of the code. This is the main loop body in the Flang

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #6 from Chris Elrod --- Created attachment 45356 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45356=edit Code to demonstrate that transposing makes things slower. Thomas Koenig, I am well aware that Fortran is column major.

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-05 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #3 from Chris Elrod --- Created attachment 45353 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45353=edit g++ assembly output

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-05 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #2 from Chris Elrod --- Created attachment 45352 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45352=edit gfortran assembly output

[Bug fortran/88713] New: _gfortran_internal_pack@PLT prevents vectorization

2019-01-05 Thread elrodc at gmail dot com
Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- Created attachment 45350 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45350=edit Fortran version of vectorization test. I am attaching Fortran an

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-05 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #1 from Chris Elrod --- Created attachment 45351 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45351=edit C++ version of the vectorization test case.

[Bug fortran/57992] Pointless packing of contiguous arrays for simply contiguous functions results as actual arguments

2018-11-15 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57992 --- Comment #4 from Chris Elrod --- Created attachment 45016 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45016=edit Assembly from compiling gfortran_internal_pack_test.f90 The code takes in sets of 3-length vectors and 3x3 symmetric

[Bug fortran/57992] Pointless packing of contiguous arrays for simply contiguous functions results as actual arguments

2018-11-15 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57992 Chris Elrod changed: What|Removed |Added CC||elrodc at gmail dot com --- Comment #3

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625 --- Comment #7 from Chris Elrod --- (In reply to Chris Elrod from comment #6) > However, for column 23 (2944/128 = 23) with -O3 and column 25 for -O2 of the > 32 columns of A Correction: it was the 16x13 version that used stack data after

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625 --- Comment #6 from Chris Elrod --- (In reply to Richard Biener from comment #3) > If you see spilling on the manually unrolled loop register pressure is > somehow an issue. In the matmul kernel: D = A * X where D is 16x14, A is 16xN, and X is

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625 --- Comment #5 from Chris Elrod --- Created attachment 44424 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44424=edit Smaller avx512 kernel that still spills into the stack This generated 18 total `vmovapd` (I think there'd ideally be 0)

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625 --- Comment #4 from Chris Elrod --- Created attachment 44423 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44423=edit 8x16 * 16x6 kernel for avx2. Here is a scaled down version to reproduce most of the the problem for avx2-capable

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625 --- Comment #2 from Chris Elrod --- Created attachment 44418 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44418=edit Code to reproduce slow vectorization pattern and unnecessary loads & stores (Sorry if this goes to the bottom instead

[Bug rtl-optimization/86625] New: funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-21 Thread elrodc at gmail dot com
tus: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- I wasn't sure where to put this. I posted in the Fortran gcc mailing list initia