[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-26 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #12 from N Schaeffer --- I found the "offending" option, and it seems to be indeed a cost-model problem as Andrew Pinski said: good code is generated by: gcc -O2 -ftree-vectorize -march=skylake (since gcc 6.1) gcc -O1

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #10 from N Schaeffer --- intrestingly (and maybe surprisingly) I can get gcc to produce nearly optimal code using vbroadcastsd with the following options: -O2 -march=skylake -ftree-vectorize

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #9 from N Schaeffer --- In addition, optimizing for size with -Os leads to a non-vectorized double-loop (51 bytes) while the vectorized loop with vbroadcastsd (produced by clang -Os) leads to 40 bytes. It is thus also a missed

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #6 from N Schaeffer --- indeed, aarch64 assembly looks very good.

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #4 from N Schaeffer --- ... and thank you for your quick reply!

[Bug target/114107] poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107 --- Comment #3 from N Schaeffer --- I have not benchmarked. For 4 vmulpd doing the actual work, there are more than 40 permute/mov instructions, among which 24 vpermd instructions which have a 3 cycle latency. That is 6 vpermd per vmulpd. There

[Bug tree-optimization/114107] New: poor vectorization at -O3 when dealing with arrays of different multiplicity, good with -O2

2024-02-25 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: nathanael.schaeffer at gmail dot com Target Milestone: --- A simple loop multiplying two arrays, with different multiplicity fails

[Bug tree-optimization/98563] [10/11 Regression] vectorization fails while it worked on gcc 9 and earlier

2021-01-07 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98563 --- Comment #3 from N Schaeffer --- I'd like to add that when you say "vectorization of the basic block", the code generated is actually worse than non-vectorized naive code: it handles all loads and arithmetic operations in scalar mode (v*sd

[Bug tree-optimization/98563] regression: vectorization fails while it worked on gcc 9 and earlier

2021-01-06 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98563 --- Comment #1 from N Schaeffer --- I just found the -mprefer-vector-width=512 to force to use zmm. The reported regression however remains.

[Bug tree-optimization/98563] New: regression: vectorization fails while it worked on gcc 9 and earlier

2021-01-06 Thread nathanael.schaeffer at gmail dot com via Gcc-bugs
Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: nathanael.schaeffer at gmail dot com Target Milestone: --- I have found what seems to be a regression. The following code is not compiled to 256-bit AVX when compiled

[Bug tree-optimization/96512] wrong code generated with avx512 intrinsics in some cases

2020-09-04 Thread nathanael.schaeffer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96512 --- Comment #6 from N Schaeffer --- Hello, Working further on this, it seems to be a problem in the assembler step, but only on some installations. I have a system where gcc 8.3 to 9 and 10 are good (no bug), while another system where gcc

[Bug tree-optimization/96512] wrong code generated with avx512 intrinsics in some cases

2020-08-07 Thread nathanael.schaeffer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96512 --- Comment #3 from N Schaeffer --- (In reply to Richard Biener from comment #2) > With trunk and GCC 10 I see > > vbroadcastsdzmm0, QWORD PTR [8+r8*8] > > can you check newer GCC? GCC 8.4 is out since some time already and I do >

[Bug tree-optimization/96512] wrong code generated with avx512 intrinsics in some cases

2020-08-06 Thread nathanael.schaeffer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96512 --- Comment #1 from N Schaeffer --- Created attachment 49014 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49014=edit even simpler bug demonstrator

[Bug tree-optimization/96512] New: wrong code generated with avx512 intrinsics in some cases

2020-08-06 Thread nathanael.schaeffer at gmail dot com
Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: nathanael.schaeffer at gmail dot com Target Milestone: --- Created attachment 49013 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49013=edit bug demonstrator with

[Bug target/93395] New: AVX2 missed optimization : _mm256_permute_pd() is unfortunately translated into the more expensive VPERMPD instead of the cheap VPERMILPD

2020-01-22 Thread nathanael.schaeffer at gmail dot com
: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nathanael.schaeffer at gmail dot com Target Milestone: --- According to Agner Fog's

[Bug tree-optimization/93334] -O3 generates useless code checking for overlapping memset ?

2020-01-21 Thread nathanael.schaeffer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93334 --- Comment #5 from N Schaeffer --- Elaborating a bit on this: I can eliminate this problem by using: -O3 -fno-tree-loop-distribute-patterns -fno-tree-loop-vectorize I wonder why -fno-tree-loop-distribute-patterns is not enough ? In that

[Bug tree-optimization/93334] -O3 generates useless code checking for overlapping memset ?

2020-01-21 Thread nathanael.schaeffer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93334 --- Comment #3 from N Schaeffer --- Hi, Thanks for pointing out the issue about writing different values. This makes sense. However, since memset deals with bytes, whenever the type of array is floating point data (or anything longer than

[Bug tree-optimization/93342] New: wrong AVX mask generation with -funsafe-math-optimizations

2020-01-20 Thread nathanael.schaeffer at gmail dot com
Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: nathanael.schaeffer at gmail dot com Target Milestone: --- When trying to produce a xor mask to negate even elements in an AVX vector, gcc produces wrong code with -funsafe-math

[Bug c/93334] New: -O3 generates useless code checking for overlapping memset ?

2020-01-20 Thread nathanael.schaeffer at gmail dot com
Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: nathanael.schaeffer at gmail dot com Target Milestone: --- It seems that trying to zero out two arrays in the same loop results in poor code beeing generated by -O3. If I understand

[Bug c++/60237] New: isnan fails with -ffast-math

2014-02-17 Thread nathanael.schaeffer at gmail dot com
++ Assignee: unassigned at gcc dot gnu.org Reporter: nathanael.schaeffer at gmail dot com With -ffast-math, isnan should return true if passed a NaN value. Otherwise, how is isnan different than (x!=x) ? isnan worked as expected with gcc 4.7, but does not with 4.8.1 and 4.8.2 How can I check if x

[Bug c++/60237] isnan fails with -ffast-math

2014-02-17 Thread nathanael.schaeffer at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60237 --- Comment #2 from N Schaeffer nathanael.schaeffer at gmail dot com --- Thank you for your answer. My program (which is a computational fluid dynamics solver) is not supposed to produce NaNs. However, when it does (which means something went

[Bug c++/60237] isnan fails with -ffast-math

2014-02-17 Thread nathanael.schaeffer at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60237 --- Comment #4 from N Schaeffer nathanael.schaeffer at gmail dot com --- int my_isnan(double x){ volatile double y=x; return y!=y; } is translated to: 0x00406cf0 +0: movsd QWORD PTR [rsp-0x8],xmm0 0x00406cf6 +6

[Bug c++/60237] isnan fails with -ffast-math

2014-02-17 Thread nathanael.schaeffer at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60237 --- Comment #6 from N Schaeffer nathanael.schaeffer at gmail dot com --- -fno-builtin-isnan is also interesting, thanks. Is there somewhere a rationale for not making isnan() find NaN's with -ffinite-math-only ?