[Bug c++/110619] Dangling pointer returned from constexpr function converts in nullptr

2023-08-06 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110619 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #7

[Bug middle-end/108441] [12 Regression] Maybe missed optimization: loading an 16-bit integer value from .rodata instead of an immediate store

2023-01-18 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108441 --- Comment #4 from Peter Cordes --- This is already fixed in current trunk; sorry I forgot to check that before recommending to report this store-coalescing bug. # https://godbolt.org/z/j3MdWrcWM # GCC nightly -O3 (tune=generic) and GCC11

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 --- Comment #27 from Peter Cordes --- (In reply to Alexander Monakov from comment #26) > Sure, the right course of action seems to be to simply document that atomic > types and built-ins are meant to be used on "common" (writeback) memory

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 --- Comment #25 from Peter Cordes --- (In reply to Alexander Monakov from comment #24) > > I think it's possible to get UC/WC mappings via a graphics/compute API (e.g. > OpenGL, Vulkan, OpenCL, CUDA) on any OS if you get a mapping to device >

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #23

[Bug tree-optimization/106138] Inefficient code generation: logical AND of disjoint booleans from equal and bitwise AND not optimized to constant false

2022-06-30 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106138 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #3

[Bug target/105929] New: [AArch64] armv8.4-a allows atomic stp. 64-bit constants can use 2 32-bit halves with _Atomic or volatile

2022-06-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105929 Bug ID: 105929 Summary: [AArch64] armv8.4-a allows atomic stp. 64-bit constants can use 2 32-bit halves with _Atomic or volatile Product: gcc Version: 13.0

[Bug target/105928] New: [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least)

2022-06-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105928 Bug ID: 105928 Summary: [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least) Product: gcc Version: 13.0 Status: UNCONFIRMED

[Bug tree-optimization/105904] New: Predicated mov r0, #1 with opposite conditions could be hoisted, between 1 and 1<

2022-06-09 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105904 Bug ID: 105904 Summary: Predicated mov r0, #1 with opposite conditions could be hoisted, between 1 and 1< // using the libstdc++ header unsigned roundup(unsigned x){ return std::bit_ceil(x); }

[Bug tree-optimization/105596] Loop counter widened to 128-bit unnecessarily

2022-05-13 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596 --- Comment #1 from Peter Cordes --- https://godbolt.org/z/aoG55T5Yq gcc -O3 -m32 has the same problem with unsigned long long total and unsigned i. Pretty much identical instruction sequences in the loop for all 3 versions, doing add/adc to

[Bug tree-optimization/105596] New: Loop counter widened to 128-bit unnecessarily

2022-05-13 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596 Bug ID: 105596 Summary: Loop counter widened to 128-bit unnecessarily Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal

[Bug target/65146] alignment of _Atomic structure member is not correct

2022-04-27 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146 --- Comment #25 from Peter Cordes --- (In reply to CVS Commits from comment #24) > The master branch has been updated by Jakub Jelinek : > > https://gcc.gnu.org/g:04df5e7de2f3dd652a9cddc1c9adfbdf45947ae6 > > commit

[Bug target/82261] x86: missing peephole for SHLD / SHRD

2022-04-09 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261 --- Comment #4 from Peter Cordes --- GCC will emit SHLD / SHRD as part of shifting an integer that's two registers wide. Hironori Bono proposed the following functions as a workaround for this missed optimization

[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg

2022-03-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066 --- Comment #5 from Peter Cordes --- > pextrw requires sse4.1 for mem operands. You're right! I didn't double-check the asm manual for PEXTRW when writing up the initial report, and had never realized that PINSRW wasn't symmetric with it.

[Bug target/105079] New: _mm_storeu_si16 inefficiently uses pextrw to an integer reg (without SSE4.1)

2022-03-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105079 Bug ID: 105079 Summary: _mm_storeu_si16 inefficiently uses pextrw to an integer reg (without SSE4.1) Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords:

[Bug sanitizer/84508] Load of misaligned address using _mm_load_sd

2022-03-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508 --- Comment #17 from Peter Cordes --- (In reply to Andrew Pinski from comment #16) > >According to Intel ( > > https://software.intel.com/sites/landingpage/IntrinsicsGuide), there are no > > alignment requirements for _mm_load_sd, _mm_store_sd

[Bug sanitizer/84508] Load of misaligned address using _mm_load_sd

2022-03-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #14

[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly

2022-03-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754 --- Comment #6 from Peter Cordes --- Looks good to me, thanks for taking care of this quickly, hopefully we can get this backported to the GCC11 series to limit the damage for people using these newish intrinsics. I'd love to recommend them for

[Bug target/105066] New: GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg

2022-03-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066 Bug ID: 105066 Summary: GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg Product: gcc Version: 12.0 Status:

[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly

2022-03-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754 --- Comment #3 from Peter Cordes --- Wait a minute, the current implementation of _mm_loadu_si32 isn't strict-aliasing or alignment safe!!! That defeats the purpose for its existence as something to use instead of _mm_cvtsi32_si128( *(int*)p

[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly

2022-03-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #2

[Bug target/104773] New: compare with 1 not merged with subtract 1

2022-03-03 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104773 Bug ID: 104773 Summary: compare with 1 not merged with subtract 1 Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal

[Bug libstdc++/97759] Could std::has_single_bit be faster?

2022-03-03 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #14

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-10-25 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 --- Comment #11 from Peter Cordes --- Also, horizontal byte sums are generally best done with VPSADBW against a zero vector, even if that means some fiddling to flip to unsigned first and then undo the bias. simde_vaddlv_s8: vpxorxmm0,

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-10-25 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #10

[Bug tree-optimization/80570] auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract

2021-09-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570 --- Comment #3 from Peter Cordes --- (In reply to Andrew Pinski from comment #2) > Even on aarch64: > > .L2: > ldr q0, [x1], 16 > sxtlv1.2d, v0.2s > sxtl2 v0.2d, v0.4s > scvtf v1.2d, v1.2d >

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2021-09-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103 --- Comment #9 from Peter Cordes --- Thanks for implementing my idea :) (In reply to Hongtao.liu from comment #6) > For elements located above 128bits, it seems always better(?) to use > valign{d,q} TL:DR: I think we should still use

[Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code

2021-09-04 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 --- Comment #37 from Peter Cordes --- Correction, PR82666 is that the cmov on the critical path happens even at -O2 (with GCC7 and later). Not just with -O3 -fno-tree-vectorize. Anyway, that's related, but probably separate from choosing to do

[Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code

2021-09-04 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #36

[Bug target/15533] Missed move to partial register

2021-08-22 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15533 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #5

[Bug middle-end/82940] Suboptimal code for (a & 0x7f) | (b & 0x80) on powerpc

2021-08-22 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82940 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #6

[Bug tree-optimization/100922] CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy

2021-06-05 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922 --- Comment #2 from Peter Cordes --- Possibly also related: With different surrounding code, this loop can compile to asm which has two useless movz / mov register copies in the loop at -O2 (https://godbolt.org/z/PTcqzM6q7). (To set up for

[Bug tree-optimization/100922] New: CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy

2021-06-05 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922 Bug ID: 100922 Summary: CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy Product: gcc Version:

[Bug rtl-optimization/88770] Redundant load opt. or CSE pessimizes code

2021-06-05 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88770 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #2

[Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

2021-06-03 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636 Peter Cordes changed: What|Removed |Added Status|NEW |RESOLVED Resolution|---

[Bug tree-optimization/42587] bswap not recognized for memory

2021-05-08 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42587 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #12

[Bug middle-end/98801] Request for a conditional move built-in function

2021-01-25 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98801 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #5

[Bug tree-optimization/98291] New: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert

2020-12-15 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98291 Bug ID: 98291 Summary: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert Product: gcc

[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366 --- Comment #1 from Peter Cordes --- Forgot to include https://godbolt.org/z/q44r13

[Bug target/97366] New: [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366 Bug ID: 97366 Summary: [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: