[Bug c++/110619] Dangling pointer returned from constexpr function converts in nullptr

2023-08-06 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110619 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #7

[Bug middle-end/108441] [12 Regression] Maybe missed optimization: loading an 16-bit integer value from .rodata instead of an immediate store

2023-01-18 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108441 --- Comment #4 from Peter Cordes --- This is already fixed in current trunk; sorry I forgot to check that before recommending to report this store-coalescing bug. # https://godbolt.org/z/j3MdWrcWM # GCC nightly -O3 (tune=generic) and GCC11

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 --- Comment #27 from Peter Cordes --- (In reply to Alexander Monakov from comment #26) > Sure, the right course of action seems to be to simply document that atomic > types and built-ins are meant to be used on "common" (writeback) memory

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 --- Comment #25 from Peter Cordes --- (In reply to Alexander Monakov from comment #24) > > I think it's possible to get UC/WC mappings via a graphics/compute API (e.g. > OpenGL, Vulkan, OpenCL, CUDA) on any OS if you get a mapping to device >

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #23

[Bug tree-optimization/106138] Inefficient code generation: logical AND of disjoint booleans from equal and bitwise AND not optimized to constant false

2022-06-30 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106138 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #3

[Bug target/105929] New: [AArch64] armv8.4-a allows atomic stp. 64-bit constants can use 2 32-bit halves with _Atomic or volatile

2022-06-11 Thread peter at cordes dot ca via Gcc-bugs
Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: arm64-*-* void

[Bug target/105928] New: [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least)

2022-06-11 Thread peter at cordes dot ca via Gcc-bugs
Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: arm64-*-* void foo(unsigned long *p) { *p

[Bug tree-optimization/105904] New: Predicated mov r0, #1 with opposite conditions could be hoisted, between 1 and 1<

2022-06-09 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105904 Bug ID: 105904 Summary: Predicated mov r0, #1 with opposite conditions could be hoisted, between 1 and 1< // using the libstdc++ header unsigned roundup(unsigned x){ return std::bit_ceil(x); }

[Bug tree-optimization/105596] Loop counter widened to 128-bit unnecessarily

2022-05-13 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596 --- Comment #1 from Peter Cordes --- https://godbolt.org/z/aoG55T5Yq gcc -O3 -m32 has the same problem with unsigned long long total and unsigned i. Pretty much identical instruction sequences in the loop for all 3 versions, doing add/adc to

[Bug tree-optimization/105596] New: Loop counter widened to 128-bit unnecessarily

2022-05-13 Thread peter at cordes dot ca via Gcc-bugs
Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- For total *= i with a u128 total and a u32 loop counter, GCC pessimizes by widening i and doing a full 128x128 => 128-

[Bug target/65146] alignment of _Atomic structure member is not correct

2022-04-27 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146 --- Comment #25 from Peter Cordes --- (In reply to CVS Commits from comment #24) > The master branch has been updated by Jakub Jelinek : > > https://gcc.gnu.org/g:04df5e7de2f3dd652a9cddc1c9adfbdf45947ae6 > > commit

[Bug target/82261] x86: missing peephole for SHLD / SHRD

2022-04-09 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261 --- Comment #4 from Peter Cordes --- GCC will emit SHLD / SHRD as part of shifting an integer that's two registers wide. Hironori Bono proposed the following functions as a workaround for this missed optimization

[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg

2022-03-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066 --- Comment #5 from Peter Cordes --- > pextrw requires sse4.1 for mem operands. You're right! I didn't double-check the asm manual for PEXTRW when writing up the initial report, and had never realized that PINSRW wasn't symmetric with it.

[Bug target/105079] New: _mm_storeu_si16 inefficiently uses pextrw to an integer reg (without SSE4.1)

2022-03-28 Thread peter at cordes dot ca via Gcc-bugs
: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* With PR105066 fixed, we do _mm_loadu_si16

[Bug sanitizer/84508] Load of misaligned address using _mm_load_sd

2022-03-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508 --- Comment #17 from Peter Cordes --- (In reply to Andrew Pinski from comment #16) > >According to Intel ( > > https://software.intel.com/sites/landingpage/IntrinsicsGuide), there are no > > alignment requirements for _mm_load_sd, _mm_store_sd

[Bug sanitizer/84508] Load of misaligned address using _mm_load_sd

2022-03-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #14

[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly

2022-03-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754 --- Comment #6 from Peter Cordes --- Looks good to me, thanks for taking care of this quickly, hopefully we can get this backported to the GCC11 series to limit the damage for people using these newish intrinsics. I'd love to recommend them for

[Bug target/105066] New: GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg

2022-03-26 Thread peter at cordes dot ca via Gcc-bugs
: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* PR99754 fixed

[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly

2022-03-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754 --- Comment #3 from Peter Cordes --- Wait a minute, the current implementation of _mm_loadu_si32 isn't strict-aliasing or alignment safe!!! That defeats the purpose for its existence as something to use instead of _mm_cvtsi32_si128( *(int*)p

[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly

2022-03-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #2

[Bug target/104773] New: compare with 1 not merged with subtract 1

2022-03-03 Thread peter at cordes dot ca via Gcc-bugs
Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-*, arm-*-* std::bit_ceil(x) involves if(x == 0 || x == 1) return 1; and 1u << (32-c

[Bug libstdc++/97759] Could std::has_single_bit be faster?

2022-03-03 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #14

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-10-25 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 --- Comment #11 from Peter Cordes --- Also, horizontal byte sums are generally best done with VPSADBW against a zero vector, even if that means some fiddling to flip to unsigned first and then undo the bias. simde_vaddlv_s8: vpxorxmm0,

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-10-25 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #10

[Bug tree-optimization/80570] auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract

2021-09-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570 --- Comment #3 from Peter Cordes --- (In reply to Andrew Pinski from comment #2) > Even on aarch64: > > .L2: > ldr q0, [x1], 16 > sxtlv1.2d, v0.2s > sxtl2 v0.2d, v0.4s > scvtf v1.2d, v1.2d >

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2021-09-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103 --- Comment #9 from Peter Cordes --- Thanks for implementing my idea :) (In reply to Hongtao.liu from comment #6) > For elements located above 128bits, it seems always better(?) to use > valign{d,q} TL:DR: I think we should still use

[Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code

2021-09-04 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 --- Comment #37 from Peter Cordes --- Correction, PR82666 is that the cmov on the critical path happens even at -O2 (with GCC7 and later). Not just with -O3 -fno-tree-vectorize. Anyway, that's related, but probably separate from choosing to do

[Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code

2021-09-04 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #36

[Bug target/15533] Missed move to partial register

2021-08-22 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15533 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #5

[Bug middle-end/82940] Suboptimal code for (a & 0x7f) | (b & 0x80) on powerpc

2021-08-22 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82940 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #6

[Bug tree-optimization/100922] CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy

2021-06-05 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922 --- Comment #2 from Peter Cordes --- Possibly also related: With different surrounding code, this loop can compile to asm which has two useless movz / mov register copies in the loop at -O2 (https://godbolt.org/z/PTcqzM6q7). (To set up for

[Bug tree-optimization/100922] New: CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy

2021-06-05 Thread peter at cordes dot ca via Gcc-bugs
: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Created attachment

[Bug rtl-optimization/88770] Redundant load opt. or CSE pessimizes code

2021-06-05 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88770 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #2

[Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

2021-06-03 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636 Peter Cordes changed: What|Removed |Added Status|NEW |RESOLVED Resolution|---

[Bug tree-optimization/42587] bswap not recognized for memory

2021-05-08 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42587 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #12

[Bug middle-end/98801] Request for a conditional move built-in function

2021-01-25 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98801 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #5

[Bug tree-optimization/98291] New: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert

2020-12-15 Thread peter at cordes dot ca via Gcc-bugs
Version: 11.0 Status: UNCONFIRMED Keywords: missed-optimization, ssemmx Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target

[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366 --- Comment #1 from Peter Cordes --- Forgot to include https://godbolt.org/z/q44r13

[Bug target/97366] New: [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-11 Thread peter at cordes dot ca via Gcc-bugs
-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- When you use the same _mm_load_si128 or _mm256_load_si256 result twice, sometimes GCC loads

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2020-04-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #53

[Bug target/93141] Missed optimization : Use of adc when checking overflow

2020-01-03 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93141 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #2

[Bug target/40838] gcc shouldn't assume that the stack is aligned

2019-10-31 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #91

[Bug target/89346] Unnecessary EVEX encoding

2019-10-30 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89346 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1

[Bug target/82459] AVX512BW instruction costs: vpmovwb is 2 uops on Skylake and not always worth using vs. vpack + vpermq lane-crossing fixup

2019-10-29 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 Peter Cordes changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill

[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)

2019-10-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244 --- Comment #4 from Peter Cordes --- (In reply to Andrew Pinski from comment #3) > (In reply to Peter Cordes from comment #1) > > On AArch64 (with gcc8.2), we see a similar effect, more instructions in the > > loop. And an indexed addressing

[Bug target/92246] Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)

2019-10-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246 --- Comment #1 from Peter Cordes --- And BTW, GCC *does* use vpermd (not vpermt2d) for swapt = int or long. This problem only applies to char and short. Possibly because AVX2 includes vpermd ymm. Apparently CannonLake has 1 uop vpermb

[Bug target/92246] New: Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)

2019-10-27 Thread peter at cordes dot ca
: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* typedef short

[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)

2019-10-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244 Peter Cordes changed: What|Removed |Added Summary|extra sub inside vectorized |vectorized loop updating 2

[Bug tree-optimization/92244] extra sub inside vectorized loop instead of calculating end-pointer

2019-10-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244 --- Comment #1 from Peter Cordes --- On AArch64 (with gcc8.2), we see a similar effect, more instructions in the loop. And an indexed addressing mode. https://godbolt.org/z/6ZVWY_ # strrev_explicit -O3 -mcpu=cortex-a53 ... .L4:

[Bug tree-optimization/92244] New: extra sub inside vectorized loop instead of calculating end-pointer

2019-10-27 Thread peter at cordes dot ca
-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- We get a redundant instruction inside the vectorized loop here. But it's

[Bug tree-optimization/92243] Missing "auto-vectorization" of char array reversal using x86 scalar bswap when SIMD pshufb isn't available

2019-10-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92243 --- Comment #1 from Peter Cordes --- Forgot to mention, this probably applies to other ISAs with GP-integer byte-reverse instructions and efficient unaligned loads.

[Bug tree-optimization/92243] New: Missing "auto-vectorization" of char array reversal using x86 scalar bswap when SIMD pshufb isn't available

2019-10-27 Thread peter at cordes dot ca
sion: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Targ

[Bug target/82887] ICE: in extract_insn, at recog.c:2287 (unrecognizable insn) with _mm512_extracti64x4_epi64

2019-10-13 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82887 --- Comment #5 from Peter Cordes --- Reported bug 92080 for the missed CSE

[Bug tree-optimization/92080] New: Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2019-10-13 Thread peter at cordes dot ca
Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* As a workaround for PR 82887 some code (e.g. a memset

[Bug target/82887] ICE: in extract_insn, at recog.c:2287 (unrecognizable insn) with _mm512_extracti64x4_epi64

2019-10-13 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82887 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #4

[Bug middle-end/91515] missed optimization: no tailcall for types of class MEMORY

2019-08-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91515 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1

[Bug c/91398] Possible missed optimization: Can a pointer be passed as hidden pointer in x86-64 System V ABI

2019-08-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91398 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #4

[Bug tree-optimization/91026] switch expansion produces a jump table with trivial entries

2019-07-30 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #3

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2019-07-08 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103 --- Comment #4 from Peter Cordes --- We should not put any stock in what ICC does for GNU C native vector indexing. I think it doesn't know how to optimize that because it *always* spills/reloads even for `vec[0]` which could be a no-op. And

[Bug target/91103] New: AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2019-07-06 Thread peter at cordes dot ca
Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* GCC9.1 and current trunk

[Bug target/90582] New: AArch64 stack-protector wastes an instruction on address-generation

2019-05-22 Thread peter at cordes dot ca
-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- void protect_me() { volatile int buf[2]; buf[1] = 3; } https://godbolt.org/z/xdlr5w

[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568 --- Comment #5 from Peter Cordes --- And BTW, this only helps if the SUB and JNE are consecutive, which GCC (correctly) doesn't currently optimize for with XOR. If this sub/jne is different from a normal sub/branch and won't already get

[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568 --- Comment #3 from Peter Cordes --- (In reply to Jakub Jelinek from comment #2) > The xor there is intentional, for security reasons we do not want the stack > canary to stay in the register afterwards, because then it could be later > spilled

[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-21 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568 --- Comment #1 from Peter Cordes --- https://godbolt.org/z/hHCVTc Forgot to mention, stack-protector also disables use of the red-zone for no apparent reason, so that's another missed optimization. (Perhaps rarely relevant; probably most

[Bug target/90568] New: stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-21 Thread peter at cordes dot ca
: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* cmp/jne is always at least as efficient as xor

[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2019-04-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #4

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #22 from Peter Cordes --- Nice, that's exactly the kind of thing I suggested in bug 80571. If this covers * vsqrtss/sd (mem),%merge_into, %xmm * vpcmpeqd%same,%same, %dest# false dep on KNL / Silvermont * vcmptrueps

[Bug target/80571] AVX allows multiple vcvtsi2ss/sd (integer -> float/double) to reuse a single dep-breaking vxorps, even hoisting it out of loops

2019-02-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571 --- Comment #2 from Peter Cordes --- I think hjl's patch for PR 89071 / PR 87007 fixes (most of?) this, at least for AVX. If register pressure is an issue, using a reg holding a arbitrary constant (instead of xor-zeroed) is a valid option, as

[Bug target/38959] Additional switches to disallow processor supplementary instructions

2019-02-12 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959 --- Comment #4 from Peter Cordes --- The __builtin_ia32_rdpmc being a pure function bug I mentioned in my previous comment is already reported and fixed (in gcc9 only): bug 87550 It was present since at least gcc 5.0

[Bug target/38959] Additional switches to disallow processor supplementary instructions

2019-02-12 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #3

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #15 from Peter Cordes --- (In reply to Uroš Bizjak from comment #13) > I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP > and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we >

[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494 --- Comment #6 from Peter Cordes --- Oops, these were SD not SS. Getting sleepy >.<. Still, my optimization suggestion for doing both compares in one masked SUB of +-PBCx applies equally. And I think my testing with VBLENDVPS should apply

[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494 --- Comment #5 from Peter Cordes --- IF ( xij.GT.+HALf ) xij = xij - PBCx IF ( xij.LT.-HALf ) xij = xij + PBCx For code like this, *if we can prove only one of the IF() conditions will be true*, we can implement it

[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494 --- Comment #4 from Peter Cordes --- I suspect dep-chains are the problem, and branching to skip work is a Good Thing when it's predictable. (In reply to Richard Biener from comment #2) > On Skylake it's better (1uop, 1 cycle latency) while on

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-29 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #10 from Peter Cordes --- (In reply to Uroš Bizjak from comment #9) > There was similar patch for sqrt [1], I think that the approach is > straightforward, and could be applied to other reg->reg scalar insns as > well, independently

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #8 from Peter Cordes --- Created attachment 45544 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45544=edit testloop-cvtss2sd.asm (In reply to H.J. Lu from comment #7) > I fixed assembly codes and run it on different AVX

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #6 from Peter Cordes --- (In reply to Peter Cordes from comment #5) > But whatever the effect is, it's totally unrelated to what you were *trying* > to test. :/ After adding a `ret` to each AVX function, all 5 are basically the same

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #5 from Peter Cordes --- (In reply to H.J. Lu from comment #4) > (In reply to Peter Cordes from comment #2) > > Can you show some > > asm where this performs better? > > Please try cvtsd2ss branch at: > >

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #3 from Peter Cordes --- (In reply to H.J. Lu from comment #1) I have a patch for PR 87007: > > https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html > > which inserts a vxorps at the last possible position. vxorps > will be

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #2 from Peter Cordes --- (In reply to H.J. Lu from comment #1) > But > > vxorps %xmm0, %xmm0, %xmm0 > vcvtsd2ss %xmm1, %xmm0, %xmm0 > > are faster than both. On Skylake-client (i7-6700k), I can't reproduce this

[Bug target/80586] vsqrtss with AVX should avoid a dependency on the destination register.

2019-01-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80586 Peter Cordes changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|---

[Bug target/89071] New: AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double

2019-01-26 Thread peter at cordes dot ca
ywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- float cvt(double unused, double xmm1) { return xmm1; } g++ (GCC-Explorer-Build)

[Bug target/89063] [x86] lack of support for BEXTR from BMI extension

2019-01-25 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89063 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2018-08-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 --- Comment #4 from Peter Cordes --- The VPAND instructions in the 256-bit version are a missed-optimization. I had another look at this with current trunk. Code-gen is similar to before with -march=skylake-avx512 -mprefer-vector-width=512.

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2018-08-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 --- Comment #3 from Peter Cordes --- I had another look at this with current trunk. Code-gen is similar to before with -march=skylake-avx512 -mprefer-vector-width=512. (If we improve code-gen for that choice, it will make it a win in more

[Bug rtl-optimization/86352] New: setc/movzx introduced into loop to provide a constant 0 value for a later rep stos

2018-06-28 Thread peter at cordes dot ca
Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* The wrong-code bug 86314 also

[Bug target/80820] _mm_set_epi64x shouldn't store/reload for -mtune=haswell, Zen should avoid store/reload, and generic should think about it.

2018-06-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 --- Comment #5 from Peter Cordes --- AVX512F with marge-masking for integer->vector broadcasts give us a single-uop replacement for vpinsrq/d, which is 2 uops on Intel/AMD. See my answer on

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2018-06-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 --- Comment #14 from Peter Cordes --- I happened to look at this old bug again recently. re: extracting high the low two 32-bit elements: (In reply to Uroš Bizjak from comment #11) > > Or without SSE4 -mtune=sandybridge (anything that excluded

[Bug tree-optimization/69615] 0 to limit signed range checks don't always use unsigned compare

2018-06-02 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69615 --- Comment #5 from Peter Cordes --- Update: https://godbolt.org/g/ZQDY1G gcc7/8 optimizes this to and / cmp / jb, while gcc6.3 doesn't. void rangecheck_var(int64_t x, int64_t lim2) { //lim2 >>= 60; lim2 &= 0xf; // let the compiler figure

[Bug tree-optimization/84011] Optimize switch table with run-time relocation

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 --- Comment #13 from Peter Cordes --- (In reply to Jakub Jelinek from comment #10) > ?? That is the task for the linker SHF_MERGE|SHF_STRINGS handling. > Why should gcc duplicate that? Because gcc would benefit from knowing if merging makes

[Bug tree-optimization/84011] Optimize switch table with run-time relocation

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 --- Comment #12 from Peter Cordes --- (In reply to Jakub Jelinek from comment #10) > (In reply to Peter Cordes from comment #9) > > gcc already totally misses optimizations here where one string is a suffix > > of another. "mii" could just be a

[Bug tree-optimization/84011] Optimize switch table with run-time relocation

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #9

[Bug tree-optimization/85585] switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a string

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585 --- Comment #1 from Peter Cordes --- By comparison, the no-PIE table of pointers only needs one instruction: movqCSWTCH.4(,%rdi,8), %rax So all my suggestions cost 1 extra instruction on x86 in no-PIE mode, but at a massive savings

[Bug tree-optimization/85585] New: switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a st

2018-05-01 Thread peter at cordes dot ca
Reporter: peter at cordes dot ca Target Milestone: --- Bug 84011 shows some really silly code-gen for PIC code and discussion suggested using a table of offsets instead of a table of actual pointers, so you just need one base address. A further optimization is possible when the strings

[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics

2018-04-30 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274 --- Comment #2 from Peter Cordes --- The stray LEA bug seems to be fixed in current trunk (9.0.0 20180429), at least for this testcase. Gcc's stack-alignment strategy seems to be improved overall (not copying the return address when not

[Bug c++/69560] x86_64: alignof(uint64_t) produces incorrect results with -m32

2018-04-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69560 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #23

[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics

2018-04-15 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1

[Bug target/85366] New: Failure to use both div and mod results of one IDIV in a prime-factor loop while(n%i==0) { n/=i; }

2018-04-12 Thread peter at cordes dot ca
Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* From https

[Bug target/85038] x32: unnecessary address-size prefix when a pointer register is already zero-extended

2018-03-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85038 --- Comment #1 from Peter Cordes --- Correction for AArch64: it supports addressing modes with a 64-bit base register + 32-bit index register with zero or sign extension for the 32-bit index. But not 32-bit base registers. As a hack that's

  1   2   3   >