[Bug c++/110619] Dangling pointer returned from constexpr function converts in nullptr

2023-08-06 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110619

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #7 from Peter Cordes  ---
(In reply to Andrew Pinski from comment #2)
> >but it is not nullptr.
> 
> Or is it just undefined so it could be considered a nullptr ...


Implementation-defined behaviour, according to answers on
https://stackoverflow.com/questions/76843246/why-does-the-address-of-an-out-of-scope-variable-equal-zero-with-constexpr


https://eel.is/c++draft/basic.compound#def:value,invalid_pointer

https://eel.is/c++draft/basic.stc.general#4

>  Indirection through an invalid pointer value and passing an invalid pointer 
> value to a deallocation function have undefined behavior.
> **Any other use of an invalid pointer value has implementation-defined 
> behavior.**

So this wasn't a bug, but the new behaviour is also allowed.

This commit could be reverted or kept, depending on maintainability and/or
quality-of-life for users of GCC.  Having it pick the other
implementation-defined behaviour from clang (GCC's previous behaviour) is maybe
a *good* thing, to help programmers catch dependence on an invalid pointer
being either null or non-null if they try their code with both compilers.

[Bug middle-end/108441] [12 Regression] Maybe missed optimization: loading an 16-bit integer value from .rodata instead of an immediate store

2023-01-18 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108441

--- Comment #4 from Peter Cordes  ---
This is already fixed in current trunk; sorry I forgot to check that before
recommending to report this store-coalescing bug.

# https://godbolt.org/z/j3MdWrcWM
# GCC nightly -O3   (tune=generic)  and GCC11
store:
movl$16, %eax
movw%ax, ldap(%rip)
ret

In case anyone's wondering why GCC doesn't  movw $16, foo(%rip)
it's avoiding LCP stalls on Intel P6-family CPUs from the 16-bit immediate.

For MOV specifically, that only happens on P6-family (Nehalem and earlier), not
Sandybridge-family, so it's getting close to time to drop it from
-mtune=generic.  (-mtune= bdver* or znver* don't do it, so there is a tuning
setting controlling it)

GCC *only* seems to know about MOV, so ironically with -march=skylake for
example, we avoid a non-existant LCP stall for mov to memory, but GCC compiles
x += 1234 into code that will LCP stall, addw $1234, x(%rip).

-march=alderlake disables this tuning workaround, using movw $imm, mem.  (The
Silvermont-family E-cores in Alder Lake don't have this problem either, so
that's correct.  Agner Fog's guide didn't mention any changes in LCP stalls for
Alder Lake.)

Avoiding LCP stalls is somewhat less important on CPUs with a uop cache, since
it only happens on legacy decode.  Although various things can cause code to
only run from legacy decode even inside a loop, such as Skylake's JCC erratum
microcode mitigation if users don't assemble with the option to have GAS work
around it, which GCC doesn't pass by default with -march=skylake.

If there isn't already a bug open about tuning choices mismatching hardware, I
can repost this as a new bug if you'd like.


Related
:https://stackoverflow.com/questions/75154687/is-this-a-missed-optimization-in-gcc-loading-an-16-bit-integer-value-from-roda

and
https://stackoverflow.com/questions/70719114/why-does-the-short-16-bit-variable-mov-a-value-to-a-register-and-store-that-u

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #27 from Peter Cordes  ---
(In reply to Alexander Monakov from comment #26)
> Sure, the right course of action seems to be to simply document that atomic
> types and built-ins are meant to be used on "common" (writeback) memory

Agreed.  Where in the manual should this go?  Maybe a new subsection of the
chapter about __atomic builtins where we document per-ISA requirements for them
to actually work?

e.g. x86 memory-type stuff, and that ARM assumes all cores are in the same
inner-shareable cache-coherency domain, thus barriers are   dmb ish   not  dmb
sy and so on.
I guess we might want to avoid documenting the actual asm implementation
strategies in the main manual, because that would imply it's supported to make
assumptions based on that.

Putting it near the __atomic docs might make it easier for readers to notice
that the list of requirements exists, vs. scattering them into different pages
for different ISAs.  And we don't currently have any section in the manual
about per-ISA quirks or requirements, just about command-line options,
builtins, and attributes that are per-ISA, so there's no existing page where
this could get tacked on.

This would also be a place where we can document that __atomic ops are
address-free when they're lock-free, and thus usable on shared memory between
processes.  ISO C++ says that *should* be the case for std::atomic, but
doesn't standardize the existence of multiple processes.

To avoid undue worry, documentation about this should probably start by saying
that normal programs (running under mainstream OSes) don't have to worry about
it or do anything special.

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #25 from Peter Cordes  ---
(In reply to Alexander Monakov from comment #24)
> 
> I think it's possible to get UC/WC mappings via a graphics/compute API (e.g.
> OpenGL, Vulkan, OpenCL, CUDA) on any OS if you get a mapping to device
> memory (and then CPU vendor cannot guarantee that 128b access won't tear
> because it might depend on downstream devices).


Even atomic_int doesn't work properly if you deref a pointer to WC memory.  WC
doesn't have the same ordering guarantees, so it would break acquire/release
semantics.
So we already don't support WC for this.

We do at least de-facto support atomics on UC memory because the ordering
guarantees are a superset of cacheable memory, and 8-byte atomicity for aligned
load/store is guaranteed even for non-cacheable memory types since P5 Pentium
(and on AMD).  (And lock cmpxchg16b is always atomic even on UC memory.)

But you're right that only Intel guarantees that 16-byte VMOVDQA loads/stores
would be atomic on UC memory.  So this change could break that very unwise
corner-case on AMD which only guarantees that for cacheable loads/stores, and
Zhaoxin only for WB.

But was anyone previously using 16-byte atomics on UC device memory?  Do we
actually care about supporting that?  I'd guess no and no, so it's just a
matter of documenting that somewhere.

Since GCC7 we've reported 16-byte atomics as being non-lock-free, so I *hope*
people weren't using __atomic_store_n on device memory.  The underlying
implementation was never guaranteed.

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #23 from Peter Cordes  ---
(In reply to Xi Ruoyao from comment #20)
> "On Zhaoxin CPUs with AVX, the VMOVDQA instruction is atomic if the accessed
> memory is Write Back, but it's not guaranteed for other memory types."

VMOVDQA is still fine, I think WB is the only memory type that's relevant for
atomics, at least on the mainstream OSes we compile for.  It's not normally
possible for user-space to allocate memory of other types.  Kernels normally
use WB memory for their shared data, too.

You're correct that WT and WP are the other two cacheable memory types, and
Zhaoxin's statement doesn't explicitly guarantee atomicity for those, unlike
Intel and AMD.

But at least on Linux, I don't think there's a way for user-space to even ask
for a page of WT or WP memory (or UC or WC).  Only WB memory is easily
available without hacking the kernel.  As far as I know, this is true on other
existing OSes.

WT = write-through: read caching, no write-allocate.  Write hits update the
line and memory.
WP = write-protect: read caching, no write-allocate.  Writes go around the
cache, evicting even on hit.
(https://stackoverflow.com/questions/65953033/whats-the-usecase-of-write-protected-pat-memory-type
quotes the Intel definitions.)

Until recently, the main work on formalizing the x86 TSO memory model had only
looked at WB memory.
A 2022 paper looked at WT, UC, and WC memory types:
https://dl.acm.org/doi/pdf/10.1145/3498683 - Extending Intel-x86 Consistency
and Persistency
Formalising the Semantics of Intel-x86 Memory Types and Non-temporal Stores
(The intro part describing memory types is quite readable, in plain English not
full of formal symbols.  They only mention WP once, but tested some litmus
tests with readers and writers using any combination of the other memory
types.)


Some commenters on my answer on when WT is ever used or useful confirmed that
mainstream OSes don't give easy access to it.
https://stackoverflow.com/questions/61129142/when-use-write-through-cache-policy-for-pages/61130838#61130838
* Linux has never merged a patch to let user-space allocate WT pages.
* The Windows kernel reportedly doesn't have a mechanism to keep track of pages
that should be WT or WP, so you won't find any.

I don't know about *BSD making it plausible for user-space to point an _Atomic
int * at a page of WT or WP memory.  I'd guess not.

I don't know if there's anywhere we can document that _Atomic objects need to
be in memory that's allocated in a "normal" way.  Probably hard to word without
accidentally disallowing something that's fine.

[Bug tree-optimization/106138] Inefficient code generation: logical AND of disjoint booleans from equal and bitwise AND not optimized to constant false

2022-06-30 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106138

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #3 from Peter Cordes  ---
Ideally, bitwise & of booleans should also be handled, not just &&.
A testcase (https://godbolt.org/z/qvosv8q7c) makes it easy to check both.

//#define LOGIC_AND 
_Bool f2(char x)
{
_Bool b1 = x == 2;
_Bool b2 = x & 1;

#ifdef LOGIC_AND
  return b1 && b2;
#else
  return b1 & b2;
#endif
}

(Clang optimized it to return false for the && version, but not bitwise.  GCC
currently doesn't optimize either way.)
This was originally posted on Stack Overflow
(https://stackoverflow.com/q/72802469/224132), BTW.

[Bug target/105929] New: [AArch64] armv8.4-a allows atomic stp. 64-bit constants can use 2 32-bit halves with _Atomic or volatile

2022-06-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105929

Bug ID: 105929
   Summary: [AArch64] armv8.4-a allows atomic stp. 64-bit
constants can use 2 32-bit halves with _Atomic or
volatile
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: arm64-*-*

void foo(unsigned long *p) {
*p = 0xdeadbeefdeadbeef;
}
// compiles nicely:  https://godbolt.org/z/8zf8ns14K
mov w1, 48879
movkw1, 0xdead, lsl 16
stp w1, w1, [x0]
ret

But even with -Os -march=armv8.4-a   the following doesn't:
void foo_atomic(_Atomic unsigned long *p) {
__atomic_store_n(p, 0xdeadbeefdeadbeef, __ATOMIC_RELAXED);
}

mov x1, 48879
movkx1, 0xdead, lsl 16
movkx1, 0xbeef, lsl 32
movkx1, 0xdead, lsl 48
stlrx1, [x0]
ret

ARMv8.4-a and later guarantees atomicity for aligned ldp/stp, according to
ARM's architecture reference manual: ARM DDI 0487H.a - ID020222, so we could
use the same asm as the non-atomic version.

> If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that access 
> fewer than 16 bytes are single-copy atomic when all of the following 
> conditions are true:
> • All bytes being accessed are within a 16-byte quantity aligned to 16 bytes.
> • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory

(FEAT_LSE2 is the same CPU feature that gives 128-bit atomicity for aligned
ldp/stp x,x,mem)

Prior to that, apparently it wasn't guaranteed that stp of 32-bit halves merged
into a single 64-bit store. So without -march=armv8.4-a it wasn't a missed
optimization to construct the constant in a single register for _Atomic or
volatile.

But with ARMv8.4, we should use MOV/MOVK + STP.

Since there doesn't seem to be a release-store version of STP, 64-bit release
and seq_cst stores should still generate the full constant in a register,
instead of using STP + barriers.


(Without ARMv8.4-a, or with a memory-order other than relaxed, see PR105928 for
generating 64-bit constants in 3 instructions instead of 4, at least for -Os,
with add x0, x0, x0, lsl 32)

[Bug target/105928] New: [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least)

2022-06-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105928

Bug ID: 105928
   Summary: [AArch64] 64-bit constants with same high/low halves
can use ADD lsl 32 (-Os at least)
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: arm64-*-*

void foo(unsigned long *p) {
*p = 0xdeadbeefdeadbeef;
}

cleverly compiles to https://godbolt.org/z/b3oqao5Kz

mov w1, 48879
movkw1, 0xdead, lsl 16
stp w1, w1, [x0]
ret

But producing the value in a register uses more than 3 instructions:

unsigned long constant(){
return 0xdeadbeefdeadbeef;
}

mov x0, 48879
movkx0, 0xdead, lsl 16
movkx0, 0xbeef, lsl 32
movkx0, 0xdead, lsl 48
ret

At least with -Os, and maybe at -O2 or -O3 if it's efficient, we could be doing
a shifted ADD or ORR to broadcast a zero-extended 32-bit value to 64-bit.

mov x0, 48879
movkx0, 0xdead, lsl 16
add x0, x0, x0, lsl 32

Some CPUs may fuse sequences of movk, and shifted operands for ALU ops may take
extra time in some CPUs, so this might not actually be optimal for performance,
but it is smaller for -Os and -Oz.

We should also be using that trick for stores to _Atomic or volatile long*,
where we currently do MOV + 3x MOVK, then an STR, with ARMv8.4-a which
guarantees atomicity.


---

ARMv8.4-a and later guarantees atomicity for ldp/stp within an aligned 16-byte
chunk, so we should use MOV/MOVK / STP there even for volatile or
__ATOMIC_RELAXED.  But presumably that's a different part of GCC's internals,
so I'll report that separately.

[Bug tree-optimization/105904] New: Predicated mov r0, #1 with opposite conditions could be hoisted, between 1 and 1<

2022-06-09 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105904

Bug ID: 105904
   Summary: Predicated mov r0, #1 with opposite conditions could
be hoisted, between 1 and 1<  // using the libstdc++ header
unsigned roundup(unsigned x){
return std::bit_ceil(x);
}

https://godbolt.org/z/Px1fvWaex

GCC's version is somewhat clunky, including MOV r0, #1 in either "side":

roundup(unsigned int):
cmp r0, #1
i   hi
addhi   r3, r0, #-1
movhi   r0, #1@@ here
clzhi   r3, r3
rsbhi   r3, r3, #32
ite hi
lslhi   r0, r0, r3
movls   r0, #1@@ here
bx  lr

Even without spotting the other optimizations that clang finds, we can combine
to a single unconditional MOV r0, #1.  But only if we avoid setting flags, so
it requires a 4-byte encoding, not MOVS.  Still, it's one fewer instruction to
execute.

This is not totally trivial: it requires seeing that we can move it across the
conditional LSL.  So it's really a matter of folding the 1s between 1<

[Bug tree-optimization/105596] Loop counter widened to 128-bit unnecessarily

2022-05-13 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596

--- Comment #1 from Peter Cordes  ---
https://godbolt.org/z/aoG55T5Yq
gcc -O3 -m32 has the same problem with  unsigned long long total  and unsigned
i.

Pretty much identical instruction sequences in the loop for all 3 versions,
doing add/adc to increment i, for example.  (Plus a bit of spilling). 
fact_gcc_handhold still compiles without the unnecessary widening.

Perhaps should retitle to widen to a "2-register type".

IDK how easily this occurs in real-world loops with 64 and 32-bit integers on
32-bit machines, but that's probably more of a concern for wasting more clock
cycles worldwide.

[Bug tree-optimization/105596] New: Loop counter widened to 128-bit unnecessarily

2022-05-13 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596

Bug ID: 105596
   Summary: Loop counter widened to 128-bit unnecessarily
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

For  total *= i  with a u128 total and a u32 loop counter, GCC pessimizes by
widening i and doing a full 128x128 => 128-bit multiply, and having to do a
128-bit increment and compare.

uint64_t i to make it a full register width doesn't help.

unsigned __int128 fact(unsigned n){
unsigned __int128 total = n;
for (unsigned i=2 ; i < n ; i++)
total *= i;
return total;
}
// 0! = 0  isn't mathematically correct, but that's not the point

https://godbolt.org/z/W4MW9b6T3  (gcc trunk 13.0.0 20220508 (experimental) and
clang 14, which makes efficient asm for all of these.)

# gcc -O3
fact:
movl%edi, %r9d
xorl%r11d, %r11d
movq%r9, %r10   # total = n  zext into  R11:R10
cmpl$2, %edi
jbe .L7 # if n<=2 return r11:r10
movl$2, %esi# i = 2  in  RDI:RSI
xorl%edi, %edi
.L9:  # do{
movq%r11, %rcx
movq%rdi, %rdx
movq%r10, %rax
movq%r9, %r8  # copy original n to destroy later
imulq   %r10, %rdx  # 128x128 multiply with 2x imul, 1x
widening mul
imulq   %rsi, %rcx
addq%rdx, %rcx
mulq%rsi
movq%rdx, %r11  # update total in r11:r10
movq%rax, %r10
addq%rcx, %r11  # last partial product

addq$1, %rsi# i++ as a 128-bit integer
adcq$0, %rdi
xorq%rsi, %r8   #  r8 = n^i
movq%rdi, %rcx   # useless copy, we're already destroying
r8
orq %r8, %rcx# hi(i^n) | lo(i^n)
jne .L9   # }while(i != n);
.L7:
movq%r10, %rax
movq%r11, %rdx
ret

So as well as creating extra work to do, it's not even doing it very
efficiently, with multiple unnecessary mov instructions.

This doesn't seem to be x86-64 specific.  It also compiles similarly for
AArch64 and MIPS64.  For some ISAs, I'm not sure if potentially-infinite loops
are making a difference, e.g. PowerPC is hard for me to read.  RV64 has three
multiply instructions in both versions.

I haven't tested a 32-bit equivalent with uint64_t total and uint32_t i.


This anti-optimization goes back to GCC4.6.  With GCC4.5 and earlier, the above
C compiles to a tight loop with the expected mul reg + imul reg,reg and 1
register loop counter: https://godbolt.org/z/6KheaqTx4  (using __uint128_t,
since unsigned __int128 wasn't supported on GCC4.4 or 4.1)

GCC 4.1 does an inefficient multiply, but one of the chunks is a freshly
xor-zeroed register.  It's still just incrementing and comparing a 32-bit loop
counter, but widening it for a 128x128-bit multiply recipe.  GCC4.4 optimizes
away the parts that are useless for the high 64 bits of (u128)i being zero.


-

A different version compiles efficiently with GCC6 and earlier, only becoming
slow like the above with GCC7 and later.

unsigned __int128 fact_downcount(unsigned n){
unsigned __int128 total = n;
for (unsigned i=n-1 ; i > 1 ; i--)
total *= i;
return total;  // 0! = 0 isn't mathematically correct
}


-

When the loop condition is possibly always-true, GCC can't prove the loop is
non-infinite, and as usual can't widen the loop counter.  In this case, that's
a good thing:

unsigned __int128 fact_gcc_handhold(unsigned n){
unsigned __int128 total = 1;   // loop does do final n
for (unsigned i=2 ; i <= n ; i++)  // potentially infinite loop defeats
this pessimization
total *= i;
return total;  // fun fact:  0! = 1  is mathematically correct
}


fact_gcc_handhold:
cmpl$1, %edi
jbe .L4
movl$2, %ecx   # i = 2   inECX
movl$1, %eax   # total = 1  in RDX:RAX
xorl%edx, %edx
.L3: #do{
movl%ecx, %esi# copy i instead of just incrementing it
later :/

movq%rdx, %r8   # save high half of total
addl$1, %ecx  # i++
imulq   %rsi, %r8   # lo x hi cross product
mulq%rsi# lo x lo widening
addq%r8, %rdx   # 128x64-bit multiply

cmpl%ecx, %edi
jnb .L3   # }while(i < n)
ret

Allocating total in RDX:RAX is nice, putting the lo part where we need it for
mulq anyway.

[Bug target/65146] alignment of _Atomic structure member is not correct

2022-04-27 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146

--- Comment #25 from Peter Cordes  ---
(In reply to CVS Commits from comment #24)
> The master branch has been updated by Jakub Jelinek :
> 
> https://gcc.gnu.org/g:04df5e7de2f3dd652a9cddc1c9adfbdf45947ae6
> 
> commit r11-2909-g04df5e7de2f3dd652a9cddc1c9adfbdf45947ae6
> Author: Jakub Jelinek 
> Date:   Thu Aug 27 18:44:40 2020 +0200
> 
> ia32: Fix alignment of _Atomic fields [PR65146]
> 
> For _Atomic fields, lowering the alignment of long long or double etc.
> fields on ia32 is undesirable, because then one really can't perform
> atomic
> operations on those using cmpxchg8b.


Just for the record, the description of this bugfix incorrectly mentioned
cmpxchg8b being a problem.  lock cmpxchg8b is *always* atomic, even if that
means the CPU has to take a bus lock (disastrously expensive affecting all
cores system-wide) instead of just delaying MESI response for one line
exclusively owned in this core's private cache (aka cache lock).

The correctness problem is __atomic_load_n / __atomic_store_n compiling to
actual 8-byte pure loads / pure stores using SSE2 movq, SSE1 movlps, or x87
fild/fistp (bouncing through the stack), such as

  movq  %xmm0, (%eax)

That's where correctness depends on Intel and AMD's atomicity guarantees which
are conditional on alignment.

(And if AVX is supported, same deal for 16-byte load/store.  Although we can
and should use movaps for that, which bakes alignment checking into the
instruction.  Intel did recently document that CPUs with AVX guarantee
atomicity of 16-byte aligned loads/stores, retroactive to all CPUs with AVX. 
It's about time, but yay.)

> Not sure about iamcu_alignment change, I know next to nothing about IA
> MCU,
> but unless it doesn't have cmpxchg8b instruction, it would surprise me
> if we
> don't want to do it as well.


I had to google iamcu.  Apparently it's Pentium-like, but only has soft-FP (so
I assume no MMX or SSE as well as no x87).

If that leaves it no way to do 8-byte load/store except (lock) cmpxchg8b, that
may mean there's no need for alignment, unless cache-line-split lock is still a
performance issue.  If it's guaranteed unicore as well, we can even omit the
lock prefix and cmpxchg8b will still be an atomic RMW (or load or store) wrt.
interrupts.  (And being unicore would likely mean much less system-wide
overhead for a split lock.)

[Bug target/82261] x86: missing peephole for SHLD / SHRD

2022-04-09 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261

--- Comment #4 from Peter Cordes  ---
GCC will emit SHLD / SHRD as part of shifting an integer that's two registers
wide.
Hironori Bono proposed the following functions as a workaround for this missed
optimization (https://stackoverflow.com/a/71805063/224132)

#include 

#ifdef __SIZEOF_INT128__
uint64_t shldq_x64(uint64_t low, uint64_t high, uint64_t count) {
  return (uint64_t)(unsigned __int128)high << 64) | (unsigned __int128)low)
<< (count & 63)) >> 64);
}

uint64_t shrdq_x64(uint64_t low, uint64_t high, uint64_t count) {
  return (uint64_t)unsigned __int128)high << 64) | (unsigned __int128)low)
>> (count & 63));
}
#endif

uint32_t shld_x86(uint32_t low, uint32_t high, uint32_t count) {
  return (uint32_t)(uint64_t)high << 32) | (uint64_t)low) << (count & 31))
>> 32);
}

uint32_t shrd_x86(uint32_t low, uint32_t high, uint32_t count) {
  return (uint32_t)uint64_t)high << 32) | (uint64_t)low) >> (count & 31));
}

---

The uint64_t functions (using __int128) compile cleanly in 64-bit mode
(https://godbolt.org/z/1j94Gcb4o) using 64-bit operand-size shld/shrd

but the uint32_t functions compile to a total mess in 32-bit mode (GCC11.2 -O3
-m32 -mregparm=3) before eventually using shld, including a totally insane 
or  dh, 0

GCC trunk with -O3 -mregparm=3 compiles them cleanly, but without regparm it's
also slightly different mess.

Ironically, the uint32_t functions compile to quite a few instructions in
64-bit mode, actually doing the operations as written with shifts and ORs, and
having to manually mask the shift count to &31 because it uses a 64-bit
operand-size shift which masks with &63.  32-bit operand-size SHLD would be a
win here, at least for -mtune=intel or a specific Intel uarch.

I haven't looked at whether they still compile ok after inlining into
surrounding code, or whether operations would tend to combine with other things
in preference to becoming an SHLD.

[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg

2022-03-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066

--- Comment #5 from Peter Cordes  ---
> pextrw requires sse4.1 for mem operands.

You're right! I didn't double-check the asm manual for PEXTRW when writing up
the initial report, and had never realized that PINSRW wasn't symmetric with
it.  I was really surprised to see that in
https://www.felixcloutier.com/x86/pextrw

So we do need to care about tuning for _mm_storeu_si16(p, v) without SSE4.1
(without the option of PEXTRW to memory).  PEXTRW to an integer register is
obviously bad; we should be doing

movd  %xmm0, %eax
mov   %ax, (%rdi)

instead of an inefficient  pextrw $0, %xmm0, %eax ; movw-store

Reported as PR105079, since the cause of the load missed-opt was GCC thinking
the instruction wasn't available, rather than a wrong tuning choice like this
is.

[Bug target/105079] New: _mm_storeu_si16 inefficiently uses pextrw to an integer reg (without SSE4.1)

2022-03-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105079

Bug ID: 105079
   Summary: _mm_storeu_si16 inefficiently uses pextrw to an
integer reg (without SSE4.1)
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

With PR105066 fixed, we do _mm_loadu_si16 with pinsrw from memory, because
that's available with just SSE2.  (And the cause wasn't tuning choices, it was
a typo in what insns GCC thought were available.)  Related: PR105072 re:
folding such 16-bit loads into memory source operands for PMOVZX/SXBQ.

But the famously non-orthogonal SSE2 only includes pextrw $imm, %xmm, reg.  Not
reg/mem until SSE4.1 (with a longer opcode for no apparent reason, instead of
just allowing mem addressing modes for the existing one.  But same mnemonic so
the assembler takes care of it.  https://www.felixcloutier.com/x86/pextrw)

So we do need to care about tuning for _mm_storeu_si16(p, v) without the option
of PEXTRW to memory.  Currently we do this, which is obviously bad:

pextrw  $0, %xmm0, %eax  # 2 uops
movw%ax, (%rdi)

we should be doing this

movd%xmm0, %eax  # 1 uop
mov %ax, (%rdi)

https://godbolt.org/z/Ee3Ez174M

This is especially true if we don't need the integer value zero-extended into
EAX.

If we *did* also want the value zero-extended in an integer register, the extra
uop in PEXTRW (in addition to the port 0 uop like MOVD) is a port-5 shuffle to
extract an arbitrary 16-bit element, vs. a separate integer movzwl %cx, %eax
could run on any integer ALU port.  (Including port 6 on HSW/SKL, which doesn't
compete with any vector ALUs).

Mov-elimination for movzwl doesn't work on any current CPUs, only movzbl on
Intel, and movl / movq on both Intel and AMD.  So currently there's no benefit
to picking a different register like %ecx, instead of just using movzwl %ax,
%eax

When we both store and use the integer value:

int store16_and_use(void *p, __m128i v){
_mm_storeu_si16( p, v );
return 123 + *(unsigned short*)p;
}

https://godbolt.org/z/zq6TMo1oE current trunk GCC does this, which is not bad:

# -O3 with or without -msse4.1
pextrw  $0, %xmm0, %eax
movw%ax, (%rdi)
addl$123, %eax
ret

Clang13 uses MOVD + MOVZX like I was suggesting, even though it costs more code
size.  That's not necessarily better

movd%xmm0, %eax
movw%ax, (%rdi)
movzwl  %ax, %eax
addl$123, %eax
retq

In this case it's not obviously wrong to use PEXTRW to an integer reg, but it's
also fine to do it clang's way.  So however that corner case shakes out in the
process of fixing the main bug (using movd / movw without SSE4.1 when we don't
reload) is fine.

If SSE4.1 is available, the no-reload case should probably use PEXTRW to memory
instead of movd + movw.  On some CPUs, the ALU op that's part of PEXTRW has
more choice of ALU port than xmm->gp_int operations.

[Bug sanitizer/84508] Load of misaligned address using _mm_load_sd

2022-03-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508

--- Comment #17 from Peter Cordes  ---
(In reply to Andrew Pinski from comment #16)
> >According to Intel (
> > https://software.intel.com/sites/landingpage/IntrinsicsGuide), there are no
> > alignment requirements for _mm_load_sd, _mm_store_sd and _mm_loaddup_pd. For
> > example, from _mm_load_sd:
> 
> I disagree with saying there is no alignment requirement.
> 
> The alignment requirement comes from the type of the argument (double
> const*). [...]
> Pointers themselves have an alignment requirement not just at the time of
> the load/store of them.

The intrinsics are badly designed to take pointer args with types other than
void*, despite how they're expected to work.  This is something we just need to
accept.  Starting with AVX-512, any new intrinsics take void*, but they haven't
redefined the old ones.

_mm_loadu_si128 takes a __m128i*, same as _mm_load_si128.  alignof(__m128i) ==
16, so _mm_loadu_si128 must not simply dereference it, that's what
_mm_load_si128 does.

Intel's intrinsics API requires you to do unaligned 16-byte loads by creating a
misaligned pointer and passing it to a loadu intrinsic.  (This in turn requires
that implementations supporting these intrinsics define the behaviour of
creating such a pointer without deref; in ISO C that alone would be UB.)

This additional unaligned-pointer behaviour that implementations must define
(at least for __m128i* and float/double*) is something I wrote about in an SO
answer:
https://stackoverflow.com/questions/52112605/is-reinterpret-casting-between-hardware-simd-vector-pointer-and-the-correspond


_mm_loadu_ps (like _mm_load_ps) takes a float*, but its entire purpose it to
not require alignment.

_mm512_loadu_ps takes a void* arg, so we can infer that earlier FP load
intrinsics really are intended to work on data with any alignment, not just
with the alignment of a float.

They're unlike a normal deref of a float* in aliasing rules, although that's
separate from creating a misaligned float* in code outside the intrinsic.  A
hypothetical low-performance portable emulation of intrinsics that ended up
dereferencing that float* arg directly would be broken for strict-aliasing as
well.

The requirement to define the behaviour of having a misaligned float* can be
blamed on Intel in 1995 (when SSE1 was new). Later extensions like AVX
_mm256_loadu_ps just followed the same pattern of taking float* until they
finally used void* for intrinsics introduced with or after AVX-512.

The introduction of _mm_loadu_si32 and si16 is another step in the right
direction, recognizing that _mm_cvtsi32_si128( *int_ptr ) isn't strict-aliasing
safe.  When those were new, it might have been around the time Intel started
exploring replacing ICC with the LLVM-based ICX.

Anyway, the requirement to support misaligned vector and float/double pointers
implies that _mm_load_ss/sd taking float*/double* doesn't imply alignof(float)
or alignof(double).

>  So either the intrinsics definition needs to be changed to be
> correct or GCC is correct.

That's an option; I'd love it if all the load/store intrinsics were changed
across all compilers to take void*.  It's ugly and a pain to type  
_mm_loadu_si128( (const __m128i*)ptr )
as well as creating cognitive dissonance because alignof(__m128i) == 16.

I'm not sure if it could break anything to change the intrinsics to take void*
even for older ones; possibly only C++ overload resolution for insane code that
defines a _mm_loadu_ps( other_type * ) and relies on float* args picking the
intrinsic.

If we changed just GCC, without getting buy-in from other compilers, taking
void* would let people's code compile on GCC without casts from stuff like
int*, when it wouldn't compile on other compilers.

That could be considered a bad thing if people test their code with GCC and are
surprised to get reports of failure from people using compilers that follow
Intel's documentation for the intrinsic function arg types. 
(https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html).  It
would basically be a case of being overly permissive for the feature / API that
people are trying to write portable code against.

[Bug sanitizer/84508] Load of misaligned address using _mm_load_sd

2022-03-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #14 from Peter Cordes  ---
This bug is mis-categorized; it's not a sanitizer bug, it's a bug in the
implementation _mm_load_ss / sd.

It currently derefs the  `float const*` arg directly, which is not
strict-aliasing or alignment safe.  alignof(float) is 4, but Intel's
documentation for this API still says "mem_addr does not need to be aligned on
any particular boundary."

_mm_load_ss (float const *__P)
{
  return _mm_set_ss (*__P);
}


As discussed on PR99754 _mm_load_si32(const void*) *is* strict-aliasing and
alignment safe.  But it only existed recently, and GCC11's implementation of it
is buggy (shuffling the element to the wrong place).  Before that, one safe way
to do a 32-bit SIMD load is with _mm_load_ss and _mm_castps_si128.  Or it was
supposed to be safe, but isn't!!

Clang uses a packed may_alias struct containing a float to get a safe load
done.  Another way would be casting the pointer to

typdef float aliasing_unaligned_f32 __attribute__((aligned(1),may_alias));

This is similar to what we do with __m32_u for use in aliasing-safe integer
load/store, except we define that as int with
vector_size(4),may_alias,aligned(1) for some reason.  Perhaps influenced by
__m64_u which is a vector of 2 ints.

MSVC is like gcc -fno-strict-aliasing, so however it handles intrinsics,
they're always aliasing-safe.

I'm not 100% sure about what ICC formally guarantees, but in practice it
doesn't move aliasing short*  stores across a _mm_load_ss( (float*)pshort )
load.
https://godbolt.org/z/6s76v71xz  I didn't test with _mm_store_ss aliasing with
short loads, only vice versa.

So GCC is the odd one out, out of the major 4 compilers that support Intel's
intrinsics API.  All our narrow load/store intrinsics should be strict-aliasing
and alignment safe, regardless of what pointer type they accept.

Intel's early design of taking float* and double* instead of void* could be
considered poor design.  Their naming with just load/store instead of
_mm_loadu_ss / storeu is also poor design, clearly motivated by the asm
differences rather than an actual intrinsic API difference.

In x86 asm, loads/stores narrower than 16 bytes never require alignment (unless
the AC bit is set in EFLAGS).  Assuming Intel modeled their intrinsics API
after their asm, then it makes sense to have load and loadu for ps and si128,
but only load/store with an implied lack of alignment for intrinsics that wrap
instructions like movlps / movhps / movss / movsd, and movd / movq, which do
narrower memory accesses.

That of course *doesn't* make sense in C terms, where it's always potentially a
problem to dereference misaligned pointers to narrow objects, even when
compiling for x86-64:
https://stackoverflow.com/questions/47510783/why-does-unaligned-access-to-mmaped-memory-sometimes-segfault-on-amd64
has an example and links some others, showing that compilers *don't* define the
behaviour of deref of misaligned pointers.

I'm pretty certain that Intel always intended their narrow load/store
intrinsics to not have any alignment requirements, like the asm instructions
that wrap them, but weren't thinking in C terms when naming them.  And were
sloppily in their choices of which ones to provide until decades later, since
it seems they thought that _mm_cvtsi32_si128(*x) was sufficient for a movd
load.  (Only the case on a compiler without strict-aliasing or alignment, since
the deref happens on the user's plain int*).

Anyway, hopefully this refutes the argument that _mm_load_sd should be aligned
because of the name, and clarifies what Intel might have been thinking when
naming these.

[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly

2022-03-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754

--- Comment #6 from Peter Cordes  ---
Looks good to me, thanks for taking care of this quickly, hopefully we can get
this backported to the GCC11 series to limit the damage for people using these
newish intrinsics.  I'd love to recommend them for general use, except for this
GCC problem where some distros have already shipped GCC versions that compile
without error but in a 100% broken way.

Portable ways to do narrow alignment/aliasing-safe SIMD loads were sorely
lacking; there aren't good effective workarounds for this, especially for
16-bit loads.  (I still don't know how to portably / safely write code that
will compile to a memory-source PMOVZXBQ across all compilers; Intel's
intrinsics API is rather lacking in some areas and relies on compilers folding
loads into memory source operands.)


> So, isn't that a bug in the intrinsic guide instead?

Yes, __m128i _mm_loadu_si16 only really makes sense with SSE2 for PINSRW.  Even
movzx into an integer reg and then MOVD xmm, eax requires SSE2.  With only SSE1
you'd have to movzx / dword store to stack / MOVSS reload.

SSE1 makes *some* sense for _mm_loadu_si32 since it can be implemented with a
single MOVSS if MOVD isn't available.

But we already have SSE1 __m128 _mm_load_ss(const float *) for that.

Except GCC's implementation of _mm_load_ss isn't alignment and strict-aliasing
safe; it derefs the actual float *__P as _mm_set_ss (*__P).  Which I think is a
bug, although I'm not clear what semantics Intel intended for that intrinsic. 
Clang implements it as alignment/aliasing safe with a packed may_alias struct
containing a float.  MSVC always behaves like -fno-strict-aliasing, and I
*think* ICC does, too.

Perhaps best to follow the crowd and make all narrow load/store intrinsics
alignment and aliasing safe, unless that causes code-gen regressions; users can
_mm_set_ss( *ptr ) themselves if they want that to tell the compiler that's its
a normal C float object.

Was going to report this, but PR84508 is still open and already covers the
relevant ss and sd intrinsics.  That points out that Intel specifically
documents it as not requiring alignment, not mentioning aliasing.



Speaking of bouncing through a GP-integer reg, GCC unfortunately does that; it
seems to incorrectly think PINSRW xmm, mem, 0 requires -msse4.1, unlike with a
GP register source.  Reported as PR105066 along with related missed
optimizations about folding into a memory source operand for pmovzx/sx.

But that's unrelated to correctness; this bug can be closed unless we're
keeping it open until it's fixed in the GCC11 current stable series.

[Bug target/105066] New: GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg

2022-03-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066

Bug ID: 105066
   Summary: GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not
SSE2?  _mm_loadu_si16 bounces through integer reg
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

PR99754 fixed the wrong-code for _mm_loadu_si16, but the resulting asm is not
efficient without -msse4.1 (as part of -march= most things).  It seems GCC
thinks that pinsrw / pextrw with a memory operand requires SSE4.1, like
pinsr/extr for b/d/q operand-size.  But actually 16-bit insr/extr only needs
SSE2

(We're also not efficiently folding it into a memory source operand for
PMOVZXBQ, see below)

https://godbolt.org/z/dYchb6hec shows GCC trunk 12.0.1 20220321

__m128i load16(void *p){
return _mm_loadu_si16( p );
}


load16(void*): # no options, or -march=core2 or -mssse3
movzwl  (%rdi), %eax
pxor%xmm1, %xmm1
pinsrw  $0, %eax, %xmm1   # should be MOVD %eax, or PINSRW mem
movdqa  %xmm1, %xmm0
ret

vs. 

load16(void*):  # -msse4.1
pxor%xmm1, %xmm1
pinsrw  $0, (%rdi), %xmm1
movdqa  %xmm1, %xmm0
ret


The second version is actually 100% fine with SSE2:
https://www.felixcloutier.com/x86/pinsrw shows that there's only a single
opcode for PINSRW xmm, r32/m16, imm8 and it requires SSE2; reg vs. mem source
is just a matter of the modr/m byte.

The same problem exists for _mm_storeu_si16 not using pextrw to memory (which
is also SSE2), instead bouncing through EAX.  (Insanely still PEXTRW instead of
MOVD).



There is a choice of strategy here, but pinsrw/extrw between eax and xmm0 is
clearly sub-optimal everywhere.  Once we factor out the dumb register
allocation that wastes a movdqa, the interesting options are:

movzwl  (%rdi), %eax  # 1 uop on everything
movd%eax, %xmm0   # 1 uop on everything

vs.

pxor%xmm0, %xmm0# 1 uop for the front-end, eliminated on Intel
pinsrw  $0, (%rdi), %xmm0   # 2 uops  (load + shuffle/merge)


Similarly for extract,

pextrw  $0, %xmm0, (%rdi)   # 2 uops on most

vs.

movd%xmm0, %eax # 1 uop, only 1/clock even on Ice Lake
movw%ax, (%rdi) # 1 uop

On Bulldozer-family, bouncing through an integer reg adds a lot of latency vs.
loading straight into the SIMD unit.  (2 integer cores share a SIMD/FP unit, so
movd between XMM and GP-integer is higher latency than most.)  So that would
definitely favour pinsrw/pextrw with memory.

On Ice Lake, pextrw to mem is 2/clock throughput: the SIMD shuffle can run on
p1/p5.  But MOVD r,v is still p0 only, and MOVD v,r is still p5 only.  So that
also favours pinsrw/pextrw with memory, despite the extra front-end uop for
pxor-zeroing the destination on load.

Of course, if _mm_storeu_si16 is used on a temporary that's later reloaded,
being able to optimize to a movd (and optionally movzx) is very good.  Similar
for _mm_loadu_si16 on a value we have in an integer reg, especially if we know
it's already zero-extended to 32-bit for just a movd, we'd like to be able to
do that.

---

It's also essential that these loads fold efficiently into memory source
operands for PMOVZX; pmovzxbq is one of the major use-cases for a 16-bit load.

That may be a separate bug, IDK

https://godbolt.org/z/3a9T55n3q shows _mm_cvtepu8_epi32(_mm_loadu_si32(p)) does
fold a 32-bit memory source operand nicely to pmovzxbd (%rdi), %xmm0 which can
micro-fuse into a single uop on Intel CPUs (for the 128-bit destination
version, not YMM), but disaster with 16-bit loads:

__m128i pmovzxbq(void *p){
return _mm_cvtepu8_epi64(_mm_loadu_si16(p));
}

pmovzxbq(void*):  # -O3 -msse4.1 -mtune=haswell
pxor%xmm0, %xmm0  # 1 uop
pinsrw  $0, (%rdi), %xmm0 # 2 uops, one for shuffle port
pmovzxbq%xmm0, %xmm0  # 1 uop for the same shuffle port
ret

(_mm_cvtepu8_epi64 requires SSE4.1 so there's no interaction with the
-mno-sse4.1 implementation of the load.)

[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly

2022-03-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754

--- Comment #3 from Peter Cordes  ---
Wait a minute, the current implementation of _mm_loadu_si32 isn't
strict-aliasing or alignment safe!!!   That defeats the purpose for its
existence as something to use instead of _mm_cvtsi32_si128( *(int*)p );

The current code contains a deref of a plain (int*).
It should be using something like

typdef int unaligned_aliasing_int __attribute__((aligned(1),may_alias));

[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly

2022-03-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #2 from Peter Cordes  ---
Can we get this patch applied soon?  There aren't any other
strict-aliasing-safe movd load intrinsics, but this one won't be portably
usable while there are buggy GCC versions around.

Until then, code should probably use something like


inline __m128i movd(void *p){
return _mm_castps_si128(_mm_load_ss((const float*)p));
}

(Which believe it or not is strict-aliasing safe even on integer data.  At
least it should be; last I tested it was across compilers, except maybe on ICC.
 Would have to double-check there.)

[Bug target/104773] New: compare with 1 not merged with subtract 1

2022-03-03 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104773

Bug ID: 104773
   Summary: compare with 1 not merged with subtract 1
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*, arm-*-*

std::bit_ceil(x) involves if(x == 0 || x == 1) return 1;
and 1u << (32-clz(x-1)).

The compare of course compiles to an unsigned <= 1, which can be done with a
sub instead of cmp, producing the value we need as an input for the
leading-zero count.  But GCC does *not* do this.  (Neither does clang for
x86-64).  I trimmed down the libstdc++  code into something I could
compile even when Godbolt is doesn't have working headers for some ISAs:
https://godbolt.org/z/3EE7W5bna

// cut down from libstdc++ for normal integer cases; compiles the same
  template
constexpr _Tp
bit_ceil(_Tp __x) noexcept
{
  constexpr auto _Nd = std::numeric_limits<_Tp>::digits;
  if (__x == 0 || __x == 1)
return 1;
  auto __shift_exponent = _Nd - __builtin_clz((_Tp)(__x - 1u));
  // using __promoted_type = decltype(__x << 1); ... // removed check for
x<

[Bug libstdc++/97759] Could std::has_single_bit be faster?

2022-03-03 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #14 from Peter Cordes  ---
Agreed with the idea of expanding doing the  popcount(x) == 1  peephole
replacement in the compiler, not forcing the header to figure out whether we
have efficient popcount or not.

If we have BMI2, it's BLSR (Bit Lowest-Set Reset) sets CF=1 if the input was
zero, and ZF=1 if the output is zero.  Unfortunately none of the standard
jcc/setcc conditions check ZF=1 & CF=0, even with CMC to invert CF first.  
(If Intel had designed it to produce ZF and CF inverted, it would be
non-intuitive for all other uses but would have allowed blsr / jcc to implement
if(has_single_bit(x)).)

CF=1, ZF=0  impossible: input was zero, output was non-zero
CF=1, ZF=1  input was zero
CF=0, ZF=0  input had multiple bits set
CF=0, ZF=1  input had a single bit set.

If we're going to branch on it anyway after inlining, a branchy strategy is
probably good:

singlebit_bmi2_branchy:
   xor%eax, %eax
   blsr   %edi, %edi#  ZF=1 means we cleared the last bit, or the input was
zero
   jc .Linput_zero  # input was zero, return 0 regardless of ZF
   setz   %al
 .Linput_zero:
   ret


And when we want a boolean in a register, a combination of setz and cmovc can
materialize one.  With clever choice of registers, we can even avoid giving
setcc a false dependency on a register that isn't already part of its dep chain

singlebit_bmi2_cmov:
   blsr%edi, %eax
   setz%al # false dep, but it's ready if FLAGS are ready because
we wrote it with BLSR
   cmovc   %edi, %eax  # return 1 only if ZF=1 (setz produces 1) and CF=0
(cmovc doesn't overwrite it with the input 0)
   ret

With xor-zeroing first, we could produce the boolean zero-extended to 32-bit,
instead of here where only the low 8 bits are actually 0 / 1.  (Which is fine
for returning a bool in all the mainstream calling conventions)

(This is the same kind of idea as ARM64 sub/tst / ccmp / cset, where ccmp can
conditionally update flags.)

An evil variation on this uses setnz / dec to invert ZF without affecting CF,
allowing JA:

   blsr   %edi,%eax
   setnz  %al # AL = !ZF
   dec%al # 1-1 -> ZF=1,  0-1 -> ZF=0.  ZF=!ZF without
affecting CF
   # seta   %al # set on CF=0 and ZF=0
   ja was_single_bit# only actually useful for branching after inlining

dec/ja can't macro-fuse into a single uop, but on Skylake and later Intel it
doesn't cost any extra partial-FLAGS merging uops, because JA simply has both
parts of FLAGS as separate inputs.  (This is why CMOVA / CMOVBE are still 2
uops on Skylake, unlike all other forms: they need 2 integer inputs and 2 parts
of FLAGS, while others need either CF or SPAZO not both.  Interestingly, Zen1/2
have that effect but not Zen3)

I don't know how AMD handles dec / ja partial-flags shenanigans.  Intel Haswell
would I think have a flags-merging uop; older Intel doesn't support BMI1 so
P6-family is irrelevant.  https://stackoverflow.com/a/49868149/224132

I haven't benchmarked them because they have different use-cases (materializing
a boolean vs. creating a FLAGS condition to branch on, being branchless
itself), so any single benchmark would make one of them look good.  If your
data almost never (or always) has an all-zero input, the JC in the first
version will predict well.  After inlining, if the caller branches on the bool
result, you might want to just branch on both conditions separately.

I don't think this setnz/dec/ja version is ever useful.  Unlike branching
separately on ZF and CF, it's not bad if both 0 and multi-bit inputs are common
while single-bit inputs are rare.  But blsr/setz/cmovc + test/jnz is only 4
uops, same as this on Skylake. (test can macro-fuse with jnz).

The uops are all dependent on each other, so it also has the same latency (to
detect a branch miss) as popcnt / macro-fused cmp/je which is 2 uops.  The only
thing this has going for it is avoiding a port-1-only uop, I think.

It's also possible to blsr / lahf / and  ah, (1<<6) | (1<<0) / cmp  ah, 1<<6   
to directly check that ZF=1 and CF=0.  I doubt that's useful.  Or hmm, can we
branch directly on PF after AND with that 2-bit mask?  CF=1 ZF=0 is impossible,
so the only other odd-parity case is CF=0 ZF=1.  AMD and Intel can macro-fuse
test/jp.

   blsr  %edi, %eax
   lahf
   test  $(1<<6) | (1<<0), %ah# check ZF and CF.
   jpo   was_single_bit   # ZF != CF means CF=0, ZF=1 because the
other way is impossible.

Also possible of course is the straightforward 2x setcc and AND to materialize
a boolean in the bottom byte of EAX.  Good ILP, only 3 cycle latency from input
to result on Intel, but that's the same as setz/cmovc

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-10-25 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #11 from Peter Cordes  ---
Also, horizontal byte sums are generally best done with  VPSADBW against a zero
vector, even if that means some fiddling to flip to unsigned first and then
undo the bias.

simde_vaddlv_s8:
 vpxorxmm0, xmm0, .LC0[rip]  # set1_epi8(0x80) flip to unsigned 0..255
range
 vpxorxmm1, xmm1
 vpsadbw  xmm0, xmm0, xmm1   # horizontal byte sum within each 64-bit half
 vmovdeax, xmm0  # we only wanted the low half anyway
 sub  eax, 8 * 128  # subtract the bias we added earlier by flipping
sign bits
 ret

This is so much shorter we'd still be ahead if we generated the vector constant
on the fly instead of loading it.  (3 instructions: vpcmpeqd same,same / vpabsb
/ vpslld by 7.  Or pcmpeqd / psllw 8 / packsswb same,same to saturate to -128)

If we had wanted a 128-bit (16 byte) vector sum, we'd need

  ...
  vpsadbw ...

  vpshufd  xmm1, xmm0, 0xfe # shuffle upper 64 bits to the bottom
  vpaddd   xmm0, xmm0, xmm1
  vmovdeax, xmm0
  sub  eax, 16 * 128

Works efficiently with only SSE2.  Actually with AVX2, we should unpack the top
half with VUNPCKHQDQ to save a byte (no immediate operand), since we don't need
PSHUFD copy-and-shuffle.

Or movd / pextrw / scalar add but that's more uops: pextrw is 2 on its own.

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-10-25 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #10 from Peter Cordes  ---
Current trunk with -fopenmp is still not good https://godbolt.org/z/b3jjhcvTa 
Still doing two separate sign extensions and two stores / wider reload (store
forwarding stall):

-O3 -march=skylake -fopenmp
simde_vaddlv_s8:
pushrbp
vpmovsxbw   xmm2, xmm0
vpsrlq  xmm0, xmm0, 32
mov rbp, rsp
vpmovsxbw   xmm3, xmm0
and rsp, -32
vmovq   QWORD PTR [rsp-16], xmm2
vmovq   QWORD PTR [rsp-8], xmm3
vmovdqa xmm4, XMMWORD PTR [rsp-16]
   ... then asm using byte-shifts

Including stuff like
   movdqa  xmm1, xmm0
   psrldq  xmm1, 4

instead of pshufd, which is an option because high garbage can be ignored.

And ARM64 goes scalar.



Current trunk *without* -fopenmp produces decent asm
https://godbolt.org/z/h1KEKPTW9

For ARM64 we've been making good asm since GCC 10.x (vs. scalar in 9.3)
simde_vaddlv_s8:
sxtlv0.8h, v0.8b
addvh0, v0.8h
umovw0, v0.h[0]
ret

x86-64 gcc  -O3 -march=skylake
simde_vaddlv_s8:
vpmovsxbw   xmm1, xmm0
vpsrlq  xmm0, xmm0, 32
vpmovsxbw   xmm0, xmm0
vpaddw  xmm0, xmm1, xmm0
vpsrlq  xmm1, xmm0, 32
vpaddw  xmm0, xmm0, xmm1
vpsrlq  xmm1, xmm0, 16
vpaddw  xmm0, xmm0, xmm1
vpextrw eax, xmm0, 0
ret


That's pretty good, but  VMOVD eax, xmm0  would be more efficient than  VPEXTRW
when we don't need to avoid high garbage (because it's a return value in this
case).  VPEXTRW zero-extends into RAX, so it's not directly helpful if we need
to sign-extend to 32 or 64-bit for some reason; we'd still need a scalar movsx.

Or with BMI2, go scalar before the last shift / VPADDW step, e.g.
  ...
  vmovd  eax, xmm0
  rorx   edx, eax, 16
  addeax, edx

[Bug tree-optimization/80570] auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract

2021-09-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570

--- Comment #3 from Peter Cordes  ---
(In reply to Andrew Pinski from comment #2)
> Even on aarch64:
> 
> .L2:
> ldr q0, [x1], 16
> sxtlv1.2d, v0.2s
> sxtl2   v0.2d, v0.4s
> scvtf   v1.2d, v1.2d
> scvtf   v0.2d, v0.2d
> stp q1, q0, [x0]
>
> But the above is decent really.

More that decent, that's what we *should* be doing, I think.

AArch64 has versions of most instructions that read the top of a vector, unlike
x86-64 where VPMOVZX / SX can only read from the bottom half.  That's the key
difference, and what makes this strategy good on ARM, bad on x86-64.

(On 32-bit ARM, you load a q register, then read the two halves separately as
64-bit d<0..31> registers.  AArch64 changed that so there are 32x 128-bit
vector regs, and no partial regs aliasing the high half.  But they provide OP,
OP2 versions of some instructions that widen or things like that, with the "2"
version accessing a high half.  Presumably part of the motivation is to make it
easier to port ARM NEON code that depended on accessing halves of a 128-bit q
vector using its d regs.  But it's a generally reasonable design and could also
be motivated by seeing how inconvenient things get in SSE and AVX for
pmovsx/zx.) 

 Anyway, AArch64 SIMD is specifically designed to make it fully efficient to do
wide loads and then unpack both halves, like is possible in ARM, but not
x86-64.  

It's also using a store (of a pair of regs) that's twice the width of the load.
 But even if it was using a max-width load of a pair of 128-bit vectors (and
having to store two pairs) that would be good, just effectively unrolling.  But
GCC sees it as one load and two separate stores, that it just happens to be
able to combine as a pair.

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2021-09-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #9 from Peter Cordes  ---
Thanks for implementing my idea :)

(In reply to Hongtao.liu from comment #6)
> For elements located above 128bits, it seems always better(?) to use
> valign{d,q}

TL:DR:
 I think we should still use vextracti* / vextractf* when that can get the job
done in a single instruction, especially when the VEX-encoded vextracti/f128
can save a byte of code size for v[4].

Extracts are simpler shuffles that might have better throughput on some future
CPUs, especially the upcoming Zen4, so even without code-size savings we should
use them when possible.  Tiger Lake has a 256-bit shuffle unit on port 1 that
supports some common shuffles (like vpshufb); a future Intel might add
256->128-bit extracts to that.

It might also save a tiny bit of power, allowing on-average higher turbo
clocks.

---

On current CPUs with AVX-512, valignd is about equal to a single vextract, and
better than multiple instruction.  It doesn't really have downsides on current
Intel, since I think Intel has continued to not have int/FP bypass delays for
shuffles.

We don't know yet what AMD's Zen4 implementation of AVX-512 will look like.  If
it's like Zen1 was AVX2 (i.e. if it decodes 512-bit instructions other than
insert/extract into at least 2x 256-bit uops) a lane-crossing shuffle like
valignd probably costs more than 2 uops.  (vpermq is more than 2 uops on
Piledriver/Zen1).  But a 128-bit extract will probably cost just one uop.  (And
especially an extract of the high 256 might be very cheap and low latency, like
vextracti128 on Zen1, so we might prefer vextracti64x4 for v[8].)

So this change is good, but using a vextracti64x2 or vextracti64x4 could be a
useful peephole optimization when byte_offset % 16 == 0.  Or of course
vextracti128 when possible (x/ymm0..15, not 16..31 which are only accessible
with an EVEX-encoded instruction).

vextractf-whatever allows an FP shuffle on FP data in case some future CPU
cares about that for shuffles.

An extract is a simpler shuffle that might have better throughput on some
future CPU even with full-width execution units.  Some future Intel CPU might
add support for vextract uops to the extra shuffle unit on port 1.  (Which is
available when no 512-bit uops are in flight.)  Currently (Ice Lake / Tiger
Lake) it can only run some common shuffles like vpshufb ymm, but not including
any vextract or valign.  Of course port 1 vector ALUs are shut down when
512-bit uops are in flight, but could be relevant for __m256 vectors on these
hypothetical future CPUs.

When we can get the job done with a single vextract-something, we should use
that instead of valignd.  Otherwise use valignd.

We already check the index for low-128 special cases to use vunpckhqdq vs.
vpshufd (or vpsrldq) or similar FP shuffles.

-

On current Intel, with clean YMM/ZMM uppers (known by the CPU hardware to be
zero), an extract that only writes a 128-bit register will keep them clean
(even if it reads a ZMM), not needing a VZEROUPPER.  Since VZEROUPPER is only
needed for dirty y/zmm0..15, not with dirty zmm16..31, so a function like

float foo(float *p) {
  some vector stuff that can use high zmm regs;
  return scalar that happens to be from the middle of a vector;
}

could vextract into XMM0, but would need vzeroupper if it used valignd into
ZMM0.

(Also related
https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-for-turbo-clocks-to-recover-after-a-512-bit-instruc
re reading a ZMM at all and turbo clock).

---

Having known zeros outside the low 128 bits (from writing an xmm instead of
rotating a zmm) is unlikely to matter, although for FP stuff copying fewer
elements that might be subnormal could happen to be an advantage, maybe saving
an FP assist for denormal.  We're unlikely to be able to take advantage of it
to save instructions/uops (like OR instead of blend).  But it's not worse to
use a single extract instruction instead of a single valignd.

[Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code

2021-09-04 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #37 from Peter Cordes  ---
Correction, PR82666 is that the cmov on the critical path happens even at -O2
(with GCC7 and later).  Not just with -O3 -fno-tree-vectorize.

Anyway, that's related, but probably separate from choosing to do if-conversion
or not after inlining.

[Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code

2021-09-04 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #36 from Peter Cordes  ---
Related:  a similar case of cmov being a worse choice, for a threshold
condition with an array input that happens to already be sorted:

https://stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-than-o2

GCC with -fprofile-generate / -fprofile-use does correctly decide to use
branches.

GCC7 and later (including current trunk) with -O3 -fno-tree-vectorize
de-optimizes by putting the CMOV on the critical path, instead of as part of
creating a zero/non-zero input for the ADD. PR82666.  If you do allow full -O3,
then vectorization is effective, though.

[Bug target/15533] Missed move to partial register

2021-08-22 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15533

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #5 from Peter Cordes  ---
The new asm less bad, but still not good.  PR53133 is closed, but this code-gen
is a new instance of partial-register writing with xor al,al.  Also related:
PR82940 re: identifying bitfield insert patterns in the middle-end; hopefully
Andrew Pinski's planned set of patches to improve that can help back-ends do a
better job?

If we're going to read a 32-bit reg after writing an 8-bit reg (causing a
partial-register stall on Nehalem and earlier), we should be doing

  mov  a, %al   # merge into the low byte of RAX
  ret

Haswell and newer Intel don't rename the low byte partial register separately
from the full register, so they behave like AMD and other non-P6 /
non-Sandybridge CPU: dependency on the full register.  That's good for this
code; in this case the merging is necessary and we don't want the CPU to guess
that it won't be needed later.  The load+ALU-merge uops can micro-fuse into a
single uop for the front end.

 xor %al,%al still has a false dependency on the old value of RAX because it's
not a zeroing idiom; IIRC in my testing it's at least as good to do  mov $0,
%al.  Both instructions are 2 bytes long.

*
https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers
 survey of the ways partial regs are handled on Intel P6 family vs. Intel
Sandybridge vs. Haswell and later vs. non-Intel and Intel Silvermont etc.
*
https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to
- details of my testing on Haswell / Skylake.



*If* we still care about  -mtune=nehalem  and other increasingly less relevant
CPUs, we should be avoiding a partial register stall for those tuning options
with something like

   movzbl   a, %edx
   and  $-256, %eax
   or   %edx, %eax

i.e. what we're already doing, but spend a 5-byte AND-immediate instead of a
2-byte xor %al,%al or mov $0, %al

(That's what clang always does, so it's missing the code-size optimization.
https://godbolt.org/z/jsE57EKcb shows a similar case of return (a&0xFF00u)
| (b&0xFFu); with two register args)

-

The penalty on Pentium-M through Nehalem is to stall for 2-3 cycles while a
merging uop is inserted.  The penalty on earlier P6 (PPro / Pentium III) is to
stall for 5-6 cycles until the partial-register write retires.

The penalty on Sandybridge (and maybe Ivy Bridge if it renames AL) is no stall,
just insert a merging uop.

On later Intel, and AMD, and Silvermont-family Intel, writing AL has a
dependency on the old RAX; it's a merge on the spot.

BTW, modern Intel does still rename AH separately, and merging does require the
front-end to issue a merging uop in a cycle by itself.  So writing AH instead
of AL would be different.

[Bug middle-end/82940] Suboptimal code for (a & 0x7f) | (b & 0x80) on powerpc

2021-08-22 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82940

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #6 from Peter Cordes  ---
For a simpler test case, GCC 4.8.5 did redundantly mask before using
bitfield-insert, but GCC 9.2.1 doesn't.


unsigned merge2(unsigned a, unsigned b){
return (a&0xFF00u) | (b&0xFFu);
}

https://godbolt.org/z/froExaPxe
# PowerPC (32-bit) GCC 4.8.5
rlwinm 4,4,0,0xff # b &= 0xFF is totally redundant
rlwimi 3,4,0,24,31
blr

# power64 GCC 9.2.1 (ATI13.0)
rlwimi 3,4,0,255# bit-blend according to mask, rotate count=0
rldicl 3,3,0,32 # Is this zero-extension to 64-bit redundant?
blr

But ppc64 GCC does zero-extension of the result from 32 to 64-bit, which is
probably not needed unless the calling convention has different requirements
for return values than for incoming args.  (I don't know PPC well enough.)

So for at least some cases, modern GCC does ok.

Also, when the blend isn't split at a byte boundary, even GCC4.8.5 manages to
avoid redundant masking before the bitfield-insert.

unsigned merge2(unsigned a, unsigned b){
return (a & 0xFF80u) | (b & 0x7Fu);
}

rlwimi 3,4,0,25,31   # GCC4.8.5, 32-bit so no zero-extension
blr

[Bug tree-optimization/100922] CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy

2021-06-05 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922

--- Comment #2 from Peter Cordes  ---
Possibly also related:

With different surrounding code, this loop can compile to asm which has two
useless movz / mov register copies in the loop at -O2 
(https://godbolt.org/z/PTcqzM6q7).  (To set up for entry into the next loop in
over-complicated ways, and doing this in the loop is unnecessary.)


  while( lut[(unsigned char)*str] == 0 ){  // also catches terminating 0
str++;
  }


.L19:
movzbl  1(%rdi), %edx
addq$1, %rdi
movzbl  %dl, %ecx
movl%edx, %eax
cmpb$0, -120(%rsp,%rcx)
je  .L19

from source

void remove_chars(char *restrict str, const char *restrict remove)
{
  char lut[256] = {0};
  do {
lut[(unsigned char)*remove] = -1;
  }while(*remove++);

/***   Over complicated asm in this loop */
  while( lut[(unsigned char)*str] == 0 ){  // also catches terminating 0
str++;
  }
  // str points at first char to *not* keep (or the terminating 0)
  const char *in = str;
  char *out = str;
  while (*in)
{
  char mask = lut[(unsigned char)*in];
unsigned char cin = *in, cout = *out;
*out = mask ? cout : cin;
  out += mask + 1;
  in++;
}
  *out = *in;
}

[Bug tree-optimization/100922] New: CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy

2021-06-05 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922

Bug ID: 100922
   Summary: CSE leads to fully redundant (back to back)
zero-extending loads of the same thing in a loop, or a
register copy
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

Created attachment 50948
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50948=edit
redundant_zero_extend.c

It's rarely a good idea to load the same thing twice; generally better to copy
a register.  Or to read the same register twice when a copy isn't needed.  So
the following asm should never happen, but it does with current trunk, and
similar with GCC as old as 4.5

movzbl  (%rax), %edx
movzbl  (%rax), %ecx# no branch target between these instructions

or 

ldrbw4, [x2]
ldrbw3, [x2], 1 # post-indexed *x2++

(Happens at -O3.  With -O2 we have a redundant register copy, so either way
still a wasted instruction.  And there are other differences earlier in the
function with -O2 vs. -O3.)

https://godbolt.org/z/jT7WaWeK8 - minimal test case. x86-64 and AArch64 trunk
show basically identical code structure.  x86-64 gcc (Compiler-Explorer-Build)
12.0.0 20210603 and aarch64-unknown-linux-gnu-gcc (GCC) 12.0.0 20210524

void remove_chars_inplace(char *str, const unsigned char keep_lut[256])
{
  while(keep_lut[(unsigned char)*str]){ // can be an if() and still repro
str++;// keep_lut[0] is false
  }

  char *out = str;
  unsigned char c;   /* must be unsigned char for correctness. */
  do {
  c = *str++;
  unsigned char inc = keep_lut[c];  // unsigned long doesn't help
  *out = c;
  out += inc;   // inc=0 or 1 to let next char overwrite or not
} while(c);
}

x86-64 asm:

remove_chars_inplace:
jmp .L8
.L3:# top of search loop for first char to remove
addq$1, %rdi
.L8:# loop entry point
movzbl  (%rdi), %eax
cmpb$0, (%rsi,%rax)  # un-laminates and doesn't macro-fuse ...
jne .L3

cmpb$0, (%rdi)  # 2nd loop body can be skipped if *str == 0
# should be test %al,%al  - this char was
already loaded.
leaq1(%rdi), %rax# even -march=znver2 fails to move this
earlier or later to allow cmp/je fusion.  (Intel won't macro-fuse cmp imm,mem /
jcc)
je  .L1

.L5: # TOP OF 2ND LOOP
movzbl  (%rax), %edx
movzbl  (%rax), %ecx # redundant load of *str
addq$1, %rax
movzbl  (%rsi,%rdx), %edx  # inc = lut[c]
movb%cl, (%rdi)
addq%rdx, %rdi   # out += inc
testb   %cl, %cl
jne .L5# }while(c != 0)
.L1:
ret

IDK if it's interesting or not that the   cmpb $0, (%rdi)  is also a redundant
load.  The first loop left *str, i.e. (%rdi), in EAX.  Putting the LEA between
cmp and je (even with -march=znver2) is a separate missed optimization. 
(unless that's working around Intel's JCC erratum)

With only -O2 instead of -O3, we get better asm in that part: it takes
advantage of having the char in AL, and jumps into the middle of the next loop
after xor-zeroing the `inc` variable.


Replacingc = *str++;  with
  c = *str;
  str++;
results in a wasted register copy with trunk, instead of a 2nd load (on x86-64
and arm64).  Still a missed opt, but less bad.  GCC7 and earlier still do an
extra load with either way of writing that.

Removing the first loop, or making its loop condition something like  *str &&
keep_lut[*str],  removes the problem entirely.  The CSE possibility is gone. 
(Same even if we use lut[*(unsigned char*)str] - type-pun the pointer to
unsigned char instead of casting the signed char value to unsigned char, on x86
where char is signed, but not on arm64 where char is unsigned.)

---

I didn't find any clear duplicates; the following are barely worth mentioning:
*  pr94442 looks like extra spilling, not just redundant loading.
*  pr97366 is due to vectors of different types, probably.
*  pr64319 needs runtime aliasing detection to avoid, unlike this.

The AArch64 version of this does seem to demo pr71942 (a useless  and x4, x2,
255 on an LDRB result) when you get it to copy a register instead of doing a
2nd load.

[Bug rtl-optimization/88770] Redundant load opt. or CSE pessimizes code

2021-06-05 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88770

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #2 from Peter Cordes  ---
Note that mov r64, imm64 is a 10-byte instruction, and can be slow to read from
the uop-cache on Sandybridge-family.

The crap involving OR is clearly sub-optimal, but *if* you already have two
spare call-preserved registers across this call, the following is actually
smaller code-size:

movabs  rdi, 21474836483
mov rbp, rdi
movabs  rsi, 39743127552
mov rbx, rsi
calltest
mov rdi, rbp
mov rsi, rbx
calltest


This is more total uops for the back-end though (movabs is still single-uop,
but takes 2 entries the uop cache on Sandybridge-family;
https://agner.org/optimize/).  So saving x86 machine-code size this way does
limit the ability of out-of-order exec to see farther, if the front-end isn't
the bottleneck.  And it's highly unlikely to be worth saving/restoring two regs
to enable this.  (Or to push rdi / push rsi before call, then pop after!)

Setting up the wrong value and then fixing it twice with OR is obviously
terrible and never has any advantage, but the general idea to CSE large
constants isn't totally crazy.  (But it's profitable only in such limited cases
that it might not be worth looking for, especially if it's only helpful at -Os)

[Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

2021-06-03 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636

Peter Cordes  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Peter Cordes  ---
This seems to be fixed for ZMM vectors in GCC8. 
https://gcc.godbolt.org/z/7351be1v4

Seems to have never been a problem for __m256, at least not for 
__m256 zero256(){ return _mm256_setzero_ps(); }
IDK what I was looking at when I originally reported; maybe just clang which
*did* used to prefer YMM-zeroing.

Some later comments suggested movdqa vs. pxor zeroing choices (and mov vs. xor
for integer), but the bug title is just AVX / AVX-512 xor-zeroing, and that
seems to be fixed.  So I think this should be closed.

[Bug tree-optimization/42587] bswap not recognized for memory

2021-05-08 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42587

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #12 from Peter Cordes  ---
(In reply to Andi Kleen from comment #11)
> Only when the first test case is fixed too

https://godbolt.org/z/7M8cx3vT1  GCC8.1 -O3 for x86-64

pushrbx
mov ebx, edi
callacpi_ut_track_stack_ptr
mov eax, ebx
pop rbx
bswap   eax
ret


The code in the initial report optimizes to bswap with GCC8.1 and later.
Is that the test case you meant?  GCC8.1 was released on May 2, 2018, well
before your Nov comment, so maybe you meant something else.

[Bug middle-end/98801] Request for a conditional move built-in function

2021-01-25 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98801

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #5 from Peter Cordes  ---
(In reply to Richard Biener from comment #4)
> Slight complication arises because people will want to have cmoves with a
> memory destination.

Do we even want to provide this?  Most ISAs can't branchlessly conditionally
store, except via an RMW (which wouldn't be thread-safe for the no-store case
if not atomic) or something really clunky.  (Like x86  rep stos  with count=0
or 1.)

ARM predicated instructions allow branchless load or store that doesn't disturb
the memory operand (and won't even fault on a bad address).

I guess another option to emulate it could be to make a dummy local and cmov to
select a store address = dummy : real.  But that's something users can build in
the source using a non-memory conditional-select builtin that exposes the much
more widely available ALU conditional-select functionality like x86 CMOV,
AArch64 CSEL, MIPS MVN, etc.


> That won't solve the eventual request to have cmov _from_ memory ... (if we
> leave all of the memory combining to RTL people will again complain that
> it's subject to compilers discretion).

It might be sufficient for most use-cases like defending against timing
side-channels to not really try to allow conditional loads (from maybe-invalid
pointers).



I'm not sure if the motivation for this includes trying to make code without
data-dependent branching, to defend against timing side-channels.

But if we do provide something like this, people are going to want to use it
that way.  That's one case where best-effort behaviour at the mercy of the
optimizer for a ternary (or having to manually check the asm) is not great. 
Stack Overflow has gotten a few Q from people looking for guaranteed CMOV
for reasons like that.

So I think we should be wary of exposing functionality that most ISAs don't
have.  OTOH, failing to provide a way to take advantage of functionality that
some ISAs *do* have is not great, e.g. ISO C failing to provide popcnt and
bit-scan (clz / ctz) has been a problem for C for a long time.

But for something like __builtin_clz, emulating on machines that don't have
hardware support still works.  If we're trying to support a guarantee of no
data-dependent branching, that limits the emulation possibilities or makes them
clunkier.  Especially if we want to support ARM's ability to not fault / not
access memory if the condition is false.

The ALU-select part can be emulated with AND/OR, so that's something we can
provide on any target.

Folding memory operands into a predicated load on ARM could actually introduce
data-dependent cache access, vs. an unconditional load and a predicated reg-reg
MOV.  So this becomes somewhat thorny, and some design work to figure out what
documented guarantees to provide will be necessary.  Performance use-cases
would certainly rather just have a conditional load in one instruction.

[Bug tree-optimization/98291] New: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert

2020-12-15 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98291

Bug ID: 98291
   Summary: multiple scalar FP accumulators auto-vectorize worse
than scalar, including vector load + merge instead of
scalar + high-half insert
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: missed-optimization, ssemmx
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

An FP reduction loop with 2 scalar accumulators auto-vectorizes into a mess,
instead of effectively mapping each scalar to an element of one vector
accumulator.  (Unless we use -ffast-math, then that happens.  clang gets it
right even without -ffast-math).

double dotprod(const double *a, const double *b, unsigned long long n)
{
  double d1 = 0.0;
  double d2 = 0.0;

  for (unsigned long long i = 0; i < n; i += 2) {
d1 += a[i] * b[i];
d2 += a[i + 1] * b[i + 1];
  }

  return (d1 + d2);
}

https://godbolt.org/z/Kq48j9

With -ffast-math the nice sane loop we expect

.L3:
movupd  (%rsi,%rax), %xmm0
movupd  (%rdi,%rax), %xmm3
addq$1, %rdx
addq$16, %rax
mulpd   %xmm3, %xmm0
addpd   %xmm0, %xmm1
cmpq%rcx, %rdx
jb  .L3


without: 

...
main loop
.L4:
movupd  (%rcx,%rax), %xmm1# 16-byte load
movupd  (%rsi,%rax), %xmm3 
movhpd  16(%rcx,%rax), %xmm1  # overwrite the high half of it!!
movhpd  16(%rsi,%rax), %xmm3
mulpd   %xmm3, %xmm1
movupd  16(%rsi,%rax), %xmm3
movlpd  8(%rsi,%rax), %xmm3
addsd   %xmm1, %xmm2
unpckhpd%xmm1, %xmm1
addsd   %xmm1, %xmm2
movupd  16(%rcx,%rax), %xmm1
movlpd  8(%rcx,%rax), %xmm1
addq$32, %rax
mulpd   %xmm3, %xmm1
addsd   %xmm1, %xmm0
unpckhpd%xmm1, %xmm1
addsd   %xmm1, %xmm0
cmpq%rdx, %rax
jne .L4

The overall strategy is insane, but even some of the details are insane.  e.g.
a 16-byte load into XMM1, and then overwriting the high half of that with a
different double before reading it.  That's bad enough, but you'd expect movsd
/ movhpd to manually gather 2 doubles, without introducing the possibility of a
cache-line split load for zero benefit.

Similarly, movupd / movlpd should have just loaded in the other order.  (Or
since they're contiguous, movupd  8(%rsi,%rax), %xmm3 / shufpd.)

So beyond the bad overall strategy (which is likely worse than unrolled
scalar), it might be worth checking for some of this kind of smaller-scale
insanity somewhere later to make it less bad if some other inputs can trigger
similar behaviour.

(This small-scale detecting of movupd / movhpd and using movsd / movhpd could
be a separate bug, but if it's just a symptom of something that should never
happen in the first place then it's not really its own bug at all.)

[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

--- Comment #1 from Peter Cordes  ---
Forgot to include https://godbolt.org/z/q44r13

[Bug target/97366] New: [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

Bug ID: 97366
   Summary: [8/9/10/11 Regression] Redundant load with SSE/AVX
vector intrinsics
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

When you use the same _mm_load_si128 or _mm256_load_si256 result twice,
sometimes GCC loads it *and* uses it as a memory source operand.

I'm not certain this is specific to x86 back-ends, please check bug tags if it
happens elsewhere.  (But it probably doesn't on 3-operand load/store RISC
machines; it looks like one operation chooses to load and then operate, the
other chooses to use the original source as a memory operand.)

#include 
void gcc_double_load_128(int8_t *__restrict out, const int8_t *__restrict
input)
{
for (unsigned i=0 ; i<1024 ; i+=16){
__m128i in = _mm_load_si128((__m128i*)[i]);
__m128i high = _mm_srli_epi32(in, 4);
_mm_store_si128((__m128i*)[i], _mm_or_si128(in,high));
}
}

gcc 8 and later -O3 -mavx2, including 11.0.0 20200920, with 

gcc_double_load_128(signed char*, signed char const*):
xorl%eax, %eax
.L6:
vmovdqa (%rsi,%rax), %xmm1 # load
vpsrld  $4, %xmm1, %xmm0
vpor(%rsi,%rax), %xmm0, %xmm0  # reload as a memory operand
vmovdqa %xmm0, (%rdi,%rax)
addq$16, %rax
cmpq$1024, %rax
jne .L6
ret

GCC7.5 and earlier use  vpor %xmm1, %xmm0, %xmm0 to use the copy of the
original that was already loaded.

`-march=haswell` happens to fix this for GCC trunk, for this 128-bit version
but not for a __m256i version.

restrict doesn't make a difference, and there's no overlapping anyway.  The two
redundant loads both happen between any other stores.

Using a memory source operand for vpsrld wasn't an option: the form with a
memory source takes the *count* from  memory, not the data. 
https://www.felixcloutier.com/x86/psllw:pslld:psllq



Note that *without* AVX, the redundant load is a possible win, for code running
on Haswell and later Intel (and AMD) CPUs.  Possibly some heuristic is saving
instructions for the legacy-SSE case (in a way that's probably worse overall)
and hurting the AVX case.

GCC 7.5, -O3  without any -m options
gcc_double_load_128(signed char*, signed char const*):
xorl%eax, %eax
.L2:
movdqa  (%rsi,%rax), %xmm0
movdqa  %xmm0, %xmm1 # this instruction avoided
psrld   $4, %xmm1
por %xmm1, %xmm0 # with a memory source reload, in GCC8 and
later
movaps  %xmm0, (%rdi,%rax)
addq$16, %rax
cmpq$1024, %rax
jne .L2
rep ret


Using a memory-source POR saves 1 front-end uop by avoiding a register-copy, as
long as the indexed addressing mode can stay micro-fused on Intel.  (Requires
Haswell or later for that to happen, or any AMD.)  But in practice it's
probably worse.  Load-port pressure, and space in the out-of-order scheduler,
as well as code-size, is a problem for using an extra memory-source operand in
the SSE version, with the upside being saving 1 uop for the front-end.  (And
thus in the ROB.)  mov-elimination on modern CPUs means the movdqa register
copy costs no back-end resources (ivybridge and bdver1).

I don't know if GCC trunk is using por  (%rsi,%rax), %xmm0  on purpose for that
reason, of if it's just a coincidence.
I don't think it's a good idea on most CPUs, even if alignment is guaranteed.

This is of course 100% a loss with AVX; we have to `vmovdqa/u` load for the
shift, and it can leave the original value in a register so we're not saving a
vmovdqua.  And it's a bigger loss because indexed memory-source operands
unlaminate from 3-operand instructions even on Haswell/Skylake:
https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes/31027695#31027695
so it hurts the front-end as well as wasting cycles on load ports, and taking
up space in the RS (scheduler).

The fact that -mtune=haswell fixes this for 128-bit vectors is interesting, but
it's clearly still a loss in the AVX version for all AVX CPUs.  2 memory ops /
cycle on Zen could become a bottleneck, and it's larger code size.  And
-mtune=haswell *doesn't* fix it for the -mavx2 _m256i version.

There is a possible real advantage in the SSE case, but it's very minor and
outweighed by disadvantages.  Especially for older CPUs like Nehalem that can
only do 1 load / 1 store per clock.  (Although this has so many uops in the
loop that it barely bottlenecks on that.)

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2020-04-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #53 from Peter Cordes  ---
I think we can close this as fixed at some point.  The last activity on this
bug was some patches that sound like they were supposed to fix, and the MCVEs
from comments I tested no longer has a problem.

GCC9.3 -O3 -march=core2 -fomit-frame-pointer only uses a `.p2align` to align
the top of the loop, not between leave and ret or between cmp/jcc.

void wait_for_enter()
{
volatile int foo = 0;  // to get a LEAVE instruction emitted at all
int u = getchar();
while (!u)
u = getchar()-13;
}

https://godbolt.org/z/RvxzZv

(Note that Godbolt normally filters .p2align so you have to either compile to
binary or not filter directives in the asm source.  Otherwise you'll never see
NOPs except in the unusual case where GCC actually emits a nop mnemonic.)

[Bug target/93141] Missed optimization : Use of adc when checking overflow

2020-01-03 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93141

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #2 from Peter Cordes  ---
gcc doesn't actually *branch* unless you use an if(), it just uses cmp/sbb to
do a 128-bit compare.  CMP is like a SUB that only sets flags.  The CF result
of SBB is used as an input for ADC.

https://godbolt.org/z/64C4R- of a testcase

GCC also wastes a varying number of MOV instructions beyond the minimum one to
make cmp/sbb work, depending on BMI2 MULX or not, and how the sum is written.

u128 prod = a[i] * (unsigned __int128) b[i];
#if 1
sum += prod;
//if(sum

[Bug target/40838] gcc shouldn't assume that the stack is aligned

2019-10-31 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #91 from Peter Cordes  ---
This bug should be closed as "resolved fixed".  The "fix" was to change the ABI
doc and break existing hand-written asm, and old binaries.  This was
intentional and resulted in some pain, but at this point it's a done deal.



My attempt at a summary of the current state of affairs for 32-bit x86 calling
conventions (on Linux and elsewhere):

Yes, the version of the i386 System V ABI used on Linux really did change
between gcc2.8 and gcc8.  Those compilers are not ABI-compatible with each
other.  This is a known fact.  Hand-written asm that makes function calls with
misaligned stack pointers is violating the (updated) ABI, and was also
knowingly broken by this change.


(Perhaps unintentionally at first, with stack alignment intended to just
provide a performance benefit, not a correctness issue.  But the resolution
ended up being to standardize on 16-byte alignment matching x86-64 System V.  
Instead of reverting to the old ABI and breaking compat with new binaries that
had started to rely on 16-byte incoming alignment, or to add significant
overhead to every function that didn't know how both its caller and callee were
compiled, i.e. most functions.  Using MOVUPS instead of MOVAPS everywhere
wouldn't work well because it would mean no folding of memory operands into ALU
instructions: without AVX's VEX encoding,  paddd xmm0, [mem] requires aligned
mem.  And existing binaries that rely on incoming 16-byte alignment weren't
doing that.)


An earlier comment also mentioned common arrays: the ABI also requires arrays
larger than 16 bytes to have 16-byte alignment.



Perhaps unnecessary pain for little real benefit: i386 on Linux has been mostly
obsolete for a long time, and the inefficient stack-args calling convention was
never changed.  It's ironic that Linux broke ABI compat for i386 in the name of
more efficient SSE-usage despite not caring to introduce anything like Windows
fastcall or vectorcall (efficient register-args calling conventions).

(GCC does have ABI-changing -mregparm=3 and -msseregparm to pass integers in
regs, and pass/return FP values in XMM registers (instead of passing on the
stack / returning in x87 st0).  But no distros have switched over to using that
calling convention for i386 binaries, AFAIK.  The Linux kernel does use regparm
for 32-bit kernel builds.)

Even more ironic, probably a lot of 32-bit code is compiled without -msse2
(because one of the main reasons for using 32-bit code is CPUs too old for
x86-64, which is about the same vintage as SSE2).  SSE usage can still happen
with runtime dispatching in binaries that are compatible with old machines
while still being able to take advantage of new ones.


But in most cases, if you want performance you use x86-64 kernel + user-space,
or maybe x32 user-space (ILP32 in 64-bit mode) to get modern calling
conventions and the benefit of twice as many registers.  x86-64 System V has
mandated 16-byte stack alignment from the start.  (I don't know the history,
but perhaps i386 code-gen started assuming / depending on it for correctness,
not just performance, by accident because of devs being used to x86-64?)

The 32-bit ABI on some other OSes, including i386 *BSD and 32-bit Windows, has
*not* changed; presumably gcc there doesn't rely on incoming stack alignment. 
(It might try to propagate 16-byte alignment for performance benefits, though.)

My understanding is that i386 MacOS still uses a version of i386 System V that
doesn't include the 16-byte stack alignment update, like other *BSDs.


(In reply to Harald van Dijk from comment #90)
> compile
> 
>   void exit(int);
>   int main(void) { exit(0); }
> 
> with GCC 2.8, compile current glibc with GCC 8, and there will be a segfault
> in glibc's __run_exit_handlers because GCC 2.8 never kept the stack
> 16-byte-aligned, but GCC 8 does now generate code which assumes it.
>
> For the moment, I've rebuilt glibc with -mincoming-stack-boundary=2 to handle 
> the problem well enough for my current needs, but it's not a complete 
> solution.

Yes, you need workarounds like this to change modern GCC's ABI back to legacy
4-byte.

Note that you might break atomicity of C11 _Atomic 8-byte objects even outside
structs by doing this, if they split across a cache line (Intel) or possibly
narrower (AMD) boundary.  But only if they were stack allocated.

[Bug target/89346] Unnecessary EVEX encoding

2019-10-30 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89346

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #1 from Peter Cordes  ---
Still present in pre10.0.0 trunk 20191022.  We pessimize vmovdqu/a in AVX2
intrinsics and autovectorization with -march=skylake-avx512 (and arch=native on
such machines)

It seems only VMOVDQU/A load/store/register-copy instructions are affected; we
get AVX2 VEX vpxor instead of AVX512VL EVEX vpxord for xor-zeroing, and
non-zeroing XOR.  (And most other instructions have the same mnemonic for VEX
and EVEX, like vpaddd.  This includes FP moves like VMOVUPS/PD)

(https://godbolt.org/z/TEvWiU for example)

The good options are: 

* use VEX whenever possible instead of AVX512VL to save code-size.  (2 or 3
byte prefix instead of 4-byte EVEX)

* Avoid the need for vzeroupper by using only x/y/zmm16..31.  (Still has a
max-turbo penalty so -mprefer-vector-width=256 is still appropriate for code
that doesn't spend a lot of time in vectorized loops.)

 This might be appropriate for very simple functions / blocks that only have a
few SIMD instructions before the next vzeroupper would be needed.  (e.g.
copying or zeroing some memory); could be competitive on code-size as well as
saving the 4-uop instruction.

 VEX instructions can't access x/y/zmm16..31 so this forces an EVEX encoding
for everything involving the vector (and rules out using AVX2 and earlier
instructions, which may be a problem for KNL without AVX512VL unless we narrow
to 128-bit in an XMM reg)



(citation for not needing vzeroupper if y/zmm0..15 aren't written explicitly:
https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-for-turbo-clocks-to-recover-after-a-512-bit-instruc
- it's even safe to do

vpxor xmm0,xmm0,xmm0
vpcmpeqb  k0, zmm0, [rdi]

without vzeroupper.  Although that will reduce max turbo *temporarily* because
it's a 512-bit uop.

Or more frequently useful: to zero some memory with vpxor xmm zeroing and YMM
stores.

[Bug target/82459] AVX512BW instruction costs: vpmovwb is 2 uops on Skylake and not always worth using vs. vpack + vpermq lane-crossing fixup

2019-10-29 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

Peter Cordes  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=89346
Summary|AVX512F instruction costs:  |AVX512BW instruction costs:
   |vmovdqu8 stores may be an   |vpmovwb is 2 uops on
   |extra uop, and vpmovwb is 2 |Skylake and not always
   |uops on Skylake and not |worth using vs. vpack +
   |always worth using  |vpermq lane-crossing fixup

--- Comment #5 from Peter Cordes  ---
Turns out vmovdqu8 with no masking doesn't cost an extra uop.  IACA was wrong,
and Agner Fog's results were *only* for the masked case.  The only downside of
that is the code-size cost of using EVEX load/store instructions instead of
AVX2 VEX. That's bug 89346


https://www.uops.info/table.html confirms that SKX non-masked vmovdqu8 load and
store are both single uop.  (Or the usual micro-fused store-address +
store-data).
 https://www.uops.info/html-tp/SKX/VMOVDQU8_ZMM_M512-Measurements.html
 https://www.uops.info/html-tp/SKX/VMOVDQU8_M512_ZMM-Measurements.html

And between registers it can be eliminated if there's no masking.

But *with* masking, as a load it's a micro-fused load+ALU uop, and as a masked
store it's just a normal store uop for xmm and ymm.  But zmm masked store is 5
uops (micro-fused to 4 front-end uops)! (Unlike vmovdqu16 or 32 masked stores
which are efficient even for zmm).

https://www.uops.info/html-tp/SKX/VMOVDQU8_M512_K_ZMM-Measurements.html

uops.info's table also shows us that IACA3.0 is wrong about vmovdqu8 as an
*unmasked* ZMM store: IACA thinks that's also 5 uops.

Retitling this bug report since that part was based on Intel's bogus data, not
real testing.

vpmovwb is still 2 uops, and current trunk gcc still uses  2x vpmovwb +
vinserti64x4 for ZMM auto-vec.  -mprefer-vector-width=512 is not the default,
but people may enable it in code that heavily uses 512-bit vectors.

YMM auto-vec is unchanged since previous comments: we do get vpackusbw +
vpermq, but an indexed addressing mode defeats micro-fusion.  And we have
redundant VPAND after shifting.

---

For icelake-client/server (AVX512VBMI) GCC is using vpermt2b, but it doesn't
fold the shifts into the 2-source byte shuffle.   (vpermt2b has 5c latency and
2c throughput on ICL, so probably its uop count is the same as uops.info
measured for CannonLake: 1*p05 + 2*p5.  Possible 2x 1-uop vpermb with
merge-masking for the 2nd into the first would work better.)

IceLake vpmovwb ymm,zmm is still 2-cycle throughput, 4-cycle latency, so
probably still 2 uops.

[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)

2019-10-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244

--- Comment #4 from Peter Cordes  ---
(In reply to Andrew Pinski from comment #3)
> (In reply to Peter Cordes from comment #1)
> > On AArch64 (with gcc8.2), we see a similar effect, more instructions in the
> > loop.  And an indexed addressing mode.

That was an overstatement, the generic tuning I showed isn't using 2 separate
pointers or indices like we get on x86.

Your thunderx2t99 output is like that, but write-back addressing modes mean it
doesn't cost extra instructions.

> I am not shocked that IV-OPTS can chose these widly differences.
> I have not looked at the cost differences to understand why
> -mcpu=thunderx2t99 chose what close might be the best (we could use one less
> IV by replacing the first ldr by using the same IV as the last str).

I don't know ARM tuning; the x86 version is clearly worse with an extra uop
inside the loop.  And an extra instruction to copy the register before the
loop, wasting code-size if nothing else.

On Skylake for example, the loop is 10 uops and bottlenecks on front-end
throughput (4 uops / clock) if the back-end can keep up with a bit less than 1
store per clock.  (Easy if pointers are aligned and data is hot in L1d). 
Reducing it to 9 uops should help in practice.  Getting it down to 8 uops would
be really nice, but we can't do that unless we could use a shuffle that
micro-fuses with a load.  (For int elements, AVX2 VPERMD can micro-fuse a
memory source, so can SSE2 PSHUFD.  pshufb's xmm/memory operand is the control
vector which doesn't help us.  AVX512 vpermb can't micro-fuse)

[Bug target/92246] Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)

2019-10-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246

--- Comment #1 from Peter Cordes  ---
And BTW, GCC *does* use vpermd (not vpermt2d) for swapt = int or long.  This
problem only applies to char and short.  Possibly because AVX2 includes vpermd
ymm.



Apparently CannonLake has 1 uop vpermb but 2 uop vpermw, according to real
testing on real hardware by https://uops.info/.  Their automated test methods
are generally reliable.

That seems to be true for Ice Lake, too, so when AVX512VBMI is available we
should be using vpermb any time we might have used vpermw with a
compile-time-constant control vector.


(verpmw requires AVX512BW, e.g. SKX and Cascade Lake.  vpermb requires
AVX512VBMI, only Ice Lake and the mostly aborted CannonLake.)

Instlat provides some confirmation:
https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel00706E5_IceLakeY_InstLatX64.txt
 shows vpermb at 3 cycle latency, but vpermw at 4 cycle latency (presumably a
chain of 2 uops, 1c and 3c being the standard latencies that exist in recent
Intel CPUs).  InstLat doesn't document which input the dep chain goes through,
so it's not 100% confirmation of only 1 uop.  But it's likely that ICL has 1
uop vpermb given that CNL definitely does.

uops.info lists latencies separately from each input to the result, sometimes
letting us figure out that e.g. one of the inputs isn't needed until the 2nd
uop.  Seems to be the case for CannonLake vpermw: latency from one of the
inputs is only 3 cycles, the other is 4. 
https://www.uops.info/html-lat/CNL/VPERMW_YMM_YMM_YMM-Measurements.html

[Bug target/92246] New: Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)

2019-10-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246

Bug ID: 92246
   Summary: Byte or short array reverse loop auto-vectorized with
3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

typedef short swapt;
void strrev_explicit(swapt *head, long len)
{
  swapt *tail = head + len - 1;
  for( ; head < tail; ++head, --tail) {
  swapt h = *head, t = *tail;
  *head = t;
  *tail = h;
  }
}

g++ -O3 -march=skylake-avx512
  (Compiler-Explorer-Build) 10.0.0 20191022 (experimental)

https://godbolt.org/z/LS34w9

...
.L4:
vmovdqu16   (%rdx), %ymm1
vmovdqu16   (%rax), %ymm0
vmovdqa64   %ymm1, %ymm3# useless copy
vpermt2w%ymm1, %ymm2, %ymm3
vmovdqu16   %ymm3, (%rax)
vpermt2w%ymm0, %ymm2, %ymm0
addq$32, %rax
vmovdqu16   %ymm0, (%rcx)
subq$32, %rdx
subq$32, %rcx   # two tail pointers, PR 92244 is unrelated to
this
cmpq%rsi, %rax
jne .L4

vpermt2w ymm is 3 uops on SKX and CannonLake:  2p5 + p015
(https://www.uops.info/table.html)

Obviously better would be  vpermw (%rax), %ymm2, %ymm0.

vpermw apparently can't micro-micro-fuse a load, but it's only 2 ALU uops plus
a load if we use a memory source.  SKX still bottlenecks on 2p5 for vpermw,
losing only the p015 uop, but in general fewer uops is better.

But on CannonLake it runs on p01 + p5 (plus p23 with a memory source).

uops.info doesn't have IceLake-client data yet but vpermw throughput on IceLake
is 1/clock, vs 1 / 2 clocks for vpermt2w, so this could double throughput on
CNL and ICL.

We have exactly the same problem with AVX512VBMI vpermt2b over vpermb with ICL
g++ -O3 -march=icelake-client -mprefer-vector-width=512

[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)

2019-10-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244

Peter Cordes  changed:

   What|Removed |Added

Summary|extra sub inside vectorized |vectorized loop updating 2
   |loop instead of calculating |copies of the same pointer
   |end-pointer |(for in-place reversal
   ||cross in the middle)

--- Comment #2 from Peter Cordes  ---
Forgot to update title after looking more carefully at the asm.

[Bug tree-optimization/92244] extra sub inside vectorized loop instead of calculating end-pointer

2019-10-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244

--- Comment #1 from Peter Cordes  ---
On AArch64 (with gcc8.2), we see a similar effect, more instructions in the
loop.  And an indexed addressing mode.

https://godbolt.org/z/6ZVWY_


# strrev_explicit   -O3 -mcpu=cortex-a53
   ...
.L4:
ldr q1, [x4, x2]# tail
ldr q0, [x3]# head
tbl v1.16b, {v1.16b}, v2.16b# byte shuffle
tbl v0.16b, {v0.16b}, v2.16b
str q1, [x3], 16# post-increment store to head
cmp x3, x1
str q0, [x4, x2]
sub x2, x2, #16   # doesn't update flags, not SUBS
bne .L4 # }while( head != end_head )



# strrev_implicit   -O3 -mcpu=cortex-a53
...
.L19:
ldr q1, [x3]
ldr q0, [x2]
tbl v1.16b, {v1.16b}, v2.16b
tbl v0.16b, {v0.16b}, v2.16b
str q1, [x2], 16   # post-increment addressing mode 
cmp x2, x4
str q0, [x3], -16  # post-decrement addressing mode 
bne .L19   # }while( head != end_head )

[Bug tree-optimization/92244] New: extra sub inside vectorized loop instead of calculating end-pointer

2019-10-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244

Bug ID: 92244
   Summary: extra sub inside vectorized loop instead of
calculating end-pointer
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

We get a redundant instruction inside the vectorized loop here.  But it's not a
separate *counter*, it's a duplicate of the tail pointer.

It goes away if we find tail with while(*tail++); instead of calculating it
from head+length.

Only happens with vectorization, not pure scalar (bug 92243 is about the fact
that -O3 fails to use bswap as a GP-integer shuffle to auto-vectorize without
x86 SSSE3).

typedef char swapt;
void strrev_explicit(swapt *head, long len)
{
  swapt *tail = head + len - 1;
  for( ; head < tail; ++head, --tail) {
  swapt h = *head, t = *tail;
  *head = t;
  *tail = h;
  }
}
https://godbolt.org/z/wdGv4S

compiled with g++ -O3 -march=sandybridge gives us a main loop of

...
movq%rcx, %rsi # RSI = RCX before entering the loop
addq%rdi, %r8
.L4:
vmovdqu (%rcx), %xmm3   # tail load from RCX
addq$16, %rax# head
subq$16, %rcx# tail
subq$16, %rsi# 2nd tail?
vmovdqu -16(%rax), %xmm0
vpshufb %xmm2, %xmm3, %xmm1
vmovups %xmm1, -16(%rax)
vpshufb %xmm2, %xmm0, %xmm0
vmovups %xmm0, 16(%rsi) # tail store to RSI
cmpq%r8, %rax   # } while(head != end_head)
jne .L4

RSI = RCX before and after the loop.  This is obviously pointless.
head uses the same register for loads and stores.

 Then we have bloated fully-unrolled scalar cleanup, instead of using the
shuffle control for 8-byte vectors -> movhps.  Or scalar bswap.  Ideally we'd
do something clever at the overlap like one load + shuffle + store, but we
might have to load the next vector before storing the current to make this work
at the overlap.  That would presumably require more special-casing this kind of
meet-in-the-middle loop.




The implicit-length version doesn't have this extra sub in the main loop.

void strrev_implicit(swapt *head)
{
  swapt *tail = head;
  while(*tail) ++tail;// find the 0 terminator, like head+strlen
  --tail; // tail points to the last real char
  for( ; head < tail; ++head, --tail) {
  swapt h = *head, t = *tail;
  *head = t;
  *tail = h;
  }
}

.L22:
vmovdqu (%rcx), %xmm3
addq$16, %rdx   # head
subq$16, %rcx   # tail
vmovdqu -16(%rdx), %xmm0
vpshufb %xmm2, %xmm3, %xmm1
vmovups %xmm1, -16(%rdx)
vpshufb %xmm2, %xmm0, %xmm0
vmovups %xmm0, 16(%rcx)
cmpq%rsi, %rdx  # } while(head != end_head)
jne .L22

[Bug tree-optimization/92243] Missing "auto-vectorization" of char array reversal using x86 scalar bswap when SIMD pshufb isn't available

2019-10-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92243

--- Comment #1 from Peter Cordes  ---
Forgot to mention, this probably applies to other ISAs with GP-integer
byte-reverse instructions and efficient unaligned loads.

[Bug tree-optimization/92243] New: Missing "auto-vectorization" of char array reversal using x86 scalar bswap when SIMD pshufb isn't available

2019-10-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92243

Bug ID: 92243
   Summary: Missing "auto-vectorization" of char array reversal
using x86 scalar bswap when SIMD pshufb isn't
available
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

We could use integer bswap to speed up an in-place byte-reverse loop by a
factor of probably 8, the same way we uses SIMD shuffles.

Consider this loop which reverses an explicit-length char array:
https://godbolt.org/z/ujXq_J

typedef char swapt; // int can auto-vectorize with just SSE2
void strrev_explicit(swapt *head, long len)
{
  swapt *tail = head + len - 1;
  for( ; head < tail; ++head, --tail) {
  swapt h = *head, t = *tail;
  *head = t;
  *tail = h;
  }
}

gcc -O3 (including current trunk) targeting x86-64 makes naive scalar
byte-at-a-time code, even though bswap r64 is available to byte-reverse a
uint64 in 1 or 2 uops (AMD and Intel, respectively).

With -mssse3, we do see auto-vectorization using SIMD pshufb (after checking
lengths and calculating how many 16-byte chunks can be done before bloated
fully-unrolled cleanup).  Doing the same thing with 64-bit integer registers
would be very much worth it (for code where a loop like this was a bottleneck).



With `swapt = short`, vectorizing with SSE2 pshuflw / pshufhw / pshufd is
probably worth it, but GCC chooses not to do that either.  Or working in 8-byte
chunks just using movq + pshuflw, so we only have 1 shuffle per 8-byte
load/store instead of 3 per 16-byte store.  That's a good balance for modern
Intel (Haswell, Skylake, and I think IceLake), although some AMD and earlier
Intel with more integer shuffle throughput (e.g. Sandybridge) might do better
with 3x shuffles per 16-byte load/store.

[Bug target/82887] ICE: in extract_insn, at recog.c:2287 (unrecognizable insn) with _mm512_extracti64x4_epi64

2019-10-13 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82887

--- Comment #5 from Peter Cordes  ---
Reported bug 92080 for the missed CSE

[Bug tree-optimization/92080] New: Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2019-10-13 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080

Bug ID: 92080
   Summary: Missed CSE of _mm512_set1_epi8(c) with
_mm256_set1_epi8(c)
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

As a workaround for PR 82887 some code (e.g. a memset) uses

__m512i zmm = _mm512_set1_epi8((char)c);
__m256i ymm = _mm256_set1_epi8((char)c);

instead of 

  ymm = _mm512_castsi512_si256(zmm);

(found in the persistent-memory library
https://github.com/pmem/pmdk/blob/a6031710f7c102c6b8b6b19dc9708a3b7d43e87b/src/libpmem/x86_64/memset/memset_nt_avx512f.h#L193
)

Obviously we'd like to CSE that instead of actually broadcasting twice.  MVCE:

#include 

__m512i sinkz;
__m256i sinky;
void foo(char c) {
sinkz = _mm512_set1_epi8(c);
sinky = _mm256_set1_epi8(c);
}

https://godbolt.org/z/CeXhi8  g++ (Compiler-Explorer-Build) 10.0.0 20191012

# g++ -O3 -march=skylake-avx512  (AVX512BW + AVX512VL are the relevant ones)
foo(char):
vpbroadcastb%edi, %zmm0
vmovdqa64   %zmm0, sinkz(%rip)
vpbroadcastb%edi, %ymm0  # wasted insn
vmovdqa64   %ymm0, sinky(%rip)   # wasted EVEX prefix
vzeroupper
ret

Without AVX512VL it wastes even more instructions (vmovd + AVX2 vpbroadcastb
xmm,ymm), even though AVX512BW vpbroadcastb zmm does set the YMM register. 
(There are no CPUs with AVX512BW but not AVX512VL; if people compile that way
it's their own fault.  But this might be relevant for set1_epi32() on KNL).

Clang finds this optimization, and uses a shorter vmovdqa for the YMM store
saving another 2 bytes of code size:

vpbroadcastb%edi, %zmm0
vmovdqa64   %zmm0, sinkz(%rip)
vmovdqa %ymm0, sinky(%rip)
vzeroupper
ret

[Bug target/82887] ICE: in extract_insn, at recog.c:2287 (unrecognizable insn) with _mm512_extracti64x4_epi64

2019-10-13 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82887

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #4 from Peter Cordes  ---
Since some code is apparently still avoiding this because of old broken GCC
(e.g.
https://github.com/pmem/pmdk/blob/a6031710f7c102c6b8b6b19dc9708a3b7d43e87b/src/libpmem/x86_64/memset/memset_nt_avx512f.h#L193
)

Perhaps a workaround of  _mm512_castsi512_si256 would be useful?  Or does that
ICE as well?  I can't repro the bug on Godbolt so IDK.

Doing _mm512_set1_epi8(c) and a separate _mm256_set1_epi8(c) doesn't CSE with
GCC, only clang.  https://godbolt.org/z/uZ4lv-   And if you leave out 
-march=skylake-avx512 you get even worse asm from GCC.

[Bug middle-end/91515] missed optimization: no tailcall for types of class MEMORY

2019-08-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91515

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #1 from Peter Cordes  ---
The real missed optimization is that GCC is returning its own incoming arg
instead of returning the copy of it that create() will return in RAX.

This is what blocks tailcall optimization; it doesn't "trust" the callee to
return what it's passing as RDI.

See https://stackoverflow.com/a/57597039/224132 for my analysis (the OP asked
the same thing on SO before reporting this, but forgot to link it in the bug
report.)

The RAX return value tends to rarely be used, but probably it should be; it's
less likely to have just been reloaded recently.

RAX is more likely to be ready sooner than R12 for out-of-order exec.  Either
reloaded earlier (still in the callee somewhere if it's complex and/or
non-leaf) or never spilled/reloaded.

So we're not even gaining a benefit from saving/restoring R12 to hold our
incoming RDI.  Thus it's not worth the extra cost (in code-size and
instructions executed), IMO.  Trust the callee to return the pointer in RAX.

[Bug c/91398] Possible missed optimization: Can a pointer be passed as hidden pointer in x86-64 System V ABI

2019-08-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91398

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #4 from Peter Cordes  ---
EAD neglected to link previous discussion about this in the initial bug report.

https://stackoverflow.com/a/57377890/224132 points out that the SysV ABI
wording is 

> If the type has class MEMORY, then **the caller provides space** for the 
> return value and passes the address of this storage in  %rdi

We can argue semantics, but in my answer on the same question, I argued that
the implication is that that space won't alias any other space.  (Because the
return-value object exists in the C abstract machine, so the default assumption
should be that it exists for real in the calling convention.)



Whether it's practical to look for this optimization or not, I'm still curious
about the point that @M.M made about the semantics of  restrict  

https://stackoverflow.com/questions/57377314/what-prevents-the-usage-of-a-function-argument-as-hidden-pointer/57436765#comment101288442_57403379

Does the callee do_something() reading a global count as happening inside the
block scope of use(Vec3 *restrict out) { ... }?  The ISO C standard wording
talks about reaching the end of a block, which hasn't happened even though
`out` is not in scope inside the other function.

If so, then calling use() creates UB when *out = do_something();
executes because it writes the pointed-to memory via a restrict-pointer in the
same block where it reads it from a pointer that's not derived from out.

If so, restrict would make this optimization safe if we can prove that
do_something is "noexcept" and doesn't longjmp.

[Bug tree-optimization/91026] switch expansion produces a jump table with trivial entries

2019-07-30 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #3 from Peter Cordes  ---
(In reply to Martin Liška from comment #2)
> Switch conversion bails out because it knowns that a jump table (or a bit
> test can) be used for this snippet. Then we prefer to use a jump table then
> a bit test. With -fno-jump-tables we generate the same code.
> That said, I confirm it's a small limitation.

This regression appeared in GCC9 for this test-case, and is present in GCC9.1
on Godbolt: https://godbolt.org/z/fDjTxN

bool is_vowel(char c) {
switch (c) {
case 'a': case 'e': case 'i': case 'o': case 'u': case 'y':
  return 1;
default:
  return 0;
}
}


But simplifying it

 case 'a': case 'e': case 'i':

to those 3 cases gets gcc9 and trunk to use an immediate bitmap.

With gcc8 and earlier, the x86-64 asm for the 2 versions is identical except
for the immediate used with TEST EAX, imm32.



(And BTW, there's a missed optimization here of using  mask & (1<>n) & 1.  Or better, looking for that conversion in user source code /
logic because people often write tests that way requiring the creation of an
actual 1 in a register.

Or for ISAs with flags, have the mask already right-shifted by 1 so the bit
shifted out is the one we want.  Then CF = result with no extra test.

Also an x86 missed optimization: BT reg,reg is very efficient (single uop) on
Intel and Ryzen, and avoids needing a 3-uop-on-Intel shift-by-CL or a mov reg,1

I'll report these ideas separately if/when I get around to it.

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2019-07-08 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #4 from Peter Cordes  ---
We should not put any stock in what ICC does for GNU C native vector indexing. 
I think it doesn't know how to optimize that because it *always* spills/reloads
even for `vec[0]` which could be a no-op.  And it's always a full-width spill
(ZMM), not just the low XMM/YMM part that contains the desired element.  I
mainly mentioned ICC in my initial post to suggest the store/reload strategy in
general as an *option*.

ICC also doesn't optimize intriniscs: it pretty much always faithfully
transliterates them to asm.  e.g. v = _mm_add_epi32(v, _mm_set1_epi32(1)); 
twice compiles to two separate paddd instructions, instead of one with a
constant of set1(2).

If we want to see ICC's strided-store strategy, we'd need to write some pure C
that auto-vectorizes.



That said, store/reload is certainly a valid option when we want all the
elements, and gets *more* attractive with wider vectors, where the one extra
store amortizes over more elements.

Strided stores will typically bottleneck on cache/memory bandwidth unless the
destination lines are already hot in L1d.  But if there's other work in the
loop, we care about OoO exec of that work with the stores, so uop throughput
could be a factor.


If we're tuning for Intel Haswell/Skylake with 1 per clock shuffles but 2 loads
+ 1 store per clock throughput (if we avoid indexed addressing modes for
stores), then it's very attractive and unlikely to be a bottleneck.

There's typically spare load execution-unit cycles in a loop that's also doing
stores + other work.  You need every other uop to be (or include) a load to
bottleneck on that at 4 uops per clock, unless you have indexed stores (which
can't run on the simple store-AGU on port 7 and need to run on port 2/3, taking
a cycle from a load).   Cache-split loads do get replayed to grab the 2nd half,
so it costs extra execution-unit pressure as well as extra cache-read cycles.

Intel says Ice will have 2 load + 2 store pipes, and a 2nd shuffle unit.  A
mixed strategy there might be interesting: extract the high 256 bits to memory
with vextractf32x8 and reload it, but shuffle the low 128/256 bits.  That
strategy might be good on earlier CPUs, too.  At least with movss + extractps
stores from the low XMM where we can do that directly.

AMD before Ryzen 2 has only 2 AGUs, so only 2 memory ops per clock, up to one
of which can be a store.  It's definitely worth considering extracting the high
128-bit half of a YMM and using movss then shuffles like vextractps: 2 uops on
Ryzen or AMD.


-

If the stride is small enough (so more than 1 element fits in a vector), we
should consider  shuffle + vmaskmovps  masked stores, or with AVX512 then
AVX512 masked stores.

But for larger strides, AVX512 scatter may get better in the future.  It's
currently (SKX) 43 uops for VSCATTERDPS or ...DD ZMM, so not very friendly to
surrounding code.  It sustains one per 17 clock throughput, slightly worse than
1 element stored per clock cycle.  Same throughput on KNL, but only 4 uops so
it can overlap much better with surrounding code.




For qword elements, we have efficient stores of the high or low half of an XMM.
 A MOVHPS store doesn't need a shuffle uop on most Intel CPUs.  So we only need
1 (YMM) or 3 (ZMM) shuffles to get each of the high 128-bit lanes down to an
XMM register.

Unfortunately on Ryzen, MOVHPS [mem], xmm costs a shuffle+store.  But Ryzen has
shuffle EUs on multiple ports.

[Bug target/91103] New: AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2019-07-06 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

Bug ID: 91103
   Summary: AVX512 vector element extract uses more than 1 shuffle
instruction; VALIGND can grab any element
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

GCC9.1 and current trunk aren't good at extracting high elements, whether it's
with GNU C native vector syntax, or when auto-vectorizing something that ends
with the result in the high element.

Using VALIGND we can get any element with one immediate instruction, but its
better to use AVX2 VPERMPD(immediate) when possible.  Or inside loops,
VPERMPS(vector), or VPERMT2PS(vector).  Or of course vextractf32x4 if possible
(element at the bottom of a 128-bit lane).

Or with only AVX2 available, VPERMPD(immediate) for high elements in __m256 and
__m256d vectors is still a big win.

#include 
float elem12(__m512 v) {  return v[12]; }
float elem15(__m512 v) {  return v[15]; }

gcc -Ofast -march=skylake-avx512
https://godbolt.org/z/241r8p

elem15:
vextractf32x8   ymm0, zmm0, 0x1
vextractf128xmm0, ymm0, 0x1# elem12 ends here, after these 2
insns
vshufps xmm0, xmm0, xmm0, 255
 # no vzeroupper I guess because the caller must have __m512 vars too,
recent optimization
ret

But AVX512F has vextractf32x4 to extract a 128-bit lane, which would preclude
the need for AVX2 vextractf128.  That's what clang does.

Obviously inside a loop it would be *much* better to use a single lane-crossing
VPERMPS to also avoid the shufps.  Intel Skylake easily bottlenecks on shuffle
throughput.  We'd need a 15 in an XMM register as a control vector, but loading
it would be off the latency critical path.  (If we needed the scalar
zero-extended instead of garbage in high elements, we could VPERMI2PS or
VPERMT2PS with a zeroed vector and a shuffle-control.)

---

If the element we want is an even element in the low 256 bits, we can get it
with a VPERMPD-immediate.  GCC does this:

elem6(float __vector(16)): # GCC 10 trunk
vextractf128xmm0, ymm0, 0x1
vunpckhps   xmm0, xmm0, xmm0
ret

Instead it should be AVX2   vpermpd ymm0, ymm0, 3
This bug also applies to __m256, not just __m512

https://www.felixcloutier.com/x86/vpermpd
VPERMPD is a 64-bit granularity lane-crossing shuffle.  The AVX512F immediate
version reuses the immediate for another 256-bit wide shuffle in the upper
half; only the vector-control version can bring an element from the top half of
a ZMM down to the bottom.  But if we're going to use a vector control, we might
as well use VPERMPS.

For the integer version of this bug, use VPERMQ

--

But we can do even better by using an integer VALIGND (AVX512F) shuffle on FP
data.  There unfortunately isn't an FP flavour of VALIGND, just integer.

AFAIK, Skylake-AVX512 still has no bypass-delay penalty for integer shuffles
between FP math instructions, i.e. the shift unit is connected to both FP and
integer forwarding networks.  Intel's optimization manual for Skylake (client)
has a bypass-latency table that shows 0 extra latency cycles for SHUF/5/1,3
reading from anything, or anything reading from it.

https://www.felixcloutier.com/x86/valignd:valignq  It's a 4 or 8-byte
granularity version of palignr, except that it's lane-crossing so the 256 and
512-bit versions are actually useful.  The immediate shift count can thus bring
*any* element down to the bottom.  (Using the same input twice makes it a
rotate).

VALIGND is good on Knight's Landing, too: unlike most 2-input shuffles, it has
1 per clock throughput.

For *any* compile-time-constant index, we can always compile v[i] to this:

extract15:
   valigndzmm0, zmm0, zmm0, 15   # I think this is right.
   ret

The only downside I'm aware of is that some future AVX512 CPU might not run
VALIGND as efficiently as SKX and KNL.




For vector elements narrower than 32 bits, we may need 2 shuffles even if we
consider using a shuffle-control vector.  On Skylake-AVX512,  AVX512BW  vpermw 
will get the job done, but costs 2 shuffle uops.  On CannonLake (and presumably
other future Intel), it and  AVX512VBMI vpermb are only 1 uop, so it's
definitely worth creating a shuffle-control vector if it can be reused.


Also worth considering instead of 2 shuffles: *unaligned* spill / reload like
ICC does for GNU C native vector indexing.  Store-forwarding latency is only 6
or 7 cycles I think, and it avoids any port 5 pressure.  Not generally a good
choice IMO when we can get the job done in one shuffle, but worth considering
if we need multiple elements.  If the function doesn't need the stack aligned,
an unaligned spill is generally

[Bug target/90582] New: AArch64 stack-protector wastes an instruction on address-generation

2019-05-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90582

Bug ID: 90582
   Summary: AArch64 stack-protector wastes an instruction on
address-generation
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

void protect_me() {
volatile int buf[2];
buf[1] = 3;
}

https://godbolt.org/z/xdlr5w AArch64 gcc8.2 -O3 -fstack-protector-strong

protect_me:
stp x29, x30, [sp, -32]!
adrpx0, __stack_chk_guard
add x0, x0, :lo12:__stack_chk_guard ### this instruction
mov x29, sp # frame pointer even though
-fomit-frame-pointer is part of -O3.  Goes away with explicit
-fomit-frame-pointer

ldr x1, [x0]# copy the cookie
str x1, [sp, 24]
mov x1,0# and destroy the reg

mov w1, 3   # right before it's already
destroyed
str w1, [sp, 20] # buf[1] = 3

ldr x1, [sp, 24]# canary
ldr x0, [x0]# key destroys the key pointer
eor x0, x1, x0
cbnzx0, .L5
ldp x29, x30, [sp], 32  # FP and LR save/restore (for
some reason?)
ret
.L5:
  # can the store of the link register go here, for backtracing?
bl  __stack_chk_fail

A function that returns a global can embed the low 12 bits of the address into
the load instruction.  AArch64 instructions are fixed-width, so there's no
reason (AFAIK) not to do this.

f:
adrpx0, foo
ldr w0, [x0, #:lo12:foo]
ret

I'm not an AArch64 performance expert; it's plausible that zero displacements
are worth spending an extra instruction on for addresses that are used twice,
but unlikely.

So we should be doing 

adrpx0, __stack_chk_guard
ldr x1, [x0, #:lo12:__stack_chk_guard]  # in prologue to copy
cookie
... 
ldr x0, [x0, #:lo12:__stack_chk_guard]  # in epilogue to check
cookie

This also avoids leaving an exact pointer right to __stack_chk_guard in a
register, in case a vulnerable callee or code in the function body can be
tricked into dereferencing it and leaking the cookie.  (In non-leaf functions,
we generate the pointer in a call-preserved register like x19, so yes it will
be floating around in a register for callees).

I'd hate to suggest destroying the pointer when copying to the stack, because
that would require another adrp later.

Finding a gadget that has exactly the right offset (the low 12 bits of
__stack_chk_guard's address) is a lot less likely than finding an  ldr from
[x0].  Of course this will introduce a lot of LDR instructions with an
#:lo12:__stack_chk_guard offset, but hopefully they won't be part of useful
gadgets because they lead to writing the stack, or to EOR/CBNZ to
__stack_chk_fail



I don't see a way to optimize canary^key == 0 any further, unlike x86-64 PR
90568.  I assume EOR / CBNZ is as at least as efficient as SUBS / BNE on
all/most AArch64 microarchitectures, but someone should check.



-O3 includes -fomit-frame-pointer according to -fverbose-asm, but functions
protected with -fstack-protector-strong still get a frame pointer in x29
(costing a MOV x29, sp instruction, and save/restore with STP/LDP along with
x30.)

However, explicitly using -fomit-frame-pointer stops that from happening.  Is
that a separate bug, or am I missing something?



Without stack-protector, the function is vastly simpler

protect_me:
sub sp, sp, #16
mov w0, 3
str w0, [sp, 12]
add sp, sp, 16
ret

Does stack-protector really need to spill/reload x29/x30 (FP and LR)?  Bouncing
the return address through memory seems inefficient, even though branch
prediction does hide that latency.

Is that just so __stack_chk_fail can backtrace?  Can we move the store of the
link register into the __stack_chk_fail branch, off the fast path?

Or if we do unconditionally store x30 (the link register), at least don't
bother reloading it in a leaf function if register allocation didn't need to
clobber it.  Unlike x86-64, the return address can't be attacked with buffer
overflows if it stays safe in a register the whole function.

Obviously my test-case with a volatile array and no inputs at all is making
-fstack-protector-strong look dumb by protecting a perfectly safe function. 
IDK how common it is to have leaf functions with arrays or structs that just
use them for some computation on function args or globals and then return,
maybe after copying the array back to somewhere else.  A sort function might
use a tmp

[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568

--- Comment #5 from Peter Cordes  ---
And BTW, this only helps if the SUB and JNE are consecutive, which GCC
(correctly) doesn't currently optimize for with XOR.

If this sub/jne is different from a normal sub/branch and won't already get
optimized for macro-fusion, we may get even more benefit from this change by
teaching gcc to keep them adjacent.

GCC currently sometimes splits up the instructions like this:

xorq%fs:40, %rdx
movl%ebx, %eax
jne .L7

from gcc8.3 (but not 9.1 or trunk in this case) on https://godbolt.org/z/nNjQ8u


#include 
unsigned int get_random_seed() {
std::random_device rd;
return rd();
}

Even with -O3 -march=skylake.
That's not wrong because XOR can't macro-fuse, but the point of switching to
SUB is that it *can* macro-fuse into a single sub-and-branch uop on
Sandybridge-family.  So we might need to teach gcc about that.

So when you change this, please make it aware of optimizing for macro-fusion by
keeping the sub and jne back to back.  Preferably with tune=generic (because
Sandybridge-family is fairly widespread and it doesn't hurt on other CPUs), but
definitely with -mtune=intel or -mtune=sandybridge or later.

Nehalem and earlier can only macro-fuse test/cmp

The potential downside of putting it adjacent instead of 1 or 2 insns earlier
for uarches that can't macro-fuse SUB/JNE should be about zero on average. 
These branches should predict very well, and there are no in-order x86 CPUs
still being sold.  So it's mostly just going to be variations in fetch/decode
that help sometimes, hurt sometimes, like any code alignment change.

[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568

--- Comment #3 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #2)
> The xor there is intentional, for security reasons we do not want the stack
> canary to stay in the register afterwards, because then it could be later
> spilled or accessible to some exploit in another way.

Ok, so we can't use CMP, therefore we should use SUB, which as I showed does
help on Sandybridge-family vs. XOR.

x - x = 0   just like 
x ^ x = 0

Otherwise SUB wouldn't set ZF.

SUB is not worse than XOR on any other CPUs; there are no CPUs with better XOR
throughput than ADD/SUB.

In the canary mismatch case, leaving  attacker_value - key  in a register seems
no worse than leaving attacker_value ^ key in a register.  Either value
trivially reveals the canary value to an attacker that knows what they
overwrote the stack with, if it does somehow leak.  We jump to __stack_chk_fail
in that case, not relying on the return value on the stack, so a ROP attack
wouldn't be sufficient to leak that value anywhere.

[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-21 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568

--- Comment #1 from Peter Cordes  ---
https://godbolt.org/z/hHCVTc

Forgot to mention, stack-protector also disables use of the red-zone for no
apparent reason, so that's another missed optimization.  (Perhaps rarely
relevant; probably most functions that get stack protection are big enough that
they need more stack, or non-leaf.  I sidestepped that with volatile.)

[Bug target/90568] New: stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-21 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568

Bug ID: 90568
   Summary: stack protector should use cmp or sub, not xor, to
allow macro-fusion on x86
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

cmp/jne is always at least as efficient as xor/jne, and more efficient on CPUs
that support macro-fusion of compare and branch.  Most support cmp/jne fusion
(including all mainstream Intel and AMD, not low-power), but none support
xor/jne fusion.

void foo() {
volatile int buf[4];
buf[1] = 2;
}

gcc trunk on Godbolt, but same code-gen all the way back to gcc4.9

foo:
subq$40, %rsp
movq%fs:40, %rax
movq%rax, 24(%rsp)
xorl%eax, %eax
movl$2, 4(%rsp)
movq24(%rsp), %rax
xorq%fs:40, %rax  ## This insn should be CMP
jne .L5
addq$40, %rsp
ret
.L5:
call__stack_chk_fail

As far as I can tell, the actual XOR result value in RAX is not an input to
__stack_chk_fail because gcc sometimes uses a different register.

Therefore we don't need it, and can use any other way to check for equality.

If we need to avoid "leaking" the canary value in a register, we can use SUB,
otherwise CMP is even better and can macro-fuse on more CPUs.

Only Sandybridge-family can fuse SUB/JCC.  (And yes, it can fuse even with a
memory-source and a segment override prefix.  SUB %fs:40(%rsp), %rax / JNE  is
a single uop on Skylake; I checked this with perf counters in an asm loop.)

AMD can fuse any TEST or CMP/JCC, but only those instructions (so SUB is as bad
as XOR for AMD).  See Agner Fog's microarch PDF.



Linux test program (NASM) that runs  sub (mem), %reg with an FS prefix to prove
that it does macro-fuse and stays micro-fused as a single uop:


default rel
%use smartalign
alignmode p6, 64

global _start
_start:

cookie equ 12345
mov  eax, 158   ; __NR_arch_prctl
mov  edi, 0x1002; ARCH_SET_FS
lea  rsi, [buf]
syscall
   ;  wrfsbase   rsi; not enabled by the kernel
mov  qword [fs: 0x28], cookie

mov ebp, 10

align 64
.loop:
mov   eax, cookie
sub   rax, [fs: 0x28]
jne   _start
and   ecx, edx

dec ebp
jnz .loop
.end:

xor edi,edi
mov eax,231   ; __NR_exit_group
syscall   ; sys_exit_group(0)


section .bss
align 4096
buf:resb 4096



nasm -felf64  branch-fuse-mem.asm &&
ld -o branch-fuse-mem  branch-fuse-mem.o
to make a static executable

taskset -c 3 perf stat
-etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u
-r2 ./branch-fuse-mem

On my i7-6700k

 Performance counter stats for './branch-fuse-mem' (2 runs):

240.78 msec task-clock:u  #0.999 CPUs utilized 
  ( +-  0.23% )
 2  context-switches  #0.010 K/sec 
  ( +- 20.00% )
 0  cpu-migrations#0.000 K/sec  
 3  page-faults   #0.012 K/sec  
 1,000,764,258  cycles:u  #4.156 GHz   
  ( +-  0.00% )
 2,000,000,076  branches:u# 8306.384 M/sec 
  ( +-  0.00% )
 6,000,000,088  instructions:u#6.00  insn per cycle
  ( +-  0.00% )
 4,000,109,615  uops_issued.any:u # 16613.222 M/sec
  ( +-  0.00% )
 5,000,098,334  uops_executed.thread:u# 20766.367 M/sec
  ( +-  0.00% )

  0.240935 +- 0.000546 seconds time elapsed  ( +-  0.23% )

Note 1.0 billion cycles (1 per iteration), and 4B fused-domain uops_issued.any,
i.e. 4 uops per loop iteration.

(5 uops *executed* is because one of those front-end uops has a load
micro-fused).

Changing SUB to CMP has no effect.

With SUB changed to XOR, the loop takes 1.25 cycles per iteration, and the
front-end issues 5 uops per iteration.  Other counters are the same.

Skylake's pipeline is 4-wide, like all Intel since Core2, so an extra uop for
the front-end creates a bottleneck.

--

On Intel pre Haswell, the decoders will only make at most 1 fusion per decode
group, so you may need to make the loop larger to still get fusion.  Or use
this as the loop-branch, e.g. with a  1  in memory

   sub  rax, [fs: 0x28]
   jnz  .loop

or with a 0 in memory, sub or cmp or xor will all set flags according to the
register being non-zero.  But sub or xor will introduce an extra cycle of
latency on the critical path for the loop counter.

[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2019-04-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #4 from Peter Cordes  ---
Yes, rep scasb is abysmal, and gcc -O3's 4-byte-at-a-time scalar loop is not
very good either.

With 16-byte alignment, (which we have from calloc on x86-64 System V), we can
inline a *much* better SSE2 loop.  See
https://stackoverflow.com/a/55589634/224132 for more details and
microbenchmarks; 

On Skylake it's about 4 to 5x faster than the current 4-byte loop for large
strings, 3x faster for short strings.  For short strings (strlen=33), it's
about 1.5x faster than calling strlen.  For very large strings (too big for L2
cache), it's ~1.7x slower than glibc's AVX2 strlen.

The lack of VEX encoding for pxor and pmovmskb is just me being lazy; let gcc
emit them all with VEX if AVX is enabled.

   # at this point gcc has `s` in RDX, `i` in ECX

pxor   %xmm0, %xmm0 # zeroed vector to compare against
.p2align 4
.Lstrlen16: # do {
#ifdef __AVX__
vpcmpeqb   (%rdx), %xmm0, %xmm1
#else
movdqa (%rdx), %xmm1
pcmpeqb%xmm0, %xmm1   # xmm1 = -1 where there was a 0 in memory
#endif

add $16, %rdx # ptr++
pmovmskb  %xmm1, %eax # extract high bit of each byte to a
16-bit mask
test   %eax, %eax
jz.Lstrlen16# }while(mask==0);
# RDX points at the 16-byte chunk *after* the one containing the terminator
# EAX = bit-mask of the 0 bytes, and is known to be non-zero
bsf%eax, %eax   # EAX = bit-index of the lowest set bit

# terminator is at rdx+rax - 16
#  movb   $'A', -16(%rdx, %rax)  // for a microbench that used
s[strlen(s)]='A'
sub%rbp, %rdx   # p -= start
lea   -16(%rdx, %rax)   # p += byte_within_vector - 16

We should actually use  REP BSF  because that's faster on AMD (tzcnt), and same
speed on Intel.


Also an inline-asm implementation of it with a microbenchmark adapted from the
SO question.  (Compile with -DUSE_ASM -DREAD_ONLY to benchmark a fixed length
repeatedly)
https://godbolt.org/z/9tuVE5

It uses clock() for timing, which I didn't bother updating.  I made it possible
to run it for lots of iterations for consistent timing.  (And so the real work
portion dominates the runtime so we can use perf stat to measure it.)




If we only have 4-byte alignment, maybe check the first 4B, then do (p+4) & ~7
to either overlap that 4B again or not when we start 8B chunks.  But probably
it's good to get to 16-byte alignment and do whole SSE2 vectors, because
repeating an aligned 16-byte test that overlaps an 8-byte test costs the same
as doing another 8-byte test.  (Except on CPUs like Bobcat that split 128-bit
vectors into 64-bit halves).  The extra AND to round down to an alignment
boundary is all it takes, plus the code-size cost of peeling 1 iteration each
of 4B and 8B before a 16-byte loop.

We can use 4B / 8B with movd / movq instead of movdqa.  For pmovmskb, we can
ignore the compare-true results for the upper 8 bytes by testing the result
with `test %al,%al`, or in general with `test $0x0F, %al` to check only the low
4 bits of EAX for the 4-byte case.



The scalar bithack version can use BSF instead of CMOV binary search for the
byte with a set high bit.  That should be a win if we ever wanted to do scalar
on some x86 target especially with 8-byte registers, or on AArch64.  AArch64
can rbit / clz to emulate bsf and find the position of the first set bit.

(Without efficient SIMD compare result -> integer_mask, or efficient SIMD ->
integer at all on some ARM / AArch64 chips, SIMD compares for search loops
aren't always (ever?) a win.  IIRC, glibc strlen and memchr don't use vectors
on ARM / AArch64, just scalar bithacks.)

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #22 from Peter Cordes  ---
Nice, that's exactly the kind of thing I suggested in bug 80571.  If this
covers 

* vsqrtss/sd  (mem),%merge_into, %xmm 
* vpcmpeqd%same,%same, %dest# false dep on KNL / Silvermont
* vcmptrueps  %same,%same, %ymm # splat -1 without AVX2.  false dep on all
known uarches

as well as int->FP conversions, then we could probably close that as fixed by
this as well.

bug 80571 does suggest that we could look for any cold reg, like a non-zero
constant, instead of requiring an xor-zeroed vector, so it might go slightly
beyond what this patch does.

And looking for known-to-be-ready dead regs from earlier in the same dep chain
could certainly be useful for non-AVX code-gen, allowing us to copy-and-sqrt
without introducing a dependency on anything that's not already ready.

(In reply to h...@gcc.gnu.org from comment #21)
> Author: hjl
> Date: Fri Feb 22 15:54:08 2019
> New Revision: 269119

[Bug target/80571] AVX allows multiple vcvtsi2ss/sd (integer -> float/double) to reuse a single dep-breaking vxorps, even hoisting it out of loops

2019-02-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571

--- Comment #2 from Peter Cordes  ---
I think hjl's patch for PR 89071 / PR 87007 fixes (most of?) this, at least for
AVX.

If register pressure is an issue, using a reg holding a arbitrary constant
(instead of xor-zeroed) is a valid option, as this bug points out.  So I'm not
sure we should close this as a duplicate of those fixed bugs.

[Bug target/38959] Additional switches to disallow processor supplementary instructions

2019-02-12 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959

--- Comment #4 from Peter Cordes  ---
The __builtin_ia32_rdpmc being a pure function bug I mentioned in my previous
comment is already reported and fixed (in gcc9 only): bug 87550

It was present since at least gcc 5.0
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/595214

[Bug target/38959] Additional switches to disallow processor supplementary instructions

2019-02-12 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #3 from Peter Cordes  ---
We can maybe close this as fixed (if -march=i386 didn't exist/work at the time)
or invalid.  Or maybe we want to add some CPU-level awareness to code-gen for
__builtin_ia32_rdtsc / rdpmc / rdtscp.

The cmov / fcomi / fcomi proposed switches are already supported as part of
-march=pentium -mtune=generic or lower, e.g. -march=i386.  (The 32-bit default
is something like arch=i686 and tune=generic, with it being possible to
configure gcc so SSE2 is on by default in 32-bit code.)

Those are the important ones, because they're emitted automatically by the
compiler's back-end.  The other options would just be trying to save you from
yourself, e.g. rejecting source that contains __rdtsc() /
__builtin_ia32_rdtsc()



I'm not sure what the situation is with long NOPs.  GCC doesn't (normally?)
emit them, just using .p2align directives for the assembler.  In 32-bit mode,
GAS appears to avoid long NOPs, using either 2-byte xchg ax,ax or pseudo-nops
like   LEA esi,[esi+eiz*1+0x0] that add a cycle of latency to the dep chain
involving ESI.

Even with -march=haswell, gcc+gas fail to use more efficient long NOPs for
padding between functions.


---

I'm not sure if CPUID is ever emitted by gcc's back-end directly, only from
inline asm.  i386/cpuid.h uses inline asm.  But __get_cpuid_max() checks if
CPUID is even supported in a 386-compatible way, checking if a bit in EFLAGS is
sticky or not.  If your source code is written safely, you won't have a problem
unless possibly __builtin_cpu_init runs CPUID without checking, in programs
that use __builtin_cpu_supports() or _is().


__builtin_ia32_rdpmc() and __rdtsc() do *not* check -march= before emitting
rdpmc and rdtsc.  Neither does __rdtscp(), which is interesting because that
instruction is new enough that some still-relevant CPUs don't support it.

__rdpmc() isn't "volatile", though, so stop-start optimizes to 0.  (I found
this bug looking for existing reports of that issue.)



Test cases:  https://godbolt.org/z/hqPdza

FCMOV and CMOV are also handled correctly, but I didn't write functions for
them.

int fcomi(double x, double y) {
return x Proposed switches:
> 
> --nocpuid  This option causes the compiler to not generate cpuid opcodes
> --nocmov   This option causes the compiler to not generate cmov opcodes
> --nofcmov  This option causes the compiler to not generate fcmov opcodes
> --nofcomi  This option causes the compiler to not generate fcomi opcodes
> --nonopl   This option causes the compiler to not generate fcomi opcodes
> --nordpmc  This option causes the compiler to not generate rdpmc opcodes
> --nordtsc  This option causes the compiler to not generate rdtsc opcodes
> 
> Possibly a general switch that is equivalent to all of the above
> 
> --nosupplementaryinstructions
> 
> Rationale
> 
> It is possible that a developer still wants to compile for a particular
> architecture (for example the i486), but does not wish to generate code with
> supplementary instructions (such as cpuid), that may be present on that
> architecture.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #15 from Peter Cordes  ---
(In reply to Uroš Bizjak from comment #13)
> I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP
> and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we
> currently don't emit XOR clear in front of these instrucitons, when they
> operate with memory input.

They *do* have an output dependency.  It might or might not actually be a
problem and be worth clogging the front-end with extra uops to avoid, it
depending on surrounding code. >.<

e.g. ROUNDSD:  DEST[127:63] remains unchanged
Thanks, Intel.  You'd think by SSE4.1 they would have learned that false
dependencies suck, and that it's extremely rare to actually take advantage of
this merge behaviour, but no.

For register-source ROUNDSD / ROUNDSS, we can use ROUNDPD / ROUNDPS which write
the full destination register and have identical performance on all CPUs that
support them.  (Except Silvermont, where roundps/pd have 5c latency vs. 4c for
roundss/sd.  Goldmont makes them equal.)  KNL has faster (V)ROUNDPS/D than
ROUNDSS/SD, maybe only because of the SSE encoding?  Agner Fog isn't clear, and
doesn't have an entry that would match vroundss/sd.

Copy-and-round is good for avoiding extra MOVAPS instructions which can make
SSE code front-end bound, and reduce the effective size of the out-of-order
window.

Preserving FP exception semantics for packed instead of scalar register-source:

* if the upper element(s) of the source is/are known 0, we can always do this
with sqrt and round, and convert: they won't produce any FP exceptions, not
even inexact.  (But not rsqrt / rcpps, of course.)
  This will be the case after a scalar load, so if we need the original value
in memory *and* the result of one of these instructions, we're all set.

* with rounding, the immediate can control masking of precision exceptions, but
not Invalid which is always raised by SRC = SNaN.  If we can rule out SNaN in
the upper elements of the input, we can use ROUNDPS / ROUNDPD

roundps/d can't produce a denormal output.  I don't think denormal inputs slow
it down on any CPUs, but worth checking for cases where we don't care about
preserving exception semantics and want to use it with potentially-arbitrary
garbage in high elements.


rsqrtps can't produce a denormal output because sqrt makes the output closer to
1.0 (reducing the magnitude of the exponent).  (And thus neither can sqrtps.) 
SQRTPS/PD is the same performance as SQRTSS/SD on new CPUs, but old CPUs that
crack 128-bit ops into 64-bit are slower: Pentium III, Pentium M, and Bobcat. 
And Jaguar for sqrt.  Also Silvermont is *MUCH* slower for SQRTPD/PS then
SD/SS, and even Goldmont Plus has slower packed SQRT, RSQRT, and RCP than
scalar.

But RCPPS can produce a denormal.  (double)1.0/FLT_MAX = 2.938736e-39, which is
smaller than FLT_MIN = 1.175494e-38



So according to Agner's tables:

* ROUNDPS/PD is never slower than ROUNDSS/SD on any CPU that support them.
* SQRTPS/PD *are* slower than scalar on Silvermont through Goldmont Plus, and
Bobcat, Nano 3000, and P4 Prescott/Nocona.  By about a factor of 2, enough that
should probably care about it for tune=generic.  For ss/ps only (not double),
also K10 and Jaguar have slower sqrtps than ss.  Also in 32-bit mode, P4,
Pentium M and earlier Intel, and Atom, are much slower for packed than scalar
sqrt.
  SQRTPD is *faster* than SQRTSD on KNL.  (But hopefully we're never tuning for
KNL without AVX available.)

* RSQRT / RCP: packed is slower on Atom, Silvermont, and Goldmont (multi-uop so
a big decode stall).  Somewhat slower on Goldmont Plus (1 uop but half
throughput).  Also slower on Nano3000, and slightly slower on Pentium 4 (before
and after Prescott/Nocona), and KNL.  (But hopefully KNL can always use
VRSQRT28PS/PD or scalar)
  Pentium M and older again decode as at least 2 uops for packed, same as
Bobcat and K8.
  Same performance for packed vs. scalar on Jaguar, K10, bdver1-4, ryzen, Core2
and later, and SnB-family.

* CVTSS2SD vs. PD, and SD2SS vs. PD2PS
  packed is slower on k8, bdver1-4 (scalar avoids the shuffle uop), Nano3000,
KNL.  On Silvermont by just 1 cycle latency (so  even a MOVAPS on the critical
path would make it equal.)  Similar on Atom.  Slower on CPUs that do 128-bit
vectors as two 64-bit uops, like Bobcat, and Pentium M / K8 and older.

  packed is *faster* on K10, Goldmont/GDM Plus (same latency, 1c vs. 2c
throughput), Prescott, P4.  Much faster on Jaguar (1c vs. 8c throughput, and 1
uop vs. 2).

  same speed (but without the false dep) for SnB-family (mostly), Core 2,
Ryzen.

  Odd stuff: Agner reports:
Nehalem: ps2pd = 2 uops / 2c, ss2sd = 1 uop / 1c.  (I guess just
zero-padding the significand, no rounding required).  pd2ps and sd2ss are equal
at 2 uops / 4c latency.
SnB: cvtpd2ps is 1c higher latency than sd2ss.
IvB: ps2pd on IvB is 1c vs. 2c for ss2sd
On HSW and later things have settled down to 

[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494

--- Comment #6 from Peter Cordes  ---
Oops, these were SD not SS.  Getting sleepy >.<.  Still, my optimization
suggestion for doing both compares in one masked SUB of +-PBCx applies equally.

And I think my testing with VBLENDVPS should apply equally to VBLENDVPD.

Since this is `double`, if we're going branchless we should definitely be
vectorizing for a pair of doubles, like doing 

xij = X0(1,i) - X0(1,j)   and 
yij = X0(2,i) - X0(2,j)

together with a vmovupd, and a vector of PBCx, PBCy.

Even if we later need both x and y separately (if those FMAs in the asm are
multiplying components of one vector), we might still come out ahead from doing
the expensive input processing with PD, then it's only one `vunpckhpd` to get
the Y element ready, and that can run in parallel with any x * z stuff

Or if we can unroll by 3 SIMD vectors over contiguous memory, we can get
{X0,Y0} {Z0,X1} {Y1,Z1}.  We get twice the work for a cost of only 3 extra
unpacks, doing 2 i and j values at once.



If this was 3 floats, using a SIMD load would be tricky (maybe vmaskmovps if we
need to avoid going off the end), unless we again unroll by 3 = LCM(vec_len,
width)

[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494

--- Comment #5 from Peter Cordes  ---
   IF ( xij.GT.+HALf ) xij = xij - PBCx
   IF ( xij.LT.-HALf ) xij = xij + PBCx

For code like this, *if we can prove only one of the IF() conditions will be
true*, we can implement it more efficiently, I think, by checking the magnitude
of xij to see if a SUB is needed, and if so figuring out the sign to apply to
PBCx.

if(abs(xij) > HALF) {
xij -= PBCx XOR sign_bit( xij )
}


# xij  in  xmm0
# PBCx in  xmm7
# HALF in  xmm6
# set1( -0.0f ) in xmm5 (i.e. 1U<<31 a sign-bit mask)
vandnps%xmm5, %xmm0, %xmm1# abs(xij)
vcmpltps   %xmm1, %xmm6, %xmm1# HALF < abs(xij)

vandps%xmm5, %xmm0, %xmm2 # signbit(xij)
vxorps%xmm7, %xmm2, %xmm2 # PBCX (xij>=0) or -PBCx  (xij<0)

vandps%xmm2, %xmm1, %xmm1 # +-PBCx or 0.0 if abs(xij) is between
-+HALF
vsubps%xmm1, %xmm0, %xmm0 # xij -= PBCx, -PBCx, or 0.0

There's a good amount of ILP here, but the critical path is ANDPS + CMPPS +
ANDPS + SUBPS = 10 cycles on Skylake.

We might want to use VPAND for some of this on Haswell, to avoid a port 5
bottleneck at least on the critical path.  (Skylake runs FP booleans on any
port.  BDW and earlier restrict them to port 5 where they can't compete with
FMA, and where bypass latency is always optimal.  On SKL they can introduce
extra bypass latency if they pick p0 or p1.)



vandnps   %xmm5, %xmm0, %xmm2 # signbit(xij)
vxorps%xmm7, %xmm2, %xmm2 # PBCX (xij>=0) or -PBCx  (xij<0)

could be replaced with a (v)blendvps using the original xij to select between
PBCx and -PBCx.  With the SSE encoding, that saves a uop and a cycle of latency
(but only off the critical path).  And I think it would cost us a vmovaps to
set up for it.

---

I think this is better than IF-conversion of both IFs separately, but I haven't
really looked.  It should be much better for *latency*.  But it's only
equivalent if subtracting PBCx can't possibly make xij negative and the next IF
condition also true.

---

I was looking at a similar case of applying a fixup if the abs value of an
input is outside a range in
https://stackoverflow.com/questions/54364694/how-to-convert-scalar-code-of-the-double-version-of-vdts-pade-exp-fast-ex-app/54377840#54377840.
 I don't think I came up with anything there that's not already obvious or
covered by the example above, though.

Except if we had needed to square xij at some point, we could have checked  xij
* xij < HALF*HALF as the bound condition to save the ANDNPS.  But then the
mulps latency is part of the input to cmpps.

[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494

--- Comment #4 from Peter Cordes  ---
I suspect dep-chains are the problem, and branching to skip work is a Good
Thing when it's predictable.

(In reply to Richard Biener from comment #2)
> On Skylake it's better (1uop, 1 cycle latency) while on Ryzen even better.
> On Bulldozer it also isn't that bad (comparable to Skylake I guess).

SKL: AVX VBLENDVPS x,x,x,x  is 2 uops, 2c latency, ~1c throughput.  (Same for
ymm)
SKL: SSE4 BLENDVPS x,x,xmm0 is 1 uop,  1c latency, ~0.36c throughput in my
testing, or maybe 0.333c with breaking dep chains.  (IDK how Agner got 1c. 
Maybe he that was an editing mistake, and he copied the 1c from the VEX
version.)


[V](P)BLENDV(B|PS|PD) is funny: the SSE versions are 1 uop on SKL, I assume
because they only have 3 register operands (including implicit XMM0).  But the
VEX encoding has 4 operands: 1 output and 3 inputs.  I think this is too many
for 1 uop to encode, and that's why VBLENDVPS is 2 uops even on Skylake.

(The blend-control register encoded by an imm8 in the VEX version instead of
implicit xmm0, but I don't think that's what stops the decoders from making it
1 uop.  I think it's simply having 4 total operands.)

On Skylake, the uop(s) for [V]BLENDVPS/D and [V]PBLENDVB can run on any of p015
(instead of only p5 on BDW and earlier), but the 2-uop VEX version is still 2
cycle latency.  The VEX version has a bias towards port 5, but less than half
the total uops run on p5 so it's not p015 + p5.  The SSE version seems equally
distributed to all of p015.



On SKL, the optimal choice might be to use the SSE encoding, if we can deal
with a destructive destination and having the blend control in xmm0.

The SSE/AVX penalty on SKL is output dependencies for write-only SSE
instructions (like movaps or cvtps2dq) writing to an XMM register that has a
dirty upper 128.  It's a per-register thing, not like Haswell where there's it
triggers a state slow change. 
(https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake)

---

Footnote: VBLENDVPS throughput is only 1c for a big block of it back-to-back,
even though it's only 2 uops that can run on any of 3 ports.  So why isn't it
0.66c throughput?

VBLENDVPS throughput (for back-to-back vblendvps) seems to be limited by some
front-end effect.  In an unrolled loop with 20 vblendvps (with no loop-carried
dependencies), there are a negligible amount of cycles where the front-end
delivered the full 4 uops.  Most cycles only 2 are issued.

This is not a general a problem for 2 uop instructions or anything: 9x bextr +
dec/jnz = 19 uops total runs at 5.00c / iter, or 3.8 uops / clock, with the
only cycle to not issue 4 uops being (I think) the group of 3 including the
loop branch.  Playing around with other 2 uops instructions, I didn't see
front-end bottlenecks.  I saw some back-end bottlenecks because other 2-uop
instructions aren't so nicely distributed over ports, but perf counts for 
idq_uops_not_delivered.cycles_fe_was_ok:u generally equaled total cycles. 
 (It counts when either the FE delivers 4 uops, or the back end was stalled and
thus not the front-end's fault.)

A 1 uop instruction following a vblendvps can issue with it in the same cycle,
so this effect is probably not horrible for normal cases where we're using
vblendvps mixed with normal instructions.

I haven't investigated further, whether this is a front-end effect (uop cache
fetch problem?) or whether it's an allocation bottleneck.  Possibly being a
4-operand instruction has something to do with it, although each uop can't have
that many I don't think.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-29 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #10 from Peter Cordes  ---
(In reply to Uroš Bizjak from comment #9)
> There was similar patch for sqrt [1], I think that the approach is
> straightforward, and could be applied to other reg->reg scalar insns as
> well, independently of PR87007 patch.
> 
> [1] https://gcc.gnu.org/ml/gcc-patches/2018-05/msg00202.html

Yeah, that looks good.  So I think it's just vcvtss2sd and sd2ss, and
VROUNDSS/SD that aren't done yet.

That patch covers VSQRTSS/SD, VRCPSS, and VRSQRTSS.

It also bizarrely uses it for VMOVSS, which gcc should only emit if it actually
wants to merge (right?).  *If* this part of the patch isn't a bug

-   return "vmovss\t{%1, %0, %0|%0, %0, %1}";
+   return "vmovss\t{%d1, %0|%0, %d1}";

then even better would be vmovaps %1, %0 (which can benefit from
mov-elimination, and doesn't need a port-5-only ALU uop.)  Same for vmovsd of
course.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #8 from Peter Cordes  ---
Created attachment 45544
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45544=edit
testloop-cvtss2sd.asm

(In reply to H.J. Lu from comment #7)
> I fixed assembly codes and run it on different AVX machines.
> I got similar results:
> 
> ./test
> sse  : 28346518
> sse_clear: 28046302
> avx  : 28214775
> avx2 : 28251195
> avx_clear: 28092687
> 
> avx_clear:
>   vxorps  %xmm0, %xmm0, %xmm0
>   vcvtsd2ss   %xmm1, %xmm0, %xmm0
>   ret
> 
> is slightly faster.


I'm pretty sure that's a coincidence, or an unrelated microarchitectural effect
where adding any extra uop makes a difference.  Or just chance of code
alignment for the uop-cache (32-byte or maybe 64-byte boundaries).

You're still testing with the caller compiled without optimization.  The loop
is a mess of sign-extension and reloads, of course, but most importantly
keeping the loop counter in memory creates a dependency chain involving
store-forwarding latency.

Attempting a load later can make it succeed more quickly in store-forwarding
cases, on Intel Sandybridge-family, so perhaps an extra xor-zeroing uop is
reducing the average latency of the store/reloads for the loop counter (which
is probably the real bottleneck.)

https://stackoverflow.com/questions/49189685/adding-a-redundant-assignment-speeds-up-code-when-compiled-without-optimization

Loads are weird in general: the scheduler anticipates their latency and
dispatches uops that will consume their results in the cycle when it expects a
load will put the result on the forwarding network.  But if the load *isn't*
ready when expected, it may have to replay the uops that wanted that input. 
See
https://stackoverflow.com/questions/54084992/weird-performance-effects-from-nearby-dependent-stores-in-a-pointer-chasing-loop
for a detailed analysis of this effect on IvyBridge.  (Skylake doesn't have the
same restrictions on stores next to loads, but other effects can cause
replays.)

https://stackoverflow.com/questions/52351397/is-there-a-penalty-when-baseoffset-is-in-a-different-page-than-the-base/52358810#52358810
is an interesting case for pointer-chasing where the load port speculates that
it can use the base pointer for TLB lookups, instead of the base+offset. 
https://stackoverflow.com/questions/52527325/why-does-the-number-of-uops-per-iteration-increase-with-the-stride-of-streaming
shows load replays on cache misses.

So there's a huge amount of complicating factors from using a calling loop that
keeps its loop counter in memory, because SnB-family doesn't have a simple
fixed latency for store forwarding.





If I put the tests in a different order, I sometimes get results like:

./test
sse  : 26882815
sse_clear: 26207589
avx_clear: 25968108
avx  : 25920897
avx2 : 25956683

Often avx (with the false dep on the load result into XMM1) is slower than
avx_clear of avx2, but there's a ton of noise.



Adding vxorps  %xmm2, %xmm2, %xmm2  to avx.S also seems to have sped it up; now
it's the same speed as the others, even though I'm *not* breaking the
dependency chain anymore.  XMM2 is unrelated, nothing touches it.

This basically proves that your benchmark is sensitive to extra instructions,
whether they interact with vcvtsd2ss or not.


We know that in the general case, throwing in extra NOPs or xor-zeroing
instructions on unused registers does not make code faster, so we should
definitely distrust the result of this microbenchmark.




I've attached my NASM loop.  It has various commented-out loop bodies, and
notes in comments on results I found with performance counters.  I don't know
if it will be useful (because it's a bit messy), but it's what I use for
testing snippets of asm in a static binary with near-zero startup overhead.  I
just run perf stat on the whole executable and look at cycles / uops.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #6 from Peter Cordes  ---
(In reply to Peter Cordes from comment #5)
> But whatever the effect is, it's totally unrelated to what you were *trying*
> to test. :/

After adding a `ret` to each AVX function, all 5 are basically the same speed
(compiling the C with `-O2` or -O2 -march=native), with just noise making it
hard to see anything clearly.  sse_clear tends to be faster than sse in a group
of runs, but if there are differences it's more likely due to weird front-end
effects and all the loads of inputs + store/reload of the return address by
call/ret.

I did  while ./test;  : ;done   to factor out CPU clock-speed ramp up and maybe
some cache warmup stuff, but it's still noisy from run to run.  Making
printf/write system calls between tests will cause TLB / branch-prediction
effects because of kernel spectre mitigation, so I guess every test is in the
same boat, running right after a system call.

Adding loads and stores into the mix makes microbenchmarking a lot harder.

Also notice that since `xmm0` and `xmm1` pointers are global, those pointers
are reloaded every time through the loop even with optimization.  I guess
you're not trying to minimize the amount of work outside of the asm functions,
to measure them as part of a messy loop.  So for the version that have a false
dependency, you're making that dependency on the result of this:

movrax,QWORD PTR [rip+0x2ebd]  # reload xmm1
vmovapd xmm1,XMMWORD PTR [rax+rbx*1]   # index xmm1

Anyway, I think there's too much noise in the data, and lots of reason to
expect that vcvtsd2ss %xmm0, %xmm0, %xmm1 is strictly better than
VPXOR+convert, except in cases where adding an extra uop actually helps, or
where code-alignment effects matter.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #5 from Peter Cordes  ---
(In reply to H.J. Lu from comment #4)
> (In reply to Peter Cordes from comment #2)

> >  Can you show some
> > asm where this performs better?
> 
> Please try cvtsd2ss branch at:
> 
> https://github.com/hjl-tools/microbenchmark/
> 
> On Intel Core i7-6700K, I got

I have the same CPU.

> [hjl@gnu-skl-2 microbenchmark]$ make
> gcc -g -I.-c -o test.o test.c
> gcc -g   -c -o sse.o sse.S
> gcc -g   -c -o sse-clear.o sse-clear.S
> gcc -g   -c -o avx.o avx.S
> gcc -g   -c -o avx2.o avx2.S
> gcc -g   -c -o avx-clear.o avx-clear.S
> gcc -o test test.o sse.o sse-clear.o avx.o avx2.o avx-clear.o
> ./test
> sse  : 24533145
> sse_clear: 24286462
> avx  : 64117779
> avx2 : 62186716
> avx_clear: 58684727
> [hjl@gnu-skl-2 microbenchmark]$

You forgot the RET at the end of the AVX functions (but not the SSE ones); The
AVX functions fall through into each other, then into __libc_csu_init before
jumping around and eventually returning.  That's why they're much slower. 
Single-step through the loop in GDB...

   │0x5660 vcvtsd2ss xmm0,xmm0,xmm1
  >│0x5664  nopWORD PTR cs:[rax+rax*1+0x0]
   │0x566e  xchg   ax,ax
   │0x5670vcvtsd2ss xmm0,xmm1,xmm1
   │0x5674  nopWORD PTR cs:[rax+rax*1+0x0]
   │0x567e  xchg   ax,ax
   │0x5680   vxorps xmm0,xmm0,xmm0
   │0x5684 vcvtsd2ss xmm0,xmm0,xmm1
   │0x5688  nopDWORD PTR [rax+rax*1+0x0]
   │0x5690 <__libc_csu_init>endbr64
   │0x5694 <__libc_csu_init+4>  push   r15
   │0x5696 <__libc_csu_init+6>  movr15,rdx

And BTW, SSE vs. SSE_clear are about the same speed because your loop
bottlenecks on the store/reload latency of keeping a loop counter in memory
(because you compiled the C without optimization).  Plus, the C caller loads
write-only into XMM0 and XMM1 every iteration, breaking any loop-carried
dependency the false dep would create.

I'm not sure why it makes a measurable difference to run the extra NOPS, and 3x
vcvtsd2ss instead of 1 for avx() vs. avx_clear(), because the C caller should
still be breaking dependencies for the AVX-128 instructions.

But whatever the effect is, it's totally unrelated to what you were *trying* to
test. :/

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #3 from Peter Cordes  ---
(In reply to H.J. Lu from comment #1)
I have a patch for PR 87007:
> 
> https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html
> 
> which inserts a vxorps at the last possible position.  vxorps
> will be executed only once in a function.

That's talking about the mem,reg case, which like I said is different.  I
reported Bug 80571 a while ago about the mem,reg case (or gp-reg for si2ss/d),
so it's great that you have a fix for that, doing one xor-zeroing and reusing
that as a merge target for a whole function / loop.

But this bug is about the reg,reg case, where I'm pretty sure there's nothing
to be gained from xor-zeroing anything.  We can fully avoid any false dep just
by choosing both source registers = src, making the destination properly
write-only.

If you *have* an xor-zeroed register, there's no apparent harm in using it as
the merge-target for a reg-reg vcvt, vsqrt, vround, or whatever, but there's no
benefit either vs. just setting both source registers the same.  So whichever
is easier to implement, but ideally we want to avoid introducing a vxorps into
functions / blocks that don't need it at all.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #2 from Peter Cordes  ---
(In reply to H.J. Lu from comment #1)
> But
> 
>   vxorps  %xmm0, %xmm0, %xmm0
>   vcvtsd2ss   %xmm1, %xmm0, %xmm0
> 
> are faster than both.

On Skylake-client (i7-6700k), I can't reproduce this result in a hand-written
asm loop.  (I was using NASM to make a static executable that runs a 100M
iteration loop so I could measure with perf).  Can you show some asm where this
performs better?

vcvtsd2ss src-reg,dst,dst is always 2 uops, regardless of the merge destination
being an xor-zeroed register.  (Either zeroed outside the loop, or inside, or
once per 4 converts with an unrolled loop.)

I can't construct a case where  vcvtsd2ss %xmm1, %xmm1, %xmm0  is worse in any
way (dependencies, uops, latency, throughput) than VXORPS + vcvtsd2ss with dst
= middle source.  I wasn't mixing it with other instructions other than VXORPS,
but I don't think anything is going to get rid of its 2nd uop, and choosing
both inputs = the same source removes any benefit from dep-breaking the output.

If adding a VXORPS helped, its probably due to some other side-effect.

Could the effect you saw have been due to code-gen changes for memory sources,
maybe  vxorps + vcvtsd2ss (mem), %xmm0, %xmm0   vs.  vmovsd + vcvtsd2ss %xmm1,
%xmm1, %xmm0?  (Those should be about equal, but memory-source SS2SD is
cheaper, no port5 uop.)



BTW, the false-dependency effect is much more obvious with SS2SD, where the
latency from src1 to output is 4 cycles, vs. 1 cycle for SD2SS.

Even without dependency-breaking, repeated

 vcvtsd2ss  %xmm1, %xmm0, %xmm0

can run at 1 per clock (same as with dep breaking), because the port-5 uop that
merges into the low 32 bits of xmm0 with 1 cycle latency is 2nd.  So latency
from xmm0 -> xmm0 for that [v]cvtsd2ss %xmm1, %xmm0 is 1 cycle.

With dep-breaking, they both still bottleneck on the port5 uop if you're doing
nothing else.

[Bug target/80586] vsqrtss with AVX should avoid a dependency on the destination register.

2019-01-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80586

Peter Cordes  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #1 from Peter Cordes  ---
Fixed for vsqrtss/sd somewhere in 9.0, but not 8.2.  
https://godbolt.org/z/0Gxf05.

The general case of one-input scalar xmm,xmm instructions like vcvtss2sd is
still all over the place, with false deps or wasted xor-zeroing.  Reported that
as bug 89071

It seems only VSQRTsd/ss itself was fixed for this; sorry I didn't think of
checking for other one-input instructions when I reported this.

[Bug target/89071] New: AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double

2019-01-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

Bug ID: 89071
   Summary: AVX vcvtsd2ss lets us avoid PXOR dependency breaking
for scalar float<->double
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

float cvt(double unused, double xmm1) { return xmm1; }

g++ (GCC-Explorer-Build) 9.0.0 20190120 (experimental):

vxorps  %xmm0, %xmm0, %xmm0
vcvtsd2ss   %xmm1, %xmm0, %xmm0# merge into XMM0

clang7.0
vcvtsd2ss   %xmm1, %xmm1, %xmm0# both sources are from XMM1, no
false dep

gcc already uses this trick for SQRTSS/SD, but not for float<->double
conversion.  I haven't checked all the other scalar instructions, but roundss
for floor() does neither and has a false dependency.  (i.e. it chooses the
output register as the merge-target, not the actual input.)

 return floorf(x);  ->   vroundss$9, %xmm1, %xmm0, %xmm0

Some testcases:

https://godbolt.org/z/-rqUVZ


---

In SSE, one-input scalar instructions like CVT* and SQRTSS/SD have an output
dependency because of Intel's short-sighted ISA design optimizing for
Pentium-III's 64-bit SIMD: zero-extending to fill the destination XMM register
would have cost an extra uop to write the upper half of the destination.

For consistency(?), SSE2 scalar instructions (new with Pentium 4 which had
128-bit SIMD execution units / register file) have the same behaviour of
merging into the low 64 bits of the destination, even conversion between double
and float between two xmm registers, which didn't exist before SSE2. 
(Previously conversion instructions were only between float in XMM and integers
in scalar or MMX regs, or packed-integer <-> ps which filled the whole XMM reg
and thus avoided a false dependency).

(Fortunately this isn't a problem for 2-input instructions like ADDSS: the
operation already depends on both registers.)

---

The VEX encoding makes the merge-target separate from the actual destination,
so we can finally avoid false dependencies without wasting an instruction
breaking it.  (When the source is already in an XMM register).


For instructions where the source isn't an XMM register (e.g. memory or integer
reg for int->FP conversions), one zeroed register can be used as a read-only
merge target by any number of scalar AVX instructions, including in a loop. 
That's bug 80571.


(It's unfortunate that Intel didn't take the opportunity to give the AVX
versions subtly different semantics, and zero-extend into the target register. 
That would probably have enabled vcvtsd2ss to be single-uop instead of 2 on
Sandybridge-family.  IDK if they didn't think of that, or if they wanted strict
consistency with the semantics of the SSE version, or if they thought decoding
/ internals would be easier if they didn't have to omit the
merge-into-destination part of the scalar operation.  At least they made the
extra dependency an explicit input, so we can choose a register other than the
destination, but it's so rarely useful to actually merge into the low 64 or 32
of another reg that it's just long-term harmful to gimp the ISA with an extra
dependency for these instructions, especially integer->FP.)



(I suspect that most of the dep-breaking gcc does isn't gaining any speed, but
the trick is figuring out when we can omit it while being sure that we don't
couple things into one big loop-carried chain, or serialize some things that
OoO exec could otherwise benefit from hiding.  Within one function with no
calls, we might be able to prove that a false dep isn't serializing anything
important (e.g. if there's already enough ILP and something else breaks a dep
on that register between loop iterations), but in general it's hard if we can't
pick a register that was already part of the dep chain that led to the input
for this operation, and thus is harmless to introduce a dep on.)



Relevant instructions that can exist in scalar xmm,xmm form:

VROUNDSS/SD  (gcc leaves a false dep, clang gets it right)

VSQRTSS/SD  (gcc already gets this right)
VRCPSS
VRSQRTSS  haven't checked

[V]CVTSS2SD xmm,xmm  (Skylake: SRC1/output dependency is a separate 1c latency
32-bit merge uop)
  The memory-source version is still 2 uops.

[V]CVTSD2SS xmm,xmm  (Skylake: SRC1/output dependency is the main 4c conversion
uop, the extra uop is first, maybe extracting 32 bits from the src?)
 The memory-source version of [V]CVTSD2SS is only 1 uop!

So avoiding a false dep by loading with MOVSS/MOVSD and then using the reg-reg
version is a bad idea for CVTSD2SS.  It's actually much better to PXOR and then
CVTSD2SS (mem), %xmm, so clang's strategy of loading and then reg-reg
conversion is a missed-optimization.

I have

[Bug target/89063] [x86] lack of support for BEXTR from BMI extension

2019-01-25 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89063

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #1 from Peter Cordes  ---
Unfortunately Intel Haswell/Skylake implement BEXTR as 2 uops with 2c latency. 
Presumably those uops are a shift + bzhi, so 1p06 + 1p15 would explain Agner
Fog's experimental result of 2p0156 for BEXTR, with 0.5c throughput.

On AMD Excavator/Ryzen, it's 1 uop with 1c latency.  On Steamroller and
earlier, it's 2 uops but 1c latency.  (I assume that's latency from the
non-control input to the output.  So maybe one of the uops pre-processes the
control input, otherwise you'd expect 2c latency from either operand.)  Ryzen
dropped support for AMD TBM, so only Excavator (bdver4) has 1-uop bextr imm16
which would avoid the need for mov reg,imm32 with the control operand.  But
mov-imm + bextr can still be a win on Ryzen, lower latency than RORX+AND

BMI2 RORX is single-uop on all CPUs that support it.  If we already need a 2nd
uop to mask anyway, we can use RORX+AND-immediate to duplicate the
functionality and performance of BEXTR-immediate, with the smaller code-size if
the AND-mask fits in an imm8.  (5+5 vs. 6+3  or 6+4 if the AND needs a REX)

Without an immediate-source BEXTR (like AMD TBM has/had), the only advantage
mov-immediate+bextr has (on Intel) over mov-reg+shift+and is that can deal with
wide bitfields using a count instead of an immediate AND mask.  (Especially if
it doesn't fit in 32 bits).

If you can reuse the same control-register in a loop, BEXTR is good-ish for
copy-and-extract.

PEXT is 1 uop on Intel CPUs even though the simpler-looking BEXTR is 2.  But
PEXT is extremely slow on Ryzen (7 uops, 18c lat and tput).  So for 32-bit
constants at least, mov r32,imm32 + PEXT to copy-and-extract is better than
BEXTR on Intel.  movabs imm64 is too big and can cause front-end problems
(slower to read from the uop cache, if that effect from Sandybridge is still
present on Haswell/Skylake), and has no advantage vs. RORX + AND unless the
bitfield you're extracting is wider than 32 bits.

PEXT has 3 cycle latency, though, and can only run on port 1 on SnB-family. 
(All integer uops with latency > 1 are p1-only).  It's potentially good for
throughput, but worse than RORX+AND for latency.

Unfortunately x86 bitfield instructions are pretty weak compared to ARM /
AArch64 ubfx or PowerPC rlwinm and friends, where the bit-positions are simply
specified as immediates.  Only AMD's immediate version of BEXTR (1 uop on
Excavator) matched them.  Having a bunch of different control operands for
BEXTR or PEXT in registers might be usable in a loop, but a lot more rarely
useful than immediate controls.




 :
   0:   c4 e3 fb f0 c7 2a   rorx   $0x2a,%rdi,%rax# $(64-22)
   6:   c4 e3 fb f0 d7 35   rorx   $0x35,%rdi,%rdx# $(64-11)
   c:   83 e7 3fand$0x3f,%edi
   f:   83 e0 3fand$0x3f,%eax
  12:   83 e2 3fand$0x3f,%edx
  15:   01 f8   add%edi,%eax # 32-bit operand-size
because we can prove it can't overflow
  17:   01 d0   add%edx,%eax # missed optimization in
both gcc's versions.
  19:   c3  retq   

Not counting the ret, this is 7 uops for Skylake and Ryzen.  **I'm pretty sure
this is our best bet for -march=skylake, and for tune=generic -mbmi2**

The BEXT intrinsics version is 9 uops for SKL, 7 for Ryzen, but is 2 bytes
larger.  (not counting the savings from avoiding a REX prefix on the ADD
instructions; that missed optimization applies equally to both.)  OTOH, the
critical path latency for BEXTR on Ryzen is better by 1 cycle, so we could
still consider it for -march=znver1.  Or for tune=generic -mbmi without BMI2.

The legacy mov+shr+and version is 10 uops because gcc wasted a `mov %rdi,%rax`
instruction; it *should* be 9 uops for all normal CPUs.

---

With only BMI1 but not BMI2 enabled, we should probably use the mov-imm + BEXTR
version.  It's not worse than the mov+shr+and version on SnB-family or bd/zn,
and it's better on some AMD.  And it's probably smaller code-size.

And in future if Intel designs CPUs that can handle BEXTR as a single uop with
1c latency, mov+bextr will become good-ish everywhere.


For code-size, BEXTR has a definite advantage for bitfields wider than 1 byte,
because AND $imm32, %r32 is 6 bytes long instead of 3.

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2018-08-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

--- Comment #4 from Peter Cordes  ---
The VPAND instructions in the 256-bit version are a missed-optimization.

I had another look at this with current trunk.  Code-gen is similar to before
with -march=skylake-avx512 -mprefer-vector-width=512.  (If we improve code-gen
for that choice, it will make it a win in more cases.)

https://godbolt.org/g/2dfkNV

Loads are folding into the shifts now, unlike with gcc7.3.  (But they can't
micro-fuse because of the indexed addressing mode.  A pointer increment might
save 1 front-end uop even in the non-unrolled loop)

The separate integer loop counter is gone, replaced with a compare against an
end-index.

But we're still doing 2x vpmovwb + vinserti64x4 instead of vpackuswb + vpermq. 
Fewer instructions and (more importantly) 1/3 the shuffle uops.  GCC knows how
to do this for the 256-bit version, so it's apparently a failure of the
cost-model that it doesn't for the 512-bit version.  (Maybe requiring a
shuffle-control vector instead of immediate puts it off?  Or maybe it's
counting the cost of the useless vpand instructions for the pack / permq
option, even though they're not part of the shuffle-throughput bottleneck?)



We do use vpackuswb + vpermq for 256-bit, but we have redundant AND
instructions with set1_epi16(0x00FF) after a right shift already leaves the
high byte zero.

---

Even if vmovdqu8 is not slower, it's larger than AVX vmovdqu.  GCC should be
using the VEX encoding of an instruction whenever it does exactly the same
thing.  At least we didn't use vpandd or vpandq EVEX instructions.

(I haven't found any confirmation about vmovdqu8 costing an extra ALU uop as a
store with no masking.  Hopefully it's efficient.)

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2018-08-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

--- Comment #3 from Peter Cordes  ---
I had another look at this with current trunk.  Code-gen is similar to before
with -march=skylake-avx512 -mprefer-vector-width=512.  (If we improve code-gen
for that choice, it will make it a win in more cases.)

https://godbolt.org/g/2dfkNV

Loads are folding into the shifts now, unlike with gcc7.3.  (But they can't
micro-fuse because of the indexed addressing mode.  A pointer increment might
save 1 front-end uop even in the non-unrolled loop)

The separate integer loop counter is gone, replaced with a compare against an
end-index.

But we're still doing 2x vpmovwb + vinserti64x4 instead of vpackuswb + vpermq. 
Fewer instructions and (more importantly) 1/3 the shuffle uops.  GCC knows how
to do this for the 256-bit version, so it's apparently a failure of the
cost-model that it doesn't for the 512-bit version.  (Maybe requiring a
shuffle-control vector instead of immediate puts it off?  Or maybe it's
counting the cost of the useless vpand instructions for the pack / permq
option, even though they're not part of the shuffle-throughput bottleneck?)



We do use vpackuswb + vpermq for 256-bit, but we have redundant AND
instructions with set1_epi16(0x00FF) after a right shift already leaves the
high byte zero.

---

Even if vmovdqu8 is not slower, it's larger than AVX vmovdqu.  GCC should be
using the VEX encoding of an instruction whenever it does exactly the same
thing.  At least we didn't use vpandd or vpandq EVEX instructions.

(I haven't found any confirmation about vmovdqu8 costing an extra ALU uop as a
store with no masking.  Hopefully it's efficient.)

[Bug rtl-optimization/86352] New: setc/movzx introduced into loop to provide a constant 0 value for a later rep stos

2018-06-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86352

Bug ID: 86352
   Summary: setc/movzx introduced into loop to provide a constant
0 value for a later rep stos
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

The wrong-code bug 86314 also revealed some very weird code-gen decisions,
which the fix didn't improve.

(I think the lock bts peephole is seen pretty late, and that's one necessary
factor for this problem.  But even without it, an unnecessary data dependency
between the lock bts loop and clearing memory is silly.)

This ended up being about 5 separate bugs, but IDK which belong together or are
already reported:

* useless mov %rsi, %rcx and useless mov %rdx, %rdi
* using setc/movzx instead of xor %eax,%eax to get a constant 0; slower and
creating a data dependency
* Doing that inside the loop instead of after
* Not adjusting register allocation to allow xor / set-flags / setc
* rep stos vs. vector stores as a zeroing strategy vs. any other repeated
value.



The reproducer test-case for bug 86314  loops until it finds and claims a zero
bit in a uint64_t, then returns a Bucket() object (with a constructor that
zero-initializes it) with no data dependency on anything.

But gcc decides to introduce a flag -> integer 0/1 inside the acquire() loop
instead of just using  xor eax,eax  before rep stosq.  The loop can only exit
when CF = 0, so RAX = 0, so it's not a correctness problem.

The loop is branching on CF as set by BTS, so there's no need to have the 0/1
in a register at all inside the loop, and setc/movzx from a known-zero CF is
more expensive that xor-zeroing.  (Plus it gives the STOSQ a data dependency on
the LOCK BTS flag result which it wouldn't have otherwise.  The stores can't
commit until after the lock memory barrier, but they can execute.)

This is the actual code-gen from (GCC-Explorer-Build) 9.0.0 20180627
https://godbolt.org/g/XGF5tR


BucketMap::acquireBucket():
movq%rdi, %rdx
movq%rsi, %rcx  # useless, lock bts can use (%rsi)
.L2:
movq(%rsi), %rax
andl$1, %eax# source is simplified to only check positions
0 or 1
lock btsq   %rax, (%rcx)  # Why not (%rsi)?
setc%al
movzbl  %al, %eax   # xor / bts / setc would have been possible
with a different reg
jc  .L2
# rax = 0 because the loop can only exit when CF=0

# should use  xor %eax,%eax  here instead

movq%rdx, %rdi  # Useless, RDI still == RDX
movl$16, %ecx
rep stosq
movq%rdx, %rax  # can't be done before rep stosq: RAX needs to
be 0
ret 



With -m32, where 64-bit lock bts isn't available, we have lock cmpxchg8b ending
with an OR.  So there is a zero in an integer register from that, but it's not
in EAX, so the code gen includes an extra `mov %esi, %eax`, which is not
cheaper than xor %eax,%eax especially with -march=haswell.  Sandybridge-family
has xor-zeroing as cheap as a NOP, but mov-elimination isn't always perfect and
SnB itself doesn't have it.

And of course mov still has a data dependency on the source of the zero, so it
defeats the effect of branch prediction + speculative breaking (control)
dependencies.  This last applies on any out-of-order x86.

I guess the lock bts peephole is seen too late to notice that it can't recycle
the 0 from the loop condition anymore, and ends up generating code to
materialize it.  But why inside the loop?

--


Even if we *did* need an integer 0/1 in a register inside the loop, we could
still use the xor / set-flags / setcc optimization: Simply use a register other
than RAX for the load / AND $1 / bts source.  And you can hoist the xor-zeroing
out of the loop.


xor %eax, %eax
.L2:
movq(%rsi), %rcx
andl$1, %ecx
lock btsq   %rax, (%rsi)
setc%al
# use %rax
jc  .L2


---

Separately:

If the initializer is non-zero, it uses SSE or AVX stores.  That makes no sense
either: if rep stosq is optimal, use  mov eax, 1 for the all-ones case.  (See
the ifdef in the Godbolt link to try it)

If it's not optimal, use xorps xmm0,xmm0 to create an all-zero vector.

I guess gcc is checking for all-zeros as a common special case, but doesn't
check for repeats of any other value, except for repeated bytes recognized as
memset.

So it makes sense that gcc uses a different strategy, but I think for only 16x
8 bytes (128 bytes) that vector stores beat rep stos on current CPUs.  (That
may change when IceLake introduces fast short-rep stos/movs.)

GCC does notice that it can reuse the same vec

[Bug target/80820] _mm_set_epi64x shouldn't store/reload for -mtune=haswell, Zen should avoid store/reload, and generic should think about it.

2018-06-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820

--- Comment #5 from Peter Cordes  ---
AVX512F with marge-masking for integer->vector broadcasts give us a single-uop
replacement for vpinsrq/d, which is 2 uops on Intel/AMD.

See my answer on
https://stackoverflow.com/questions/50779309/loading-an-xmm-from-gp-regs.  I
don't have access to real hardware, but according to reported uop counts, this
should be very good: 1 uop per instruction on Skylake-avx512 or KNL

vmovq xmm0, rax1 uop p5   2c latency
vpbroadcastq  xmm0{k1}, rdx   ; k1 = 0b00101 uop p5   3c latency
vpbroadcastq  ymm0{k2}, rdi   ; k2 = 0b01001 uop p5   3c latency
vpbroadcastq  ymm0{k3}, rsi   ; k3 = 0b10001 uop p5   3c latency

xmm vs. ymm vs. zmm makes no difference to latency, according to InstLatx64

(For a full ZMM vector, maybe start a 2nd dep chain and vinsert to combine
256-bit halves.  Also means only 3 k registers instead of 7)

vpbroadcastq  zmm0{k4}, rcx   ; k4 =0b1 3c latency
... filling up the ZMM reg


Starting with k1 = 2 = 0b0010, we can init the rest with KSHIFT:

mov  eax, 0b0010 = 2
kmovwk1, eax
KSHIFTLW k2, k1, 1
KSHIFTLW k3, k1, 2

  #  KSHIFTLW k4, k1, 3
 ...

KSHIFT runs only on port 5 (SKX), but so does KMOV; moving from integer
registers would just cost extra instructions to set up integer regs first.

It's actually ok if the upper bytes of the vector are filled with broadcasts,
not zeros, so we could use 0b1110 / 0b1100 etc. for the masks.  We could start
with kxnor to generate a -1 and left-shift that, but that's 2 port5 uops vs.
mov eax,2 / kmovw k1, eax being p0156 + p5.

Loading k registers from memory is not helpful: according to IACA, it costs 3
uops.  (But that includes p237, and a store-AGU uop makes no sense, so it might
be wrong.)

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2018-06-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833

--- Comment #14 from Peter Cordes  ---
I happened to look at this old bug again recently.

re: extracting high the low two 32-bit elements:

(In reply to Uroš Bizjak from comment #11)
> > Or without SSE4 -mtune=sandybridge (anything that excluded Nehalem and other
> > CPUs where an FP shuffle has bypass delay between integer ops)
> > 
> > movd %xmm0, %eax
> > movshdup %xmm0, %xmm0  # saves 1B of code-size vs. psrldq, I think.
> > movd %xmm0, %edx
> > 
> > Or without SSE3,
> > 
> > movd %xmm0, %eax
> > psrldq   $4,  %xmm0# 1 m-op cheaper than pshufd on K8
> > movd %xmm0, %edx
> 
> The above two proposals are not suitable for generic moves. We should not
> clobber input value, and we are not allowed to use temporary.

SSE3 movshdup broadcasts the high element within each pair of 32-bit elements
so 

   movshdup  %xmm0, %xmm1
   movd  %xmm1, %eax

saves a byte of code vs  pshufd / movd, and saves a uop on Merom and avoids a
flt->int.  (According to Agner Fog's tables, pshufd is flt->int domain, i.e. it
wants input in the float domain.  While movshdup ironically is only an integer
shuffle.)

Probably not worth looking for that optimization, though, because it's not
worth using universally (Nehalem has worse latency for float shuffles between
int instructions).


With just SSE2, PSHUFLW is the same size as PSHUFD and faster on Merom / K8
(slowshuffle CPUs where PSHUFD is multiple uops).  It's not slower on any
current CPUs.  I could imagine some future CPU having better throughput for
32-bit element size shuffles than 16-bit, though.  That's already the case for
wider lane-crossing shuffles (VPERMW YMM is multiple uops on Skylake-AVX512). 
This would be a definite win for tune=core2 or k8, and Pentium M, but those are
so old it's probably not worth adding extra code to look for it.

I think it's pretty future-proof, though, unless Intel or AMD add an extra
shuffle unit for element sizes of 32-bit or wider on another port.

[Bug tree-optimization/69615] 0 to limit signed range checks don't always use unsigned compare

2018-06-02 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69615

--- Comment #5 from Peter Cordes  ---
Update: https://godbolt.org/g/ZQDY1G

gcc7/8 optimizes this to and / cmp / jb, while gcc6.3 doesn't.

void rangecheck_var(int64_t x, int64_t lim2) {
  //lim2 >>= 60;
  lim2 &= 0xf;  // let the compiler figure out the limited range of limit
  if (x>=0 && x=0 && x<=(INT_MAX-1)) ext(); }  // clang and
gcc use 2 branches

[Bug tree-optimization/84011] Optimize switch table with run-time relocation

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011

--- Comment #13 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #10)
> ??  That is the task for the linker SHF_MERGE|SHF_STRINGS handling.
> Why should gcc duplicate that?

Because gcc would benefit from knowing if merging makes the total block of
strings for a switch() table short enough to use a uint8_t offset[] instead of
uint16_t.

If we don't know at compile time, we'd have to be conservative and potentially
use a wider offset table.  (Although as Joseph points out
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585#c2, without more linker
support for this we could end up missing out on literal merging across
compilation units.  So perhaps a first step in applying this idea would be to
use 32-bit offsets from the start of the .rodata.str1.1 section, so we can
still let the linker merge strings and end up with them non-contiguous without
having to force the one that gets kept to be the one that's part of our block
of strings.)

[Bug tree-optimization/84011] Optimize switch table with run-time relocation

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011

--- Comment #12 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #10)
> (In reply to Peter Cordes from comment #9)
> > gcc already totally misses optimizations here where one string is a suffix
> > of another.  "mii" could just be a pointer to the 3rd byte of "sgmii", but
> > we instead duplicate all the characters.  That's where major savings are
> > possible for this function.
> 
> ??  That is the task for the linker SHF_MERGE|SHF_STRINGS handling.
> Why should gcc duplicate that?

Oops, right I was only looking at gcc's asm output, didn't check an actual
linked binary.

Will the linker currently catch a case like this?

.LC_base:
.LC2: .string "mii"
.LC3: .string "gmii"

table:
.byte  .LC2 - .LC_base,  .LC3 - .LC_base

and drop .string "mii" entirely + rewrite the table to
.byte  .LC3+1 - .LC_base,  .LC3 - .LC_base

(This discussion should probably be happening on bug 85585.)

Sorry I don't know the actual mechanism by which gcc signals to the linker that
it can / can't merge.  I guess only in some sections?  Because gcc couldn't
allow it if was emitting an array like this, where dropping a string would
change the offsets for later data and break offset calculations:

const struct { char str[11]; } table[] = { {"mii"}, {"gmii"} };

[Bug tree-optimization/84011] Optimize switch table with run-time relocation

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #9 from Peter Cordes  ---
(In reply to rguent...@suse.de from comment #4)
> An optimization would be to
> add an indirection by, say, only recording the constant offset
> into an "array of strings" in the table, thus effectively
> 
>   "case1\0case2\0..."[CSWITCH[i]]
> 
> which would require only a relocation to access the single string
> constant.  But it would prohibit cases of string merging within
> those strings unless we implement that as well for this optimization.

gcc already totally misses optimizations here where one string is a suffix of
another.  "mii" could just be a pointer to the 3rd byte of "sgmii", but we
instead duplicate all the characters.  That's where major savings are possible
for this function.

> Note this might be profitable unconditionally, not just with -fpie/pic
> as the CSWITCH table would be smaller (dependent on the total
> size of the merged string).

Indeed, I wrote up bug 85585 with ideas for optimizing this.  A table of byte
or uint16_t offsets into a static buffer of packed strings looks good for PIC
and for position-dependent.

To avoid any runtime relocations, all you need is the ability to get a static
address into a register (e.g. RIP-relative LEA) and do an indexed load relative
to it, just like using a normal static char[].  Then add the load result to
that address.  Runtime relocation is nice to avoid even if you don't *need* to
avoid it.

Also possible is padding each string out to a constant length and calculating
an index into that, removing a level of indirection.  (Good when strings are
similar length and/or all short, and there aren't many strings that are
duplicates or suffixes of others.)  Again you just need to get a static address
into a register, and add it to 11*enum_value.  This is all ADD + LEA (with one
of them being RIP-relative).

[Bug tree-optimization/85585] switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a string

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585

--- Comment #1 from Peter Cordes  ---
By comparison, the no-PIE table of pointers only needs one instruction:

movqCSWTCH.4(,%rdi,8), %rax

So all my suggestions cost 1 extra instruction on x86 in no-PIE mode, but at a
massive savings in data size.

clang -fPIE compiles the plain switch to the obvious / sane 2 instruction
sequence which should be our baseline for normal cases.

# clang6.0 -fPIE -O3  (switch compilers on the Godbolt link)
leaq.Lswitch.table.phy_modes(%rip), %rcx
movq(%rcx,%rax,8), %rax

Clang is willing to make a table that needs relocations for the entries.  (My
suggestions all avoid that because they're based on offsets, not a table of
pointers.  Avoiding rodata relocations that dirty a page and prevent sharing
has some non-zero value, although it's low on many architectures where memory
is cheap.)

[Bug tree-optimization/85585] New: switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a st

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585

Bug ID: 85585
   Summary: switch to select a string based on an enum can
profitably optimize away the table of pointers/offsets
into fixed-length char[] blocks.  Or use byte offsets
into a string table
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

Bug 84011 shows some really silly code-gen for PIC code and discussion
suggested using a table of offsets instead of a table of actual pointers, so
you just need one base address.

A further optimization is possible when the strings are all similar length,
and/or the longest one isn't much longer than a pointer:

Pad all strings to the same length with trailing 0 bytes, and calculate a
pointer instead of loading it from an array.  This removes the possibility of
multiple entries sharing the same suffix (which is a missed optimization gcc
wasn't already doing), but avoids needing any space for storing pointers in
memory at all.

In the case discussed in bug 84011 (Linux's phy.h const char
*phy_modes(phy_interface_t interface)), the longest strings are 11 bytes
(including the \0), and there are 23 of them.  So it takes 253 bytes of char
data to store everything (not counting the "unknown" for the default: special
case) with all strings padded to 11 bytes.



The current strings + pointer-table implementation doesn't merge string
literals where one string is a suffix of another; this is another a
missed-optimization that would save many bytes here.  (e.g. instead of .string
"mii" and .string "gmii", just have .LC4 .byte 's'; .LC3: .byte 'g'; .LC2:
.string "mii".)

That optimization plus byte or 16-bit offsets into the table would be nice and
compact, and most CPUs have efficient zero-extending narrow loads.  So for
cases where the other optimization I'm suggesting isn't good, that would
probably be best.



The current packed string-data takes 158 bytes , so with 4-byte offsets it
takes 158+23*4 = 250 bytes.  Or with 8-byte pointers/offsets, it takes 158 +
23*8 = 342 bytes.  Or with 1-byte offsets, 158 + 23*1 = 181 bytes: load with
movzbl.  (If you can't use the offset directly as an 8-byte memory source
operand for ADD to a pointer, there's no point making it 32 bits instead of 8.)

The code for *using* such a table is quite simple.  This C source compiles to
what I'm suggesting:

https://godbolt.org/g/E8J3iS

struct foo {
char str[11];
} const table[23] = {};

const char *lookup(unsigned long idx) {
if(idx > 23) {
return "unknown";
//idx=23;
}
return table[idx].str;
}

Multiply by 11 only takes 2 LEA instructions on x86, so for PIC code with a
RIP-relative LEA we end up with 4 ALU instructions total to get a string
address, after checking the if condition:

   # gcc7.3 -march=haswell -O3 -fPIE output:  https://godbolt.org/g/qMzaY8
leaq.LC0(%rip), %rax# "unknown"
cmpq$23, %rdi
ja  .L4 # branchless is also an option
leaq(%rdi,%rdi,4), %rax
leaqtable(%rip), %rdx   # RIP-relative table base address
leaq(%rdi,%rax,2), %rax
addq%rdx, %rax  # table + 11*idx
.L4:
ret

This is even better in no-PIE mode where a static address is usable as a signed
32-bit immediate:

lookup(unsigned long):
movl$.LC0, %eax
cmpq$23, %rdi
ja  .L4
leaq(%rdi,%rdi,4), %rax
leaqtable(%rdi,%rax,2), %rax# 3 cycle latency for 3-component
LEA on SnB-family
.L4:
ret

So this has extremely low code-size cost on x86-64, for the benefit of removing
a table load in the dependency chain from enum to string data.  It does cost
significant data size vs. a byte-offset table with suffix-merging, but it's 
better than what gcc is doing now in non-PIE (table of qword pointers), and
*much* better in PIE (insane jump table).

-

The byte-index version is equivalent to transforming the C source like this:

const char packedstrings[158] = {};
const unsigned char offsets[23] = {};
const char *lookup_byteidx(unsigned long idx) {
if(idx>23)
return "unknown";
return [offsets[idx]];
}

leaq.LC0(%rip), %rax  # "unknown"
cmpq$23, %rdi
ja  .L9
leaqoffsets(%rip), %rax
leaqpackedstrings(%rip), %rdx
movzbl  (%rax,%rdi), %eax
addq%rdx, %rax
.L9:
ret

We can save an instruction here by making the relative position of
packedstrings and offsets a compile-time constant, i.e. by effectively putting
the

[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics

2018-04-30 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274

--- Comment #2 from Peter Cordes  ---
The stray LEA bug seems to be fixed in current trunk (9.0.0 20180429), at least
for this testcase.  Gcc's stack-alignment strategy seems to be improved overall
(not copying the return address when not needed), so probably it's really
fixed.

It's still present in 7.3.

[Bug c++/69560] x86_64: alignof(uint64_t) produces incorrect results with -m32

2018-04-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69560

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #23 from Peter Cordes  ---
Just to recap the current situation (gcc/g++ 8.0.1 20180425):

I ported David Marillat's testcase to work as C or C++
https://godbolt.org/g/QdG2V6.  (And changed it to set global variables instead
of calling printf, so you can see the results from looking at the asm output
instead of running it).

C++11 alignof() now agrees with C11 alignof() (which didn't change) that
alignof(int64_t) is 4 when targeting the i386 System V ABI.

Previously G++'s alignof() reported 8, while gcc's C11 alignof (stdalign.h)
reported 4.  That was the only change: struct-member alignof results are
unchanged, and already matched between C11 and C++11.


4 is the minimum alignment that *any* int64_t, or pointer to int64_t, is
assumed to have when generating code for i386 SysV.  gcc / g++ are allowed to
generate code that breaks if passed a pointer to int64_t that wasn't 4-byte
aligned.  (Auto-vectorization is one case where that can happen on x86:
https://stackoverflow.com/q/47510783/224132).

They're *not* allowed to assume that it's 8-byte aligned unless they can see
the definition and know that a particular int64_t object is over-aligned, e.g.
to its natural alignment of 8, like gcc chooses to do whenever possible (i.e.
outside structs).

So in both C++ and C (and in g++/gcc after this patch), alignof(int64_t) is the
minimum that any allocator must give an int64_t for correctness (in this funky
32-bit ABI), not the recommended alignment that gcc and g++ both already used
whenever ABI struct-packing rules didn't constrain them.

It's also the guaranteed minimum that code can *assume*.  e.g. a
manually-vectorized library function might check alignof(T) == sizeof(T) before
assuming that using 16-byte aligned loads/stores can line up with element
boundaries.  (An array inside a struct { int foo; int64_t arr[10]; } would
violate this for i386 SysV).

Anyway, I think use-cases like these are why the standard is worded the way it
is, and why it makes sense for alignof() to report the guaranteed/required
minimum.  The recommended or actual alignment is useful, too, though, for other
cases, so it's nice that GNU __alignof() is also available to report that.



Semi-related: gcc depends on 8-byte alignment for C11 _Atomic int64_t but still
fails to provide it inside structs on the i386 SysV ABI (Bug 65146), using the
same alignment rules as regular int64_t.

C++11 std::atomic is fine, getting the required natural alignment even
on i386 SysV so SSE2 movq is atomic and lock add is efficient.

This change to what alignof() reports in C++ had no effect on C at all, or on
any alignment choices made by the compiler in either C or C++.  I only mention
it as another interesting case where i386 SysV's under-alignment of 64-bit
types requiring special care, but that one will require an ABI change of some
sort to fix.

[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics

2018-04-15 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #1 from Peter Cordes  ---
This LEA stuff is part of what gcc does to align the stack by 32 for spilling
AVX locals.

Gcc's stack-align sequence is over-complicated and ties up an extra register
for the whole function (add  volatile  to the local and see the -O3 code).  Or
at least it was; it seems gcc8 trunk just makes a stack frame with EBP / RBP
but references 32-byte aligned locals from aligned RSP instead of unaligned
RBP.

It used to copy the address of the return address to make a full copy of
ret-addr / saved-RBP for the aligned stack frame, which was super weird.

https://godbolt.org/g/RLJNtd.  (With an alloca or something, gcc8 does the same
crazy stack-frame stuff as gcc7, otherwise it's much cleaner, like clang)



The actual bug here is that it's not fully optimized away when it turns out
that no 32-byte spills / reloads from locals are left in the function.

gcc for x86-64 sometimes has a few leftover instructions like that in more
complex functions using __m256; this is not exclusively an i386 problem, but
it's happens more easily for 32-bit it seems.

[Bug target/85366] New: Failure to use both div and mod results of one IDIV in a prime-factor loop while(n%i==0) { n/=i; }

2018-04-12 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85366

Bug ID: 85366
   Summary: Failure to use both div and mod results of one IDIV in
a prime-factor loop while(n%i==0) { n/=i; }
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

From
https://codereview.stackexchange.com/questions/191792/find-prime-factors-in-c/191801#191801,
simplified to use a pointer instead of returning std::vector. 
Interestingly, the version with std::vector can be more easily coaxed to use
both results of one idiv, see the Godbolt link.

void find_prime_factors_ptr(int n, int *p)
{
// inefficient to test even numbers > 2, but that's a separate missed
optimization.
for (int i = 2; i <= n; i++) {
while (n % i == 0) {
*p++ = i;
n /= i;   // reordering the loop body doesn't help
}
}
}

https://godbolt.org/g/ogyZW8

g++ 8.0.1 20180411 -O3 -march=haswell gives us this inner loop:

 ...
 # outer loop
 movl%edi, %eax
# idiv to test if inner loop should even run once, leaving n/i in eax
.L4:
movl%edi, %eax# but instead we discard it
addq$4, %rsi
movl%ecx, -4(%rsi)
cltd
idivl   %ecx
cltd  # then modulo that division result to see if
the next iteration should run
movl%eax, %edi
idivl   %ecx  # leaves n/i in eax, ready for next
iteration...
testl   %edx, %edx
je  .L4
 ...

So both ways to get to .L4 (fall in or loop) have n/i in EAX from an idiv
already!  The loop doesn't need to be re-structured to take advantage, gcc just
needs to keep track of what it's doing.

## Hand optimized version of the whole function:
cmpl$1, %edi
jle .L9
movl$2, %ecx
.L5:
movl%edi, %eax
cltd
idivl   %ecx  # eax = tmp = n/i
testl   %edx, %edx
jne .L3
.L4:
movl%ecx, (%rsi)
addq$4, %rsi  # we're tuning for Haswell, no register-read
stalls so increment after reading and save a byte in the addressing mode
movl%eax, %edi# n = tmp
cltd
idivl   %ecx  # eax = tmp = n/i
testl   %edx, %edx
je  .L4
.L3:
incl%ecx
cmpl%edi, %ecx
jle .L5
.L9:
ret


I didn't make *any* changes to the code outside the inner loop.  I ended up
just removing movl %edi, %eax / cltd / idiv %ecx.

Changing the inner loop to

int tmp;
while (tmp = n/i, n % i == 0) {
*p++ = i;
n = tmp;
}

gives us the asm almost that good (an extra mov inside the loop), but we get a
jmp into the loop instead of peeling the while condition from before the first
iteration:


# gcc8.0.1 -O3 -march=haswell output, commented but unmodified
find_prime_factors_ptr_opt(int, int*):
cmpl$1, %edi
jle .L18
movl$2, %ecx
jmp .L19
.L16: # top of inner loop
addq$4, %rsi
movl%ecx, -4(%rsi)
movl%eax, %edi# extra mov puts this and the next mov on
the critical path
.L19:# inner loop entry point
movl%edi, %eax
cltd
idivl   %ecx
testl   %edx, %edx
je  .L16  # bottom of inner
incl%ecx
cmpl%edi, %ecx
jle .L19   # bottom of outer
.L18:
ret

Saving code-size here with the dependent chain of movl %eax, %edi / movl %edi,
%eax is pretty minor even on CPUs like original Sandybridge, or Bulldozer,
without mov-elimination, because idiv's latency dominates.  But it could easily
be taken out of the inner loop by duplicating it outside the outer loop, then
moving it to the outer-only part of the loop body, like this:

cmpl$1, %edi
jle .L18
movl$2, %ecx
movl%edi, %eax   # eax = n added here
jmp .L19
.L16: # top of inner loop
addq$4, %rsi
movl%ecx, -4(%rsi)
movl%eax, %edi # n = tmp  still here
.L19:# inner loop entry point
 #movl%edi, %eax  # eax = n removed from here in inner/outer loop
cltd
idivl   %ecx
testl   %edx, %edx
je  .L16  # bottom of inner

movl%edi, %eax# eax = n also added here, in the outer-only part
incl%ecx
cmpl%edi, %ecx
jle .L19   # bottom of outer
.L18:
 

[Bug target/85038] x32: unnecessary address-size prefix when a pointer register is already zero-extended

2018-03-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85038

--- Comment #1 from Peter Cordes  ---
Correction for AArch64: it supports addressing modes with a 64-bit base
register + 32-bit index register with zero or sign extension for the 32-bit
index.  But not 32-bit base registers.

As a hack that's better than nothing, AArch64 could use a 32-bit pointer as the
index with a UXTW mode, using a zeroed register as the base (unless indexed
modes have any perf downside on real AArch64 chips).  But unfortunately, the
architectural zero register isn't usable as the base: that encoding means the
stack pointer for this instruction.  ldr w1,[xzr,w2,uxtw] doesn't assemble,
only x0-x30 or SP.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0801b/BABBGCAC.html


http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/LDR_reg_gen.html
describes LDR  Wt, [Xn|SP, Rm{, extend {amount}}]
where Rm can be an X or W register, and "extend" can be SXTW or UXTW for word
regs, or LSL for X regs.  (SXTX is a synonym for LSL).  Any of the modes can
use a left-shift amount, applied *after* extension to 64-bit.

See
https://community.arm.com/processors/b/blog/posts/a64-shift-and-extend-operations-operand-modifiers
for details on operand-modifiers.


gcc6.3 doesn't take advantage with -mabi=ilp32, and Godbolt doesn't have later
AArch64 gcc.

So gcc will need to know about zero-extended pointers, and the signedness of
32-bit values, to take advantage of AArch64's addressing modes for the common
case of a 32-bit index.  Teaching gcc to track signed/unsigned in RTL would
benefit x32 and AArch64 ILP32, if I understand the situation correctly.

  1   2   3   >