[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2019-07-08 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #4 from Peter Cordes  ---
We should not put any stock in what ICC does for GNU C native vector indexing. 
I think it doesn't know how to optimize that because it *always* spills/reloads
even for `vec[0]` which could be a no-op.  And it's always a full-width spill
(ZMM), not just the low XMM/YMM part that contains the desired element.  I
mainly mentioned ICC in my initial post to suggest the store/reload strategy in
general as an *option*.

ICC also doesn't optimize intriniscs: it pretty much always faithfully
transliterates them to asm.  e.g. v = _mm_add_epi32(v, _mm_set1_epi32(1)); 
twice compiles to two separate paddd instructions, instead of one with a
constant of set1(2).

If we want to see ICC's strided-store strategy, we'd need to write some pure C
that auto-vectorizes.



That said, store/reload is certainly a valid option when we want all the
elements, and gets *more* attractive with wider vectors, where the one extra
store amortizes over more elements.

Strided stores will typically bottleneck on cache/memory bandwidth unless the
destination lines are already hot in L1d.  But if there's other work in the
loop, we care about OoO exec of that work with the stores, so uop throughput
could be a factor.


If we're tuning for Intel Haswell/Skylake with 1 per clock shuffles but 2 loads
+ 1 store per clock throughput (if we avoid indexed addressing modes for
stores), then it's very attractive and unlikely to be a bottleneck.

There's typically spare load execution-unit cycles in a loop that's also doing
stores + other work.  You need every other uop to be (or include) a load to
bottleneck on that at 4 uops per clock, unless you have indexed stores (which
can't run on the simple store-AGU on port 7 and need to run on port 2/3, taking
a cycle from a load).   Cache-split loads do get replayed to grab the 2nd half,
so it costs extra execution-unit pressure as well as extra cache-read cycles.

Intel says Ice will have 2 load + 2 store pipes, and a 2nd shuffle unit.  A
mixed strategy there might be interesting: extract the high 256 bits to memory
with vextractf32x8 and reload it, but shuffle the low 128/256 bits.  That
strategy might be good on earlier CPUs, too.  At least with movss + extractps
stores from the low XMM where we can do that directly.

AMD before Ryzen 2 has only 2 AGUs, so only 2 memory ops per clock, up to one
of which can be a store.  It's definitely worth considering extracting the high
128-bit half of a YMM and using movss then shuffles like vextractps: 2 uops on
Ryzen or AMD.


-

If the stride is small enough (so more than 1 element fits in a vector), we
should consider  shuffle + vmaskmovps  masked stores, or with AVX512 then
AVX512 masked stores.

But for larger strides, AVX512 scatter may get better in the future.  It's
currently (SKX) 43 uops for VSCATTERDPS or ...DD ZMM, so not very friendly to
surrounding code.  It sustains one per 17 clock throughput, slightly worse than
1 element stored per clock cycle.  Same throughput on KNL, but only 4 uops so
it can overlap much better with surrounding code.




For qword elements, we have efficient stores of the high or low half of an XMM.
 A MOVHPS store doesn't need a shuffle uop on most Intel CPUs.  So we only need
1 (YMM) or 3 (ZMM) shuffles to get each of the high 128-bit lanes down to an
XMM register.

Unfortunately on Ryzen, MOVHPS [mem], xmm costs a shuffle+store.  But Ryzen has
shuffle EUs on multiple ports.

[Bug target/91103] New: AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2019-07-06 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

Bug ID: 91103
   Summary: AVX512 vector element extract uses more than 1 shuffle
instruction; VALIGND can grab any element
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

GCC9.1 and current trunk aren't good at extracting high elements, whether it's
with GNU C native vector syntax, or when auto-vectorizing something that ends
with the result in the high element.

Using VALIGND we can get any element with one immediate instruction, but its
better to use AVX2 VPERMPD(immediate) when possible.  Or inside loops,
VPERMPS(vector), or VPERMT2PS(vector).  Or of course vextractf32x4 if possible
(element at the bottom of a 128-bit lane).

Or with only AVX2 available, VPERMPD(immediate) for high elements in __m256 and
__m256d vectors is still a big win.

#include 
float elem12(__m512 v) {  return v[12]; }
float elem15(__m512 v) {  return v[15]; }

gcc -Ofast -march=skylake-avx512
https://godbolt.org/z/241r8p

elem15:
vextractf32x8   ymm0, zmm0, 0x1
vextractf128xmm0, ymm0, 0x1# elem12 ends here, after these 2
insns
vshufps xmm0, xmm0, xmm0, 255
 # no vzeroupper I guess because the caller must have __m512 vars too,
recent optimization
ret

But AVX512F has vextractf32x4 to extract a 128-bit lane, which would preclude
the need for AVX2 vextractf128.  That's what clang does.

Obviously inside a loop it would be *much* better to use a single lane-crossing
VPERMPS to also avoid the shufps.  Intel Skylake easily bottlenecks on shuffle
throughput.  We'd need a 15 in an XMM register as a control vector, but loading
it would be off the latency critical path.  (If we needed the scalar
zero-extended instead of garbage in high elements, we could VPERMI2PS or
VPERMT2PS with a zeroed vector and a shuffle-control.)

---

If the element we want is an even element in the low 256 bits, we can get it
with a VPERMPD-immediate.  GCC does this:

elem6(float __vector(16)): # GCC 10 trunk
vextractf128xmm0, ymm0, 0x1
vunpckhps   xmm0, xmm0, xmm0
ret

Instead it should be AVX2   vpermpd ymm0, ymm0, 3
This bug also applies to __m256, not just __m512

https://www.felixcloutier.com/x86/vpermpd
VPERMPD is a 64-bit granularity lane-crossing shuffle.  The AVX512F immediate
version reuses the immediate for another 256-bit wide shuffle in the upper
half; only the vector-control version can bring an element from the top half of
a ZMM down to the bottom.  But if we're going to use a vector control, we might
as well use VPERMPS.

For the integer version of this bug, use VPERMQ

--

But we can do even better by using an integer VALIGND (AVX512F) shuffle on FP
data.  There unfortunately isn't an FP flavour of VALIGND, just integer.

AFAIK, Skylake-AVX512 still has no bypass-delay penalty for integer shuffles
between FP math instructions, i.e. the shift unit is connected to both FP and
integer forwarding networks.  Intel's optimization manual for Skylake (client)
has a bypass-latency table that shows 0 extra latency cycles for SHUF/5/1,3
reading from anything, or anything reading from it.

https://www.felixcloutier.com/x86/valignd:valignq  It's a 4 or 8-byte
granularity version of palignr, except that it's lane-crossing so the 256 and
512-bit versions are actually useful.  The immediate shift count can thus bring
*any* element down to the bottom.  (Using the same input twice makes it a
rotate).

VALIGND is good on Knight's Landing, too: unlike most 2-input shuffles, it has
1 per clock throughput.

For *any* compile-time-constant index, we can always compile v[i] to this:

extract15:
   valigndzmm0, zmm0, zmm0, 15   # I think this is right.
   ret

The only downside I'm aware of is that some future AVX512 CPU might not run
VALIGND as efficiently as SKX and KNL.




For vector elements narrower than 32 bits, we may need 2 shuffles even if we
consider using a shuffle-control vector.  On Skylake-AVX512,  AVX512BW  vpermw 
will get the job done, but costs 2 shuffle uops.  On CannonLake (and presumably
other future Intel), it and  AVX512VBMI vpermb are only 1 uop, so it's
definitely worth creating a shuffle-control vector if it can be reused.


Also worth considering instead of 2 shuffles: *unaligned* spill / reload like
ICC does for GNU C native vector indexing.  Store-forwarding latency is only 6
or 7 cycles I think, and it avoids any port 5 pressure.  Not generally a good
choice IMO when we can get the job done in one shuffle, but worth considering
if we need multiple elements.  If the function doesn't need the stack aligned,
an unaligned spill is generally

[Bug target/90582] New: AArch64 stack-protector wastes an instruction on address-generation

2019-05-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90582

Bug ID: 90582
   Summary: AArch64 stack-protector wastes an instruction on
address-generation
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

void protect_me() {
volatile int buf[2];
buf[1] = 3;
}

https://godbolt.org/z/xdlr5w AArch64 gcc8.2 -O3 -fstack-protector-strong

protect_me:
stp x29, x30, [sp, -32]!
adrpx0, __stack_chk_guard
add x0, x0, :lo12:__stack_chk_guard ### this instruction
mov x29, sp # frame pointer even though
-fomit-frame-pointer is part of -O3.  Goes away with explicit
-fomit-frame-pointer

ldr x1, [x0]# copy the cookie
str x1, [sp, 24]
mov x1,0# and destroy the reg

mov w1, 3   # right before it's already
destroyed
str w1, [sp, 20] # buf[1] = 3

ldr x1, [sp, 24]# canary
ldr x0, [x0]# key destroys the key pointer
eor x0, x1, x0
cbnzx0, .L5
ldp x29, x30, [sp], 32  # FP and LR save/restore (for
some reason?)
ret
.L5:
  # can the store of the link register go here, for backtracing?
bl  __stack_chk_fail

A function that returns a global can embed the low 12 bits of the address into
the load instruction.  AArch64 instructions are fixed-width, so there's no
reason (AFAIK) not to do this.

f:
adrpx0, foo
ldr w0, [x0, #:lo12:foo]
ret

I'm not an AArch64 performance expert; it's plausible that zero displacements
are worth spending an extra instruction on for addresses that are used twice,
but unlikely.

So we should be doing 

adrpx0, __stack_chk_guard
ldr x1, [x0, #:lo12:__stack_chk_guard]  # in prologue to copy
cookie
... 
ldr x0, [x0, #:lo12:__stack_chk_guard]  # in epilogue to check
cookie

This also avoids leaving an exact pointer right to __stack_chk_guard in a
register, in case a vulnerable callee or code in the function body can be
tricked into dereferencing it and leaking the cookie.  (In non-leaf functions,
we generate the pointer in a call-preserved register like x19, so yes it will
be floating around in a register for callees).

I'd hate to suggest destroying the pointer when copying to the stack, because
that would require another adrp later.

Finding a gadget that has exactly the right offset (the low 12 bits of
__stack_chk_guard's address) is a lot less likely than finding an  ldr from
[x0].  Of course this will introduce a lot of LDR instructions with an
#:lo12:__stack_chk_guard offset, but hopefully they won't be part of useful
gadgets because they lead to writing the stack, or to EOR/CBNZ to
__stack_chk_fail



I don't see a way to optimize canary^key == 0 any further, unlike x86-64 PR
90568.  I assume EOR / CBNZ is as at least as efficient as SUBS / BNE on
all/most AArch64 microarchitectures, but someone should check.



-O3 includes -fomit-frame-pointer according to -fverbose-asm, but functions
protected with -fstack-protector-strong still get a frame pointer in x29
(costing a MOV x29, sp instruction, and save/restore with STP/LDP along with
x30.)

However, explicitly using -fomit-frame-pointer stops that from happening.  Is
that a separate bug, or am I missing something?



Without stack-protector, the function is vastly simpler

protect_me:
sub sp, sp, #16
mov w0, 3
str w0, [sp, 12]
add sp, sp, 16
ret

Does stack-protector really need to spill/reload x29/x30 (FP and LR)?  Bouncing
the return address through memory seems inefficient, even though branch
prediction does hide that latency.

Is that just so __stack_chk_fail can backtrace?  Can we move the store of the
link register into the __stack_chk_fail branch, off the fast path?

Or if we do unconditionally store x30 (the link register), at least don't
bother reloading it in a leaf function if register allocation didn't need to
clobber it.  Unlike x86-64, the return address can't be attacked with buffer
overflows if it stays safe in a register the whole function.

Obviously my test-case with a volatile array and no inputs at all is making
-fstack-protector-strong look dumb by protecting a perfectly safe function. 
IDK how common it is to have leaf functions with arrays or structs that just
use them for some computation on function args or globals and then return,
maybe after copying the array back to somewhere else.  A sort function might
use a tmp

[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568

--- Comment #5 from Peter Cordes  ---
And BTW, this only helps if the SUB and JNE are consecutive, which GCC
(correctly) doesn't currently optimize for with XOR.

If this sub/jne is different from a normal sub/branch and won't already get
optimized for macro-fusion, we may get even more benefit from this change by
teaching gcc to keep them adjacent.

GCC currently sometimes splits up the instructions like this:

xorq%fs:40, %rdx
movl%ebx, %eax
jne .L7

from gcc8.3 (but not 9.1 or trunk in this case) on https://godbolt.org/z/nNjQ8u


#include 
unsigned int get_random_seed() {
std::random_device rd;
return rd();
}

Even with -O3 -march=skylake.
That's not wrong because XOR can't macro-fuse, but the point of switching to
SUB is that it *can* macro-fuse into a single sub-and-branch uop on
Sandybridge-family.  So we might need to teach gcc about that.

So when you change this, please make it aware of optimizing for macro-fusion by
keeping the sub and jne back to back.  Preferably with tune=generic (because
Sandybridge-family is fairly widespread and it doesn't hurt on other CPUs), but
definitely with -mtune=intel or -mtune=sandybridge or later.

Nehalem and earlier can only macro-fuse test/cmp

The potential downside of putting it adjacent instead of 1 or 2 insns earlier
for uarches that can't macro-fuse SUB/JNE should be about zero on average. 
These branches should predict very well, and there are no in-order x86 CPUs
still being sold.  So it's mostly just going to be variations in fetch/decode
that help sometimes, hurt sometimes, like any code alignment change.

[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568

--- Comment #3 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #2)
> The xor there is intentional, for security reasons we do not want the stack
> canary to stay in the register afterwards, because then it could be later
> spilled or accessible to some exploit in another way.

Ok, so we can't use CMP, therefore we should use SUB, which as I showed does
help on Sandybridge-family vs. XOR.

x - x = 0   just like 
x ^ x = 0

Otherwise SUB wouldn't set ZF.

SUB is not worse than XOR on any other CPUs; there are no CPUs with better XOR
throughput than ADD/SUB.

In the canary mismatch case, leaving  attacker_value - key  in a register seems
no worse than leaving attacker_value ^ key in a register.  Either value
trivially reveals the canary value to an attacker that knows what they
overwrote the stack with, if it does somehow leak.  We jump to __stack_chk_fail
in that case, not relying on the return value on the stack, so a ROP attack
wouldn't be sufficient to leak that value anywhere.

[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-21 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568

--- Comment #1 from Peter Cordes  ---
https://godbolt.org/z/hHCVTc

Forgot to mention, stack-protector also disables use of the red-zone for no
apparent reason, so that's another missed optimization.  (Perhaps rarely
relevant; probably most functions that get stack protection are big enough that
they need more stack, or non-leaf.  I sidestepped that with volatile.)

[Bug target/90568] New: stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-21 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568

Bug ID: 90568
   Summary: stack protector should use cmp or sub, not xor, to
allow macro-fusion on x86
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

cmp/jne is always at least as efficient as xor/jne, and more efficient on CPUs
that support macro-fusion of compare and branch.  Most support cmp/jne fusion
(including all mainstream Intel and AMD, not low-power), but none support
xor/jne fusion.

void foo() {
volatile int buf[4];
buf[1] = 2;
}

gcc trunk on Godbolt, but same code-gen all the way back to gcc4.9

foo:
subq$40, %rsp
movq%fs:40, %rax
movq%rax, 24(%rsp)
xorl%eax, %eax
movl$2, 4(%rsp)
movq24(%rsp), %rax
xorq%fs:40, %rax  ## This insn should be CMP
jne .L5
addq$40, %rsp
ret
.L5:
call__stack_chk_fail

As far as I can tell, the actual XOR result value in RAX is not an input to
__stack_chk_fail because gcc sometimes uses a different register.

Therefore we don't need it, and can use any other way to check for equality.

If we need to avoid "leaking" the canary value in a register, we can use SUB,
otherwise CMP is even better and can macro-fuse on more CPUs.

Only Sandybridge-family can fuse SUB/JCC.  (And yes, it can fuse even with a
memory-source and a segment override prefix.  SUB %fs:40(%rsp), %rax / JNE  is
a single uop on Skylake; I checked this with perf counters in an asm loop.)

AMD can fuse any TEST or CMP/JCC, but only those instructions (so SUB is as bad
as XOR for AMD).  See Agner Fog's microarch PDF.



Linux test program (NASM) that runs  sub (mem), %reg with an FS prefix to prove
that it does macro-fuse and stays micro-fused as a single uop:


default rel
%use smartalign
alignmode p6, 64

global _start
_start:

cookie equ 12345
mov  eax, 158   ; __NR_arch_prctl
mov  edi, 0x1002; ARCH_SET_FS
lea  rsi, [buf]
syscall
   ;  wrfsbase   rsi; not enabled by the kernel
mov  qword [fs: 0x28], cookie

mov ebp, 10

align 64
.loop:
mov   eax, cookie
sub   rax, [fs: 0x28]
jne   _start
and   ecx, edx

dec ebp
jnz .loop
.end:

xor edi,edi
mov eax,231   ; __NR_exit_group
syscall   ; sys_exit_group(0)


section .bss
align 4096
buf:resb 4096



nasm -felf64  branch-fuse-mem.asm &&
ld -o branch-fuse-mem  branch-fuse-mem.o
to make a static executable

taskset -c 3 perf stat
-etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u
-r2 ./branch-fuse-mem

On my i7-6700k

 Performance counter stats for './branch-fuse-mem' (2 runs):

240.78 msec task-clock:u  #0.999 CPUs utilized 
  ( +-  0.23% )
 2  context-switches  #0.010 K/sec 
  ( +- 20.00% )
 0  cpu-migrations#0.000 K/sec  
 3  page-faults   #0.012 K/sec  
 1,000,764,258  cycles:u  #4.156 GHz   
  ( +-  0.00% )
 2,000,000,076  branches:u# 8306.384 M/sec 
  ( +-  0.00% )
 6,000,000,088  instructions:u#6.00  insn per cycle
  ( +-  0.00% )
 4,000,109,615  uops_issued.any:u # 16613.222 M/sec
  ( +-  0.00% )
 5,000,098,334  uops_executed.thread:u# 20766.367 M/sec
  ( +-  0.00% )

  0.240935 +- 0.000546 seconds time elapsed  ( +-  0.23% )

Note 1.0 billion cycles (1 per iteration), and 4B fused-domain uops_issued.any,
i.e. 4 uops per loop iteration.

(5 uops *executed* is because one of those front-end uops has a load
micro-fused).

Changing SUB to CMP has no effect.

With SUB changed to XOR, the loop takes 1.25 cycles per iteration, and the
front-end issues 5 uops per iteration.  Other counters are the same.

Skylake's pipeline is 4-wide, like all Intel since Core2, so an extra uop for
the front-end creates a bottleneck.

--

On Intel pre Haswell, the decoders will only make at most 1 fusion per decode
group, so you may need to make the loop larger to still get fusion.  Or use
this as the loop-branch, e.g. with a  1  in memory

   sub  rax, [fs: 0x28]
   jnz  .loop

or with a 0 in memory, sub or cmp or xor will all set flags according to the
register being non-zero.  But sub or xor will introduce an extra cycle of
latency on the critical path for the loop counter.

[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2019-04-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #4 from Peter Cordes  ---
Yes, rep scasb is abysmal, and gcc -O3's 4-byte-at-a-time scalar loop is not
very good either.

With 16-byte alignment, (which we have from calloc on x86-64 System V), we can
inline a *much* better SSE2 loop.  See
https://stackoverflow.com/a/55589634/224132 for more details and
microbenchmarks; 

On Skylake it's about 4 to 5x faster than the current 4-byte loop for large
strings, 3x faster for short strings.  For short strings (strlen=33), it's
about 1.5x faster than calling strlen.  For very large strings (too big for L2
cache), it's ~1.7x slower than glibc's AVX2 strlen.

The lack of VEX encoding for pxor and pmovmskb is just me being lazy; let gcc
emit them all with VEX if AVX is enabled.

   # at this point gcc has `s` in RDX, `i` in ECX

pxor   %xmm0, %xmm0 # zeroed vector to compare against
.p2align 4
.Lstrlen16: # do {
#ifdef __AVX__
vpcmpeqb   (%rdx), %xmm0, %xmm1
#else
movdqa (%rdx), %xmm1
pcmpeqb%xmm0, %xmm1   # xmm1 = -1 where there was a 0 in memory
#endif

add $16, %rdx # ptr++
pmovmskb  %xmm1, %eax # extract high bit of each byte to a
16-bit mask
test   %eax, %eax
jz.Lstrlen16# }while(mask==0);
# RDX points at the 16-byte chunk *after* the one containing the terminator
# EAX = bit-mask of the 0 bytes, and is known to be non-zero
bsf%eax, %eax   # EAX = bit-index of the lowest set bit

# terminator is at rdx+rax - 16
#  movb   $'A', -16(%rdx, %rax)  // for a microbench that used
s[strlen(s)]='A'
sub%rbp, %rdx   # p -= start
lea   -16(%rdx, %rax)   # p += byte_within_vector - 16

We should actually use  REP BSF  because that's faster on AMD (tzcnt), and same
speed on Intel.


Also an inline-asm implementation of it with a microbenchmark adapted from the
SO question.  (Compile with -DUSE_ASM -DREAD_ONLY to benchmark a fixed length
repeatedly)
https://godbolt.org/z/9tuVE5

It uses clock() for timing, which I didn't bother updating.  I made it possible
to run it for lots of iterations for consistent timing.  (And so the real work
portion dominates the runtime so we can use perf stat to measure it.)




If we only have 4-byte alignment, maybe check the first 4B, then do (p+4) & ~7
to either overlap that 4B again or not when we start 8B chunks.  But probably
it's good to get to 16-byte alignment and do whole SSE2 vectors, because
repeating an aligned 16-byte test that overlaps an 8-byte test costs the same
as doing another 8-byte test.  (Except on CPUs like Bobcat that split 128-bit
vectors into 64-bit halves).  The extra AND to round down to an alignment
boundary is all it takes, plus the code-size cost of peeling 1 iteration each
of 4B and 8B before a 16-byte loop.

We can use 4B / 8B with movd / movq instead of movdqa.  For pmovmskb, we can
ignore the compare-true results for the upper 8 bytes by testing the result
with `test %al,%al`, or in general with `test $0x0F, %al` to check only the low
4 bits of EAX for the 4-byte case.



The scalar bithack version can use BSF instead of CMOV binary search for the
byte with a set high bit.  That should be a win if we ever wanted to do scalar
on some x86 target especially with 8-byte registers, or on AArch64.  AArch64
can rbit / clz to emulate bsf and find the position of the first set bit.

(Without efficient SIMD compare result -> integer_mask, or efficient SIMD ->
integer at all on some ARM / AArch64 chips, SIMD compares for search loops
aren't always (ever?) a win.  IIRC, glibc strlen and memchr don't use vectors
on ARM / AArch64, just scalar bithacks.)

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #22 from Peter Cordes  ---
Nice, that's exactly the kind of thing I suggested in bug 80571.  If this
covers 

* vsqrtss/sd  (mem),%merge_into, %xmm 
* vpcmpeqd%same,%same, %dest# false dep on KNL / Silvermont
* vcmptrueps  %same,%same, %ymm # splat -1 without AVX2.  false dep on all
known uarches

as well as int->FP conversions, then we could probably close that as fixed by
this as well.

bug 80571 does suggest that we could look for any cold reg, like a non-zero
constant, instead of requiring an xor-zeroed vector, so it might go slightly
beyond what this patch does.

And looking for known-to-be-ready dead regs from earlier in the same dep chain
could certainly be useful for non-AVX code-gen, allowing us to copy-and-sqrt
without introducing a dependency on anything that's not already ready.

(In reply to h...@gcc.gnu.org from comment #21)
> Author: hjl
> Date: Fri Feb 22 15:54:08 2019
> New Revision: 269119

[Bug target/80571] AVX allows multiple vcvtsi2ss/sd (integer -> float/double) to reuse a single dep-breaking vxorps, even hoisting it out of loops

2019-02-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571

--- Comment #2 from Peter Cordes  ---
I think hjl's patch for PR 89071 / PR 87007 fixes (most of?) this, at least for
AVX.

If register pressure is an issue, using a reg holding a arbitrary constant
(instead of xor-zeroed) is a valid option, as this bug points out.  So I'm not
sure we should close this as a duplicate of those fixed bugs.

[Bug target/38959] Additional switches to disallow processor supplementary instructions

2019-02-12 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959

--- Comment #4 from Peter Cordes  ---
The __builtin_ia32_rdpmc being a pure function bug I mentioned in my previous
comment is already reported and fixed (in gcc9 only): bug 87550

It was present since at least gcc 5.0
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/595214

[Bug target/38959] Additional switches to disallow processor supplementary instructions

2019-02-12 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #3 from Peter Cordes  ---
We can maybe close this as fixed (if -march=i386 didn't exist/work at the time)
or invalid.  Or maybe we want to add some CPU-level awareness to code-gen for
__builtin_ia32_rdtsc / rdpmc / rdtscp.

The cmov / fcomi / fcomi proposed switches are already supported as part of
-march=pentium -mtune=generic or lower, e.g. -march=i386.  (The 32-bit default
is something like arch=i686 and tune=generic, with it being possible to
configure gcc so SSE2 is on by default in 32-bit code.)

Those are the important ones, because they're emitted automatically by the
compiler's back-end.  The other options would just be trying to save you from
yourself, e.g. rejecting source that contains __rdtsc() /
__builtin_ia32_rdtsc()



I'm not sure what the situation is with long NOPs.  GCC doesn't (normally?)
emit them, just using .p2align directives for the assembler.  In 32-bit mode,
GAS appears to avoid long NOPs, using either 2-byte xchg ax,ax or pseudo-nops
like   LEA esi,[esi+eiz*1+0x0] that add a cycle of latency to the dep chain
involving ESI.

Even with -march=haswell, gcc+gas fail to use more efficient long NOPs for
padding between functions.


---

I'm not sure if CPUID is ever emitted by gcc's back-end directly, only from
inline asm.  i386/cpuid.h uses inline asm.  But __get_cpuid_max() checks if
CPUID is even supported in a 386-compatible way, checking if a bit in EFLAGS is
sticky or not.  If your source code is written safely, you won't have a problem
unless possibly __builtin_cpu_init runs CPUID without checking, in programs
that use __builtin_cpu_supports() or _is().


__builtin_ia32_rdpmc() and __rdtsc() do *not* check -march= before emitting
rdpmc and rdtsc.  Neither does __rdtscp(), which is interesting because that
instruction is new enough that some still-relevant CPUs don't support it.

__rdpmc() isn't "volatile", though, so stop-start optimizes to 0.  (I found
this bug looking for existing reports of that issue.)



Test cases:  https://godbolt.org/z/hqPdza

FCMOV and CMOV are also handled correctly, but I didn't write functions for
them.

int fcomi(double x, double y) {
return x Proposed switches:
> 
> --nocpuid  This option causes the compiler to not generate cpuid opcodes
> --nocmov   This option causes the compiler to not generate cmov opcodes
> --nofcmov  This option causes the compiler to not generate fcmov opcodes
> --nofcomi  This option causes the compiler to not generate fcomi opcodes
> --nonopl   This option causes the compiler to not generate fcomi opcodes
> --nordpmc  This option causes the compiler to not generate rdpmc opcodes
> --nordtsc  This option causes the compiler to not generate rdtsc opcodes
> 
> Possibly a general switch that is equivalent to all of the above
> 
> --nosupplementaryinstructions
> 
> Rationale
> 
> It is possible that a developer still wants to compile for a particular
> architecture (for example the i486), but does not wish to generate code with
> supplementary instructions (such as cpuid), that may be present on that
> architecture.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #15 from Peter Cordes  ---
(In reply to Uroš Bizjak from comment #13)
> I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP
> and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we
> currently don't emit XOR clear in front of these instrucitons, when they
> operate with memory input.

They *do* have an output dependency.  It might or might not actually be a
problem and be worth clogging the front-end with extra uops to avoid, it
depending on surrounding code. >.<

e.g. ROUNDSD:  DEST[127:63] remains unchanged
Thanks, Intel.  You'd think by SSE4.1 they would have learned that false
dependencies suck, and that it's extremely rare to actually take advantage of
this merge behaviour, but no.

For register-source ROUNDSD / ROUNDSS, we can use ROUNDPD / ROUNDPS which write
the full destination register and have identical performance on all CPUs that
support them.  (Except Silvermont, where roundps/pd have 5c latency vs. 4c for
roundss/sd.  Goldmont makes them equal.)  KNL has faster (V)ROUNDPS/D than
ROUNDSS/SD, maybe only because of the SSE encoding?  Agner Fog isn't clear, and
doesn't have an entry that would match vroundss/sd.

Copy-and-round is good for avoiding extra MOVAPS instructions which can make
SSE code front-end bound, and reduce the effective size of the out-of-order
window.

Preserving FP exception semantics for packed instead of scalar register-source:

* if the upper element(s) of the source is/are known 0, we can always do this
with sqrt and round, and convert: they won't produce any FP exceptions, not
even inexact.  (But not rsqrt / rcpps, of course.)
  This will be the case after a scalar load, so if we need the original value
in memory *and* the result of one of these instructions, we're all set.

* with rounding, the immediate can control masking of precision exceptions, but
not Invalid which is always raised by SRC = SNaN.  If we can rule out SNaN in
the upper elements of the input, we can use ROUNDPS / ROUNDPD

roundps/d can't produce a denormal output.  I don't think denormal inputs slow
it down on any CPUs, but worth checking for cases where we don't care about
preserving exception semantics and want to use it with potentially-arbitrary
garbage in high elements.


rsqrtps can't produce a denormal output because sqrt makes the output closer to
1.0 (reducing the magnitude of the exponent).  (And thus neither can sqrtps.) 
SQRTPS/PD is the same performance as SQRTSS/SD on new CPUs, but old CPUs that
crack 128-bit ops into 64-bit are slower: Pentium III, Pentium M, and Bobcat. 
And Jaguar for sqrt.  Also Silvermont is *MUCH* slower for SQRTPD/PS then
SD/SS, and even Goldmont Plus has slower packed SQRT, RSQRT, and RCP than
scalar.

But RCPPS can produce a denormal.  (double)1.0/FLT_MAX = 2.938736e-39, which is
smaller than FLT_MIN = 1.175494e-38



So according to Agner's tables:

* ROUNDPS/PD is never slower than ROUNDSS/SD on any CPU that support them.
* SQRTPS/PD *are* slower than scalar on Silvermont through Goldmont Plus, and
Bobcat, Nano 3000, and P4 Prescott/Nocona.  By about a factor of 2, enough that
should probably care about it for tune=generic.  For ss/ps only (not double),
also K10 and Jaguar have slower sqrtps than ss.  Also in 32-bit mode, P4,
Pentium M and earlier Intel, and Atom, are much slower for packed than scalar
sqrt.
  SQRTPD is *faster* than SQRTSD on KNL.  (But hopefully we're never tuning for
KNL without AVX available.)

* RSQRT / RCP: packed is slower on Atom, Silvermont, and Goldmont (multi-uop so
a big decode stall).  Somewhat slower on Goldmont Plus (1 uop but half
throughput).  Also slower on Nano3000, and slightly slower on Pentium 4 (before
and after Prescott/Nocona), and KNL.  (But hopefully KNL can always use
VRSQRT28PS/PD or scalar)
  Pentium M and older again decode as at least 2 uops for packed, same as
Bobcat and K8.
  Same performance for packed vs. scalar on Jaguar, K10, bdver1-4, ryzen, Core2
and later, and SnB-family.

* CVTSS2SD vs. PD, and SD2SS vs. PD2PS
  packed is slower on k8, bdver1-4 (scalar avoids the shuffle uop), Nano3000,
KNL.  On Silvermont by just 1 cycle latency (so  even a MOVAPS on the critical
path would make it equal.)  Similar on Atom.  Slower on CPUs that do 128-bit
vectors as two 64-bit uops, like Bobcat, and Pentium M / K8 and older.

  packed is *faster* on K10, Goldmont/GDM Plus (same latency, 1c vs. 2c
throughput), Prescott, P4.  Much faster on Jaguar (1c vs. 8c throughput, and 1
uop vs. 2).

  same speed (but without the false dep) for SnB-family (mostly), Core 2,
Ryzen.

  Odd stuff: Agner reports:
Nehalem: ps2pd = 2 uops / 2c, ss2sd = 1 uop / 1c.  (I guess just
zero-padding the significand, no rounding required).  pd2ps and sd2ss are equal
at 2 uops / 4c latency.
SnB: cvtpd2ps is 1c higher latency than sd2ss.
IvB: ps2pd on IvB is 1c vs. 2c for ss2sd
On HSW and later things have settled down to 

[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494

--- Comment #6 from Peter Cordes  ---
Oops, these were SD not SS.  Getting sleepy >.<.  Still, my optimization
suggestion for doing both compares in one masked SUB of +-PBCx applies equally.

And I think my testing with VBLENDVPS should apply equally to VBLENDVPD.

Since this is `double`, if we're going branchless we should definitely be
vectorizing for a pair of doubles, like doing 

xij = X0(1,i) - X0(1,j)   and 
yij = X0(2,i) - X0(2,j)

together with a vmovupd, and a vector of PBCx, PBCy.

Even if we later need both x and y separately (if those FMAs in the asm are
multiplying components of one vector), we might still come out ahead from doing
the expensive input processing with PD, then it's only one `vunpckhpd` to get
the Y element ready, and that can run in parallel with any x * z stuff

Or if we can unroll by 3 SIMD vectors over contiguous memory, we can get
{X0,Y0} {Z0,X1} {Y1,Z1}.  We get twice the work for a cost of only 3 extra
unpacks, doing 2 i and j values at once.



If this was 3 floats, using a SIMD load would be tricky (maybe vmaskmovps if we
need to avoid going off the end), unless we again unroll by 3 = LCM(vec_len,
width)

[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494

--- Comment #5 from Peter Cordes  ---
   IF ( xij.GT.+HALf ) xij = xij - PBCx
   IF ( xij.LT.-HALf ) xij = xij + PBCx

For code like this, *if we can prove only one of the IF() conditions will be
true*, we can implement it more efficiently, I think, by checking the magnitude
of xij to see if a SUB is needed, and if so figuring out the sign to apply to
PBCx.

if(abs(xij) > HALF) {
xij -= PBCx XOR sign_bit( xij )
}


# xij  in  xmm0
# PBCx in  xmm7
# HALF in  xmm6
# set1( -0.0f ) in xmm5 (i.e. 1U<<31 a sign-bit mask)
vandnps%xmm5, %xmm0, %xmm1# abs(xij)
vcmpltps   %xmm1, %xmm6, %xmm1# HALF < abs(xij)

vandps%xmm5, %xmm0, %xmm2 # signbit(xij)
vxorps%xmm7, %xmm2, %xmm2 # PBCX (xij>=0) or -PBCx  (xij<0)

vandps%xmm2, %xmm1, %xmm1 # +-PBCx or 0.0 if abs(xij) is between
-+HALF
vsubps%xmm1, %xmm0, %xmm0 # xij -= PBCx, -PBCx, or 0.0

There's a good amount of ILP here, but the critical path is ANDPS + CMPPS +
ANDPS + SUBPS = 10 cycles on Skylake.

We might want to use VPAND for some of this on Haswell, to avoid a port 5
bottleneck at least on the critical path.  (Skylake runs FP booleans on any
port.  BDW and earlier restrict them to port 5 where they can't compete with
FMA, and where bypass latency is always optimal.  On SKL they can introduce
extra bypass latency if they pick p0 or p1.)



vandnps   %xmm5, %xmm0, %xmm2 # signbit(xij)
vxorps%xmm7, %xmm2, %xmm2 # PBCX (xij>=0) or -PBCx  (xij<0)

could be replaced with a (v)blendvps using the original xij to select between
PBCx and -PBCx.  With the SSE encoding, that saves a uop and a cycle of latency
(but only off the critical path).  And I think it would cost us a vmovaps to
set up for it.

---

I think this is better than IF-conversion of both IFs separately, but I haven't
really looked.  It should be much better for *latency*.  But it's only
equivalent if subtracting PBCx can't possibly make xij negative and the next IF
condition also true.

---

I was looking at a similar case of applying a fixup if the abs value of an
input is outside a range in
https://stackoverflow.com/questions/54364694/how-to-convert-scalar-code-of-the-double-version-of-vdts-pade-exp-fast-ex-app/54377840#54377840.
 I don't think I came up with anything there that's not already obvious or
covered by the example above, though.

Except if we had needed to square xij at some point, we could have checked  xij
* xij < HALF*HALF as the bound condition to save the ANDNPS.  But then the
mulps latency is part of the input to cmpps.

[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494

--- Comment #4 from Peter Cordes  ---
I suspect dep-chains are the problem, and branching to skip work is a Good
Thing when it's predictable.

(In reply to Richard Biener from comment #2)
> On Skylake it's better (1uop, 1 cycle latency) while on Ryzen even better.
> On Bulldozer it also isn't that bad (comparable to Skylake I guess).

SKL: AVX VBLENDVPS x,x,x,x  is 2 uops, 2c latency, ~1c throughput.  (Same for
ymm)
SKL: SSE4 BLENDVPS x,x,xmm0 is 1 uop,  1c latency, ~0.36c throughput in my
testing, or maybe 0.333c with breaking dep chains.  (IDK how Agner got 1c. 
Maybe he that was an editing mistake, and he copied the 1c from the VEX
version.)


[V](P)BLENDV(B|PS|PD) is funny: the SSE versions are 1 uop on SKL, I assume
because they only have 3 register operands (including implicit XMM0).  But the
VEX encoding has 4 operands: 1 output and 3 inputs.  I think this is too many
for 1 uop to encode, and that's why VBLENDVPS is 2 uops even on Skylake.

(The blend-control register encoded by an imm8 in the VEX version instead of
implicit xmm0, but I don't think that's what stops the decoders from making it
1 uop.  I think it's simply having 4 total operands.)

On Skylake, the uop(s) for [V]BLENDVPS/D and [V]PBLENDVB can run on any of p015
(instead of only p5 on BDW and earlier), but the 2-uop VEX version is still 2
cycle latency.  The VEX version has a bias towards port 5, but less than half
the total uops run on p5 so it's not p015 + p5.  The SSE version seems equally
distributed to all of p015.



On SKL, the optimal choice might be to use the SSE encoding, if we can deal
with a destructive destination and having the blend control in xmm0.

The SSE/AVX penalty on SKL is output dependencies for write-only SSE
instructions (like movaps or cvtps2dq) writing to an XMM register that has a
dirty upper 128.  It's a per-register thing, not like Haswell where there's it
triggers a state slow change. 
(https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake)

---

Footnote: VBLENDVPS throughput is only 1c for a big block of it back-to-back,
even though it's only 2 uops that can run on any of 3 ports.  So why isn't it
0.66c throughput?

VBLENDVPS throughput (for back-to-back vblendvps) seems to be limited by some
front-end effect.  In an unrolled loop with 20 vblendvps (with no loop-carried
dependencies), there are a negligible amount of cycles where the front-end
delivered the full 4 uops.  Most cycles only 2 are issued.

This is not a general a problem for 2 uop instructions or anything: 9x bextr +
dec/jnz = 19 uops total runs at 5.00c / iter, or 3.8 uops / clock, with the
only cycle to not issue 4 uops being (I think) the group of 3 including the
loop branch.  Playing around with other 2 uops instructions, I didn't see
front-end bottlenecks.  I saw some back-end bottlenecks because other 2-uop
instructions aren't so nicely distributed over ports, but perf counts for 
idq_uops_not_delivered.cycles_fe_was_ok:u generally equaled total cycles. 
 (It counts when either the FE delivers 4 uops, or the back end was stalled and
thus not the front-end's fault.)

A 1 uop instruction following a vblendvps can issue with it in the same cycle,
so this effect is probably not horrible for normal cases where we're using
vblendvps mixed with normal instructions.

I haven't investigated further, whether this is a front-end effect (uop cache
fetch problem?) or whether it's an allocation bottleneck.  Possibly being a
4-operand instruction has something to do with it, although each uop can't have
that many I don't think.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-29 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #10 from Peter Cordes  ---
(In reply to Uroš Bizjak from comment #9)
> There was similar patch for sqrt [1], I think that the approach is
> straightforward, and could be applied to other reg->reg scalar insns as
> well, independently of PR87007 patch.
> 
> [1] https://gcc.gnu.org/ml/gcc-patches/2018-05/msg00202.html

Yeah, that looks good.  So I think it's just vcvtss2sd and sd2ss, and
VROUNDSS/SD that aren't done yet.

That patch covers VSQRTSS/SD, VRCPSS, and VRSQRTSS.

It also bizarrely uses it for VMOVSS, which gcc should only emit if it actually
wants to merge (right?).  *If* this part of the patch isn't a bug

-   return "vmovss\t{%1, %0, %0|%0, %0, %1}";
+   return "vmovss\t{%d1, %0|%0, %d1}";

then even better would be vmovaps %1, %0 (which can benefit from
mov-elimination, and doesn't need a port-5-only ALU uop.)  Same for vmovsd of
course.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #8 from Peter Cordes  ---
Created attachment 45544
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45544=edit
testloop-cvtss2sd.asm

(In reply to H.J. Lu from comment #7)
> I fixed assembly codes and run it on different AVX machines.
> I got similar results:
> 
> ./test
> sse  : 28346518
> sse_clear: 28046302
> avx  : 28214775
> avx2 : 28251195
> avx_clear: 28092687
> 
> avx_clear:
>   vxorps  %xmm0, %xmm0, %xmm0
>   vcvtsd2ss   %xmm1, %xmm0, %xmm0
>   ret
> 
> is slightly faster.


I'm pretty sure that's a coincidence, or an unrelated microarchitectural effect
where adding any extra uop makes a difference.  Or just chance of code
alignment for the uop-cache (32-byte or maybe 64-byte boundaries).

You're still testing with the caller compiled without optimization.  The loop
is a mess of sign-extension and reloads, of course, but most importantly
keeping the loop counter in memory creates a dependency chain involving
store-forwarding latency.

Attempting a load later can make it succeed more quickly in store-forwarding
cases, on Intel Sandybridge-family, so perhaps an extra xor-zeroing uop is
reducing the average latency of the store/reloads for the loop counter (which
is probably the real bottleneck.)

https://stackoverflow.com/questions/49189685/adding-a-redundant-assignment-speeds-up-code-when-compiled-without-optimization

Loads are weird in general: the scheduler anticipates their latency and
dispatches uops that will consume their results in the cycle when it expects a
load will put the result on the forwarding network.  But if the load *isn't*
ready when expected, it may have to replay the uops that wanted that input. 
See
https://stackoverflow.com/questions/54084992/weird-performance-effects-from-nearby-dependent-stores-in-a-pointer-chasing-loop
for a detailed analysis of this effect on IvyBridge.  (Skylake doesn't have the
same restrictions on stores next to loads, but other effects can cause
replays.)

https://stackoverflow.com/questions/52351397/is-there-a-penalty-when-baseoffset-is-in-a-different-page-than-the-base/52358810#52358810
is an interesting case for pointer-chasing where the load port speculates that
it can use the base pointer for TLB lookups, instead of the base+offset. 
https://stackoverflow.com/questions/52527325/why-does-the-number-of-uops-per-iteration-increase-with-the-stride-of-streaming
shows load replays on cache misses.

So there's a huge amount of complicating factors from using a calling loop that
keeps its loop counter in memory, because SnB-family doesn't have a simple
fixed latency for store forwarding.





If I put the tests in a different order, I sometimes get results like:

./test
sse  : 26882815
sse_clear: 26207589
avx_clear: 25968108
avx  : 25920897
avx2 : 25956683

Often avx (with the false dep on the load result into XMM1) is slower than
avx_clear of avx2, but there's a ton of noise.



Adding vxorps  %xmm2, %xmm2, %xmm2  to avx.S also seems to have sped it up; now
it's the same speed as the others, even though I'm *not* breaking the
dependency chain anymore.  XMM2 is unrelated, nothing touches it.

This basically proves that your benchmark is sensitive to extra instructions,
whether they interact with vcvtsd2ss or not.


We know that in the general case, throwing in extra NOPs or xor-zeroing
instructions on unused registers does not make code faster, so we should
definitely distrust the result of this microbenchmark.




I've attached my NASM loop.  It has various commented-out loop bodies, and
notes in comments on results I found with performance counters.  I don't know
if it will be useful (because it's a bit messy), but it's what I use for
testing snippets of asm in a static binary with near-zero startup overhead.  I
just run perf stat on the whole executable and look at cycles / uops.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #6 from Peter Cordes  ---
(In reply to Peter Cordes from comment #5)
> But whatever the effect is, it's totally unrelated to what you were *trying*
> to test. :/

After adding a `ret` to each AVX function, all 5 are basically the same speed
(compiling the C with `-O2` or -O2 -march=native), with just noise making it
hard to see anything clearly.  sse_clear tends to be faster than sse in a group
of runs, but if there are differences it's more likely due to weird front-end
effects and all the loads of inputs + store/reload of the return address by
call/ret.

I did  while ./test;  : ;done   to factor out CPU clock-speed ramp up and maybe
some cache warmup stuff, but it's still noisy from run to run.  Making
printf/write system calls between tests will cause TLB / branch-prediction
effects because of kernel spectre mitigation, so I guess every test is in the
same boat, running right after a system call.

Adding loads and stores into the mix makes microbenchmarking a lot harder.

Also notice that since `xmm0` and `xmm1` pointers are global, those pointers
are reloaded every time through the loop even with optimization.  I guess
you're not trying to minimize the amount of work outside of the asm functions,
to measure them as part of a messy loop.  So for the version that have a false
dependency, you're making that dependency on the result of this:

movrax,QWORD PTR [rip+0x2ebd]  # reload xmm1
vmovapd xmm1,XMMWORD PTR [rax+rbx*1]   # index xmm1

Anyway, I think there's too much noise in the data, and lots of reason to
expect that vcvtsd2ss %xmm0, %xmm0, %xmm1 is strictly better than
VPXOR+convert, except in cases where adding an extra uop actually helps, or
where code-alignment effects matter.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #5 from Peter Cordes  ---
(In reply to H.J. Lu from comment #4)
> (In reply to Peter Cordes from comment #2)

> >  Can you show some
> > asm where this performs better?
> 
> Please try cvtsd2ss branch at:
> 
> https://github.com/hjl-tools/microbenchmark/
> 
> On Intel Core i7-6700K, I got

I have the same CPU.

> [hjl@gnu-skl-2 microbenchmark]$ make
> gcc -g -I.-c -o test.o test.c
> gcc -g   -c -o sse.o sse.S
> gcc -g   -c -o sse-clear.o sse-clear.S
> gcc -g   -c -o avx.o avx.S
> gcc -g   -c -o avx2.o avx2.S
> gcc -g   -c -o avx-clear.o avx-clear.S
> gcc -o test test.o sse.o sse-clear.o avx.o avx2.o avx-clear.o
> ./test
> sse  : 24533145
> sse_clear: 24286462
> avx  : 64117779
> avx2 : 62186716
> avx_clear: 58684727
> [hjl@gnu-skl-2 microbenchmark]$

You forgot the RET at the end of the AVX functions (but not the SSE ones); The
AVX functions fall through into each other, then into __libc_csu_init before
jumping around and eventually returning.  That's why they're much slower. 
Single-step through the loop in GDB...

   │0x5660 vcvtsd2ss xmm0,xmm0,xmm1
  >│0x5664  nopWORD PTR cs:[rax+rax*1+0x0]
   │0x566e  xchg   ax,ax
   │0x5670vcvtsd2ss xmm0,xmm1,xmm1
   │0x5674  nopWORD PTR cs:[rax+rax*1+0x0]
   │0x567e  xchg   ax,ax
   │0x5680   vxorps xmm0,xmm0,xmm0
   │0x5684 vcvtsd2ss xmm0,xmm0,xmm1
   │0x5688  nopDWORD PTR [rax+rax*1+0x0]
   │0x5690 <__libc_csu_init>endbr64
   │0x5694 <__libc_csu_init+4>  push   r15
   │0x5696 <__libc_csu_init+6>  movr15,rdx

And BTW, SSE vs. SSE_clear are about the same speed because your loop
bottlenecks on the store/reload latency of keeping a loop counter in memory
(because you compiled the C without optimization).  Plus, the C caller loads
write-only into XMM0 and XMM1 every iteration, breaking any loop-carried
dependency the false dep would create.

I'm not sure why it makes a measurable difference to run the extra NOPS, and 3x
vcvtsd2ss instead of 1 for avx() vs. avx_clear(), because the C caller should
still be breaking dependencies for the AVX-128 instructions.

But whatever the effect is, it's totally unrelated to what you were *trying* to
test. :/

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #3 from Peter Cordes  ---
(In reply to H.J. Lu from comment #1)
I have a patch for PR 87007:
> 
> https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html
> 
> which inserts a vxorps at the last possible position.  vxorps
> will be executed only once in a function.

That's talking about the mem,reg case, which like I said is different.  I
reported Bug 80571 a while ago about the mem,reg case (or gp-reg for si2ss/d),
so it's great that you have a fix for that, doing one xor-zeroing and reusing
that as a merge target for a whole function / loop.

But this bug is about the reg,reg case, where I'm pretty sure there's nothing
to be gained from xor-zeroing anything.  We can fully avoid any false dep just
by choosing both source registers = src, making the destination properly
write-only.

If you *have* an xor-zeroed register, there's no apparent harm in using it as
the merge-target for a reg-reg vcvt, vsqrt, vround, or whatever, but there's no
benefit either vs. just setting both source registers the same.  So whichever
is easier to implement, but ideally we want to avoid introducing a vxorps into
functions / blocks that don't need it at all.

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #2 from Peter Cordes  ---
(In reply to H.J. Lu from comment #1)
> But
> 
>   vxorps  %xmm0, %xmm0, %xmm0
>   vcvtsd2ss   %xmm1, %xmm0, %xmm0
> 
> are faster than both.

On Skylake-client (i7-6700k), I can't reproduce this result in a hand-written
asm loop.  (I was using NASM to make a static executable that runs a 100M
iteration loop so I could measure with perf).  Can you show some asm where this
performs better?

vcvtsd2ss src-reg,dst,dst is always 2 uops, regardless of the merge destination
being an xor-zeroed register.  (Either zeroed outside the loop, or inside, or
once per 4 converts with an unrolled loop.)

I can't construct a case where  vcvtsd2ss %xmm1, %xmm1, %xmm0  is worse in any
way (dependencies, uops, latency, throughput) than VXORPS + vcvtsd2ss with dst
= middle source.  I wasn't mixing it with other instructions other than VXORPS,
but I don't think anything is going to get rid of its 2nd uop, and choosing
both inputs = the same source removes any benefit from dep-breaking the output.

If adding a VXORPS helped, its probably due to some other side-effect.

Could the effect you saw have been due to code-gen changes for memory sources,
maybe  vxorps + vcvtsd2ss (mem), %xmm0, %xmm0   vs.  vmovsd + vcvtsd2ss %xmm1,
%xmm1, %xmm0?  (Those should be about equal, but memory-source SS2SD is
cheaper, no port5 uop.)



BTW, the false-dependency effect is much more obvious with SS2SD, where the
latency from src1 to output is 4 cycles, vs. 1 cycle for SD2SS.

Even without dependency-breaking, repeated

 vcvtsd2ss  %xmm1, %xmm0, %xmm0

can run at 1 per clock (same as with dep breaking), because the port-5 uop that
merges into the low 32 bits of xmm0 with 1 cycle latency is 2nd.  So latency
from xmm0 -> xmm0 for that [v]cvtsd2ss %xmm1, %xmm0 is 1 cycle.

With dep-breaking, they both still bottleneck on the port5 uop if you're doing
nothing else.

[Bug target/80586] vsqrtss with AVX should avoid a dependency on the destination register.

2019-01-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80586

Peter Cordes  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #1 from Peter Cordes  ---
Fixed for vsqrtss/sd somewhere in 9.0, but not 8.2.  
https://godbolt.org/z/0Gxf05.

The general case of one-input scalar xmm,xmm instructions like vcvtss2sd is
still all over the place, with false deps or wasted xor-zeroing.  Reported that
as bug 89071

It seems only VSQRTsd/ss itself was fixed for this; sorry I didn't think of
checking for other one-input instructions when I reported this.

[Bug target/89071] New: AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double

2019-01-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

Bug ID: 89071
   Summary: AVX vcvtsd2ss lets us avoid PXOR dependency breaking
for scalar float<->double
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

float cvt(double unused, double xmm1) { return xmm1; }

g++ (GCC-Explorer-Build) 9.0.0 20190120 (experimental):

vxorps  %xmm0, %xmm0, %xmm0
vcvtsd2ss   %xmm1, %xmm0, %xmm0# merge into XMM0

clang7.0
vcvtsd2ss   %xmm1, %xmm1, %xmm0# both sources are from XMM1, no
false dep

gcc already uses this trick for SQRTSS/SD, but not for float<->double
conversion.  I haven't checked all the other scalar instructions, but roundss
for floor() does neither and has a false dependency.  (i.e. it chooses the
output register as the merge-target, not the actual input.)

 return floorf(x);  ->   vroundss$9, %xmm1, %xmm0, %xmm0

Some testcases:

https://godbolt.org/z/-rqUVZ


---

In SSE, one-input scalar instructions like CVT* and SQRTSS/SD have an output
dependency because of Intel's short-sighted ISA design optimizing for
Pentium-III's 64-bit SIMD: zero-extending to fill the destination XMM register
would have cost an extra uop to write the upper half of the destination.

For consistency(?), SSE2 scalar instructions (new with Pentium 4 which had
128-bit SIMD execution units / register file) have the same behaviour of
merging into the low 64 bits of the destination, even conversion between double
and float between two xmm registers, which didn't exist before SSE2. 
(Previously conversion instructions were only between float in XMM and integers
in scalar or MMX regs, or packed-integer <-> ps which filled the whole XMM reg
and thus avoided a false dependency).

(Fortunately this isn't a problem for 2-input instructions like ADDSS: the
operation already depends on both registers.)

---

The VEX encoding makes the merge-target separate from the actual destination,
so we can finally avoid false dependencies without wasting an instruction
breaking it.  (When the source is already in an XMM register).


For instructions where the source isn't an XMM register (e.g. memory or integer
reg for int->FP conversions), one zeroed register can be used as a read-only
merge target by any number of scalar AVX instructions, including in a loop. 
That's bug 80571.


(It's unfortunate that Intel didn't take the opportunity to give the AVX
versions subtly different semantics, and zero-extend into the target register. 
That would probably have enabled vcvtsd2ss to be single-uop instead of 2 on
Sandybridge-family.  IDK if they didn't think of that, or if they wanted strict
consistency with the semantics of the SSE version, or if they thought decoding
/ internals would be easier if they didn't have to omit the
merge-into-destination part of the scalar operation.  At least they made the
extra dependency an explicit input, so we can choose a register other than the
destination, but it's so rarely useful to actually merge into the low 64 or 32
of another reg that it's just long-term harmful to gimp the ISA with an extra
dependency for these instructions, especially integer->FP.)



(I suspect that most of the dep-breaking gcc does isn't gaining any speed, but
the trick is figuring out when we can omit it while being sure that we don't
couple things into one big loop-carried chain, or serialize some things that
OoO exec could otherwise benefit from hiding.  Within one function with no
calls, we might be able to prove that a false dep isn't serializing anything
important (e.g. if there's already enough ILP and something else breaks a dep
on that register between loop iterations), but in general it's hard if we can't
pick a register that was already part of the dep chain that led to the input
for this operation, and thus is harmless to introduce a dep on.)



Relevant instructions that can exist in scalar xmm,xmm form:

VROUNDSS/SD  (gcc leaves a false dep, clang gets it right)

VSQRTSS/SD  (gcc already gets this right)
VRCPSS
VRSQRTSS  haven't checked

[V]CVTSS2SD xmm,xmm  (Skylake: SRC1/output dependency is a separate 1c latency
32-bit merge uop)
  The memory-source version is still 2 uops.

[V]CVTSD2SS xmm,xmm  (Skylake: SRC1/output dependency is the main 4c conversion
uop, the extra uop is first, maybe extracting 32 bits from the src?)
 The memory-source version of [V]CVTSD2SS is only 1 uop!

So avoiding a false dep by loading with MOVSS/MOVSD and then using the reg-reg
version is a bad idea for CVTSD2SS.  It's actually much better to PXOR and then
CVTSD2SS (mem), %xmm, so clang's strategy of loading and then reg-reg
conversion is a missed-optimization.

I have

[Bug target/89063] [x86] lack of support for BEXTR from BMI extension

2019-01-25 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89063

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #1 from Peter Cordes  ---
Unfortunately Intel Haswell/Skylake implement BEXTR as 2 uops with 2c latency. 
Presumably those uops are a shift + bzhi, so 1p06 + 1p15 would explain Agner
Fog's experimental result of 2p0156 for BEXTR, with 0.5c throughput.

On AMD Excavator/Ryzen, it's 1 uop with 1c latency.  On Steamroller and
earlier, it's 2 uops but 1c latency.  (I assume that's latency from the
non-control input to the output.  So maybe one of the uops pre-processes the
control input, otherwise you'd expect 2c latency from either operand.)  Ryzen
dropped support for AMD TBM, so only Excavator (bdver4) has 1-uop bextr imm16
which would avoid the need for mov reg,imm32 with the control operand.  But
mov-imm + bextr can still be a win on Ryzen, lower latency than RORX+AND

BMI2 RORX is single-uop on all CPUs that support it.  If we already need a 2nd
uop to mask anyway, we can use RORX+AND-immediate to duplicate the
functionality and performance of BEXTR-immediate, with the smaller code-size if
the AND-mask fits in an imm8.  (5+5 vs. 6+3  or 6+4 if the AND needs a REX)

Without an immediate-source BEXTR (like AMD TBM has/had), the only advantage
mov-immediate+bextr has (on Intel) over mov-reg+shift+and is that can deal with
wide bitfields using a count instead of an immediate AND mask.  (Especially if
it doesn't fit in 32 bits).

If you can reuse the same control-register in a loop, BEXTR is good-ish for
copy-and-extract.

PEXT is 1 uop on Intel CPUs even though the simpler-looking BEXTR is 2.  But
PEXT is extremely slow on Ryzen (7 uops, 18c lat and tput).  So for 32-bit
constants at least, mov r32,imm32 + PEXT to copy-and-extract is better than
BEXTR on Intel.  movabs imm64 is too big and can cause front-end problems
(slower to read from the uop cache, if that effect from Sandybridge is still
present on Haswell/Skylake), and has no advantage vs. RORX + AND unless the
bitfield you're extracting is wider than 32 bits.

PEXT has 3 cycle latency, though, and can only run on port 1 on SnB-family. 
(All integer uops with latency > 1 are p1-only).  It's potentially good for
throughput, but worse than RORX+AND for latency.

Unfortunately x86 bitfield instructions are pretty weak compared to ARM /
AArch64 ubfx or PowerPC rlwinm and friends, where the bit-positions are simply
specified as immediates.  Only AMD's immediate version of BEXTR (1 uop on
Excavator) matched them.  Having a bunch of different control operands for
BEXTR or PEXT in registers might be usable in a loop, but a lot more rarely
useful than immediate controls.




 :
   0:   c4 e3 fb f0 c7 2a   rorx   $0x2a,%rdi,%rax# $(64-22)
   6:   c4 e3 fb f0 d7 35   rorx   $0x35,%rdi,%rdx# $(64-11)
   c:   83 e7 3fand$0x3f,%edi
   f:   83 e0 3fand$0x3f,%eax
  12:   83 e2 3fand$0x3f,%edx
  15:   01 f8   add%edi,%eax # 32-bit operand-size
because we can prove it can't overflow
  17:   01 d0   add%edx,%eax # missed optimization in
both gcc's versions.
  19:   c3  retq   

Not counting the ret, this is 7 uops for Skylake and Ryzen.  **I'm pretty sure
this is our best bet for -march=skylake, and for tune=generic -mbmi2**

The BEXT intrinsics version is 9 uops for SKL, 7 for Ryzen, but is 2 bytes
larger.  (not counting the savings from avoiding a REX prefix on the ADD
instructions; that missed optimization applies equally to both.)  OTOH, the
critical path latency for BEXTR on Ryzen is better by 1 cycle, so we could
still consider it for -march=znver1.  Or for tune=generic -mbmi without BMI2.

The legacy mov+shr+and version is 10 uops because gcc wasted a `mov %rdi,%rax`
instruction; it *should* be 9 uops for all normal CPUs.

---

With only BMI1 but not BMI2 enabled, we should probably use the mov-imm + BEXTR
version.  It's not worse than the mov+shr+and version on SnB-family or bd/zn,
and it's better on some AMD.  And it's probably smaller code-size.

And in future if Intel designs CPUs that can handle BEXTR as a single uop with
1c latency, mov+bextr will become good-ish everywhere.


For code-size, BEXTR has a definite advantage for bitfields wider than 1 byte,
because AND $imm32, %r32 is 6 bytes long instead of 3.

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2018-08-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

--- Comment #4 from Peter Cordes  ---
The VPAND instructions in the 256-bit version are a missed-optimization.

I had another look at this with current trunk.  Code-gen is similar to before
with -march=skylake-avx512 -mprefer-vector-width=512.  (If we improve code-gen
for that choice, it will make it a win in more cases.)

https://godbolt.org/g/2dfkNV

Loads are folding into the shifts now, unlike with gcc7.3.  (But they can't
micro-fuse because of the indexed addressing mode.  A pointer increment might
save 1 front-end uop even in the non-unrolled loop)

The separate integer loop counter is gone, replaced with a compare against an
end-index.

But we're still doing 2x vpmovwb + vinserti64x4 instead of vpackuswb + vpermq. 
Fewer instructions and (more importantly) 1/3 the shuffle uops.  GCC knows how
to do this for the 256-bit version, so it's apparently a failure of the
cost-model that it doesn't for the 512-bit version.  (Maybe requiring a
shuffle-control vector instead of immediate puts it off?  Or maybe it's
counting the cost of the useless vpand instructions for the pack / permq
option, even though they're not part of the shuffle-throughput bottleneck?)



We do use vpackuswb + vpermq for 256-bit, but we have redundant AND
instructions with set1_epi16(0x00FF) after a right shift already leaves the
high byte zero.

---

Even if vmovdqu8 is not slower, it's larger than AVX vmovdqu.  GCC should be
using the VEX encoding of an instruction whenever it does exactly the same
thing.  At least we didn't use vpandd or vpandq EVEX instructions.

(I haven't found any confirmation about vmovdqu8 costing an extra ALU uop as a
store with no masking.  Hopefully it's efficient.)

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2018-08-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

--- Comment #3 from Peter Cordes  ---
I had another look at this with current trunk.  Code-gen is similar to before
with -march=skylake-avx512 -mprefer-vector-width=512.  (If we improve code-gen
for that choice, it will make it a win in more cases.)

https://godbolt.org/g/2dfkNV

Loads are folding into the shifts now, unlike with gcc7.3.  (But they can't
micro-fuse because of the indexed addressing mode.  A pointer increment might
save 1 front-end uop even in the non-unrolled loop)

The separate integer loop counter is gone, replaced with a compare against an
end-index.

But we're still doing 2x vpmovwb + vinserti64x4 instead of vpackuswb + vpermq. 
Fewer instructions and (more importantly) 1/3 the shuffle uops.  GCC knows how
to do this for the 256-bit version, so it's apparently a failure of the
cost-model that it doesn't for the 512-bit version.  (Maybe requiring a
shuffle-control vector instead of immediate puts it off?  Or maybe it's
counting the cost of the useless vpand instructions for the pack / permq
option, even though they're not part of the shuffle-throughput bottleneck?)



We do use vpackuswb + vpermq for 256-bit, but we have redundant AND
instructions with set1_epi16(0x00FF) after a right shift already leaves the
high byte zero.

---

Even if vmovdqu8 is not slower, it's larger than AVX vmovdqu.  GCC should be
using the VEX encoding of an instruction whenever it does exactly the same
thing.  At least we didn't use vpandd or vpandq EVEX instructions.

(I haven't found any confirmation about vmovdqu8 costing an extra ALU uop as a
store with no masking.  Hopefully it's efficient.)

[Bug rtl-optimization/86352] New: setc/movzx introduced into loop to provide a constant 0 value for a later rep stos

2018-06-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86352

Bug ID: 86352
   Summary: setc/movzx introduced into loop to provide a constant
0 value for a later rep stos
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

The wrong-code bug 86314 also revealed some very weird code-gen decisions,
which the fix didn't improve.

(I think the lock bts peephole is seen pretty late, and that's one necessary
factor for this problem.  But even without it, an unnecessary data dependency
between the lock bts loop and clearing memory is silly.)

This ended up being about 5 separate bugs, but IDK which belong together or are
already reported:

* useless mov %rsi, %rcx and useless mov %rdx, %rdi
* using setc/movzx instead of xor %eax,%eax to get a constant 0; slower and
creating a data dependency
* Doing that inside the loop instead of after
* Not adjusting register allocation to allow xor / set-flags / setc
* rep stos vs. vector stores as a zeroing strategy vs. any other repeated
value.



The reproducer test-case for bug 86314  loops until it finds and claims a zero
bit in a uint64_t, then returns a Bucket() object (with a constructor that
zero-initializes it) with no data dependency on anything.

But gcc decides to introduce a flag -> integer 0/1 inside the acquire() loop
instead of just using  xor eax,eax  before rep stosq.  The loop can only exit
when CF = 0, so RAX = 0, so it's not a correctness problem.

The loop is branching on CF as set by BTS, so there's no need to have the 0/1
in a register at all inside the loop, and setc/movzx from a known-zero CF is
more expensive that xor-zeroing.  (Plus it gives the STOSQ a data dependency on
the LOCK BTS flag result which it wouldn't have otherwise.  The stores can't
commit until after the lock memory barrier, but they can execute.)

This is the actual code-gen from (GCC-Explorer-Build) 9.0.0 20180627
https://godbolt.org/g/XGF5tR


BucketMap::acquireBucket():
movq%rdi, %rdx
movq%rsi, %rcx  # useless, lock bts can use (%rsi)
.L2:
movq(%rsi), %rax
andl$1, %eax# source is simplified to only check positions
0 or 1
lock btsq   %rax, (%rcx)  # Why not (%rsi)?
setc%al
movzbl  %al, %eax   # xor / bts / setc would have been possible
with a different reg
jc  .L2
# rax = 0 because the loop can only exit when CF=0

# should use  xor %eax,%eax  here instead

movq%rdx, %rdi  # Useless, RDI still == RDX
movl$16, %ecx
rep stosq
movq%rdx, %rax  # can't be done before rep stosq: RAX needs to
be 0
ret 



With -m32, where 64-bit lock bts isn't available, we have lock cmpxchg8b ending
with an OR.  So there is a zero in an integer register from that, but it's not
in EAX, so the code gen includes an extra `mov %esi, %eax`, which is not
cheaper than xor %eax,%eax especially with -march=haswell.  Sandybridge-family
has xor-zeroing as cheap as a NOP, but mov-elimination isn't always perfect and
SnB itself doesn't have it.

And of course mov still has a data dependency on the source of the zero, so it
defeats the effect of branch prediction + speculative breaking (control)
dependencies.  This last applies on any out-of-order x86.

I guess the lock bts peephole is seen too late to notice that it can't recycle
the 0 from the loop condition anymore, and ends up generating code to
materialize it.  But why inside the loop?

--


Even if we *did* need an integer 0/1 in a register inside the loop, we could
still use the xor / set-flags / setcc optimization: Simply use a register other
than RAX for the load / AND $1 / bts source.  And you can hoist the xor-zeroing
out of the loop.


xor %eax, %eax
.L2:
movq(%rsi), %rcx
andl$1, %ecx
lock btsq   %rax, (%rsi)
setc%al
# use %rax
jc  .L2


---

Separately:

If the initializer is non-zero, it uses SSE or AVX stores.  That makes no sense
either: if rep stosq is optimal, use  mov eax, 1 for the all-ones case.  (See
the ifdef in the Godbolt link to try it)

If it's not optimal, use xorps xmm0,xmm0 to create an all-zero vector.

I guess gcc is checking for all-zeros as a common special case, but doesn't
check for repeats of any other value, except for repeated bytes recognized as
memset.

So it makes sense that gcc uses a different strategy, but I think for only 16x
8 bytes (128 bytes) that vector stores beat rep stos on current CPUs.  (That
may change when IceLake introduces fast short-rep stos/movs.)

GCC does notice that it can reuse the same vec

[Bug target/80820] _mm_set_epi64x shouldn't store/reload for -mtune=haswell, Zen should avoid store/reload, and generic should think about it.

2018-06-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820

--- Comment #5 from Peter Cordes  ---
AVX512F with marge-masking for integer->vector broadcasts give us a single-uop
replacement for vpinsrq/d, which is 2 uops on Intel/AMD.

See my answer on
https://stackoverflow.com/questions/50779309/loading-an-xmm-from-gp-regs.  I
don't have access to real hardware, but according to reported uop counts, this
should be very good: 1 uop per instruction on Skylake-avx512 or KNL

vmovq xmm0, rax1 uop p5   2c latency
vpbroadcastq  xmm0{k1}, rdx   ; k1 = 0b00101 uop p5   3c latency
vpbroadcastq  ymm0{k2}, rdi   ; k2 = 0b01001 uop p5   3c latency
vpbroadcastq  ymm0{k3}, rsi   ; k3 = 0b10001 uop p5   3c latency

xmm vs. ymm vs. zmm makes no difference to latency, according to InstLatx64

(For a full ZMM vector, maybe start a 2nd dep chain and vinsert to combine
256-bit halves.  Also means only 3 k registers instead of 7)

vpbroadcastq  zmm0{k4}, rcx   ; k4 =0b1 3c latency
... filling up the ZMM reg


Starting with k1 = 2 = 0b0010, we can init the rest with KSHIFT:

mov  eax, 0b0010 = 2
kmovwk1, eax
KSHIFTLW k2, k1, 1
KSHIFTLW k3, k1, 2

  #  KSHIFTLW k4, k1, 3
 ...

KSHIFT runs only on port 5 (SKX), but so does KMOV; moving from integer
registers would just cost extra instructions to set up integer regs first.

It's actually ok if the upper bytes of the vector are filled with broadcasts,
not zeros, so we could use 0b1110 / 0b1100 etc. for the masks.  We could start
with kxnor to generate a -1 and left-shift that, but that's 2 port5 uops vs.
mov eax,2 / kmovw k1, eax being p0156 + p5.

Loading k registers from memory is not helpful: according to IACA, it costs 3
uops.  (But that includes p237, and a store-AGU uop makes no sense, so it might
be wrong.)

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2018-06-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833

--- Comment #14 from Peter Cordes  ---
I happened to look at this old bug again recently.

re: extracting high the low two 32-bit elements:

(In reply to Uroš Bizjak from comment #11)
> > Or without SSE4 -mtune=sandybridge (anything that excluded Nehalem and other
> > CPUs where an FP shuffle has bypass delay between integer ops)
> > 
> > movd %xmm0, %eax
> > movshdup %xmm0, %xmm0  # saves 1B of code-size vs. psrldq, I think.
> > movd %xmm0, %edx
> > 
> > Or without SSE3,
> > 
> > movd %xmm0, %eax
> > psrldq   $4,  %xmm0# 1 m-op cheaper than pshufd on K8
> > movd %xmm0, %edx
> 
> The above two proposals are not suitable for generic moves. We should not
> clobber input value, and we are not allowed to use temporary.

SSE3 movshdup broadcasts the high element within each pair of 32-bit elements
so 

   movshdup  %xmm0, %xmm1
   movd  %xmm1, %eax

saves a byte of code vs  pshufd / movd, and saves a uop on Merom and avoids a
flt->int.  (According to Agner Fog's tables, pshufd is flt->int domain, i.e. it
wants input in the float domain.  While movshdup ironically is only an integer
shuffle.)

Probably not worth looking for that optimization, though, because it's not
worth using universally (Nehalem has worse latency for float shuffles between
int instructions).


With just SSE2, PSHUFLW is the same size as PSHUFD and faster on Merom / K8
(slowshuffle CPUs where PSHUFD is multiple uops).  It's not slower on any
current CPUs.  I could imagine some future CPU having better throughput for
32-bit element size shuffles than 16-bit, though.  That's already the case for
wider lane-crossing shuffles (VPERMW YMM is multiple uops on Skylake-AVX512). 
This would be a definite win for tune=core2 or k8, and Pentium M, but those are
so old it's probably not worth adding extra code to look for it.

I think it's pretty future-proof, though, unless Intel or AMD add an extra
shuffle unit for element sizes of 32-bit or wider on another port.

[Bug tree-optimization/69615] 0 to limit signed range checks don't always use unsigned compare

2018-06-02 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69615

--- Comment #5 from Peter Cordes  ---
Update: https://godbolt.org/g/ZQDY1G

gcc7/8 optimizes this to and / cmp / jb, while gcc6.3 doesn't.

void rangecheck_var(int64_t x, int64_t lim2) {
  //lim2 >>= 60;
  lim2 &= 0xf;  // let the compiler figure out the limited range of limit
  if (x>=0 && x=0 && x<=(INT_MAX-1)) ext(); }  // clang and
gcc use 2 branches

[Bug tree-optimization/84011] Optimize switch table with run-time relocation

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011

--- Comment #13 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #10)
> ??  That is the task for the linker SHF_MERGE|SHF_STRINGS handling.
> Why should gcc duplicate that?

Because gcc would benefit from knowing if merging makes the total block of
strings for a switch() table short enough to use a uint8_t offset[] instead of
uint16_t.

If we don't know at compile time, we'd have to be conservative and potentially
use a wider offset table.  (Although as Joseph points out
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585#c2, without more linker
support for this we could end up missing out on literal merging across
compilation units.  So perhaps a first step in applying this idea would be to
use 32-bit offsets from the start of the .rodata.str1.1 section, so we can
still let the linker merge strings and end up with them non-contiguous without
having to force the one that gets kept to be the one that's part of our block
of strings.)

[Bug tree-optimization/84011] Optimize switch table with run-time relocation

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011

--- Comment #12 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #10)
> (In reply to Peter Cordes from comment #9)
> > gcc already totally misses optimizations here where one string is a suffix
> > of another.  "mii" could just be a pointer to the 3rd byte of "sgmii", but
> > we instead duplicate all the characters.  That's where major savings are
> > possible for this function.
> 
> ??  That is the task for the linker SHF_MERGE|SHF_STRINGS handling.
> Why should gcc duplicate that?

Oops, right I was only looking at gcc's asm output, didn't check an actual
linked binary.

Will the linker currently catch a case like this?

.LC_base:
.LC2: .string "mii"
.LC3: .string "gmii"

table:
.byte  .LC2 - .LC_base,  .LC3 - .LC_base

and drop .string "mii" entirely + rewrite the table to
.byte  .LC3+1 - .LC_base,  .LC3 - .LC_base

(This discussion should probably be happening on bug 85585.)

Sorry I don't know the actual mechanism by which gcc signals to the linker that
it can / can't merge.  I guess only in some sections?  Because gcc couldn't
allow it if was emitting an array like this, where dropping a string would
change the offsets for later data and break offset calculations:

const struct { char str[11]; } table[] = { {"mii"}, {"gmii"} };

[Bug tree-optimization/84011] Optimize switch table with run-time relocation

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #9 from Peter Cordes  ---
(In reply to rguent...@suse.de from comment #4)
> An optimization would be to
> add an indirection by, say, only recording the constant offset
> into an "array of strings" in the table, thus effectively
> 
>   "case1\0case2\0..."[CSWITCH[i]]
> 
> which would require only a relocation to access the single string
> constant.  But it would prohibit cases of string merging within
> those strings unless we implement that as well for this optimization.

gcc already totally misses optimizations here where one string is a suffix of
another.  "mii" could just be a pointer to the 3rd byte of "sgmii", but we
instead duplicate all the characters.  That's where major savings are possible
for this function.

> Note this might be profitable unconditionally, not just with -fpie/pic
> as the CSWITCH table would be smaller (dependent on the total
> size of the merged string).

Indeed, I wrote up bug 85585 with ideas for optimizing this.  A table of byte
or uint16_t offsets into a static buffer of packed strings looks good for PIC
and for position-dependent.

To avoid any runtime relocations, all you need is the ability to get a static
address into a register (e.g. RIP-relative LEA) and do an indexed load relative
to it, just like using a normal static char[].  Then add the load result to
that address.  Runtime relocation is nice to avoid even if you don't *need* to
avoid it.

Also possible is padding each string out to a constant length and calculating
an index into that, removing a level of indirection.  (Good when strings are
similar length and/or all short, and there aren't many strings that are
duplicates or suffixes of others.)  Again you just need to get a static address
into a register, and add it to 11*enum_value.  This is all ADD + LEA (with one
of them being RIP-relative).

[Bug tree-optimization/85585] switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a string

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585

--- Comment #1 from Peter Cordes  ---
By comparison, the no-PIE table of pointers only needs one instruction:

movqCSWTCH.4(,%rdi,8), %rax

So all my suggestions cost 1 extra instruction on x86 in no-PIE mode, but at a
massive savings in data size.

clang -fPIE compiles the plain switch to the obvious / sane 2 instruction
sequence which should be our baseline for normal cases.

# clang6.0 -fPIE -O3  (switch compilers on the Godbolt link)
leaq.Lswitch.table.phy_modes(%rip), %rcx
movq(%rcx,%rax,8), %rax

Clang is willing to make a table that needs relocations for the entries.  (My
suggestions all avoid that because they're based on offsets, not a table of
pointers.  Avoiding rodata relocations that dirty a page and prevent sharing
has some non-zero value, although it's low on many architectures where memory
is cheap.)

[Bug tree-optimization/85585] New: switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a st

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585

Bug ID: 85585
   Summary: switch to select a string based on an enum can
profitably optimize away the table of pointers/offsets
into fixed-length char[] blocks.  Or use byte offsets
into a string table
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

Bug 84011 shows some really silly code-gen for PIC code and discussion
suggested using a table of offsets instead of a table of actual pointers, so
you just need one base address.

A further optimization is possible when the strings are all similar length,
and/or the longest one isn't much longer than a pointer:

Pad all strings to the same length with trailing 0 bytes, and calculate a
pointer instead of loading it from an array.  This removes the possibility of
multiple entries sharing the same suffix (which is a missed optimization gcc
wasn't already doing), but avoids needing any space for storing pointers in
memory at all.

In the case discussed in bug 84011 (Linux's phy.h const char
*phy_modes(phy_interface_t interface)), the longest strings are 11 bytes
(including the \0), and there are 23 of them.  So it takes 253 bytes of char
data to store everything (not counting the "unknown" for the default: special
case) with all strings padded to 11 bytes.



The current strings + pointer-table implementation doesn't merge string
literals where one string is a suffix of another; this is another a
missed-optimization that would save many bytes here.  (e.g. instead of .string
"mii" and .string "gmii", just have .LC4 .byte 's'; .LC3: .byte 'g'; .LC2:
.string "mii".)

That optimization plus byte or 16-bit offsets into the table would be nice and
compact, and most CPUs have efficient zero-extending narrow loads.  So for
cases where the other optimization I'm suggesting isn't good, that would
probably be best.



The current packed string-data takes 158 bytes , so with 4-byte offsets it
takes 158+23*4 = 250 bytes.  Or with 8-byte pointers/offsets, it takes 158 +
23*8 = 342 bytes.  Or with 1-byte offsets, 158 + 23*1 = 181 bytes: load with
movzbl.  (If you can't use the offset directly as an 8-byte memory source
operand for ADD to a pointer, there's no point making it 32 bits instead of 8.)

The code for *using* such a table is quite simple.  This C source compiles to
what I'm suggesting:

https://godbolt.org/g/E8J3iS

struct foo {
char str[11];
} const table[23] = {};

const char *lookup(unsigned long idx) {
if(idx > 23) {
return "unknown";
//idx=23;
}
return table[idx].str;
}

Multiply by 11 only takes 2 LEA instructions on x86, so for PIC code with a
RIP-relative LEA we end up with 4 ALU instructions total to get a string
address, after checking the if condition:

   # gcc7.3 -march=haswell -O3 -fPIE output:  https://godbolt.org/g/qMzaY8
leaq.LC0(%rip), %rax# "unknown"
cmpq$23, %rdi
ja  .L4 # branchless is also an option
leaq(%rdi,%rdi,4), %rax
leaqtable(%rip), %rdx   # RIP-relative table base address
leaq(%rdi,%rax,2), %rax
addq%rdx, %rax  # table + 11*idx
.L4:
ret

This is even better in no-PIE mode where a static address is usable as a signed
32-bit immediate:

lookup(unsigned long):
movl$.LC0, %eax
cmpq$23, %rdi
ja  .L4
leaq(%rdi,%rdi,4), %rax
leaqtable(%rdi,%rax,2), %rax# 3 cycle latency for 3-component
LEA on SnB-family
.L4:
ret

So this has extremely low code-size cost on x86-64, for the benefit of removing
a table load in the dependency chain from enum to string data.  It does cost
significant data size vs. a byte-offset table with suffix-merging, but it's 
better than what gcc is doing now in non-PIE (table of qword pointers), and
*much* better in PIE (insane jump table).

-

The byte-index version is equivalent to transforming the C source like this:

const char packedstrings[158] = {};
const unsigned char offsets[23] = {};
const char *lookup_byteidx(unsigned long idx) {
if(idx>23)
return "unknown";
return [offsets[idx]];
}

leaq.LC0(%rip), %rax  # "unknown"
cmpq$23, %rdi
ja  .L9
leaqoffsets(%rip), %rax
leaqpackedstrings(%rip), %rdx
movzbl  (%rax,%rdi), %eax
addq%rdx, %rax
.L9:
ret

We can save an instruction here by making the relative position of
packedstrings and offsets a compile-time constant, i.e. by effectively putting
the

[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics

2018-04-30 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274

--- Comment #2 from Peter Cordes  ---
The stray LEA bug seems to be fixed in current trunk (9.0.0 20180429), at least
for this testcase.  Gcc's stack-alignment strategy seems to be improved overall
(not copying the return address when not needed), so probably it's really
fixed.

It's still present in 7.3.

[Bug c++/69560] x86_64: alignof(uint64_t) produces incorrect results with -m32

2018-04-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69560

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #23 from Peter Cordes  ---
Just to recap the current situation (gcc/g++ 8.0.1 20180425):

I ported David Marillat's testcase to work as C or C++
https://godbolt.org/g/QdG2V6.  (And changed it to set global variables instead
of calling printf, so you can see the results from looking at the asm output
instead of running it).

C++11 alignof() now agrees with C11 alignof() (which didn't change) that
alignof(int64_t) is 4 when targeting the i386 System V ABI.

Previously G++'s alignof() reported 8, while gcc's C11 alignof (stdalign.h)
reported 4.  That was the only change: struct-member alignof results are
unchanged, and already matched between C11 and C++11.


4 is the minimum alignment that *any* int64_t, or pointer to int64_t, is
assumed to have when generating code for i386 SysV.  gcc / g++ are allowed to
generate code that breaks if passed a pointer to int64_t that wasn't 4-byte
aligned.  (Auto-vectorization is one case where that can happen on x86:
https://stackoverflow.com/q/47510783/224132).

They're *not* allowed to assume that it's 8-byte aligned unless they can see
the definition and know that a particular int64_t object is over-aligned, e.g.
to its natural alignment of 8, like gcc chooses to do whenever possible (i.e.
outside structs).

So in both C++ and C (and in g++/gcc after this patch), alignof(int64_t) is the
minimum that any allocator must give an int64_t for correctness (in this funky
32-bit ABI), not the recommended alignment that gcc and g++ both already used
whenever ABI struct-packing rules didn't constrain them.

It's also the guaranteed minimum that code can *assume*.  e.g. a
manually-vectorized library function might check alignof(T) == sizeof(T) before
assuming that using 16-byte aligned loads/stores can line up with element
boundaries.  (An array inside a struct { int foo; int64_t arr[10]; } would
violate this for i386 SysV).

Anyway, I think use-cases like these are why the standard is worded the way it
is, and why it makes sense for alignof() to report the guaranteed/required
minimum.  The recommended or actual alignment is useful, too, though, for other
cases, so it's nice that GNU __alignof() is also available to report that.



Semi-related: gcc depends on 8-byte alignment for C11 _Atomic int64_t but still
fails to provide it inside structs on the i386 SysV ABI (Bug 65146), using the
same alignment rules as regular int64_t.

C++11 std::atomic is fine, getting the required natural alignment even
on i386 SysV so SSE2 movq is atomic and lock add is efficient.

This change to what alignof() reports in C++ had no effect on C at all, or on
any alignment choices made by the compiler in either C or C++.  I only mention
it as another interesting case where i386 SysV's under-alignment of 64-bit
types requiring special care, but that one will require an ABI change of some
sort to fix.

[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics

2018-04-15 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #1 from Peter Cordes  ---
This LEA stuff is part of what gcc does to align the stack by 32 for spilling
AVX locals.

Gcc's stack-align sequence is over-complicated and ties up an extra register
for the whole function (add  volatile  to the local and see the -O3 code).  Or
at least it was; it seems gcc8 trunk just makes a stack frame with EBP / RBP
but references 32-byte aligned locals from aligned RSP instead of unaligned
RBP.

It used to copy the address of the return address to make a full copy of
ret-addr / saved-RBP for the aligned stack frame, which was super weird.

https://godbolt.org/g/RLJNtd.  (With an alloca or something, gcc8 does the same
crazy stack-frame stuff as gcc7, otherwise it's much cleaner, like clang)



The actual bug here is that it's not fully optimized away when it turns out
that no 32-byte spills / reloads from locals are left in the function.

gcc for x86-64 sometimes has a few leftover instructions like that in more
complex functions using __m256; this is not exclusively an i386 problem, but
it's happens more easily for 32-bit it seems.

[Bug target/85366] New: Failure to use both div and mod results of one IDIV in a prime-factor loop while(n%i==0) { n/=i; }

2018-04-12 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85366

Bug ID: 85366
   Summary: Failure to use both div and mod results of one IDIV in
a prime-factor loop while(n%i==0) { n/=i; }
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

From
https://codereview.stackexchange.com/questions/191792/find-prime-factors-in-c/191801#191801,
simplified to use a pointer instead of returning std::vector. 
Interestingly, the version with std::vector can be more easily coaxed to use
both results of one idiv, see the Godbolt link.

void find_prime_factors_ptr(int n, int *p)
{
// inefficient to test even numbers > 2, but that's a separate missed
optimization.
for (int i = 2; i <= n; i++) {
while (n % i == 0) {
*p++ = i;
n /= i;   // reordering the loop body doesn't help
}
}
}

https://godbolt.org/g/ogyZW8

g++ 8.0.1 20180411 -O3 -march=haswell gives us this inner loop:

 ...
 # outer loop
 movl%edi, %eax
# idiv to test if inner loop should even run once, leaving n/i in eax
.L4:
movl%edi, %eax# but instead we discard it
addq$4, %rsi
movl%ecx, -4(%rsi)
cltd
idivl   %ecx
cltd  # then modulo that division result to see if
the next iteration should run
movl%eax, %edi
idivl   %ecx  # leaves n/i in eax, ready for next
iteration...
testl   %edx, %edx
je  .L4
 ...

So both ways to get to .L4 (fall in or loop) have n/i in EAX from an idiv
already!  The loop doesn't need to be re-structured to take advantage, gcc just
needs to keep track of what it's doing.

## Hand optimized version of the whole function:
cmpl$1, %edi
jle .L9
movl$2, %ecx
.L5:
movl%edi, %eax
cltd
idivl   %ecx  # eax = tmp = n/i
testl   %edx, %edx
jne .L3
.L4:
movl%ecx, (%rsi)
addq$4, %rsi  # we're tuning for Haswell, no register-read
stalls so increment after reading and save a byte in the addressing mode
movl%eax, %edi# n = tmp
cltd
idivl   %ecx  # eax = tmp = n/i
testl   %edx, %edx
je  .L4
.L3:
incl%ecx
cmpl%edi, %ecx
jle .L5
.L9:
ret


I didn't make *any* changes to the code outside the inner loop.  I ended up
just removing movl %edi, %eax / cltd / idiv %ecx.

Changing the inner loop to

int tmp;
while (tmp = n/i, n % i == 0) {
*p++ = i;
n = tmp;
}

gives us the asm almost that good (an extra mov inside the loop), but we get a
jmp into the loop instead of peeling the while condition from before the first
iteration:


# gcc8.0.1 -O3 -march=haswell output, commented but unmodified
find_prime_factors_ptr_opt(int, int*):
cmpl$1, %edi
jle .L18
movl$2, %ecx
jmp .L19
.L16: # top of inner loop
addq$4, %rsi
movl%ecx, -4(%rsi)
movl%eax, %edi# extra mov puts this and the next mov on
the critical path
.L19:# inner loop entry point
movl%edi, %eax
cltd
idivl   %ecx
testl   %edx, %edx
je  .L16  # bottom of inner
incl%ecx
cmpl%edi, %ecx
jle .L19   # bottom of outer
.L18:
ret

Saving code-size here with the dependent chain of movl %eax, %edi / movl %edi,
%eax is pretty minor even on CPUs like original Sandybridge, or Bulldozer,
without mov-elimination, because idiv's latency dominates.  But it could easily
be taken out of the inner loop by duplicating it outside the outer loop, then
moving it to the outer-only part of the loop body, like this:

cmpl$1, %edi
jle .L18
movl$2, %ecx
movl%edi, %eax   # eax = n added here
jmp .L19
.L16: # top of inner loop
addq$4, %rsi
movl%ecx, -4(%rsi)
movl%eax, %edi # n = tmp  still here
.L19:# inner loop entry point
 #movl%edi, %eax  # eax = n removed from here in inner/outer loop
cltd
idivl   %ecx
testl   %edx, %edx
je  .L16  # bottom of inner

movl%edi, %eax# eax = n also added here, in the outer-only part
incl%ecx
cmpl%edi, %ecx
jle .L19   # bottom of outer
.L18:
 

[Bug target/85038] x32: unnecessary address-size prefix when a pointer register is already zero-extended

2018-03-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85038

--- Comment #1 from Peter Cordes  ---
Correction for AArch64: it supports addressing modes with a 64-bit base
register + 32-bit index register with zero or sign extension for the 32-bit
index.  But not 32-bit base registers.

As a hack that's better than nothing, AArch64 could use a 32-bit pointer as the
index with a UXTW mode, using a zeroed register as the base (unless indexed
modes have any perf downside on real AArch64 chips).  But unfortunately, the
architectural zero register isn't usable as the base: that encoding means the
stack pointer for this instruction.  ldr w1,[xzr,w2,uxtw] doesn't assemble,
only x0-x30 or SP.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0801b/BABBGCAC.html


http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/LDR_reg_gen.html
describes LDR  Wt, [Xn|SP, Rm{, extend {amount}}]
where Rm can be an X or W register, and "extend" can be SXTW or UXTW for word
regs, or LSL for X regs.  (SXTX is a synonym for LSL).  Any of the modes can
use a left-shift amount, applied *after* extension to 64-bit.

See
https://community.arm.com/processors/b/blog/posts/a64-shift-and-extend-operations-operand-modifiers
for details on operand-modifiers.


gcc6.3 doesn't take advantage with -mabi=ilp32, and Godbolt doesn't have later
AArch64 gcc.

So gcc will need to know about zero-extended pointers, and the signedness of
32-bit values, to take advantage of AArch64's addressing modes for the common
case of a 32-bit index.  Teaching gcc to track signed/unsigned in RTL would
benefit x32 and AArch64 ILP32, if I understand the situation correctly.

[Bug target/85038] New: x32: unnecessary address-size prefix when a pointer register is already zero-extended

2018-03-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85038

Bug ID: 85038
   Summary: x32: unnecessary address-size prefix when a pointer
register is already zero-extended
   Product: gcc
   Version: 8.0.1
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

Bug 82267 was fixed for RSP only.  (Or interpreted narrowly as only being about
RSP vs. ESP).

This bug is about the general case of using address-size prefixes in cases
where we could prove they're not needed.  Either because out-of-bounds is UB so
we don't care about wrap vs. going outside 4GiB, or (simpler) the
single-register case when we know the pointer is already zero-extended.  Maybe
we want separate bugs to track parts of this that can be fixed with separate
patches, but I won't consider this fixed until -mx32 emits optimal code for all
the cases listed here.

I realize this won't be any time soon, but it's still code-size (and thus
indirectly performance) that gcc is leaving on the table.  Being smarter about
using 64-bit address-size is even more useful for AArch64 -mabi=ilp32, because
it doesn't have 32-bit address-size overrrides, so it always costs an extra
instruction every time we fail to prove that 64-bit is safe.  (And AArch64
ILP32 may get more use than x32 these days).  I intended this bug to be about
x32, though.



Useless 0x67 address-size override prefixes hurt code-size and thus performance
on everything, with more serious problems on some CPUs that have trouble with
more than 3 prefixes (especially Silvermont).  See Bug 82267 for the details
which I won't repeat.


We still have tons of useless 0x67 prefixes in the default -maddress-mode=short
mode (for every memory operand other than RSP, or RIP-relative), and
-maddress-mode=long has lots of missed optimizations resulting in wasted LEA
instructions, so neither one is good.


float doublederef(float **p){
return **p;
}
 // https://godbolt.org/g/exb74t
 // gcc 8.0.1 (trunk) -O3 -mx32 -march=haswell -maddress-mode=short
movl(%edi), %eax
vmovss  (%eax), %xmm0# could/should be (%rax)
ret

-maddress-mode=long gets that right, using (%rax), and also (%rdi) because the
ABI doc specifies that x32 passes pointers zero-extended.  mode=short still
ensures that, so failure to take advantage is still a missed-opt.

Note that clang -mx32 violates that ABI guarantee by compiling
pass_arg(unsigned long long ptr) { ext_func((void*)ptr); } to just a tailcall
(while gcc does zero-extend).  See output in the godbolt link above.  IDK if we
care about being bug-compatible with clang for that corner case for this rare
ABI, though.  A less contrived case would be a struct arg or return value
packed into a register passed on as just a pointer.


-

// arr+offset*4 is strictly within the low 32 bits because of range limits

float safe_offset(float *arr, unsigned offset){
unsigned tmp = (unsigned)arr;
arr = (void*)(tmp & -4096);  // round down to a page
offset &= 0xf;
return arr[offset];
}
   // on the above godbolt link
#mode=short
andl$-4096, %edi
andl$15, %esi
vmovss  (%edi,%esi,4), %xmm0
# (%rdi,%rsi,4) would have been safe, but that's maybe not worth
looking for.
# most cases have less pointer alignment than offset range

#mode=long
andl$-4096, %edi
andl$15, %esi
leal(%rdi,%rsi,4), %eax
vmovss  (%eax), %xmm0 # 32-bit addrmode after using a separate
LEA

So mode=long is just braindead here.  It gets the worst of both worlds, using a
separate LEA but then not taking advantage of the zero-extended pointer.  The
only way this could be worse is the LEA operand-size was 64-bit.

Without the masking, both modes just use  vmovss (%edi,%esi,4), %xmm0, but the
extra operations defeat mode=long's attempts to recognize this case, and it
picks an LEA instead of (or as well as?!?) an address-size prefix.

---

With a 64-bit offset, and a pointer that's definitely zero-extended to 64 bits:

   // same for signed or unsigned
float ptr_and_offset_zext(float **p, unsigned long long offset){
float *arr = *p;
return arr[offset];
}

# mode=short
movl(%edi), %eax  # mode=long uses (%rdi) here
vmovss  (%eax,%esi,4), %xmm0  # but still 32-bit here.
ret

Why are we using address-size prefixes to stop a base+index from going outside
4G on out of bounds UB?  (%rax,%rsi,4) should work for a signed / unsigned
64-bit offset when the pointer is known to be zero-extended.

ISO C11 says that pointer+integer produces a result of pointer type, with UB if
the result goes outside the array.  It does *not* say that the int

[Bug libstdc++/71660] [6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86

2018-03-13 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660

--- Comment #17 from Peter Cordes  ---
(In reply to Jonathan Wakely from comment #16)
> But what we do care about is comment 2, i.e. _Atomic(T) and std::atomic
> should have the same alignment (both in an out of structs). Maybe that needs
> the C front-end to change how _Atomic works, or maybe it needs the C++
> library to change how std::atomic works, but I want to keep this bug open
> while comment 2 gives different answers for C and C++.

Right, gcc's C _Atomic ABI is still broken for long long on 32-bit x86.  It
only aligned _Atomic long long to 32 bits (inside structs), but then assumes
that 8-byte loads / stores (with x87 or SSE1/2) are atomic.

It also leads to abysmal performance for  LOCK CMPXCHG  or other RMW operations
if the atomic object is split across a cache line.

That's bug 65146, so we can close this one.  (I never got around to posting in
the google group for the ABI.  By far the best good solution is giving _Atomic
long long (and other 8-byte objects) a boost to their _Alignof, up to 8 byte
alignment even inside structs.)

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-16 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

--- Comment #28 from Peter Cordes  ---
(In reply to Richard Biener from comment #27)
> Note that this is deliberately left as-is because the target advertises
> (cheap) support for horizontal reduction.  The vectorizer simply generates
> a single statement for the reduction epilogue:
>  [...]
> so either the target shouldn't tell the vectorizer it supports this or
> it simply needs to expand to better code.  Which means - can you open
> a separate bug for this?

Yes; I was incorrectly assuming the inefficient asm had the same cause as
before.  I agree *this* is fixed, thanks for the explanation of how gcc was
arriving at this sequence.

I'll have a look at the backend canned sequence defs and see if there are any
other sub-optimal ones, or if it was only AVX.

Having canned sequences for different target instruction sets instead of
leaving it to arch-independent code seems like it should be an improvement over
the old design.

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

--- Comment #25 from Peter Cordes  ---
We're getting a spill/reload inside the loop with AVX512:

.L2:
vmovdqa64   (%esp), %zmm3
vpaddd  (%eax), %zmm3, %zmm2
addl$64, %eax
vmovdqa64   %zmm2, (%esp)
cmpl%eax, %edx
jne .L2

Loop finishes with the accumulator in memory *and* in ZMM2.  The copy in ZMM2
is ignored, and we get

# narrow to 32 bytes using memory indexing instead of VEXTRACTI32X8 or
VEXTRACTI64X4
vmovdqa 32(%esp), %ymm5
vpaddd  (%esp), %ymm5, %ymm0

# braindead: vextracti128 can write a new reg instead of destroying xmm0
vmovdqa %xmm0, %xmm1
vextracti128$1, %ymm0, %xmm0
vpaddd  %xmm0, %xmm1, %xmm0

... then a sane 128b hsum as expected, so at least that part went
right.

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

--- Comment #22 from Peter Cordes  ---
Forgot the Godbolt link with updated cmdline options:
https://godbolt.org/g/FCZAEj.

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

Peter Cordes  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

--- Comment #21 from Peter Cordes  ---
(In reply to Richard Biener from comment #20)
> Fixed.

Unfortunately only fixed for integer, not FP.  The OpenMP and vanilla float
array sum functions from the godbolt link in the initial bug report still use
256b shuffles, including a gratuitous vperm2f128 when the upper half isn't
used, so vextractf128 would have done the same job in 1 uop on Ryzen instead of
8.

Even on Intel CPUs, they're optimized for code-size, not performance (vhaddps
instead of shuffle / vaddps).  Remember that Intel CPUs with AVX only have one
FP shuffle unit.  (Including Sandy/Ivybridge, which has 2 integer-128 shuffle
units)

float sumfloat_autovec(const float arr[]) {
  arr = __builtin_assume_aligned(arr, 64);
  float sum=0;
  for (int i=0 ; i<1024 ; i++)
sum = sum + arr[i];
  return sum;
}

# gcc 20180113 -mavx2 -ffast-math -O3
#  (tune=generic, and even with arch=znver1 no-prefer-avx128)
...
vhaddps %ymm0, %ymm0, %ymm0
vhaddps %ymm0, %ymm0, %ymm1
vperm2f128  $1, %ymm1, %ymm1, %ymm0   # why not vextract?
vaddps  %ymm1, %ymm0, %ymm0   # gratuitous 256b
vzeroupper

This bug is still present for FP code: it narrows from 256b to scalar only in
the last step.

Every VHADDPS is 2 shuffles + 1 add on Intel.  They're in-lane shuffles, but
it's still 2 uops for port5 vs. VSHUFPS + VADDPS.  (Costing an extra cycle of
latency because with only 1 shuffle port, the 2 interleave-shuffles that feed a
vertical-add uop can't run in the same cycle.)  (V)HADDPS with the same input
twice is almost never the best choice for performance.

On Ryzen it's an even bigger penalty: HADDPS xmm is 4 uops (vs. 3 on Intel). 
It's also 7c latency (vs. 3 for ADDPS).  256b VHADDPS ymm is 8 uops, one per 3
cycle throughput, and Agner Fog reports that it's "mixed domain", i.e. some
kind of penalty for ivec / fp domain crossing.  I guess the shuffles it uses
internally are ivec domain?

With multiple threads on the same core, or even with ILP with surrounding code,
uop throughput matters as well as latency, so more uops is worse even if it
didn't have latency costs.

The sequence I'd recommend (for both Intel and AMD) is:
(See also
http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86/35270026#35270026)


vextractf128$1, %ymm0, %xmm1
vaddps  %xmm1, %xmm0, %xmm0  # narrow to 128b

vmovshdup   %xmm0, %xmm0, %xmm1  # copy high->low in each
pair
vaddps  %xmm1, %xmm0, %xmm0

vmovhlps%xmm0, %xmm0, %xmm1  # duplicate high 64b
vaddps  %xmm1, %xmm0, %xmm0

The MOVSHDUP / MOVHLPS sequence is also what you want without VEX, so you can
do a 128b hsum with 4 instructions, with no MOVAPS.

Intel: 6 uops total, 3 shuffles.  vs. 8 total, 5 shuffles

AMD Ryzen: 6 uops, 3 shuffles.  vs. 26 total uops, 20 of them shuffles.  And
much worse latency, too.

Even just fixing this specific bug without fixing the rest of the sequence
would help AMD *significantly*, because vextractf128 is very cheap, and vhaddps
xmm is only half the uops of ymm.  (But the latency still sucks).

-

Even for integer, this patch didn't fix the MOVDQA + PSRLDQ that we get without
AVX.  PSHUFD or PSHUFLW to copy+shuffle is cheaper.  I guess I need to report
that bug separately, because it probably won't get fixed soon: if I understand
correctly, there's no mechanism for the back-end to tell the auto-vectorizer
what shuffles it can do efficiently!

It usually won't make too much difference, but for very small arrays (like 8
`int` elements) the hsum is a big part of the cost, although it's probably
still worth auto-vectorizing *if* you can do it efficiently.

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2018-01-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 80846, which changed state.

Bug 80846 Summary: auto-vectorized AVX2 horizontal sum should narrow to 128b 
right away, to be more efficient for Ryzen and Intel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes

2017-12-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837

--- Comment #6 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #4)
> But have just tried gcc 7.1.0 release and can't reproduce even there.

Matt says the Compiler Explorer backend uses upstream release tarballs like
`URL=ftp://ftp.gnu.org/gnu/gcc/gcc-${VERSION}/${TARBALL}`.  (where TARBALL is
`gcc-${VERSION}.tar.xz` for recent gcc where .xz is available).

The compiler config used is
https://github.com/mattgodbolt/compiler-explorer-image/blob/master/gcc/build/build.sh#L78:

CONFIG=""
CONFIG+=" --build=x86_64-linux-gnu"
CONFIG+=" --host=x86_64-linux-gnu"
CONFIG+=" --target=x86_64-linux-gnu"
CONFIG+=" --disable-bootstrap"
CONFIG+=" --enable-multiarch"
CONFIG+=" --with-abi=m64"
CONFIG+=" --with-multilib-list=m32,m64,mx32"
CONFIG+=" --enable-multilib"
CONFIG+=" --enable-clocale=gnu"
CONFIG+=" --enable-languages=c,c++,fortran" # used to have go, but is
incompatible with m32/mx32
CONFIG+=" --enable-ld=yes"
CONFIG+=" --enable-gold=yes"
CONFIG+=" --enable-libstdcxx-debug"
CONFIG+=" --enable-libstdcxx-time=yes"
CONFIG+=" --enable-linker-build-id" 
CONFIG+=" --enable-lto"
CONFIG+=" --enable-plugins"
CONFIG+=" --enable-threads=posix"
CONFIG+=" --with-pkgversion=GCC-Explorer-Build"
BINUTILS_VERSION=2.29.1


Does that help figure out how to build a gcc7.1.0 that can repro this?

[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes

2017-12-13 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837

--- Comment #5 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #4)
> Can't reproduce.  It is true that we now emit the __atomic_load_16 call, but
> that was intentional change

Yup.

>, and it can't be easily tail call, because the
> tailcall pass doesn't understand that the low 8 bytes of the 16 byte
> structure are returned the same as the whole structure

Ok that's disappointing, but hopefully is very rare after inlining.

> But I certainly can't reproduce any significant value masking etc., tried
> r235002 (+- gcc 6 branchpoint), r247000 (+- gcc 7 branchpoint) as well as
> current trunk.
> Unless it is something that has been broken on the 7 branch and later fixed.
> 
> But have just tried gcc 7.1.0 release and can't reproduce even there.

I can't repro it locally with gcc7.1.1 either.  This is the version info from
-fverbose-asm on the godbolt.org link (which does still repro it)

# GNU C++11 (GCC-Explorer-Build) version 7.1.0 (x86_64-linux-gnu)
#   compiled by GNU C version 5.4.0 20160609, GMP version 6.1.0, MPFR
version 3.1.4, MPC version 1.0.3, isl version isl-0.16.1-GMP

It's not present in the gcc7.2 build on Godbolt.org either.

I asked Matt Godbolt what exact version the compiler explorer site is using for
the gcc7.1.0 dropdown
(https://github.com/mattgodbolt/compiler-explorer/issues/684).  Hopefully he
can help us track down a gcc SVN revision to repro it, or confirm that it was a
misconfigured or buggy gcc or something.  Just to rule out the possibility of a
now-dormant bug lurking somewhere.

[Bug target/83285] New: non-atomic stores can reorder more aggressively with seq_cst on AArch64 than x86: missed x86 optimization?

2017-12-04 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83285

Bug ID: 83285
   Summary: non-atomic stores can reorder more aggressively with
seq_cst on AArch64 than x86: missed x86 optimization?
   Product: gcc
   Version: 6.3.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

This is either an x86-64 missed optimization or an AArch64 bug.  I *think*
x86-64 missed optimization, but it's not-a-bug on AArch64 only because any
observers that could tell the difference would have data race UB.

#include 
// int na;
// std::atomic_int sync;

void seq_cst(int , std::atomic_int ) {
na = 1;
sync = 2;
na = 3;
}
https://godbolt.org/g/bUwZaM

On x86, all 3 stores are there in the asm in source order (for mo_seq_cst, but
not for mo_release).

On AArch64, gcc6.3 does  does  sync=2;  na=3;  If `na` was using relaxed atomic
stores, this would be a bug (because a thread that saw `sync==2` could then see
the original value of na, not na==1 or na==3).

But for non-atomic na, reading na even after Synchronizing With the `sync=2`
(with an acquire load) would be UB, because the thread that writes sync writes
na again *after* that.  It seems that gcc's AArch64 backend is using this as
license to sink the na=1 store past the sync=2 and merge it with the na=3.

seq_cst(int&, std::atomic&, std::atomic&):
mov w2, 2 // tmp79,
stlrw2, [x1]// tmp79,* sync
mov w1, 3 // tmp78,
str w1, [x0]  // tmp78, *na_2(D)
ret

-

If sync=2 is a release store (not seq_cst), then gcc for x86 does sink the na=1
past the release and merge.  (See the godbolt link.)  In this case it's also
allowed to hoist the na=3 store ahead of the release, because plain release is
only a one-way barrier for earlier stores.  That would be safe for
relaxed-atomic as well (unlike for non-atomic), but gcc doesn't do that.

I'm slightly worried that this is unintentional and could maybe happen for
relaxed atomics when it would be illegal.  (On AArch64 with seq_cst or release,
and on x86 only with release.)

But hopefully this is just gcc being clever and taking advantage of the fact
that writing a non-atomic after a possible synchronization point means that the
sync point is irrelevant for programs without data race UB.

[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")

2017-10-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729

--- Comment #5 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #4)
> As for this exact ones, I'm now working on GIMPLE store merging
> improvements, but that of course won't handle this case.
> For RTL I had code to handle this at RTL DSE time, see PR22141 and
> https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html
> The problem was that the patch caused performance regressions on PowerPC and
> it was hard to find a good cost model for it.  Of course, for -Os the cost
> model would be quite simple, but although you count instructions, you were
> reporting this for -O3.

Yeah, fewer total stores, fewer instructions, and smaller code size *is* what
makes this better for performance.  An 8-byte store that doesn't cross a
cache-line boundary has nearly identical cost to a 1-byte store at least on
Intel.

x86 is robust with overlapping stores, although store-forwarding only works for
loads that get all their data from one store (and even then some CPUs have some
alignment restrictions for the load relative to the store).  Still, that
generally means that fewer wider stores are better, because most CPUs can
forward from a 4B store to a byte reload of any of those 4 bytes.


> Doing this at GIMPLE time is impossible, because it is extremely complex
> where exactly the variables are allocated, depends on many flags etc. (e.g.
> -fsanitize=address allocates pads in between them, some targets allocate
> them from top to bottom, others the other way around, ...),

Allocation order is fixed for a given target?  Ideally we'd allocate locals to
pack them together well to avoid wasted padding, and/or put ones used together
next to each other for possible SIMD (including non-loop XMM stuff like a pair
of `double`s or copying a group of integer locals into a struct).  (In case of
a really large local array, you want variables used together in the same page
and same cache line.)

Considering all the possibilities might be computationally infeasible though,
especially if the typical gains are small.

> -fstack-protector* might protect some but not others and thus allocate in
> different buckets, alignment could play roles etc.

Anyway, sounds like it would make more sense to look for possibilities likes
this in RTL when deciding how to lay out the local variables.  For x86 it seems
gcc sorts them by size?  Changing the order of declaration changes the order of
the stores, but not the locations.

[Bug tree-optimization/82732] malloc+zeroing other than memset not optimized to calloc, so asm output is malloc+memset

2017-10-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82732

--- Comment #3 from Peter Cordes  ---
(In reply to Marc Glisse from comment #2)
> If you use size_t consistently (for size and i), then the resulting code is
> a call to calloc.

Ah interesting.

With a compile-time constant size and -O3 we get calloc as well, even with the
original types, so that's a good thing.

[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")

2017-10-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729

--- Comment #3 from Peter Cordes  ---
Oh also, why is MSP430 using 3 byte-stores instead of a mov.w + mov.b for
storing ab[]?  (on the godbolt link in the initial report)


   # msp430-gcc 6.2.1.16) 6.2.1 20161212
MOV.W   #25185, 6(R1)
MOV.W   #99, 8(R1)   # abc[]

MOV.B   #97, 3(R1)
MOV.B   #98, 4(R1)
MOV.B   #0, 5(R1)# ab[]

MOV.B   #97, 1(R1)
MOV.B   #0, 2(R1)# a[]

Even if alignment is required (IDK), either the first two or last two mov.b
instructions for ab[] could combine into a mov.w, like is done for abc[].  Is
that a target bug?

MSP430 is on Godbolt and it's not a RISC with word size > largest immediate, so
I was looking at it to see if it was just an x86 missed optimization.

Like I was saying for ARM, gcc seems to do a poor job on many RISC ISAs with
this, given the redundancy between strings.

[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")

2017-10-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729

--- Comment #2 from Peter Cordes  ---
(In reply to Richard Biener from comment #1)
> The issue is we have no merging of stores at the RTL level and the GIMPLE
> level doesn't know whether the variables will end up allocated next to each
> other.

Are bug reports like this useful at all?  It seems that a good fraction of the
missed-optimization bugs I file are things that gcc doesn't really have the
infrastructure to find.  I'm hoping it's helping to improve gcc in the long
run, at least.  I guess I could try to learn more about gcc internals to find
out why it misses them on my own before filing, but either way it seems
potentially useful to document efficient asm possibilities even if gcc's
current design makes it hard to take advantage.


Anyway, could GIMPLE notice that multiple small objects are being written and
hint to RTL that it would be useful to allocate them in a certain way?  (And
give RTL a merged store that RTL would have to split if it decides not to?)

Or a more conservative approach could still be an improvement.  Can RTL realize
that it can use 4-byte stores that overlap into not-yet-initialized or
otherwise dead memory?

For -march=haswell  or generic we get 

movl$97, %edx
movl$25185, %eax   # avoid an LCP stall on Nehalem or earlier
movw%dx, 7(%rsp)
... lea
movl$6513249, 12(%rsp)
movw%ax, 9(%rsp)
movb$0, 11(%rsp)

This is pretty bad for code-size, and this would do the same thing with no
merging between objects, just knowing when to allow overlap into other objects.

movl   $0x61, 7(%rsp)# imm32 still shorter than a mov imm32 ->
reg and 16-bit store
movl $0x6261, 9(%rsp)
movl   $0x636261, 12(%rsp)


(Teaching gcc that mov $imm16 is safe on Sandybridge-family is a separate bug,
I guess.  It's only other instructions with an imm16 that LCP stall, unlike on
Nehalem and earlier where mov $imm16 is a problem too.  Silvermont marks
instruction lengths in the cache to avoid LCP stalls entirely, and gcc knows
that.)

[Bug tree-optimization/82732] New: malloc+zeroing other than memset not optimized to calloc, so asm output is malloc+memset

2017-10-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82732

Bug ID: 82732
   Summary: malloc+zeroing other than memset not optimized to
calloc, so asm output is malloc+memset
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

#include 
#include 

int *foo(unsigned size)
{
int *p = malloc(size*sizeof(int));
//memset(p,0, size*sizeof(int));

for (unsigned i=0; i<size; i++) {
p[i]=0;
}
return p;
}

gcc -O3 -march=haswellhttps://godbolt.org/g/bpGHoa

pushq   %rbx
movl%edi, %edi   # zero-extend
movq%rdi, %rbx   # why 64-bit operand-size here?
salq$2, %rdi
callmalloc

movq%rax, %rcx
testl   %ebx, %ebx   # check that size was non-zero before looping
je  .L6
leal-1(%rbx), %eax
movq%rcx, %rdi
xorl%esi, %esi
leaq4(,%rax,4), %rdx  # redo the left-shift
callmemset
movq%rax, %rcx
.L6:
movq%rcx, %rax   # this is dumb, either way we get here malloc
return value is already in %rax.  memset returns it.
popq%rbx
ret

So gcc figures out that this is malloc+memset, but I guess not until after the
pass that recognizes that as calloc.


But with explicit memset and gcc -O3, we get the zeroing loop to optimize away
as well

foo:
movl%edi, %edi
movl$1, %esi
salq$2, %rdi
jmp calloc

Unfortunately at -O2 we still get a loop that stores 4 bytes at a time, *after
calloc*.  I know -O2 doesn't enable all the optimizations, but I thought it
would do better than this for "manual" zeroing loops.

[Bug target/82731] New: _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.

2017-10-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

Bug ID: 82731
   Summary: _mm256_set_epi8(array[offset[0]], array[offset[1]],
...) byte gather makes slow code, trying to
zero-extend all the uint16_t offsets first and
spilling them.
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization, ssemmx
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

#include "immintrin.h"
#include "inttypes.h"

__m256i gather(char *array, uint16_t *offset) {

  return _mm256_set_epi8(array[offset[0]], array[offset[1]], array[offset[2]],
array[offset[3]], array[offset[4]], array[offset[5]], array[offset[6]],
array[offset[7]],
  array[offset[8]],array[offset[9]],array[offset[10]],array[offset[11]],
array[offset[12]], array[offset[13]], array[offset[14]], array[offset[15]], 
  array[offset[16]],array[offset[17]], array[offset[18]],
array[offset[19]], array[offset[20]], array[offset[21]], array[offset[22]],
array[offset[23]], 
  array[offset[24]],array[offset[25]],array[offset[26]], array[offset[27]],
array[offset[28]], array[offset[29]], array[offset[30]],array[offset[31]]);
}

https://stackoverflow.com/questions/46881656/avx2-byte-gather-with-uint16-indices-into-a-m256i

https://godbolt.org/g/LEVVwt


pushq   %rbp
movq%rsp, %rbp
pushq   %r15
pushq   %r14
pushq   %r13
pushq   %r12
pushq   %rbx
andq$-32, %rsp
subq$40, %rsp
movzwl  40(%rsi), %eax
... # more movzwl
movq%rax, 32(%rsp)   # spill
movzwl  38(%rsi), %eax   # and reuse
... # more movzwl
movzwl  46(%rsi), %r8d
movq%rax, 24(%rsp)   # spill
movzwl  36(%rsi), %eax
movzwl  42(%rsi), %edx
movq%rax, 16(%rsp)
movzwl  34(%rsi), %eax
...

...
vpinsrb $1, (%rdi,%r9), %xmm6, %xmm6
vpinsrb $1, (%rdi,%rcx), %xmm5, %xmm5
movq24(%rsp), %rcx  # more reloading
vpunpcklwd  %xmm6, %xmm3, %xmm3
movzbl  (%rdi,%rcx), %edx   # and using as a gather index
movq8(%rsp), %rcx
vpunpcklwd  %xmm5, %xmm1, %xmm1
vpunpckldq  %xmm3, %xmm2, %xmm2
vmovd   %edx, %xmm0
movzbl  (%rdi,%rcx), %edx
vpinsrb $1, (%rdi,%rbx), %xmm0, %xmm0

I think gcc is missing the point of vpinsrb, and making too many separate dep
chains which it then has to shuffle together.  It doesn't have such good
throughput on any CPUs that you need more than 2 or 3 dep chains to max out its
1 or 2 per clock throughput.

But the main point here is doing all the zero-extension of offset[0..31] before
doing *any* of the loads from array[], running out of registers and spilling.

See also discussion on that SO question about byte gathers and possibilities
for VPGATHERDD being maybe worth it on Skylake.

[Bug target/82730] New: extra store/reload of an XMM for every byte extracted

2017-10-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82730

Bug ID: 82730
   Summary: extra store/reload of an XMM for every byte extracted
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization, ssemmx
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

#include 
#include 
#include 

void p128_as_u8hex(__m128i in) {
_Alignas(16) uint8_t v[16];
_mm_store_si128((__m128i*)v, in);

printf("v16.u8: %#x %#x %#x %#x | %#x %#x %#x %#x | %#x %#x %#x %#x | %#x
%#x %#x %#x\n",
   v[0], v[1],  v[2],  v[3],  v[4],  v[5],  v[6],  v[7],
   v[8], v[9], v[10], v[11], v[12], v[13], v[14], v[15]);
}

https://godbolt.org/g/yoikq9
-O3  (or -march= anything with -mno-sse4 for pextrb)

subq$288, %rsp   # 288 bytes!!!
movl$.LC0, %edi
movaps  %xmm0, 8(%rsp)   # store
movdqa  8(%rsp), %xmm6   # reload twice...
movdqa  8(%rsp), %xmm1
movaps  %xmm6, 184(%rsp) # spill somewhere else
movzbl  199(%rsp), %eax  # v[15]
movaps  %xmm1, 264(%rsp)
movzbl  8(%rsp), %esi# v[0]
movaps  %xmm1, 248(%rsp)
...
pushq   %rax # v[15]

movdqa  16(%rsp), %xmm7
movaps  %xmm7, 176(%rsp)
movzbl  190(%rsp), %eax
pushq   %rax # v[14]

movdqa  24(%rsp), %xmm0
movaps  %xmm0, 168(%rsp)
movzbl  181(%rsp), %eax
pushq   %rax
...
xorl%eax, %eax
callprintf
addq$376, %rsp
ret

This is pretty hilariously bad, especially compared to the scalar code that
gcc6.3 produces:

subq$32, %rsp
movq%xmm0, %r9
movq%xmm0, %rcx
# ok this is a bit silly vs. a scalar mov.
# very few CPUs can do parallel movq so there's a resource-conflict
anyway making this no better than a GP->GP mov
movaps  %xmm0, 8(%rsp)
movq16(%rsp), %rax# high half
shrq$32, %r9
shrq$16, %rcx
movq%xmm0, %r8
movq%xmm0, %rdx
movzbl  %cl, %ecx
movzbl  %r8b, %esi
movzbl  %dh, %edx # using dh to save on shifts
movzbl  %r9b, %r9d
shrl$24, %r8d
movq%rax, %rdi
shrq$56, %rdi
pushq   %rdi
...

Not perfect (related to bug 67072), but at least doesn't do a chain of vector
copies all over the place.



OTOH, we could vectorize the unpack and store to stack memory in 16B chunks. 
This is much more profitable for 32-bit mode, where all args are stack args,
and where a 16B vector holds 4 args instead of 2.  e.g. movzxbd or 2-step
punpck with zeros.

For printing as 32-bit or 64-bit integers, we can just store the vector to the
stack instead of getting each element out separately!  (Should I report that as
a separate missed optimization, for 

void p128_as_u32hex(__m128i in) {
//const uint32_t *v = (const uint32_t*) 
alignas(16) uint32_t v[4];
_mm_store_si128((__m128i*)v, in);
printf("v4.u32: %#x %#x %#x %#x\n", v[0], v[1], v[2], v[3]);
}

where we get (with gcc -O3 -m32)

pshufd  $255, %xmm0, %xmm1
movd%xmm1, %eax
movdqa  %xmm0, %xmm1
pushl   %eax
punpckhdq   %xmm0, %xmm1
movd%xmm1, %eax
pshufd  $85, %xmm0, %xmm1
pushl   %eax
...

instead of a single movaps store.  Or for printing as uint64_t, we get

movhps  %xmm0, 20(%esp)
pushl   24(%esp)
pushl   24(%esp)
movq%xmm0, 28(%esp)
pushl   32(%esp)
pushl   32(%esp)

[Bug tree-optimization/82729] New: adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")

2017-10-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729

Bug ID: 82729
   Summary: adjacent small objects can be initialized with a
single store (but aren't for char a[] = "a")
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

void ext(char *, char *, char *);

void foo(void) {
char abc[] = "abc";
char ab[] = "ab";
char a[] = "a";
ext(a, ab, abc);
}

gcc 8.0.0 20171024 -O3   https://godbolt.org/g/mFNUgn

foo:   -march=bdver3  to avoid moving to 32-bit registers first
subq$24, %rsp
leaq12(%rsp), %rdx
leaq9(%rsp), %rsi
leaq7(%rsp), %rdi

# these 4 stores only need 2 instructions
movl$6513249, 12(%rsp)
movw$25185, 9(%rsp)
movb$0, 11(%rsp)   # last byte of ab[]
movw$97, 7(%rsp)


callext
addq$24, %rsp
ret

-march=haswell still avoids movw $imm16, (mem), even though Haswell doesn't
have LCP stalls.  But that's not what this bug is about.

A single  push imm32  or  mov $imm32, r/m64  could store a[] and ab[], because
sign-extension will produce 4 bytes of zeros in the high half.  We only need
one of those zeros to terminate the string.  If you don't want to waste the
extra 3 bytes of padding, simply have the next store overlap it.

Or keeping the layout identical:

...
movq$0x62610061, 7(%rsp)   # zero some of the bytes for abc[]
 #memory at 7(%rsp) = 'a', 0, 'a', 'b', 0, 0 (rsp+12), 0, 0
movl$6513249, 12(%rsp) # then initialize abc[]
...

x86 CPUs generally have good support for overlapping stores.  e.g.
store-forwarding still works from the movq to a load of a[] or ab[], and also
works from the movl to a load from abc[].

related: bug 82142, padding in structs stopping store merging.  But this isn't
padding, it's merging across separate objects that are / can be placed next to
each other on the stack.





On ARM, we can take advantage of redundancy between the string data as well
instead of using a string constant for ab[] and a literal pool with a pointer +
abc[].

# This is dumb:  ARM gcc 6.3.0
.L3:
.word   .LC0  # Should just store ab[] literally here
.word   6513249
.LC0:
.ascii  "ab\000"

You can also do stuff like  add r1, r1, 'c' LSL 16  to append the 'c' byte if
you have "ab" in a register.  Or if it's a common suffix instead of prefix,
left-shift and add.  Or start with the 4-byte object (including the terminating
zero) and AND out the characters.  IDK if this is a common enough pattern to
spend time searching for that much redundancy between constant initializers.

But I think on x86 it would be a good idea to zero a register instead of doing
more than 2 or 3 repeated  movq $0, (mem)   especially when the addressing mode
is RIP-relative (can't micro-fuse immediate + RIP-relative addressing mode), or
otherwise uses a 32-bit displacement.  (Code-size matters.)

[Bug target/82680] Use cmpXXss and cmpXXsd for setcc boolean compare

2017-10-24 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82680

--- Comment #2 from Peter Cordes  ---
gcc's sequence is *probably* good, as long as it uses xor / comisd / setcc and
not comisd / setcc / movzx (which gcc often likes to do for integer setcc).

(u)comisd and cmpeqsd both run on the FP add unit.  Agner Fog doesn't list the
latency.  (It's hard to measure, because you'd need to construct a round-trip
back to FP.)  XOR-zeroing is as cheap as a NOP on Intel SnB-family, but uses an
execution port on AMD, so gcc's sequence is the same front-end uops but fewer
unfused-domain uops for the execution units on SnB.  Also, the xor-zeroing is
off the critical path on all CPUs.  (But ucomisd latency is probably as high as
cmpeqsd + movd).

Hmm, AMD bdver* and Ryzen take 2 uops for comisd, so for tune=generic it's
probably worth thinking about using ICC's sequence.

ICC's sequence is especially good if you're doing something with the integer
result that can optimize away the NEG.  (e.g. use it with AND instead of a CMOV
to conditionally zero something, or AND it with another condition).  Or if
you're storing the boolean result to memory, psrld $31, %xmm0 or PAND, then
movd directly to memory without going through integer regs.


comisd doesn't destroy either of its args, but cmpeqsd does (without AVX).  If
you want both x and y afterwards (e.g. if they weren't equal, or you care about
-0.0 and +0.0 being different even though they compare equal), then comisd is a
win.

So I think we need to look at the choices given some more surrounding code.

I'll hopefully look at this some more soon.

[Bug target/82668] New: could use BMI2 rorx for unpacking struct { int a,b }; from a register (SysV ABI)

2017-10-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82668

Bug ID: 82668
   Summary: could use BMI2 rorx for unpacking struct { int a,b };
from a register (SysV ABI)
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*

struct twoint {
int a, b;
};

int bar(struct twoint s) {
return s.a + s.b;
}

https://godbolt.org/g/4ygAMm

movq%rdi, %rax
sarq$32, %rax
addl%edi, %eax
ret

But we could have used

rorx   $32, %rdi, %rax   # 1 uop 1c latency
add$edi, %eax
ret

rorxq is only 1 uop, vs. 2 for mov + sar.  It also saves a byte a 3 byte MOV +
a 4 byte SAR with a 6 byte rorx.

Without BMI2, we can shorten critical path if mov isn't zero latency, from 3 to
2 cycles (and save a byte on the REX prefix for the mov):

movl%edi, %eax
sarq$32, %rdi
addl%edi, %eax
ret

This would be a better choice in general, especially for tune=generic.



Also related (let me know if I should report separately, or if gcc knowing how
to use rotate to swap struct members would fix this too):

// only needs one call-preserved reg and a rotate.
long foo(int a /* edi */, int b /* esi */)
{
struct_arg ( (struct twoint){a,b});
struct_arg ( (struct twoint){b,a});
return 0;
}

gcc saves two call-preserved registers so it can save a and b separately, and
shift+OR them together each time.

pushq   %rbp
movl%edi, %ebp
pushq   %rbx
movl%esi, %ebx
movq%rbx, %rdi
salq$32, %rdi
subq$8, %rsp
orq %rbp, %rdi
callstruct_arg
movq%rbp, %rdi
salq$32, %rdi
orq %rbx, %rdi
callstruct_arg
addq$8, %rsp
xorl%eax, %eax
popq%rbx
popq%rbp
ret


This is sub-optimal in two ways: first, on Intel SnB-family (but not silvermont
or any AMD), SHRD is efficient (1 uop, 1c latency, runs on port1 only instead
of p06 for other shifts/rotates).  SHL + SHRD may be better than mov + shl +
or.

Second, because instead of redoing the creation of the struct, we can rotate
the first one.  Even writing it as a swap of the members of a struct (instead
of creation of a new struct) doesn't help.

Anyway, I think this would be better

pushq   %rbx
shl $32, %rdi
shrd$32, %rsi, %rdi   # SnB-family alternative to mov+shl+or

rorx$32, %rdi, %rbx   # arg for 2nd call
callstruct_arg
movq%rbx, %rdi
callstruct_arg

xorl%eax, %eax
popq%rbx
ret

I didn't check whether I got the correct arg as the high half, but that's not
the point.

[Bug target/82667] New: SSE2 redundant pcmpgtd for sign-extension of values known to be >= 0

2017-10-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82667

Bug ID: 82667
   Summary: SSE2 redundant pcmpgtd for sign-extension of values
known to be >= 0
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization, ssemmx
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

long long sumarray(const int *data)
{
data = (const int*)__builtin_assume_aligned(data, 64);
long long sum = 0;
for (int c=0 ; c<32768 ; c++)
sum += (data[c] >= 128 ? data[c] : 0);

return sum;
}

// Same function as pr 82666, see that for scalar cmov choices.

Same result with

if (data[c] >= 128)
sum += data[c];

https://godbolt.org/g/NwcPmh

gcc 8.0.0 20171022 -O3

movdqa  .LC0(%rip), %xmm5  # set1(127)
leaq131072(%rdi), %rax
pxor%xmm2, %xmm2 # accumulator
pxor%xmm4, %xmm4 # for Intel CPUs we should re-materialize
with pxor inside the loop instead instead of movdqa.  But not AMD
.L2:
movdqa  (%rdi), %xmm0
addq$16, %rdi
movdqa  %xmm0, %xmm1
pcmpgtd %xmm5, %xmm1
pand%xmm1, %xmm0
# so far so good: we have conditionally zeroed xmm0

movdqa  %xmm4, %xmm1
pcmpgtd %xmm0, %xmm1# 0 > x to generate high-half for
sign-extension
movdqa  %xmm0, %xmm3

punpckldq   %xmm1, %xmm3   # unpack with compare result
punpckhdq   %xmm1, %xmm0   # (instead of just zero)
paddq   %xmm3, %xmm2
paddq   %xmm0, %xmm2
cmpq%rdi, %rax
jne .L2
movdqa  %xmm2, %xmm0
psrldq  $8, %xmm0   # requires a wasted movdqa vs. pshufd or
movhlps
paddq   %xmm2, %xmm0
movq%xmm0, %rax
ret

There are multiple inefficiencies that I pointed out in comments, but this bug
report is about doing sign extension when we can prove that simple zero
extension is sufficient.  Negative numbers are impossible from (x>=128 ? x :
0).

Changing the source to do zero-extension but still a signed compare stops it
from auto-vectorizing.

int to_add = (data[c] >= 128 ? data[c] : 0);
unsigned tmp = to_add;
sum += (unsigned long long)tmp;  // zero-extension

Making everything unsigned does zero-extension as expected, but if the
comparison is signed, it either fails to auto-vectorize or it still uses
sign-extension.

e.g. this auto-vectorizes with sign-extension, but if you change the constant
to -128, it won't auto-vectorize at all (because then sign and zero extension
are no longer equivalent).

int to_add = (data[c] >= 128 ? data[c] : 0);
unsigned tmp = to_add;
unsigned long long tmp_ull = tmp;  // zero-extension
long long tmp_ll = tmp_ull;
sum += tmp_ll;

[Bug tree-optimization/82666] New: [7/8 regression]: sum += (x>128 ? x : 0) puts the cmov on the critical path (at -O2)

2017-10-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82666

Bug ID: 82666
   Summary: [7/8 regression]: sum += (x>128 ? x : 0) puts the cmov
on the critical path (at -O2)
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

long long sumarray(const int *data)
{
data = (const int*)__builtin_assume_aligned(data, 64);
long long sum = 0;
for (int c=0 ; c<32768 ; c++)
sum += (data[c] >= 128 ? data[c] : 0);

return sum;
}

The loop body is written to encourage gcc to make the loop-carried dep chain
just an ADD, with independent branchless zeroing of each input.  But
unfortunately, gcc7 and gcc8 -O2 de-optimize it back to what we get with older
gcc -O3 from

if (data[c] >= 128)  // doesn't auto-vectorize with gcc4, unlike the
above
sum += data[c];

See also
https://stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-then-o2.


https://godbolt.org/g/GgVp7E
gcc8.0 8.0.0 20171022  -O2 -mtune=haswell  (slow)

leaq131072(%rdi), %rsi
xorl%eax, %eax
.L3:
movslq  (%rdi), %rdx
movq%rdx, %rcx
addq%rax, %rdx  # mov+add could have been LEA
cmpl$127, %ecx
cmovg   %rdx, %rax  # sum = (x>=128 : sum+x : sum)
addq$4, %rdi
cmpq%rsi, %rdi
jne .L3
ret

This version has a 3 cycle latency loop-carried dep chain, (addq %rax, %rdx 
and cmov).  It's also 8 fused-domain uops (1 more than older gcc) but using LEA
would fix that.


gcc6.3 -O2 -mtune=haswell (last good version of gcc on Godbolt, for this test)

leaq131072(%rdi), %rsi
xorl%eax, %eax
xorl%ecx, %ecx  # extra zero constant for a cmov source
.L3:
movslq  (%rdi), %rdx
cmpl$127, %edx
cmovle  %rcx, %rdx  # rdx = 0 when rdx<=128
addq$4, %rdi
addq%rdx, %rax  # sum += ... critical path 1c latency
cmpq%rsi, %rdi
jne .L3
ret

7 fused-domain uops in the loop (cmov is 2 with 2c latency before Broadwell). 
Should run at 1.75 cycles per iter on Haswell (or slightly slower due to an odd
number of uops in the loop buffer), bottlenecked on the front-end.  The latency
bottleneck is only 1 cycle.  (Which Ryzen might come closer to.)

Anyway, on Haswell (with -mtune=haswell), the function should be more than 1.5x
slower with gcc7/8 than with gcc6 and earlier.

Moreover, gcc should try to optimize something like this:

if (data[c] >= 128)
sum += data[c];

into conditionally zeroing a register instead of using a loop-carried cmov dep
chain.

[Bug target/82582] New: not quite optimal code for -2*x*y - 3*z: could use one less LEA for smaller code without increasing critical path latency for any input

2017-10-17 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82582

Bug ID: 82582
   Summary: not quite optimal code for -2*x*y - 3*z: could use one
less LEA for smaller code without increasing critical
path latency for any input
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

int foo32(int x, int y, int z) {
return -2*x*y - 3*z;
}

gcc8.0.0 20171015 -O3   https://godbolt.org/g/tzBuHx

imull   %esi, %edi# x*y
leal0(,%rdx,4), %eax# needs a disp32 = 0
subl%eax, %edx# -3*z
negl%edi  # -(x*y)
leal(%rdx,%rdi,2), %eax   # result

LEA runs on limited ports, and an index with no base needs a 4-byte disp32 = 0.
The critical-path latencies, assuming 2-operand imul is 3 cycles like on Intel:

x->res: imul, neg, lea = 5c
y->res: imul, neg, lea = 5c
z->res:  lea, sub, lea = 3c

This is better than gcc6.3 / gcc7.2 (which uses 3 LEA and is generally worse). 
It's also different from gcc4/gcc5 (6c from x to result, but only 2c from z to
result, so it's different but not worse or better in all cases).


clang5.0 does better: same latencies, smaller code size, and trades one LEA for
an ADD:
imull   %esi, %edi
addl%edi, %edi
leal(%rdx,%rdx,2), %eax
negl%eax
subl%edi, %eax

x->res: imul, add, sub = 5c
y->res: imul, add, sub = 5c
z->res:  lea, neg, sub = 3c



related: poor code-gen for 32-bit code with this.  I haven't checked other
32-bit architectures.

long long foo64(int x, int y, int z) {
return -2LL*x*(long long)y - 3LL*(long long)z;
}
// also on the godbolt link

gcc -m32 uses a 3-operand imul-immediate for `-2`, but some clunky shifting for
`-3`.  There's also a mull in there.

clang5.0 -m32 makes very nice code, using a one-operand imul for -3 and just
shld/add + sub/sbb (plus some mov instructions).  One-operand mul/imul is 3
uops on Intel with 2 clock throughput, but ADC is 2 uops on Intel
pre-Broadwell, so it's nice to avoid that.

related: add %esi,%esi / sbb %edi,%edi  is an interesting way to sign-extend a
32-bit input into a pair of registers while doubling it.  However, if it starts
in eax,  cltd / add %eax,%eax is much better.  (sbb same,same is only
recognized as dep-breaking on AMD Bulldozer-family and Ryzen.  On Intel it has
a false dep on the old value of the register, not just CF).

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2017-10-06 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

--- Comment #1 from Peter Cordes  ---
BTW, if we *are* using vpmovwb, it supports a memory operand.  It doesn't save
any front-end uops on Skylake-avx512, just code-size.  Unless it means less
efficient packing in the uop cache (since all uops from one instruction have to
go in the same line) it should be better to fold the stores than to use
separate store instructions.

vpmovwb %zmm0,(%rcx)
vpmovwb %zmm1, 32(%rcx)

is 6 fused-domain uops (2 * 2 p5 shuffle uops, 2 micro-fused stores), according
to IACA.

It's possible to coax gcc into emitting it with intrinsics, but only with a -1
mask:

// https://godbolt.org/g/SBZX1W
void vpmovwb(__m512i a, char *p) {
  _mm256_storeu_si256(p, _mm512_cvtepi16_epi8(a));
}
vpmovwb %zmm0, %ymm0
vmovdqu64   %ymm0, (%rdi)
ret

void vpmovwb_store(__m512i a, char *p) {
  _mm512_mask_cvtepi16_storeu_epi8(p, -1, a);
}
vpmovwb %zmm0, (%rdi)
ret

clang is the same here, not using a memory destination unless you hand-hold it
with a -1 mask.


Also note the lack of vzeroupper here, and in the auto-vectorized function,
even with an explicit -mvzeroupper.

[Bug target/82460] New: AVX512: choose between vpermi2d and vpermt2d to save mov instructions. Also, fails to optimize away shifts before shuffle

2017-10-06 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82460

Bug ID: 82460
   Summary: AVX512: choose between vpermi2d and vpermt2d to save
mov instructions.  Also, fails to optimize away shifts
before shuffle
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization, ssemmx
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

#include 

// gcc  -O3 -march=skylake-avx512 -mavx512vbmi8.0.0 20171004
// https://godbolt.org/g/fVt4Kb

__m512i vpermi2d(__m512i t1, __m512i control, char *src) {
  return _mm512_permutex2var_epi32(control, t1, _mm512_loadu_si512(src));
}
vpermt2d(%rdi), %zmm0, %zmm1
vmovdqa64   %zmm1, %zmm0
ret

  clang emits  vpermi2d  (%rdi), %zmm1, %zmm0

__m512i vpermi2b(__m512i t1, __m512i a, __m512i b) {
  return _mm512_permutex2var_epi8(a, t1, b);
}
vpermt2b%zmm2, %zmm0, %zmm1
vmovdqa64   %zmm1, %zmm0
ret

  clang emits  vpermi2b  %zmm2, %zmm1, %zmm0


This one compiles ok, though:

__m512i vpermt2d(__m512i t1, __m512i control, char *src) {
  return _mm512_permutex2var_epi32(t1, control, _mm512_loadu_si512(src));
}
vpermt2d(%rdi), %zmm1, %zmm0


---


But when auto-vectorizing this with AVX512VBMI (see bug 82459 for AVX512BW
missed optimizations), gcc uses vpermi2b when vpermt2b would be better:

void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t
*__restrict__ src, size_t bytes) {
  uint8_t *end_dst = dst + bytes;
  do{
 *dst++ = *src++ >> 8;
  } while(dst < end_dst);
}


.L9:
vmovdqa64   (%rsi,%rax,2), %zmm0
vmovdqa64   64(%rsi,%rax,2), %zmm1
vmovdqa64   %zmm2, %zmm3 # copy the index
vpsrlw  $8, %zmm0, %zmm0
vpsrlw  $8, %zmm1, %zmm1
vpermi2b%zmm1, %zmm0, %zmm3  # then destroy it
vmovdqu8%zmm3, (%rcx,%rax)   # extra uop according to
Intel: bug 82459
addq$64, %rax
cmpq%rax, %rdi
jne .L9

Of course, the shifts are redundant when we have a full byte shuffle that
doesn't do any saturating:

# different shuffle control in zmm1
   .L9
vmovdqa64   (%rsi,%rax,2), %zmm0
vpermt2b64(%rsi,%rax,2), %zmm1, %zmm0
vmovdqu64%zmm0, (%rcx,%rax)
addq$64, %rax
cmpq%rax, %rdi
jne .L9

If unrolling, use pointer increments so the shuffle can maybe avoid
un-lamination, although some multi-uop instructions don't micro-fuse in the
first place.

vpermt2w is 3 uops on Skylake-AVX512 (p0 + 2p5), so we should expect vpermt2b
to be at least that slow on the first CPUs that support it.  On a CPU where
vpermt2b is p0 + 2p5, this loop will run at about one store per 2 clocks, the
same as what you can achieve with 2x shift + vpackuswb + vpermq (bug 82459). 
But this has one fewer p0 uop.

With indexing from the end of the arrays to save the CMP, this could also be 7
fused-domain uops for the front-end (assuming no micro-fusion for the vpermt2b
+ load), but assuming the store does fuse.

[Bug target/82459] New: AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2017-10-06 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

Bug ID: 82459
   Summary: AVX512F instruction costs: vmovdqu8 stores may be an
extra uop, and vpmovwb is 2 uops on Skylake and not
always worth using
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

gcc bottlenecks on shuffle uops when auto-vectorizing this for skylake-avx512

* Perhaps the cost model is wrong for vpmovwb (it's 2 port5 uops), or gcc
doesn't consider any cheaper alternatives.  My version with 2x shift +
vpacksswb + vpermq has 3x the theoretical throughput (with hot caches).  In
general, AVX512BW lane crossing shuffles of 8 or 16-bit elements are multi-uop
on SKX, but in-lane byte/word shuffles are single-uop just like their AVX2
versions.

* Using vmovdqu8 as a store costs a port5 ALU uop even with no masking,
according to Intel (not tested).  We should always use AVX512F vmovdqu32 or 64
for unmasked loads/stores, not AVX512BW vmovdqu8 or 16.  Intel's docs indicate
that current hardware doesn't handle unmasked vmovdqu8/16 as efficiently as
32/64, and there's no downside.

* Using vinserti64x4 instead of 2 separate stores is worse because it makes the
shuffle bottleneck worse, and 2 stores wouldn't bottleneck on load/store
throughput.  (Avoiding vpmovwb makes this moot in this case, but presumably
whatever decided to shuffle + store instead of store + store will make that
mistake in other cases too.)

 SKX shuts down port 1 (except for scalar integer) when there are 512b uops in
flight, so extra loads/stores are relatively cheaper than using more ALU uops,
compared to 256b or 128b vectors where the back-end can keep up even when 3 of
the 4 uops per clock are vector-ALU (if they go to different ports).

#include 
#include 
void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t
*__restrict__ src, size_t bytes) {
  uint8_t *end_dst = dst + bytes;
  do{
 *dst++ = *src++ >> 8;
  } while(dst < end_dst);
}

// https://godbolt.org/g/kXjEp1
gcc8 -O3 -march=skylake-avx512

.L5:  # inner loop
vmovdqa64   (%rsi,%rax,2), %zmm0
vmovdqa64   64(%rsi,%rax,2), %zmm1
vpsrlw  $8, %zmm0, %zmm0 # memory operand not folded: bug
82370
vpsrlw  $8, %zmm1, %zmm1
vpmovwb %zmm0, %ymm0 # 2 uops each
vpmovwb %zmm1, %ymm1
vinserti64x4$0x1, %ymm1, %zmm0, %zmm0
vmovdqu8%zmm0, (%rcx,%rax)   # Intel says this is worse than
vmovdqu64
addq$64, %rax
cmpq%rax, %rdi # using an indexed addr mode, but still
doing separate add/cmp
jne .L5

IACA says gcc's loop will run at one 64B store per 6 clocks, bottlenecked on 6
port5 uops (including the vmovdqu8.  vmovdqu64 gives one store per 5 clocks,
still bottlenecked on port5).  Using 2 stores instead of vinserti64x4 gives us
one store per 4 clocks.  (Still twice as slow as with vpacksswb + vpermq, which
produces one 512b vector per 2 shuffle uops instead of one 256b vector per 2
shuffle uops.)

See
https://stackoverflow.com/questions/26021337/what-is-iaca-and-how-do-i-use-it
for more about Intel's static analysis tool.


related: pr 82370 mentions vectorization strategies for this.

Fortunately gcc doesn't unroll the startup loop to reach an alignment boundary.
 (And BTW, aligned pointers are more important with AVX512 than AVX2, in my
testing with manual vectorization of other code on Skylake-avx512.)  Of course,
a potentially-overlapping unaligned first vector would be much better than a
scalar loop here.



Anyway, does gcc know that vpmovwb %zmm, %ymm is 2 uops for port 5, while
vpackuswb zmm,zmm,zmm in-lane 2-input shuffle is 1 uop (for port 5)?  The xmm
source version is single-uop, because it's in-lane.

Source: Intel's IACA2.3, not testing on real hardware.  SKX port-assignment
spreadsheet:
https://github.com/InstLatx64/InstLatx64/blob/master/AVX512_SKX_PortAssign_v102_PUB.ods
It's based on IACA output for uops, but throughputs and latencies are from real
hardware AIDA64 InstLatx64, with a 2nd column for Intel's published
tput/latency (which as usual doesn't always match).  vpmovwb real throughput is
one per 2 clocks, which is consistent with being 2 uops for p5.

It makes some sense from a HW-design perspective that all lane-crossing
shuffles with element size smaller than 32-bit are multi-uop.  It's cool that
in-lane AVX512 vpshufb zmm  vpacksswb zmm are single-uop, but it means it's
often better to use more instructions to do the same work in fewer total
shuffle uops.  (Any loop that involves any shuffling can *easily* bottleneck on
shuffle throughput.)

Related: AVX512 merge-masking can something into a 2-input shuffle

[Bug target/82370] AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't.

2017-10-06 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370

--- Comment #5 from Peter Cordes  ---
I got off topic with this bug.  It was supposed to be about emitting

   vpsrlw  $8, (%rsi), %xmm1# load folded into AVX512BW version

instead of

   vmovdqu64   (%rsi), %xmm0 # or VEX vmovdqu; that's where I got off
topic
   vpsrlw  $8, %xmm0, %xmm0

[Bug tree-optimization/82432] Missed constant propagation of return values of non-inlined static functions

2017-10-04 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82432

--- Comment #1 from Peter Cordes  ---
Meant to add https://godbolt.org/g/K9CxQ6 before submitting.  And to say I
wasn't sure tree-optimization was the right component.

I did check that -flto didn't do this optimization either.

Is it worth opening a separate bug for making .clone versions of functions with
a more convenient calling convention?  Obviously that can gain performance, but
can make debugging harder.  https://stackoverflow.com/a/46549978/224132 is
right that compilers *could* do this, but there are probably good reasons why
they don't.

[Bug tree-optimization/82432] New: Missed constant propagation of return values of non-inlined static functions

2017-10-04 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82432

Bug ID: 82432
   Summary: Missed constant propagation of return values of
non-inlined static functions
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

static __attribute((noinline)) 
  int get_constant() { /* optionally stuff with side effects */
   return 42; }
movl$42, %eax
ret

// Consider the case where this is large enough to not inline (even without an
attribute), but still returns a constant.  e.g. a success/fail status that we
can prove is always success, or just the current implementation always returns
success but the callers still check.

int call_constant() { return 10 - get_constant(); }

callget_constant()
movl$10, %edx
subl%eax, %edx
movl%edx, %eax
ret

Even though the function didn't inline so we still have to call it, its return
value is a compile-time constant.  

   call  get_constant
   mov $(10-42), %eax
   ret

would be a better way to compile this.  It potentially breaks a data dependency
chain, and saves instructions.  And enables further constprop if the caller
isn't trivial and does more with the return value.

For return values passed by hidden pointer, it avoids store-forwarding latency.
 If we want the value in memory, we can use the copy the callee put there.  If
we made a .clone version that uses a custom calling convention, we could have
the callee skip storing the return value if it's constant for all callers. 
(Hmm, checking this could cost a lot of compile time, especially with LTO.  The
simpler version is to only optimize it away for small objects that are really
constant, not just from constant propagation from one caller's args.)


One useful case is returning a std::optional<>.  Even if the .value() is
unknown, it might be known that there *is* a value, so the caller doesn't have
to check the `bool` member.

libstdc++'s optional is not trivially-copyable even if T is, so it returns
via hidden pointer for optional.  (libc++ does implement it that way, so
it returns packed into a register in x86-64, but clang also still checks the
return value when it doesn't inline.
https://stackoverflow.com/a/46546636/224132)

int baz() {
return 1 + get_std_optional_int().value();
}
subq$24, %rsp
leaq8(%rsp), %rdi
callget_std_optional_int()
cmpb$0, 12(%rsp)
je  .L98
movl8(%rsp), %eax
addq$24, %rsp
addl$1, %eax
ret
baz() [clone .cold.49]:
.L98:
callabort

This obviously simplifies the call site some if we don't have to check the
return value.

But we still have to provide storage space unless we make a
nonstandard-calling-convention clone of get_std_optional_int() which ideally
returns in %eax and %edx.  (Returning small objects packed less tightly into
multiple registers would probably be a win in general for non-constant return
values, if we want to start cloning static functions and discarding the
ABI-compliant definition.  Or with LTO or whole-program, as this post argues:
https://stackoverflow.com/a/46549978/224132)

[Bug target/82370] AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't.

2017-10-04 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370

--- Comment #4 from Peter Cordes  ---
VPANDQ can be shorter than an equivalent VPAND, for displacements > 127 but <=
16 * 127 or 32 * 127, and that are an exact multiple of the vector width.  EVEX
with disp8 always implies a compressed displacement.  (See Intel manual vol.2
2.6.5
Compressed Displacement (disp8*N) Support in EVEX).


# worst case for EVEX: odd displacement forcing a disp32 while VEX can use
disp8
  c5 f9 db 4e 01  vpand  0x1(%rsi),%xmm0,%xmm1
  62 f1 fd 08 db 8e 01 00 00 00   vpandq 0x1(%rsi),%xmm0,%xmm1

# Best case for EVEX, where it wins by byte
# (or two vs. a 3-byte VEX + disp32, e.g. if I'd used %r10)
  c5 09 db be 00 02 00 00 vpand  0x200(%rsi),%xmm14,%xmm15
  62 71 8d 08 db 7e 20vpandq 0x200(%rsi),%xmm14,%xmm15

# But the tables turn with an odd offset, where EVEX has to use disp32
  c5 09 db be ff 01 00 00 vpand  0x1ff(%rsi),%xmm14,%xmm15
  62 71 8d 08 db be ff 01 00 00   vpandq 0x1ff(%rsi),%xmm14,%xmm15

[Bug target/82370] AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't.

2017-10-03 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370

--- Comment #3 from Peter Cordes  ---
Doesn't change the performance implications, but I just realized I have the
offset-load backwards.  Instead of
vpsrlw  $8, (%rsi), %xmm1
vpand   15(%rsi), %xmm2, %xmm0

this algorithm should use
vpand   1(%rsi), %xmm2, %xmm0 # ideally with rsi 32B-aligned
vpsrlw  $8, 16(%rsi), %xmm1

Or (with k1 = 0x)
vmovdqu81(%rsi),  %zmm0{k1}{z}   # ALU + load micro-fused
vmovdqu865(%rsi), %zmm1{k1}{z}   # and probably causes CL-split
penalties

Like I said, we should probably avoid vmovdqu8 for loads or stores unless we
actually use masking.  vmovdqu32 or 64 is always at least as good.  If some
future CPU has masked vmovdqu8 without needing an ALU uop, it could be good
(but probably only if it also avoids cache-line split penalties).

https://godbolt.org/g/a1U7hf

See also https://github.com/InstLatx64/InstLatx64 for a spreadsheet of
Skylake-AVX512 uop->port assignments (but it doesn't include masked loads /
stores), and doesn't match IACA for vmovdqu8 zmm stores (which says even
without masking, the ZMM version uses an ALU uop).

[Bug target/82370] AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't.

2017-10-03 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370

--- Comment #2 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #1)
> Created attachment 42296 [details]
> gcc8-pr82370.patch
> 
> If VPAND is exactly as fast as VPANDQ except for different encodings, then
> maybe we can do something like this patch, where we'd use the suffixes only
> for 512-bit vectors, or when any of the operands is %[xy]mm16+, or when
> masking.
> If VPAND is slower, then we could do it for -Os at least.

They're exactly as fast on Skylake-avx512, and no reason to ever expect them to
be slower on any future CPU.

VEX is well-designed and future-compatible because it zeroes out to VLMAX,
whatever that is on the current CPU.  A VEX VPAND can always be decoded to
exactly the same internal uop as a VPANDQ with no masking.  There's no penalty
for mixing VEX and EVEX in general,
 and no reason to expect one (https://stackoverflow.com/q/46080327/224132).

Assemblers already use the VEX encoding whenever possible for FP instructions
like  vandps  15(%rsi), %xmm2, %xmm1  so AVX512VL code will typically contain a
mix of VEX and EVEX.  Related: vpxor %xmm0,%xmm0,%xmm0  is the best way to zero
a ZMM register, saving bytes (and potentially uops on some future AMD-style
CPU).  pr 80636.


KNL doesn't even support AVX512VL, so it can only encode the ZMM version of
VPANDQ.  But according to Agner Fog's testing, VEX VPAND xmm/ymm is the same
tput and latency as EVEX VPANDD/Q.

---

BTW, I was thinking about this again: it might be even better to use
 VMOVDQU8 15(%rsi), %xmm1{k1}{z}# zero-masking load
Or not: IACA says that it uses an ALU uop (port 0/1/5) as well as a load-port
uop, so it's break-even vs. VPAND except for setting up the constant (probably
movabs $0x, %rax; kmov %rax, %k1.  Or maybe load the mask from
memory in one instruction, vs. a broadcast-load of a vector constant.)  It
might possibly save power, or it might use more.

A masked load won't fault from masked elements, so it would actually be safe to
vmovdqu8 -1(%rs1), %zmm1{k1}{z}   but performance-wise that's probably not a
win.  It probably still "counts" as crossing a cache-line boundary on most
CPUs.  And probably quite slow if it has to squash an exception for an unmasked
page based on the mask.  (At least VMASKMOVPS is like that.)

For stores, IACA says vmovdqu8 uses an extra ALU uop even with no masking.  gcc
unfortunately uses that when auto-vectorizing the pure C version:
https://godbolt.org/g/f4bJKd

[Bug target/82370] New: AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't.

2017-09-29 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370

Bug ID: 82370
   Summary: AVX512 can use a memory operand for immediate-count
vpsrlw, but gcc doesn't.
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization, ssemmx
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

#include 
#include 
#include 
void pack_high8_alignhack(uint8_t *restrict dst, const uint8_t *restrict src,
size_t bytes) {
  uint8_t *end_dst = dst + bytes;
  do{
 __m128i v0 = _mm_loadu_si128((__m128i*)src);
 __m128i v1_offset = _mm_loadu_si128(1+(__m128i*)(src-1));
 v0 = _mm_srli_epi16(v0, 8);
 __m128i v1 = _mm_and_si128(v1_offset, _mm_set1_epi16(0x00FF));
 __m128i pack = _mm_packus_epi16(v0, v1);
 _mm_storeu_si128((__m128i*)dst, pack);
 dst += 16;
 src += 32;  // 32 bytes
  } while(dst < end_dst);
}

pack_high8_alignhack:
vmovdqa64   .LC0(%rip), %xmm2# pointless EVEX when VEX is
shorter
addq%rdi, %rdx
.L18:
vmovdqu64   (%rsi), %xmm0
vpandq  15(%rsi), %xmm2, %xmm1   # pointless EVEX vs. VPAND
addq$16, %rdi
addq$32, %rsi
vpsrlw  $8, %xmm0, %xmm0 # could use a memory source.
vpackuswb   %xmm1, %xmm0, %xmm0
vmovups %xmm0, -16(%rdi)
cmpq%rdi, %rdx
ja  .L18
ret

There's no benefit to using VPANDQ (4-byte EVEX prefix) instead of VPAND
(2-byte VEX prefix).  Same for VMOVDQA64.  We should only use the AVX512
version when we need masking, ZMM register size, or xmm/ymm16-31.

Or in this case, to use the AVX512VL+AVX512BW form that lets us fold a load
into a memory operand:  VPSRLW xmm1 {k1}{z}, xmm2/m128, imm8 
(https://hjlebbink.github.io/x86doc/html/PSRLW_PSRLD_PSRLQ.html).  IACA2.3 says
it micro-fuses, so it's definitely worth it.

Clang gets everything right and emits:

pack_high8_alignhack:
addq%rdi, %rdx
vmovdqa .LCPI2_0(%rip), %xmm0# Plain AVX (VEX prefix)
.LBB2_1:
vpsrlw  $8, (%rsi), %xmm1# load folded into AVX512BW version
vpand   15(%rsi), %xmm0, %xmm2   # AVX-128 VEX encoding.
vpackuswb   %xmm2, %xmm1, %xmm1
vmovdqu %xmm1, (%rdi)
addq$16, %rdi
addq$32, %rsi
cmpq%rdx, %rdi
jb  .LBB2_1
retq

vmovdqu is the same length as vmovups, so there's no benefit.  But AFAIK, no
downside on any CPU to always using FP stores on the results of vector-integer
ALU instructions.

(There isn't a separate mnemonic for EVEX vmovups, so the assembler uses the
VEX encoding whenever it's encodeable that way.  Or maybe for medium-size
displacements that are multiples of the vector width, it can save a byte by
using an EVEX + disp8 instead of VEX + disp32.)

[Bug target/82369] New: "optimizes" indexed addressing back into two pointer increments

2017-09-29 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82369

Bug ID: 82369
   Summary: "optimizes" indexed addressing back into two pointer
increments
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

gcc defeats this attempt to get it to reduce the front-end bottleneck in this
loop (simplified from a version of the loop in pr82356).

Indexing src by  (dst-src) + src  is easy to do in C, and works well.  But when
one pointer advances faster than the other it's very clunky to express in C.

#include 
#include 
#include 

// index src relative to dst, but use a pointer-increment for dst
// so the store still has a simple addressing mode (and can run on port7)
// gcc and clang "optimize" back to two separate pointers, but ICC13 leaves it
alone
// Saves one ADD instruction in the loop.
void pack_high8_indexed_src(uint8_t *restrict dst, const uint16_t *restrict
src, size_t bytes) {
  uintptr_t end_dst = (uintptr_t)(dst + bytes);
  uintptr_t srcu = (uintptr_t)src, dstu = (uintptr_t)dst;

  ptrdiff_t src_dst_offset = srcu - 2*dstu;
  do{
 __m128i v0 = _mm_loadu_si128((__m128i*)(dstu*2+src_dst_offset));
 __m128i v1 = _mm_loadu_si128((__m128i*)(dstu*2+src_dst_offset)+1);
 __m128i res = _mm_packus_epi16(v1,v0);

 _mm_storeu_si128((__m128i*)dstu, res);
 dstu += 16;
 //src += 16;  // 32 bytes
  } while(dstu < end_dst);
}

https://godbolt.org/g/pycLQC
gcc -O3 -mtune=skylake  de-optimizes it to this:

pack_high8_indexed_src:   # gcc and clang do this:
addq%rdi, %rdx
.L2:
movdqu  16(%rsi), %xmm0
movdqu  (%rsi), %xmm1
addq$16, %rdi
addq$32, %rsi# 2 separate pointer increments
packuswb%xmm1, %xmm0
movups  %xmm0, -16(%rdi)
cmpq%rdi, %rdx
ja  .L2
ret

Intel SnB-family: 7 fused-domain uops.  (The store micro-fuses, and the cmp/ja
macro-fuses).  In theory, this bottlenecks on front-end throughput (4 uops per
clock), running at 1 iter per 1.75 cycles.  The store uses a simple addressing
mode, so its store-address uop can run on port7.  If not for the front-end
bottleneck, the back-end could run this at nearly 1 per clock.

ICC13/16/17 compiles it the way I was hoping to hand-hold gcc into doing, to 6
fused-domain uops, and should run 1 iter per 1.5 clocks on SnB/HSW/SKL.  This
might also be good on Silvermont, since it's fewer instructions.

Possibly a similar benefit on K10 / BD (although AMD would benefit from using
simple array indexing, because indexed addressing modes for stores aren't worse
AFAIK.  But -mtune=bdver2 doesn't do that.)

pack_high8_indexed_src:   # ICC17
lea   (%rdi,%rdi), %rax
negq  %rax
addq  %rdi, %rdx
addq  %rax, %rsi
..B1.2:
movdqu16(%rsi,%rdi,2), %xmm1   # src indexed via dst*2
movdqu(%rsi,%rdi,2), %xmm0
packuswb  %xmm0, %xmm1
movdqu%xmm1, (%rdi)# dst with a simple
addressing mode.
addq  $16, %rdi# 16B of dst, 32B of src
cmpq  %rdx, %rdi
jb..B1.2
ret

A mov-load with a complex addressing mode is a single uop on all CPUs.  It
might have 1c higher latency than a simple addressing mode, but that doesn't
matter when the address math is off the critical path.

With unrolling, the actual work is only 4 fused-domain uops for 2x load + pack
+ store, so the front-end can just barely keep the back-end fed with infinite
unrolling.  For any sane unroll factor, saving 1 uop of loop overhead is a
slight win.

A store with an indexed addressing-mode can't run on port7 on Haswell/Skylake. 
With any unrolling, that would become a bottleneck.  On SnB/IvB, indexed stores
are un-laminated into 2 fused-domain uops, so simple array-indexing gets worse
with unrolling.


BTW, with an indexed store, we could count a negative index up towards zero. 
That would avoid the CMP, since the loop overhead could be just a single
macro-fused uop: add $16, %rdx / jnc.  (But only SnB-family macro-fuses
add/jcc.  AMD and Core2/Nehalem only macro-fuse test/cmp.)  But on a CPU that
doesn't macro-fuse at all, it's good.  (e.g. Silvermont / KNL).

---

BTW, with AVX, micro-fused loads are un-laminated on Haswell/Skylake.  e.g.

vmovdqu   16(%rsi,%rdi,2), %xmm0
vpackuswb (%rsi,%rdi,2), %xmm0, %xmm1
vmovdqu   %xmm1, (%rdi)

is 3 fused-domain uops in the decoders/uop cache, but its 4 fused-domain uops
for the issue/rename stage and in the ROB.  The vpackuswb un-laminates.
https://stackoverflow.com/question

[Bug tree-optimization/82356] New: auto-vectorizing pack of 16->8 has a redundant AND after a shift

2017-09-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82356

Bug ID: 82356
   Summary: auto-vectorizing pack of 16->8 has a redundant AND
after a shift
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization, ssemmx
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

#include 
void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t
*__restrict__ src, size_t bytes) {
  uint8_t *end_dst = dst + bytes;
  do{
 *dst++ = *src++ >> 8;
  } while(dst < end_dst);
}

https://godbolt.org/g/yoJZ3C
gcc -O3  auto-vectorizes to this inner loop:

.L5:
movdqa  (%rdx,%rax,2), %xmm0
movdqa  16(%rdx,%rax,2), %xmm1
psrlw   $8, %xmm0
pand%xmm2, %xmm0 # Redundant with the shift
psrlw   $8, %xmm1
pand%xmm2, %xmm1 # Redundant with the shift
packuswb%xmm1, %xmm0
movups  %xmm0, (%rcx,%rax)
addq$16, %rax
cmpq%rsi, %rax
jne .L5

This is mostly good, but the PAND instructions are redundant, because psrlw by
8 already leaves the high byte of each 16-bit element zeroed.

The same extra AND is present when auto-vectorizing for AVX2 (but not AVX512,
where it uses a different strategy.)  Other than that, the AVX2 vectorization
strategy looks very good (packus ymm, then vpermq to fix the result).

If the input is 32B-aligned (or 64B aligned for AVX2), one of the load+shifts
can be *replaced* with an AND + unaligned load offset by -1.  This avoids a
bottleneck on shift throughput (at least with unrolling it does; without
unrolling we bottleneck on the front-end except on Ryzen).  It's even better
with AVX, because load+AND can fold into one instruction.

See https://stackoverflow.com/a/46477080/224132 for more details.

This C source produces the inner loop that I think should be very good across
K10, Bulldozer, Ryzen,  Nehalem, Sandybridge, HSW/SKL, Jaguar, Atom, and
Silvermont.  (With SSE2 or AVX.)  i.e. this should be great for tune=generic
after reaching a 32B boundary.

Not great on Core2 or K8 where non-cache-line-crossing movdqu costs more.

// take both args as uint8_t* so we can offset by 1 byte to replace a shift
with an AND
// if src is 32B-aligned, we never have cache-line splits
void pack_high8_alignhack(uint8_t *restrict dst, const uint8_t *restrict src,
size_t bytes) {
  uint8_t *end_dst = dst + bytes;
  do{
 __m128i v0 = _mm_loadu_si128((__m128i*)src);  // this load should be
aligned
 __m128i v1_offset = _mm_loadu_si128(1+(__m128i*)(src-1));
 v0 = _mm_srli_epi16(v0, 8);
 __m128i v1 = _mm_and_si128(v1_offset, _mm_set1_epi16(0x00FF));
 __m128i pack = _mm_packus_epi16(v0, v1);
 _mm_storeu_si128((__m128i*)dst, pack);
 dst += 16;
 src += 32;  // 32 bytes
  } while(dst < end_dst);
}

[Bug target/68924] No intrinsic for x86 `MOVQ m64, %xmm` in 32bit mode.

2017-09-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68924

--- Comment #3 from Peter Cordes  ---
(In reply to Marc Glisse from comment #2)
> Does anything bad happen if you remove the #ifdef/#endif for
> _mm_cvtsi64_si128? (2 files in the testsuite would need updating for a
> proper patch)

It's just a wrapper for

_mm_cvtsi64_si128 (long long __A) {
  return _mm_set_epi64x (0, __A);
}

and _mm_set_epi64x is already available in 32-bit mode.

I tried using _mm_set_epi64x(0, i) (https://godbolt.org/g/24AYPk), and got the
expected results (same as with _mm_loadl_epi64());

__m128i movq_test(uint64_t *p) {
  return _mm_set_epi64x( 0, *p );
}

movl4(%esp), %eax
vmovq   (%eax), %xmm0
ret

For the test where we shift before movq, it still uses 32-bit integer
double-precision shifts, stores to the stack, then vmovq (instead of optimizing
to  vmovq / vpsllq)


For the reverse, we get:

long long extract(__m128i v) {
return ((__v2di)v)[0];
}

subl$28, %esp
vmovq   %xmm0, 8(%esp)
movl8(%esp), %eax
movl12(%esp), %edx
addl$28, %esp
ret

MOVD / PEXTRD might be better, but gcc does handle it.  It's all using syntax
that's available in 32-bit mode, not a special built-in.

I don't think it's helpful to disable the 64-bit integer intrinsics for 32-bit
mode, even though they are no longer always single instructions.  I guess it
could be worse if someone used it without thinking, assuming it would be the
same cost as MOVD, and didn't really need the full 64 bits.  In that case, a
compile-time error would prompt them to port more optimally to 32-bit.  But
it's not usually gcc's job to refuse to compile code that might be sub-optimal!

[Bug target/82339] Inefficient movabs instruction

2017-09-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339

--- Comment #5 from Peter Cordes  ---
(In reply to Richard Biener from comment #2)
> I always wondered if it is more efficient to have constant pools per function
> in .text so we can do %rip relative loads with short displacement?

There's no rel8 encoding for RIP-relative; it's always RIP+rel32, so this
doesn't save code-size.  (AMD64 hacked it in by repurposing one of the two
redundant ways to encode a 32-bit absolute address with no base or index
register; the ModRM machine-code encoding is otherwise the same between x86-32
and x86-64.)

> I suppose the assembler could even optimize things if there's the desired
> constant somewhere near in the code itself... (in case data loads from icache
> do not occur too much of a penalty).

There's no penalty for loads AFAIK, only stores to addresses near RIP are
snooped and cause self-modifying-code machine clears.

Code will often be hot in L2 cache as well as L1I, so an L1D miss could hit
there.  But L1dTLB is separate from L1iTLB, so you could TLB miss even when
loading from the instruction you're running.

(The L2TLB is usually a victim cache, IIRC, so a TLB miss that loaded the
translation into the L1iTLB doesn't also put it into L2TLB.)

>  The assembler could also replace
> .palign space before function start with (small) constant(s).

This could be a win in some cases, if L1D pressure is low or there wasn't any
locality with other constants anyway.  If there could have been locality,
you're just wasting space in L1D by having your data spread out across more
cache lines.

But in general on x86, it's probably not a good strategy.


BTW, gcc could do a lot better with vector constants.  e.g. set1_ps(1.0f) could
compile to a vbroadcastss load (which is the same cost as a normal vmovaps). 
But instead it actually repeats the 1.0f in memory 8 times.  That's useful if
you want to use it as a memory operand, because before AVX512 you can't have
broadcast memory operands to ALU instructions.  But if it's only ever loaded
ahead of a loop, a broadcast load or a PMOVZX load can save a lot of space.  In
a function with multiple vector constants, this is the difference between one
vs. multiple cache lines for all its data. 

(vpbroadcastd/q, ss/sd, and 128-bit is handled in the load ports on Intel and
AMD, but vector PMOVZX/SX with a memory operand is still a micro-fused
load+ALU.  Still, could easily be worth it for e.g.
_mm256_set_epi32(1,2,3,4,5,6,7,8), storing that as .byte 1,2,3,4,5,6,7,8.

The downside is lost opportunities for different functions to share the same
constant like with string-literal deduplication.  If one function wants the
full constant in memory for use as a memory operand, it's probably better for
all functions to use that copy.  Except that putting all the constants for a
given function into a couple cache lines is good for locality when it runs.  If
the full copy somewhere else isn't generally hot when a function that could use
a broadcast or pmovzx/pmovsx load runs, it might be better for it to use a
separate copy stored with the constants it does touch.

[Bug target/82339] Inefficient movabs instruction

2017-09-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339

--- Comment #4 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #0)
> At least on i7-5960X in the following testcase:
> 
> baz is fastest as well as shortest.
> So I think we should consider using movl $cst, %edx; shlq $shift, %rdx
> instead of movabsq $(cst << shift), %rdx.
> 
> Unfortunately I can't find in Agner Fog MOVABS and for MOV r64,i64 there is
> too little information, so it is unclear on which CPUs it is beneficial.

Agner uses Intel syntax, where imm64 doesn't have a special mnemonic.  It's
part of the  mov r,i  entry in the tables.  But those tables are throughput for
a flat sequence of the instruction repeated many times, not mixed with others
where front-end effects can be different.  Agner probably didn't actually test
mov r64,imm64, because its throughput is different when tested in a long
sequence (not in a small loop).  According to
http://users.atw.hu/instlatx64/GenuineIntel00506E3_Skylake2_InstLatX64.txt, a
regular desktop Skylake has 0.64c throughput for mov r64, imm64, vs. 0.25 for
mov r32, imm32.  (They don't test mov r/m64, imm32, the 7-byte encoding for
something like mov rax,-1).

Skylake with up-to-date microcode (including all SKX CPUs) disables the loop
buffer (LSD), and has to read uops from the uop cache every time even in short
loops.

Uop-cache effects could be a problem for instructions with a 64-bit immediate. 
Agner only did detailed testing for Sandybridge; it's likely that Skylake still
mostly works the same (although the uop cache read bandwidth is higher).

mov r64, imm64 takes 2 entries in the uop cache (because of the 64-bit
immediate that's outside the signed 32-bit range), and takes 2 cycles to read
from the uop cache, according to Agner's Table 9.1 in his microarch pdf.  It
can borrow space from another entry in the same uop cache line, but still takes
extra cycles to read.

See
https://stackoverflow.com/questions/46433208/which-is-faster-imm64-or-m64-for-x86-64
for an SO question the other day about loading constants from memory vs. imm64.
 (Although I didn't have anything very wise to say there, just that it depends
on surrounding code as always!)

> Peter, any information on what the MOV r64,i64 latency/throughput on various
> CPUs vs. MOV r32,i32; SHL r64,i8 is?

When not bottlenecked on the front-end,  mov r64,i64  is a single ALU uop with
1c latency.  I think it's pretty much universal that it's the best choice when
you bottleneck on anything else.

Some loops *do* bottleneck on the front-end, though, especially without
unrolling.  But then it comes down to whether we have a uop-cache read
bottleneck, or a decode bottleneck, or an issue bottleneck (4 fused-domain uops
per clock renamed/issued).  For issue/retire bandwidth mov/shl is 2 uops
instead of 1.

But for code that bottlenecks on reading the uop-cache, it's really hard to say
if one is better in general.  I think if the imm64 can borrow space in other
uops in the cache line, it's better for uop-cache density than mov/shl.  Unless
the extra code-size means one fewer instruction fits into a uop cache line that
wasn't nearly full (6 uops).

Front-end stuff is *very* context-sensitive.  :/  Calling a very short
non-inline function from a tiny loop is probably making the uop-cache issues
worse, and is probably favouring the mov/shift over the mov r64,imm64 approach
more than you'd see as part of a larger contiguous block.

I *think*  mov r64,imm64  should still generally be preferred in most cases. 
Usually the issue queue (IDQ) between the uop cache and the issue/rename stage
can absorb uop-cache read bubbles.

A constant pool might be worth considering if code-size is getting huge
(average instruction length much greater than 4).

Normally of course you'd really want to hoist an imm64 out of a loop, if you
have a spare register.  When optimizing small loops, you can usually avoid
front-end bottlenecks.  It's a lot harder for medium-sized loops involving
separate functions.  I'm not confident this noinline case is very
representative of real code.

---

Note that in this special case, you can save another byte of code by using 
ror rax  (implicit by-one encoding).

Also worth considering for tune=sandybridge or later: xor eax,eax / bts rax,
63.   2B + 5B = 7B.  BTS has 0.5c throughput, and xor-zeroing doesn't need an
ALU on SnB-family (so it has zero latency; the BTS can execute right away even
if it issues in the same cycle as xor-zeroing).  BTS runs on the same ports as
shifts (p0/p6 in HSW+, or p0/p5 in SnB/IvB).  On older Intel, it has 1 per
clock throughput for the reg,imm form.  On AMD, it's 2 uops, with 1c throughput
(0.5c on Ryzen), so its not bad if used on AMD CPUs, but it doesn't look good
for tune=generic.

At -Os, you could consider  or eax, -1;  shl rax,63.  (Also 7 bytes, and works
for constants with multiple consecutive high-bits set). The false dependency on
the old RAX value is often not a bottleneck, and gcc 

[Bug target/68924] No intrinsic for x86 `MOVQ m64, %xmm` in 32bit mode.

2017-09-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68924

--- Comment #1 from Peter Cordes  ---
There's  __m128i _mm_loadl_epi64 (__m128i const*
mem_addr)(https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=movq=5450,4247,3115=SSE2),
which gcc makes available in 32-bit mode.

This does solve the correctness problem for 32-bit, but gcc still compiles it
to a separate vmovq before a vpmovzxbd %xmm,%ymm.  (Using _mm_loadu_si128 still
optimizes away to vpmovzxbd (%eax), %ymm0.)

https://godbolt.org/g/Zuf26P

[Bug target/82267] x32: unnecessary address-size prefixes. Why isn't -maddress-mode=long the default?

2017-09-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82267

--- Comment #6 from Peter Cordes  ---
(In reply to H.J. Lu from comment #2)
> > Are there still cases where -maddress-mode=long makes worse code?
> 
> 
> Yes, there are more places where -maddress-mode=long needs to zero-extend
> address to 64 bits where 0x67 prefix does for you.

So ideally, gcc should use 0x67 opportunistically where it saves a
zero-extension instruction.

Using 64-bit address size opportunistically wherever we're sure it's safe seems
like a good idea, but I assume that's not easy to implement.

Can we teach  -maddress-mode=long  that a 0x67 prefix is a nearly-free way to
zero-extend as part of an addressing-mode, so it will use that instead of extra
instructions?


> > SSSE3 and later instructions need 66 0F 3A/38 before the opcode, so an
> > address-size or REX prefix will cause a decode stall on Silvermont.  With
> 
> That is true.

> > Similarly, Bulldozer-family has a 3-prefix limit, but doesn't
> > count escape bytes, and VEX only counts as 0 or 1 (for 2/3 byte VEX).
> 
> But 0x67 prefix is still better.

For tune=silvermont or knl, ideally we'd count prefixes and use an extra
instruction when it avoids a decode bottleneck.

For tune=generic we should probably always use 0x67 when it saves an
instruction.  IDK about tune=bdver2.  Probably not worth worrying about too
much.


> Since the upper 32 bits of stack register are always zero for x32, we
> can encode %esp as %rsp to avoid 0x67 prefix in address if there is no
> index or base register.

Note that %rsp can't be an index register, so you only have to check if it's
the base register.

The SIB encodings that would mean index=RSP actually mean "no index".  The
ModRM encoding that would mean base=RSP instead means "there's a SIB byte". 
https://stackoverflow.com/a/46263495/224132

This means that `(%rsp)` is encodeable, instead of (%rsp, %rsp, scale).  Any
other register can be used as a base with no SIB byte (unfortunately for
code-size with -fomit-frame-pointer).

Can this check be applied to  %rbp  in functions that use a frame pointer?

That might be possible even if we can't as easily decide whether other
registers need to be zero or sign extended if we're not sure whether they're
"the pointer" or a signed integer pointer-difference.

However, simple dereference addressing modes (one register, no displacement)
can always use 64-bit address size when the register is known to be
zero-extended.

[Bug target/82158] _Noreturn functions that do return clobber caller's registers on ARM32 (but not other arches)

2017-09-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158

Peter Cordes  changed:

   What|Removed |Added

 Resolution|INVALID |WONTFIX

--- Comment #11 from Peter Cordes  ---
Flagging this WONTFIX instead of INVALID, since I'm still convinced gcc could
do better.  What I'm proposing wouldn't be any worse than breaking the ABI on
UB.  I guess unsafe code-gen for something that gcc warns about by default is
not too bad, though.

I guess I should file a separate missed-optimization bug about not eliminating
the code from a basic block that returns in a noreturn function.

See also
https://stackoverflow.com/questions/45981545/why-does-noreturn-function-return/46407858#46407858
(description of the possible optimizations).

[Bug target/82328] New: x86 rdrand: flags not used directly when branching on success/failure

2017-09-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82328

Bug ID: 82328
   Summary: x86 rdrand: flags not used directly when branching on
success/failure
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

#include 
unsigned long long use_intrinsic(void) {
unsigned long long rand;
while(!_rdrand64_step());  // FIXME: limited retry in case RNG is
broken
return rand;
}
// https://godbolt.org/g/x7mUvj
gcc 8.0.0 20170926 -O3 -mrdrnd

movl$1, %edx
.L4:
rdrand  %rax
movq%rax, -8(%rsp) # spill to memory, really?
cmovc   %edx, %eax
testl   %eax, %eax
je  .L4
movq-8(%rsp), %rax
ret

Note that RDRAND (http://felixcloutier.com/x86/RDRAND.html) indicates failure
by clearing CF *and* putting 0 in the destination register.  So this code is
correct (returning a valid RDRAND result even if it was zero), just much worse
than clang's:

.LBB1_1:
rdrandq %rax
jae .LBB1_1
retq

[Bug target/82298] New: x86 BMI: no peephole for BZHI

2017-09-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82298

Bug ID: 82298
   Summary: x86 BMI: no peephole for BZHI
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

gcc never seems to emit BZHI on its own.

// exact BZHI behaviour for all inputs (with no C UB)
unsigned bzhi_exact(unsigned x, unsigned c) {
c &= 0xff;
if (c <= 31) {
  x &= ((1U << c) - 1);
  // 1ULL defeats clang's peephole, but is a convenient way to avoid UB for
count=32.
}
return x;
}
// https://godbolt.org/g/tZKnV3

unsigned long bzhi_l(unsigned long x, unsigned c) {
return x & ((1UL << c) - 1);
}

Out-of-range shift UB allows peepholing to BZHI for the simpler case, so these
(respectively) should compile to

bzhil   %esi, %edi, %edi
bzhiq   %rsi, %rdi, %rax

But we actually get (gcc8 -O3 -march=haswell (-mbmi2))

movq$-1, %rax
shlx%rsi, %rax, %rdx
andn%rdi, %rdx, %rax
ret

Or that with a test for bzhi_exact.  Clang succeeds at peepholing BZHI
here, but it still does the &0xff and the test to skip BZHI when it
would do nothing.  It's easy to imagine cases where the source would use a
conditional to avoid UB when it wants to leave x unmodified for c==32, and the
range is 1 to 32:

unsigned bzhi_1_to_32(unsigned x, unsigned c) {
if (c != 32)
x &= ((1U << c) - 1);
return x;
}


BZHI is defined to saturate the index to OperandSize, so it copies src1
unmodified when the low 8 bits of src2 are >= 32 or >= 64.  (See the Operation
section of http://felixcloutier.com/x86/BZHI.html.  The text description is
wrong, claiming it saturates to OperandSize-1, which would zero the high bit.)

Other ways to express it (which clang fails to peephole to BZHI, like gcc):

unsigned bzhi2(unsigned x, unsigned c) {
//  c &= 0xff;
//  if(c < 32) {
  x &= (0xUL >> (32-c));
//  }
return x;
}

unsigned bzhi3(unsigned long x, unsigned c) {
// c &= 0xff;
return x & ~(-1U << c);
}



Related: pr65871 suggested this, but was really about taking advantage of flags
set by __builtin_ia32_bzhi_si so it is correctly closed.  pr66872 suggested
transforming x & ((1 << t) - 1); to x & ~(-1 << t); to enable ANDN.  Compiling
both to BZHI when BMI2 is available was mentioned, but the the main subject of
that bug either.

[Bug target/82281] New: Bulldozer/Zen tuning: uses XMM for single 64-bit integer AND, even with a simple mask

2017-09-21 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82281

Bug ID: 82281
   Summary: Bulldozer/Zen tuning: uses XMM for single 64-bit
integer AND, even with a simple mask
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization, ssemmx
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

long long test_and(long long x) {
return x & 0x77ULL;
}
// https://godbolt.org/g/D6XujV
# -O3 -march=znver1 -m32 -mno-avx
movaps  .LC0, %xmm1
movq4(%esp), %xmm0
andps   %xmm1, %xmm0
movd%xmm0, %eax
pextrd  $1, %xmm0, %edx
ret

# -O3 -m32
movl8(%esp), %edx
movl4(%esp), %eax
andl$119, %edx
ret

We get this with znver1 and bdver1-4, but not barcelona or btver2.

Also not haswell, skylake or knl.

So something is wrong with tunings for recent AMD that make it over-eager to go
to vector registers for 64-bit integers in the most trivial case possible. 
Fortunately it's on when coming from memory:

long long ext();
long long test_and() {
long long x = ext();
return x & 0x77ULL;
}
  # -O3 -march=znver1 -m32
subl$12, %esp
callext()
addl$12, %esp
andl$119, %edx
ret

[Bug target/81602] Unnecessary zero-extension after 16 bit popcnt

2017-09-21 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81602

--- Comment #3 from Peter Cordes  ---
Forgot to mention: memory-source popcnt with an indexed addressing mode would
also be worse on SnB/IvB: it can't stay micro-fused, so the front-end
un-laminates it in the issue stage.

Haswell and later can keep  popcnt (%rdi, %rdx), %eax  micro-fused throughout
the pipeline, so it's always 1 fused-domain uop instead of expanding to 2, but
it's still 2 unfused-domain uops so it takes more room in the scheduler than
the reg-reg form.

When Intel fixes the output dependency in some future uarch, it might
un-laminate again with indexed addressing modes.  That's what happens on
Skylake for tzcnt/lzcnt, because SKL fixed their output dependency.  (And
judging from the published errata, they meant to fix popcnt as well.)  But
index addressing modes can only stay micro-fused with an ALU uop with
"traditional" x86-style instructions with 2 operands where the destination is
read/write, not write-only.   (Tested on Haswell and Skylake).  And yes, this
makes indexed addressing modes with AVX instructions worse than with the SSE
equivalent. :/

[Bug target/81602] Unnecessary zero-extension after 16 bit popcnt

2017-09-21 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81602

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #2 from Peter Cordes  ---
(In reply to Uroš Bizjak from comment #1)
> The "xorl %eax, %eax; movw %ax, (%rsi)" pair is just optimized way to 
> implement "movw $0, (%rsi);".

That's questionable on Skylake.  Sandybridge and later Intel CPUs don't have
LCP decode stalls on mov with a 16-bit immediate, only on ALU instructions. 
movw $0, (%rsi)  has no displacement and a 16-bit immediate, so it only takes 1
entry in the uop cache (Agner Fog's microarch pdf, table 9.1).  I don't see any
other downsides for a mov-imm16 to memory on Skylake.

> It just happens that peephole pass found unused %eax as an empty temporary 
> reg when splitting direct move of immediate to memory.

Then it's a missed-optimization to keep re-zeroing inside the loop, instead of
picking %ecx and hoisting the xor so the rest of the loop can keep clobbering
%eax.

Although really doing what Christoph suggested is even better if you do want
the xor-zeroing for something else.  If a peephole pass is introducing
xor-zeroing instructions, then it's a missed optimization if other instructions
can't take advantage.  Having an xor-zeroing inside the loop *instead* of a
movzx is pretty good if the xor-zero is needed for some other reason.

> popcntl has a false dependency on its output in certain situations,

Yes, always on Intel Sandybridge-family, including Skylake.

> where popcntw doesn have this limitation. So, gcc choose this approach for a
> reason.

Intel Haswell/Skylake (and I think IvyBridge) don't rename low16 separately
from the full register
(https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to).
 *Any* write of a 16-bit register has a false dependency on the old value.  So
popcntw isn't a special case of false dependency, it's like other 16-bit
instructions.  Is that what you're thinking of?

The only CPUs where it's even theoretically possible for popcntw to not have an
output dependency are Nehalem and Sandybridge.  All other CPUs don't rename
low16 separately from the full register or are too old to have popcnt.

Do you have a source for your claim that popcntw doesn't have a dependency on
the 16-bit destination register, on Sandybridge?  It's probably true on
Nehalem, because they rename %ax and don't have false dependencies for
popcntl/q, but more likely Sandybridge will treat %ax as an input dependency.

If you don't have an xor-zeroed register you can clobber, movzx-load + popcntl
is pretty clearly better than popcntw mem,%ax;  movzx %ax, %eax on all CPUs, or
at least not worse.  It saves a code byte (no operand-size prefix), guarantees
no false dependency on %eax, and avoids a 16-bit popcnt (slow on
Silvermont/KNL).

It's also fewer total uops for the out-of-order scheduler / execution units:
popcnt (mem), reg  is a micro-fused load + ALU, and movzwl r16,r32 is another
ALU uop.  But a movzx load decodes to just a load uop on Intel CPUs, no ALU uop
needed.


For example, gcc -m32 -O3 -mtune=generic compiles this
int pc_ushort(unsigned short a) { return __builtin_popcount(a); }
// https://godbolt.org/g/nnNYLU
popcntw 4(%esp), %ax# false dep on most CPUs
movzwl  %ax, %eax   # extra ALU uop on the critical path
ret

Much better:

movzwl   4(%esp), %eax  # just as fast as a mov load
popcntl  %eax, %eax # popcntl false dep taken care of by
same,same


Or another -m32 example:
int pcl(unsigned long a) { return __builtin_popcountll(a); }
# -O3 -mpopcnt -m32
xorl%eax, %eax # correctly omitted with -mtune=znver1
popcntl 4(%esp), %eax

extra non-eliminated ALU uop for the execution units on Nehalem and AMD vs.

movl4(%esp), %eax
popcntl %eax, %eax

Also, the load might issue 1 cycle earlier this way, or can even be hoisted
ahead of other surrounding code.  Splitting a micro-fused load+ALU into a
mov-load is always at least as good as xor-zeroing for dep breaking when the
source is in memory.  (A reg-reg mov is not always better; putting a mov on the
critical path is always worse for older CPUs, vs. xor off the critical path.)

Here are some more examples of current gcc 8.0.0 20170920  getting things wrong
for various tunings:

int pcll_uint(unsigned int a) { return __builtin_popcountll(a); }
   # -O3 -march=bdver2
movl%edi, %eax# zero-extend to 64-bit
popcntq %rax, %rax# 64-bit popcntq has 4c throughput, vs. 2c
for 16/32 bit (Agner Fog's table for Piledriver)
ret

Missed optimization to  popcntl %edi, %eax  (without mov) applies to znver1 /
knl / others as well, but

[Bug target/82158] _Noreturn functions that do return clobber caller's registers on ARM32 (but not other arches)

2017-09-21 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158

--- Comment #10 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #9)
> None of the above options is IMHO acceptable.
> This is UB like any other.

I still think it's a quality-of-implementation bug that could be fixed without
downsides for conforming programs by emitting an illegal instruction instead of
bx lr.  (Thanks for pointing out some cases I hadn't considered, BTW.  That
narrows down the possible good solutions.)

IMO corrupting registers while otherwise working normally is really dangerous. 
That could appear to work in a unit test but fail sometimes with a different
caller, maybe even causing security problems if the clobbered register was only
used in an error-handling case.

If a fix would take too much work to implement, then I'm fine with leaving this
as WONTFIX; there is a warning enabled by default for cases where gcc is sure
that a noreturn function *does* return.  I disagree about INVALID, though.

I'm imagining a codebase where a stray _Noreturn attribute got attached to a
function that was never intended to be _Noreturn.

A buggy _Noreturn function that has a combination of inputs that does return
would always be problematic regardless of clobbering registers, and I guess
that's why you're thinking about sanitize?

> What we could add is -fsanitize=noreturn that would add runtime
> instrumentation

I'm not proposing anything that heavy, just an illegal instruction instead of a
return, or simply no instruction at all.  In a _Noreturn function (that has
clobbered call-preserved registers), I think gcc should never emit instructions
that return.

As you point out, even if gcc can't prove whether that path is/isn't taken, a
correct program *won't* take it, so an illegal instruction is a good thing. 
(For performance, an illegal instruction can stop incorrect speculation down a
wrong path after a mispredicted branch, potentially saving useless prefetch of
code/data and speculative page walks).

There is precedent for trapping with illegal instructions on UB:

int* address_of_local(int v) { return   }
int foo() {
int* p = address_of_local(4);
return *p;
}
   // gcc4.x returns 4
   // x86-64 gcc5 and later.  https://godbolt.org/g/W5vUiA

foo():
movl0, %eax   # load from absolute address 0
ud2   # undefined 2-byte instruction: future-proof way
to raise #UD

Falling through into whatever's next would also be a better option, since it
saves code size instead of generating instructions that a correct program will
never execute.  (gcc does that for foo() on ARM after trying a NULL-pointer
load.  ARM64 gcc uses brk #1000 after trying a NULL-pointer load.)

If the block that would return isn't at the end of the function, then the
fall-through would be into another block of the current function.  This is
potentially hard to debug, so probably an illegal instruction is a good idea,
maybe with an option to omit it.

Anything in a basic block that returns can be assumed not to be executed, so
the example from the original report could be compiled to:

  push {r4, lr}
 #  mov r5, r1# optionally optimize these out, because we know
 #  mov r4, r0# the code that uses them can't be reached.
  bl ext
   # don't emit the store
   # or the pop / bx lr
   # Fall through into padding.
   # Ideally pad with illegal instructions instead of NOPs

Hmm, unless the store faults, and that's what made this function not return. 
But we can definitely replace the pop/bx with an illegal instruction or a
fall-through, because executing them is guaranteed to be the wrong thing.

Making functions that almost works but violate the ABI seems like the worst
possible way to handle this corner case.  Crashing is better than silent
corruption when there's no correct way to continue execution, right?


> Compile time error is undesirable

That's what I thought, too.  -Werror exists for users that want that.

I only mentioned it for completeness, but good point that it's not an option
even when gcc can prove that a function returns, because it might never be
called. 

> , or say the
> return is only conditional and that path doesn't ever happen in a valid
> program, or say there are calls in the noreturn function that just throw
> exceptions or loop forever or abort or don't really return some other way.

That's a good point.  We don't want to save extra registers that a correct
program doesn't need saved, so ignoring the noreturn attribute and emitting
code to save/restore is not a good solution for the general case.

Illegal instruction or fall through into other code is left as the only
solution I think is really good.

[Bug target/82158] _Noreturn functions that do return clobber caller's registers on ARM32 (but not other arches)

2017-09-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158

Peter Cordes  changed:

   What|Removed |Added

 Status|VERIFIED|UNCONFIRMED
 Resolution|WONTFIX |---

--- Comment #8 from Peter Cordes  ---
reopening

[Bug target/82158] _Noreturn functions that do return clobber caller's registers on ARM32 (but not other arches)

2017-09-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158

Peter Cordes  changed:

   What|Removed |Added

 Status|RESOLVED|VERIFIED

--- Comment #7 from Peter Cordes  ---
(In reply to Ramana Radhakrishnan from comment #6)
> (In reply to Peter Cordes from comment #5)
> >  That's what I thought; just be able to print backtraces.  Good point about
> > -fno-exceptions.  I forgot I was building as C in the first place. :P
> 
> Yes and that's why I'm closing this as a WONTFIX.

Hang on, there's still a bug here for _Noreturn functions that *do* return.

That discussion established that the current behaviour is correct for actual
noreturn functions (thank you), but that's *not* what this bug report is about.
 It's about this function:

void ext(void);
_Noreturn void foo(int *p, int y) {
ext();
*p = y;   // then use args that had to survive a call
return;
}

gcc for ARM32 has a quality-of-implementation bug here: it only warns, and then
makes code that silently clobbers the caller's R5.

It's UB so the standard allows this, but it's definitely not nice, and can
cause problems that are hard to trace back to this function.

The valid options are:
* preserve the caller's registers as if _Noreturn hadn't been specified
* abort / trap at the end of the function instead of returning.  (Enable
-mabort-on-noreturn by default?)
* compile-time error.

[Bug target/82260] [x86] Unnecessary use of 8-bit registers with -Os. slightly slower and larger code

2017-09-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82260

--- Comment #5 from Peter Cordes  ---
> (not (match_test "TARGET_PARTIAL_REG_STALL"))

gcc is doing this even with -mtune=core2.

Core2 / Nehalem stall (the front-end) for 2-3 cycles to insert a merging uop
when reading a full register after writing a partial register.  Sandybridge
inserts a merging uop without stalling.  Haswell/Skylake doesn't rename low8 in
the first place (but inserts a merging uop for high8 without stalling).

gcc should be trying to avoid partial-register shenanigans on Core2 / Nehalem,
but the penalty is low enough that it's probably not worth changing
-mtune=generic.

Related: gcc likes to do set-flags / setcc / movzx, but it would be
significantly better to do  xor-zero / set-flags / setcc when possible, when a
zero-extended bool is needed.

setcc into the low8 of a register zeroed with a recognized zeroing idiom avoids
partial-register penalties when reading the full register, and it has a shorter
critical path from test -> 32-bit result.  It also avoids a false dependency on
the old value of the register.  (Fun fact: on early P6 (PPro to Pentium III),
xor-zeroing was not dependency-breaking, but did avoid partial-register
stalls.)

Also, movzx %al, %eax defeats mov-elimination on Intel, so it's always better
to movzx to a different architectural register for zero-extension, modulo
register pressure and not costing any extra instructions total.

Is there already an open bug for either of these latter problems?  (Sorry I
have a bad habit of taking bugs off topic.)

[Bug target/82260] [x86] Unnecessary use of 8-bit registers with -Os. slightly slower and larger code

2017-09-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82260

--- Comment #4 from Peter Cordes  ---
(In reply to Jakub Jelinek from comment #2)
> From pure instruction size POV, for the first 2 alternatives as can be seen
> say on:
> ...
> movb $0x15, %al
> movl $0x15, %eax
> movb $-0x78, %bl
> movl $-0x78, %ebx

There are ways to save code-size when setting up constants.  If you already
have one constant in a register, you can get other nearby constants in 3 bytes
with LEA

  xor  %edi, %edi # you often need a zero for something
  lea -0x78(%rdi), %ebx   # 3 bytes vs. 5 for mov $imm32, %r32

Or a 4-byte LEA with a 64-bit destination to replace a 7-byte mov $imm32, %r64.
 Modern CPUs have pretty good LEA throughput (2 per clock on Intel SnB-family +
KNL and AMD K8/K10/BD-family/Zen), especially for 2-component LEA (base + disp,
no index).  1 per clock on others, still 1c latency.  With efficient
xor-zeroing support, the LEA can execute without any extra delay even if it
issues in the same cycle as the xor-zeroing.  If using LEA relative to some
other constant, well it's still just 1c extra.

If gcc had a -Oz mode like clang does (optimize for size even more), you could
consider stuff like 3-byte push+pop (clobbering the top of the red zone).

  push $-0x78   # imm8 sign-extended to 64-bit 
  pop  %rbx

https://stackoverflow.com/questions/45105164/set-all-bits-in-cpu-register-to-1-efficiently
https://stackoverflow.com/questions/33825546/shortest-intel-x86-64-opcode-for-rax=1

[Bug target/82267] New: x32: unnecessary address-size prefixes. Why isn't -maddress-mode=long the default?

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82267

Bug ID: 82267
   Summary: x32: unnecessary address-size prefixes.  Why isn't
-maddress-mode=long the default?
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: ABI, missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*

x32 defaults to using 32-bit address-size everywhere, it seems.  (Apparently
introduced by rev 185396 for bug 50797, which introduced -maddress-mode=short
and made it the default.)

This takes an extra 1-byte prefix on every instruction with a memory operand. 
It's not just code-size; this is potentially a big throughput problem on Intel
Silvermont where more than 3 prefixes (including mandatory prefixes and 0F
escape bytes for SSE and other instructions) cause a stall.  These are exactly
the systems where a memory-saving ABI might be most useful.  (I'm not building
one, I just think x32 is a good idea if implemented optimally.)

long long doublederef(long long **p){
return **p;
}
//  https://godbolt.org/g/NHbURq
gcc8 -mx32 -O3
movl(%edi), %eax  # 0x67 prefix
movq(%eax), %rax  # 0x67 prefix
ret

The second instruction is 1 byte longer for no reason: it needs a 0x67
address-size prefix to encode.
But we know for certain that the address is already zero-extended into %rax,
because we just put it there.  Also, the ABI requires p to be zero-extended to
64 bits, so it would be safe to use `movl (%rdi), %eax` as the first
instruction.

Even (%rsp) is avoided for some reason, even though -mx32 still uses
push/pop/call/ret which use the full %rsp, so it has to be valid.

int stackuse(void) {
volatile int foo = 2;
return foo * 3;
}
movl$2, -4(%esp)# 0x67 prefix
movl-4(%esp), %eax  # 0x67 prefix
leal(%rax,%rax,2), %eax # no prefixes
ret


Compiling with -maddress-mode=long appears to generate optimal code for all the
simple test cases I looked at, e.g.

movl$2, -4(%rsp)# no prefixes
movl-4(%rsp), %eax  # no prefixes
leal(%rax,%rax,2), %eax # no prefixes
ret

-maddress-mode=long still uses an address-size prefix instead of an LEA to make
sure addresses wrap at 4G, and to ignore high garbage in registers:

long long fooi(long long *arr, int offset){
return arr[offset];
}
movq(%edi,%esi,8), %rax# same for mode=short or long.
ret

Are there still cases where -maddress-mode=long makes worse code?



Is it really necessary for an unsigned offset to be wrap at 4G?  Does ISO C or
GNU C guarantee that large unsigned values work like negative signed integers
when used for pointer arithmetic?

// 64-bit offset so it won't have high garbage
long long fooull(long long *arr, unsigned long long offset){
return arr[offset];
}

movq(%edi,%esi,8), %rax# but couldn't this be (%rdi,%rsi,8)
ret

Allowing 64-bit addressing modes with unsigned indexes could potentially save
significant code-size, couldn't it?

address-mode=long already allows constant offsets to go outside 4G, for
example:

foo_constant: #return arr[123456];
movq987648(%rdi), %rax
ret

But it does treat the offset as signed, so 0xULL will  movq -8(%rdi),
%rax.

The ABI doc (https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI) doesn't
specify anything about C pointer-wrapping semantics, and I don't know where
else to look to find out what behaviour is required/guaranteed and what is just
how the current implementation happens to work.

Anyway, this is a side-track from the issue of not using address-size prefixes
in single-pointer cases where it's already zero extended.

-

SSSE3 and later instructions need 66 0F 3A/38 before the opcode, so an
address-size or REX prefix will cause a decode stall on Silvermont.  With the
default x32 behaviour, even SSE2 instructions (66 0F opcode) will cause decode
stalls with a REX and address-size prefix.  e.g. paddb (%r8d), %xmm8   or even
movdqa (but not movaps or other SSE1 instructions).  Fortunately KNL isn't
really affected: VEX/EVEX is fine unless there's a segment prefix before it,
but Agner Fog seems to be saying that other prefixes are fine.

In integer code, REX + operand-size + address-size + a 0F escape byte would be
a problem for Silvermont/KNL, e.g. imul (%edi), %r10w needs all 4.   movbe %ax,
(%edi) has 4 prefixes, including the 2 mandatory escape bytes: 67 66 0f 38 f1
07.


In-order Atom also has "severe delays" (according to
http://agner.org/optimize/) with more than 3 prefixes, but unlike Silvermont,
that apparently doesn't include mandator

[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259

--- Comment #4 from Peter Cordes  ---
(In reply to Uroš Bizjak from comment #2)
> A couple of *scc_bt patterns are missing. These are similar to already
> existing *jcc_bt patterns. Combine wants:

Does gcc also need patterns for bt + cmovcc?

Thinking about this again, with an immediate count <= 31 it might be best to
test $0x0100, %edi / setz %al.  BT might be shorter, needing only an imm8
instead of imm32.  But TEST can run on more ports than BT on Intel.  (Ryzen has
4 per clock bt throughput).

(In some registers, TEST can check the low8 or high8 using an imm8, but high8
can have extra latency on HSW/SKL:
https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to.
 But test $imm8, %al is only 2 bytes, or 3 bytes for low8 other than AL if a
REX isn't needed.  There's no test $imm8_sign_extended, r32/r64, so you need a
REX to test the low byte of edi/esi/ebp.)

But for a variable count, it's likely that BT is the best bet, even when
booleanizing with setcc.  At least if we avoid `movzx`, because bt/setcc/movzx
is significantly worse than  xor-zero / bt / setcc, for latency and for a false
dependency on the destination register.

With a constant count, SHR / AND is very good if we don't need to invert the
boolean, and it's ok to destroy the source register.  (Or of course just SHR if
we want the high bit).  If adding new BT/SETCC patterns, I guess we need to
make sure gcc still uses SHR or SHR/AND where appropriate.

[Bug target/82261] New: x86: missing peephole for SHLD / SHRD

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261

Bug ID: 82261
   Summary: x86: missing peephole for SHLD / SHRD
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

unsigned shld(unsigned a, unsigned b, unsigned n){
//n=13;
a <<= n;
b >>= (32-n); //&31;
return a|b;
}
// https://godbolt.org/g/3jbgbR

g++ (GCC-Explorer-Build) 8.0.0 20170919 -O3 -march=haswell
movl$32, %eax
subl%edx, %eax  # missed optimization: NEG would work
shrx%eax, %esi, %eax
shlx%edx, %edi, %esi
orl %esi, %eax
ret

Intel has efficient SHLD/SHRD, so this should be compiled similar to what clang
does:

movl%edx, %ecx
movl%edi, %eax   # move first so we overwrite a
mov-elimination result right away
shldl   %cl, %esi, %eax
retq

Without SHLD, there's another missed optimization: shifts mask their count, and
32 & 31 is 0, so we could just NEG instead of setting up a constant 32.

shlx%edx, %edi, %eax
neg %edx
shrx%edx, %esi, %esi
orl %esi, %eax
ret

This *might* be worth it on AMD, where SHLD is 7 uops and one per 3 clock
throughput/latency.  Without BMI2, though, it may be good to just use SHLD
anyway.

There are various inefficiencies (extra copying of the shift count) in the
non-BMI2 output, but this bug report is supposed to be about the SHRD/SHLD
peephole.  (I didn't check for SHRD).

[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259

--- Comment #3 from Peter Cordes  ---
Oops, BT sets CF, not ZF.  So

bt  $13, %edi
setnc   %al# aka setae
ret

This is what clang does for the bt_ functions, and might be optimal for many
use-cases.  (For branching with an immediate, test/jcc is of course better
because it can macro-fuse into a test+branch uop on Intel and AMD.)

[Bug target/82260] New: [x86] Unnecessary use of 8-bit registers with -Os. slightly slower and larger code

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82260

Bug ID: 82260
   Summary: [x86] Unnecessary use of 8-bit registers with -Os.
slightly slower and larger code
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

int shift(int x, int c) {
return x >> c;
}
// https://godbolt.org/g/waovLu

gcc8 20170915 -Os -mtune=haswell:
movl%edi, %eax
movb%sil, %cl   # bad
sarl%cl, %eax
ret

-O3:
movl%edi, %eax
movl%esi, %ecx  # good
sarl%cl, %eax
ret

The 8-bit MOV needs a REX prefix to access %sil, and has a false dependency on
the old value of RCX.  Haswell/Skylake don't rename low8 partial registers,
only high8.  https://stackoverflow.com/q/45660139/224132.  P6 and Sandybridge
do, but an 8-bit mov is definitely *not* better when a 32-bit mov is also an
option.

So -Os makes code that's larger and also potentially slower.

[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259

--- Comment #1 from Peter Cordes  ---
More generally, you can flip a higher bit while copying with

lea  64(%rdi), %eax

That leaves the bits above that position munged by carry-out, but that isn't
always a problem.

[Bug target/82259] New: missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259

Bug ID: 82259
   Summary: missed optimization: use LEA to add 1 to flip the low
bit when copying before AND with 1
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

bool bt_signed(int x, unsigned bit) {
bit = 13;
return !(x & (1<<bit));
}
// https://godbolt.org/g/rzdtzm
movl%edi, %eax
sarl$13, %eax
notl%eax
andl$1, %eax
ret

This is pretty good, but we could do better by using addition instead of a
separate NOT.  (XOR is add-without-carry.  Adding 1 will always flip the low
bit).

sarl$13, %edi
lea 1(%edi), %eax
andl$1, %eax
ret

If partial-registers aren't a problem, this will be even better on most CPUs:

bt  $13, %edi
setz%al
ret

related: bug 47769 about missed BTR peepholes.  That probably covers the missed
BT.

But *this* bug is about the LEA+AND vs. MOV+NOT+AND optimization.  This might
be relevant for other 2-operand ISAs with mostly destructive instructions, like
ARM Thumb.


Related:

bool bt_unsigned(unsigned x, unsigned bit) {
//bit = 13;
return !(x & (1<<bit));  // 1U avoids test/set
}

movl%esi, %ecx
movl$1, %eax
sall%cl, %eax
testl   %edi, %eax
sete%al
ret

This is weird.  The code generated with  1U << bit  is like the bt_signed code
above and has identical results, so gcc should emit whatever is optimal for
both cases.  There are similar differences on ARM32.

(With a fixed count, it just makes the difference between NOT vs. XOR $1.)

If we're going to use setcc, it's definitely *much* better to use  bt  instead
of a variable-count shift + test.

bt  %esi, %edi
setz%al
ret

[Bug target/47769] [missed optimization] use of btr (bit test and reset)

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #6 from Peter Cordes  ---
This seems to be partially fixed in gcc8.0:

#include 
uint64_t btr_variable(uint64_t x, unsigned bit) {
//bit = 53;  // produces btr in older gcc, too.
return x & ~(1ULL << bit);
}
movq%rdi, %rax
btrq%rsi, %rax
ret

vs. gcc7.2 -O3 -mtune=haswell
movl%esi, %ecx
movq$-2, %rdx
rolq%cl, %rdx
movq%rdx, %rax  # this is dumb, should have put the mask in rax
in the first place
andq%rdi, %rax
ret

Or with bit=53:
movabsq $-9007199254740993, %rax
andq%rdi, %rax
ret

btr $53, %rax  only has 2 per clock throughput instead of 4 per clock for AND,
but a 10-byte mov instruction to set up the constant is almost never going to
be worth it for -mtune=haswell.  It takes up extra slots in the uop cache.

---

The inner loop from the Matthias's attached program *really* confuses gcc, so
badly that it never gets to the btr pattern, apparently.

unsigned long cfunc_one(unsigned long tmp) {
for (unsigned long bit = 0; bit < sizeof(unsigned long) * 8; bit += 3) {
tmp &= ~(1UL << bit);
}
return tmp;
}

movq%rdi, %rax
xorl%ecx, %ecx
movl$1, %esi
.L5:
movq%rsi, %rdx   # start with 1UL every time
salq%cl, %rdx
addq$3, %rcx
notq%rdx # what happened to rotating -2?
andq%rdx, %rax
cmpq$66, %rcx
jne .L5
ret


This is obviously horrible, but the right answer isn't btr in a loop, it's what
clang does:

movabsq $7905747460161236406, %rax # imm = 0x6DB6DB6DB6DB6DB6 every
third bit unset
andq%rdi, %rax
retq

gcc does spot this with `bit += 7`, I guess because with fewer iterations it
decides to try fully unrolling and then can optimize.

With a constant shift count and an inline function call, gcc manages to get
really confused auto-vectorizing the loop:

uint64_t btr64(uint64_t x, unsigned bit) {
bit = 53;
return x & ~(1ULL << bit);
}

unsigned long cfunc_one(unsigned long tmp) {
for (unsigned long bit = 0; bit < sizeof(unsigned long) * 8; bit += 7) {
//tmp &= ~(1UL << bit);
tmp = btr64(tmp, bit);
}
return tmp;
}


movdqa  .LC0(%rip), %xmm0# constant with both halves the same.
movdqa  %xmm0, %xmm1
psrldq  $8, %xmm1
pand%xmm1, %xmm0
movq%xmm0, %rax
 # The above is equivalent to mov .LC0(%rip), %rax
andq%rdi, %rax
ret



(In reply to Richard Biener from comment #1)
> Can you provide a testcase that can be compiled please?
> 
> Cut from i386.md:
> 
> ;; %%% bts, btr, btc, bt.
> ;; In general these instructions are *slow* when applied to memory,
> ;; since they enforce atomic operation.

This error is fixed in the current version 
https://raw.githubusercontent.com/gcc-mirror/gcc/master/gcc/config/i386/i386.md.

They're slow because of crazy-CISC semantics, and aren't atomic without a lock
prefix.  btr %rax, (%rdi) uses %rax as a bit index into memory relative to
%rdi, so the actual byte or dword or qword eventually accessed is *not*
determined by the addressing mode alone.  It's micro-coded as several uops.

>  When applied to registers,
> ;; it depends on the cpu implementation.  They're never faster than
> ;; the corresponding and/ior/xor operations, so with 32-bit there's
> ;; no point.  But in 64-bit, we can't hold the relevant immediates
> ;; within the instruction itself, so operating on bits in the high
> ;; 32-bits of a register becomes easier.

This section is talking about using it with an immediate operand like
  btr  $53, %raxbecause  and $imm64, %rax  doesn't exist, only and
$sign_extended_imm32, %rax

Does `(set_attr "type" "alu1")` mean gcc thinks it only has 1 per clock
throughput?  Or that it competes with other "alu1" instructions?

On Intel since Sandybridge, bt/btr/bts/btc reg,reg or imm,reg is 2 per clock. 
It's 1 per clock on Bulldozer-family and Jaguar, 2 per clock on Ryzen.

On Silvermont / KNL, they're 1 per clock occupying both integer ports.

  1   2   3   >