[Bug target/47769] [missed optimization] use of btr (bit test and reset)

2022-05-26 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769
Bug 47769 depends on bug 105735, which changed state.

Bug 105735 Summary: GCC failed to reduce &= loop_inv in loop.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105735

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

[Bug target/47769] [missed optimization] use of btr (bit test and reset)

2021-11-28 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769

--- Comment #7 from Hongtao.liu  ---

> 
> This is obviously horrible, but the right answer isn't btr in a loop, it's
> what clang does:
> 
> movabsq $7905747460161236406, %rax # imm = 0x6DB6DB6DB6DB6DB6 every
> third bit unset
> andq%rdi, %rax
> retq
> 

Open pr103462 for this.

[Bug target/47769] [missed optimization] use of btr (bit test and reset)

2021-11-27 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769

Andrew Pinski  changed:

   What|Removed |Added

   Severity|minor   |enhancement
   Last reconfirmed|2011-02-16 19:46:40 |2021-11-27

[Bug target/47769] [missed optimization] use of btr (bit test and reset)

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #6 from Peter Cordes  ---
This seems to be partially fixed in gcc8.0:

#include 
uint64_t btr_variable(uint64_t x, unsigned bit) {
//bit = 53;  // produces btr in older gcc, too.
return x & ~(1ULL << bit);
}
movq%rdi, %rax
btrq%rsi, %rax
ret

vs. gcc7.2 -O3 -mtune=haswell
movl%esi, %ecx
movq$-2, %rdx
rolq%cl, %rdx
movq%rdx, %rax  # this is dumb, should have put the mask in rax
in the first place
andq%rdi, %rax
ret

Or with bit=53:
movabsq $-9007199254740993, %rax
andq%rdi, %rax
ret

btr $53, %rax  only has 2 per clock throughput instead of 4 per clock for AND,
but a 10-byte mov instruction to set up the constant is almost never going to
be worth it for -mtune=haswell.  It takes up extra slots in the uop cache.

---

The inner loop from the Matthias's attached program *really* confuses gcc, so
badly that it never gets to the btr pattern, apparently.

unsigned long cfunc_one(unsigned long tmp) {
for (unsigned long bit = 0; bit < sizeof(unsigned long) * 8; bit += 3) {
tmp &= ~(1UL << bit);
}
return tmp;
}

movq%rdi, %rax
xorl%ecx, %ecx
movl$1, %esi
.L5:
movq%rsi, %rdx   # start with 1UL every time
salq%cl, %rdx
addq$3, %rcx
notq%rdx # what happened to rotating -2?
andq%rdx, %rax
cmpq$66, %rcx
jne .L5
ret


This is obviously horrible, but the right answer isn't btr in a loop, it's what
clang does:

movabsq $7905747460161236406, %rax # imm = 0x6DB6DB6DB6DB6DB6 every
third bit unset
andq%rdi, %rax
retq

gcc does spot this with `bit += 7`, I guess because with fewer iterations it
decides to try fully unrolling and then can optimize.

With a constant shift count and an inline function call, gcc manages to get
really confused auto-vectorizing the loop:

uint64_t btr64(uint64_t x, unsigned bit) {
bit = 53;
return x & ~(1ULL << bit);
}

unsigned long cfunc_one(unsigned long tmp) {
for (unsigned long bit = 0; bit < sizeof(unsigned long) * 8; bit += 7) {
//tmp &= ~(1UL << bit);
tmp = btr64(tmp, bit);
}
return tmp;
}


movdqa  .LC0(%rip), %xmm0# constant with both halves the same.
movdqa  %xmm0, %xmm1
psrldq  $8, %xmm1
pand%xmm1, %xmm0
movq%xmm0, %rax
 # The above is equivalent to mov .LC0(%rip), %rax
andq%rdi, %rax
ret



(In reply to Richard Biener from comment #1)
> Can you provide a testcase that can be compiled please?
> 
> Cut&pasting from i386.md:
> 
> ;; %%% bts, btr, btc, bt.
> ;; In general these instructions are *slow* when applied to memory,
> ;; since they enforce atomic operation.

This error is fixed in the current version 
https://raw.githubusercontent.com/gcc-mirror/gcc/master/gcc/config/i386/i386.md.

They're slow because of crazy-CISC semantics, and aren't atomic without a lock
prefix.  btr %rax, (%rdi) uses %rax as a bit index into memory relative to
%rdi, so the actual byte or dword or qword eventually accessed is *not*
determined by the addressing mode alone.  It's micro-coded as several uops.

>  When applied to registers,
> ;; it depends on the cpu implementation.  They're never faster than
> ;; the corresponding and/ior/xor operations, so with 32-bit there's
> ;; no point.  But in 64-bit, we can't hold the relevant immediates
> ;; within the instruction itself, so operating on bits in the high
> ;; 32-bits of a register becomes easier.

This section is talking about using it with an immediate operand like
  btr  $53, %raxbecause  and $imm64, %rax  doesn't exist, only and
$sign_extended_imm32, %rax

Does `(set_attr "type" "alu1")` mean gcc thinks it only has 1 per clock
throughput?  Or that it competes with other "alu1" instructions?

On Intel since Sandybridge, bt/btr/bts/btc reg,reg or imm,reg is 2 per clock. 
It's 1 per clock on Bulldozer-family and Jaguar, 2 per clock on Ryzen.

On Silvermont / KNL, they're 1 per clock occupying both integer ports.

[Bug target/47769] [missed optimization] use of btr (bit test and reset)

2013-05-03 Thread paolo.carlini at oracle dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769



Paolo Carlini  changed:



   What|Removed |Added



 Status|WAITING |NEW


[Bug target/47769] [missed optimization] use of btr (bit test and reset)

2013-05-03 Thread kretz at kde dot org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769



--- Comment #5 from Matthias Kretz  2013-05-03 11:45:49 
UTC ---

Another ping.

The bug status is still WAITING...


[Bug target/47769] [missed optimization] use of btr (bit test and reset)

2012-03-29 Thread kretz at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769

--- Comment #4 from Matthias Kretz  2012-03-29 09:55:46 
UTC ---
ping.
Are you still waiting for more input from me?


[Bug target/47769] [missed optimization] use of btr (bit test and reset)

2011-02-17 Thread kretz at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769

--- Comment #3 from Matthias Kretz  2011-02-17 10:01:34 
UTC ---
Created attachment 23376
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23376
TimeStampCounter class for benchmarking


[Bug target/47769] [missed optimization] use of btr (bit test and reset)

2011-02-17 Thread kretz at kde dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769

--- Comment #2 from Matthias Kretz  2011-02-17 10:00:55 
UTC ---
Created attachment 23375
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23375
Test code to see whether btr gets used automatically and to compare speed

compile with
g++ -O3 -march=core2 -msse4 -Wall -funroll-loops

-funroll-loops is not required to see the speedup, but it shows that a higher
instruction level parallelism can be achieved with the use of btr.


[Bug target/47769] [missed optimization] use of btr (bit test and reset)

2011-02-16 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769

Richard Guenther  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target||x86_64-*-*, i?86-*-*
 Status|UNCONFIRMED |WAITING
   Last reconfirmed||2011.02.16 19:46:40
 Ever Confirmed|0   |1

--- Comment #1 from Richard Guenther  2011-02-16 
19:46:40 UTC ---
Can you provide a testcase that can be compiled please?

Cut&pasting from i386.md:

;; %%% bts, btr, btc, bt.
;; In general these instructions are *slow* when applied to memory,
;; since they enforce atomic operation.  When applied to registers,
;; it depends on the cpu implementation.  They're never faster than
;; the corresponding and/ior/xor operations, so with 32-bit there's
;; no point.  But in 64-bit, we can't hold the relevant immediates
;; within the instruction itself, so operating on bits in the high
;; 32-bits of a register becomes easier.
;;
;; These are slow on Nocona, but fast on Athlon64.  We do require the use
;; of btrq and btcq for corner cases of post-reload expansion of absdf and
;; negdf respectively, so they can never be disabled entirely.



(define_insn "*btrq"
  [(set (zero_extract:DI (match_operand:DI 0 "register_operand" "+r")
 (const_int 1)
 (match_operand:DI 1 "const_0_to_63_operand" ""))
(const_int 0))
   (clobber (reg:CC FLAGS_REG))]
  "TARGET_64BIT && (TARGET_USE_BT || reload_completed)"
  "btr{q}\t{%1, %0|%0, %1}"
  [(set_attr "type" "alu1")
   (set_attr "prefix_0f" "1")
   (set_attr "mode" "DI")])

and

  /* X86_TUNE_USE_BT */
  m_AMD_MULTIPLE | m_ATOM | m_CORE2I7 | m_GENERIC,

so it appears it should already be used by default.