[Bug target/47769] [missed optimization] use of btr (bit test and reset)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769 Bug 47769 depends on bug 105735, which changed state. Bug 105735 Summary: GCC failed to reduce &= loop_inv in loop. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105735 What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE
[Bug target/47769] [missed optimization] use of btr (bit test and reset)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769 --- Comment #7 from Hongtao.liu --- > > This is obviously horrible, but the right answer isn't btr in a loop, it's > what clang does: > > movabsq $7905747460161236406, %rax # imm = 0x6DB6DB6DB6DB6DB6 every > third bit unset > andq%rdi, %rax > retq > Open pr103462 for this.
[Bug target/47769] [missed optimization] use of btr (bit test and reset)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769 Andrew Pinski changed: What|Removed |Added Severity|minor |enhancement Last reconfirmed|2011-02-16 19:46:40 |2021-11-27
[Bug target/47769] [missed optimization] use of btr (bit test and reset)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #6 from Peter Cordes --- This seems to be partially fixed in gcc8.0: #include uint64_t btr_variable(uint64_t x, unsigned bit) { //bit = 53; // produces btr in older gcc, too. return x & ~(1ULL << bit); } movq%rdi, %rax btrq%rsi, %rax ret vs. gcc7.2 -O3 -mtune=haswell movl%esi, %ecx movq$-2, %rdx rolq%cl, %rdx movq%rdx, %rax # this is dumb, should have put the mask in rax in the first place andq%rdi, %rax ret Or with bit=53: movabsq $-9007199254740993, %rax andq%rdi, %rax ret btr $53, %rax only has 2 per clock throughput instead of 4 per clock for AND, but a 10-byte mov instruction to set up the constant is almost never going to be worth it for -mtune=haswell. It takes up extra slots in the uop cache. --- The inner loop from the Matthias's attached program *really* confuses gcc, so badly that it never gets to the btr pattern, apparently. unsigned long cfunc_one(unsigned long tmp) { for (unsigned long bit = 0; bit < sizeof(unsigned long) * 8; bit += 3) { tmp &= ~(1UL << bit); } return tmp; } movq%rdi, %rax xorl%ecx, %ecx movl$1, %esi .L5: movq%rsi, %rdx # start with 1UL every time salq%cl, %rdx addq$3, %rcx notq%rdx # what happened to rotating -2? andq%rdx, %rax cmpq$66, %rcx jne .L5 ret This is obviously horrible, but the right answer isn't btr in a loop, it's what clang does: movabsq $7905747460161236406, %rax # imm = 0x6DB6DB6DB6DB6DB6 every third bit unset andq%rdi, %rax retq gcc does spot this with `bit += 7`, I guess because with fewer iterations it decides to try fully unrolling and then can optimize. With a constant shift count and an inline function call, gcc manages to get really confused auto-vectorizing the loop: uint64_t btr64(uint64_t x, unsigned bit) { bit = 53; return x & ~(1ULL << bit); } unsigned long cfunc_one(unsigned long tmp) { for (unsigned long bit = 0; bit < sizeof(unsigned long) * 8; bit += 7) { //tmp &= ~(1UL << bit); tmp = btr64(tmp, bit); } return tmp; } movdqa .LC0(%rip), %xmm0# constant with both halves the same. movdqa %xmm0, %xmm1 psrldq $8, %xmm1 pand%xmm1, %xmm0 movq%xmm0, %rax # The above is equivalent to mov .LC0(%rip), %rax andq%rdi, %rax ret (In reply to Richard Biener from comment #1) > Can you provide a testcase that can be compiled please? > > Cut&pasting from i386.md: > > ;; %%% bts, btr, btc, bt. > ;; In general these instructions are *slow* when applied to memory, > ;; since they enforce atomic operation. This error is fixed in the current version https://raw.githubusercontent.com/gcc-mirror/gcc/master/gcc/config/i386/i386.md. They're slow because of crazy-CISC semantics, and aren't atomic without a lock prefix. btr %rax, (%rdi) uses %rax as a bit index into memory relative to %rdi, so the actual byte or dword or qword eventually accessed is *not* determined by the addressing mode alone. It's micro-coded as several uops. > When applied to registers, > ;; it depends on the cpu implementation. They're never faster than > ;; the corresponding and/ior/xor operations, so with 32-bit there's > ;; no point. But in 64-bit, we can't hold the relevant immediates > ;; within the instruction itself, so operating on bits in the high > ;; 32-bits of a register becomes easier. This section is talking about using it with an immediate operand like btr $53, %raxbecause and $imm64, %rax doesn't exist, only and $sign_extended_imm32, %rax Does `(set_attr "type" "alu1")` mean gcc thinks it only has 1 per clock throughput? Or that it competes with other "alu1" instructions? On Intel since Sandybridge, bt/btr/bts/btc reg,reg or imm,reg is 2 per clock. It's 1 per clock on Bulldozer-family and Jaguar, 2 per clock on Ryzen. On Silvermont / KNL, they're 1 per clock occupying both integer ports.
[Bug target/47769] [missed optimization] use of btr (bit test and reset)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769 Paolo Carlini changed: What|Removed |Added Status|WAITING |NEW
[Bug target/47769] [missed optimization] use of btr (bit test and reset)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769 --- Comment #5 from Matthias Kretz 2013-05-03 11:45:49 UTC --- Another ping. The bug status is still WAITING...
[Bug target/47769] [missed optimization] use of btr (bit test and reset)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769 --- Comment #4 from Matthias Kretz 2012-03-29 09:55:46 UTC --- ping. Are you still waiting for more input from me?
[Bug target/47769] [missed optimization] use of btr (bit test and reset)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769 --- Comment #3 from Matthias Kretz 2011-02-17 10:01:34 UTC --- Created attachment 23376 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23376 TimeStampCounter class for benchmarking
[Bug target/47769] [missed optimization] use of btr (bit test and reset)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769 --- Comment #2 from Matthias Kretz 2011-02-17 10:00:55 UTC --- Created attachment 23375 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=23375 Test code to see whether btr gets used automatically and to compare speed compile with g++ -O3 -march=core2 -msse4 -Wall -funroll-loops -funroll-loops is not required to see the speedup, but it shows that a higher instruction level parallelism can be achieved with the use of btr.
[Bug target/47769] [missed optimization] use of btr (bit test and reset)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769 Richard Guenther changed: What|Removed |Added Keywords||missed-optimization Target||x86_64-*-*, i?86-*-* Status|UNCONFIRMED |WAITING Last reconfirmed||2011.02.16 19:46:40 Ever Confirmed|0 |1 --- Comment #1 from Richard Guenther 2011-02-16 19:46:40 UTC --- Can you provide a testcase that can be compiled please? Cut&pasting from i386.md: ;; %%% bts, btr, btc, bt. ;; In general these instructions are *slow* when applied to memory, ;; since they enforce atomic operation. When applied to registers, ;; it depends on the cpu implementation. They're never faster than ;; the corresponding and/ior/xor operations, so with 32-bit there's ;; no point. But in 64-bit, we can't hold the relevant immediates ;; within the instruction itself, so operating on bits in the high ;; 32-bits of a register becomes easier. ;; ;; These are slow on Nocona, but fast on Athlon64. We do require the use ;; of btrq and btcq for corner cases of post-reload expansion of absdf and ;; negdf respectively, so they can never be disabled entirely. (define_insn "*btrq" [(set (zero_extract:DI (match_operand:DI 0 "register_operand" "+r") (const_int 1) (match_operand:DI 1 "const_0_to_63_operand" "")) (const_int 0)) (clobber (reg:CC FLAGS_REG))] "TARGET_64BIT && (TARGET_USE_BT || reload_completed)" "btr{q}\t{%1, %0|%0, %1}" [(set_attr "type" "alu1") (set_attr "prefix_0f" "1") (set_attr "mode" "DI")]) and /* X86_TUNE_USE_BT */ m_AMD_MULTIPLE | m_ATOM | m_CORE2I7 | m_GENERIC, so it appears it should already be used by default.