https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86693

            Bug ID: 86693
           Summary: inefficient atomic_fetch_xor
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: nruslan_devel at yahoo dot com
  Target Milestone: ---

(Compiled with O2 on x86-64)

Consider the following example:

void func1();

void func(unsigned long *counter)
{
        if (__atomic_fetch_xor(counter, 1, __ATOMIC_ACQ_REL) == 1) {
                func1();
        }
}

It is clear that the code can be optimized to simply do 'lock xorq' rather than
cmpxchg loop since the xor operation can easily be inverted 1^1 = 0, i.e. can
be tested from flags directly (just like for similar cases with fetch_sub and
fetch_add which gcc optimizes well).

However, gcc currently generates cmpxchg loop:
func:
.LFB0:
        .cfi_startproc
        movq    (%rdi), %rax
.L2:
        movq    %rax, %rcx
        movq    %rax, %rdx
        xorq    $1, %rcx
        lock cmpxchgq   %rcx, (%rdi)
        jne     .L2
        cmpq    $1, %rdx
        je      .L7
        rep ret

Compare this with fetch_sub instead of fetch_xor:
func:
.LFB0:
        .cfi_startproc
        lock subq       $1, (%rdi)
        je      .L4
        rep ret

Reply via email to