https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71245

            Bug ID: 71245
           Summary: std::atomic<double> load/store bounces the data to the
                    stack using fild/fistp
           Product: gcc
           Version: 6.1.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: i386-linux-gnu

Same result with gcc 4.8, gcc5, and gcc6.1.  Didn't test exhaustively.

#include <atomic>

std::atomic<double> d(5.0);
void foo_d(void) {
  d =  d + 1.0;
  // d+=1.0;     // unimplemented
}

with gcc6.1 -m32 -O3 -march=i586 (https://godbolt.org/g/w3VKpG) this compiles
to

foo_d():
        subl    $20, %esp       #,
        fildq   d     #                # 
        fistpq  (%esp)      # %sfp     # copy `d`'s bits to the stack (with no
loss)
        fldl    (%esp)  # %sfp
        fadds   .LC0  #
        fstpl   (%esp)        # %sfp   # store d + 1.0 to the stack
        fildq   (%esp)        # %sfp   # 
        fistpq  d   #                  # atomic store using fild
        lock orl        $0, (%esp)     ## mfence equivalent
        addl    $20, %esp       #,
        ret

I assume fild/fistp is gcc's trick for implementing atomic loads and stores
without resorting to cmpxchg8b.  Clever, since 80bit float can't munge the
data.  The fild/fistp pairs are of course not necessary in this case, where the
data is already 64bit float.  The function should just fld / fstp to  d 
directly.

With -march=i486 or lower, gcc correctly doesn't assume that 64bit FP
loads/stores are atomic, so it calls a library function to do the atomic load
and store.  With SSE or SSE2 available, it uses an SSE load/store to copy to
the stack.  

With -msse2 and -mfpmath=sse, we finally load/store directly from/to d, with
movq / addsd / movq.  movq vs. movsd shouldn't make a performance difference, I
think.

--------

We don't need to allocate any stack space.  We could implement the StoreLoad
barrier with  lock or $0, -4(%esp) instead of reserving extra stack to avoid
doing it to our return address (which would introduce extra store-forwarding
delay before the ret could eventually retire).

Reply via email to