[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 Richard Biener changed: What|Removed |Added Target Milestone|12.3|12.4 --- Comment #24 from Richard Biener --- GCC 12.3 is being released, retargeting bugs to GCC 12.4.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 Richard Biener changed: What|Removed |Added Target Milestone|12.2|12.3 --- Comment #23 from Richard Biener --- GCC 12.2 is being released, retargeting bugs to GCC 12.3.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 Jakub Jelinek changed: What|Removed |Added Target Milestone|12.0|12.2 --- Comment #22 from Jakub Jelinek --- GCC 12.1 is being released, retargeting bugs to GCC 12.2.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #21 from CVS Commits --- The master branch has been updated by Hongyu Wang : https://gcc.gnu.org/g:522f25e90c781d284f8347a04940db8b41c42fd5 commit r12-8136-g522f25e90c781d284f8347a04940db8b41c42fd5 Author: Hongyu Wang Date: Wed Apr 13 14:51:36 2022 +0800 i386: Fix infinite loop under -mrelax-cmpxchg-loop [PR 103069] For -mrelax-cmpxchg-loop which relaxes atomic_fetch_ loops, there is a missing set to %eax when compare fails, which would result in infinite loop in some benchmark. Add set to %eax to avoid it. gcc/ChangeLog: PR target/103069 * config/i386/i386-expand.cc (ix86_expand_cmpxchg_loop): Add missing set to target_val at pause label.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #20 from Thiago Macieira --- I think there will be cases where the relaxation makes sense and others where it doesn't because the surrounding code already does it. So I'd like to control per emission. If I can't do it per code block, I suppose I could make a lambda block [&]() __attribute__((target("relax-cmpxchg-loop"))) { return __atomic_compare_exchange_weak(); }(); Of course, it might be easier to simply relax it myself at that point.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #19 from Hongyu Wang --- (In reply to Thiago Macieira from comment #18) > (In reply to Jakub Jelinek from comment #17) > > _Pragma("GCC target \"relax-cmpxchg-loop\"") > > should do that (ditto target("relax-cmpxchg-loop") attribute). > > The attribute is applied to a function. I'm hoping to do it for s block of > code: > > _Pragma("GCC push_options") > _Pragma("GCC target \"relax-cmpxchg-loop\"") > __atomic_compare_exchange_weak(); > _Pragma("GCC pop_options") I'm not aware of any target __attribute__ or #Pragma can be used to code block, at this level user can change their code directly, so I don't know why it is needed..
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #18 from Thiago Macieira --- (In reply to Jakub Jelinek from comment #17) > _Pragma("GCC target \"relax-cmpxchg-loop\"") > should do that (ditto target("relax-cmpxchg-loop") attribute). The attribute is applied to a function. I'm hoping to do it for s block of code: _Pragma("GCC push_options") _Pragma("GCC target \"relax-cmpxchg-loop\"") __atomic_compare_exchange_weak(); _Pragma("GCC pop_options")
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #17 from Jakub Jelinek --- _Pragma("GCC target \"relax-cmpxchg-loop\"") should do that (ditto target("relax-cmpxchg-loop") attribute).
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #16 from Thiago Macieira --- Can this option be enabled and disabled with a _Pragma?
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #15 from Hongyu Wang --- (In reply to Thiago Macieira from comment #14) > I'd restrict relaxations to loops emitted by the compiler. All other atomic > operations shouldn't be modified at all, unless the user asks for it. That > includes non-looping atomic operations (like LOCK BTC, LOCK XADD) as well as > a pure LOCK CMPXCHG that came from a single __atomic_compare_exchange by the > user. > > I'd welcome the ability to relax the latter, especially if with one codebase > I could be efficient in CAS architectures as well as LL/SC ones. The latest patch relaxed the pure LOCK CMPXCHG with -mrelax-cmpxchg-loop as the commit message shows. So if you want, I can split this part to another switch like -mrelax-cmpxchg-insn.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #14 from Thiago Macieira --- I'd restrict relaxations to loops emitted by the compiler. All other atomic operations shouldn't be modified at all, unless the user asks for it. That includes non-looping atomic operations (like LOCK BTC, LOCK XADD) as well as a pure LOCK CMPXCHG that came from a single __atomic_compare_exchange by the user. I'd welcome the ability to relax the latter, especially if with one codebase I could be efficient in CAS architectures as well as LL/SC ones.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #13 from Hongyu Wang --- All above glibc cases are now both relaxed by an load/cmp to skip cmpxchg under -mrelax-cmpxchg-loop, but for > do > { > flags = THREAD_GETMEM (self, cancelhandling); > newval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling, > flags & ~SETXID_BITMASK, flags); > } > while (flags != newval); If we want to optimize it to lock btc, we need to know the cmpxchg lies in a loop. So it may require an extra pass to do further analysis and optimize, which is not a good idea to do in stage 4.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #12 from CVS Commits --- The master branch has been updated by Hongyu Wang : https://gcc.gnu.org/g:0435b978f95971e139882549f5a1765c50682216 commit r12-7316-g0435b978f95971e139882549f5a1765c50682216 Author: Hongyu Wang Date: Fri Feb 11 14:44:15 2022 +0800 i386: Relax cmpxchg instruction under -mrelax-cmpxchg-loop [PR103069] For cmpxchg, it is commonly used in spin loop, and several user code such as pthread directly takes cmpxchg as loop condition, which cause huge cache bouncing. This patch extends previous implementation to relax all cmpxchg instruction under -mrelax-cmpxchg-loop with an extra atomic load, compare and emulate the failed cmpxchg behavior. For original spin loop which looks like loop: mov%eax,%r8d or $1,%r8d lock cmpxchg %r8d,(%rdi) jneloop It will now truns to loop: mov%eax,%r8d or $1,%r8d mov(%r8),%rsi <--- load lock first cmp%rsi,%rax <--- compare with expected input jne.L2 <--- lock ne expected lock cmpxchg %r8d,(%rdi) jneloop L2: mov%rsi,%rax <--- perform the behavior of failed cmpxchg jneloop under -mrelax-cmpxchg-loop. gcc/ChangeLog: PR target/103069 * config/i386/i386-expand.cc (ix86_expand_atomic_fetch_op_loop): Split atomic fetch and loop part. (ix86_expand_cmpxchg_loop): New expander for cmpxchg loop. * config/i386/i386-protos.h (ix86_expand_cmpxchg_loop): New prototype. * config/i386/sync.md (atomic_compare_and_swap): Call new expander under TARGET_RELAX_CMPXCHG_LOOP. (atomic_compare_and_swap): Likewise for doubleword modes. gcc/testsuite/ChangeLog: PR target/103069 * gcc.target/i386/pr103069-2.c: Adjust result check. * gcc.target/i386/pr103069-3.c: New test. * gcc.target/i386/pr103069-4.c: Likewise.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #11 from Hongyu Wang --- For the case with atomic_compare_exchange_weak_release, it can be expanded as loop: mov%eax,%r8d and$0xfff8,%r8d mov(%r8),%rsi <--- load lock first cmp%rsi,%rax <--- compare with expected input jne.L2 <--- lock ne expected lock cmpxchg %r8d,(%rdi) mov%rsi,%rax <--- perform the behavior of failed cmpxchg jneloop But this is not suitable for atomic_compare_exchange_strong, as the document said Unlike atomic_compare_exchange_weak, this strong version is required to always return true when expected indeed compares equal to the contained object, not allowing spurious failures. If we expand cmpxchg as above, it would result in spurious failure since the load is not atomic. So for do pd->nextevent = __nptl_last_event; while (atomic_compare_and_exchange_bool_acq (&__nptl_last_event, pd, pd->nextevent)); who invokes atomic_compare_exchange_strong we may not simply adjust the expander. It is better to know the call is in loop condition and relax it accordingly.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #10 from Thiago Macieira --- (In reply to H.J. Lu from comment #9) > nptl/nptl_setxid.c in glibc has > > do > { > flags = THREAD_GETMEM (self, cancelhandling); > newval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling, > flags & ~SETXID_BITMASK, flags); > } > while (flags != newval); > > GCC 12 generates: > >899f0: 64 8b 14 25 08 03 00mov%fs:0x308,%edx >899f8: 89 d6 mov%edx,%esi >899fa: 89 d0 mov%edx,%eax >899fc: 83 e6 bfand$0xffbf,%esi >899ff: f0 0f b1 31 lock cmpxchg %esi,(%rcx) >89a03: 75 eb jne899f0 <__GI___nptl_setxid_sighand > ler+0x90> This one is a single bit. This one should be replaced with a LOCK BTC and no loop.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 H.J. Lu changed: What|Removed |Added Status|REOPENED|NEW --- Comment #9 from H.J. Lu --- nptl/nptl_setxid.c in glibc has do { flags = THREAD_GETMEM (self, cancelhandling); newval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling, flags & ~SETXID_BITMASK, flags); } while (flags != newval); GCC 12 generates: 899f0: 64 8b 14 25 08 03 00mov%fs:0x308,%edx 899f8: 89 d6 mov%edx,%esi 899fa: 89 d0 mov%edx,%eax 899fc: 83 e6 bfand$0xffbf,%esi 899ff: f0 0f b1 31 lock cmpxchg %esi,(%rcx) 89a03: 75 eb jne899f0 <__GI___nptl_setxid_sighand ler+0x90>
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #8 from H.J. Lu --- nptl/pthread_create.c has do pd->nextevent = __nptl_last_event; while (atomic_compare_and_exchange_bool_acq (&__nptl_last_event, pd, pd->nextevent)); we got 8d1b7: 48 8b 05 da 55 16 00mov0x1655da(%rip),%rax 8d1be: 48 89 83 68 06 00 00mov%rax,0x668(%rbx) 8d1c5: f0 4c 0f b1 35 ca 55lock cmpxchg %r14,0x1655ca(%rip) 8d1ce: 75 e7 jne8d1b7
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #7 from H.J. Lu --- nptl/pthread_mutex_unlock.c in glibc has: do { newval = oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK; } while (!atomic_compare_exchange_weak_release (>__data.__lock, , newval)); GCC 12 generates: 90a81: 41 89 c0mov%eax,%r8d 90a84: 41 81 e0 00 00 f8 ffand$0xfff8,%r8d 90a8b: f0 44 0f b1 07 lock cmpxchg %r8d,(%rdi) 90a90: 75 ef jne90a81 <__pthread_mutex_unlock_ful l+0x81>
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 H.J. Lu changed: What|Removed |Added Status|RESOLVED|REOPENED Ever confirmed|0 |1 Resolution|FIXED |--- Last reconfirmed||2022-01-24 --- Comment #6 from H.J. Lu --- nptl/pthread_rwlock_common.c in glibc has: unsigned int r = atomic_load_relaxed (>__data.__readers); /* Release MO so that subsequent readers or writers synchronize with us. */ while (!atomic_compare_exchange_weak_release (>__data.__readers, , ((r ^ PTHREAD_RWLOCK_WRLOCKED) ^ ((r >> PTHREAD_RWLOCK_READER_SHIFT) == 0 ? 0 : PTHREAD_RWLOCK_WRPHASE { /* TODO Back-off. */ } -mrelax-cmpxchg-loop -O2 generates: 9240e: 89 c1 mov%eax,%ecx 92410: 31 d2 xor%edx,%edx 92412: c1 e9 03shr$0x3,%ecx 92415: 0f 95 c2setne %dl 92418: 31 c2 xor%eax,%edx 9241a: 83 f2 02xor$0x2,%edx 9241d: f0 41 0f b1 10 lock cmpxchg %edx,(%r8) 92422: 75 ea jne9240e <__pthread_rwlock_unlock@GL IBC_2.2.5+0x10e>
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #5 from CVS Commits --- The master branch has been updated by Hongyu Wang : https://gcc.gnu.org/g:15f5e70cbb33b40c97325ef9d7747a148d39 commit r12-5363-g15f5e70cbb33b40c97325ef9d7747a148d39 Author: Hongyu Wang Date: Thu Nov 18 14:45:23 2021 +0800 i386: Fix wrong codegen for -mrelax-cmpxchg-loop For -mrelax-cmpxchg-loop introduced by PR 103069/r12-5265, it would produce infinite loop. The correct code should be .L84: movl(%rdi), %ecx movl%eax, %edx orl %esi, %edx cmpl%eax, %ecx jne .L82 lock cmpxchgl %edx, (%rdi) jne .L84 movl%r8d, %eax <<< retval is missing in previous impl ret .L82: rep nop jmp .L84 Adjust corresponding expander to fix such issue, and fix runtime test so the problem would be exposed. gcc/ChangeLog: * config/i386/i386-expand.c (ix86_expand_atomic_fetch_op_loop): Adjust generated cfg to avoid infinite loop. gcc/testsuite/ChangeLog: * gcc.target/i386/pr103069-2.c: Adjust.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 H.J. Lu changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Target Milestone|--- |12.0 Resolution|--- |FIXED --- Comment #4 from H.J. Lu --- Fixed for GCC 12.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #3 from CVS Commits --- The master branch has been updated by Hongyu Wang : https://gcc.gnu.org/g:4d281ff7ddd8f6365943c0a622107f92315bb8a6 commit r12-5265-g4d281ff7ddd8f6365943c0a622107f92315bb8a6 Author: Hongyu Wang Date: Fri Nov 12 10:50:46 2021 +0800 PR target/103069: Relax cmpxchg loop for x86 target From the CPU's point of view, getting a cache line for writing is more expensive than reading. See Appendix A.2 Spinlock in: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ xeon-lock-scaling-analysis-paper.pdf The full compare and swap will grab the cache line exclusive and causes excessive cache line bouncing. The atomic_fetch_{or,xor,and,nand} builtins generates cmpxchg loop under -march=x86-64 like: movlv(%rip), %eax .L2: movl%eax, %ecx movl%eax, %edx orl $1, %ecx lock cmpxchgl %ecx, v(%rip) jne .L2 movl%edx, %eax andl$1, %eax ret To relax above loop, GCC should first emit a normal load, check and jump to .L2 if cmpxchgl may fail. Before jump to .L2, PAUSE should be inserted to yield the CPU to another hyperthread and to save power, so the code is like .L84: movl(%rdi), %ecx movl%eax, %edx orl %esi, %edx cmpl%eax, %ecx jne .L82 lock cmpxchgl %edx, (%rdi) jne .L84 .L82: rep nop jmp .L84 This patch adds corresponding atomic_fetch_op expanders to insert load/ compare and pause for all the atomic logic fetch builtins. Add flag -mrelax-cmpxchg-loop to control whether to generate relaxed loop. gcc/ChangeLog: PR target/103069 * config/i386/i386-expand.c (ix86_expand_atomic_fetch_op_loop): New expand function. * config/i386/i386-options.c (ix86_target_string): Add -mrelax-cmpxchg-loop flag. (ix86_valid_target_attribute_inner_p): Likewise. * config/i386/i386-protos.h (ix86_expand_atomic_fetch_op_loop): New expand function prototype. * config/i386/i386.opt: Add -mrelax-cmpxchg-loop. * config/i386/sync.md (atomic_fetch_): New expander for SI,HI,QI modes. (atomic__fetch): Likewise. (atomic_fetch_nand): Likewise. (atomic_nand_fetch): Likewise. (atomic_fetch_): New expander for DI,TI modes. (atomic__fetch): Likewise. (atomic_fetch_nand): Likewise. (atomic_nand_fetch): Likewise. * doc/invoke.texi: Document -mrelax-cmpxchg-loop. gcc/testsuite/ChangeLog: PR target/103069 * gcc.target/i386/pr103069-1.c: New test. * gcc.target/i386/pr103069-2.c: Ditto.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #2 from Thiago Macieira --- See also bug 103090 for a few more (restricted) possibilities to replace a cmpxchg loop with a LOCK RMW operation.
[Bug target/103069] cmpxchg isn't optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069 --- Comment #1 from Thiago Macieira --- (the assembly doesn't match the source code, but we got your point) Another possible improvement for the __atomic_fetch_{and,nand,or} functions is that it can check whether the fetched value is already correct and branch out. In your example, the __atomic_fetch_or with 0x4000 can check if that bit is already set and, if so, not execute the CMPXCHG at all. This is a valid solution for x86 on memory orderings up to acq_rel. For other architectures, they may still need barriers. For seq_cst, we either need a barrier or we need to execute the CMPXCHG at least once. Therefore, the emitted code might want to optimistically execute the operation once and, if it fails, enter the load loop. That's a slightly longer codegen. Whether we want that under -Os or not, you'll have to be the judge. Prior art: glibc/sysdeps/x86_64/nptl/pthread_spin_lock.S: ENTRY(__pthread_spin_lock) 1: LOCK decl0(%rdi) jne 2f xor %eax, %eax ret .align 16 2: rep nop cmpl$0, 0(%rdi) jg 1b jmp 2b END(__pthread_spin_lock) This does the atomic operation once, hoping it'll succeed. If it fails, it enters the PAUSE+CMP+JG loop until the value is suitable.