[Bug target/103069] cmpxchg isn't optimized

2023-05-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|12.3|12.4

--- Comment #24 from Richard Biener  ---
GCC 12.3 is being released, retargeting bugs to GCC 12.4.

[Bug target/103069] cmpxchg isn't optimized

2022-08-19 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|12.2|12.3

--- Comment #23 from Richard Biener  ---
GCC 12.2 is being released, retargeting bugs to GCC 12.3.

[Bug target/103069] cmpxchg isn't optimized

2022-05-06 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

Jakub Jelinek  changed:

   What|Removed |Added

   Target Milestone|12.0|12.2

--- Comment #22 from Jakub Jelinek  ---
GCC 12.1 is being released, retargeting bugs to GCC 12.2.

[Bug target/103069] cmpxchg isn't optimized

2022-04-13 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #21 from CVS Commits  ---
The master branch has been updated by Hongyu Wang :

https://gcc.gnu.org/g:522f25e90c781d284f8347a04940db8b41c42fd5

commit r12-8136-g522f25e90c781d284f8347a04940db8b41c42fd5
Author: Hongyu Wang 
Date:   Wed Apr 13 14:51:36 2022 +0800

i386: Fix infinite loop under -mrelax-cmpxchg-loop [PR 103069]

For -mrelax-cmpxchg-loop which relaxes atomic_fetch_ loops,
there is a missing set to %eax when compare fails, which would result
in infinite loop in some benchmark. Add set to %eax to avoid it.

gcc/ChangeLog:

PR target/103069
* config/i386/i386-expand.cc (ix86_expand_cmpxchg_loop):
  Add missing set to target_val at pause label.

[Bug target/103069] cmpxchg isn't optimized

2022-02-22 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #20 from Thiago Macieira  ---
I think there will be cases where the relaxation makes sense and others where
it doesn't because the surrounding code already does it. So I'd like to control
per emission.

If I can't do it per code block, I suppose I could make a lambda block

  [&]() __attribute__((target("relax-cmpxchg-loop"))) { 
return __atomic_compare_exchange_weak();
  }();

Of course, it might be easier to simply relax it myself at that point.

[Bug target/103069] cmpxchg isn't optimized

2022-02-22 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #19 from Hongyu Wang  ---
(In reply to Thiago Macieira from comment #18)
> (In reply to Jakub Jelinek from comment #17)
> > _Pragma("GCC target \"relax-cmpxchg-loop\"")
> > should do that (ditto target("relax-cmpxchg-loop") attribute).
> 
> The attribute is applied to a function. I'm hoping to do it for s block of
> code:
> 
>  _Pragma("GCC push_options")
>  _Pragma("GCC target \"relax-cmpxchg-loop\"")
>  __atomic_compare_exchange_weak();
>  _Pragma("GCC pop_options")

I'm not aware of any target __attribute__ or #Pragma can be used to code block,
at this level user can change their code directly, so I don't know why it is
needed..

[Bug target/103069] cmpxchg isn't optimized

2022-02-22 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #18 from Thiago Macieira  ---
(In reply to Jakub Jelinek from comment #17)
> _Pragma("GCC target \"relax-cmpxchg-loop\"")
> should do that (ditto target("relax-cmpxchg-loop") attribute).

The attribute is applied to a function. I'm hoping to do it for s block of
code:

 _Pragma("GCC push_options")
 _Pragma("GCC target \"relax-cmpxchg-loop\"")
 __atomic_compare_exchange_weak();
 _Pragma("GCC pop_options")

[Bug target/103069] cmpxchg isn't optimized

2022-02-22 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #17 from Jakub Jelinek  ---
_Pragma("GCC target \"relax-cmpxchg-loop\"")
should do that (ditto target("relax-cmpxchg-loop") attribute).

[Bug target/103069] cmpxchg isn't optimized

2022-02-22 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #16 from Thiago Macieira  ---
Can this option be enabled and disabled with a _Pragma?

[Bug target/103069] cmpxchg isn't optimized

2022-02-22 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #15 from Hongyu Wang  ---
(In reply to Thiago Macieira from comment #14)
> I'd restrict relaxations to loops emitted by the compiler. All other atomic
> operations shouldn't be modified at all, unless the user asks for it. That
> includes non-looping atomic operations (like LOCK BTC, LOCK XADD) as well as
> a pure LOCK CMPXCHG that came from a single __atomic_compare_exchange by the
> user.
> 
> I'd welcome the ability to relax the latter, especially if with one codebase
> I could be efficient in CAS architectures as well as LL/SC ones.

The latest patch relaxed the pure LOCK CMPXCHG with -mrelax-cmpxchg-loop as the
commit message shows. So if you want, I can split this part to another switch
like -mrelax-cmpxchg-insn.

[Bug target/103069] cmpxchg isn't optimized

2022-02-21 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #14 from Thiago Macieira  ---
I'd restrict relaxations to loops emitted by the compiler. All other atomic
operations shouldn't be modified at all, unless the user asks for it. That
includes non-looping atomic operations (like LOCK BTC, LOCK XADD) as well as a
pure LOCK CMPXCHG that came from a single __atomic_compare_exchange by the
user.

I'd welcome the ability to relax the latter, especially if with one codebase I
could be efficient in CAS architectures as well as LL/SC ones.

[Bug target/103069] cmpxchg isn't optimized

2022-02-21 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #13 from Hongyu Wang  ---
All above glibc cases are now both relaxed by an load/cmp to skip cmpxchg under
-mrelax-cmpxchg-loop,

but for

>   do  
> {   
>   flags = THREAD_GETMEM (self, cancelhandling);
>   newval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling,
>   flags & ~SETXID_BITMASK, flags);
> }   
>   while (flags != newval);

If we want to optimize it to lock btc, we need to know the cmpxchg lies in a
loop. So it may require an extra pass to do further analysis and optimize,
which is not a good idea to do in stage 4.

[Bug target/103069] cmpxchg isn't optimized

2022-02-21 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #12 from CVS Commits  ---
The master branch has been updated by Hongyu Wang :

https://gcc.gnu.org/g:0435b978f95971e139882549f5a1765c50682216

commit r12-7316-g0435b978f95971e139882549f5a1765c50682216
Author: Hongyu Wang 
Date:   Fri Feb 11 14:44:15 2022 +0800

i386: Relax cmpxchg instruction under -mrelax-cmpxchg-loop [PR103069]

For cmpxchg, it is commonly used in spin loop, and several user code
such as pthread directly takes cmpxchg as loop condition, which cause
huge cache bouncing.

This patch extends previous implementation to relax all cmpxchg
instruction under -mrelax-cmpxchg-loop with an extra atomic load,
compare and emulate the failed cmpxchg behavior.

For original spin loop which looks like

loop: mov%eax,%r8d
  or $1,%r8d
  lock cmpxchg %r8d,(%rdi)
  jneloop

It will now truns to

loop: mov%eax,%r8d
  or $1,%r8d
  mov(%r8),%rsi <--- load lock first
  cmp%rsi,%rax <--- compare with expected input
  jne.L2 <--- lock ne expected
  lock cmpxchg %r8d,(%rdi)
  jneloop
  L2: mov%rsi,%rax <--- perform the behavior of failed cmpxchg
  jneloop

under -mrelax-cmpxchg-loop.

gcc/ChangeLog:

PR target/103069
* config/i386/i386-expand.cc (ix86_expand_atomic_fetch_op_loop):
Split atomic fetch and loop part.
(ix86_expand_cmpxchg_loop): New expander for cmpxchg loop.
* config/i386/i386-protos.h (ix86_expand_cmpxchg_loop): New
prototype.
* config/i386/sync.md (atomic_compare_and_swap): Call new
expander under TARGET_RELAX_CMPXCHG_LOOP.
(atomic_compare_and_swap): Likewise for doubleword modes.

gcc/testsuite/ChangeLog:

PR target/103069
* gcc.target/i386/pr103069-2.c: Adjust result check.
* gcc.target/i386/pr103069-3.c: New test.
* gcc.target/i386/pr103069-4.c: Likewise.

[Bug target/103069] cmpxchg isn't optimized

2022-02-15 Thread wwwhhhyyy333 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #11 from Hongyu Wang  ---


For the case with atomic_compare_exchange_weak_release, it can be expanded as

loop: mov%eax,%r8d
  and$0xfff8,%r8d
  mov(%r8),%rsi <--- load lock first
  cmp%rsi,%rax <--- compare with expected input
  jne.L2 <--- lock ne expected
  lock cmpxchg %r8d,(%rdi)
  mov%rsi,%rax <--- perform the behavior of failed cmpxchg
  jneloop

But this is not suitable for atomic_compare_exchange_strong, as the document
said

Unlike atomic_compare_exchange_weak, this strong version is required to always
return true when expected indeed compares equal to the contained object, not
allowing spurious failures. If we expand cmpxchg as above, it would result in
spurious failure since the load is not atomic. 

So for

 do
   pd->nextevent = __nptl_last_event;
 while (atomic_compare_and_exchange_bool_acq (&__nptl_last_event,
  pd, pd->nextevent));

who invokes atomic_compare_exchange_strong we may not simply adjust the
expander. It is better to know the call is in loop condition and relax it
accordingly.

[Bug target/103069] cmpxchg isn't optimized

2022-01-24 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #10 from Thiago Macieira  ---
(In reply to H.J. Lu from comment #9)
> nptl/nptl_setxid.c in glibc has
> 
>   do  
> {   
>   flags = THREAD_GETMEM (self, cancelhandling);
>   newval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling,
>   flags & ~SETXID_BITMASK, flags);
> }   
>   while (flags != newval);
> 
> GCC 12 generates:
> 
>899f0: 64 8b 14 25 08 03 00mov%fs:0x308,%edx
>899f8: 89 d6   mov%edx,%esi
>899fa: 89 d0   mov%edx,%eax
>899fc: 83 e6 bfand$0xffbf,%esi
>899ff: f0 0f b1 31 lock cmpxchg %esi,(%rcx)   
>89a03: 75 eb   jne899f0 <__GI___nptl_setxid_sighand
> ler+0x90>

This one is a single bit. This one should be replaced with a LOCK BTC and no
loop.

[Bug target/103069] cmpxchg isn't optimized

2022-01-24 Thread hjl.tools at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

H.J. Lu  changed:

   What|Removed |Added

 Status|REOPENED|NEW

--- Comment #9 from H.J. Lu  ---
nptl/nptl_setxid.c in glibc has

  do  
{   
  flags = THREAD_GETMEM (self, cancelhandling);
  newval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling,
  flags & ~SETXID_BITMASK, flags);
}   
  while (flags != newval);

GCC 12 generates:

 899f0: 64 8b 14 25 08 03 00mov%fs:0x308,%edx
 899f8: 89 d6   mov%edx,%esi
 899fa: 89 d0   mov%edx,%eax
 899fc: 83 e6 bfand$0xffbf,%esi
 899ff: f0 0f b1 31 lock cmpxchg %esi,(%rcx)   
 89a03: 75 eb   jne899f0
<__GI___nptl_setxid_sighand
ler+0x90>

[Bug target/103069] cmpxchg isn't optimized

2022-01-24 Thread hjl.tools at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #8 from H.J. Lu  ---
nptl/pthread_create.c has

  do
pd->nextevent = __nptl_last_event;
  while (atomic_compare_and_exchange_bool_acq (&__nptl_last_event,
   pd, pd->nextevent));

we got

 8d1b7: 48 8b 05 da 55 16 00mov0x1655da(%rip),%rax
 8d1be: 48 89 83 68 06 00 00mov%rax,0x668(%rbx)
 8d1c5: f0 4c 0f b1 35 ca 55lock cmpxchg %r14,0x1655ca(%rip)

 8d1ce: 75 e7   jne8d1b7


[Bug target/103069] cmpxchg isn't optimized

2022-01-24 Thread hjl.tools at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #7 from H.J. Lu  ---
nptl/pthread_mutex_unlock.c in glibc has:

 do  
{ 
  newval = oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK;
}
  while (!atomic_compare_exchange_weak_release (>__data.__lock,
, newval));

GCC 12 generates:

 90a81: 41 89 c0mov%eax,%r8d
 90a84: 41 81 e0 00 00 f8 ffand$0xfff8,%r8d
 90a8b: f0 44 0f b1 07  lock cmpxchg %r8d,(%rdi)   
 90a90: 75 ef   jne90a81
<__pthread_mutex_unlock_ful
l+0x81>

[Bug target/103069] cmpxchg isn't optimized

2022-01-24 Thread hjl.tools at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

H.J. Lu  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Ever confirmed|0   |1
 Resolution|FIXED   |---
   Last reconfirmed||2022-01-24

--- Comment #6 from H.J. Lu  ---
nptl/pthread_rwlock_common.c in glibc has:

  unsigned int r = atomic_load_relaxed (>__data.__readers);
  /* Release MO so that subsequent readers or writers synchronize with us.  */
  while (!atomic_compare_exchange_weak_release
 (>__data.__readers, ,
  ((r ^ PTHREAD_RWLOCK_WRLOCKED)
   ^ ((r >> PTHREAD_RWLOCK_READER_SHIFT) == 0 ? 0
  : PTHREAD_RWLOCK_WRPHASE
{
  /* TODO Back-off.  */
}   

-mrelax-cmpxchg-loop -O2 generates:

 9240e: 89 c1   mov%eax,%ecx
 92410: 31 d2   xor%edx,%edx
 92412: c1 e9 03shr$0x3,%ecx
 92415: 0f 95 c2setne  %dl
 92418: 31 c2   xor%eax,%edx
 9241a: 83 f2 02xor$0x2,%edx
 9241d: f0 41 0f b1 10  lock cmpxchg %edx,(%r8)   
 92422: 75 ea   jne9240e
<__pthread_rwlock_unlock@GL
IBC_2.2.5+0x10e>

[Bug target/103069] cmpxchg isn't optimized

2021-11-18 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #5 from CVS Commits  ---
The master branch has been updated by Hongyu Wang :

https://gcc.gnu.org/g:15f5e70cbb33b40c97325ef9d7747a148d39

commit r12-5363-g15f5e70cbb33b40c97325ef9d7747a148d39
Author: Hongyu Wang 
Date:   Thu Nov 18 14:45:23 2021 +0800

i386: Fix wrong codegen for -mrelax-cmpxchg-loop

For -mrelax-cmpxchg-loop introduced by PR 103069/r12-5265, it would
produce infinite loop. The correct code should be

.L84:
movl(%rdi), %ecx
movl%eax, %edx
orl %esi, %edx
cmpl%eax, %ecx
jne .L82
lock cmpxchgl   %edx, (%rdi)
jne .L84
movl%r8d, %eax  <<< retval is missing in previous impl
ret
.L82:
rep nop
jmp .L84

Adjust corresponding expander to fix such issue, and fix runtime test
so the problem would be exposed.

gcc/ChangeLog:

* config/i386/i386-expand.c (ix86_expand_atomic_fetch_op_loop):
Adjust generated cfg to avoid infinite loop.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr103069-2.c: Adjust.

[Bug target/103069] cmpxchg isn't optimized

2021-11-15 Thread hjl.tools at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

H.J. Lu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
   Target Milestone|--- |12.0
 Resolution|--- |FIXED

--- Comment #4 from H.J. Lu  ---
Fixed for GCC 12.

[Bug target/103069] cmpxchg isn't optimized

2021-11-15 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #3 from CVS Commits  ---
The master branch has been updated by Hongyu Wang :

https://gcc.gnu.org/g:4d281ff7ddd8f6365943c0a622107f92315bb8a6

commit r12-5265-g4d281ff7ddd8f6365943c0a622107f92315bb8a6
Author: Hongyu Wang 
Date:   Fri Nov 12 10:50:46 2021 +0800

PR target/103069: Relax cmpxchg loop for x86 target

From the CPU's point of view, getting a cache line for writing is more
expensive than reading.  See Appendix A.2 Spinlock in:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/
xeon-lock-scaling-analysis-paper.pdf

The full compare and swap will grab the cache line exclusive and causes
excessive cache line bouncing.

The atomic_fetch_{or,xor,and,nand} builtins generates cmpxchg loop under
-march=x86-64 like:

movlv(%rip), %eax
.L2:
movl%eax, %ecx
movl%eax, %edx
orl $1, %ecx
lock cmpxchgl   %ecx, v(%rip)
jne .L2
movl%edx, %eax
andl$1, %eax
ret

To relax above loop, GCC should first emit a normal load, check and jump to
.L2 if cmpxchgl may fail. Before jump to .L2, PAUSE should be inserted to
yield the CPU to another hyperthread and to save power, so the code is
like

.L84:
movl(%rdi), %ecx
movl%eax, %edx
orl %esi, %edx
cmpl%eax, %ecx
jne .L82
lock cmpxchgl   %edx, (%rdi)
jne .L84
.L82:
rep nop
jmp .L84

This patch adds corresponding atomic_fetch_op expanders to insert load/
compare and pause for all the atomic logic fetch builtins. Add flag
-mrelax-cmpxchg-loop to control whether to generate relaxed loop.

gcc/ChangeLog:

PR target/103069
* config/i386/i386-expand.c (ix86_expand_atomic_fetch_op_loop):
New expand function.
* config/i386/i386-options.c (ix86_target_string): Add
-mrelax-cmpxchg-loop flag.
(ix86_valid_target_attribute_inner_p): Likewise.
* config/i386/i386-protos.h (ix86_expand_atomic_fetch_op_loop):
New expand function prototype.
* config/i386/i386.opt: Add -mrelax-cmpxchg-loop.
* config/i386/sync.md (atomic_fetch_): New expander
for SI,HI,QI modes.
(atomic__fetch): Likewise.
(atomic_fetch_nand): Likewise.
(atomic_nand_fetch): Likewise.
(atomic_fetch_): New expander for DI,TI modes.
(atomic__fetch): Likewise.
(atomic_fetch_nand): Likewise.
(atomic_nand_fetch): Likewise.
* doc/invoke.texi: Document -mrelax-cmpxchg-loop.

gcc/testsuite/ChangeLog:

PR target/103069
* gcc.target/i386/pr103069-1.c: New test.
* gcc.target/i386/pr103069-2.c: Ditto.

[Bug target/103069] cmpxchg isn't optimized

2021-11-04 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #2 from Thiago Macieira  ---
See also bug 103090 for a few more (restricted) possibilities to replace a
cmpxchg loop with a LOCK RMW operation.

[Bug target/103069] cmpxchg isn't optimized

2021-11-03 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #1 from Thiago Macieira  ---
(the assembly doesn't match the source code, but we got your point)

Another possible improvement for the __atomic_fetch_{and,nand,or} functions is
that it can check whether the fetched value is already correct and branch out.
In your example, the __atomic_fetch_or with 0x4000 can check if that bit is
already set and, if so, not execute the CMPXCHG at all.

This is a valid solution for x86 on memory orderings up to acq_rel. For other
architectures, they may still need barriers. For seq_cst, we either need a
barrier or we need to execute the CMPXCHG at least once. 

Therefore, the emitted code might want to optimistically execute the operation
once and, if it fails, enter the load loop. That's a slightly longer codegen.
Whether we want that under -Os or not, you'll have to be the judge.

Prior art: glibc/sysdeps/x86_64/nptl/pthread_spin_lock.S:
ENTRY(__pthread_spin_lock)
1:  LOCK
decl0(%rdi)
jne 2f
xor %eax, %eax
ret

.align  16
2:  rep
nop
cmpl$0, 0(%rdi)
jg  1b
jmp 2b
END(__pthread_spin_lock)

This does the atomic operation once, hoping it'll succeed. If it fails, it
enters the PAUSE+CMP+JG loop until the value is suitable.