https://gcc.gnu.org/g:e14c673ea9ab2eca5de4db91b478f0b5297ef321
commit r15-696-ge14c673ea9ab2eca5de4db91b478f0b5297ef321
Author: Wilco Dijkstra
Date: Wed Apr 17 17:18:23 2024 +0100
AArch64: Improve costing of ctz
Improve costing of ctz - both TARGET_CSSC and vector cases were
https://gcc.gnu.org/g:804fa0bb92f8073394b3859edb810c3e23375530
commit r15-695-g804fa0bb92f8073394b3859edb810c3e23375530
Author: Wilco Dijkstra
Date: Thu Apr 25 17:33:00 2024 +0100
AArch64: Fix printing of 2-instruction alternatives
Add missing '\' in 2-instruction movsi/di
Hi Andrew,
A few comments on the implementation, I think it can be simplified a lot:
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -700,8 +700,9 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE =
> AARCH64_FL_SM_OFF;
> #define DWARF2_UNWIND_INFO 1
>
> /* Use R0 through R3 to pass exception handling
Hi Andrew,
> I should note popcount has a similar issue which I hope to fix next week.
> Popcount cost is used during expand so it is very useful to be slightly more
> correct.
It's useful to set the cost so that all of the special cases still apply - even
if popcount is
relatively fast, it's
https://gcc.gnu.org/g:43fb827f259e6fdea39bc4021950c810be769d58
commit r15-513-g43fb827f259e6fdea39bc4021950c810be769d58
Author: Wilco Dijkstra
Date: Wed May 15 13:07:27 2024 +0100
AArch64: Use UZP1 instead of INS
Use UZP1 instead of INS when combining low and high halves
Improve costing of ctz - both TARGET_CSSC and vector cases were not handled yet.
Passes regress & bootstrap - OK for commit?
gcc:
* config/aarch64/aarch64.cc (aarch64_rtx_costs): Improve CTZ costing.
---
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index
Add missing '\' in 2-instruction movsi/di alternatives so that they are
printed on separate lines.
Passes bootstrap and regress, OK for commit once stage 1 reopens?
gcc:
* config/aarch64/aarch64.md (movsi_aarch64): Use '\;' to force
newline in 2-instruction pattern.
Use LDP/STP for large struct types as they have useful immediate offsets and
are typically faster.
This removes differences between little and big endian and allows use of
LDP/STP without UNSPEC.
Passes regress and bootstrap, OK for commit?
gcc:
* config/aarch64/aarch64.cc
Use LDP/STP for large struct types as they have useful immediate offsets and
are typically faster.
This removes differences between little and big endian and allows use of
LDP/STP without UNSPEC.
Passes regress and bootstrap, OK for commit?
gcc:
* config/aarch64/aarch64.cc
Use UZP1 instead of INS when combining low and high halves of vectors.
UZP1 has 3 operands which improves register allocation, and is faster on
some microarchitectures.
Passes regress & bootstrap, OK for commit?
gcc:
* config/aarch64/aarch64-simd.md (aarch64_combine_internal):
According to documentation, '^' should only have an effect during reload.
However ira-costs.cc treats it in the same way as '?' during early costing.
As a result using '^' can accidentally disable valid alternatives and cause
significant regressions (see PR114741). Avoid this by ignoring '^'
https://gcc.gnu.org/g:6b86f71165de9ee64fb76489c04ce032dd74ac21
commit r15-8-g6b86f71165de9ee64fb76489c04ce032dd74ac21
Author: Wilco Dijkstra
Date: Wed Feb 21 23:34:37 2024 +
AArch64: Cleanup memset expansion
Cleanup memset implementation. Similar to memcpy/memmove, use
https://gcc.gnu.org/g:768fbb56b3285b2a3cf067881e745e0f8caec215
commit r15-7-g768fbb56b3285b2a3cf067881e745e0f8caec215
Author: Wilco Dijkstra
Date: Fri Apr 26 15:09:31 2024 +0100
AArch64: Remove AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS
Remove the tune
https://gcc.gnu.org/g:5716f8daf3f2abc54ececa61350fff0af2e7ce90
commit r15-6-g5716f8daf3f2abc54ececa61350fff0af2e7ce90
Author: Wilco Dijkstra
Date: Tue Mar 26 15:42:16 2024 +
libatomic: Cleanup macros in atomic_16.S
Cleanup the macros to add the libat_ prefixes in atomic_16.S
https://gcc.gnu.org/g:27b6d081f68528435066be2234c7329e31e0e84f
commit r14-9796-g27b6d081f68528435066be2234c7329e31e0e84f
Author: Wilco Dijkstra
Date: Tue Mar 26 15:08:02 2024 +
libatomic: Fix build for --disable-gnu-indirect-function [PR113986]
Fix libatomic build to support
https://gcc.gnu.org/g:8f9e92eec3230d2f1305d414984e89aaebdfe0c6
commit r14-9776-g8f9e92eec3230d2f1305d414984e89aaebdfe0c6
Author: Wilco Dijkstra
Date: Wed Mar 27 16:06:13 2024 +
libgcc: Add missing HWCAP entries to aarch64/cpuinfo.c
A few HWCAP entries are missing from aarch64
A few HWCAP entries are missing from aarch64/cpuinfo.c. This results in build
errors
on older machines.
This counts a trivial build fix, but since it's late in stage 4 I'll let
maintainers chip in.
OK for commit?
libgcc/
* config/aarch64/cpuinfo.c: Add HWCAP_EVTSTRM, HWCAP_CRC32,
As mentioned in
https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648397.html ,
do some additional cleanup of the macros and aliases:
Cleanup the macros to add the libat_ prefixes in atomic_16.S. Emit the
alias to __atomic_ when ifuncs are not enabled in the ENTRY macro.
Passes regress and
Hi Richard,
> This description is too brief for me. Could you say in detail how the
> new scheme works? E.g. the description doesn't explain:
>
> -if ARCH_AARCH64_HAVE_LSE128
> -AM_CPPFLAGS = -DHAVE_FEAT_LSE128
> -endif
That is not needed because we can include auto-config.h in
On Thumb-2 the use of CBZ blocks conditional execution, so change the
test to compare with a non-zero value.
gcc/testsuite/ChangeLog:
PR target/113915
* gcc.target/arm/builtin-bswap.x: Fix test to avoid emitting CBZ.
---
diff --git a/gcc/testsuite/gcc.target/arm/builtin-bswap.x
https://gcc.gnu.org/g:5119c7927c70b02ab9768b30f40564480f556432
commit r14-9394-g5119c7927c70b02ab9768b30f40564480f556432
Author: Wilco Dijkstra
Date: Fri Mar 8 15:01:15 2024 +
ARM: Fix builtin-bswap-1.c test [PR113915]
On Thumb-2 the use of CBZ blocks conditional execution
https://gcc.gnu.org/g:19b23bf3c32df3cbb96b3d898a1d7142f7bea4a0
commit r14-9373-g19b23bf3c32df3cbb96b3d898a1d7142f7bea4a0
Author: Wilco Dijkstra
Date: Wed Feb 21 23:33:58 2024 +
AArch64: memcpy/memset expansions should not emit LDP/STP [PR113618]
The new RTL introduced for LDP
https://gcc.gnu.org/g:b575f37a342cebb954aa85fa45df0604bfa1ada9
commit r14-9343-gb575f37a342cebb954aa85fa45df0604bfa1ada9
Author: Wilco Dijkstra
Date: Wed Mar 6 17:35:16 2024 +
ARM: Fix conditional execution [PR113915]
By default most patterns can be conditionalized on Arm
Hi Richard,
> Did you test this on a thumb1 target? It seems to me that the target parts
> that you've
> removed were likely related to that. In fact, I don't see why this test
> would need to be changed at all.
The testcase explicitly forces a Thumb-2 target (arm_arch_v6t2). The patterns
Fix libatomic build to support --disable-gnu-indirect-function on AArch64.
Always build atomic_16.S and add aliases to the __atomic_* functions if
!HAVE_IFUNC.
Passes regress and bootstrap, OK for commit?
libatomic:
PR target/113986
* Makefile.in: Regenerated.
*
Hi Richard,
> This bit isn't. The correct fix here is to fix the pattern(s) concerned to
> add the missing predicate.
>
> Note that builtin-bswap.x explicitly mentions predicated mnemonics in the
> comments.
I fixed the patterns in v2. There are likely some more, plus we could likely
merge
Hi Richard,
> It looks like this is really doing two things at once: disabling the
> direct emission of LDP/STP Qs, and switching the GPR handling from using
> pairs of DImode moves to single TImode moves. At least, that seems to be
> the effect of...
No it still uses TImode for the
By default most patterns can be conditionalized on Arm targets. However
Thumb-2 predication requires the "predicable" attribute be explicitly
set to "yes". Most patterns are shared between Arm and Thumb(-2) and are
marked with "predicable". Given this sharing, it does not make sense to
use a
The new RTL introduced for LDP/STP results in regressions due to use of UNSPEC.
Given the new LDP fusion pass is good at finding LDP opportunities, change the
memcpy, memmove and memset expansions to emit single vector loads/stores.
This fixes the regression and enables more RTL optimization on
Hi Richard,
>> That tune is only used by an obsolete core. I ran the memcpy and memset
>> benchmarks from Optimized Routines on xgene-1 with and without LDP/STP.
>> There is no measurable penalty for using LDP/STP. I'm not sure why it was
>> ever added given it does not do anything useful. I'll
(follow-on based on review comments on
https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641913.html)
Remove the tune AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS since it is only
used by an old core and doesn't properly support -Os. SPECINT_2017
shows that removing it has no performance
Hi,
>> Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer
>> ID).
>>
>> Passes regress, OK for commit?
>
> Ok.
Also OK to backport to GCC 13, 12 and 11?
Cheers,
Wilco
Add support for -mcpu=cobalt-100 (Neoverse N2 with a different implementer ID).
Passes regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64-cores.def (AARCH64_CORE): Add 'cobalt-100' CPU.
* config/aarch64/aarch64-tune.md: Regenerated.
* doc/invoke.texi
Hi Richard,
>> + rtx base = strip_offset_and_salt (XEXP (x, 1), );
>
> This should be just strip_offset, so that we don't lose the salt
> during optimisation.
Fixed.
> +
> + if (offset.is_constant ())
> I'm not sure this is really required. Logically the same thing
> would apply to
GCC tends to optimistically create CONST of globals with an immediate offset.
However it is almost always better to CSE addresses of globals and add immediate
offsets separately (the offset could be merged later in single-use cases).
Splitting CONST expressions with an index in
Hi Richard,
>> +#define MAX_SET_SIZE(speed) (speed ? 256 : 96)
>
> Since this isn't (AFAIK) a standard macro, there doesn't seem to be
> any need to put it in the header file. It could just go at the head
> of aarch64.cc instead.
Sure, I've moved it in v4.
>> + if (len <= 24 ||
Hi Richard,
>> Benchmarking showed that LSE and LSE2 RMW atomics have similar performance
>> once
>> the atomic is acquire, release or both. Given there is already a significant
>> overhead due
>> to the function call, PLT indirection and argument setup, it doesn't make
>> sense to add
>>
Hi,
>> Is there no benefit to using SWPPL for RELEASE here? Similarly for the
>> others.
>
> We started off implementing all possible memory orderings available.
> Wilco saw value in merging less restricted orderings into more
> restricted ones - mainly to reduce codesize in less frequently
v3: rebased to latest trunk
Cleanup memset implementation. Similar to memcpy/memmove, use an offset and
bytes throughout. Simplify the complex calculations when optimizing for size
by using a fixed limit.
Passes regress & bootstrap.
gcc/ChangeLog:
* config/aarch64/aarch64.h
Hi Richard,
>> Enable lock-free 128-bit atomics on AArch64. This is backwards compatible
>> with
>> existing binaries, gives better performance than locking atomics and is what
>> most users expect.
>
> Please add a justification for why it's backwards compatible, rather
> than just stating
Hi Richard,
> + rtx load[max_ops], store[max_ops];
>
> Please either add a comment explaining why 40 is guaranteed to be
> enough, or (my preference) use:
>
> auto_vec, ...> ops;
I've changed to using auto_vec since that should help reduce conflicts
with Alex' LDP changes. I double-checked
Hi Richard,
Thanks for the review, now committed.
> The new aarch64_split_compare_and_swap code looks a bit twisty.
> The approach in lse.S seems more obvious. But I'm guessing you
> didn't want to spend any time restructuring the pre-LSE
> -mno-outline-atomics code, and I agree the patch in
Hi Richard,
> +/* Maximum bytes set for an inline memset expansion. With -Os use 3 STP
> + and 1 MOVI/DUP (same size as a call). */
> +#define MAX_SET_SIZE(speed) (speed ? 256 : 96)
> So it looks like this assumes we have AdvSIMD. What about
> -mgeneral-regs-only?
After my strictalign
Hi,
>>> I checked codesize on SPECINT2017, and 96 had practically identical size.
>>> Using 128 would also be a reasonable Os value with a very slight size
>>> increase,
>>> and 384 looks good for O2 - however I didn't want to tune these values
>>> as this
>>> is a cleanup patch.
>>>
>>> Cheers,
Hi Kyrill,
> + if (!(hwcap & HWCAP_CPUID))
> + return false;
> +
> + unsigned long midr;
> + asm volatile ("mrs %0, midr_el1" : "=r" (midr));
> From what I recall that midr_el1 register is emulated by the kernel and so
> userspace software
> has to check that the kernel supports that
Hi Kyrill,
> + /* Reduce the maximum size with -Os. */
> + if (optimize_function_for_size_p (cfun))
> + max_set_size = 96;
> +
> This is a new "magic" number in this code. It looks sensible, but how
> did you arrive at it?
We need 1 instruction to create the value to store (DUP or
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
ping
From: Wilco Dijkstra
Sent: 04 August 2023 16:05
To: GCC Patches ; Richard Sandiford
Cc: Kyrylo Tkachov
Subject: [PATCH] libatomic: Improve ifunc selection on AArch64
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use
ping
__sync_val_compare_and_swap may be used on 128-bit types and either calls the
outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic if
the value is stored successfully using STXP, but the current implementations
do not perform the store if the comparison fails. In
ping
Cleanup memset implementation. Similar to memcpy/memmove, use an offset and
bytes throughout. Simplify the complex calculations when optimizing for size
by using a fixed limit.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64.cc
ping
v2: further cleanups, improved comments
Add support for inline memmove expansions. The generated code is identical
as for memcpy, except that all loads are emitted before stores rather than
being interleaved. The maximum size is 256 bytes which requires at most 16
registers.
Passes
ping
v2: Use UINTVAL, rename max_mops_size.
The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
Passes regress/bootstrap, OK for commit?
v2: Use check-function-bodies in tests
Further improve immediate generation by adding support for 2-instruction
MOV/EOR bitmask immediates. This reduces the number of 3/4-instruction
immediates in SPECCPU2017 by ~2%.
Passes regress, OK for commit?
gcc/ChangeLog:
*
Cleanup memset implementation. Similar to memcpy/memmove, use an offset and
bytes throughout. Simplify the complex calculations when optimizing for size
by using a fixed limit.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64.cc
Further improve immediate generation by adding support for 2-instruction
MOV/EOR bitmask immediates. This reduces the number of 3/4-instruction
immediates in SPECCPU2017 by ~2%.
Passes regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_internal_mov_immediate)
Hi Ramana,
> I remember this to be the previous discussions and common understanding.
>
> https://gcc.gnu.org/legacy-ml/gcc/2016-06/msg00017.html
>
> and here
>
> https://gcc.gnu.org/legacy-ml/gcc-patches/2017-02/msg00168.html
>
> Can you point any discussion recently that shows this has changed
ping
__sync_val_compare_and_swap may be used on 128-bit types and either calls the
outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic if
the value is stored successfully using STXP, but the current implementations
do not perform the store if the comparison fails. In
ping
From: Wilco Dijkstra
Sent: 04 August 2023 16:05
To: GCC Patches ; Richard Sandiford
Cc: Kyrylo Tkachov
Subject: [PATCH] libatomic: Improve ifunc selection on AArch64
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
ping
v2: Use UINTVAL, rename max_mops_size.
The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
Passes regress/bootstrap, OK for commit?
v2: further cleanups, improved comments
Add support for inline memmove expansions. The generated code is identical
as for memcpy, except that all loads are emitted before stores rather than
being interleaved. The maximum size is 256 bytes which requires at most 16
registers.
Passes
Hi Ramana,
>> I used --target=arm-none-linux-gnueabihf --host=arm-none-linux-gnueabihf
>> --build=arm-none-linux-gnueabihf --with-float=hard. However it seems that the
>> default armhf settings are incorrect. I shouldn't need the --with-float=hard
>> since
>> that is obviously implied by armhf,
Hi Ramana,
> Hope this helps.
Yes definitely!
>> Passes regress/bootstrap, OK for commit?
>
> Target ? armhf ? --with-arch , -with-fpu , -with-float parameters ?
> Please be specific.
I used --target=arm-none-linux-gnueabihf --host=arm-none-linux-gnueabihf
--build=arm-none-linux-gnueabihf
The outline atomic functions have hidden visibility and can only be called
directly. Therefore we can remove the BTI at function entry. This improves
security by reducing the number of indirect entry points in a binary.
The BTI markings on the objects are still emitted.
Passes regress, OK for
Hi Ramana,
>> __sync_val_compare_and_swap may be used on 128-bit types and either calls the
>> outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic
>> if
>> the value is stored successfully using STXP, but the current implementations
>> do not perform the store if the
Add support for inline memmove expansions. The generated code is identical
as for memcpy, except that all loads are emitted before stores rather than
being interleaved. The maximum size is 256 bytes which requires at most 16
registers.
Passes regress/bootstrap, OK for commit?
v2: Use UINTVAL, rename max_mops_size.
The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
Passes regress/bootstrap, OK for commit?
Hi Richard,
> * config/aarch64/aarch64.md (cpymemdi): Remove pattern condition.
> Shouldn't this be a separate patch? It's not immediately obvious that this
> is a necessary part of this change.
You mean this?
@@ -1627,7 +1627,7 @@ (define_expand "cpymemdi"
(match_operand:BLK 1
A MOPS memmove may corrupt registers since there is no copy of the input
operands to temporary registers. Fix this by calling
aarch64_expand_cpymem_mops.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog/
PR target/21
* config/aarch64/aarch64.md
The cpymemdi/setmemdi implementation doesn't fully support strict alignment.
Block the expansion if the alignment is less than 16 with STRICT_ALIGNMENT.
Clean up the condition when to use MOPS.
Passes regress/bootstrap, OK for commit?
gcc/ChangeLog/
PR target/103100
*
Hi Richard,
>> Note that aarch64_internal_mov_immediate may be called after reload,
>> so it would end up even more complex.
>
> The sequence I quoted was supposed to work before and after reload. The:
>
> rtx tmp = aarch64_target_reg (dest, DImode);
>
> would create a fresh
Hi Richard,
> I was worried that reusing "dest" for intermediate results would
> prevent CSE for cases like:
>
> void g (long long, long long);
> void
> f (long long *ptr)
> {
> g (0xee11ee22ee11ee22LL, 0xdc23dc44ee11ee22LL);
> }
Note that aarch64_internal_mov_immediate may be called after
Support immediate expansion of immediates which can be created from 2 MOVKs
and a shifted ORR or BIC instruction. Change aarch64_split_dimode_const_store
to apply if we save one instruction.
This reduces the number of 4-instruction immediates in SPECINT/FP by 5%.
Passes regress, OK for commit?
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
ping
From: Wilco Dijkstra
Sent: 04 August 2023 16:05
To: GCC Patches ; Richard Sandiford
Cc: Kyrylo Tkachov
Subject: [PATCH] libatomic: Improve ifunc selection on AArch64
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use
__sync_val_compare_and_swap may be used on 128-bit types and either calls the
outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic if
the value is stored successfully using STXP, but the current implementations
do not perform the store if the comparison fails. In this case
List official cores first so that -cpu=native does not show a codename with -v
or in errors/warnings.
Passes regress, OK for commit?
gcc/ChangeLog:
* config/aarch64/aarch64-cores.def (neoverse-n1): Place before ares.
(neoverse-v1): Place before zeus.
(neoverse-v2): Place
The v7 memory ordering model allows reordering of conditional atomic
instructions.
To avoid this, make all atomic patterns unconditional. Expand atomic loads and
stores for all architectures so the memory access can be wrapped into an UNSPEC.
Passes regress/bootstrap, OK for commit?
Hi Richard,
(that's quick!)
> + if (size > max_copy_size || size > max_mops_size)
> +return aarch64_expand_cpymem_mops (operands, is_memmove);
>
> Could you explain this a bit more? If I've followed the logic correctly,
> max_copy_size will always be 0 for movmem, so this "if" condition
A MOPS memmove may corrupt registers since there is no copy of the input
operands to temporary
registers. Fix this by calling aarch64_expand_cpymem which does this. Also
fix an issue with
STRICT_ALIGNMENT being ignored if TARGET_MOPS is true, and avoid crashing or
generating a huge
expansion
Hi Richard,
>>> Answering my own question, N1 does not officially have FEAT_LSE2.
>>
>> It doesn't indeed. However most cores support atomic 128-bit load/store
>> (part of LSE2), so we can still use the LSE2 ifunc for those cores. Since
>> there
>> isn't a feature bit for this in the CPU or
Hi Richard,
>> Why would HWCAP_USCAT not be set by the kernel?
>>
>> Failing that, I would think you would check ID_AA64MMFR2_EL1.AT.
>>
> Answering my own question, N1 does not officially have FEAT_LSE2.
It doesn't indeed. However most cores support atomic 128-bit load/store
(part of LSE2), so
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
Add support for ifunc selection based on CPUID register. Neoverse N1 supports
atomic 128-bit load/store, so use the FEAT_USCAT ifunc like newer Neoverse
cores.
Passes regress, OK for commit?
libatomic/
config/linux/aarch64/host-config.h (ifunc1): Use CPUID in ifunc
selection.
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
ping
From: Wilco Dijkstra
Sent: 02 June 2023 18:28
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64
[PR110061]
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with
existing binaries, gives better performance than locking atomics and is what
most users expect.
Note 128-bit atomic loads use a load/store exclusive loop if LSE2 is not
supported.
This results in an implicit store
ping
From: Wilco Dijkstra
Sent: 23 February 2023 15:11
To: GCC Patches
Cc: Richard Sandiford ; Kyrylo Tkachov
Subject: [PATCH] libatomic: Fix SEQ_CST 128-bit atomic load [PR108891]
The LSE2 ifunc for 16-byte atomic load requires a barrier before the LDP -
without it, it effectively has
The LSE2 ifunc for 16-byte atomic load requires a barrier before the LDP -
without it, it effectively has Load-AcquirePC semantics similar to LDAPR,
which is less restrictive than what __ATOMIC_SEQ_CST requires. This patch
fixes this and adds comments to make it easier to see which sequence is
Hi,
>> + /* Return-address signing state is toggled by DW_CFA_GNU_window_save
>> (where
>> + REG_UNDEFINED means enabled), or set by a DW_CFA_expression. */
>
> Needs updating to REG_UNSAVED_ARCHEXT.
>
> OK with that changes, thanks, and sorry for the delays & runaround.
Thanks, I've
Hi,
> @Wilco, can you please send the rebased patch for patch review? We would
> need in out openSUSE package soon.
Here is an updated and rebased version:
Cheers,
Wilco
v4: rebase and add REG_UNSAVED_ARCHEXT.
A recent change only initializes the regs.how[] during Dwarf unwinding
which
Hi,
> On 1/10/23 19:12, Jakub Jelinek via Gcc-patches wrote:
>> Anyway, the sooner this makes it into gcc trunk, the better, it breaks quite
>> a lot of stuff.
>
> Yep, please, we're also waiting for this patch for pushing to our gcc13
> package.
Well I'm waiting for an OK from a maintainer...
Hi Szabolcs,
> i would keep the assert: how[reg] must be either UNSAVED or UNDEFINED
> here, other how[reg] means the toggle cfi instruction is mixed with
> incompatible instructions for the pseudo reg.
>
> and i would add a comment about this e.g. saying that UNSAVED/UNDEFINED
> how[reg] is used
Hi Richard,
> Hmm, but the point of the original patch was to support code generators
> that emit DW_CFA_val_expression instead of DW_CFA_AARCH64_negate_ra_state.
> Doesn't this patch undo that?
Well it wasn't clear from the code or comments that was supported. I've
added that back in v2.
>
Hi,
I don't believe there is a missing optimization here: compilers expand mempcpy
by default into memcpy since that is the standard library call. That means even
if your source code contains mempcpy, there will never be any calls to mempcpy.
The reason is obvious: most targets support optimized
Enable TARGET_CONST_ANCHOR to allow complex constants to be created via
immediate add.
Use a 24-bit range as that enables a 3 or 4-instruction immediate to be
replaced by
2 additions. Fix the costing of immediate add to support 24-bit immediate and
12-bit shifted
immediates. The generated
Hi Andreas,
Thanks for the report, I've committed the fix:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108006
Cheers,
Wilco
Ensure we only pass SI/DImode which fixes the assert.
Committed as obvious.
gcc/
PR target/108006
* config/aarch64/aarch64.c (aarch64_expand_sve_const_vector):
Fix call to aarch64_move_imm to use SI/DI.
---
diff --git a/gcc/config/aarch64/aarch64.cc
Hi,
> i don't think how[*RA_STATE] can ever be set to REG_SAVED_OFFSET,
> this pseudo reg is not spilled to the stack, it is reset to 0 in
> each frame and then toggled within a frame.
It's is just a state, we can use any state we want since it is a pseudo reg.
These registers are global and
Hi Richard,
> - scalar_int_mode imode = (mode == HFmode
> - ? SImode
> - : int_mode_for_mode (mode).require ());
> + machine_mode imode = (mode == DFmode) ? DImode : SImode;
> It looks like this might mishandle DDmode, if not now
1 - 100 of 1149 matches
Mail list logo