Re: [PATCH v2 1/2] aarch64: Use standard names for saturating arithmetic
Hi Kyrill, On 17/12/2024 15:15, Kyrylo Tkachov wrote: We avoid using the __builtin_aarch64_* builtins in test cases as they are undocumented and we don’t make any guarantees about their stability to users. I’d prefer if the saturating operation was open-coded in C. I expect the midend machinery is smart enough to recognize the saturating logic for scalars by now? Thanks for the detailed feedback. It's been really helpful, and I've gone ahead and implemented almost all of it. I'm struggling to find a pattern that's recognised for signed arithmetic though- the following emits branching code: int64_t __attribute__((noipa)) sadd64 (int64_t __a, int64_t __b) { if (__a > 0) { if (__b > INT64_MAX - __a) return INT64_MAX; } else if (__b < INT64_MIN - __a) { return INT64_MIN; } return __a + __b; } Resulting assembly: |sadd64: .LFB6: .cfi_startproc mov x3, x0 cmp x0, 0 ble .L9 mov x2, 9223372036854775807 sub x4, x2, x0 mov x0, x2 cmp x4, x1 blt .L8 .L11: add x0, x3, x1 .L8: ret .p2align 2,,3 .L9: mov x2, -9223372036854775808 sub x0, x2, x0 cmp x0, x1 ble .L11 mov x0, x2 ret Is there a way to force this not to use branches by any chance? I'll keep looking and see if there are some patterns recently added to match that will work here. If I don't find something, would it be sufficient to use the scalar NEON intrinsics for this? And if so, would that mean the test should move to the Adv. SIMD directory? Many thanks once again, Akram |
Re: [PATCH v2 1/2] aarch64: Use standard names for saturating arithmetic
Hi Akram, > On 14 Nov 2024, at 16:53, Akram Ahmad wrote: > > This renames the existing {s,u}q{add,sub} instructions to use the > standard names {s,u}s{add,sub}3 which are used by IFN_SAT_ADD and > IFN_SAT_SUB. > > The NEON intrinsics for saturating arithmetic and their corresponding > builtins are changed to use these standard names too. > > Using the standard names for the instructions causes 32 and 64-bit > unsigned scalar saturating arithmetic to use the NEON instructions, > resulting in an additional (and inefficient) FMOV to be generated when > the original operands are in GP registers. This patch therefore also > restores the original behaviour of using the adds/subs instructions > in this circumstance. > > Furthermore, this patch introduces a new optimisation for signed 32 > and 64-bit scalar saturating arithmetic which uses adds/subs in place > of the NEON instruction. > > Addition, before: > fmov d0, x0 > fmov d1, x1 > sqadd d0, d0, d1 > fmov x0, d0 > > Addition, after: > asr x2, x1, 63 > adds x0, x0, x1 > eor x2, x2, 0x8000 > csinv x0, x0, x2, vc > > In the above example, subtraction replaces the adds with subs and the > csinv with csel. The 32-bit case follows the same approach. Arithmetic > with a constant operand is simplified further by directly storing the > saturating limit in the temporary register, resulting in only three > instructions being used. It is important to note that this only works > when early-ra is disabled due to an early-ra bug which erroneously > assigns FP registers to the operands; if early-ra is enabled, then the > original behaviour (NEON instruction) occurs. > > Additional tests are written for the scalar and Adv. SIMD cases to > ensure that the correct instructions are used. The NEON intrinsics are > already tested elsewhere. The signed scalar case is also tested with > an execution test to check the results. > > gcc/ChangeLog: > > * config/aarch64/aarch64-builtins.cc: Expand iterators. > * config/aarch64/aarch64-simd-builtins.def: Use standard names > * config/aarch64/aarch64-simd.md: Use standard names, split insn > definitions on signedness of operator and type of operands. > * config/aarch64/arm_neon.h: Use standard builtin names. > * config/aarch64/iterators.md: Add VSDQ_I_QI_HI iterator to > simplify splitting of insn for scalar arithmetic. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc: > Template file for unsigned vector saturating arithmetic tests. > * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c: > 8-bit vector type tests. > * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c: > 16-bit vector type tests. > * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c: > 32-bit vector type tests. > * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c: > 64-bit vector type tests. > * gcc.target/aarch64/saturating_arithmetic.inc: Template file > for scalar saturating arithmetic tests. > * gcc.target/aarch64/saturating_arithmetic_1.c: 8-bit tests. > * gcc.target/aarch64/saturating_arithmetic_2.c: 16-bit tests. > * gcc.target/aarch64/saturating_arithmetic_3.c: 32-bit tests. > * gcc.target/aarch64/saturating_arithmetic_4.c: 64-bit tests. > * gcc.target/aarch64/saturating_arithmetic_signed.c: Signed tests. > --- > gcc/config/aarch64/aarch64-builtins.cc| 13 + > gcc/config/aarch64/aarch64-simd-builtins.def | 8 +- > gcc/config/aarch64/aarch64-simd.md| 209 ++- > gcc/config/aarch64/arm_neon.h | 96 +++ > gcc/config/aarch64/iterators.md | 4 + > .../saturating_arithmetic_autovect.inc| 58 + > .../saturating_arithmetic_autovect_1.c| 79 ++ > .../saturating_arithmetic_autovect_2.c| 79 ++ > .../saturating_arithmetic_autovect_3.c| 75 ++ > .../saturating_arithmetic_autovect_4.c| 77 ++ > .../aarch64/saturating-arithmetic-signed.c| 244 ++ > .../aarch64/saturating_arithmetic.inc | 39 +++ > .../aarch64/saturating_arithmetic_1.c | 36 +++ > .../aarch64/saturating_arithmetic_2.c | 36 +++ > .../aarch64/saturating_arithmetic_3.c | 30 +++ > .../aarch64/saturating_arithmetic_4.c | 30 +++ > 16 files changed, 1057 insertions(+), 56 deletions(-) > create mode 100644 > gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc > create mode 100644 > gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c > create mode 100644 > gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c > create mode 100644 > gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c > create mode 100644 > gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c > create mode 100644 > gcc/testsu
[PATCH v2 1/2] aarch64: Use standard names for saturating arithmetic
This renames the existing {s,u}q{add,sub} instructions to use the standard names {s,u}s{add,sub}3 which are used by IFN_SAT_ADD and IFN_SAT_SUB. The NEON intrinsics for saturating arithmetic and their corresponding builtins are changed to use these standard names too. Using the standard names for the instructions causes 32 and 64-bit unsigned scalar saturating arithmetic to use the NEON instructions, resulting in an additional (and inefficient) FMOV to be generated when the original operands are in GP registers. This patch therefore also restores the original behaviour of using the adds/subs instructions in this circumstance. Furthermore, this patch introduces a new optimisation for signed 32 and 64-bit scalar saturating arithmetic which uses adds/subs in place of the NEON instruction. Addition, before: fmovd0, x0 fmovd1, x1 sqadd d0, d0, d1 fmovx0, d0 Addition, after: asr x2, x1, 63 addsx0, x0, x1 eor x2, x2, 0x8000 csinv x0, x0, x2, vc In the above example, subtraction replaces the adds with subs and the csinv with csel. The 32-bit case follows the same approach. Arithmetic with a constant operand is simplified further by directly storing the saturating limit in the temporary register, resulting in only three instructions being used. It is important to note that this only works when early-ra is disabled due to an early-ra bug which erroneously assigns FP registers to the operands; if early-ra is enabled, then the original behaviour (NEON instruction) occurs. Additional tests are written for the scalar and Adv. SIMD cases to ensure that the correct instructions are used. The NEON intrinsics are already tested elsewhere. The signed scalar case is also tested with an execution test to check the results. gcc/ChangeLog: * config/aarch64/aarch64-builtins.cc: Expand iterators. * config/aarch64/aarch64-simd-builtins.def: Use standard names * config/aarch64/aarch64-simd.md: Use standard names, split insn definitions on signedness of operator and type of operands. * config/aarch64/arm_neon.h: Use standard builtin names. * config/aarch64/iterators.md: Add VSDQ_I_QI_HI iterator to simplify splitting of insn for scalar arithmetic. gcc/testsuite/ChangeLog: * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc: Template file for unsigned vector saturating arithmetic tests. * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c: 8-bit vector type tests. * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c: 16-bit vector type tests. * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c: 32-bit vector type tests. * gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c: 64-bit vector type tests. * gcc.target/aarch64/saturating_arithmetic.inc: Template file for scalar saturating arithmetic tests. * gcc.target/aarch64/saturating_arithmetic_1.c: 8-bit tests. * gcc.target/aarch64/saturating_arithmetic_2.c: 16-bit tests. * gcc.target/aarch64/saturating_arithmetic_3.c: 32-bit tests. * gcc.target/aarch64/saturating_arithmetic_4.c: 64-bit tests. * gcc.target/aarch64/saturating_arithmetic_signed.c: Signed tests. --- gcc/config/aarch64/aarch64-builtins.cc| 13 + gcc/config/aarch64/aarch64-simd-builtins.def | 8 +- gcc/config/aarch64/aarch64-simd.md| 209 ++- gcc/config/aarch64/arm_neon.h | 96 +++ gcc/config/aarch64/iterators.md | 4 + .../saturating_arithmetic_autovect.inc| 58 + .../saturating_arithmetic_autovect_1.c| 79 ++ .../saturating_arithmetic_autovect_2.c| 79 ++ .../saturating_arithmetic_autovect_3.c| 75 ++ .../saturating_arithmetic_autovect_4.c| 77 ++ .../aarch64/saturating-arithmetic-signed.c| 244 ++ .../aarch64/saturating_arithmetic.inc | 39 +++ .../aarch64/saturating_arithmetic_1.c | 36 +++ .../aarch64/saturating_arithmetic_2.c | 36 +++ .../aarch64/saturating_arithmetic_3.c | 30 +++ .../aarch64/saturating_arithmetic_4.c | 30 +++ 16 files changed, 1057 insertions(+), 56 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc create mode 100644 gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c create mode 100644 gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c create mode 100644 gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c create mode 100644 gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_auto