Re: [PATCH] AArch64: Improve immediate expansion [PR105928]

2023-09-18 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > I was worried that reusing "dest" for intermediate results would > prevent CSE for cases like: > > void g (long long, long long); > void > f (long long *ptr) > { >   g (0xee11ee22ee11ee22LL, 0xdc23dc44ee11ee22LL); > } Note that aarch64_internal_mov_immediate may be called after

[PATCH] AArch64: Improve immediate expansion [PR105928]

2023-09-14 Thread Wilco Dijkstra via Gcc-patches
Support immediate expansion of immediates which can be created from 2 MOVKs and a shifted ORR or BIC instruction. Change aarch64_split_dimode_const_store to apply if we save one instruction. This reduces the number of 4-instruction immediates in SPECINT/FP by 5%. Passes regress, OK for commit?

Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-09-13 Thread Wilco Dijkstra via Gcc-patches
ping From: Wilco Dijkstra Sent: 02 June 2023 18:28 To: GCC Patches Cc: Richard Sandiford ; Kyrylo Tkachov Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]   Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with existing binaries,

Re: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-09-13 Thread Wilco Dijkstra via Gcc-patches
ping From: Wilco Dijkstra Sent: 04 August 2023 16:05 To: GCC Patches ; Richard Sandiford Cc: Kyrylo Tkachov Subject: [PATCH] libatomic: Improve ifunc selection on AArch64   Add support for ifunc selection based on CPUID register.  Neoverse N1 supports atomic 128-bit load/store, so use the

[PATCH] AArch64: Fix __sync_val_compare_and_swap [PR111404]

2023-09-13 Thread Wilco Dijkstra via Gcc-patches
__sync_val_compare_and_swap may be used on 128-bit types and either calls the outline atomic code or uses an inline loop. On AArch64 LDXP is only atomic if the value is stored successfully using STXP, but the current implementations do not perform the store if the comparison fails. In this case

[PATCH] AArch64: List official cores before codenames

2023-09-13 Thread Wilco Dijkstra via Gcc-patches
List official cores first so that -cpu=native does not show a codename with -v or in errors/warnings. Passes regress, OK for commit? gcc/ChangeLog: * config/aarch64/aarch64-cores.def (neoverse-n1): Place before ares. (neoverse-v1): Place before zeus. (neoverse-v2): Place

[PATCH] ARM: Block predication on atomics [PR111235]

2023-09-07 Thread Wilco Dijkstra via Gcc-patches
The v7 memory ordering model allows reordering of conditional atomic instructions. To avoid this, make all atomic patterns unconditional. Expand atomic loads and stores for all architectures so the memory access can be wrapped into an UNSPEC. Passes regress/bootstrap, OK for commit?

Re: [PATCH] AArch64: Fix MOPS memmove operand corruption [PR111121]

2023-08-23 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, (that's quick!) > + if (size > max_copy_size || size > max_mops_size) > +return aarch64_expand_cpymem_mops (operands, is_memmove); > > Could you explain this a bit more? If I've followed the logic correctly, > max_copy_size will always be 0 for movmem, so this "if" condition

[PATCH] AArch64: Fix MOPS memmove operand corruption [PR111121]

2023-08-23 Thread Wilco Dijkstra via Gcc-patches
A MOPS memmove may corrupt registers since there is no copy of the input operands to temporary registers. Fix this by calling aarch64_expand_cpymem which does this. Also fix an issue with STRICT_ALIGNMENT being ignored if TARGET_MOPS is true, and avoid crashing or generating a huge expansion

Re: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-08-10 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, >>> Answering my own question, N1 does not officially have FEAT_LSE2. >> >> It doesn't indeed. However most cores support atomic 128-bit load/store >> (part of LSE2), so we can still use the LSE2 ifunc for those cores. Since >> there >> isn't a feature bit for this in the CPU or

Re: [PATCH] libatomic: Improve ifunc selection on AArch64

2023-08-10 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, >> Why would HWCAP_USCAT not be set by the kernel? >> >> Failing that, I would think you would check ID_AA64MMFR2_EL1.AT. >> > Answering my own question, N1 does not officially have FEAT_LSE2. It doesn't indeed. However most cores support atomic 128-bit load/store (part of LSE2), so

Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-08-04 Thread Wilco Dijkstra via Gcc-patches
ping From: Wilco Dijkstra Sent: 02 June 2023 18:28 To: GCC Patches Cc: Richard Sandiford ; Kyrylo Tkachov Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]   Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with existing binaries,

[PATCH] libatomic: Improve ifunc selection on AArch64

2023-08-04 Thread Wilco Dijkstra via Gcc-patches
Add support for ifunc selection based on CPUID register. Neoverse N1 supports atomic 128-bit load/store, so use the FEAT_USCAT ifunc like newer Neoverse cores. Passes regress, OK for commit? libatomic/ config/linux/aarch64/host-config.h (ifunc1): Use CPUID in ifunc selection.

Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-07-05 Thread Wilco Dijkstra via Gcc-patches
ping From: Wilco Dijkstra Sent: 02 June 2023 18:28 To: GCC Patches Cc: Richard Sandiford ; Kyrylo Tkachov Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]   Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with existing binaries,

Re: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-06-16 Thread Wilco Dijkstra via Gcc-patches
ping From: Wilco Dijkstra Sent: 02 June 2023 18:28 To: GCC Patches Cc: Richard Sandiford ; Kyrylo Tkachov Subject: [PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]   Enable lock-free 128-bit atomics on AArch64.  This is backwards compatible with existing binaries,

[PATCH] libatomic: Enable lock-free 128-bit atomics on AArch64 [PR110061]

2023-06-02 Thread Wilco Dijkstra via Gcc-patches
Enable lock-free 128-bit atomics on AArch64. This is backwards compatible with existing binaries, gives better performance than locking atomics and is what most users expect. Note 128-bit atomic loads use a load/store exclusive loop if LSE2 is not supported. This results in an implicit store

Re: [PATCH] libatomic: Fix SEQ_CST 128-bit atomic load [PR108891]

2023-03-16 Thread Wilco Dijkstra via Gcc-patches
ping From: Wilco Dijkstra Sent: 23 February 2023 15:11 To: GCC Patches Cc: Richard Sandiford ; Kyrylo Tkachov Subject: [PATCH] libatomic: Fix SEQ_CST 128-bit atomic load [PR108891]   The LSE2 ifunc for 16-byte atomic load requires a barrier before the LDP - without it, it effectively has

[PATCH] libatomic: Fix SEQ_CST 128-bit atomic load [PR108891]

2023-02-23 Thread Wilco Dijkstra via Gcc-patches
The LSE2 ifunc for 16-byte atomic load requires a barrier before the LDP - without it, it effectively has Load-AcquirePC semantics similar to LDAPR, which is less restrictive than what __ATOMIC_SEQ_CST requires. This patch fixes this and adds comments to make it easier to see which sequence is

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2023-01-18 Thread Wilco Dijkstra via Gcc-patches
Hi, >> +  /* Return-address signing state is toggled by DW_CFA_GNU_window_save >> (where >> + REG_UNDEFINED means enabled), or set by a DW_CFA_expression.  */ > > Needs updating to REG_UNSAVED_ARCHEXT. > > OK with that changes, thanks, and sorry for the delays & runaround. Thanks, I've

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2023-01-17 Thread Wilco Dijkstra via Gcc-patches
Hi, > @Wilco, can you please send the rebased patch for patch review? We would > need in out openSUSE package soon. Here is an updated and rebased version: Cheers, Wilco v4: rebase and add REG_UNSAVED_ARCHEXT. A recent change only initializes the regs.how[] during Dwarf unwinding which

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2023-01-11 Thread Wilco Dijkstra via Gcc-patches
Hi, > On 1/10/23 19:12, Jakub Jelinek via Gcc-patches wrote: >> Anyway, the sooner this makes it into gcc trunk, the better, it breaks quite >> a lot of stuff. > > Yep, please, we're also waiting for this patch for pushing to our gcc13 > package. Well I'm waiting for an OK from a maintainer...

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2023-01-10 Thread Wilco Dijkstra via Gcc-patches
Hi Szabolcs, > i would keep the assert: how[reg] must be either UNSAVED or UNDEFINED > here, other how[reg] means the toggle cfi instruction is mixed with > incompatible instructions for the pseudo reg. > > and i would add a comment about this e.g. saying that UNSAVED/UNDEFINED > how[reg] is used

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2023-01-03 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Hmm, but the point of the original patch was to support code generators > that emit DW_CFA_val_expression instead of DW_CFA_AARCH64_negate_ra_state. > Doesn't this patch undo that? Well it wasn't clear from the code or comments that was supported. I've added that back in v2. >

[PATCH] AArch64: Enable TARGET_CONST_ANCHOR

2022-12-09 Thread Wilco Dijkstra via Gcc-patches
Enable TARGET_CONST_ANCHOR to allow complex constants to be created via immediate add. Use a 24-bit range as that enables a 3 or 4-instruction immediate to be replaced by 2 additions. Fix the costing of immediate add to support 24-bit immediate and 12-bit shifted immediates. The generated

Re: [PATCH][AArch64] Cleanup move immediate code

2022-12-07 Thread Wilco Dijkstra via Gcc-patches
Hi Andreas, Thanks for the report, I've committed the fix: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108006 Cheers, Wilco

[COMMITTED] AArch64: Fix assert in aarch64_move_imm [PR108006]

2022-12-07 Thread Wilco Dijkstra via Gcc-patches
Ensure we only pass SI/DImode which fixes the assert. Committed as obvious. gcc/         PR target/108006 * config/aarch64/aarch64.c (aarch64_expand_sve_const_vector):         Fix call to aarch64_move_imm to use SI/DI. --- diff --git a/gcc/config/aarch64/aarch64.cc

Re: [PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2022-12-06 Thread Wilco Dijkstra via Gcc-patches
Hi, > i don't think how[*RA_STATE] can ever be set to REG_SAVED_OFFSET, > this pseudo reg is not spilled to the stack, it is reset to 0 in > each frame and then toggled within a frame. It's is just a state, we can use any state we want since it is a pseudo reg. These registers are global and

Re: [PATCH][AArch64] Cleanup move immediate code

2022-12-05 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > -  scalar_int_mode imode = (mode == HFmode > -    ? SImode > -    : int_mode_for_mode (mode).require ()); > +  machine_mode imode = (mode == DFmode) ? DImode : SImode; > It looks like this might mishandle DDmode, if not now

[PATCH] libgcc: Fix uninitialized RA signing on AArch64 [PR107678]

2022-12-01 Thread Wilco Dijkstra via Gcc-patches
A recent change only initializes the regs.how[] during Dwarf unwinding which resulted in an uninitialized offset used in return address signing and random failures during unwinding. The fix is to use REG_SAVED_OFFSET as the state where the return address signing bit is valid, and if the state is

Re: [PATCH][AArch64] Cleanup move immediate code

2022-11-29 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Just to make sure I understand: isn't it really just MOVN?  I would have > expected a 32-bit MOVZ to be equivalent to (and add no capabilities over) > a 64-bit MOVZ. The 32-bit MOVZ immediates are equivalent, MOVN never overlaps, and MOVI has some overlaps . Since we allow all 3

Re: [PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-23 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, >> A smart reassociation pass could form more FMAs while also increasing >> parallelism, but the way it currently works always results in fewer FMAs. > > Yeah, as Richard said, that seems the right long-term fix. > It would also avoid the hack of treating PLUS_EXPR as a signal > of an

Re: [PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-22 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > I guess an obvious question is: if 1 (rather than 2) was the right value > for cores with 2 FMA pipes, why is 4 the right value for cores with 4 FMA > pipes?  It would be good to clarify how, conceptually, the core property > should map to the fma_reassoc_width value. 1 turns off

Re: [PATCH] AArch64: Add support for -mdirect-extern-access

2022-11-17 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Can you go into more detail about: > >    Use :option:`-mdirect-extern-access` either in shared libraries or in >    executables, but not in both.  Protected symbols used both in a shared >    library and executable may cause linker errors or fail to work correctly > > If this is

[PATCH] AArch64: Add support for -mdirect-extern-access

2022-11-11 Thread Wilco Dijkstra via Gcc-patches
Add a new option -mdirect-extern-access similar to other targets. This removes GOT indirections on external symbols with -fPIE, resulting in significantly better code quality. With -fPIC it only affects protected symbols, allowing for more efficient shared libraries which can be linked with

[PATCH] libatomic: Add support for LSE and LSE2

2022-11-11 Thread Wilco Dijkstra via Gcc-patches
Add support for AArch64 LSE and LSE2 to libatomic. Disable outline atomics, and use LSE ifuncs for 1-8 byte atomics and LSE2 ifuncs for 16-byte atomics. On Neoverse V1, 16-byte atomics are ~4x faster due to avoiding locks. Note this is safe since we swap all 16-byte atomics using the same ifunc,

[PATCH] AArch64: Add fma_reassoc_width [PR107413]

2022-11-09 Thread Wilco Dijkstra via Gcc-patches
Add a reassocation width for FMAs in per-CPU tuning structures. Keep the existing setting for cores with 2 FMA pipes, and use 4 for cores with 4 FMA pipes. This improves SPECFP2017 on Neoverse V1 by ~1.5%. Passes regress/bootstrap, OK for commit? gcc/ PR 107413 *

[committed] AArch64: Fix testcase

2022-11-04 Thread Wilco Dijkstra via Gcc-patches
Committed as trivial fix. gcc/testsuite/ * gcc.target/aarch64/mgeneral-regs_3.c: Fix testcase. --- diff --git a/gcc/testsuite/gcc.target/aarch64/mgeneral-regs_3.c b/gcc/testsuite/gcc.target/aarch64/mgeneral-regs_3.c index

[PATCH][AArch64] Cleanup move immediate code

2022-11-01 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, Here is the immediate cleanup splitoff from the previous patch: Simplify, refactor and improve various move immediate functions. Allow 32-bit MOVZ/N as a valid 64-bit immediate which removes special cases in aarch64_internal_mov_immediate. Add new constraint so the movdi pattern

Re: [PATCH][AArch64] Improve immediate expansion [PR106583]

2022-10-20 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Can you do the aarch64_mov_imm changes as a separate patch?  It's difficult > to review the two changes folded together like this. Sure, I'll send a separate patch. So here is version 2 again: [PATCH v2][AArch64] Improve immediate expansion [PR106583] Improve immediate expansion

Re: [PATCH][AArch64] Improve immediate expansion [PR106583]

2022-10-19 Thread Wilco Dijkstra via Gcc-patches
ping Hi Richard, >>> Sounds good, but could you put it before the mode version, >>> to avoid the forward declaration? >> >> I can swap them around but the forward declaration is still required as >> aarch64_check_bitmask is 5000 lines before aarch64_bitmask_imm. > > OK, how about moving them

Re: [PATCH][AArch64] Improve bit tests [PR105773]

2022-10-13 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Maybe pre-existing, but are ordered comparisons safe for the > ZERO_EXTRACT case?  If we extract the top 8 bits (say), zero extend, > and compare with zero, the result should be >= 0, whereas TST would > set N to the top bit. Yes in principle zero extract should always be positive

Re: [PATCH][AArch64] Improve immediate expansion [PR106583]

2022-10-12 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, >>> Sounds good, but could you put it before the mode version, >>> to avoid the forward declaration? >> >> I can swap them around but the forward declaration is still required as >> aarch64_check_bitmask is 5000 lines before aarch64_bitmask_imm. > > OK, how about moving them both

Re: [PATCH][AArch64] Improve bit tests [PR105773]

2022-10-12 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Realise this is awkward, but: CC_NZmode is for operations that set only > the N and Z flags to useful values.  If we want to take advantage of V > being zero then I think we need a different mode. > > We can't go all the way to CCmode because the carry flag has the opposite > value

Re: [PATCH][AArch64] Improve immediate expansion [PR106583]

2022-10-07 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, >> Yes, with a more general search loop we can get that case too - >> it doesn't trigger much though. The code that checks for this is >> now refactored into a new function. Given there are now many >> more calls to aarch64_bitmask_imm, I added a streamlined internal >> entry point

Re: [PATCH][AArch64] Improve immediate expansion [PR106583]

2022-10-06 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Did you consider handling the case where the movks aren't for > consecutive bitranges?  E.g. the patch handles: > but it looks like it would be fairly easy to extend it to: > >  0x12345678 Yes, with a more general search loop we can get that case too - it doesn't trigger

[PATCH][AArch64] Improve bit tests [PR105773]

2022-10-05 Thread Wilco Dijkstra via Gcc-patches
Since AArch64 sets all flags on logical operations, comparisons with zero can be combined into an AND even if the condition is LE or GT. Passes regress, OK for commit? gcc: PR target/105773 * config/aarch64/aarch64.cc (aarch64_select_cc_mode): Allow GT/LE for merging

[PATCH][AArch64] Improve immediate expansion [PR106583]

2022-10-04 Thread Wilco Dijkstra via Gcc-patches
Improve immediate expansion of immediates which can be created from a bitmask immediate and 2 MOVKs. This reduces the number of 4-instruction immediates in SPECINT/FP by 10-15%. Passes regress, OK for commit? gcc/ChangeLog: PR target/106583 * config/aarch64/aarch64.cc

[PATCH] AArch64: Cleanup option processing code

2022-05-25 Thread Wilco Dijkstra via Gcc-patches
Further cleanup option processing. Remove the duplication of global variables for CPU and tune settings so that CPU option processing is simplified even further. Move global variables that need save and restore due to target option processing into aarch64.opt. This removes the need for explicit

Re: [PATCH] AArch64: Prioritise init_have_lse_atomics constructor [PR 105708]

2022-05-25 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, I've added a comment - as usual it's just a number. A quick grep in gcc and glibc showed that priorities 98-101 are used, so I just went a bit below so it has a higher priority than typical initializations. Cheers, Wilco Here is v2: Increase the priority of the

[PATCH] AArch64: Prioritise init_have_lse_atomics constructor [PR 105708]

2022-05-24 Thread Wilco Dijkstra via Gcc-patches
Increase the priority of the init_have_lse_atomics constructor so it runs before other constructors. This improves chances that rr works when LSE atomics are supported. Regress and bootstrap pass, OK for commit? 2022-05-24 Wilco Dijkstra libgcc/ PR libgcc/105708 *

Re: [AArch64] PR105162: emit barrier for __sync and __atomic builtins on CPUs without LSE

2022-05-13 Thread Wilco Dijkstra via Gcc-patches
Hi Sebastian, >> Note the patch still needs an appropriate commit message. > > Added the following ChangeLog entry to the commit message. > > * config/aarch64/aarch64-protos.h (atomic_ool_names): Increase >dimension > of str array. > * config/aarch64/aarch64.cc

Re: [PATCH] AArch64: Improve address rematerialization costs

2022-05-12 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > But even if the costs are too high, the patch seems to be overcompensating. > It doesn't make logical sense for an ADRP+LDR to be cheaper than an LDR. An LDR is not a replacement for ADRP+LDR, you need a store in addition the original ADRP+LDR. Basically a simple spill would be

Re: [PATCH] AArch64: Improve address rematerialization costs

2022-05-12 Thread Wilco Dijkstra via Gcc-patches
Hi, >> It's also said that chosen alternatives might be the reason that >> rematerialization >> is not choosen and alternatives are chosen based on reload heuristics, not >> based >> on actual costs. > > Thanks for the pointer.  Yeah, it'd be interesting to know if this > is the same issue,

Re: [PATCH] AArch64: Cleanup CPU option processing code

2022-05-12 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Looks like you might have attached the old patch.  The aarch64_option_restore > change is mentioned in the changelog but doesn't appear in the patch itself. Indeed, not sure how that happened. Here is the correct v2 anyway. Wilco The --with-cpu/--with-arch configure option

Re: [PATCH] AArch64: Cleanup CPU option processing code

2022-05-11 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Although invoking ./cc1 directly only half-works with --with-arch, > it half-works well-enough that I'd still like to keep it working. > But I agree we should apply your change first, then I can follow up > with a patch to make --with-* work with ./cc1 later.  (I have a version >

Re: [PATCH] AArch64: Improve address rematerialization costs

2022-05-11 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Yeah, I'm not disagreeing with any of that.  It's just a question of > whether the problem should be fixed by artificially lowering the general > rtx costs with one particular user (RA spill costs) in mind, or whether > it should be fixed by making the RA spill code take the factors

Re: [PATCH] AArch64: Improve address rematerialization costs

2022-05-10 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, >> There isn't really a better way of doing this within the existing costing >> code. > > Yeah, I was wondering whether we could change something there. > ADRP+LDR is logically more expensive than a single LDR, especially > when optimising for size, so I think it's reasonable for the

Re: [PATCH] AArch64: Improve address rematerialization costs

2022-05-09 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > I'm not questioning the results, but I think we need to look in more > detail why rematerialisation requires such low costs.  The point of > comparison should be against a spill and reload, so any constant > that is as cheap as a load should be rematerialised.  If that isn't >

[PATCH] AArch64: Cleanup CPU option processing code

2022-05-09 Thread Wilco Dijkstra via Gcc-patches
The --with-cpu/--with-arch configure option processing not only checks valid arguments but also sets TARGET_CPU_DEFAULT with a CPU and extension bitmask. This isn't used however since a --with-cpu is translated into a -mcpu option which is processed as if written on the command-line (so

[PATCH] AArch64: Improve address rematerialization costs

2022-05-09 Thread Wilco Dijkstra via Gcc-patches
Improve rematerialization costs of addresses. The current costs are set too high which results in extra register pressure and spilling. Using lower costs means addresses will be rematerialized more often rather than being spilled or causing spills. This results in significant codesize

Re: [AArch64] PR105162: emit barrier for __sync and __atomic builtins on CPUs without LSE

2022-05-03 Thread Wilco Dijkstra via Gcc-patches
Hi Sebastian, > Please find attached the patch amended following your recommendations. > The number of new functions for _sync is reduced by 3x. > I tested the patch on Graviton2 aarch64-linux. > I also checked by hand that the outline functions in libgcc look similar to > what GCC produces for

Re: [AArch64] PR105162: emit barrier for __sync and __atomic builtins on CPUs without LSE

2022-04-19 Thread Wilco Dijkstra via Gcc-patches
Hi Sebastian, > Wilco pointed out in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105162#c7​ > that > "Only __sync needs the extra full barrier, but __atomic does not." > The attached patch does that by adding out-of-line functions for > MEMMODEL_SYNC_*. > Those new functions contain a barrier

[PATCH] AArch64: Improve rotate patterns

2021-12-08 Thread Wilco Dijkstra via Gcc-patches
Improve and generalize rotate patterns. Rotates by more than half the bitwidth of a register are canonicalized to rotate left. Many existing shift patterns don't handle this case correctly, so add rotate left to the shift iterator and convert rotate left into ror during assembly output. Add

Re: [PATCH] AArch64: Improve address rematerialization costs

2021-11-24 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Can you fold in the rtx costs part of the original GOT relaxation patch? Sure, see below for the updated version. > I don't think there's enough information here for me to be able to review > the patch though.  I'll need to find testcases, look in detail at what > the rtl passes

[PATCH] AArch64: Fix PR103085

2021-11-05 Thread Wilco Dijkstra via Gcc-patches
The stack protector implementation hides symbols in a const unspec, which means movdi/movsi patterns must always support const on symbol operands and explicitly strip away the unspec. Do this for the recently added GOT alternatives. Add a test to ensure stack-protector tests GOT accesses as well.

Re: [PATCH] AArch64: Improve address rematerialization costs

2021-11-04 Thread Wilco Dijkstra via Gcc-patches
ping From: Wilco Dijkstra Sent: 02 June 2021 11:21 To: GCC Patches Cc: Kyrylo Tkachov ; Richard Sandiford Subject: [PATCH] AArch64: Improve address rematerialization costs   Hi, Given the large improvements from better register allocation of GOT accesses, I decided to generalize it to get

[PATCH v2] AArch64: Cleanup CPU option processing code

2021-11-04 Thread Wilco Dijkstra via Gcc-patches
v2: rebased The --with-cpu/--with-arch configure option processing not only checks valid arguments but also sets TARGET_CPU_DEFAULT with a CPU and extension bitmask. This isn't used however since a --with-cpu is translated into a -mcpu option which is processed as if written on the

Re: [PATCH v3] AArch64: Improve GOT addressing

2021-11-02 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > - Why do we rewrite the constant moves after reload into ldr_got_small_sidi >   and ldr_got_small_?  Couldn't we just get the move patterns to >   output the sequence directly? That's possible too, however it makes the movsi/di patterns more complex. See version v4 below. > - I

Re: [PATCH v3] AArch64: Improve GOT addressing

2021-10-20 Thread Wilco Dijkstra via Gcc-patches
ping From: Wilco Dijkstra Sent: 04 June 2021 14:44 To: Richard Sandiford Cc: Kyrylo Tkachov ; GCC Patches Subject: [PATCH v3] AArch64: Improve GOT addressing   Hi Richard, This merges the v1 and v2 patches and removes the spurious MEM from ldr_got_small_si/di. This has been rebased after

Re: [PATCH] AArch64: Improve address rematerialization costs

2021-10-20 Thread Wilco Dijkstra via Gcc-patches
ping From: Wilco Dijkstra Sent: 02 June 2021 11:21 To: GCC Patches Cc: Kyrylo Tkachov ; Richard Sandiford Subject: [PATCH] AArch64: Improve address rematerialization costs   Hi, Given the large improvements from better register allocation of GOT accesses, I decided to generalize it to get

Re: [PATCH] AArch64: Tune case-values-threshold

2021-10-19 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > The problem is that you're effectively asking for these values to be > taken on faith without providing any analysis and without describing > how you arrived at the new numbers.  Did you try other values too? > If so, how did they compare with the numbers that you finally chose? >

Re: [PATCH] AArch64: Tune case-values-threshold

2021-10-19 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > I'm just concerned that here we're using the same explanation but with > different numbers.  Why are the new numbers more right than the old ones > (especially when it comes to code size, where the trade-off hasn't > really changed)? Like all tuning/costing parameters, these values

[PATCH] AArch64: Enable fast shifts on Neoverse V1/N2

2021-10-18 Thread Wilco Dijkstra via Gcc-patches
Enable the fast shift feature in Neoverse V1 and N2 tunings as well. ChangeLog: 2021-10-18 Wilco Dijkstra * config/aarch64/aarch64.c (neoversev1_tunings): Enable AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND. (neoversen2_tunings): Likewise. --- diff --git

[PATCH] AArch64: Tune case-values-threshold

2021-10-18 Thread Wilco Dijkstra via Gcc-patches
Tune the case-values-threshold setting for modern cores. A value of 11 improves SPECINT2017 by 0.2% and reduces codesize by 0.04%. With -Os use value 8 which reduces codesize by 0.07%. Passes regress, OK for commit? ChangeLog: 2021-10-18 Wilco Dijkstra * config/aarch64/aarch64.c

Re: [PATCH] AArch64: Add support for __builtin_roundeven[f] [PR100966]

2021-06-22 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > So rather than have two patterns that generate frintn, I think > it would be better to change the existing frint_pattern entry to > "roundeven" instead, and fix whatever the fallout is. Hopefully it > shouldn't be too bad, since we already use the optab names for the > other

[PATCH] AArch64: Add support for __builtin_roundeven[f] [PR100966]

2021-06-18 Thread Wilco Dijkstra via Gcc-patches
Enable __builtin_roundeven[f] by adding roundeven as an alias to the existing frintn support. Bootstrap OK and passes regress. ChangeLog: 2021-06-18 Wilco Dijkstra PR target/100966 * config/aarch64/aarch64.md (UNSPEC_FRINTR): Add. * config/aarch64/aarch64.c

[PATCH v3] AArch64: Improve GOT addressing

2021-06-04 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, This merges the v1 and v2 patches and removes the spurious MEM from ldr_got_small_si/di. This has been rebased after [1], and the performance gain has now doubled. [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-June/571708.html Improve GOT addressing by treating the instructions

Re: [PATCH] AArch64: Improve address rematerialization costs

2021-06-02 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > No.  It's never correct to completely wipe out the existing cost - you > don't know the context where this is being used. > > The most you can do is not add any additional cost. Remember that aarch64_rtx_costs starts like this: /* By default, assume that everything has

[PATCH] AArch64: Improve address rematerialization costs

2021-06-02 Thread Wilco Dijkstra via Gcc-patches
Hi, Given the large improvements from better register allocation of GOT accesses, I decided to generalize it to get large gains for normal addressing too: Improve rematerialization costs of addresses. The current costs are set too high which results in extra register pressure and spilling.

Re: [PATCH v2] AArch64: Improve GOT addressing

2021-05-26 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Are we actually planning to do any linker relaxations here, or is this > purely theoretical?  If doing relaxations is a realistic possiblity then > I agree that would be a good/legitimate reason to use a single define_insn > for both instructions.  In that case though, there should

[PATCH v2] AArch64: Improve GOT addressing

2021-05-24 Thread Wilco Dijkstra via Gcc-patches
Version v2 uses movsi/di for GOT accesses until after reload as suggested. This caused worse spilling, however improving the costs of GOT accesses resulted in better codesize and performance gains: Improve GOT addressing by treating the instructions as a pair. This reduces register pressure and

Re: [PATCH] AArch64: Improve GOT addressing

2021-05-10 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Normally we should only put two instructions in the same define_insn > if there's a specific ABI or architectural reason for not separating > them.  Doing it purely for optimisation reasons is going against the > general direction of travel.  So I think the first question is: why >

[PATCH] AArch64: Improve GOT addressing

2021-05-05 Thread Wilco Dijkstra via Gcc-patches
Improve GOT addressing by emitting the instructions as a pair. This reduces register pressure and improves code quality. With -fPIC codesize improves by 0.65% and SPECINT2017 improves by 0.25%. Passes bootstrap and regress. OK for commit? ChangeLog: 2021-05-05 Wilco Dijkstra *

Re: [PATCH] AArch64: Cleanup aarch64_classify_symbol

2021-04-30 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Hmm, OK.  I guess it makes things more consistent in that sense > (PIC vs. non-PIC).  But on the other side it's making things less > internally consistent for non-PIC, since we don't use the GOT for > anything else there.  I guess in principle there's a danger that a > custom *-elf

Re: [PATCH] AArch64: Cleanup aarch64_classify_symbol

2021-04-28 Thread Wilco Dijkstra via Gcc-patches
Hi Andrew, > I thought that was changed not to use the GOT on purpose. > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63874 > > That is if the symbol is not declared in the TU, then using the GOT is > correct thing to do. > Is the testcase gcc.target/aarch64/pr63874.c still working or is not >

Re: [PATCH] AArch64: Cleanup aarch64_classify_symbol

2021-04-28 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > Just to check: I guess this part is an optimisation, because it > means that we can share the GOT entry with other TUs.  Is that right? > I think it would be worth having a comment either way, whatever the > rationale.  A couple of other very minor things: It's just to make the

[PATCH] AArch64: Cleanup aarch64_classify_symbol

2021-04-28 Thread Wilco Dijkstra via Gcc-patches
Use a GOT indirection for extern weak symbols instead of a literal - this is the same as PIC/PIE and mirrors LLVM behaviour. Ensure PIC/PIE use the same offset limits for symbols that don't use the GOT. Passes bootstrap and regress. OK for commit? ChangeLog: 2021-04-27 Wilco Dijkstra

[GCC8 backport] AArch64: Fix symbol offset limit (PR 98618)

2021-01-21 Thread Wilco Dijkstra via Gcc-patches
In aarch64_classify_symbol symbols are allowed large offsets on relocations. This means the offset can use all of the +/-4GB offset, leaving no offset available for the symbol itself. This results in relocation overflow and link-time errors for simple expressions like _array + 0xff00. To

[GCC9 backport] AArch64: Fix symbol offset limit (PR 98618)

2021-01-19 Thread Wilco Dijkstra via Gcc-patches
In aarch64_classify_symbol symbols are allowed large offsets on relocations. This means the offset can use all of the +/-4GB offset, leaving no offset available for the symbol itself. This results in relocation overflow and link-time errors for simple expressions like _array + 0xff00. To

Re: [AArch64] Add --with-tune configure flag

2020-12-10 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > I specifically want to test generic SVE rather than SVE tuned for a > specific core, so --with-arch=armv8.2-a+sve is the thing I want to test. Btw that's not actually what you get if you use cc1 - you always get armv8.0, so --with-arch doesn't work at all. The only case that

Re: [AArch64] Add --with-tune configure flag

2020-12-07 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, >>> I share Richard E's concern about the effect of this on people who run >>> ./cc1 directly.  (And I'm being selfish here, because I regularly run >>> ./cc1 directly on toolchains configured with --with-arch=armv8.2-a+sve.) >>> So TBH my preference would be to keep the

Re: [AArch64] Add --with-tune configure flag

2020-12-07 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, > I share Richard E's concern about the effect of this on people who run > ./cc1 directly.  (And I'm being selfish here, because I regularly run > ./cc1 directly on toolchains configured with --with-arch=armv8.2-a+sve.) > So TBH my preference would be to keep the

Re: [AArch64] Add --with-tune configure flag

2020-11-19 Thread Wilco Dijkstra via Gcc-patches
Hi, >>>    As for your second patch, --with-cpu-64 could be a simple alias indeed, >>>    but what is the exact definition/expected behaviour of a --with-cpu-32 >>>    on a target that only supports 64-bit code? The AArch64 target cannot >>>    generate AArch32 code, so we shouldn't silently

Re: [AArch64] Add --with-tune configure flag

2020-11-18 Thread Wilco Dijkstra via Gcc-patches
Hi Sebastian, I presume you're trying to unify the --with- options across most targets? That would be very useful! However there are significant differences between targets in how they interpret options like --with-arch=native (or -march). So those differences also need to be looked at and fixed

[PATCH] AArch64: Add cost table for Cortex-A76

2020-11-18 Thread Wilco Dijkstra via Gcc-patches
Add an initial cost table for Cortex-A76 - this is copied from cotexa57_extra_costs but updates it based on the Optimization Guide. Use the new cost table on all Neoverse tunings and ensure the tunings are consistent for all. As a result more compact code is generated with more combined shift+alu

Re: [PATCH] AArch64: Improve inline memcpy expansion

2020-11-16 Thread Wilco Dijkstra via Gcc-patches
Hi Richard, >> +  if (size <= 24 || !TARGET_SIMD > > Nit: one condition per line when the condition spans multiple lines. Fixed. >> +  || (size <= (max_copy_size / 2) >> +  && (aarch64_tune_params.extra_tuning_flags >> +  & AARCH64_EXTRA_TUNE_NO_LDP_STP_QREGS))) >> +    copy_bits =

[PATCH] AArch64: Improve inline memcpy expansion

2020-11-05 Thread Wilco Dijkstra via Gcc-patches
Improve the inline memcpy expansion. Use integer load/store for copies <= 24 bytes instead of SIMD. Set the maximum copy to expand to 256 by default, except that -Os or no Neon expands up to 128 bytes. When using LDP/STP of Q-registers, also use Q-register accesses for the unaligned tail,

Re: [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c

2020-10-08 Thread Wilco Dijkstra via Gcc-patches
Hi Jakub, > On Thu, Oct 08, 2020 at 11:37:24AM +0000, Wilco Dijkstra via Gcc-patches > wrote: >> Which optimizations does it enable that aren't possible if the value is >> defined? > > See bugzilla.  Note other compilers heavily optimize on those builtins > undefin

Re: [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c

2020-10-08 Thread Wilco Dijkstra via Gcc-patches
Hi Jakub,  > Having it undefined allows optimizations, and has been that way for years. Which optimizations does it enable that aren't possible if the value is defined? > We just should make sure that we optimize code like x ? __builtin_c[lt]z (x) > : 32; > etc. properly (and I believe we do).

Re: [PATCH] PR target/97312: Tweak gcc.target/aarch64/pr90838.c

2020-10-08 Thread Wilco Dijkstra via Gcc-patches
Btw for PowerPC is 0..32: https://www.ibm.com/support/knowledgecenter/ssw_aix_72/assembler/idalangref_cntlzw_instrs.html Wilco

  1   2   >