On Fri, 11 Mar 2022 12:18:36 GMT, Florian Weimer <fwei...@openjdk.org> wrote:
> > According to > > https://forums.swift.org/t/concurrencys-use-of-thread-local-variables/48654: > > "these accesses are just a move from a system register plus a load/store > > at a constant offset." > > Ideally you'd still benchmark that. Some AArch64 implementations have really, > really slow moves from the system register used as the thread pointer. > Hopefully Apple's implementation isn't in that category. In a tight loop, loads from __thread variables take 1ns. It's this: 0x18ea1c530 <+0>: ldr x16, [x0, #0x8] 0x18ea1c534 <+4>: mrs x17, TPIDRRO_EL0 0x18ea1c538 <+8>: and x17, x17, #0xfffffffffffffff8 0x18ea1c53c <+12>: ldr x17, [x17, x16, lsl #3] 0x18ea1c540 <+16>: cbz x17, 0x18ea1c550 ; only executed first time 0x18ea1c544 <+20>: ldr x16, [x0, #0x10] 0x18ea1c548 <+24>: add x0, x17, x16 0x18ea1c54c <+28>: ret ... which looks the same as what glibc does. Not bad, but quite a lot more to do than a simple load. I'd still use a static SafeFetch, with no W^X fiddling. It just seems to me much more reasonable. ------------- PR: https://git.openjdk.java.net/jdk/pull/7727