Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Sat, Apr 25, 2020 at 01:40:24PM +1000, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 24, 2020 3:42 am: > > On Thu, Apr 23, 2020 at 02:15:58PM -0300, Adhemerval Zanella wrote: > >> > >> > >> On 23/04/2020 13:43, Rich Felker wrote: > >> > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote: > >> >> > >> >> > >> >> On 23/04/2020 13:18, Rich Felker wrote: > >> >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: > >> > >> > >> On 22/04/2020 23:36, Rich Felker wrote: > >> > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: > >> >> Yeah I had a bit of a play around with musl (which is very nice > >> >> code I > >> >> must say). The powerpc64 syscall asm is missing ctr clobber by the > >> >> way. > >> >> Fortunately adding it doesn't change code generation for me, but it > >> >> should be fixed. glibc had the same bug at one point I think > >> >> (probably > >> >> due to syscall ABI documentation not existing -- something now > >> >> lives in > >> >> linux/Documentation/powerpc/syscall64-abi.rst). > >> > > >> > Do you know anywhere I can read about the ctr issue, possibly the > >> > relevant glibc bug report? I'm not particularly familiar with ppc > >> > register file (at least I have to refamiliarize myself every time I > >> > work on this stuff) so it'd be nice to understand what's > >> > potentially-wrong now. > >> > >> My understanding is the ctr issue only happens for vDSO calls where it > >> fallback to a syscall in case an error (invalid argument, etc. and > >> assuming if vDSO does not fallback to a syscall it always succeed). > >> This makes the vDSO call on powerpc to have same same ABI constraint > >> as a syscall, where it clobbers CR0. > >> >>> > >> >>> I think you mean "vsyscall", the old thing glibc used where there are > >> >>> in-userspace implementations of some syscalls with call interfaces > >> >>> roughly equivalent to a syscall. musl has never used this. It only > >> >>> uses the actual exported functions from the vdso which have normal > >> >>> external function call ABI. > >> >> > >> >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. > >> >> The issue is indeed when calling the powerpc provided functions in > >> >> vDSO, which musl might want to do eventually. > >> > > >> > AIUI (at least this is true for all other archs) the functions have > >> > normal external function call ABI and calling them has nothing to do > >> > with syscall mechanisms. > >> > >> My point is powerpc specifically does not follow it, since it issues a > >> syscall in fallback and its semantic follow kernel syscalls (error > >> signalled in cr0, r3 being always a positive value): > > > > Oh, then I think we'll just ignore these unless the kernel can make > > ones with a reasonable ABI. It's not worth having ppc-specific code > > for this... It would be really nice if ones that actually behave like > > functions could be added though. > > Yeah this is an annoyance for me after making the scv ABI return -ve in > r3 for error and other things that more closely follow function calls, > we still have the vdso functions using the old style. > > Maybe we should add function call style vdso too. Please do. Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
Excerpts from Rich Felker's message of April 24, 2020 3:42 am: > On Thu, Apr 23, 2020 at 02:15:58PM -0300, Adhemerval Zanella wrote: >> >> >> On 23/04/2020 13:43, Rich Felker wrote: >> > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote: >> >> >> >> >> >> On 23/04/2020 13:18, Rich Felker wrote: >> >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: >> >> >> On 22/04/2020 23:36, Rich Felker wrote: >> > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: >> >> Yeah I had a bit of a play around with musl (which is very nice code I >> >> must say). The powerpc64 syscall asm is missing ctr clobber by the >> >> way. >> >> Fortunately adding it doesn't change code generation for me, but it >> >> should be fixed. glibc had the same bug at one point I think >> >> (probably >> >> due to syscall ABI documentation not existing -- something now lives >> >> in >> >> linux/Documentation/powerpc/syscall64-abi.rst). >> > >> > Do you know anywhere I can read about the ctr issue, possibly the >> > relevant glibc bug report? I'm not particularly familiar with ppc >> > register file (at least I have to refamiliarize myself every time I >> > work on this stuff) so it'd be nice to understand what's >> > potentially-wrong now. >> >> My understanding is the ctr issue only happens for vDSO calls where it >> fallback to a syscall in case an error (invalid argument, etc. and >> assuming if vDSO does not fallback to a syscall it always succeed). >> This makes the vDSO call on powerpc to have same same ABI constraint >> as a syscall, where it clobbers CR0. >> >>> >> >>> I think you mean "vsyscall", the old thing glibc used where there are >> >>> in-userspace implementations of some syscalls with call interfaces >> >>> roughly equivalent to a syscall. musl has never used this. It only >> >>> uses the actual exported functions from the vdso which have normal >> >>> external function call ABI. >> >> >> >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. >> >> The issue is indeed when calling the powerpc provided functions in >> >> vDSO, which musl might want to do eventually. >> > >> > AIUI (at least this is true for all other archs) the functions have >> > normal external function call ABI and calling them has nothing to do >> > with syscall mechanisms. >> >> My point is powerpc specifically does not follow it, since it issues a >> syscall in fallback and its semantic follow kernel syscalls (error >> signalled in cr0, r3 being always a positive value): > > Oh, then I think we'll just ignore these unless the kernel can make > ones with a reasonable ABI. It's not worth having ppc-specific code > for this... It would be really nice if ones that actually behave like > functions could be added though. Yeah this is an annoyance for me after making the scv ABI return -ve in r3 for error and other things that more closely follow function calls, we still have the vdso functions using the old style. Maybe we should add function call style vdso too. Thanks, Nick
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
Excerpts from Rich Felker's message of April 23, 2020 12:36 pm: > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: >> Yeah I had a bit of a play around with musl (which is very nice code I >> must say). The powerpc64 syscall asm is missing ctr clobber by the way. >> Fortunately adding it doesn't change code generation for me, but it >> should be fixed. glibc had the same bug at one point I think (probably >> due to syscall ABI documentation not existing -- something now lives in >> linux/Documentation/powerpc/syscall64-abi.rst). > > Do you know anywhere I can read about the ctr issue, possibly the > relevant glibc bug report? I'm not particularly familiar with ppc > register file (at least I have to refamiliarize myself every time I > work on this stuff) so it'd be nice to understand what's > potentially-wrong now. Ah I was misremembering, glibc was (and still is) actually missing cr clobbers from its "vsyscall", probably because it copied syscall which only clobbers cr0, but vsyscall clobbers cr0-1,5-7 like a normal function call. musl is missing the ctr register clobber from syscalls. powerpc has gpr0-31 GPRs, cr0-7 condition regs, and lr and ctr branch registers (lr is generally used for function returns, ctr for other indirect branches). ctr is volatile (caller saved) across C function calls, and sc system calls on Linux. Thanks, Nick
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 23, 2020 at 02:15:58PM -0300, Adhemerval Zanella wrote: > > > On 23/04/2020 13:43, Rich Felker wrote: > > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote: > >> > >> > >> On 23/04/2020 13:18, Rich Felker wrote: > >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: > > > On 22/04/2020 23:36, Rich Felker wrote: > > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: > >> Yeah I had a bit of a play around with musl (which is very nice code I > >> must say). The powerpc64 syscall asm is missing ctr clobber by the > >> way. > >> Fortunately adding it doesn't change code generation for me, but it > >> should be fixed. glibc had the same bug at one point I think (probably > >> due to syscall ABI documentation not existing -- something now lives > >> in > >> linux/Documentation/powerpc/syscall64-abi.rst). > > > > Do you know anywhere I can read about the ctr issue, possibly the > > relevant glibc bug report? I'm not particularly familiar with ppc > > register file (at least I have to refamiliarize myself every time I > > work on this stuff) so it'd be nice to understand what's > > potentially-wrong now. > > My understanding is the ctr issue only happens for vDSO calls where it > fallback to a syscall in case an error (invalid argument, etc. and > assuming if vDSO does not fallback to a syscall it always succeed). > This makes the vDSO call on powerpc to have same same ABI constraint > as a syscall, where it clobbers CR0. > >>> > >>> I think you mean "vsyscall", the old thing glibc used where there are > >>> in-userspace implementations of some syscalls with call interfaces > >>> roughly equivalent to a syscall. musl has never used this. It only > >>> uses the actual exported functions from the vdso which have normal > >>> external function call ABI. > >> > >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. > >> The issue is indeed when calling the powerpc provided functions in > >> vDSO, which musl might want to do eventually. > > > > AIUI (at least this is true for all other archs) the functions have > > normal external function call ABI and calling them has nothing to do > > with syscall mechanisms. > > My point is powerpc specifically does not follow it, since it issues a > syscall in fallback and its semantic follow kernel syscalls (error > signalled in cr0, r3 being always a positive value): Oh, then I think we'll just ignore these unless the kernel can make ones with a reasonable ABI. It's not worth having ppc-specific code for this... It would be really nice if ones that actually behave like functions could be added though. > -- > V_FUNCTION_BEGIN(__kernel_clock_gettime) > .cfi_startproc > [...] > /* > * syscall fallback > */ > 99: > li r0,__NR_clock_gettime > .cfi_restore lr > sc > blr > .cfi_endproc > V_FUNCTION_END(__kernel_clock_gettime) > > > > > > It looks like we're not using them right now and I'm not sure why. It > > could be that there are ABI mismatch issues (are 32-bit ones > > compatible with secure-plt? are 64-bit ones compatible with ELFv2?) or > > just that nobody proposed adding them. Also as of 5.4 32-bit ppc > > lacked time64 versions of them; not sure if this is fixed yet. > > For 64-bit it also have an issue where vDSO does not provide an OPD > for ELFv1, which has bitten glibc while trying to implement an ifunc > optimization. I don't recall any issue for ELFv2. > > For 32-bit I am not sure secure-plt will change anything, at least not > on powerpc where we use the same strategy for 64-bit and use a > mtctr/bctr directly. Indeed, I don't think there's a secure-plt distinction unless you're making outgoing calls to possibly-cross-DSO functions. Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On 23/04/2020 13:43, Rich Felker wrote: > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote: >> >> >> On 23/04/2020 13:18, Rich Felker wrote: >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: On 22/04/2020 23:36, Rich Felker wrote: > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: >> Yeah I had a bit of a play around with musl (which is very nice code I >> must say). The powerpc64 syscall asm is missing ctr clobber by the way. >> Fortunately adding it doesn't change code generation for me, but it >> should be fixed. glibc had the same bug at one point I think (probably >> due to syscall ABI documentation not existing -- something now lives in >> linux/Documentation/powerpc/syscall64-abi.rst). > > Do you know anywhere I can read about the ctr issue, possibly the > relevant glibc bug report? I'm not particularly familiar with ppc > register file (at least I have to refamiliarize myself every time I > work on this stuff) so it'd be nice to understand what's > potentially-wrong now. My understanding is the ctr issue only happens for vDSO calls where it fallback to a syscall in case an error (invalid argument, etc. and assuming if vDSO does not fallback to a syscall it always succeed). This makes the vDSO call on powerpc to have same same ABI constraint as a syscall, where it clobbers CR0. >>> >>> I think you mean "vsyscall", the old thing glibc used where there are >>> in-userspace implementations of some syscalls with call interfaces >>> roughly equivalent to a syscall. musl has never used this. It only >>> uses the actual exported functions from the vdso which have normal >>> external function call ABI. >> >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. >> The issue is indeed when calling the powerpc provided functions in >> vDSO, which musl might want to do eventually. > > AIUI (at least this is true for all other archs) the functions have > normal external function call ABI and calling them has nothing to do > with syscall mechanisms. My point is powerpc specifically does not follow it, since it issues a syscall in fallback and its semantic follow kernel syscalls (error signalled in cr0, r3 being always a positive value): -- V_FUNCTION_BEGIN(__kernel_clock_gettime) .cfi_startproc [...] /* * syscall fallback */ 99: li r0,__NR_clock_gettime .cfi_restore lr sc blr .cfi_endproc V_FUNCTION_END(__kernel_clock_gettime) > > It looks like we're not using them right now and I'm not sure why. It > could be that there are ABI mismatch issues (are 32-bit ones > compatible with secure-plt? are 64-bit ones compatible with ELFv2?) or > just that nobody proposed adding them. Also as of 5.4 32-bit ppc > lacked time64 versions of them; not sure if this is fixed yet. For 64-bit it also have an issue where vDSO does not provide an OPD for ELFv1, which has bitten glibc while trying to implement an ifunc optimization. I don't recall any issue for ELFv2. For 32-bit I am not sure secure-plt will change anything, at least not on powerpc where we use the same strategy for 64-bit and use a mtctr/bctr directly.
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote: > > > On 23/04/2020 13:18, Rich Felker wrote: > > On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: > >> > >> > >> On 22/04/2020 23:36, Rich Felker wrote: > >>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: > Yeah I had a bit of a play around with musl (which is very nice code I > must say). The powerpc64 syscall asm is missing ctr clobber by the way. > Fortunately adding it doesn't change code generation for me, but it > should be fixed. glibc had the same bug at one point I think (probably > due to syscall ABI documentation not existing -- something now lives in > linux/Documentation/powerpc/syscall64-abi.rst). > >>> > >>> Do you know anywhere I can read about the ctr issue, possibly the > >>> relevant glibc bug report? I'm not particularly familiar with ppc > >>> register file (at least I have to refamiliarize myself every time I > >>> work on this stuff) so it'd be nice to understand what's > >>> potentially-wrong now. > >> > >> My understanding is the ctr issue only happens for vDSO calls where it > >> fallback to a syscall in case an error (invalid argument, etc. and > >> assuming if vDSO does not fallback to a syscall it always succeed). > >> This makes the vDSO call on powerpc to have same same ABI constraint > >> as a syscall, where it clobbers CR0. > > > > I think you mean "vsyscall", the old thing glibc used where there are > > in-userspace implementations of some syscalls with call interfaces > > roughly equivalent to a syscall. musl has never used this. It only > > uses the actual exported functions from the vdso which have normal > > external function call ABI. > > I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. > The issue is indeed when calling the powerpc provided functions in > vDSO, which musl might want to do eventually. AIUI (at least this is true for all other archs) the functions have normal external function call ABI and calling them has nothing to do with syscall mechanisms. It looks like we're not using them right now and I'm not sure why. It could be that there are ABI mismatch issues (are 32-bit ones compatible with secure-plt? are 64-bit ones compatible with ELFv2?) or just that nobody proposed adding them. Also as of 5.4 32-bit ppc lacked time64 versions of them; not sure if this is fixed yet. Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On 23/04/2020 13:18, Rich Felker wrote: > On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: >> >> >> On 22/04/2020 23:36, Rich Felker wrote: >>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: Yeah I had a bit of a play around with musl (which is very nice code I must say). The powerpc64 syscall asm is missing ctr clobber by the way. Fortunately adding it doesn't change code generation for me, but it should be fixed. glibc had the same bug at one point I think (probably due to syscall ABI documentation not existing -- something now lives in linux/Documentation/powerpc/syscall64-abi.rst). >>> >>> Do you know anywhere I can read about the ctr issue, possibly the >>> relevant glibc bug report? I'm not particularly familiar with ppc >>> register file (at least I have to refamiliarize myself every time I >>> work on this stuff) so it'd be nice to understand what's >>> potentially-wrong now. >> >> My understanding is the ctr issue only happens for vDSO calls where it >> fallback to a syscall in case an error (invalid argument, etc. and >> assuming if vDSO does not fallback to a syscall it always succeed). >> This makes the vDSO call on powerpc to have same same ABI constraint >> as a syscall, where it clobbers CR0. > > I think you mean "vsyscall", the old thing glibc used where there are > in-userspace implementations of some syscalls with call interfaces > roughly equivalent to a syscall. musl has never used this. It only > uses the actual exported functions from the vdso which have normal > external function call ABI. I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. The issue is indeed when calling the powerpc provided functions in vDSO, which musl might want to do eventually.
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: > > > On 22/04/2020 23:36, Rich Felker wrote: > > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: > >> Yeah I had a bit of a play around with musl (which is very nice code I > >> must say). The powerpc64 syscall asm is missing ctr clobber by the way. > >> Fortunately adding it doesn't change code generation for me, but it > >> should be fixed. glibc had the same bug at one point I think (probably > >> due to syscall ABI documentation not existing -- something now lives in > >> linux/Documentation/powerpc/syscall64-abi.rst). > > > > Do you know anywhere I can read about the ctr issue, possibly the > > relevant glibc bug report? I'm not particularly familiar with ppc > > register file (at least I have to refamiliarize myself every time I > > work on this stuff) so it'd be nice to understand what's > > potentially-wrong now. > > My understanding is the ctr issue only happens for vDSO calls where it > fallback to a syscall in case an error (invalid argument, etc. and > assuming if vDSO does not fallback to a syscall it always succeed). > This makes the vDSO call on powerpc to have same same ABI constraint > as a syscall, where it clobbers CR0. I think you mean "vsyscall", the old thing glibc used where there are in-userspace implementations of some syscalls with call interfaces roughly equivalent to a syscall. musl has never used this. It only uses the actual exported functions from the vdso which have normal external function call ABI. Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On 22/04/2020 23:36, Rich Felker wrote: > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: >> Yeah I had a bit of a play around with musl (which is very nice code I >> must say). The powerpc64 syscall asm is missing ctr clobber by the way. >> Fortunately adding it doesn't change code generation for me, but it >> should be fixed. glibc had the same bug at one point I think (probably >> due to syscall ABI documentation not existing -- something now lives in >> linux/Documentation/powerpc/syscall64-abi.rst). > > Do you know anywhere I can read about the ctr issue, possibly the > relevant glibc bug report? I'm not particularly familiar with ppc > register file (at least I have to refamiliarize myself every time I > work on this stuff) so it'd be nice to understand what's > potentially-wrong now. My understanding is the ctr issue only happens for vDSO calls where it fallback to a syscall in case an error (invalid argument, etc. and assuming if vDSO does not fallback to a syscall it always succeed). This makes the vDSO call on powerpc to have same same ABI constraint as a syscall, where it clobbers CR0. On glibc we handle by simulating a function call and analysing the CR0 result: __asm__ __volatile__ ("mtctr %0\n\t" "bctrl\n\t" "mfcr %0\n\t" "0:" : "+r" (r0), "+r" (r3), "+r" (r4), "+r" (r5), "+r" (r6), "+r" (r7), "+r" (r8) : : "r9", "r10", "r11", "r12", "cr0", "ctr", "lr", "memory"); __asm__ __volatile__ ("" : "=r" (rval) : "r" (r3)); On musl you don't have this issue because it does not enable vDSO support on powerpc. And if it eventually does it with the VDSO_* macros the only issue I see is on when vDSO fallbacks to the syscall and it also fails (the return code won't be negated since on musl it uses a default C function pointer issue which does not model the CR0 kernel abi). So I think the extra ctr constraint on glibc powerpc syscall code is not really required. I think I have some patches to optimize this a bit based on previous discussions.
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: > Yeah I had a bit of a play around with musl (which is very nice code I > must say). The powerpc64 syscall asm is missing ctr clobber by the way. > Fortunately adding it doesn't change code generation for me, but it > should be fixed. glibc had the same bug at one point I think (probably > due to syscall ABI documentation not existing -- something now lives in > linux/Documentation/powerpc/syscall64-abi.rst). Do you know anywhere I can read about the ctr issue, possibly the relevant glibc bug report? I'm not particularly familiar with ppc register file (at least I have to refamiliarize myself every time I work on this stuff) so it'd be nice to understand what's potentially-wrong now. Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
Excerpts from Nicholas Piggin's message of April 22, 2020 4:18 pm: > If we go further and try to preserve r3 as well by putting the return > value in r9 or r0, we go backwards about 300 bytes. It's good for the > lock loops and complex functions, but hurts a lot of simpler functions > that have to add 'mr r3,r9' etc. > > Most of the time there are saved non-volatile GPRs around anyway though, > so not sure which way to go on this. Text size savings can't be ignored > and it's pretty easy for the kernel to do (we already save r3-r8 and > zero them on exit, so we could load them instead from cache line that's > should be hot). > > So I may be inclined to go this way, even if we won't see benefit now. By, "this way" I don't mean r9 or r0 return value (which is larger code), but r3 return value with r0,r4-r8 preserved. Thanks, Nick
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
Excerpts from Rich Felker's message of April 21, 2020 3:27 am: > On Mon, Apr 20, 2020 at 02:31:58PM +1000, Nicholas Piggin wrote: >> Excerpts from Rich Felker's message of April 20, 2020 2:09 pm: >> > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote: >> >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am: >> >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: >> >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: >> >> >> > Note that because lr is clobbered we need at least once normally >> >> >> > call-clobbered register that's not syscall clobbered to save lr in. >> >> >> > Otherwise stack frame setup is required to spill it. >> >> >> >> >> >> The kernel would like to use r9-r12 for itself. We could do with fewer >> >> >> registers, but we have some delay establishing the stack (depends on a >> >> >> load which depends on a mfspr), and entry code tends to be quite store >> >> >> heavy whereas on the caller side you have r1 set up (modulo stack >> >> >> updates), and the system call is a long delay during which time the >> >> >> store queue has significant time to drain. >> >> >> >> >> >> My feeling is it would be better for kernel to have these scratch >> >> >> registers. >> >> > >> >> > If your new kernel syscall mechanism requires the caller to make a >> >> > whole stack frame it otherwise doesn't need and spill registers to it, >> >> > it becomes a lot less attractive. Some of those 90 cycles saved are >> >> > immediately lost on the userspace side, plus you either waste icache >> >> > at the call point or require the syscall to go through a >> >> > userspace-side helper function that performs the spill and restore. >> >> >> >> You would be surprised how few cycles that takes on a high end CPU. Some >> >> might be a couple of %. I am one for counting cycles mind you, I'm not >> >> being flippant about it. If we can come up with something faster I'd be >> >> up for it. >> > >> > If the cycle count is trivial then just do it on the kernel side. >> >> The cycle count for user is, because you have r1 ready. Kernel does not >> have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to >> save into. >> >> Which is also wasted work for a userspace. >> >> Now that I think about it, no stack frame is even required! lr is saved >> into the caller's stack when its clobbered with an asm, just as when >> it's used for a function call. > > No. If there is a non-clobbered register, lr can be moved to the > non-clobbered register rather than saved to the stack. However it > looks like (1) gcc doesn't take advantage of that possibility, but (2) > the caller already arranged for there to be space on the stack to save > lr, so the cost is only one store and one load, not any stack > adjustment or other frame setup. So it's probably not a really big > deal. However, just adding "lr" clobber to existing syscall in musl > increased the size of a simple syscall function (getuid) from 20 bytes > to 36 bytes. Yeah I had a bit of a play around with musl (which is very nice code I must say). The powerpc64 syscall asm is missing ctr clobber by the way. Fortunately adding it doesn't change code generation for me, but it should be fixed. glibc had the same bug at one point I think (probably due to syscall ABI documentation not existing -- something now lives in linux/Documentation/powerpc/syscall64-abi.rst). Yes lr needs to be saved, I didn't see any new requirement for stack frames, and it was often already saved, but it does hurt the small wrapper functions. I did look at entirely replacing sc with scv though, just as an experiment. One day you might make sc optional! Text size impoves by about 3kB with the proposed ABI. Mostly seems to be the bns+ ; neg sequence. __syscall1/2/3 get out-of-lined by the compiler in a lot of cases. Linux's bloat-o-meter says: add/remove: 0/5 grow/shrink: 24/260 up/down: 220/-3428 (-3208) Function old new delta fcntl400 424 +24 popen600 620 +20 times 32 40 +8 [...] alloc_rev816 784 -32 alloc_fwd812 780 -32 __syscall1.constprop 32 - -32 __fdopen 504 472 -32 __expand_heap628 592 -36 __syscall240 - -40 __syscall344 - -44 fchmodat 372 324 -48 __wake.constprop 408 360 -48 child 11161064 -52 checker 220 156 -64 __bin_chunk
RE: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
From: Nicholas Piggin > Sent: 20 April 2020 02:10 ... > >> Yes, but does it really matter to optimize this specific usage case > >> for size? glibc, for instance, tries to leverage the syscall mechanism > >> by adding some complex pre-processor asm directives. It optimizes > >> the syscall code size in most cases. For instance, kill in static case > >> generates on x86_64: > >> > >> <__kill>: > >>0: b8 3e 00 00 00 mov$0x3e,%eax > >>5: 0f 05 syscall > >>7: 48 3d 01 f0 ff ff cmp$0xf001,%rax > >>d: 0f 83 00 00 00 00 jae13 <__kill+0x13> Hmmm... that cmp + jae is unnecessary here. It is also a 32bit offset jump. I also suspect it gets predicted very badly. > >> 13: c3 retq > >> > >> While on musl: > >> > >> : > >>0: 48 83 ec 08 sub$0x8,%rsp > >>4: 48 63 ffmovslq %edi,%rdi > >>7: 48 63 f6movslq %esi,%rsi > >>a: b8 3e 00 00 00 mov$0x3e,%eax > >>f: 0f 05 syscall > >> 11: 48 89 c7mov%rax,%rdi > >> 14: e8 00 00 00 00 callq 19 > >> 19: 5a pop%rdx > >> 1a: c3 retq > > > > Wow that's some extraordinarily bad codegen going on by gcc... The > > sign-extension is semantically needed and I don't see a good way > > around it (glibc's asm is kinda a hack taking advantage of kernel not > > looking at high bits, I think), but the gratuitous stack adjustment > > and refusal to generate a tail call isn't. I'll see if we can track > > down what's going on and get it fixed. A suitable cast might get rid of the sign extension. Possibly just (unsigned int). David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
RE: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
From: Adhemerval Zanella > Sent: 21 April 2020 16:01 > > On 21/04/2020 11:39, Rich Felker wrote: > > On Tue, Apr 21, 2020 at 12:28:25PM +, David Laight wrote: > >> From: Nicholas Piggin > >>> Sent: 20 April 2020 02:10 > >> ... > > Yes, but does it really matter to optimize this specific usage case > > for size? glibc, for instance, tries to leverage the syscall mechanism > > by adding some complex pre-processor asm directives. It optimizes > > the syscall code size in most cases. For instance, kill in static case > > generates on x86_64: > > > > <__kill>: > >0: b8 3e 00 00 00 mov$0x3e,%eax > >5: 0f 05 syscall > >7: 48 3d 01 f0 ff ff cmp$0xf001,%rax > >d: 0f 83 00 00 00 00 jae13 <__kill+0x13> > >> > >> Hmmm... that cmp + jae is unnecessary here. > > > > It's not.. Rather the objdump was just mistakenly done without -r so > > it looks like a nop jump rather than a conditional tail call to the > > function that sets errno. > > > > Indeed, the output with -r is: > > <__kill>: >0: b8 3e 00 00 00 mov$0x3e,%eax >5: 0f 05 syscall >7: 48 3d 01 f0 ff ff cmp$0xf001,%rax >d: 0f 83 00 00 00 00 jae13 <__kill+0x13> > f: R_X86_64_PLT32 __syscall_error-0x4 > 13: c3 retq Yes, I probably should have remembered it looked like that :-) ... > >> I also suspect it gets predicted very badly. > > > > I doubt that. This is a very standard idiom and the size of the offset > > (which is necessarily 32-bit because it has a relocation on it) is > > orthogonal to the condition on the jump. Yes, it only gets mispredicted as badly as any other conditional jump. I believe modern intel x86 will randomly predict it taken (regardless of the direction) and then hit a TLB fault on text.unlikely :-) > > FWIW a syscall like kill takes global kernel-side locks to be able to > > address a target process by pid, and the rate of meaningful calls you > > can make to it is very low (since it's bounded by time for target > > process to act on the signal). Trying to optimize it for speed is > > pointless, and even size isn't important locally (although in > > aggregate, lots of wasted small size can add up to more pages = more > > TLB entries = ...). > > I agree and I would prefer to focus on code simplicity to have a > platform neutral way to handle error and let the compiler optimize > it than messy with assembly macros to squeeze this kind of > micro-optimizations. syscall entry does get micro-optimised. Real speed-ups can probably be found by optimising other places. I've a patch i need to resumbit that should improve the reading of iov[] from user space. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On 21/04/2020 11:39, Rich Felker wrote: > On Tue, Apr 21, 2020 at 12:28:25PM +, David Laight wrote: >> From: Nicholas Piggin >>> Sent: 20 April 2020 02:10 >> ... > Yes, but does it really matter to optimize this specific usage case > for size? glibc, for instance, tries to leverage the syscall mechanism > by adding some complex pre-processor asm directives. It optimizes > the syscall code size in most cases. For instance, kill in static case > generates on x86_64: > > <__kill>: >0: b8 3e 00 00 00 mov$0x3e,%eax >5: 0f 05 syscall >7: 48 3d 01 f0 ff ff cmp$0xf001,%rax >d: 0f 83 00 00 00 00 jae13 <__kill+0x13> >> >> Hmmm... that cmp + jae is unnecessary here. > > It's not.. Rather the objdump was just mistakenly done without -r so > it looks like a nop jump rather than a conditional tail call to the > function that sets errno. > Indeed, the output with -r is: <__kill>: 0: b8 3e 00 00 00 mov$0x3e,%eax 5: 0f 05 syscall 7: 48 3d 01 f0 ff ff cmp$0xf001,%rax d: 0f 83 00 00 00 00 jae13 <__kill+0x13> f: R_X86_64_PLT32 __syscall_error-0x4 13: c3 retq And for x86_64 __syscall_error is defined as: <__syscall_error>: 0: 48 f7 d8neg%rax 0003 <__syscall_error_1>: 3: 64 89 04 25 00 00 00mov%eax,%fs:0x0 a: 00 7: R_X86_64_TPOFF32 errno b: 48 83 c8 ff or $0x,%rax f: c3 retq Different than musl, each architecture defines its own error handling mechanism (some embedded errno setting in syscall itself, other branches to a __syscall_error like function as x86_64). This is due most likely from the glibc long history. One of my long term plan is to just simplify, get rid of the assembly pre-processor, implement all syscall in C code, and set error handling mechanism in a platform neutral way using a tail call (most likely you do on musl). >> It is also a 32bit offset jump. >> I also suspect it gets predicted very badly. > > I doubt that. This is a very standard idiom and the size of the offset > (which is necessarily 32-bit because it has a relocation on it) is > orthogonal to the condition on the jump. > > FWIW a syscall like kill takes global kernel-side locks to be able to > address a target process by pid, and the rate of meaningful calls you > can make to it is very low (since it's bounded by time for target > process to act on the signal). Trying to optimize it for speed is > pointless, and even size isn't important locally (although in > aggregate, lots of wasted small size can add up to more pages = more > TLB entries = ...). I agree and I would prefer to focus on code simplicity to have a platform neutral way to handle error and let the compiler optimize it than messy with assembly macros to squeeze this kind of micro-optimizations. > > 13: c3 retq > > While on musl: > > : >0: 48 83 ec 08 sub$0x8,%rsp >4: 48 63 ffmovslq %edi,%rdi >7: 48 63 f6movslq %esi,%rsi >a: b8 3e 00 00 00 mov$0x3e,%eax >f: 0f 05 syscall > 11: 48 89 c7mov%rax,%rdi > 14: e8 00 00 00 00 callq 19 > 19: 5a pop%rdx > 1a: c3 retq Wow that's some extraordinarily bad codegen going on by gcc... The sign-extension is semantically needed and I don't see a good way around it (glibc's asm is kinda a hack taking advantage of kernel not looking at high bits, I think), but the gratuitous stack adjustment and refusal to generate a tail call isn't. I'll see if we can track down what's going on and get it fixed. >> >> A suitable cast might get rid of the sign extension. >> Possibly just (unsigned int). > > No, it won't. The problem is that there is no representation of the > fact that the kernel is only going to inspect the low 32 bits (by > declaring the kernel-side function as taking an int argument). The > external kill function receives arguments by the ABI, where the upper > bits of int args can contain junk, and the asm register constraints > for syscalls use longs (or rather an abstract syscall-arg type). It > wouldn't even work to have macro magic detect that the expressions > passed are ints and use hacks to avoid that, since it's perfectly > valid to pass an int to a syscall that expects a long argument (e.g. > offset to mmap), in which case it needs to be sign-extended. > > The only way to avoid this is encoding somewhere th
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Tue, Apr 21, 2020 at 12:28:25PM +, David Laight wrote: > From: Nicholas Piggin > > Sent: 20 April 2020 02:10 > ... > > >> Yes, but does it really matter to optimize this specific usage case > > >> for size? glibc, for instance, tries to leverage the syscall mechanism > > >> by adding some complex pre-processor asm directives. It optimizes > > >> the syscall code size in most cases. For instance, kill in static case > > >> generates on x86_64: > > >> > > >> <__kill>: > > >>0: b8 3e 00 00 00 mov$0x3e,%eax > > >>5: 0f 05 syscall > > >>7: 48 3d 01 f0 ff ff cmp$0xf001,%rax > > >>d: 0f 83 00 00 00 00 jae13 <__kill+0x13> > > Hmmm... that cmp + jae is unnecessary here. It's not.. Rather the objdump was just mistakenly done without -r so it looks like a nop jump rather than a conditional tail call to the function that sets errno. > It is also a 32bit offset jump. > I also suspect it gets predicted very badly. I doubt that. This is a very standard idiom and the size of the offset (which is necessarily 32-bit because it has a relocation on it) is orthogonal to the condition on the jump. FWIW a syscall like kill takes global kernel-side locks to be able to address a target process by pid, and the rate of meaningful calls you can make to it is very low (since it's bounded by time for target process to act on the signal). Trying to optimize it for speed is pointless, and even size isn't important locally (although in aggregate, lots of wasted small size can add up to more pages = more TLB entries = ...). > > >> 13: c3 retq > > >> > > >> While on musl: > > >> > > >> : > > >>0:48 83 ec 08 sub$0x8,%rsp > > >>4:48 63 ffmovslq %edi,%rdi > > >>7:48 63 f6movslq %esi,%rsi > > >>a:b8 3e 00 00 00 mov$0x3e,%eax > > >>f:0f 05 syscall > > >> 11:48 89 c7mov%rax,%rdi > > >> 14:e8 00 00 00 00 callq 19 > > >> 19:5a pop%rdx > > >> 1a:c3 retq > > > > > > Wow that's some extraordinarily bad codegen going on by gcc... The > > > sign-extension is semantically needed and I don't see a good way > > > around it (glibc's asm is kinda a hack taking advantage of kernel not > > > looking at high bits, I think), but the gratuitous stack adjustment > > > and refusal to generate a tail call isn't. I'll see if we can track > > > down what's going on and get it fixed. > > A suitable cast might get rid of the sign extension. > Possibly just (unsigned int). No, it won't. The problem is that there is no representation of the fact that the kernel is only going to inspect the low 32 bits (by declaring the kernel-side function as taking an int argument). The external kill function receives arguments by the ABI, where the upper bits of int args can contain junk, and the asm register constraints for syscalls use longs (or rather an abstract syscall-arg type). It wouldn't even work to have macro magic detect that the expressions passed are ints and use hacks to avoid that, since it's perfectly valid to pass an int to a syscall that expects a long argument (e.g. offset to mmap), in which case it needs to be sign-extended. The only way to avoid this is encoding somewhere the syscall-specific knowledge of what arg size the kernel function expects. That's way too much redundant effort and too error-prone for the incredibly miniscule size benefit you'd get out of it. Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
* Szabolcs Nagy: > * Nicholas Piggin [2020-04-20 12:08:36 +1000]: >> Excerpts from Rich Felker's message of April 20, 2020 11:29 am: >> > Also, allowing patching of executable pages is generally frowned upon >> > these days because W^X is a desirable hardening property. >> >> Right, it would want be write-protected after being patched. > > "frowned upon" means that users may have to update > their security policy setting in pax, selinux, apparmor, > seccomp bpf filters and who knows what else that may > monitor and flag W&X mprotect. > > libc update can break systems if the new libc does W&X. It's possible to map over pre-compiled alternative implementations, though. Basically, we would do the patching and build time and store the results in the file. It works best if the variance is concentrated on a few pages, and there are very few alternatives. For example, having two syscall APIs and supporting threading and no-threading versions would need four code versions in total, which is likely excessive.
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
* Nicholas Piggin [2020-04-20 12:08:36 +1000]: > Excerpts from Rich Felker's message of April 20, 2020 11:29 am: > > Also, allowing patching of executable pages is generally frowned upon > > these days because W^X is a desirable hardening property. > > Right, it would want be write-protected after being patched. "frowned upon" means that users may have to update their security policy setting in pax, selinux, apparmor, seccomp bpf filters and who knows what else that may monitor and flag W&X mprotect. libc update can break systems if the new libc does W&X.
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Mon, Apr 20, 2020 at 02:31:58PM +1000, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 20, 2020 2:09 pm: > > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote: > >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am: > >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: > >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: > >> >> > Note that because lr is clobbered we need at least once normally > >> >> > call-clobbered register that's not syscall clobbered to save lr in. > >> >> > Otherwise stack frame setup is required to spill it. > >> >> > >> >> The kernel would like to use r9-r12 for itself. We could do with fewer > >> >> registers, but we have some delay establishing the stack (depends on a > >> >> load which depends on a mfspr), and entry code tends to be quite store > >> >> heavy whereas on the caller side you have r1 set up (modulo stack > >> >> updates), and the system call is a long delay during which time the > >> >> store queue has significant time to drain. > >> >> > >> >> My feeling is it would be better for kernel to have these scratch > >> >> registers. > >> > > >> > If your new kernel syscall mechanism requires the caller to make a > >> > whole stack frame it otherwise doesn't need and spill registers to it, > >> > it becomes a lot less attractive. Some of those 90 cycles saved are > >> > immediately lost on the userspace side, plus you either waste icache > >> > at the call point or require the syscall to go through a > >> > userspace-side helper function that performs the spill and restore. > >> > >> You would be surprised how few cycles that takes on a high end CPU. Some > >> might be a couple of %. I am one for counting cycles mind you, I'm not > >> being flippant about it. If we can come up with something faster I'd be > >> up for it. > > > > If the cycle count is trivial then just do it on the kernel side. > > The cycle count for user is, because you have r1 ready. Kernel does not > have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to > save into. > > Which is also wasted work for a userspace. > > Now that I think about it, no stack frame is even required! lr is saved > into the caller's stack when its clobbered with an asm, just as when > it's used for a function call. No. If there is a non-clobbered register, lr can be moved to the non-clobbered register rather than saved to the stack. However it looks like (1) gcc doesn't take advantage of that possibility, but (2) the caller already arranged for there to be space on the stack to save lr, so the cost is only one store and one load, not any stack adjustment or other frame setup. So it's probably not a really big deal. However, just adding "lr" clobber to existing syscall in musl increased the size of a simple syscall function (getuid) from 20 bytes to 36 bytes. > >> > syscall arg registers still preserved? If not, this is a major cost on > >> > the userspace side, since any call point that has to loop-and-retry > >> > (e.g. futex) now needs to make its own place to store the original > >> > values.) > >> > >> Powerpc system calls never did. We could have scv preserve them, but > >> you'd still need to restore r3. We could make an ABI which does not > >> clobber r3 but puts the return value in r9, say. I'd like to see what > >> the user side code looks like to take advantage of such a thing though. > > > > Oh wow, I hadn't realized that, but indeed the code we have now is > > allowing for the kernel to clobber them all. So at least this isn't > > getting any worse I guess. I think it was a very poor choice of > > behavior though and a disadvantage vs what other archs do (some of > > them preserve all registers; others preserve only normally call-saved > > ones plus the syscall arg ones and possibly a few other specials). > > Well, we could change it. Does the generated code improve significantly > we take those clobbers away? I'd have to experiment a bit more to see. It's not going to help at all in functions which are pure syscall wrappers that just do the syscall and return, since the arg regs are dead after the syscall anyway (the caller must assume they were clobbered). But where syscalls are inlined and used in a loop, like a futex wait, it might make a nontrivial difference. Unfortunately even if you did change it for the new scv mechanism, it would be hard to take advantage of the change while also supporting sc, unless we used a helper function that just did scv directly, but saved/restored all the arg regs when using the legacy sc mechanism. Just inlining the hwcap conditional and clobbering more regs in one code path than in the other likely would not help; gcc won't shrink-wrap the clobbered/non-clobbered paths separately, and even if it did, when this were inlined somewhere like a futex loop, it'd end up having to lift the conditional out of the loop to be very advantageous, then making th
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
Excerpts from Rich Felker's message of April 20, 2020 2:09 pm: > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote: >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am: >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: >> >> > Note that because lr is clobbered we need at least once normally >> >> > call-clobbered register that's not syscall clobbered to save lr in. >> >> > Otherwise stack frame setup is required to spill it. >> >> >> >> The kernel would like to use r9-r12 for itself. We could do with fewer >> >> registers, but we have some delay establishing the stack (depends on a >> >> load which depends on a mfspr), and entry code tends to be quite store >> >> heavy whereas on the caller side you have r1 set up (modulo stack >> >> updates), and the system call is a long delay during which time the >> >> store queue has significant time to drain. >> >> >> >> My feeling is it would be better for kernel to have these scratch >> >> registers. >> > >> > If your new kernel syscall mechanism requires the caller to make a >> > whole stack frame it otherwise doesn't need and spill registers to it, >> > it becomes a lot less attractive. Some of those 90 cycles saved are >> > immediately lost on the userspace side, plus you either waste icache >> > at the call point or require the syscall to go through a >> > userspace-side helper function that performs the spill and restore. >> >> You would be surprised how few cycles that takes on a high end CPU. Some >> might be a couple of %. I am one for counting cycles mind you, I'm not >> being flippant about it. If we can come up with something faster I'd be >> up for it. > > If the cycle count is trivial then just do it on the kernel side. The cycle count for user is, because you have r1 ready. Kernel does not have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to save into. Which is also wasted work for a userspace. Now that I think about it, no stack frame is even required! lr is saved into the caller's stack when its clobbered with an asm, just as when it's used for a function call. >> > The right way to do this is to have the kernel preserve enough >> > registers that userspace can avoid having any spills. It doesn't have >> > to preserve everything, probably just enough to save lr. (BTW are >> >> Again, the problem is the kernel doesn't have its dependencies >> immediately ready to spill, and spilling (may be) more costly >> immediately after the call because we're doing a lot of stores. >> >> I could try measure this. Unfortunately our pipeline simulator tool >> doesn't model system calls properly so it's hard to see what's happening >> across the user/kernel horizon, I might check if that can be improved >> or I can hack it by putting some isync in there or something. > > I think it's unlikely to make any real difference to the total number > of cycles spent which side it happens on, but putting it on the kernel > side makes it easier to avoid wasting size/icache at each syscall > site. > >> > syscall arg registers still preserved? If not, this is a major cost on >> > the userspace side, since any call point that has to loop-and-retry >> > (e.g. futex) now needs to make its own place to store the original >> > values.) >> >> Powerpc system calls never did. We could have scv preserve them, but >> you'd still need to restore r3. We could make an ABI which does not >> clobber r3 but puts the return value in r9, say. I'd like to see what >> the user side code looks like to take advantage of such a thing though. > > Oh wow, I hadn't realized that, but indeed the code we have now is > allowing for the kernel to clobber them all. So at least this isn't > getting any worse I guess. I think it was a very poor choice of > behavior though and a disadvantage vs what other archs do (some of > them preserve all registers; others preserve only normally call-saved > ones plus the syscall arg ones and possibly a few other specials). Well, we could change it. Does the generated code improve significantly we take those clobbers away? Thanks, Nick
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 20, 2020 11:34 am: > > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: > >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: > >> > Note that because lr is clobbered we need at least once normally > >> > call-clobbered register that's not syscall clobbered to save lr in. > >> > Otherwise stack frame setup is required to spill it. > >> > >> The kernel would like to use r9-r12 for itself. We could do with fewer > >> registers, but we have some delay establishing the stack (depends on a > >> load which depends on a mfspr), and entry code tends to be quite store > >> heavy whereas on the caller side you have r1 set up (modulo stack > >> updates), and the system call is a long delay during which time the > >> store queue has significant time to drain. > >> > >> My feeling is it would be better for kernel to have these scratch > >> registers. > > > > If your new kernel syscall mechanism requires the caller to make a > > whole stack frame it otherwise doesn't need and spill registers to it, > > it becomes a lot less attractive. Some of those 90 cycles saved are > > immediately lost on the userspace side, plus you either waste icache > > at the call point or require the syscall to go through a > > userspace-side helper function that performs the spill and restore. > > You would be surprised how few cycles that takes on a high end CPU. Some > might be a couple of %. I am one for counting cycles mind you, I'm not > being flippant about it. If we can come up with something faster I'd be > up for it. If the cycle count is trivial then just do it on the kernel side. > > The right way to do this is to have the kernel preserve enough > > registers that userspace can avoid having any spills. It doesn't have > > to preserve everything, probably just enough to save lr. (BTW are > > Again, the problem is the kernel doesn't have its dependencies > immediately ready to spill, and spilling (may be) more costly > immediately after the call because we're doing a lot of stores. > > I could try measure this. Unfortunately our pipeline simulator tool > doesn't model system calls properly so it's hard to see what's happening > across the user/kernel horizon, I might check if that can be improved > or I can hack it by putting some isync in there or something. I think it's unlikely to make any real difference to the total number of cycles spent which side it happens on, but putting it on the kernel side makes it easier to avoid wasting size/icache at each syscall site. > > syscall arg registers still preserved? If not, this is a major cost on > > the userspace side, since any call point that has to loop-and-retry > > (e.g. futex) now needs to make its own place to store the original > > values.) > > Powerpc system calls never did. We could have scv preserve them, but > you'd still need to restore r3. We could make an ABI which does not > clobber r3 but puts the return value in r9, say. I'd like to see what > the user side code looks like to take advantage of such a thing though. Oh wow, I hadn't realized that, but indeed the code we have now is allowing for the kernel to clobber them all. So at least this isn't getting any worse I guess. I think it was a very poor choice of behavior though and a disadvantage vs what other archs do (some of them preserve all registers; others preserve only normally call-saved ones plus the syscall arg ones and possibly a few other specials). Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
Excerpts from Rich Felker's message of April 20, 2020 11:34 am: > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: >> > Note that because lr is clobbered we need at least once normally >> > call-clobbered register that's not syscall clobbered to save lr in. >> > Otherwise stack frame setup is required to spill it. >> >> The kernel would like to use r9-r12 for itself. We could do with fewer >> registers, but we have some delay establishing the stack (depends on a >> load which depends on a mfspr), and entry code tends to be quite store >> heavy whereas on the caller side you have r1 set up (modulo stack >> updates), and the system call is a long delay during which time the >> store queue has significant time to drain. >> >> My feeling is it would be better for kernel to have these scratch >> registers. > > If your new kernel syscall mechanism requires the caller to make a > whole stack frame it otherwise doesn't need and spill registers to it, > it becomes a lot less attractive. Some of those 90 cycles saved are > immediately lost on the userspace side, plus you either waste icache > at the call point or require the syscall to go through a > userspace-side helper function that performs the spill and restore. You would be surprised how few cycles that takes on a high end CPU. Some might be a couple of %. I am one for counting cycles mind you, I'm not being flippant about it. If we can come up with something faster I'd be up for it. > > The right way to do this is to have the kernel preserve enough > registers that userspace can avoid having any spills. It doesn't have > to preserve everything, probably just enough to save lr. (BTW are Again, the problem is the kernel doesn't have its dependencies immediately ready to spill, and spilling (may be) more costly immediately after the call because we're doing a lot of stores. I could try measure this. Unfortunately our pipeline simulator tool doesn't model system calls properly so it's hard to see what's happening across the user/kernel horizon, I might check if that can be improved or I can hack it by putting some isync in there or something. > syscall arg registers still preserved? If not, this is a major cost on > the userspace side, since any call point that has to loop-and-retry > (e.g. futex) now needs to make its own place to store the original > values.) Powerpc system calls never did. We could have scv preserve them, but you'd still need to restore r3. We could make an ABI which does not clobber r3 but puts the return value in r9, say. I'd like to see what the user side code looks like to take advantage of such a thing though. Thanks, Nick
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
Excerpts from Rich Felker's message of April 20, 2020 11:29 am: > On Mon, Apr 20, 2020 at 10:27:58AM +1000, Nicholas Piggin wrote: >> Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm: >> > * Nicholas Piggin via Libc-alpha [2020-04-16 >> > 10:16:54 +1000]: >> >> Well it would have to test HWCAP and patch in or branch to two >> >> completely different sequences including register save/restores yes. >> >> You could have the same asm and matching clobbers to put the sequence >> >> inline and then you could patch the one sc/scv instruction I suppose. >> > >> > how would that 'patch' work? >> > >> > there are many reasons why you don't >> > want libc to write its .text >> >> I guess I don't know what I'm talking about when it comes to libraries. >> Shame if there is no good way to load-time patch libc. It's orthogonal >> to the scv selection though -- if you don't patch you have to >> conditional or indirect branch however you implement it. > > Patched pages cannot be shared. The whole design of PIC and shared > libraries is that the code("text")/rodata is immutable and shared and > that only a minimal amount of data, packed tightly together (the GOT) > has to exist per-instance. Yeah the pages which were patched couldn't be shared across exec, which is a significant downside, unless you could group all patch sites into their own section and similarly pack it together (which has issues of being out of line). > > Also, allowing patching of executable pages is generally frowned upon > these days because W^X is a desirable hardening property. Right, it would want be write-protected after being patched. Thanks, Nick
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 17, 2020 4:31 am: > > Note that because lr is clobbered we need at least once normally > > call-clobbered register that's not syscall clobbered to save lr in. > > Otherwise stack frame setup is required to spill it. > > The kernel would like to use r9-r12 for itself. We could do with fewer > registers, but we have some delay establishing the stack (depends on a > load which depends on a mfspr), and entry code tends to be quite store > heavy whereas on the caller side you have r1 set up (modulo stack > updates), and the system call is a long delay during which time the > store queue has significant time to drain. > > My feeling is it would be better for kernel to have these scratch > registers. If your new kernel syscall mechanism requires the caller to make a whole stack frame it otherwise doesn't need and spill registers to it, it becomes a lot less attractive. Some of those 90 cycles saved are immediately lost on the userspace side, plus you either waste icache at the call point or require the syscall to go through a userspace-side helper function that performs the spill and restore. The right way to do this is to have the kernel preserve enough registers that userspace can avoid having any spills. It doesn't have to preserve everything, probably just enough to save lr. (BTW are syscall arg registers still preserved? If not, this is a major cost on the userspace side, since any call point that has to loop-and-retry (e.g. futex) now needs to make its own place to store the original values.) Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Mon, Apr 20, 2020 at 10:27:58AM +1000, Nicholas Piggin wrote: > Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm: > > * Nicholas Piggin via Libc-alpha [2020-04-16 > > 10:16:54 +1000]: > >> Well it would have to test HWCAP and patch in or branch to two > >> completely different sequences including register save/restores yes. > >> You could have the same asm and matching clobbers to put the sequence > >> inline and then you could patch the one sc/scv instruction I suppose. > > > > how would that 'patch' work? > > > > there are many reasons why you don't > > want libc to write its .text > > I guess I don't know what I'm talking about when it comes to libraries. > Shame if there is no good way to load-time patch libc. It's orthogonal > to the scv selection though -- if you don't patch you have to > conditional or indirect branch however you implement it. Patched pages cannot be shared. The whole design of PIC and shared libraries is that the code("text")/rodata is immutable and shared and that only a minimal amount of data, packed tightly together (the GOT) has to exist per-instance. Also, allowing patching of executable pages is generally frowned upon these days because W^X is a desirable hardening property. Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
Excerpts from Rich Felker's message of April 17, 2020 4:31 am: > On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote: >> >> >> On 16/04/2020 14:59, Rich Felker wrote: >> > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: >> >> >> >> >> >> On 16/04/2020 12:37, Rich Felker wrote: >> >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: >> > My preference would be that it work just like the i386 AT_SYSINFO >> > where you just replace "int $128" with "call *%%gs:16" and the kernel >> > provides a stub in the vdso that performs either scv or the old >> > mechanism with the same calling convention. Then if the kernel doesn't >> > provide it (because the kernel is too old) libc would have to provide >> > its own stub that uses the legacy method and matches the calling >> > convention of the one the kernel is expected to provide. >> >> What about pthread cancellation and the requirement of checking the >> cancellable syscall anchors in asynchronous cancellation? My plan is >> still to use musl strategy on glibc (BZ#12683) and for i686 it >> requires to always use old int$128 for program that uses cancellation >> (static case) or just threads (dynamic mode, which should be more >> common on glibc). >> >> Using the i686 strategy of a vDSO bridge symbol would require to always >> fallback to 'sc' to still use the same cancellation strategy (and >> thus defeating this optimization in such cases). >> >>> >> >>> Yes, I assumed it would be the same, ignoring the new syscall >> >>> mechanism for cancellable syscalls. While there are some exceptions, >> >>> cancellable syscalls are generally not hot paths but things that are >> >>> expected to block and to have significant amounts of work to do in >> >>> kernelspace, so saving a few tens of cycles is rather pointless. >> >>> >> >>> It's possible to do a branch/multiple versions of the syscall asm for >> >>> cancellation but would require extending the cancellation handler to >> >>> support checking against multiple independent address ranges or using >> >>> some alternate markup of them. >> >> >> >> The main issue is at least for glibc dynamic linking is way more common >> >> than static linking and once the program become multithread the fallback >> >> will be always used. >> > >> > I'm not relying on static linking optimizing out the cancellable >> > version. I'm talking about how cancellable syscalls are pretty much >> > all "heavy" operations to begin with where a few tens of cycles are in >> > the realm of "measurement noise" relative to the dominating time >> > costs. >> >> Yes I am aware, but at same time I am not sure how it plays on real world. >> For instance, some workloads might issue kernel query syscalls, such as >> recv, where buffer copying might not be dominant factor. So I see that if >> the idea is optimizing syscall mechanism, we should try to leverage it >> as whole in libc. > > Have you timed a minimal recv? I'm not assuming buffer copying is the > dominant factor. I'm assuming the overhead of all the kernel layers > involved is dominant. > >> >> And besides the cancellation performance issue, a new bridge vDSO >> >> mechanism >> >> will still require to setup some extra bridge for the case of the older >> >> kernel. In the scheme you suggested: >> >> >> >> __asm__("indirect call" ... with common clobbers); >> >> >> >> The indirect call will be either the vDSO bridge or an libc provided that >> >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain >> >> against: >> >> >> >>if (hwcap & PPC_FEATURE2_SCV) { >> >> __asm__(... with some clobbers); >> >>} else { >> >> __asm__(... with different clobbers); >> >>} >> > >> > If the indirect call can be made roughly as efficiently as the sc >> > sequence now (which already have some cost due to handling the nasty >> > error return convention, making the indirect call likely just as small >> > or smaller), it's O(1) additional code size (and thus icache usage) >> > rather than O(n) where n is number of syscall points. >> > >> > Of course it would work just as well (for avoiding O(n) growth) to >> > have a direct call to out-of-line branch like you suggested. >> >> Yes, but does it really matter to optimize this specific usage case >> for size? glibc, for instance, tries to leverage the syscall mechanism >> by adding some complex pre-processor asm directives. It optimizes >> the syscall code size in most cases. For instance, kill in static case >> generates on x86_64: >> >> <__kill>: >>0: b8 3e 00 00 00 mov$0x3e,%eax >>5: 0f 05 syscall >>7: 48 3d 01 f0 ff ff cmp$0xf001,%rax >>d: 0f 83 00 00 00 00 jae13 <__kill+0x13> >> 13: c3 retq >> >> While on musl: >> >> : >>0:
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
Excerpts from Adhemerval Zanella's message of April 17, 2020 4:52 am: > > > On 16/04/2020 15:31, Rich Felker wrote: >> On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote: >>> >>> >>> On 16/04/2020 14:59, Rich Felker wrote: On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: > > > On 16/04/2020 12:37, Rich Felker wrote: >> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: My preference would be that it work just like the i386 AT_SYSINFO where you just replace "int $128" with "call *%%gs:16" and the kernel provides a stub in the vdso that performs either scv or the old mechanism with the same calling convention. Then if the kernel doesn't provide it (because the kernel is too old) libc would have to provide its own stub that uses the legacy method and matches the calling convention of the one the kernel is expected to provide. >>> >>> What about pthread cancellation and the requirement of checking the >>> cancellable syscall anchors in asynchronous cancellation? My plan is >>> still to use musl strategy on glibc (BZ#12683) and for i686 it >>> requires to always use old int$128 for program that uses cancellation >>> (static case) or just threads (dynamic mode, which should be more >>> common on glibc). >>> >>> Using the i686 strategy of a vDSO bridge symbol would require to always >>> fallback to 'sc' to still use the same cancellation strategy (and >>> thus defeating this optimization in such cases). >> >> Yes, I assumed it would be the same, ignoring the new syscall >> mechanism for cancellable syscalls. While there are some exceptions, >> cancellable syscalls are generally not hot paths but things that are >> expected to block and to have significant amounts of work to do in >> kernelspace, so saving a few tens of cycles is rather pointless. >> >> It's possible to do a branch/multiple versions of the syscall asm for >> cancellation but would require extending the cancellation handler to >> support checking against multiple independent address ranges or using >> some alternate markup of them. > > The main issue is at least for glibc dynamic linking is way more common > than static linking and once the program become multithread the fallback > will be always used. I'm not relying on static linking optimizing out the cancellable version. I'm talking about how cancellable syscalls are pretty much all "heavy" operations to begin with where a few tens of cycles are in the realm of "measurement noise" relative to the dominating time costs. >>> >>> Yes I am aware, but at same time I am not sure how it plays on real world. >>> For instance, some workloads might issue kernel query syscalls, such as >>> recv, where buffer copying might not be dominant factor. So I see that if >>> the idea is optimizing syscall mechanism, we should try to leverage it >>> as whole in libc. >> >> Have you timed a minimal recv? I'm not assuming buffer copying is the >> dominant factor. I'm assuming the overhead of all the kernel layers >> involved is dominant. > > Not really, but reading the advantages of using 'scv' over 'sc' also does > not outline the real expect gain. Taking in consideration this should > be a micro-optimization (focused on entry syscall patch), I think we should > use where it possible. It's around 90 cycles improvement, depending on config options and speculative mitigations in place, this may be roughly 5-20% of a gettid syscall, which itself probably bears little relationship to what a recv syscall doing real work would do, it's easy to swamp it with other work. But it's a pretty big win in terms of how much we try to optimise this path. Thanks, Nick
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm: > * Nicholas Piggin via Libc-alpha [2020-04-16 > 10:16:54 +1000]: >> Well it would have to test HWCAP and patch in or branch to two >> completely different sequences including register save/restores yes. >> You could have the same asm and matching clobbers to put the sequence >> inline and then you could patch the one sc/scv instruction I suppose. > > how would that 'patch' work? > > there are many reasons why you don't > want libc to write its .text I guess I don't know what I'm talking about when it comes to libraries. Shame if there is no good way to load-time patch libc. It's orthogonal to the scv selection though -- if you don't patch you have to conditional or indirect branch however you implement it. Thanks, Nick
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
* Segher Boessenkool: > On Thu, Apr 16, 2020 at 08:34:42PM -0400, Rich Felker wrote: >> On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote: >> > On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote: >> > > > I think my choice would be just making the inline syscall be a single >> > > > call insn to an asm source file that out-of-lines the loading of TOC >> > > > pointer and call through it or branch based on hwcap so that it's not >> > > > repeated all over the place. >> > > >> > > I don't know how problematic control flow out of an inline asm is on >> > > POWER. But this is basically the -moutline-atomics approach. >> > >> > Control flow out of inline asm (other than with "asm goto") is not >> > allowed at all, just like on any other target (and will not work in >> > practice, either -- just like on any other target). But the suggestion >> > was to use actual assembler code, not inline asm? >> >> Calling it control flow out of inline asm is something of a misnomer. >> The enclosing state is not discarded or altered; the asm statement >> exits normally, reaching the next instruction in the enclosing >> block/function as soon as the call from the asm statement returns, >> with all register/clobber constraints satisfied. > > Ah. That should always Just Work, then -- our ABIs guarantee you can. After thinking about it, I agree: GCC will handle spilling of the link register. Branch-and-link instructions do not clobber the protected zone, so no stack adjustment is needed (which would be problematic to reflect in the unwind information). Of course, the target function has to be written in assembler because it must not use a regular stack frame.
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 16, 2020 at 08:34:42PM -0400, Rich Felker wrote: > On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote: > > On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote: > > > > I think my choice would be just making the inline syscall be a single > > > > call insn to an asm source file that out-of-lines the loading of TOC > > > > pointer and call through it or branch based on hwcap so that it's not > > > > repeated all over the place. > > > > > > I don't know how problematic control flow out of an inline asm is on > > > POWER. But this is basically the -moutline-atomics approach. > > > > Control flow out of inline asm (other than with "asm goto") is not > > allowed at all, just like on any other target (and will not work in > > practice, either -- just like on any other target). But the suggestion > > was to use actual assembler code, not inline asm? > > Calling it control flow out of inline asm is something of a misnomer. > The enclosing state is not discarded or altered; the asm statement > exits normally, reaching the next instruction in the enclosing > block/function as soon as the call from the asm statement returns, > with all register/clobber constraints satisfied. Ah. That should always Just Work, then -- our ABIs guarantee you can. > Control flow out of inline asm would be more like longjmp, and it can > be valid -- for instance, you can implement coroutines this way > (assuming you switch stack correctly) or do longjmp this way (jumping > to the location saved by setjmp). But it's not what'd be happening > here. Yeah, you cannot do that in C, not without making assumptions about what machine code the compiler generates. GCC explicitly disallows it, too: 'asm' statements may not perform jumps into other 'asm' statements, only to the listed GOTOLABELS. GCC's optimizers do not know about other jumps; therefore they cannot take account of them when deciding how to optimize. Segher
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote: > On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote: > > > I think my choice would be just making the inline syscall be a single > > > call insn to an asm source file that out-of-lines the loading of TOC > > > pointer and call through it or branch based on hwcap so that it's not > > > repeated all over the place. > > > > I don't know how problematic control flow out of an inline asm is on > > POWER. But this is basically the -moutline-atomics approach. > > Control flow out of inline asm (other than with "asm goto") is not > allowed at all, just like on any other target (and will not work in > practice, either -- just like on any other target). But the suggestion > was to use actual assembler code, not inline asm? Calling it control flow out of inline asm is something of a misnomer. The enclosing state is not discarded or altered; the asm statement exits normally, reaching the next instruction in the enclosing block/function as soon as the call from the asm statement returns, with all register/clobber constraints satisfied. Control flow out of inline asm would be more like longjmp, and it can be valid -- for instance, you can implement coroutines this way (assuming you switch stack correctly) or do longjmp this way (jumping to the location saved by setjmp). But it's not what'd be happening here. Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote: > > I think my choice would be just making the inline syscall be a single > > call insn to an asm source file that out-of-lines the loading of TOC > > pointer and call through it or branch based on hwcap so that it's not > > repeated all over the place. > > I don't know how problematic control flow out of an inline asm is on > POWER. But this is basically the -moutline-atomics approach. Control flow out of inline asm (other than with "asm goto") is not allowed at all, just like on any other target (and will not work in practice, either -- just like on any other target). But the suggestion was to use actual assembler code, not inline asm? Segher
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
* Nicholas Piggin via Libc-alpha: > We may or may not be getting a new ABI that will use instructions not > supported by old processors. > > https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html > > Current ABI continues to work of course and be the default for some > time, but building for new one would give some opportunity to drop > such support for old procs, at least for glibc. If I recall correctly, during last year's GNU Tools Cauldron, I think it was pretty clear that this was only to be used for intra-DSO ABIs, not cross-DSO optimization. Relocatable object files have an ABI, too, of course, so that's why there's a ABI documentation needed. For cross-DSO optimization, the link editor would look at the DSO being linked in, check if it uses the -mfuture ABI, and apply some shortcuts. But at that point, if the DSO is swapped back to a version built without -mfuture, it no longer works with those newly linked binaries against the -mfuture version. Such a thing is a clear ABI bump, and based what I remember from Cauldron, that is not the plan here. (I don't have any insider knowledge—I just don't want people to read this think: gosh, yet another POWER ABI bump. But the PCREL stuff *is* exciting!)
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On 16/04/2020 15:31, Rich Felker wrote: > On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote: >> >> >> On 16/04/2020 14:59, Rich Felker wrote: >>> On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: On 16/04/2020 12:37, Rich Felker wrote: > On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: >>> My preference would be that it work just like the i386 AT_SYSINFO >>> where you just replace "int $128" with "call *%%gs:16" and the kernel >>> provides a stub in the vdso that performs either scv or the old >>> mechanism with the same calling convention. Then if the kernel doesn't >>> provide it (because the kernel is too old) libc would have to provide >>> its own stub that uses the legacy method and matches the calling >>> convention of the one the kernel is expected to provide. >> >> What about pthread cancellation and the requirement of checking the >> cancellable syscall anchors in asynchronous cancellation? My plan is >> still to use musl strategy on glibc (BZ#12683) and for i686 it >> requires to always use old int$128 for program that uses cancellation >> (static case) or just threads (dynamic mode, which should be more >> common on glibc). >> >> Using the i686 strategy of a vDSO bridge symbol would require to always >> fallback to 'sc' to still use the same cancellation strategy (and >> thus defeating this optimization in such cases). > > Yes, I assumed it would be the same, ignoring the new syscall > mechanism for cancellable syscalls. While there are some exceptions, > cancellable syscalls are generally not hot paths but things that are > expected to block and to have significant amounts of work to do in > kernelspace, so saving a few tens of cycles is rather pointless. > > It's possible to do a branch/multiple versions of the syscall asm for > cancellation but would require extending the cancellation handler to > support checking against multiple independent address ranges or using > some alternate markup of them. The main issue is at least for glibc dynamic linking is way more common than static linking and once the program become multithread the fallback will be always used. >>> >>> I'm not relying on static linking optimizing out the cancellable >>> version. I'm talking about how cancellable syscalls are pretty much >>> all "heavy" operations to begin with where a few tens of cycles are in >>> the realm of "measurement noise" relative to the dominating time >>> costs. >> >> Yes I am aware, but at same time I am not sure how it plays on real world. >> For instance, some workloads might issue kernel query syscalls, such as >> recv, where buffer copying might not be dominant factor. So I see that if >> the idea is optimizing syscall mechanism, we should try to leverage it >> as whole in libc. > > Have you timed a minimal recv? I'm not assuming buffer copying is the > dominant factor. I'm assuming the overhead of all the kernel layers > involved is dominant. Not really, but reading the advantages of using 'scv' over 'sc' also does not outline the real expect gain. Taking in consideration this should be a micro-optimization (focused on entry syscall patch), I think we should use where it possible. > And besides the cancellation performance issue, a new bridge vDSO mechanism will still require to setup some extra bridge for the case of the older kernel. In the scheme you suggested: __asm__("indirect call" ... with common clobbers); The indirect call will be either the vDSO bridge or an libc provided that fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain against: if (hwcap & PPC_FEATURE2_SCV) { __asm__(... with some clobbers); } else { __asm__(... with different clobbers); } >>> >>> If the indirect call can be made roughly as efficiently as the sc >>> sequence now (which already have some cost due to handling the nasty >>> error return convention, making the indirect call likely just as small >>> or smaller), it's O(1) additional code size (and thus icache usage) >>> rather than O(n) where n is number of syscall points. >>> >>> Of course it would work just as well (for avoiding O(n) growth) to >>> have a direct call to out-of-line branch like you suggested. >> >> Yes, but does it really matter to optimize this specific usage case >> for size? glibc, for instance, tries to leverage the syscall mechanism >> by adding some complex pre-processor asm directives. It optimizes >> the syscall code size in most cases. For instance, kill in static case >> generates on x86_64: >> >> <__kill>: >>0: b8 3e 00 00 00 mov$0x3e,%eax >>5: 0f 05 syscall >>7: 48 3d 01 f0 ff ff cmp$0xf001,%rax >>d: 0f 83 00
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 16, 2020 at 02:31:51PM -0400, Rich Felker wrote: > > While on musl: > > > > : > >0: 48 83 ec 08 sub$0x8,%rsp > >4: 48 63 ffmovslq %edi,%rdi > >7: 48 63 f6movslq %esi,%rsi > >a: b8 3e 00 00 00 mov$0x3e,%eax > >f: 0f 05 syscall > > 11: 48 89 c7mov%rax,%rdi > > 14: e8 00 00 00 00 callq 19 > > 19: 5a pop%rdx > > 1a: c3 retq > > Wow that's some extraordinarily bad codegen going on by gcc... The > sign-extension is semantically needed and I don't see a good way > around it (glibc's asm is kinda a hack taking advantage of kernel not > looking at high bits, I think), but the gratuitous stack adjustment > and refusal to generate a tail call isn't. I'll see if we can track > down what's going on and get it fixed. It seems to be https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14441 which I've updated with a comment about the above. Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote: > > > On 16/04/2020 14:59, Rich Felker wrote: > > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: > >> > >> > >> On 16/04/2020 12:37, Rich Felker wrote: > >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: > > My preference would be that it work just like the i386 AT_SYSINFO > > where you just replace "int $128" with "call *%%gs:16" and the kernel > > provides a stub in the vdso that performs either scv or the old > > mechanism with the same calling convention. Then if the kernel doesn't > > provide it (because the kernel is too old) libc would have to provide > > its own stub that uses the legacy method and matches the calling > > convention of the one the kernel is expected to provide. > > What about pthread cancellation and the requirement of checking the > cancellable syscall anchors in asynchronous cancellation? My plan is > still to use musl strategy on glibc (BZ#12683) and for i686 it > requires to always use old int$128 for program that uses cancellation > (static case) or just threads (dynamic mode, which should be more > common on glibc). > > Using the i686 strategy of a vDSO bridge symbol would require to always > fallback to 'sc' to still use the same cancellation strategy (and > thus defeating this optimization in such cases). > >>> > >>> Yes, I assumed it would be the same, ignoring the new syscall > >>> mechanism for cancellable syscalls. While there are some exceptions, > >>> cancellable syscalls are generally not hot paths but things that are > >>> expected to block and to have significant amounts of work to do in > >>> kernelspace, so saving a few tens of cycles is rather pointless. > >>> > >>> It's possible to do a branch/multiple versions of the syscall asm for > >>> cancellation but would require extending the cancellation handler to > >>> support checking against multiple independent address ranges or using > >>> some alternate markup of them. > >> > >> The main issue is at least for glibc dynamic linking is way more common > >> than static linking and once the program become multithread the fallback > >> will be always used. > > > > I'm not relying on static linking optimizing out the cancellable > > version. I'm talking about how cancellable syscalls are pretty much > > all "heavy" operations to begin with where a few tens of cycles are in > > the realm of "measurement noise" relative to the dominating time > > costs. > > Yes I am aware, but at same time I am not sure how it plays on real world. > For instance, some workloads might issue kernel query syscalls, such as > recv, where buffer copying might not be dominant factor. So I see that if > the idea is optimizing syscall mechanism, we should try to leverage it > as whole in libc. Have you timed a minimal recv? I'm not assuming buffer copying is the dominant factor. I'm assuming the overhead of all the kernel layers involved is dominant. > >> And besides the cancellation performance issue, a new bridge vDSO mechanism > >> will still require to setup some extra bridge for the case of the older > >> kernel. In the scheme you suggested: > >> > >> __asm__("indirect call" ... with common clobbers); > >> > >> The indirect call will be either the vDSO bridge or an libc provided that > >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain > >> against: > >> > >>if (hwcap & PPC_FEATURE2_SCV) { > >> __asm__(... with some clobbers); > >>} else { > >> __asm__(... with different clobbers); > >>} > > > > If the indirect call can be made roughly as efficiently as the sc > > sequence now (which already have some cost due to handling the nasty > > error return convention, making the indirect call likely just as small > > or smaller), it's O(1) additional code size (and thus icache usage) > > rather than O(n) where n is number of syscall points. > > > > Of course it would work just as well (for avoiding O(n) growth) to > > have a direct call to out-of-line branch like you suggested. > > Yes, but does it really matter to optimize this specific usage case > for size? glibc, for instance, tries to leverage the syscall mechanism > by adding some complex pre-processor asm directives. It optimizes > the syscall code size in most cases. For instance, kill in static case > generates on x86_64: > > <__kill>: >0: b8 3e 00 00 00 mov$0x3e,%eax >5: 0f 05 syscall >7: 48 3d 01 f0 ff ff cmp$0xf001,%rax >d: 0f 83 00 00 00 00 jae13 <__kill+0x13> > 13: c3 retq > > While on musl: > > : >0: 48 83 ec 08 sub$0x8,%rsp >4: 48 63 ffmovslq %edi,%rdi >7: 48 63 f6movslq %esi,%rsi >a: b8 3e 00 00 00 mov$0
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On 16/04/2020 14:59, Rich Felker wrote: > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: >> >> >> On 16/04/2020 12:37, Rich Felker wrote: >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: > My preference would be that it work just like the i386 AT_SYSINFO > where you just replace "int $128" with "call *%%gs:16" and the kernel > provides a stub in the vdso that performs either scv or the old > mechanism with the same calling convention. Then if the kernel doesn't > provide it (because the kernel is too old) libc would have to provide > its own stub that uses the legacy method and matches the calling > convention of the one the kernel is expected to provide. What about pthread cancellation and the requirement of checking the cancellable syscall anchors in asynchronous cancellation? My plan is still to use musl strategy on glibc (BZ#12683) and for i686 it requires to always use old int$128 for program that uses cancellation (static case) or just threads (dynamic mode, which should be more common on glibc). Using the i686 strategy of a vDSO bridge symbol would require to always fallback to 'sc' to still use the same cancellation strategy (and thus defeating this optimization in such cases). >>> >>> Yes, I assumed it would be the same, ignoring the new syscall >>> mechanism for cancellable syscalls. While there are some exceptions, >>> cancellable syscalls are generally not hot paths but things that are >>> expected to block and to have significant amounts of work to do in >>> kernelspace, so saving a few tens of cycles is rather pointless. >>> >>> It's possible to do a branch/multiple versions of the syscall asm for >>> cancellation but would require extending the cancellation handler to >>> support checking against multiple independent address ranges or using >>> some alternate markup of them. >> >> The main issue is at least for glibc dynamic linking is way more common >> than static linking and once the program become multithread the fallback >> will be always used. > > I'm not relying on static linking optimizing out the cancellable > version. I'm talking about how cancellable syscalls are pretty much > all "heavy" operations to begin with where a few tens of cycles are in > the realm of "measurement noise" relative to the dominating time > costs. Yes I am aware, but at same time I am not sure how it plays on real world. For instance, some workloads might issue kernel query syscalls, such as recv, where buffer copying might not be dominant factor. So I see that if the idea is optimizing syscall mechanism, we should try to leverage it as whole in libc. > >> And besides the cancellation performance issue, a new bridge vDSO mechanism >> will still require to setup some extra bridge for the case of the older >> kernel. In the scheme you suggested: >> >> __asm__("indirect call" ... with common clobbers); >> >> The indirect call will be either the vDSO bridge or an libc provided that >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain >> against: >> >>if (hwcap & PPC_FEATURE2_SCV) { >> __asm__(... with some clobbers); >>} else { >> __asm__(... with different clobbers); >>} > > If the indirect call can be made roughly as efficiently as the sc > sequence now (which already have some cost due to handling the nasty > error return convention, making the indirect call likely just as small > or smaller), it's O(1) additional code size (and thus icache usage) > rather than O(n) where n is number of syscall points. > > Of course it would work just as well (for avoiding O(n) growth) to > have a direct call to out-of-line branch like you suggested. Yes, but does it really matter to optimize this specific usage case for size? glibc, for instance, tries to leverage the syscall mechanism by adding some complex pre-processor asm directives. It optimizes the syscall code size in most cases. For instance, kill in static case generates on x86_64: <__kill>: 0: b8 3e 00 00 00 mov$0x3e,%eax 5: 0f 05 syscall 7: 48 3d 01 f0 ff ff cmp$0xf001,%rax d: 0f 83 00 00 00 00 jae13 <__kill+0x13> 13: c3 retq While on musl: : 0: 48 83 ec 08 sub$0x8,%rsp 4: 48 63 ffmovslq %edi,%rdi 7: 48 63 f6movslq %esi,%rsi a: b8 3e 00 00 00 mov$0x3e,%eax f: 0f 05 syscall 11: 48 89 c7mov%rax,%rdi 14: e8 00 00 00 00 callq 19 19: 5a pop%rdx 1a: c3 retq But I hardly think it pays off the required code complexity. Some for providing a O(1) bridge: this will require additional complexity to write it and setup correctly. > >> Specially
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
* Rich Felker: > On Thu, Apr 16, 2020 at 06:42:32PM +0200, Florian Weimer wrote: >> * Rich Felker: >> >> > On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote: >> >> * Rich Felker: >> >> >> >> > My preference would be that it work just like the i386 AT_SYSINFO >> >> > where you just replace "int $128" with "call *%%gs:16" and the kernel >> >> > provides a stub in the vdso that performs either scv or the old >> >> > mechanism with the same calling convention. >> >> >> >> The i386 mechanism has received some criticism because it provides an >> >> effective means to redirect execution flow to anyone who can write to >> >> the TCB. I am not sure if it makes sense to copy it. >> > >> > Indeed that's a good point. Do you have ideas for making it equally >> > efficient without use of a function pointer in the TCB? >> >> We could add a shared non-writable mapping at a 64K offset from the >> thread pointer and store the function pointer or the code there. Then >> it would be safe. >> >> However, since this is apparently tied to POWER9 and we already have a >> POWER9 multilib, and assuming that we are going to backport the kernel >> change, I would tweak the selection criterion for that multilib to >> include the new HWCAP2 flag. If a user runs this glibc on a kernel >> which does not have support, they will get set baseline (POWER8) >> multilib, which still works. This way, outside the dynamic loader, no >> run-time dispatch is needed at all. I guess this is not at all the >> answer you were looking for. 8-) > > How does this work with -static? :-) -static is not supported. 8-) (If you use the unsupported static libraries, you get POWER8 code.) (Just to be clear, in case someone doesn't get the joke: This is about a potential approach for a heavily constrained, vertically integrated environment. It does not reflect general glibc recommendations.) >> If a single binary is needed, I would perhaps follow what Arm did for >> -moutline-atomics: lay out the code so that its easy to execute for >> the non-POWER9 case, assuming that POWER9 machines will be better at >> predicting things than their predecessors. >> >> Or you could also put the function pointer into a RELRO segment. Then >> there's overlap with the __libc_single_threaded discussion, where >> people objected to this kind of optimization (although I did not >> propose to change the TCB ABI, that would be required for >> __libc_single_threaded because it's an external interface). > > Of course you can use a normal global, but now every call point needs > to setup a TOC pointer (= two entry points and more icache lines for > otherwise trivial functions). > > I think my choice would be just making the inline syscall be a single > call insn to an asm source file that out-of-lines the loading of TOC > pointer and call through it or branch based on hwcap so that it's not > repeated all over the place. I don't know how problematic control flow out of an inline asm is on POWER. But this is basically the -moutline-atomics approach. > Alternatively, it would perhaps work to just put hwcap in the TCB and > branch on it rather than making an indirect call to a function pointer > in the TCB, so that the worst you could do by clobbering it is execute > the wrong syscall insn and thereby get SIGILL. The HWCAP is already in the TCB. I expect this is what generic glibc builds are going to use (perhaps with a bit of tweaking favorable to POWER8 implementations, but we'll see).
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: > > > On 16/04/2020 12:37, Rich Felker wrote: > > On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: > >>> My preference would be that it work just like the i386 AT_SYSINFO > >>> where you just replace "int $128" with "call *%%gs:16" and the kernel > >>> provides a stub in the vdso that performs either scv or the old > >>> mechanism with the same calling convention. Then if the kernel doesn't > >>> provide it (because the kernel is too old) libc would have to provide > >>> its own stub that uses the legacy method and matches the calling > >>> convention of the one the kernel is expected to provide. > >> > >> What about pthread cancellation and the requirement of checking the > >> cancellable syscall anchors in asynchronous cancellation? My plan is > >> still to use musl strategy on glibc (BZ#12683) and for i686 it > >> requires to always use old int$128 for program that uses cancellation > >> (static case) or just threads (dynamic mode, which should be more > >> common on glibc). > >> > >> Using the i686 strategy of a vDSO bridge symbol would require to always > >> fallback to 'sc' to still use the same cancellation strategy (and > >> thus defeating this optimization in such cases). > > > > Yes, I assumed it would be the same, ignoring the new syscall > > mechanism for cancellable syscalls. While there are some exceptions, > > cancellable syscalls are generally not hot paths but things that are > > expected to block and to have significant amounts of work to do in > > kernelspace, so saving a few tens of cycles is rather pointless. > > > > It's possible to do a branch/multiple versions of the syscall asm for > > cancellation but would require extending the cancellation handler to > > support checking against multiple independent address ranges or using > > some alternate markup of them. > > The main issue is at least for glibc dynamic linking is way more common > than static linking and once the program become multithread the fallback > will be always used. I'm not relying on static linking optimizing out the cancellable version. I'm talking about how cancellable syscalls are pretty much all "heavy" operations to begin with where a few tens of cycles are in the realm of "measurement noise" relative to the dominating time costs. > And besides the cancellation performance issue, a new bridge vDSO mechanism > will still require to setup some extra bridge for the case of the older > kernel. In the scheme you suggested: > > __asm__("indirect call" ... with common clobbers); > > The indirect call will be either the vDSO bridge or an libc provided that > fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain > against: > >if (hwcap & PPC_FEATURE2_SCV) { > __asm__(... with some clobbers); >} else { > __asm__(... with different clobbers); >} If the indirect call can be made roughly as efficiently as the sc sequence now (which already have some cost due to handling the nasty error return convention, making the indirect call likely just as small or smaller), it's O(1) additional code size (and thus icache usage) rather than O(n) where n is number of syscall points. Of course it would work just as well (for avoiding O(n) growth) to have a direct call to out-of-line branch like you suggested. > Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a > TCB member (as we do on glibc) and if we could make the asm clever > enough to not require different clobbers (although not sure if > it would be possible). The easy way not to require different clobbers is just using the union of the clobbers, no? Does the proposed new method clobber any call-saved registers that would make it painful (requiring new call frames to save them in)? Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On 16/04/2020 12:37, Rich Felker wrote: > On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: >>> My preference would be that it work just like the i386 AT_SYSINFO >>> where you just replace "int $128" with "call *%%gs:16" and the kernel >>> provides a stub in the vdso that performs either scv or the old >>> mechanism with the same calling convention. Then if the kernel doesn't >>> provide it (because the kernel is too old) libc would have to provide >>> its own stub that uses the legacy method and matches the calling >>> convention of the one the kernel is expected to provide. >> >> What about pthread cancellation and the requirement of checking the >> cancellable syscall anchors in asynchronous cancellation? My plan is >> still to use musl strategy on glibc (BZ#12683) and for i686 it >> requires to always use old int$128 for program that uses cancellation >> (static case) or just threads (dynamic mode, which should be more >> common on glibc). >> >> Using the i686 strategy of a vDSO bridge symbol would require to always >> fallback to 'sc' to still use the same cancellation strategy (and >> thus defeating this optimization in such cases). > > Yes, I assumed it would be the same, ignoring the new syscall > mechanism for cancellable syscalls. While there are some exceptions, > cancellable syscalls are generally not hot paths but things that are > expected to block and to have significant amounts of work to do in > kernelspace, so saving a few tens of cycles is rather pointless. > > It's possible to do a branch/multiple versions of the syscall asm for > cancellation but would require extending the cancellation handler to > support checking against multiple independent address ranges or using > some alternate markup of them. The main issue is at least for glibc dynamic linking is way more common than static linking and once the program become multithread the fallback will be always used. And besides the cancellation performance issue, a new bridge vDSO mechanism will still require to setup some extra bridge for the case of the older kernel. In the scheme you suggested: __asm__("indirect call" ... with common clobbers); The indirect call will be either the vDSO bridge or an libc provided that fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain against: if (hwcap & PPC_FEATURE2_SCV) { __asm__(... with some clobbers); } else { __asm__(... with different clobbers); } Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a TCB member (as we do on glibc) and if we could make the asm clever enough to not require different clobbers (although not sure if it would be possible).
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 16, 2020 at 06:42:32PM +0200, Florian Weimer wrote: > * Rich Felker: > > > On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote: > >> * Rich Felker: > >> > >> > My preference would be that it work just like the i386 AT_SYSINFO > >> > where you just replace "int $128" with "call *%%gs:16" and the kernel > >> > provides a stub in the vdso that performs either scv or the old > >> > mechanism with the same calling convention. > >> > >> The i386 mechanism has received some criticism because it provides an > >> effective means to redirect execution flow to anyone who can write to > >> the TCB. I am not sure if it makes sense to copy it. > > > > Indeed that's a good point. Do you have ideas for making it equally > > efficient without use of a function pointer in the TCB? > > We could add a shared non-writable mapping at a 64K offset from the > thread pointer and store the function pointer or the code there. Then > it would be safe. > > However, since this is apparently tied to POWER9 and we already have a > POWER9 multilib, and assuming that we are going to backport the kernel > change, I would tweak the selection criterion for that multilib to > include the new HWCAP2 flag. If a user runs this glibc on a kernel > which does not have support, they will get set baseline (POWER8) > multilib, which still works. This way, outside the dynamic loader, no > run-time dispatch is needed at all. I guess this is not at all the > answer you were looking for. 8-) How does this work with -static? :-) > If a single binary is needed, I would perhaps follow what Arm did for > -moutline-atomics: lay out the code so that its easy to execute for > the non-POWER9 case, assuming that POWER9 machines will be better at > predicting things than their predecessors. > > Or you could also put the function pointer into a RELRO segment. Then > there's overlap with the __libc_single_threaded discussion, where > people objected to this kind of optimization (although I did not > propose to change the TCB ABI, that would be required for > __libc_single_threaded because it's an external interface). Of course you can use a normal global, but now every call point needs to setup a TOC pointer (= two entry points and more icache lines for otherwise trivial functions). I think my choice would be just making the inline syscall be a single call insn to an asm source file that out-of-lines the loading of TOC pointer and call through it or branch based on hwcap so that it's not repeated all over the place. Alternatively, it would perhaps work to just put hwcap in the TCB and branch on it rather than making an indirect call to a function pointer in the TCB, so that the worst you could do by clobbering it is execute the wrong syscall insn and thereby get SIGILL. Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
* Rich Felker: > On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote: >> * Rich Felker: >> >> > My preference would be that it work just like the i386 AT_SYSINFO >> > where you just replace "int $128" with "call *%%gs:16" and the kernel >> > provides a stub in the vdso that performs either scv or the old >> > mechanism with the same calling convention. >> >> The i386 mechanism has received some criticism because it provides an >> effective means to redirect execution flow to anyone who can write to >> the TCB. I am not sure if it makes sense to copy it. > > Indeed that's a good point. Do you have ideas for making it equally > efficient without use of a function pointer in the TCB? We could add a shared non-writable mapping at a 64K offset from the thread pointer and store the function pointer or the code there. Then it would be safe. However, since this is apparently tied to POWER9 and we already have a POWER9 multilib, and assuming that we are going to backport the kernel change, I would tweak the selection criterion for that multilib to include the new HWCAP2 flag. If a user runs this glibc on a kernel which does not have support, they will get set baseline (POWER8) multilib, which still works. This way, outside the dynamic loader, no run-time dispatch is needed at all. I guess this is not at all the answer you were looking for. 8-) If a single binary is needed, I would perhaps follow what Arm did for -moutline-atomics: lay out the code so that its easy to execute for the non-POWER9 case, assuming that POWER9 machines will be better at predicting things than their predecessors. Or you could also put the function pointer into a RELRO segment. Then there's overlap with the __libc_single_threaded discussion, where people objected to this kind of optimization (although I did not propose to change the TCB ABI, that would be required for __libc_single_threaded because it's an external interface).
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 16, 2020 at 11:21:56AM -0400, Jeffrey Walton wrote: > On Wed, Apr 15, 2020 at 8:17 PM Nicholas Piggin wrote: > > > > Excerpts from Rich Felker's message of April 16, 2020 8:55 am: > > > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: > > >> I would like to enable Linux support for the powerpc 'scv' instruction, > > >> as a faster system call instruction. > > >> > > >> This requires two things to be defined: Firstly a way to advertise to > > >> userspace that kernel supports scv, and a way to allocate and advertise > > >> support for individual scv vectors. Secondly, a calling convention ABI > > >> for this new instruction. > > >> ... > > > Note that any libc that actually makes use of the new functionality is > > > not going to be able to make clobbers conditional on support for it; > > > branching around different clobbers is going to defeat any gains vs > > > always just treating anything clobbered by either method as clobbered. > > > > Well it would have to test HWCAP and patch in or branch to two > > completely different sequences including register save/restores yes. > > You could have the same asm and matching clobbers to put the sequence > > inline and then you could patch the one sc/scv instruction I suppose. > > Could GCC function multiversioning work here? > https://gcc.gnu.org/wiki/FunctionMultiVersioning > > It seems like selecting a runtime version of a function is the sort of > thing you are trying to do. On glibc it potentially could. This is ifunc-based functionality though and musl explicitly does not (and will not) support ifunc because of lots of fundamental problems it entails. But even on glibc the underlying mechanisms for ifunc are just the same as a normal indirect call and there's no real reason to prefer implementing it with ifunc/multiversioning vs directly. Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: > > My preference would be that it work just like the i386 AT_SYSINFO > > where you just replace "int $128" with "call *%%gs:16" and the kernel > > provides a stub in the vdso that performs either scv or the old > > mechanism with the same calling convention. Then if the kernel doesn't > > provide it (because the kernel is too old) libc would have to provide > > its own stub that uses the legacy method and matches the calling > > convention of the one the kernel is expected to provide. > > What about pthread cancellation and the requirement of checking the > cancellable syscall anchors in asynchronous cancellation? My plan is > still to use musl strategy on glibc (BZ#12683) and for i686 it > requires to always use old int$128 for program that uses cancellation > (static case) or just threads (dynamic mode, which should be more > common on glibc). > > Using the i686 strategy of a vDSO bridge symbol would require to always > fallback to 'sc' to still use the same cancellation strategy (and > thus defeating this optimization in such cases). Yes, I assumed it would be the same, ignoring the new syscall mechanism for cancellable syscalls. While there are some exceptions, cancellable syscalls are generally not hot paths but things that are expected to block and to have significant amounts of work to do in kernelspace, so saving a few tens of cycles is rather pointless. It's possible to do a branch/multiple versions of the syscall asm for cancellation but would require extending the cancellation handler to support checking against multiple independent address ranges or using some alternate markup of them. Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote: > * Rich Felker: > > > My preference would be that it work just like the i386 AT_SYSINFO > > where you just replace "int $128" with "call *%%gs:16" and the kernel > > provides a stub in the vdso that performs either scv or the old > > mechanism with the same calling convention. > > The i386 mechanism has received some criticism because it provides an > effective means to redirect execution flow to anyone who can write to > the TCB. I am not sure if it makes sense to copy it. Indeed that's a good point. Do you have ideas for making it equally efficient without use of a function pointer in the TCB? Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Wed, Apr 15, 2020 at 8:17 PM Nicholas Piggin wrote: > > Excerpts from Rich Felker's message of April 16, 2020 8:55 am: > > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: > >> I would like to enable Linux support for the powerpc 'scv' instruction, > >> as a faster system call instruction. > >> > >> This requires two things to be defined: Firstly a way to advertise to > >> userspace that kernel supports scv, and a way to allocate and advertise > >> support for individual scv vectors. Secondly, a calling convention ABI > >> for this new instruction. > >> ... > > Note that any libc that actually makes use of the new functionality is > > not going to be able to make clobbers conditional on support for it; > > branching around different clobbers is going to defeat any gains vs > > always just treating anything clobbered by either method as clobbered. > > Well it would have to test HWCAP and patch in or branch to two > completely different sequences including register save/restores yes. > You could have the same asm and matching clobbers to put the sequence > inline and then you could patch the one sc/scv instruction I suppose. Could GCC function multiversioning work here? https://gcc.gnu.org/wiki/FunctionMultiVersioning It seems like selecting a runtime version of a function is the sort of thing you are trying to do. Jeff
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
* Nicholas Piggin via Libc-alpha [2020-04-16 10:16:54 +1000]: > Well it would have to test HWCAP and patch in or branch to two > completely different sequences including register save/restores yes. > You could have the same asm and matching clobbers to put the sequence > inline and then you could patch the one sc/scv instruction I suppose. how would that 'patch' work? there are many reasons why you don't want libc to write its .text
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On 15/04/2020 19:55, Rich Felker wrote: > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: >> I would like to enable Linux support for the powerpc 'scv' instruction, >> as a faster system call instruction. >> >> This requires two things to be defined: Firstly a way to advertise to >> userspace that kernel supports scv, and a way to allocate and advertise >> support for individual scv vectors. Secondly, a calling convention ABI >> for this new instruction. >> >> Thanks to those who commented last time, since then I have removed my >> answered questions and unpopular alternatives but you can find them >> here >> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html >> >> Let me try one more with a wider cc list, and then we'll get something >> merged. Any questions or counter-opinions are welcome. >> >> System Call Vectored (scv) ABI >> == >> >> The scv instruction is introduced with POWER9 / ISA3, it comes with an >> rfscv counter-part. The benefit of these instructions is performance >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR >> updates. The scv instruction has 128 interrupt entry points (not enough >> to cover the Linux system call space). >> >> The proposal is to assign scv numbers very conservatively and allocate >> them as individual HWCAP features as we add support for more. The zero >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. >> >> Advertisement >> >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a >> SIGILL in current environments. Linux has defined a HWCAP2 bit >> PPC_FEATURE2_SCV for SCV support, but does not set it. >> >> When scv instruction support and the scv 0 vector for system calls are >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors >> should not be used without future HWCAP bits indicating support, which is >> how we will allocate them. (Should unallocated ones generate SIGILL, or >> return -ENOSYS in r3?) >> >> Calling convention >> >> The proposal is for scv 0 to provide the standard Linux system call ABI >> with the following differences from sc convention[1]: >> >> - LR is to be volatile across scv calls. This is necessary because the >> scv instruction clobbers LR. From previous discussion, this should be >> possible to deal with in GCC clobbers and CFI. >> >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the >> kernel system call exit to avoid restoring the CR register (although >> we probably still would anyway to avoid information leak). >> >> - Error handling: I think the consensus has been to move to using negative >> return value in r3 rather than CR0[SO]=1 to indicate error, which matches >> most other architectures and is closer to a function call. >> >> The number of scratch registers (r9-r12) at kernel entry seems >> sufficient that we don't have any costly spilling, patch is here[2]. >> >> [1] >> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst >> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html > > My preference would be that it work just like the i386 AT_SYSINFO > where you just replace "int $128" with "call *%%gs:16" and the kernel > provides a stub in the vdso that performs either scv or the old > mechanism with the same calling convention. Then if the kernel doesn't > provide it (because the kernel is too old) libc would have to provide > its own stub that uses the legacy method and matches the calling > convention of the one the kernel is expected to provide. What about pthread cancellation and the requirement of checking the cancellable syscall anchors in asynchronous cancellation? My plan is still to use musl strategy on glibc (BZ#12683) and for i686 it requires to always use old int$128 for program that uses cancellation (static case) or just threads (dynamic mode, which should be more common on glibc). Using the i686 strategy of a vDSO bridge symbol would require to always fallback to 'sc' to still use the same cancellation strategy (and thus defeating this optimization in such cases). > Note that any libc that actually makes use of the new functionality is > not going to be able to make clobbers conditional on support for it; > branching around different clobbers is going to defeat any gains vs > always just treating anything clobbered by either method as clobbered. > Likewise, it's not useful to have different error return mechanisms > because the caller just has to branch to support both (or the > kernel-provided stub just has to emulate one for it; that could work > if you really want to change the bad existing convention). > > Thoughts? > > Rich >
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
* Rich Felker: > My preference would be that it work just like the i386 AT_SYSINFO > where you just replace "int $128" with "call *%%gs:16" and the kernel > provides a stub in the vdso that performs either scv or the old > mechanism with the same calling convention. The i386 mechanism has received some criticism because it provides an effective means to redirect execution flow to anyone who can write to the TCB. I am not sure if it makes sense to copy it.
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
Excerpts from Rich Felker's message of April 16, 2020 1:03 pm: > On Thu, Apr 16, 2020 at 12:53:31PM +1000, Nicholas Piggin wrote: >> > Not to mention the dcache line to access >> > __hwcap or whatever, and the icache lines to setup access TOC-relative >> > access to it. (Of course you could put a copy of its value in TLS at a >> > fixed offset, which would somewhat mitigate both.) >> > >> >> And finally, the HWCAP test can eventually go away in future. A vdso >> >> call can not. >> > >> > We support nearly arbitrarily old kernels (with limited functionality) >> > and hardware (with full functionality) and don't intend for that to >> > change, ever. But indeed glibc might want too eventually drop the >> > check. >> >> Ah, cool. Any build-time flexibility there? >> >> We may or may not be getting a new ABI that will use instructions not >> supported by old processors. >> >> https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html >> >> Current ABI continues to work of course and be the default for some >> time, but building for new one would give some opportunity to drop >> such support for old procs, at least for glibc. > > What does "new ABI" entail to you? In the terminology I use with musl, > "new ABI" and "new ISA level" are different things. You can compile > (explicit -march or compiler default) binaries that won't run on older > cpus due to use of new insns etc., but we consider it the same ABI if > you can link code for an older/baseline ISA level with the > newer-ISA-level object files, i.e. if the interface surface for > linkage remains compatible. We also try to avoid gratuitous > proliferation of different ABIs unless there's a strong underlying > need (like addition of softfloat ABIs for archs that usually have FPU, > or vice versa). Yeah it will be a new ABI type that also requires a new ISA level. As far as I know (and I'm not on the toolchain side) there will be some call compatibility between the two, so it may be fine to continue with existing ABI for libc. But it just something that comes to mind as a build-time cutover where we might be able to assume particular features. > In principle the same could be done for kernels except it's a bigger > silent gotcha (possible ENOSYS in places where it shouldn't be able to > happen rather than a trapping SIGILL or similar) and there's rarely > any serious performance or size benefit to dropping support for older > kernels. Right, I don't think it'd be a huge problem whatever way we go, compared with the cost of the system call. Thanks, Nick
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 16, 2020 at 12:53:31PM +1000, Nicholas Piggin wrote: > > Not to mention the dcache line to access > > __hwcap or whatever, and the icache lines to setup access TOC-relative > > access to it. (Of course you could put a copy of its value in TLS at a > > fixed offset, which would somewhat mitigate both.) > > > >> And finally, the HWCAP test can eventually go away in future. A vdso > >> call can not. > > > > We support nearly arbitrarily old kernels (with limited functionality) > > and hardware (with full functionality) and don't intend for that to > > change, ever. But indeed glibc might want too eventually drop the > > check. > > Ah, cool. Any build-time flexibility there? > > We may or may not be getting a new ABI that will use instructions not > supported by old processors. > > https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html > > Current ABI continues to work of course and be the default for some > time, but building for new one would give some opportunity to drop > such support for old procs, at least for glibc. What does "new ABI" entail to you? In the terminology I use with musl, "new ABI" and "new ISA level" are different things. You can compile (explicit -march or compiler default) binaries that won't run on older cpus due to use of new insns etc., but we consider it the same ABI if you can link code for an older/baseline ISA level with the newer-ISA-level object files, i.e. if the interface surface for linkage remains compatible. We also try to avoid gratuitous proliferation of different ABIs unless there's a strong underlying need (like addition of softfloat ABIs for archs that usually have FPU, or vice versa). In principle the same could be done for kernels except it's a bigger silent gotcha (possible ENOSYS in places where it shouldn't be able to happen rather than a trapping SIGILL or similar) and there's rarely any serious performance or size benefit to dropping support for older kernels. Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
Excerpts from Rich Felker's message of April 16, 2020 12:35 pm: > On Thu, Apr 16, 2020 at 12:24:16PM +1000, Nicholas Piggin wrote: >> >> > Likewise, it's not useful to have different error return mechanisms >> >> > because the caller just has to branch to support both (or the >> >> > kernel-provided stub just has to emulate one for it; that could work >> >> > if you really want to change the bad existing convention). >> >> > >> >> > Thoughts? >> >> >> >> The existing convention has to change somewhat because of the clobbers, >> >> so I thought we could change the error return at the same time. I'm >> >> open to not changing it and using CR0[SO], but others liked the idea. >> >> Pro: it matches sc and vsyscall. Con: it's different from other common >> >> archs. Performnce-wise it would really be a wash -- cost of conditional >> >> branch is not the cmp but the mispredict. >> > >> > If you do the branch on hwcap at each syscall, then you significantly >> > increase code size of every syscall point, likely turning a bunch of >> > trivial functions that didn't need stack frames into ones that do. You >> > also potentially make them need a TOC pointer. Making them all just do >> > an indirect call unconditionally (with pointer in TLS like i386?) is a >> > lot more efficient in code size and at least as good for performance. >> >> I disagree. Doing the long vdso indirect call *necessarily* requires >> touching a new icache line, and even a new TLB entry. Indirect branches > > The increase in number of icache lines from the branch at every > syscall point is far greater than the use of a single extra icache > line shared by all syscalls. That's true, I was thinking of a single function that does the test and calls syscalls, which might be the fair comparison. > Not to mention the dcache line to access > __hwcap or whatever, and the icache lines to setup access TOC-relative > access to it. (Of course you could put a copy of its value in TLS at a > fixed offset, which would somewhat mitigate both.) > >> And finally, the HWCAP test can eventually go away in future. A vdso >> call can not. > > We support nearly arbitrarily old kernels (with limited functionality) > and hardware (with full functionality) and don't intend for that to > change, ever. But indeed glibc might want too eventually drop the > check. Ah, cool. Any build-time flexibility there? We may or may not be getting a new ABI that will use instructions not supported by old processors. https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html Current ABI continues to work of course and be the default for some time, but building for new one would give some opportunity to drop such support for old procs, at least for glibc. > >> If you really want to select with an indirect branch rather than >> direct conditional, you can do that all within the library. > > OK. It's a little bit more work if that's not the interface the kernel > will give us, but it's no big deal. Okay. Thanks, Nick
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 16, 2020 at 12:24:16PM +1000, Nicholas Piggin wrote: > >> > Likewise, it's not useful to have different error return mechanisms > >> > because the caller just has to branch to support both (or the > >> > kernel-provided stub just has to emulate one for it; that could work > >> > if you really want to change the bad existing convention). > >> > > >> > Thoughts? > >> > >> The existing convention has to change somewhat because of the clobbers, > >> so I thought we could change the error return at the same time. I'm > >> open to not changing it and using CR0[SO], but others liked the idea. > >> Pro: it matches sc and vsyscall. Con: it's different from other common > >> archs. Performnce-wise it would really be a wash -- cost of conditional > >> branch is not the cmp but the mispredict. > > > > If you do the branch on hwcap at each syscall, then you significantly > > increase code size of every syscall point, likely turning a bunch of > > trivial functions that didn't need stack frames into ones that do. You > > also potentially make them need a TOC pointer. Making them all just do > > an indirect call unconditionally (with pointer in TLS like i386?) is a > > lot more efficient in code size and at least as good for performance. > > I disagree. Doing the long vdso indirect call *necessarily* requires > touching a new icache line, and even a new TLB entry. Indirect branches The increase in number of icache lines from the branch at every syscall point is far greater than the use of a single extra icache line shared by all syscalls. Not to mention the dcache line to access __hwcap or whatever, and the icache lines to setup access TOC-relative access to it. (Of course you could put a copy of its value in TLS at a fixed offset, which would somewhat mitigate both.) > And finally, the HWCAP test can eventually go away in future. A vdso > call can not. We support nearly arbitrarily old kernels (with limited functionality) and hardware (with full functionality) and don't intend for that to change, ever. But indeed glibc might want too eventually drop the check. > If you really want to select with an indirect branch rather than > direct conditional, you can do that all within the library. OK. It's a little bit more work if that's not the interface the kernel will give us, but it's no big deal. Rich
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
Excerpts from Rich Felker's message of April 16, 2020 10:48 am: > On Thu, Apr 16, 2020 at 10:16:54AM +1000, Nicholas Piggin wrote: >> Excerpts from Rich Felker's message of April 16, 2020 8:55 am: >> > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: >> >> I would like to enable Linux support for the powerpc 'scv' instruction, >> >> as a faster system call instruction. >> >> >> >> This requires two things to be defined: Firstly a way to advertise to >> >> userspace that kernel supports scv, and a way to allocate and advertise >> >> support for individual scv vectors. Secondly, a calling convention ABI >> >> for this new instruction. >> >> >> >> Thanks to those who commented last time, since then I have removed my >> >> answered questions and unpopular alternatives but you can find them >> >> here >> >> >> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html >> >> >> >> Let me try one more with a wider cc list, and then we'll get something >> >> merged. Any questions or counter-opinions are welcome. >> >> >> >> System Call Vectored (scv) ABI >> >> == >> >> >> >> The scv instruction is introduced with POWER9 / ISA3, it comes with an >> >> rfscv counter-part. The benefit of these instructions is performance >> >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the >> >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR >> >> updates. The scv instruction has 128 interrupt entry points (not enough >> >> to cover the Linux system call space). >> >> >> >> The proposal is to assign scv numbers very conservatively and allocate >> >> them as individual HWCAP features as we add support for more. The zero >> >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. >> >> >> >> Advertisement >> >> >> >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a >> >> SIGILL in current environments. Linux has defined a HWCAP2 bit >> >> PPC_FEATURE2_SCV for SCV support, but does not set it. >> >> >> >> When scv instruction support and the scv 0 vector for system calls are >> >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors >> >> should not be used without future HWCAP bits indicating support, which is >> >> how we will allocate them. (Should unallocated ones generate SIGILL, or >> >> return -ENOSYS in r3?) >> >> >> >> Calling convention >> >> >> >> The proposal is for scv 0 to provide the standard Linux system call ABI >> >> with the following differences from sc convention[1]: >> >> >> >> - LR is to be volatile across scv calls. This is necessary because the >> >> scv instruction clobbers LR. From previous discussion, this should be >> >> possible to deal with in GCC clobbers and CFI. >> >> >> >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the >> >> kernel system call exit to avoid restoring the CR register (although >> >> we probably still would anyway to avoid information leak). >> >> >> >> - Error handling: I think the consensus has been to move to using negative >> >> return value in r3 rather than CR0[SO]=1 to indicate error, which >> >> matches >> >> most other architectures and is closer to a function call. >> >> >> >> The number of scratch registers (r9-r12) at kernel entry seems >> >> sufficient that we don't have any costly spilling, patch is here[2]. >> >> >> >> [1] >> >> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst >> >> [2] >> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840..html >> > >> > My preference would be that it work just like the i386 AT_SYSINFO >> > where you just replace "int $128" with "call *%%gs:16" and the kernel >> > provides a stub in the vdso that performs either scv or the old >> > mechanism with the same calling convention. Then if the kernel doesn't >> > provide it (because the kernel is too old) libc would have to provide >> > its own stub that uses the legacy method and matches the calling >> > convention of the one the kernel is expected to provide. >> >> I'm not sure if that's necessary. That's done on x86-32 because they >> select different sequences to use based on the CPU running and if the host >> kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP >> bits and select the right sequence in libc as well I suppose. > > It's not just a HWCAP. It's a contract between the kernel and > userspace to support a particular calling convention that's not > exposed except as the public entry point the kernel exports via > AT_SYSINFO. Right. >> > Note that any libc that actually makes use of the new functionality is >> > not going to be able to make clobbers conditional on support for it; >> > branching around different clobbers is going to defeat any gains vs >> > always just treating anything clobbered by either method as clobbered. >> >> Well it would have to test HWCAP and
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 16, 2020 at 10:16:54AM +1000, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 16, 2020 8:55 am: > > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: > >> I would like to enable Linux support for the powerpc 'scv' instruction, > >> as a faster system call instruction. > >> > >> This requires two things to be defined: Firstly a way to advertise to > >> userspace that kernel supports scv, and a way to allocate and advertise > >> support for individual scv vectors. Secondly, a calling convention ABI > >> for this new instruction. > >> > >> Thanks to those who commented last time, since then I have removed my > >> answered questions and unpopular alternatives but you can find them > >> here > >> > >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html > >> > >> Let me try one more with a wider cc list, and then we'll get something > >> merged. Any questions or counter-opinions are welcome. > >> > >> System Call Vectored (scv) ABI > >> == > >> > >> The scv instruction is introduced with POWER9 / ISA3, it comes with an > >> rfscv counter-part. The benefit of these instructions is performance > >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the > >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR > >> updates. The scv instruction has 128 interrupt entry points (not enough > >> to cover the Linux system call space). > >> > >> The proposal is to assign scv numbers very conservatively and allocate > >> them as individual HWCAP features as we add support for more. The zero > >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. > >> > >> Advertisement > >> > >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a > >> SIGILL in current environments. Linux has defined a HWCAP2 bit > >> PPC_FEATURE2_SCV for SCV support, but does not set it. > >> > >> When scv instruction support and the scv 0 vector for system calls are > >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors > >> should not be used without future HWCAP bits indicating support, which is > >> how we will allocate them. (Should unallocated ones generate SIGILL, or > >> return -ENOSYS in r3?) > >> > >> Calling convention > >> > >> The proposal is for scv 0 to provide the standard Linux system call ABI > >> with the following differences from sc convention[1]: > >> > >> - LR is to be volatile across scv calls. This is necessary because the > >> scv instruction clobbers LR. From previous discussion, this should be > >> possible to deal with in GCC clobbers and CFI. > >> > >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the > >> kernel system call exit to avoid restoring the CR register (although > >> we probably still would anyway to avoid information leak). > >> > >> - Error handling: I think the consensus has been to move to using negative > >> return value in r3 rather than CR0[SO]=1 to indicate error, which matches > >> most other architectures and is closer to a function call. > >> > >> The number of scratch registers (r9-r12) at kernel entry seems > >> sufficient that we don't have any costly spilling, patch is here[2]. > >> > >> [1] > >> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst > >> [2] > >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840..html > > > > My preference would be that it work just like the i386 AT_SYSINFO > > where you just replace "int $128" with "call *%%gs:16" and the kernel > > provides a stub in the vdso that performs either scv or the old > > mechanism with the same calling convention. Then if the kernel doesn't > > provide it (because the kernel is too old) libc would have to provide > > its own stub that uses the legacy method and matches the calling > > convention of the one the kernel is expected to provide. > > I'm not sure if that's necessary. That's done on x86-32 because they > select different sequences to use based on the CPU running and if the host > kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP > bits and select the right sequence in libc as well I suppose. It's not just a HWCAP. It's a contract between the kernel and userspace to support a particular calling convention that's not exposed except as the public entry point the kernel exports via AT_SYSINFO. > > Note that any libc that actually makes use of the new functionality is > > not going to be able to make clobbers conditional on support for it; > > branching around different clobbers is going to defeat any gains vs > > always just treating anything clobbered by either method as clobbered. > > Well it would have to test HWCAP and patch in or branch to two > completely different sequences including register save/restores yes. > You could have the same asm and matching clobbers to put the sequence > inline
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
Excerpts from Rich Felker's message of April 16, 2020 8:55 am: > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: >> I would like to enable Linux support for the powerpc 'scv' instruction, >> as a faster system call instruction. >> >> This requires two things to be defined: Firstly a way to advertise to >> userspace that kernel supports scv, and a way to allocate and advertise >> support for individual scv vectors. Secondly, a calling convention ABI >> for this new instruction. >> >> Thanks to those who commented last time, since then I have removed my >> answered questions and unpopular alternatives but you can find them >> here >> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html >> >> Let me try one more with a wider cc list, and then we'll get something >> merged. Any questions or counter-opinions are welcome. >> >> System Call Vectored (scv) ABI >> == >> >> The scv instruction is introduced with POWER9 / ISA3, it comes with an >> rfscv counter-part. The benefit of these instructions is performance >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR >> updates. The scv instruction has 128 interrupt entry points (not enough >> to cover the Linux system call space). >> >> The proposal is to assign scv numbers very conservatively and allocate >> them as individual HWCAP features as we add support for more. The zero >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. >> >> Advertisement >> >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a >> SIGILL in current environments. Linux has defined a HWCAP2 bit >> PPC_FEATURE2_SCV for SCV support, but does not set it. >> >> When scv instruction support and the scv 0 vector for system calls are >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors >> should not be used without future HWCAP bits indicating support, which is >> how we will allocate them. (Should unallocated ones generate SIGILL, or >> return -ENOSYS in r3?) >> >> Calling convention >> >> The proposal is for scv 0 to provide the standard Linux system call ABI >> with the following differences from sc convention[1]: >> >> - LR is to be volatile across scv calls. This is necessary because the >> scv instruction clobbers LR. From previous discussion, this should be >> possible to deal with in GCC clobbers and CFI. >> >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the >> kernel system call exit to avoid restoring the CR register (although >> we probably still would anyway to avoid information leak). >> >> - Error handling: I think the consensus has been to move to using negative >> return value in r3 rather than CR0[SO]=1 to indicate error, which matches >> most other architectures and is closer to a function call. >> >> The number of scratch registers (r9-r12) at kernel entry seems >> sufficient that we don't have any costly spilling, patch is here[2]. >> >> [1] >> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst >> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html > > My preference would be that it work just like the i386 AT_SYSINFO > where you just replace "int $128" with "call *%%gs:16" and the kernel > provides a stub in the vdso that performs either scv or the old > mechanism with the same calling convention. Then if the kernel doesn't > provide it (because the kernel is too old) libc would have to provide > its own stub that uses the legacy method and matches the calling > convention of the one the kernel is expected to provide. I'm not sure if that's necessary. That's done on x86-32 because they select different sequences to use based on the CPU running and if the host kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP bits and select the right sequence in libc as well I suppose. > Note that any libc that actually makes use of the new functionality is > not going to be able to make clobbers conditional on support for it; > branching around different clobbers is going to defeat any gains vs > always just treating anything clobbered by either method as clobbered. Well it would have to test HWCAP and patch in or branch to two completely different sequences including register save/restores yes. You could have the same asm and matching clobbers to put the sequence inline and then you could patch the one sc/scv instruction I suppose. A bit of logic to select between them doesn't defeat gains though, it's about 90 cycle improvement which is a handful of branch mispredicts so it really is an improvement. Eventually userspace will stop supporting the old variant too. > Likewise, it's not useful to have different error return mechanisms > because the caller just has to branch to support both (or the > kernel-provided s
Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2
On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: > I would like to enable Linux support for the powerpc 'scv' instruction, > as a faster system call instruction. > > This requires two things to be defined: Firstly a way to advertise to > userspace that kernel supports scv, and a way to allocate and advertise > support for individual scv vectors. Secondly, a calling convention ABI > for this new instruction. > > Thanks to those who commented last time, since then I have removed my > answered questions and unpopular alternatives but you can find them > here > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html > > Let me try one more with a wider cc list, and then we'll get something > merged. Any questions or counter-opinions are welcome. > > System Call Vectored (scv) ABI > == > > The scv instruction is introduced with POWER9 / ISA3, it comes with an > rfscv counter-part. The benefit of these instructions is performance > (trading slower SRR0/1 with faster LR/CTR registers, and entering the > kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR > updates. The scv instruction has 128 interrupt entry points (not enough > to cover the Linux system call space). > > The proposal is to assign scv numbers very conservatively and allocate > them as individual HWCAP features as we add support for more. The zero > vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. > > Advertisement > > Linux has not enabled FSCR[SCV] yet, so the instruction will cause a > SIGILL in current environments. Linux has defined a HWCAP2 bit > PPC_FEATURE2_SCV for SCV support, but does not set it. > > When scv instruction support and the scv 0 vector for system calls are > added, PPC_FEATURE2_SCV will indicate support for these. Other vectors > should not be used without future HWCAP bits indicating support, which is > how we will allocate them. (Should unallocated ones generate SIGILL, or > return -ENOSYS in r3?) > > Calling convention > > The proposal is for scv 0 to provide the standard Linux system call ABI > with the following differences from sc convention[1]: > > - LR is to be volatile across scv calls. This is necessary because the > scv instruction clobbers LR. From previous discussion, this should be > possible to deal with in GCC clobbers and CFI. > > - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the > kernel system call exit to avoid restoring the CR register (although > we probably still would anyway to avoid information leak). > > - Error handling: I think the consensus has been to move to using negative > return value in r3 rather than CR0[SO]=1 to indicate error, which matches > most other architectures and is closer to a function call. > > The number of scratch registers (r9-r12) at kernel entry seems > sufficient that we don't have any costly spilling, patch is here[2]. > > [1] > https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst > [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html My preference would be that it work just like the i386 AT_SYSINFO where you just replace "int $128" with "call *%%gs:16" and the kernel provides a stub in the vdso that performs either scv or the old mechanism with the same calling convention. Then if the kernel doesn't provide it (because the kernel is too old) libc would have to provide its own stub that uses the legacy method and matches the calling convention of the one the kernel is expected to provide. Note that any libc that actually makes use of the new functionality is not going to be able to make clobbers conditional on support for it; branching around different clobbers is going to defeat any gains vs always just treating anything clobbered by either method as clobbered. Likewise, it's not useful to have different error return mechanisms because the caller just has to branch to support both (or the kernel-provided stub just has to emulate one for it; that could work if you really want to change the bad existing convention). Thoughts? Rich