subject:"RE\: \[musl\] Powerpc Linux 'scv' system call ABI proposal take 2"

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-24 Thread Rich Felker

On Sat, Apr 25, 2020 at 01:40:24PM +1000, Nicholas Piggin wrote:
> Excerpts from Rich Felker's message of April 24, 2020 3:42 am:
> > On Thu, Apr 23, 2020 at 02:15:58PM -0300, Adhemerval Zanella wrote:
> >> 
> >> 
> >> On 23/04/2020 13:43, Rich Felker wrote:
> >> > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote:
> >> >>
> >> >>
> >> >> On 23/04/2020 13:18, Rich Felker wrote:
> >> >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote:
> >> 
> >> 
> >>  On 22/04/2020 23:36, Rich Felker wrote:
> >> > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
> >> >> Yeah I had a bit of a play around with musl (which is very nice 
> >> >> code I
> >> >> must say). The powerpc64 syscall asm is missing ctr clobber by the 
> >> >> way.  
> >> >> Fortunately adding it doesn't change code generation for me, but it 
> >> >> should be fixed. glibc had the same bug at one point I think 
> >> >> (probably 
> >> >> due to syscall ABI documentation not existing -- something now 
> >> >> lives in 
> >> >> linux/Documentation/powerpc/syscall64-abi.rst).
> >> >
> >> > Do you know anywhere I can read about the ctr issue, possibly the
> >> > relevant glibc bug report? I'm not particularly familiar with ppc
> >> > register file (at least I have to refamiliarize myself every time I
> >> > work on this stuff) so it'd be nice to understand what's
> >> > potentially-wrong now.
> >> 
> >>  My understanding is the ctr issue only happens for vDSO calls where it
> >>  fallback to a syscall in case an error (invalid argument, etc. and
> >>  assuming if vDSO does not fallback to a syscall it always succeed).
> >>  This makes the vDSO call on powerpc to have same same ABI constraint
> >>  as a syscall, where it clobbers CR0.
> >> >>>
> >> >>> I think you mean "vsyscall", the old thing glibc used where there are
> >> >>> in-userspace implementations of some syscalls with call interfaces
> >> >>> roughly equivalent to a syscall. musl has never used this. It only
> >> >>> uses the actual exported functions from the vdso which have normal
> >> >>> external function call ABI.
> >> >>
> >> >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing.
> >> >> The issue is indeed when calling the powerpc provided functions in 
> >> >> vDSO, which musl might want to do eventually.
> >> > 
> >> > AIUI (at least this is true for all other archs) the functions have
> >> > normal external function call ABI and calling them has nothing to do
> >> > with syscall mechanisms.
> >> 
> >> My point is powerpc specifically does not follow it, since it issues a
> >> syscall in fallback and its semantic follow kernel syscalls (error
> >> signalled in cr0, r3 being always a positive value):
> > 
> > Oh, then I think we'll just ignore these unless the kernel can make
> > ones with a reasonable ABI. It's not worth having ppc-specific code
> > for this... It would be really nice if ones that actually behave like
> > functions could be added though.
> 
> Yeah this is an annoyance for me after making the scv ABI return -ve in 
> r3 for error and other things that more closely follow function calls, 
> we still have the vdso functions using the old style.
> 
> Maybe we should add function call style vdso too.

Please do.

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-24 Thread Nicholas Piggin

Excerpts from Rich Felker's message of April 24, 2020 3:42 am:
> On Thu, Apr 23, 2020 at 02:15:58PM -0300, Adhemerval Zanella wrote:
>> 
>> 
>> On 23/04/2020 13:43, Rich Felker wrote:
>> > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote:
>> >>
>> >>
>> >> On 23/04/2020 13:18, Rich Felker wrote:
>> >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote:
>> 
>> 
>>  On 22/04/2020 23:36, Rich Felker wrote:
>> > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
>> >> Yeah I had a bit of a play around with musl (which is very nice code I
>> >> must say). The powerpc64 syscall asm is missing ctr clobber by the 
>> >> way.  
>> >> Fortunately adding it doesn't change code generation for me, but it 
>> >> should be fixed. glibc had the same bug at one point I think 
>> >> (probably 
>> >> due to syscall ABI documentation not existing -- something now lives 
>> >> in 
>> >> linux/Documentation/powerpc/syscall64-abi.rst).
>> >
>> > Do you know anywhere I can read about the ctr issue, possibly the
>> > relevant glibc bug report? I'm not particularly familiar with ppc
>> > register file (at least I have to refamiliarize myself every time I
>> > work on this stuff) so it'd be nice to understand what's
>> > potentially-wrong now.
>> 
>>  My understanding is the ctr issue only happens for vDSO calls where it
>>  fallback to a syscall in case an error (invalid argument, etc. and
>>  assuming if vDSO does not fallback to a syscall it always succeed).
>>  This makes the vDSO call on powerpc to have same same ABI constraint
>>  as a syscall, where it clobbers CR0.
>> >>>
>> >>> I think you mean "vsyscall", the old thing glibc used where there are
>> >>> in-userspace implementations of some syscalls with call interfaces
>> >>> roughly equivalent to a syscall. musl has never used this. It only
>> >>> uses the actual exported functions from the vdso which have normal
>> >>> external function call ABI.
>> >>
>> >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing.
>> >> The issue is indeed when calling the powerpc provided functions in 
>> >> vDSO, which musl might want to do eventually.
>> > 
>> > AIUI (at least this is true for all other archs) the functions have
>> > normal external function call ABI and calling them has nothing to do
>> > with syscall mechanisms.
>> 
>> My point is powerpc specifically does not follow it, since it issues a
>> syscall in fallback and its semantic follow kernel syscalls (error
>> signalled in cr0, r3 being always a positive value):
> 
> Oh, then I think we'll just ignore these unless the kernel can make
> ones with a reasonable ABI. It's not worth having ppc-specific code
> for this... It would be really nice if ones that actually behave like
> functions could be added though.

Yeah this is an annoyance for me after making the scv ABI return -ve in 
r3 for error and other things that more closely follow function calls, 
we still have the vdso functions using the old style.

Maybe we should add function call style vdso too.

Thanks,
Nick

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-24 Thread Nicholas Piggin

Excerpts from Rich Felker's message of April 23, 2020 12:36 pm:
> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
>> Yeah I had a bit of a play around with musl (which is very nice code I
>> must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
>> Fortunately adding it doesn't change code generation for me, but it 
>> should be fixed. glibc had the same bug at one point I think (probably 
>> due to syscall ABI documentation not existing -- something now lives in 
>> linux/Documentation/powerpc/syscall64-abi.rst).
> 
> Do you know anywhere I can read about the ctr issue, possibly the
> relevant glibc bug report? I'm not particularly familiar with ppc
> register file (at least I have to refamiliarize myself every time I
> work on this stuff) so it'd be nice to understand what's
> potentially-wrong now.

Ah I was misremembering, glibc was (and still is) actually missing cr 
clobbers from its "vsyscall", probably because it copied syscall which 
only clobbers cr0, but vsyscall clobbers cr0-1,5-7 like a normal 
function call.

musl is missing the ctr register clobber from syscalls.

powerpc has gpr0-31 GPRs, cr0-7 condition regs, and lr and ctr branch 
registers (lr is generally used for function returns, ctr for other 
indirect branches). ctr is volatile (caller saved) across C function 
calls, and sc system calls on Linux.

Thanks,
Nick

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-23 Thread Rich Felker

On Thu, Apr 23, 2020 at 02:15:58PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 23/04/2020 13:43, Rich Felker wrote:
> > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote:
> >>
> >>
> >> On 23/04/2020 13:18, Rich Felker wrote:
> >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote:
> 
> 
>  On 22/04/2020 23:36, Rich Felker wrote:
> > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
> >> Yeah I had a bit of a play around with musl (which is very nice code I
> >> must say). The powerpc64 syscall asm is missing ctr clobber by the 
> >> way.  
> >> Fortunately adding it doesn't change code generation for me, but it 
> >> should be fixed. glibc had the same bug at one point I think (probably 
> >> due to syscall ABI documentation not existing -- something now lives 
> >> in 
> >> linux/Documentation/powerpc/syscall64-abi.rst).
> >
> > Do you know anywhere I can read about the ctr issue, possibly the
> > relevant glibc bug report? I'm not particularly familiar with ppc
> > register file (at least I have to refamiliarize myself every time I
> > work on this stuff) so it'd be nice to understand what's
> > potentially-wrong now.
> 
>  My understanding is the ctr issue only happens for vDSO calls where it
>  fallback to a syscall in case an error (invalid argument, etc. and
>  assuming if vDSO does not fallback to a syscall it always succeed).
>  This makes the vDSO call on powerpc to have same same ABI constraint
>  as a syscall, where it clobbers CR0.
> >>>
> >>> I think you mean "vsyscall", the old thing glibc used where there are
> >>> in-userspace implementations of some syscalls with call interfaces
> >>> roughly equivalent to a syscall. musl has never used this. It only
> >>> uses the actual exported functions from the vdso which have normal
> >>> external function call ABI.
> >>
> >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing.
> >> The issue is indeed when calling the powerpc provided functions in 
> >> vDSO, which musl might want to do eventually.
> > 
> > AIUI (at least this is true for all other archs) the functions have
> > normal external function call ABI and calling them has nothing to do
> > with syscall mechanisms.
> 
> My point is powerpc specifically does not follow it, since it issues a
> syscall in fallback and its semantic follow kernel syscalls (error
> signalled in cr0, r3 being always a positive value):

Oh, then I think we'll just ignore these unless the kernel can make
ones with a reasonable ABI. It's not worth having ppc-specific code
for this... It would be really nice if ones that actually behave like
functions could be added though.

> --
> V_FUNCTION_BEGIN(__kernel_clock_gettime)
>   .cfi_startproc
> [...]
> /*
>  * syscall fallback
>  */
> 99:
> li  r0,__NR_clock_gettime
>   .cfi_restore lr
> sc
> blr
>   .cfi_endproc
> V_FUNCTION_END(__kernel_clock_gettime)
> 
> 
> > 
> > It looks like we're not using them right now and I'm not sure why. It
> > could be that there are ABI mismatch issues (are 32-bit ones
> > compatible with secure-plt? are 64-bit ones compatible with ELFv2?) or
> > just that nobody proposed adding them. Also as of 5.4 32-bit ppc
> > lacked time64 versions of them; not sure if this is fixed yet.
> 
> For 64-bit it also have an issue where vDSO does not provide an OPD
> for ELFv1, which has bitten glibc while trying to implement an ifunc
> optimization. I don't recall any issue for ELFv2.
> 
> For 32-bit I am not sure secure-plt will change anything, at least not
> on powerpc where we use the same strategy for 64-bit and use a
> mtctr/bctr directly.

Indeed, I don't think there's a secure-plt distinction unless you're
making outgoing calls to possibly-cross-DSO functions.

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-23 Thread Adhemerval Zanella




On 23/04/2020 13:43, Rich Felker wrote:
> On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 23/04/2020 13:18, Rich Felker wrote:
>>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote:


 On 22/04/2020 23:36, Rich Felker wrote:
> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
>> Yeah I had a bit of a play around with musl (which is very nice code I
>> must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
>> Fortunately adding it doesn't change code generation for me, but it 
>> should be fixed. glibc had the same bug at one point I think (probably 
>> due to syscall ABI documentation not existing -- something now lives in 
>> linux/Documentation/powerpc/syscall64-abi.rst).
>
> Do you know anywhere I can read about the ctr issue, possibly the
> relevant glibc bug report? I'm not particularly familiar with ppc
> register file (at least I have to refamiliarize myself every time I
> work on this stuff) so it'd be nice to understand what's
> potentially-wrong now.

 My understanding is the ctr issue only happens for vDSO calls where it
 fallback to a syscall in case an error (invalid argument, etc. and
 assuming if vDSO does not fallback to a syscall it always succeed).
 This makes the vDSO call on powerpc to have same same ABI constraint
 as a syscall, where it clobbers CR0.
>>>
>>> I think you mean "vsyscall", the old thing glibc used where there are
>>> in-userspace implementations of some syscalls with call interfaces
>>> roughly equivalent to a syscall. musl has never used this. It only
>>> uses the actual exported functions from the vdso which have normal
>>> external function call ABI.
>>
>> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing.
>> The issue is indeed when calling the powerpc provided functions in 
>> vDSO, which musl might want to do eventually.
> 
> AIUI (at least this is true for all other archs) the functions have
> normal external function call ABI and calling them has nothing to do
> with syscall mechanisms.

My point is powerpc specifically does not follow it, since it issues a
syscall in fallback and its semantic follow kernel syscalls (error
signalled in cr0, r3 being always a positive value):

--
V_FUNCTION_BEGIN(__kernel_clock_gettime)
  .cfi_startproc
[...]
/*
 * syscall fallback
 */
99:
li  r0,__NR_clock_gettime
  .cfi_restore lr
sc
blr
  .cfi_endproc
V_FUNCTION_END(__kernel_clock_gettime)


> 
> It looks like we're not using them right now and I'm not sure why. It
> could be that there are ABI mismatch issues (are 32-bit ones
> compatible with secure-plt? are 64-bit ones compatible with ELFv2?) or
> just that nobody proposed adding them. Also as of 5.4 32-bit ppc
> lacked time64 versions of them; not sure if this is fixed yet.

For 64-bit it also have an issue where vDSO does not provide an OPD
for ELFv1, which has bitten glibc while trying to implement an ifunc
optimization. I don't recall any issue for ELFv2.

For 32-bit I am not sure secure-plt will change anything, at least not
on powerpc where we use the same strategy for 64-bit and use a
mtctr/bctr directly.

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-23 Thread Rich Felker

On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 23/04/2020 13:18, Rich Felker wrote:
> > On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote:
> >>
> >>
> >> On 22/04/2020 23:36, Rich Felker wrote:
> >>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
>  Yeah I had a bit of a play around with musl (which is very nice code I
>  must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
>  Fortunately adding it doesn't change code generation for me, but it 
>  should be fixed. glibc had the same bug at one point I think (probably 
>  due to syscall ABI documentation not existing -- something now lives in 
>  linux/Documentation/powerpc/syscall64-abi.rst).
> >>>
> >>> Do you know anywhere I can read about the ctr issue, possibly the
> >>> relevant glibc bug report? I'm not particularly familiar with ppc
> >>> register file (at least I have to refamiliarize myself every time I
> >>> work on this stuff) so it'd be nice to understand what's
> >>> potentially-wrong now.
> >>
> >> My understanding is the ctr issue only happens for vDSO calls where it
> >> fallback to a syscall in case an error (invalid argument, etc. and
> >> assuming if vDSO does not fallback to a syscall it always succeed).
> >> This makes the vDSO call on powerpc to have same same ABI constraint
> >> as a syscall, where it clobbers CR0.
> > 
> > I think you mean "vsyscall", the old thing glibc used where there are
> > in-userspace implementations of some syscalls with call interfaces
> > roughly equivalent to a syscall. musl has never used this. It only
> > uses the actual exported functions from the vdso which have normal
> > external function call ABI.
> 
> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing.
> The issue is indeed when calling the powerpc provided functions in 
> vDSO, which musl might want to do eventually.

AIUI (at least this is true for all other archs) the functions have
normal external function call ABI and calling them has nothing to do
with syscall mechanisms.

It looks like we're not using them right now and I'm not sure why. It
could be that there are ABI mismatch issues (are 32-bit ones
compatible with secure-plt? are 64-bit ones compatible with ELFv2?) or
just that nobody proposed adding them. Also as of 5.4 32-bit ppc
lacked time64 versions of them; not sure if this is fixed yet.

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-23 Thread Adhemerval Zanella




On 23/04/2020 13:18, Rich Felker wrote:
> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 22/04/2020 23:36, Rich Felker wrote:
>>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
 Yeah I had a bit of a play around with musl (which is very nice code I
 must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
 Fortunately adding it doesn't change code generation for me, but it 
 should be fixed. glibc had the same bug at one point I think (probably 
 due to syscall ABI documentation not existing -- something now lives in 
 linux/Documentation/powerpc/syscall64-abi.rst).
>>>
>>> Do you know anywhere I can read about the ctr issue, possibly the
>>> relevant glibc bug report? I'm not particularly familiar with ppc
>>> register file (at least I have to refamiliarize myself every time I
>>> work on this stuff) so it'd be nice to understand what's
>>> potentially-wrong now.
>>
>> My understanding is the ctr issue only happens for vDSO calls where it
>> fallback to a syscall in case an error (invalid argument, etc. and
>> assuming if vDSO does not fallback to a syscall it always succeed).
>> This makes the vDSO call on powerpc to have same same ABI constraint
>> as a syscall, where it clobbers CR0.
> 
> I think you mean "vsyscall", the old thing glibc used where there are
> in-userspace implementations of some syscalls with call interfaces
> roughly equivalent to a syscall. musl has never used this. It only
> uses the actual exported functions from the vdso which have normal
> external function call ABI.

I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing.
The issue is indeed when calling the powerpc provided functions in 
vDSO, which musl might want to do eventually.

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-23 Thread Rich Felker

On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote:
> 
> 
> On 22/04/2020 23:36, Rich Felker wrote:
> > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
> >> Yeah I had a bit of a play around with musl (which is very nice code I
> >> must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
> >> Fortunately adding it doesn't change code generation for me, but it 
> >> should be fixed. glibc had the same bug at one point I think (probably 
> >> due to syscall ABI documentation not existing -- something now lives in 
> >> linux/Documentation/powerpc/syscall64-abi.rst).
> > 
> > Do you know anywhere I can read about the ctr issue, possibly the
> > relevant glibc bug report? I'm not particularly familiar with ppc
> > register file (at least I have to refamiliarize myself every time I
> > work on this stuff) so it'd be nice to understand what's
> > potentially-wrong now.
> 
> My understanding is the ctr issue only happens for vDSO calls where it
> fallback to a syscall in case an error (invalid argument, etc. and
> assuming if vDSO does not fallback to a syscall it always succeed).
> This makes the vDSO call on powerpc to have same same ABI constraint
> as a syscall, where it clobbers CR0.

I think you mean "vsyscall", the old thing glibc used where there are
in-userspace implementations of some syscalls with call interfaces
roughly equivalent to a syscall. musl has never used this. It only
uses the actual exported functions from the vdso which have normal
external function call ABI.

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-23 Thread Adhemerval Zanella

On 22/04/2020 23:36, Rich Felker wrote:
> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
>> Yeah I had a bit of a play around with musl (which is very nice code I
>> must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
>> Fortunately adding it doesn't change code generation for me, but it 
>> should be fixed. glibc had the same bug at one point I think (probably 
>> due to syscall ABI documentation not existing -- something now lives in 
>> linux/Documentation/powerpc/syscall64-abi.rst).
> 
> Do you know anywhere I can read about the ctr issue, possibly the
> relevant glibc bug report? I'm not particularly familiar with ppc
> register file (at least I have to refamiliarize myself every time I
> work on this stuff) so it'd be nice to understand what's
> potentially-wrong now.

My understanding is the ctr issue only happens for vDSO calls where it
fallback to a syscall in case an error (invalid argument, etc. and
assuming if vDSO does not fallback to a syscall it always succeed).
This makes the vDSO call on powerpc to have same same ABI constraint
as a syscall, where it clobbers CR0.

On glibc we handle by simulating a function call and analysing the CR0
result:

  __asm__ __volatile__ 
  ("mtctr %0\n\t"
   "bctrl\n\t"
   "mfcr  %0\n\t"
   "0:"
   : "+r" (r0), "+r" (r3), "+r" (r4), "+r" (r5),  "+r" (r6),
 "+r" (r7), "+r" (r8)
   : : "r9", "r10", "r11", "r12", "cr0", "ctr", "lr", "memory");
  __asm__ __volatile__ ("" : "=r" (rval) : "r" (r3));

On musl you don't have this issue because it does not enable vDSO
support on powerpc.  And if it eventually does it with the VDSO_*
macros the only issue I see is on when vDSO fallbacks to the syscall 
and it also fails (the return code won't be negated since on musl it 
uses a default C function pointer issue which does not model the CR0
kernel abi). 

So I think the extra ctr constraint on glibc powerpc syscall code is
not really required.  I think I have some patches to optimize this a
bit based on previous discussions.

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-22 Thread Rich Felker

On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote:
> Yeah I had a bit of a play around with musl (which is very nice code I
> must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
> Fortunately adding it doesn't change code generation for me, but it 
> should be fixed. glibc had the same bug at one point I think (probably 
> due to syscall ABI documentation not existing -- something now lives in 
> linux/Documentation/powerpc/syscall64-abi.rst).

Do you know anywhere I can read about the ctr issue, possibly the
relevant glibc bug report? I'm not particularly familiar with ppc
register file (at least I have to refamiliarize myself every time I
work on this stuff) so it'd be nice to understand what's
potentially-wrong now.

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-21 Thread Nicholas Piggin

Excerpts from Nicholas Piggin's message of April 22, 2020 4:18 pm:
> If we go further and try to preserve r3 as well by putting the return 
> value in r9 or r0, we go backwards about 300 bytes. It's good for the 
> lock loops and complex functions, but hurts a lot of simpler functions 
> that have to add 'mr r3,r9' etc.  
> 
> Most of the time there are saved non-volatile GPRs around anyway though, 
> so not sure which way to go on this. Text size savings can't be ignored
> and it's pretty easy for the kernel to do (we already save r3-r8 and
> zero them on exit, so we could load them instead from cache line that's
> should be hot).
> 
> So I may be inclined to go this way, even if we won't see benefit now.

By, "this way" I don't mean r9 or r0 return value (which is larger code),
but r3 return value with r0,r4-r8 preserved.

Thanks,
Nick

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-21 Thread Nicholas Piggin

Excerpts from Rich Felker's message of April 21, 2020 3:27 am:
> On Mon, Apr 20, 2020 at 02:31:58PM +1000, Nicholas Piggin wrote:
>> Excerpts from Rich Felker's message of April 20, 2020 2:09 pm:
>> > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote:
>> >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
>> >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
>> >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
>> >> >> > Note that because lr is clobbered we need at least once normally
>> >> >> > call-clobbered register that's not syscall clobbered to save lr in.
>> >> >> > Otherwise stack frame setup is required to spill it.
>> >> >> 
>> >> >> The kernel would like to use r9-r12 for itself. We could do with fewer 
>> >> >> registers, but we have some delay establishing the stack (depends on a
>> >> >> load which depends on a mfspr), and entry code tends to be quite store
>> >> >> heavy whereas on the caller side you have r1 set up (modulo stack 
>> >> >> updates), and the system call is a long delay during which time the 
>> >> >> store queue has significant time to drain.
>> >> >> 
>> >> >> My feeling is it would be better for kernel to have these scratch 
>> >> >> registers.
>> >> > 
>> >> > If your new kernel syscall mechanism requires the caller to make a
>> >> > whole stack frame it otherwise doesn't need and spill registers to it,
>> >> > it becomes a lot less attractive. Some of those 90 cycles saved are
>> >> > immediately lost on the userspace side, plus you either waste icache
>> >> > at the call point or require the syscall to go through a
>> >> > userspace-side helper function that performs the spill and restore.
>> >> 
>> >> You would be surprised how few cycles that takes on a high end CPU. Some 
>> >> might be a couple of %. I am one for counting cycles mind you, I'm not 
>> >> being flippant about it. If we can come up with something faster I'd be 
>> >> up for it.
>> > 
>> > If the cycle count is trivial then just do it on the kernel side.
>> 
>> The cycle count for user is, because you have r1 ready. Kernel does not 
>> have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to 
>> save into.
>> 
>> Which is also wasted work for a userspace.
>> 
>> Now that I think about it, no stack frame is even required! lr is saved 
>> into the caller's stack when its clobbered with an asm, just as when 
>> it's used for a function call.
> 
> No. If there is a non-clobbered register, lr can be moved to the
> non-clobbered register rather than saved to the stack. However it
> looks like (1) gcc doesn't take advantage of that possibility, but (2)
> the caller already arranged for there to be space on the stack to save
> lr, so the cost is only one store and one load, not any stack
> adjustment or other frame setup. So it's probably not a really big
> deal. However, just adding "lr" clobber to existing syscall in musl
> increased the size of a simple syscall function (getuid) from 20 bytes
> to 36 bytes.

Yeah I had a bit of a play around with musl (which is very nice code I
must say). The powerpc64 syscall asm is missing ctr clobber by the way.  
Fortunately adding it doesn't change code generation for me, but it 
should be fixed. glibc had the same bug at one point I think (probably 
due to syscall ABI documentation not existing -- something now lives in 
linux/Documentation/powerpc/syscall64-abi.rst).

Yes lr needs to be saved, I didn't see any new requirement for stack
frames, and it was often already saved, but it does hurt the small 
wrapper functions.

I did look at entirely replacing sc with scv though, just as an 
experiment. One day you might make sc optional! Text size impoves by 
about 3kB with the proposed ABI.

Mostly seems to be the bns+ ; neg sequence. __syscall1/2/3 get
out-of-lined by the compiler in a lot of cases. Linux's bloat-o-meter 
says:

add/remove: 0/5 grow/shrink: 24/260 up/down: 220/-3428 (-3208)
Function old new   delta
fcntl400 424 +24
popen600 620 +20
times 32  40  +8
[...]
alloc_rev816 784 -32
alloc_fwd812 780 -32
__syscall1.constprop  32   - -32
__fdopen 504 472 -32
__expand_heap628 592 -36
__syscall240   - -40
__syscall344   - -44
fchmodat 372 324 -48
__wake.constprop 408 360 -48
child   11161064 -52
checker  220 156 -64
__bin_chunk

RE: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-21 Thread David Laight

From: Nicholas Piggin
> Sent: 20 April 2020 02:10
...
> >> Yes, but does it really matter to optimize this specific usage case
> >> for size? glibc, for instance, tries to leverage the syscall mechanism
> >> by adding some complex pre-processor asm directives.  It optimizes
> >> the syscall code size in most cases.  For instance, kill in static case
> >> generates on x86_64:
> >>
> >>  <__kill>:
> >>0:   b8 3e 00 00 00  mov$0x3e,%eax
> >>5:   0f 05   syscall
> >>7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
> >>d:   0f 83 00 00 00 00   jae13 <__kill+0x13>

Hmmm... that cmp + jae is unnecessary here.
It is also a 32bit offset jump.
I also suspect it gets predicted very badly.

> >>   13:   c3  retq
> >>
> >> While on musl:
> >>
> >>  :
> >>0:  48 83 ec 08 sub$0x8,%rsp
> >>4:  48 63 ffmovslq %edi,%rdi
> >>7:  48 63 f6movslq %esi,%rsi
> >>a:  b8 3e 00 00 00  mov$0x3e,%eax
> >>f:  0f 05   syscall
> >>   11:  48 89 c7mov%rax,%rdi
> >>   14:  e8 00 00 00 00  callq  19 
> >>   19:  5a  pop%rdx
> >>   1a:  c3  retq
> >
> > Wow that's some extraordinarily bad codegen going on by gcc... The
> > sign-extension is semantically needed and I don't see a good way
> > around it (glibc's asm is kinda a hack taking advantage of kernel not
> > looking at high bits, I think), but the gratuitous stack adjustment
> > and refusal to generate a tail call isn't. I'll see if we can track
> > down what's going on and get it fixed.

A suitable cast might get rid of the sign extension.
Possibly just (unsigned int).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

RE: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-21 Thread David Laight

From: Adhemerval Zanella
> Sent: 21 April 2020 16:01
> 
> On 21/04/2020 11:39, Rich Felker wrote:
> > On Tue, Apr 21, 2020 at 12:28:25PM +, David Laight wrote:
> >> From: Nicholas Piggin
> >>> Sent: 20 April 2020 02:10
> >> ...
> > Yes, but does it really matter to optimize this specific usage case
> > for size? glibc, for instance, tries to leverage the syscall mechanism
> > by adding some complex pre-processor asm directives.  It optimizes
> > the syscall code size in most cases.  For instance, kill in static case
> > generates on x86_64:
> >
> >  <__kill>:
> >0:   b8 3e 00 00 00  mov$0x3e,%eax
> >5:   0f 05   syscall
> >7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
> >d:   0f 83 00 00 00 00   jae13 <__kill+0x13>
> >>
> >> Hmmm... that cmp + jae is unnecessary here.
> >
> > It's not.. Rather the objdump was just mistakenly done without -r so
> > it looks like a nop jump rather than a conditional tail call to the
> > function that sets errno.
> >
> 
> Indeed, the output with -r is:
> 
>  <__kill>:
>0:   b8 3e 00 00 00  mov$0x3e,%eax
>5:   0f 05   syscall
>7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
>d:   0f 83 00 00 00 00   jae13 <__kill+0x13>
> f: R_X86_64_PLT32   __syscall_error-0x4
>   13:   c3  retq

Yes, I probably should have remembered it looked like that :-)
...
> >> I also suspect it gets predicted very badly.
> >
> > I doubt that. This is a very standard idiom and the size of the offset
> > (which is necessarily 32-bit because it has a relocation on it) is
> > orthogonal to the condition on the jump.

Yes, it only gets mispredicted as badly as any other conditional jump.
I believe modern intel x86 will randomly predict it taken (regardless
of the direction) and then hit a TLB fault on text.unlikely :-)

> > FWIW a syscall like kill takes global kernel-side locks to be able to
> > address a target process by pid, and the rate of meaningful calls you
> > can make to it is very low (since it's bounded by time for target
> > process to act on the signal). Trying to optimize it for speed is
> > pointless, and even size isn't important locally (although in
> > aggregate, lots of wasted small size can add up to more pages = more
> > TLB entries = ...).
> 
> I agree and I would prefer to focus on code simplicity to have a
> platform neutral way to handle error and let the compiler optimize
> it than messy with assembly macros to squeeze this kind of
> micro-optimizations.

syscall entry does get micro-optimised.
Real speed-ups can probably be found by optimising other places.
I've a patch i need to resumbit that should improve the reading
of iov[] from user space.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-21 Thread Adhemerval Zanella




On 21/04/2020 11:39, Rich Felker wrote:
> On Tue, Apr 21, 2020 at 12:28:25PM +, David Laight wrote:
>> From: Nicholas Piggin
>>> Sent: 20 April 2020 02:10
>> ...
> Yes, but does it really matter to optimize this specific usage case
> for size? glibc, for instance, tries to leverage the syscall mechanism
> by adding some complex pre-processor asm directives.  It optimizes
> the syscall code size in most cases.  For instance, kill in static case
> generates on x86_64:
>
>  <__kill>:
>0:   b8 3e 00 00 00  mov$0x3e,%eax
>5:   0f 05   syscall
>7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
>d:   0f 83 00 00 00 00   jae13 <__kill+0x13>
>>
>> Hmmm... that cmp + jae is unnecessary here.
> 
> It's not.. Rather the objdump was just mistakenly done without -r so
> it looks like a nop jump rather than a conditional tail call to the
> function that sets errno.
> 

Indeed, the output with -r is:

 <__kill>:
   0:   b8 3e 00 00 00  mov$0x3e,%eax
   5:   0f 05   syscall 
   7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
   d:   0f 83 00 00 00 00   jae13 <__kill+0x13>
f: R_X86_64_PLT32   __syscall_error-0x4
  13:   c3  retq   

And for x86_64 __syscall_error is defined as:

 <__syscall_error>:
   0:   48 f7 d8neg%rax

0003 <__syscall_error_1>:
   3:   64 89 04 25 00 00 00mov%eax,%fs:0x0
   a:   00
7: R_X86_64_TPOFF32 errno
   b:   48 83 c8 ff or $0x,%rax
   f:   c3  retq

Different than musl, each architecture defines its own error handling
mechanism (some embedded errno setting in syscall itself, other branches
to a __syscall_error like function as x86_64).  

This is due most likely from the glibc long history.  One of my long 
term plan is to just simplify, get rid of the assembly pre-processor,
implement all syscall in C code, and set error handling mechanism in
a platform neutral way using a tail call (most likely you do on musl).

>> It is also a 32bit offset jump.
>> I also suspect it gets predicted very badly.
> 
> I doubt that. This is a very standard idiom and the size of the offset
> (which is necessarily 32-bit because it has a relocation on it) is
> orthogonal to the condition on the jump.
> 
> FWIW a syscall like kill takes global kernel-side locks to be able to
> address a target process by pid, and the rate of meaningful calls you
> can make to it is very low (since it's bounded by time for target
> process to act on the signal). Trying to optimize it for speed is
> pointless, and even size isn't important locally (although in
> aggregate, lots of wasted small size can add up to more pages = more
> TLB entries = ...).

I agree and I would prefer to focus on code simplicity to have a
platform neutral way to handle error and let the compiler optimize
it than messy with assembly macros to squeeze this kind of
micro-optimizations.

> 
>   13:   c3  retq
>
> While on musl:
>
>  :
>0: 48 83 ec 08 sub$0x8,%rsp
>4: 48 63 ffmovslq %edi,%rdi
>7: 48 63 f6movslq %esi,%rsi
>a: b8 3e 00 00 00  mov$0x3e,%eax
>f: 0f 05   syscall
>   11: 48 89 c7mov%rax,%rdi
>   14: e8 00 00 00 00  callq  19 
>   19: 5a  pop%rdx
>   1a: c3  retq

 Wow that's some extraordinarily bad codegen going on by gcc... The
 sign-extension is semantically needed and I don't see a good way
 around it (glibc's asm is kinda a hack taking advantage of kernel not
 looking at high bits, I think), but the gratuitous stack adjustment
 and refusal to generate a tail call isn't. I'll see if we can track
 down what's going on and get it fixed.
>>
>> A suitable cast might get rid of the sign extension.
>> Possibly just (unsigned int).
> 
> No, it won't. The problem is that there is no representation of the
> fact that the kernel is only going to inspect the low 32 bits (by
> declaring the kernel-side function as taking an int argument). The
> external kill function receives arguments by the ABI, where the upper
> bits of int args can contain junk, and the asm register constraints
> for syscalls use longs (or rather an abstract syscall-arg type). It
> wouldn't even work to have macro magic detect that the expressions
> passed are ints and use hacks to avoid that, since it's perfectly
> valid to pass an int to a syscall that expects a long argument (e.g.
> offset to mmap), in which case it needs to be sign-extended.
> 
> The only way to avoid this is encoding somewhere th

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-21 Thread Rich Felker

On Tue, Apr 21, 2020 at 12:28:25PM +, David Laight wrote:
> From: Nicholas Piggin
> > Sent: 20 April 2020 02:10
> ...
> > >> Yes, but does it really matter to optimize this specific usage case
> > >> for size? glibc, for instance, tries to leverage the syscall mechanism
> > >> by adding some complex pre-processor asm directives.  It optimizes
> > >> the syscall code size in most cases.  For instance, kill in static case
> > >> generates on x86_64:
> > >>
> > >>  <__kill>:
> > >>0:   b8 3e 00 00 00  mov$0x3e,%eax
> > >>5:   0f 05   syscall
> > >>7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
> > >>d:   0f 83 00 00 00 00   jae13 <__kill+0x13>
> 
> Hmmm... that cmp + jae is unnecessary here.

It's not.. Rather the objdump was just mistakenly done without -r so
it looks like a nop jump rather than a conditional tail call to the
function that sets errno.

> It is also a 32bit offset jump.
> I also suspect it gets predicted very badly.

I doubt that. This is a very standard idiom and the size of the offset
(which is necessarily 32-bit because it has a relocation on it) is
orthogonal to the condition on the jump.

FWIW a syscall like kill takes global kernel-side locks to be able to
address a target process by pid, and the rate of meaningful calls you
can make to it is very low (since it's bounded by time for target
process to act on the signal). Trying to optimize it for speed is
pointless, and even size isn't important locally (although in
aggregate, lots of wasted small size can add up to more pages = more
TLB entries = ...).

> > >>   13:   c3  retq
> > >>
> > >> While on musl:
> > >>
> > >>  :
> > >>0:48 83 ec 08 sub$0x8,%rsp
> > >>4:48 63 ffmovslq %edi,%rdi
> > >>7:48 63 f6movslq %esi,%rsi
> > >>a:b8 3e 00 00 00  mov$0x3e,%eax
> > >>f:0f 05   syscall
> > >>   11:48 89 c7mov%rax,%rdi
> > >>   14:e8 00 00 00 00  callq  19 
> > >>   19:5a  pop%rdx
> > >>   1a:c3  retq
> > >
> > > Wow that's some extraordinarily bad codegen going on by gcc... The
> > > sign-extension is semantically needed and I don't see a good way
> > > around it (glibc's asm is kinda a hack taking advantage of kernel not
> > > looking at high bits, I think), but the gratuitous stack adjustment
> > > and refusal to generate a tail call isn't. I'll see if we can track
> > > down what's going on and get it fixed.
> 
> A suitable cast might get rid of the sign extension.
> Possibly just (unsigned int).

No, it won't. The problem is that there is no representation of the
fact that the kernel is only going to inspect the low 32 bits (by
declaring the kernel-side function as taking an int argument). The
external kill function receives arguments by the ABI, where the upper
bits of int args can contain junk, and the asm register constraints
for syscalls use longs (or rather an abstract syscall-arg type). It
wouldn't even work to have macro magic detect that the expressions
passed are ints and use hacks to avoid that, since it's perfectly
valid to pass an int to a syscall that expects a long argument (e.g.
offset to mmap), in which case it needs to be sign-extended.

The only way to avoid this is encoding somewhere the syscall-specific
knowledge of what arg size the kernel function expects. That's way too
much redundant effort and too error-prone for the incredibly miniscule
size benefit you'd get out of it.

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-21 Thread Florian Weimer

* Szabolcs Nagy:

> * Nicholas Piggin  [2020-04-20 12:08:36 +1000]:
>> Excerpts from Rich Felker's message of April 20, 2020 11:29 am:
>> > Also, allowing patching of executable pages is generally frowned upon
>> > these days because W^X is a desirable hardening property.
>> 
>> Right, it would want be write-protected after being patched.
>
> "frowned upon" means that users may have to update
> their security policy setting in pax, selinux, apparmor,
> seccomp bpf filters and who knows what else that may
> monitor and flag W&X mprotect.
>
> libc update can break systems if the new libc does W&X.

It's possible to map over pre-compiled alternative implementations,
though.  Basically, we would do the patching and build time and store
the results in the file.

It works best if the variance is concentrated on a few pages, and
there are very few alternatives.  For example, having two syscall APIs
and supporting threading and no-threading versions would need four
code versions in total, which is likely excessive.

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-20 Thread Szabolcs Nagy

* Nicholas Piggin  [2020-04-20 12:08:36 +1000]:
> Excerpts from Rich Felker's message of April 20, 2020 11:29 am:
> > Also, allowing patching of executable pages is generally frowned upon
> > these days because W^X is a desirable hardening property.
> 
> Right, it would want be write-protected after being patched.

"frowned upon" means that users may have to update
their security policy setting in pax, selinux, apparmor,
seccomp bpf filters and who knows what else that may
monitor and flag W&X mprotect.

libc update can break systems if the new libc does W&X.

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-20 Thread Rich Felker

On Mon, Apr 20, 2020 at 02:31:58PM +1000, Nicholas Piggin wrote:
> Excerpts from Rich Felker's message of April 20, 2020 2:09 pm:
> > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote:
> >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
> >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
> >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
> >> >> > Note that because lr is clobbered we need at least once normally
> >> >> > call-clobbered register that's not syscall clobbered to save lr in.
> >> >> > Otherwise stack frame setup is required to spill it.
> >> >> 
> >> >> The kernel would like to use r9-r12 for itself. We could do with fewer 
> >> >> registers, but we have some delay establishing the stack (depends on a
> >> >> load which depends on a mfspr), and entry code tends to be quite store
> >> >> heavy whereas on the caller side you have r1 set up (modulo stack 
> >> >> updates), and the system call is a long delay during which time the 
> >> >> store queue has significant time to drain.
> >> >> 
> >> >> My feeling is it would be better for kernel to have these scratch 
> >> >> registers.
> >> > 
> >> > If your new kernel syscall mechanism requires the caller to make a
> >> > whole stack frame it otherwise doesn't need and spill registers to it,
> >> > it becomes a lot less attractive. Some of those 90 cycles saved are
> >> > immediately lost on the userspace side, plus you either waste icache
> >> > at the call point or require the syscall to go through a
> >> > userspace-side helper function that performs the spill and restore.
> >> 
> >> You would be surprised how few cycles that takes on a high end CPU. Some 
> >> might be a couple of %. I am one for counting cycles mind you, I'm not 
> >> being flippant about it. If we can come up with something faster I'd be 
> >> up for it.
> > 
> > If the cycle count is trivial then just do it on the kernel side.
> 
> The cycle count for user is, because you have r1 ready. Kernel does not 
> have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to 
> save into.
> 
> Which is also wasted work for a userspace.
> 
> Now that I think about it, no stack frame is even required! lr is saved 
> into the caller's stack when its clobbered with an asm, just as when 
> it's used for a function call.

No. If there is a non-clobbered register, lr can be moved to the
non-clobbered register rather than saved to the stack. However it
looks like (1) gcc doesn't take advantage of that possibility, but (2)
the caller already arranged for there to be space on the stack to save
lr, so the cost is only one store and one load, not any stack
adjustment or other frame setup. So it's probably not a really big
deal. However, just adding "lr" clobber to existing syscall in musl
increased the size of a simple syscall function (getuid) from 20 bytes
to 36 bytes.

> >> > syscall arg registers still preserved? If not, this is a major cost on
> >> > the userspace side, since any call point that has to loop-and-retry
> >> > (e.g. futex) now needs to make its own place to store the original
> >> > values.)
> >> 
> >> Powerpc system calls never did. We could have scv preserve them, but 
> >> you'd still need to restore r3. We could make an ABI which does not
> >> clobber r3 but puts the return value in r9, say. I'd like to see what
> >> the user side code looks like to take advantage of such a thing though.
> > 
> > Oh wow, I hadn't realized that, but indeed the code we have now is
> > allowing for the kernel to clobber them all. So at least this isn't
> > getting any worse I guess. I think it was a very poor choice of
> > behavior though and a disadvantage vs what other archs do (some of
> > them preserve all registers; others preserve only normally call-saved
> > ones plus the syscall arg ones and possibly a few other specials).
> 
> Well, we could change it. Does the generated code improve significantly
> we take those clobbers away?

I'd have to experiment a bit more to see. It's not going to help at
all in functions which are pure syscall wrappers that just do the
syscall and return, since the arg regs are dead after the syscall
anyway (the caller must assume they were clobbered). But where
syscalls are inlined and used in a loop, like a futex wait, it might
make a nontrivial difference.

Unfortunately even if you did change it for the new scv mechanism, it
would be hard to take advantage of the change while also supporting
sc, unless we used a helper function that just did scv directly, but
saved/restored all the arg regs when using the legacy sc mechanism.
Just inlining the hwcap conditional and clobbering more regs in one
code path than in the other likely would not help; gcc won't
shrink-wrap the clobbered/non-clobbered paths separately, and even if
it did, when this were inlined somewhere like a futex loop, it'd end
up having to lift the conditional out of the loop to be very
advantageous, then making th

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Nicholas Piggin

Excerpts from Rich Felker's message of April 20, 2020 2:09 pm:
> On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote:
>> Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
>> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
>> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
>> >> > Note that because lr is clobbered we need at least once normally
>> >> > call-clobbered register that's not syscall clobbered to save lr in.
>> >> > Otherwise stack frame setup is required to spill it.
>> >> 
>> >> The kernel would like to use r9-r12 for itself. We could do with fewer 
>> >> registers, but we have some delay establishing the stack (depends on a
>> >> load which depends on a mfspr), and entry code tends to be quite store
>> >> heavy whereas on the caller side you have r1 set up (modulo stack 
>> >> updates), and the system call is a long delay during which time the 
>> >> store queue has significant time to drain.
>> >> 
>> >> My feeling is it would be better for kernel to have these scratch 
>> >> registers.
>> > 
>> > If your new kernel syscall mechanism requires the caller to make a
>> > whole stack frame it otherwise doesn't need and spill registers to it,
>> > it becomes a lot less attractive. Some of those 90 cycles saved are
>> > immediately lost on the userspace side, plus you either waste icache
>> > at the call point or require the syscall to go through a
>> > userspace-side helper function that performs the spill and restore.
>> 
>> You would be surprised how few cycles that takes on a high end CPU. Some 
>> might be a couple of %. I am one for counting cycles mind you, I'm not 
>> being flippant about it. If we can come up with something faster I'd be 
>> up for it.
> 
> If the cycle count is trivial then just do it on the kernel side.

The cycle count for user is, because you have r1 ready. Kernel does not 
have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to 
save into.

Which is also wasted work for a userspace.

Now that I think about it, no stack frame is even required! lr is saved 
into the caller's stack when its clobbered with an asm, just as when 
it's used for a function call.

>> > The right way to do this is to have the kernel preserve enough
>> > registers that userspace can avoid having any spills. It doesn't have
>> > to preserve everything, probably just enough to save lr. (BTW are
>> 
>> Again, the problem is the kernel doesn't have its dependencies 
>> immediately ready to spill, and spilling (may be) more costly 
>> immediately after the call because we're doing a lot of stores.
>> 
>> I could try measure this. Unfortunately our pipeline simulator tool 
>> doesn't model system calls properly so it's hard to see what's happening 
>> across the user/kernel horizon, I might check if that can be improved
>> or I can hack it by putting some isync in there or something.
> 
> I think it's unlikely to make any real difference to the total number
> of cycles spent which side it happens on, but putting it on the kernel
> side makes it easier to avoid wasting size/icache at each syscall
> site.
> 
>> > syscall arg registers still preserved? If not, this is a major cost on
>> > the userspace side, since any call point that has to loop-and-retry
>> > (e.g. futex) now needs to make its own place to store the original
>> > values.)
>> 
>> Powerpc system calls never did. We could have scv preserve them, but 
>> you'd still need to restore r3. We could make an ABI which does not
>> clobber r3 but puts the return value in r9, say. I'd like to see what
>> the user side code looks like to take advantage of such a thing though.
> 
> Oh wow, I hadn't realized that, but indeed the code we have now is
> allowing for the kernel to clobber them all. So at least this isn't
> getting any worse I guess. I think it was a very poor choice of
> behavior though and a disadvantage vs what other archs do (some of
> them preserve all registers; others preserve only normally call-saved
> ones plus the syscall arg ones and possibly a few other specials).

Well, we could change it. Does the generated code improve significantly
we take those clobbers away?

Thanks,
Nick

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Rich Felker

On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote:
> Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
> >> > Note that because lr is clobbered we need at least once normally
> >> > call-clobbered register that's not syscall clobbered to save lr in.
> >> > Otherwise stack frame setup is required to spill it.
> >> 
> >> The kernel would like to use r9-r12 for itself. We could do with fewer 
> >> registers, but we have some delay establishing the stack (depends on a
> >> load which depends on a mfspr), and entry code tends to be quite store
> >> heavy whereas on the caller side you have r1 set up (modulo stack 
> >> updates), and the system call is a long delay during which time the 
> >> store queue has significant time to drain.
> >> 
> >> My feeling is it would be better for kernel to have these scratch 
> >> registers.
> > 
> > If your new kernel syscall mechanism requires the caller to make a
> > whole stack frame it otherwise doesn't need and spill registers to it,
> > it becomes a lot less attractive. Some of those 90 cycles saved are
> > immediately lost on the userspace side, plus you either waste icache
> > at the call point or require the syscall to go through a
> > userspace-side helper function that performs the spill and restore.
> 
> You would be surprised how few cycles that takes on a high end CPU. Some 
> might be a couple of %. I am one for counting cycles mind you, I'm not 
> being flippant about it. If we can come up with something faster I'd be 
> up for it.

If the cycle count is trivial then just do it on the kernel side.

> > The right way to do this is to have the kernel preserve enough
> > registers that userspace can avoid having any spills. It doesn't have
> > to preserve everything, probably just enough to save lr. (BTW are
> 
> Again, the problem is the kernel doesn't have its dependencies 
> immediately ready to spill, and spilling (may be) more costly 
> immediately after the call because we're doing a lot of stores.
> 
> I could try measure this. Unfortunately our pipeline simulator tool 
> doesn't model system calls properly so it's hard to see what's happening 
> across the user/kernel horizon, I might check if that can be improved
> or I can hack it by putting some isync in there or something.

I think it's unlikely to make any real difference to the total number
of cycles spent which side it happens on, but putting it on the kernel
side makes it easier to avoid wasting size/icache at each syscall
site.

> > syscall arg registers still preserved? If not, this is a major cost on
> > the userspace side, since any call point that has to loop-and-retry
> > (e.g. futex) now needs to make its own place to store the original
> > values.)
> 
> Powerpc system calls never did. We could have scv preserve them, but 
> you'd still need to restore r3. We could make an ABI which does not
> clobber r3 but puts the return value in r9, say. I'd like to see what
> the user side code looks like to take advantage of such a thing though.

Oh wow, I hadn't realized that, but indeed the code we have now is
allowing for the kernel to clobber them all. So at least this isn't
getting any worse I guess. I think it was a very poor choice of
behavior though and a disadvantage vs what other archs do (some of
them preserve all registers; others preserve only normally call-saved
ones plus the syscall arg ones and possibly a few other specials).

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Nicholas Piggin

Excerpts from Rich Felker's message of April 20, 2020 11:34 am:
> On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
>> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
>> > Note that because lr is clobbered we need at least once normally
>> > call-clobbered register that's not syscall clobbered to save lr in.
>> > Otherwise stack frame setup is required to spill it.
>> 
>> The kernel would like to use r9-r12 for itself. We could do with fewer 
>> registers, but we have some delay establishing the stack (depends on a
>> load which depends on a mfspr), and entry code tends to be quite store
>> heavy whereas on the caller side you have r1 set up (modulo stack 
>> updates), and the system call is a long delay during which time the 
>> store queue has significant time to drain.
>> 
>> My feeling is it would be better for kernel to have these scratch 
>> registers.
> 
> If your new kernel syscall mechanism requires the caller to make a
> whole stack frame it otherwise doesn't need and spill registers to it,
> it becomes a lot less attractive. Some of those 90 cycles saved are
> immediately lost on the userspace side, plus you either waste icache
> at the call point or require the syscall to go through a
> userspace-side helper function that performs the spill and restore.

You would be surprised how few cycles that takes on a high end CPU. Some 
might be a couple of %. I am one for counting cycles mind you, I'm not 
being flippant about it. If we can come up with something faster I'd be 
up for it.

> 
> The right way to do this is to have the kernel preserve enough
> registers that userspace can avoid having any spills. It doesn't have
> to preserve everything, probably just enough to save lr. (BTW are

Again, the problem is the kernel doesn't have its dependencies 
immediately ready to spill, and spilling (may be) more costly 
immediately after the call because we're doing a lot of stores.

I could try measure this. Unfortunately our pipeline simulator tool 
doesn't model system calls properly so it's hard to see what's happening 
across the user/kernel horizon, I might check if that can be improved
or I can hack it by putting some isync in there or something.

> syscall arg registers still preserved? If not, this is a major cost on
> the userspace side, since any call point that has to loop-and-retry
> (e.g. futex) now needs to make its own place to store the original
> values.)

Powerpc system calls never did. We could have scv preserve them, but 
you'd still need to restore r3. We could make an ABI which does not
clobber r3 but puts the return value in r9, say. I'd like to see what
the user side code looks like to take advantage of such a thing though.

Thanks,
Nick

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Nicholas Piggin

Excerpts from Rich Felker's message of April 20, 2020 11:29 am:
> On Mon, Apr 20, 2020 at 10:27:58AM +1000, Nicholas Piggin wrote:
>> Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm:
>> > * Nicholas Piggin via Libc-alpha  [2020-04-16 
>> > 10:16:54 +1000]:
>> >> Well it would have to test HWCAP and patch in or branch to two 
>> >> completely different sequences including register save/restores yes.
>> >> You could have the same asm and matching clobbers to put the sequence
>> >> inline and then you could patch the one sc/scv instruction I suppose.
>> > 
>> > how would that 'patch' work?
>> > 
>> > there are many reasons why you don't
>> > want libc to write its .text
>> 
>> I guess I don't know what I'm talking about when it comes to libraries. 
>> Shame if there is no good way to load-time patch libc. It's orthogonal
>> to the scv selection though -- if you don't patch you have to 
>> conditional or indirect branch however you implement it.
> 
> Patched pages cannot be shared. The whole design of PIC and shared
> libraries is that the code("text")/rodata is immutable and shared and
> that only a minimal amount of data, packed tightly together (the GOT)
> has to exist per-instance.

Yeah the pages which were patched couldn't be shared across exec, which
is a significant downside, unless you could group all patch sites into
their own section and similarly pack it together (which has issues of
being out of line).

> 
> Also, allowing patching of executable pages is generally frowned upon
> these days because W^X is a desirable hardening property.

Right, it would want be write-protected after being patched.

Thanks,
Nick

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Rich Felker

On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote:
> Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
> > Note that because lr is clobbered we need at least once normally
> > call-clobbered register that's not syscall clobbered to save lr in.
> > Otherwise stack frame setup is required to spill it.
> 
> The kernel would like to use r9-r12 for itself. We could do with fewer 
> registers, but we have some delay establishing the stack (depends on a
> load which depends on a mfspr), and entry code tends to be quite store
> heavy whereas on the caller side you have r1 set up (modulo stack 
> updates), and the system call is a long delay during which time the 
> store queue has significant time to drain.
> 
> My feeling is it would be better for kernel to have these scratch 
> registers.

If your new kernel syscall mechanism requires the caller to make a
whole stack frame it otherwise doesn't need and spill registers to it,
it becomes a lot less attractive. Some of those 90 cycles saved are
immediately lost on the userspace side, plus you either waste icache
at the call point or require the syscall to go through a
userspace-side helper function that performs the spill and restore.

The right way to do this is to have the kernel preserve enough
registers that userspace can avoid having any spills. It doesn't have
to preserve everything, probably just enough to save lr. (BTW are
syscall arg registers still preserved? If not, this is a major cost on
the userspace side, since any call point that has to loop-and-retry
(e.g. futex) now needs to make its own place to store the original
values.)

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Rich Felker

On Mon, Apr 20, 2020 at 10:27:58AM +1000, Nicholas Piggin wrote:
> Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm:
> > * Nicholas Piggin via Libc-alpha  [2020-04-16 
> > 10:16:54 +1000]:
> >> Well it would have to test HWCAP and patch in or branch to two 
> >> completely different sequences including register save/restores yes.
> >> You could have the same asm and matching clobbers to put the sequence
> >> inline and then you could patch the one sc/scv instruction I suppose.
> > 
> > how would that 'patch' work?
> > 
> > there are many reasons why you don't
> > want libc to write its .text
> 
> I guess I don't know what I'm talking about when it comes to libraries. 
> Shame if there is no good way to load-time patch libc. It's orthogonal
> to the scv selection though -- if you don't patch you have to 
> conditional or indirect branch however you implement it.

Patched pages cannot be shared. The whole design of PIC and shared
libraries is that the code("text")/rodata is immutable and shared and
that only a minimal amount of data, packed tightly together (the GOT)
has to exist per-instance.

Also, allowing patching of executable pages is generally frowned upon
these days because W^X is a desirable hardening property.

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Nicholas Piggin

Excerpts from Rich Felker's message of April 17, 2020 4:31 am:
> On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote:
>> 
>> 
>> On 16/04/2020 14:59, Rich Felker wrote:
>> > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:
>> >>
>> >>
>> >> On 16/04/2020 12:37, Rich Felker wrote:
>> >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
>> > My preference would be that it work just like the i386 AT_SYSINFO
>> > where you just replace "int $128" with "call *%%gs:16" and the kernel
>> > provides a stub in the vdso that performs either scv or the old
>> > mechanism with the same calling convention. Then if the kernel doesn't
>> > provide it (because the kernel is too old) libc would have to provide
>> > its own stub that uses the legacy method and matches the calling
>> > convention of the one the kernel is expected to provide.
>> 
>>  What about pthread cancellation and the requirement of checking the
>>  cancellable syscall anchors in asynchronous cancellation? My plan is
>>  still to use musl strategy on glibc (BZ#12683) and for i686 it 
>>  requires to always use old int$128 for program that uses cancellation
>>  (static case) or just threads (dynamic mode, which should be more
>>  common on glibc).
>> 
>>  Using the i686 strategy of a vDSO bridge symbol would require to always
>>  fallback to 'sc' to still use the same cancellation strategy (and
>>  thus defeating this optimization in such cases).
>> >>>
>> >>> Yes, I assumed it would be the same, ignoring the new syscall
>> >>> mechanism for cancellable syscalls. While there are some exceptions,
>> >>> cancellable syscalls are generally not hot paths but things that are
>> >>> expected to block and to have significant amounts of work to do in
>> >>> kernelspace, so saving a few tens of cycles is rather pointless.
>> >>>
>> >>> It's possible to do a branch/multiple versions of the syscall asm for
>> >>> cancellation but would require extending the cancellation handler to
>> >>> support checking against multiple independent address ranges or using
>> >>> some alternate markup of them.
>> >>
>> >> The main issue is at least for glibc dynamic linking is way more common
>> >> than static linking and once the program become multithread the fallback
>> >> will be always used.
>> > 
>> > I'm not relying on static linking optimizing out the cancellable
>> > version. I'm talking about how cancellable syscalls are pretty much
>> > all "heavy" operations to begin with where a few tens of cycles are in
>> > the realm of "measurement noise" relative to the dominating time
>> > costs.
>> 
>> Yes I am aware, but at same time I am not sure how it plays on real world.
>> For instance, some workloads might issue kernel query syscalls, such as
>> recv, where buffer copying might not be dominant factor. So I see that if
>> the idea is optimizing syscall mechanism, we should try to leverage it
>> as whole in libc.
> 
> Have you timed a minimal recv? I'm not assuming buffer copying is the
> dominant factor. I'm assuming the overhead of all the kernel layers
> involved is dominant.
> 
>> >> And besides the cancellation performance issue, a new bridge vDSO 
>> >> mechanism
>> >> will still require to setup some extra bridge for the case of the older
>> >> kernel.  In the scheme you suggested:
>> >>
>> >>   __asm__("indirect call" ... with common clobbers);
>> >>
>> >> The indirect call will be either the vDSO bridge or an libc provided that
>> >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
>> >> against:
>> >>
>> >>if (hwcap & PPC_FEATURE2_SCV) {
>> >>  __asm__(... with some clobbers);
>> >>} else {
>> >>  __asm__(... with different clobbers);
>> >>}
>> > 
>> > If the indirect call can be made roughly as efficiently as the sc
>> > sequence now (which already have some cost due to handling the nasty
>> > error return convention, making the indirect call likely just as small
>> > or smaller), it's O(1) additional code size (and thus icache usage)
>> > rather than O(n) where n is number of syscall points.
>> > 
>> > Of course it would work just as well (for avoiding O(n) growth) to
>> > have a direct call to out-of-line branch like you suggested.
>> 
>> Yes, but does it really matter to optimize this specific usage case
>> for size? glibc, for instance, tries to leverage the syscall mechanism 
>> by adding some complex pre-processor asm directives.  It optimizes
>> the syscall code size in most cases.  For instance, kill in static case 
>> generates on x86_64:
>> 
>>  <__kill>:
>>0:   b8 3e 00 00 00  mov$0x3e,%eax
>>5:   0f 05   syscall 
>>7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
>>d:   0f 83 00 00 00 00   jae13 <__kill+0x13>
>>   13:   c3  retq   
>> 
>> While on musl:
>> 
>>  :
>>0:

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Nicholas Piggin

Excerpts from Adhemerval Zanella's message of April 17, 2020 4:52 am:
> 
> 
> On 16/04/2020 15:31, Rich Felker wrote:
>> On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote:
>>>
>>>
>>> On 16/04/2020 14:59, Rich Felker wrote:
 On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:
>
>
> On 16/04/2020 12:37, Rich Felker wrote:
>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
 My preference would be that it work just like the i386 AT_SYSINFO
 where you just replace "int $128" with "call *%%gs:16" and the kernel
 provides a stub in the vdso that performs either scv or the old
 mechanism with the same calling convention. Then if the kernel doesn't
 provide it (because the kernel is too old) libc would have to provide
 its own stub that uses the legacy method and matches the calling
 convention of the one the kernel is expected to provide.
>>>
>>> What about pthread cancellation and the requirement of checking the
>>> cancellable syscall anchors in asynchronous cancellation? My plan is
>>> still to use musl strategy on glibc (BZ#12683) and for i686 it 
>>> requires to always use old int$128 for program that uses cancellation
>>> (static case) or just threads (dynamic mode, which should be more
>>> common on glibc).
>>>
>>> Using the i686 strategy of a vDSO bridge symbol would require to always
>>> fallback to 'sc' to still use the same cancellation strategy (and
>>> thus defeating this optimization in such cases).
>>
>> Yes, I assumed it would be the same, ignoring the new syscall
>> mechanism for cancellable syscalls. While there are some exceptions,
>> cancellable syscalls are generally not hot paths but things that are
>> expected to block and to have significant amounts of work to do in
>> kernelspace, so saving a few tens of cycles is rather pointless.
>>
>> It's possible to do a branch/multiple versions of the syscall asm for
>> cancellation but would require extending the cancellation handler to
>> support checking against multiple independent address ranges or using
>> some alternate markup of them.
>
> The main issue is at least for glibc dynamic linking is way more common
> than static linking and once the program become multithread the fallback
> will be always used.

 I'm not relying on static linking optimizing out the cancellable
 version. I'm talking about how cancellable syscalls are pretty much
 all "heavy" operations to begin with where a few tens of cycles are in
 the realm of "measurement noise" relative to the dominating time
 costs.
>>>
>>> Yes I am aware, but at same time I am not sure how it plays on real world.
>>> For instance, some workloads might issue kernel query syscalls, such as
>>> recv, where buffer copying might not be dominant factor. So I see that if
>>> the idea is optimizing syscall mechanism, we should try to leverage it
>>> as whole in libc.
>> 
>> Have you timed a minimal recv? I'm not assuming buffer copying is the
>> dominant factor. I'm assuming the overhead of all the kernel layers
>> involved is dominant.
> 
> Not really, but reading the advantages of using 'scv' over 'sc' also does
> not outline the real expect gain.  Taking in consideration this should
> be a micro-optimization (focused on entry syscall patch), I think we should
> use where it possible.

It's around 90 cycles improvement, depending on config options and 
speculative mitigations in place, this may be roughly 5-20% of a gettid
syscall, which itself probably bears little relationship to what a recv
syscall doing real work would do, it's easy to swamp it with other work.

But it's a pretty big win in terms of how much we try to optimise this
path.

Thanks,
Nick

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-19 Thread Nicholas Piggin

Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm:
> * Nicholas Piggin via Libc-alpha  [2020-04-16 
> 10:16:54 +1000]:
>> Well it would have to test HWCAP and patch in or branch to two 
>> completely different sequences including register save/restores yes.
>> You could have the same asm and matching clobbers to put the sequence
>> inline and then you could patch the one sc/scv instruction I suppose.
> 
> how would that 'patch' work?
> 
> there are many reasons why you don't
> want libc to write its .text

I guess I don't know what I'm talking about when it comes to libraries. 
Shame if there is no good way to load-time patch libc. It's orthogonal
to the scv selection though -- if you don't patch you have to 
conditional or indirect branch however you implement it.

Thanks,
Nick

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-17 Thread Florian Weimer

* Segher Boessenkool:

> On Thu, Apr 16, 2020 at 08:34:42PM -0400, Rich Felker wrote:
>> On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote:
>> > On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote:
>> > > > I think my choice would be just making the inline syscall be a single
>> > > > call insn to an asm source file that out-of-lines the loading of TOC
>> > > > pointer and call through it or branch based on hwcap so that it's not
>> > > > repeated all over the place.
>> > > 
>> > > I don't know how problematic control flow out of an inline asm is on
>> > > POWER.  But this is basically the -moutline-atomics approach.
>> > 
>> > Control flow out of inline asm (other than with "asm goto") is not
>> > allowed at all, just like on any other target (and will not work in
>> > practice, either -- just like on any other target).  But the suggestion
>> > was to use actual assembler code, not inline asm?
>> 
>> Calling it control flow out of inline asm is something of a misnomer.
>> The enclosing state is not discarded or altered; the asm statement
>> exits normally, reaching the next instruction in the enclosing
>> block/function as soon as the call from the asm statement returns,
>> with all register/clobber constraints satisfied.
>
> Ah.  That should always Just Work, then -- our ABIs guarantee you can.

After thinking about it, I agree: GCC will handle spilling of the link
register.  Branch-and-link instructions do not clobber the protected
zone, so no stack adjustment is needed (which would be problematic to
reflect in the unwind information).

Of course, the target function has to be written in assembler because
it must not use a regular stack frame.

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Segher Boessenkool

On Thu, Apr 16, 2020 at 08:34:42PM -0400, Rich Felker wrote:
> On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote:
> > On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote:
> > > > I think my choice would be just making the inline syscall be a single
> > > > call insn to an asm source file that out-of-lines the loading of TOC
> > > > pointer and call through it or branch based on hwcap so that it's not
> > > > repeated all over the place.
> > > 
> > > I don't know how problematic control flow out of an inline asm is on
> > > POWER.  But this is basically the -moutline-atomics approach.
> > 
> > Control flow out of inline asm (other than with "asm goto") is not
> > allowed at all, just like on any other target (and will not work in
> > practice, either -- just like on any other target).  But the suggestion
> > was to use actual assembler code, not inline asm?
> 
> Calling it control flow out of inline asm is something of a misnomer.
> The enclosing state is not discarded or altered; the asm statement
> exits normally, reaching the next instruction in the enclosing
> block/function as soon as the call from the asm statement returns,
> with all register/clobber constraints satisfied.

Ah.  That should always Just Work, then -- our ABIs guarantee you can.

> Control flow out of inline asm would be more like longjmp, and it can
> be valid -- for instance, you can implement coroutines this way
> (assuming you switch stack correctly) or do longjmp this way (jumping
> to the location saved by setjmp). But it's not what'd be happening
> here.

Yeah, you cannot do that in C, not without making assumptions about what
machine code the compiler generates.  GCC explicitly disallows it, too:

 'asm' statements may not perform jumps into other 'asm' statements,
 only to the listed GOTOLABELS.  GCC's optimizers do not know about
 other jumps; therefore they cannot take account of them when
 deciding how to optimize.


Segher

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Rich Felker

On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote:
> On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote:
> > > I think my choice would be just making the inline syscall be a single
> > > call insn to an asm source file that out-of-lines the loading of TOC
> > > pointer and call through it or branch based on hwcap so that it's not
> > > repeated all over the place.
> > 
> > I don't know how problematic control flow out of an inline asm is on
> > POWER.  But this is basically the -moutline-atomics approach.
> 
> Control flow out of inline asm (other than with "asm goto") is not
> allowed at all, just like on any other target (and will not work in
> practice, either -- just like on any other target).  But the suggestion
> was to use actual assembler code, not inline asm?

Calling it control flow out of inline asm is something of a misnomer.
The enclosing state is not discarded or altered; the asm statement
exits normally, reaching the next instruction in the enclosing
block/function as soon as the call from the asm statement returns,
with all register/clobber constraints satisfied.

Control flow out of inline asm would be more like longjmp, and it can
be valid -- for instance, you can implement coroutines this way
(assuming you switch stack correctly) or do longjmp this way (jumping
to the location saved by setjmp). But it's not what'd be happening
here.

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Segher Boessenkool

On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote:
> > I think my choice would be just making the inline syscall be a single
> > call insn to an asm source file that out-of-lines the loading of TOC
> > pointer and call through it or branch based on hwcap so that it's not
> > repeated all over the place.
> 
> I don't know how problematic control flow out of an inline asm is on
> POWER.  But this is basically the -moutline-atomics approach.

Control flow out of inline asm (other than with "asm goto") is not
allowed at all, just like on any other target (and will not work in
practice, either -- just like on any other target).  But the suggestion
was to use actual assembler code, not inline asm?

Segher

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Florian Weimer

* Nicholas Piggin via Libc-alpha:

> We may or may not be getting a new ABI that will use instructions not 
> supported by old processors.
>
> https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html
>
> Current ABI continues to work of course and be the default for some 
> time, but building for new one would give some opportunity to drop
> such support for old procs, at least for glibc.

If I recall correctly, during last year's GNU Tools Cauldron, I think
it was pretty clear that this was only to be used for intra-DSO ABIs,
not cross-DSO optimization.  Relocatable object files have an ABI,
too, of course, so that's why there's a ABI documentation needed.

For cross-DSO optimization, the link editor would look at the DSO
being linked in, check if it uses the -mfuture ABI, and apply some
shortcuts.  But at that point, if the DSO is swapped back to a version
built without -mfuture, it no longer works with those newly linked
binaries against the -mfuture version.  Such a thing is a clear ABI
bump, and based what I remember from Cauldron, that is not the plan
here.

(I don't have any insider knowledge—I just don't want people to read
this think: gosh, yet another POWER ABI bump.  But the PCREL stuff
*is* exciting!)

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Adhemerval Zanella




On 16/04/2020 15:31, Rich Felker wrote:
> On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 16/04/2020 14:59, Rich Felker wrote:
>>> On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:


 On 16/04/2020 12:37, Rich Felker wrote:
> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
>>> My preference would be that it work just like the i386 AT_SYSINFO
>>> where you just replace "int $128" with "call *%%gs:16" and the kernel
>>> provides a stub in the vdso that performs either scv or the old
>>> mechanism with the same calling convention. Then if the kernel doesn't
>>> provide it (because the kernel is too old) libc would have to provide
>>> its own stub that uses the legacy method and matches the calling
>>> convention of the one the kernel is expected to provide.
>>
>> What about pthread cancellation and the requirement of checking the
>> cancellable syscall anchors in asynchronous cancellation? My plan is
>> still to use musl strategy on glibc (BZ#12683) and for i686 it 
>> requires to always use old int$128 for program that uses cancellation
>> (static case) or just threads (dynamic mode, which should be more
>> common on glibc).
>>
>> Using the i686 strategy of a vDSO bridge symbol would require to always
>> fallback to 'sc' to still use the same cancellation strategy (and
>> thus defeating this optimization in such cases).
>
> Yes, I assumed it would be the same, ignoring the new syscall
> mechanism for cancellable syscalls. While there are some exceptions,
> cancellable syscalls are generally not hot paths but things that are
> expected to block and to have significant amounts of work to do in
> kernelspace, so saving a few tens of cycles is rather pointless.
>
> It's possible to do a branch/multiple versions of the syscall asm for
> cancellation but would require extending the cancellation handler to
> support checking against multiple independent address ranges or using
> some alternate markup of them.

 The main issue is at least for glibc dynamic linking is way more common
 than static linking and once the program become multithread the fallback
 will be always used.
>>>
>>> I'm not relying on static linking optimizing out the cancellable
>>> version. I'm talking about how cancellable syscalls are pretty much
>>> all "heavy" operations to begin with where a few tens of cycles are in
>>> the realm of "measurement noise" relative to the dominating time
>>> costs.
>>
>> Yes I am aware, but at same time I am not sure how it plays on real world.
>> For instance, some workloads might issue kernel query syscalls, such as
>> recv, where buffer copying might not be dominant factor. So I see that if
>> the idea is optimizing syscall mechanism, we should try to leverage it
>> as whole in libc.
> 
> Have you timed a minimal recv? I'm not assuming buffer copying is the
> dominant factor. I'm assuming the overhead of all the kernel layers
> involved is dominant.

Not really, but reading the advantages of using 'scv' over 'sc' also does
not outline the real expect gain.  Taking in consideration this should
be a micro-optimization (focused on entry syscall patch), I think we should
use where it possible.

> 
 And besides the cancellation performance issue, a new bridge vDSO mechanism
 will still require to setup some extra bridge for the case of the older
 kernel.  In the scheme you suggested:

   __asm__("indirect call" ... with common clobbers);

 The indirect call will be either the vDSO bridge or an libc provided that
 fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
 against:

if (hwcap & PPC_FEATURE2_SCV) {
  __asm__(... with some clobbers);
} else {
  __asm__(... with different clobbers);
}
>>>
>>> If the indirect call can be made roughly as efficiently as the sc
>>> sequence now (which already have some cost due to handling the nasty
>>> error return convention, making the indirect call likely just as small
>>> or smaller), it's O(1) additional code size (and thus icache usage)
>>> rather than O(n) where n is number of syscall points.
>>>
>>> Of course it would work just as well (for avoiding O(n) growth) to
>>> have a direct call to out-of-line branch like you suggested.
>>
>> Yes, but does it really matter to optimize this specific usage case
>> for size? glibc, for instance, tries to leverage the syscall mechanism 
>> by adding some complex pre-processor asm directives.  It optimizes
>> the syscall code size in most cases.  For instance, kill in static case 
>> generates on x86_64:
>>
>>  <__kill>:
>>0:   b8 3e 00 00 00  mov$0x3e,%eax
>>5:   0f 05   syscall 
>>7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
>>d:   0f 83 00

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Rich Felker

On Thu, Apr 16, 2020 at 02:31:51PM -0400, Rich Felker wrote:
> > While on musl:
> > 
> >  :
> >0:   48 83 ec 08 sub$0x8,%rsp
> >4:   48 63 ffmovslq %edi,%rdi
> >7:   48 63 f6movslq %esi,%rsi
> >a:   b8 3e 00 00 00  mov$0x3e,%eax
> >f:   0f 05   syscall 
> >   11:   48 89 c7mov%rax,%rdi
> >   14:   e8 00 00 00 00  callq  19 
> >   19:   5a  pop%rdx
> >   1a:   c3  retq   
> 
> Wow that's some extraordinarily bad codegen going on by gcc... The
> sign-extension is semantically needed and I don't see a good way
> around it (glibc's asm is kinda a hack taking advantage of kernel not
> looking at high bits, I think), but the gratuitous stack adjustment
> and refusal to generate a tail call isn't. I'll see if we can track
> down what's going on and get it fixed.

It seems to be https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14441
which I've updated with a comment about the above.

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Rich Felker

On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 16/04/2020 14:59, Rich Felker wrote:
> > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:
> >>
> >>
> >> On 16/04/2020 12:37, Rich Felker wrote:
> >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
> > My preference would be that it work just like the i386 AT_SYSINFO
> > where you just replace "int $128" with "call *%%gs:16" and the kernel
> > provides a stub in the vdso that performs either scv or the old
> > mechanism with the same calling convention. Then if the kernel doesn't
> > provide it (because the kernel is too old) libc would have to provide
> > its own stub that uses the legacy method and matches the calling
> > convention of the one the kernel is expected to provide.
> 
>  What about pthread cancellation and the requirement of checking the
>  cancellable syscall anchors in asynchronous cancellation? My plan is
>  still to use musl strategy on glibc (BZ#12683) and for i686 it 
>  requires to always use old int$128 for program that uses cancellation
>  (static case) or just threads (dynamic mode, which should be more
>  common on glibc).
> 
>  Using the i686 strategy of a vDSO bridge symbol would require to always
>  fallback to 'sc' to still use the same cancellation strategy (and
>  thus defeating this optimization in such cases).
> >>>
> >>> Yes, I assumed it would be the same, ignoring the new syscall
> >>> mechanism for cancellable syscalls. While there are some exceptions,
> >>> cancellable syscalls are generally not hot paths but things that are
> >>> expected to block and to have significant amounts of work to do in
> >>> kernelspace, so saving a few tens of cycles is rather pointless.
> >>>
> >>> It's possible to do a branch/multiple versions of the syscall asm for
> >>> cancellation but would require extending the cancellation handler to
> >>> support checking against multiple independent address ranges or using
> >>> some alternate markup of them.
> >>
> >> The main issue is at least for glibc dynamic linking is way more common
> >> than static linking and once the program become multithread the fallback
> >> will be always used.
> > 
> > I'm not relying on static linking optimizing out the cancellable
> > version. I'm talking about how cancellable syscalls are pretty much
> > all "heavy" operations to begin with where a few tens of cycles are in
> > the realm of "measurement noise" relative to the dominating time
> > costs.
> 
> Yes I am aware, but at same time I am not sure how it plays on real world.
> For instance, some workloads might issue kernel query syscalls, such as
> recv, where buffer copying might not be dominant factor. So I see that if
> the idea is optimizing syscall mechanism, we should try to leverage it
> as whole in libc.

Have you timed a minimal recv? I'm not assuming buffer copying is the
dominant factor. I'm assuming the overhead of all the kernel layers
involved is dominant.

> >> And besides the cancellation performance issue, a new bridge vDSO mechanism
> >> will still require to setup some extra bridge for the case of the older
> >> kernel.  In the scheme you suggested:
> >>
> >>   __asm__("indirect call" ... with common clobbers);
> >>
> >> The indirect call will be either the vDSO bridge or an libc provided that
> >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
> >> against:
> >>
> >>if (hwcap & PPC_FEATURE2_SCV) {
> >>  __asm__(... with some clobbers);
> >>} else {
> >>  __asm__(... with different clobbers);
> >>}
> > 
> > If the indirect call can be made roughly as efficiently as the sc
> > sequence now (which already have some cost due to handling the nasty
> > error return convention, making the indirect call likely just as small
> > or smaller), it's O(1) additional code size (and thus icache usage)
> > rather than O(n) where n is number of syscall points.
> > 
> > Of course it would work just as well (for avoiding O(n) growth) to
> > have a direct call to out-of-line branch like you suggested.
> 
> Yes, but does it really matter to optimize this specific usage case
> for size? glibc, for instance, tries to leverage the syscall mechanism 
> by adding some complex pre-processor asm directives.  It optimizes
> the syscall code size in most cases.  For instance, kill in static case 
> generates on x86_64:
> 
>  <__kill>:
>0:   b8 3e 00 00 00  mov$0x3e,%eax
>5:   0f 05   syscall 
>7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
>d:   0f 83 00 00 00 00   jae13 <__kill+0x13>
>   13:   c3  retq   
> 
> While on musl:
> 
>  :
>0: 48 83 ec 08 sub$0x8,%rsp
>4: 48 63 ffmovslq %edi,%rdi
>7: 48 63 f6movslq %esi,%rsi
>a: b8 3e 00 00 00  mov$0

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Adhemerval Zanella




On 16/04/2020 14:59, Rich Felker wrote:
> On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:
>>
>>
>> On 16/04/2020 12:37, Rich Felker wrote:
>>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
> My preference would be that it work just like the i386 AT_SYSINFO
> where you just replace "int $128" with "call *%%gs:16" and the kernel
> provides a stub in the vdso that performs either scv or the old
> mechanism with the same calling convention. Then if the kernel doesn't
> provide it (because the kernel is too old) libc would have to provide
> its own stub that uses the legacy method and matches the calling
> convention of the one the kernel is expected to provide.

 What about pthread cancellation and the requirement of checking the
 cancellable syscall anchors in asynchronous cancellation? My plan is
 still to use musl strategy on glibc (BZ#12683) and for i686 it 
 requires to always use old int$128 for program that uses cancellation
 (static case) or just threads (dynamic mode, which should be more
 common on glibc).

 Using the i686 strategy of a vDSO bridge symbol would require to always
 fallback to 'sc' to still use the same cancellation strategy (and
 thus defeating this optimization in such cases).
>>>
>>> Yes, I assumed it would be the same, ignoring the new syscall
>>> mechanism for cancellable syscalls. While there are some exceptions,
>>> cancellable syscalls are generally not hot paths but things that are
>>> expected to block and to have significant amounts of work to do in
>>> kernelspace, so saving a few tens of cycles is rather pointless.
>>>
>>> It's possible to do a branch/multiple versions of the syscall asm for
>>> cancellation but would require extending the cancellation handler to
>>> support checking against multiple independent address ranges or using
>>> some alternate markup of them.
>>
>> The main issue is at least for glibc dynamic linking is way more common
>> than static linking and once the program become multithread the fallback
>> will be always used.
> 
> I'm not relying on static linking optimizing out the cancellable
> version. I'm talking about how cancellable syscalls are pretty much
> all "heavy" operations to begin with where a few tens of cycles are in
> the realm of "measurement noise" relative to the dominating time
> costs.

Yes I am aware, but at same time I am not sure how it plays on real world.
For instance, some workloads might issue kernel query syscalls, such as
recv, where buffer copying might not be dominant factor. So I see that if
the idea is optimizing syscall mechanism, we should try to leverage it
as whole in libc.

> 
>> And besides the cancellation performance issue, a new bridge vDSO mechanism
>> will still require to setup some extra bridge for the case of the older
>> kernel.  In the scheme you suggested:
>>
>>   __asm__("indirect call" ... with common clobbers);
>>
>> The indirect call will be either the vDSO bridge or an libc provided that
>> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
>> against:
>>
>>if (hwcap & PPC_FEATURE2_SCV) {
>>  __asm__(... with some clobbers);
>>} else {
>>  __asm__(... with different clobbers);
>>}
> 
> If the indirect call can be made roughly as efficiently as the sc
> sequence now (which already have some cost due to handling the nasty
> error return convention, making the indirect call likely just as small
> or smaller), it's O(1) additional code size (and thus icache usage)
> rather than O(n) where n is number of syscall points.
> 
> Of course it would work just as well (for avoiding O(n) growth) to
> have a direct call to out-of-line branch like you suggested.

Yes, but does it really matter to optimize this specific usage case
for size? glibc, for instance, tries to leverage the syscall mechanism 
by adding some complex pre-processor asm directives.  It optimizes
the syscall code size in most cases.  For instance, kill in static case 
generates on x86_64:

 <__kill>:
   0:   b8 3e 00 00 00  mov$0x3e,%eax
   5:   0f 05   syscall 
   7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
   d:   0f 83 00 00 00 00   jae13 <__kill+0x13>
  13:   c3  retq   

While on musl:

 :
   0:   48 83 ec 08 sub$0x8,%rsp
   4:   48 63 ffmovslq %edi,%rdi
   7:   48 63 f6movslq %esi,%rsi
   a:   b8 3e 00 00 00  mov$0x3e,%eax
   f:   0f 05   syscall 
  11:   48 89 c7mov%rax,%rdi
  14:   e8 00 00 00 00  callq  19 
  19:   5a  pop%rdx
  1a:   c3  retq   

But I hardly think it pays off the required code complexity.  Some
for providing a O(1) bridge: this will require additional complexity
to write it and setup correctly.

> 
>> Specially

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Florian Weimer

* Rich Felker:

> On Thu, Apr 16, 2020 at 06:42:32PM +0200, Florian Weimer wrote:
>> * Rich Felker:
>> 
>> > On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote:
>> >> * Rich Felker:
>> >> 
>> >> > My preference would be that it work just like the i386 AT_SYSINFO
>> >> > where you just replace "int $128" with "call *%%gs:16" and the kernel
>> >> > provides a stub in the vdso that performs either scv or the old
>> >> > mechanism with the same calling convention.
>> >> 
>> >> The i386 mechanism has received some criticism because it provides an
>> >> effective means to redirect execution flow to anyone who can write to
>> >> the TCB.  I am not sure if it makes sense to copy it.
>> >
>> > Indeed that's a good point. Do you have ideas for making it equally
>> > efficient without use of a function pointer in the TCB?
>> 
>> We could add a shared non-writable mapping at a 64K offset from the
>> thread pointer and store the function pointer or the code there.  Then
>> it would be safe.
>> 
>> However, since this is apparently tied to POWER9 and we already have a
>> POWER9 multilib, and assuming that we are going to backport the kernel
>> change, I would tweak the selection criterion for that multilib to
>> include the new HWCAP2 flag.  If a user runs this glibc on a kernel
>> which does not have support, they will get set baseline (POWER8)
>> multilib, which still works.  This way, outside the dynamic loader, no
>> run-time dispatch is needed at all.  I guess this is not at all the
>> answer you were looking for. 8-)
>
> How does this work with -static? :-)

-static is not supported. 8-) (If you use the unsupported static
libraries, you get POWER8 code.)

(Just to be clear, in case someone doesn't get the joke: This is about
a potential approach for a heavily constrained, vertically integrated
environment.  It does not reflect general glibc recommendations.)

>> If a single binary is needed, I would perhaps follow what Arm did for
>> -moutline-atomics: lay out the code so that its easy to execute for
>> the non-POWER9 case, assuming that POWER9 machines will be better at
>> predicting things than their predecessors.
>> 
>> Or you could also put the function pointer into a RELRO segment.  Then
>> there's overlap with the __libc_single_threaded discussion, where
>> people objected to this kind of optimization (although I did not
>> propose to change the TCB ABI, that would be required for
>> __libc_single_threaded because it's an external interface).
>
> Of course you can use a normal global, but now every call point needs
> to setup a TOC pointer (= two entry points and more icache lines for
> otherwise trivial functions).
>
> I think my choice would be just making the inline syscall be a single
> call insn to an asm source file that out-of-lines the loading of TOC
> pointer and call through it or branch based on hwcap so that it's not
> repeated all over the place.

I don't know how problematic control flow out of an inline asm is on
POWER.  But this is basically the -moutline-atomics approach.

> Alternatively, it would perhaps work to just put hwcap in the TCB and
> branch on it rather than making an indirect call to a function pointer
> in the TCB, so that the worst you could do by clobbering it is execute
> the wrong syscall insn and thereby get SIGILL.

The HWCAP is already in the TCB.  I expect this is what generic glibc
builds are going to use (perhaps with a bit of tweaking favorable to
POWER8 implementations, but we'll see).

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Rich Felker

On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote:
> 
> 
> On 16/04/2020 12:37, Rich Felker wrote:
> > On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
> >>> My preference would be that it work just like the i386 AT_SYSINFO
> >>> where you just replace "int $128" with "call *%%gs:16" and the kernel
> >>> provides a stub in the vdso that performs either scv or the old
> >>> mechanism with the same calling convention. Then if the kernel doesn't
> >>> provide it (because the kernel is too old) libc would have to provide
> >>> its own stub that uses the legacy method and matches the calling
> >>> convention of the one the kernel is expected to provide.
> >>
> >> What about pthread cancellation and the requirement of checking the
> >> cancellable syscall anchors in asynchronous cancellation? My plan is
> >> still to use musl strategy on glibc (BZ#12683) and for i686 it 
> >> requires to always use old int$128 for program that uses cancellation
> >> (static case) or just threads (dynamic mode, which should be more
> >> common on glibc).
> >>
> >> Using the i686 strategy of a vDSO bridge symbol would require to always
> >> fallback to 'sc' to still use the same cancellation strategy (and
> >> thus defeating this optimization in such cases).
> > 
> > Yes, I assumed it would be the same, ignoring the new syscall
> > mechanism for cancellable syscalls. While there are some exceptions,
> > cancellable syscalls are generally not hot paths but things that are
> > expected to block and to have significant amounts of work to do in
> > kernelspace, so saving a few tens of cycles is rather pointless.
> > 
> > It's possible to do a branch/multiple versions of the syscall asm for
> > cancellation but would require extending the cancellation handler to
> > support checking against multiple independent address ranges or using
> > some alternate markup of them.
> 
> The main issue is at least for glibc dynamic linking is way more common
> than static linking and once the program become multithread the fallback
> will be always used.

I'm not relying on static linking optimizing out the cancellable
version. I'm talking about how cancellable syscalls are pretty much
all "heavy" operations to begin with where a few tens of cycles are in
the realm of "measurement noise" relative to the dominating time
costs.

> And besides the cancellation performance issue, a new bridge vDSO mechanism
> will still require to setup some extra bridge for the case of the older
> kernel.  In the scheme you suggested:
> 
>   __asm__("indirect call" ... with common clobbers);
> 
> The indirect call will be either the vDSO bridge or an libc provided that
> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
> against:
> 
>if (hwcap & PPC_FEATURE2_SCV) {
>  __asm__(... with some clobbers);
>} else {
>  __asm__(... with different clobbers);
>}

If the indirect call can be made roughly as efficiently as the sc
sequence now (which already have some cost due to handling the nasty
error return convention, making the indirect call likely just as small
or smaller), it's O(1) additional code size (and thus icache usage)
rather than O(n) where n is number of syscall points.

Of course it would work just as well (for avoiding O(n) growth) to
have a direct call to out-of-line branch like you suggested.

> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a 
> TCB member (as we do on glibc) and if we could make the asm clever
> enough to not require different clobbers (although not sure if
> it would be possible).

The easy way not to require different clobbers is just using the union
of the clobbers, no? Does the proposed new method clobber any
call-saved registers that would make it painful (requiring new call
frames to save them in)?

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Adhemerval Zanella




On 16/04/2020 12:37, Rich Felker wrote:
> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
>>> My preference would be that it work just like the i386 AT_SYSINFO
>>> where you just replace "int $128" with "call *%%gs:16" and the kernel
>>> provides a stub in the vdso that performs either scv or the old
>>> mechanism with the same calling convention. Then if the kernel doesn't
>>> provide it (because the kernel is too old) libc would have to provide
>>> its own stub that uses the legacy method and matches the calling
>>> convention of the one the kernel is expected to provide.
>>
>> What about pthread cancellation and the requirement of checking the
>> cancellable syscall anchors in asynchronous cancellation? My plan is
>> still to use musl strategy on glibc (BZ#12683) and for i686 it 
>> requires to always use old int$128 for program that uses cancellation
>> (static case) or just threads (dynamic mode, which should be more
>> common on glibc).
>>
>> Using the i686 strategy of a vDSO bridge symbol would require to always
>> fallback to 'sc' to still use the same cancellation strategy (and
>> thus defeating this optimization in such cases).
> 
> Yes, I assumed it would be the same, ignoring the new syscall
> mechanism for cancellable syscalls. While there are some exceptions,
> cancellable syscalls are generally not hot paths but things that are
> expected to block and to have significant amounts of work to do in
> kernelspace, so saving a few tens of cycles is rather pointless.
> 
> It's possible to do a branch/multiple versions of the syscall asm for
> cancellation but would require extending the cancellation handler to
> support checking against multiple independent address ranges or using
> some alternate markup of them.

The main issue is at least for glibc dynamic linking is way more common
than static linking and once the program become multithread the fallback
will be always used.

And besides the cancellation performance issue, a new bridge vDSO mechanism
will still require to setup some extra bridge for the case of the older
kernel.  In the scheme you suggested:

  __asm__("indirect call" ... with common clobbers);

The indirect call will be either the vDSO bridge or an libc provided that
fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain
against:

   if (hwcap & PPC_FEATURE2_SCV) {
 __asm__(... with some clobbers);
   } else {
 __asm__(... with different clobbers);
   }

Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a 
TCB member (as we do on glibc) and if we could make the asm clever
enough to not require different clobbers (although not sure if
it would be possible).

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Rich Felker

On Thu, Apr 16, 2020 at 06:42:32PM +0200, Florian Weimer wrote:
> * Rich Felker:
> 
> > On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote:
> >> * Rich Felker:
> >> 
> >> > My preference would be that it work just like the i386 AT_SYSINFO
> >> > where you just replace "int $128" with "call *%%gs:16" and the kernel
> >> > provides a stub in the vdso that performs either scv or the old
> >> > mechanism with the same calling convention.
> >> 
> >> The i386 mechanism has received some criticism because it provides an
> >> effective means to redirect execution flow to anyone who can write to
> >> the TCB.  I am not sure if it makes sense to copy it.
> >
> > Indeed that's a good point. Do you have ideas for making it equally
> > efficient without use of a function pointer in the TCB?
> 
> We could add a shared non-writable mapping at a 64K offset from the
> thread pointer and store the function pointer or the code there.  Then
> it would be safe.
> 
> However, since this is apparently tied to POWER9 and we already have a
> POWER9 multilib, and assuming that we are going to backport the kernel
> change, I would tweak the selection criterion for that multilib to
> include the new HWCAP2 flag.  If a user runs this glibc on a kernel
> which does not have support, they will get set baseline (POWER8)
> multilib, which still works.  This way, outside the dynamic loader, no
> run-time dispatch is needed at all.  I guess this is not at all the
> answer you were looking for. 8-)

How does this work with -static? :-)

> If a single binary is needed, I would perhaps follow what Arm did for
> -moutline-atomics: lay out the code so that its easy to execute for
> the non-POWER9 case, assuming that POWER9 machines will be better at
> predicting things than their predecessors.
> 
> Or you could also put the function pointer into a RELRO segment.  Then
> there's overlap with the __libc_single_threaded discussion, where
> people objected to this kind of optimization (although I did not
> propose to change the TCB ABI, that would be required for
> __libc_single_threaded because it's an external interface).

Of course you can use a normal global, but now every call point needs
to setup a TOC pointer (= two entry points and more icache lines for
otherwise trivial functions).

I think my choice would be just making the inline syscall be a single
call insn to an asm source file that out-of-lines the loading of TOC
pointer and call through it or branch based on hwcap so that it's not
repeated all over the place.

Alternatively, it would perhaps work to just put hwcap in the TCB and
branch on it rather than making an indirect call to a function pointer
in the TCB, so that the worst you could do by clobbering it is execute
the wrong syscall insn and thereby get SIGILL.

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Florian Weimer

* Rich Felker:

> On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote:
>> * Rich Felker:
>> 
>> > My preference would be that it work just like the i386 AT_SYSINFO
>> > where you just replace "int $128" with "call *%%gs:16" and the kernel
>> > provides a stub in the vdso that performs either scv or the old
>> > mechanism with the same calling convention.
>> 
>> The i386 mechanism has received some criticism because it provides an
>> effective means to redirect execution flow to anyone who can write to
>> the TCB.  I am not sure if it makes sense to copy it.
>
> Indeed that's a good point. Do you have ideas for making it equally
> efficient without use of a function pointer in the TCB?

We could add a shared non-writable mapping at a 64K offset from the
thread pointer and store the function pointer or the code there.  Then
it would be safe.

However, since this is apparently tied to POWER9 and we already have a
POWER9 multilib, and assuming that we are going to backport the kernel
change, I would tweak the selection criterion for that multilib to
include the new HWCAP2 flag.  If a user runs this glibc on a kernel
which does not have support, they will get set baseline (POWER8)
multilib, which still works.  This way, outside the dynamic loader, no
run-time dispatch is needed at all.  I guess this is not at all the
answer you were looking for. 8-)

If a single binary is needed, I would perhaps follow what Arm did for
-moutline-atomics: lay out the code so that its easy to execute for
the non-POWER9 case, assuming that POWER9 machines will be better at
predicting things than their predecessors.

Or you could also put the function pointer into a RELRO segment.  Then
there's overlap with the __libc_single_threaded discussion, where
people objected to this kind of optimization (although I did not
propose to change the TCB ABI, that would be required for
__libc_single_threaded because it's an external interface).

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Rich Felker

On Thu, Apr 16, 2020 at 11:21:56AM -0400, Jeffrey Walton wrote:
> On Wed, Apr 15, 2020 at 8:17 PM Nicholas Piggin  wrote:
> >
> > Excerpts from Rich Felker's message of April 16, 2020 8:55 am:
> > > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote:
> > >> I would like to enable Linux support for the powerpc 'scv' instruction,
> > >> as a faster system call instruction.
> > >>
> > >> This requires two things to be defined: Firstly a way to advertise to
> > >> userspace that kernel supports scv, and a way to allocate and advertise
> > >> support for individual scv vectors. Secondly, a calling convention ABI
> > >> for this new instruction.
> > >> ...
> > > Note that any libc that actually makes use of the new functionality is
> > > not going to be able to make clobbers conditional on support for it;
> > > branching around different clobbers is going to defeat any gains vs
> > > always just treating anything clobbered by either method as clobbered.
> >
> > Well it would have to test HWCAP and patch in or branch to two
> > completely different sequences including register save/restores yes.
> > You could have the same asm and matching clobbers to put the sequence
> > inline and then you could patch the one sc/scv instruction I suppose.
> 
> Could GCC function multiversioning work here?
> https://gcc.gnu.org/wiki/FunctionMultiVersioning
> 
> It seems like selecting a runtime version of a function is the sort of
> thing you are trying to do.

On glibc it potentially could. This is ifunc-based functionality
though and musl explicitly does not (and will not) support ifunc
because of lots of fundamental problems it entails. But even on glibc
the underlying mechanisms for ifunc are just the same as a normal
indirect call and there's no real reason to prefer implementing it
with ifunc/multiversioning vs directly.

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Rich Felker

On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote:
> > My preference would be that it work just like the i386 AT_SYSINFO
> > where you just replace "int $128" with "call *%%gs:16" and the kernel
> > provides a stub in the vdso that performs either scv or the old
> > mechanism with the same calling convention. Then if the kernel doesn't
> > provide it (because the kernel is too old) libc would have to provide
> > its own stub that uses the legacy method and matches the calling
> > convention of the one the kernel is expected to provide.
> 
> What about pthread cancellation and the requirement of checking the
> cancellable syscall anchors in asynchronous cancellation? My plan is
> still to use musl strategy on glibc (BZ#12683) and for i686 it 
> requires to always use old int$128 for program that uses cancellation
> (static case) or just threads (dynamic mode, which should be more
> common on glibc).
> 
> Using the i686 strategy of a vDSO bridge symbol would require to always
> fallback to 'sc' to still use the same cancellation strategy (and
> thus defeating this optimization in such cases).

Yes, I assumed it would be the same, ignoring the new syscall
mechanism for cancellable syscalls. While there are some exceptions,
cancellable syscalls are generally not hot paths but things that are
expected to block and to have significant amounts of work to do in
kernelspace, so saving a few tens of cycles is rather pointless.

It's possible to do a branch/multiple versions of the syscall asm for
cancellation but would require extending the cancellation handler to
support checking against multiple independent address ranges or using
some alternate markup of them.

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Rich Felker

On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote:
> * Rich Felker:
> 
> > My preference would be that it work just like the i386 AT_SYSINFO
> > where you just replace "int $128" with "call *%%gs:16" and the kernel
> > provides a stub in the vdso that performs either scv or the old
> > mechanism with the same calling convention.
> 
> The i386 mechanism has received some criticism because it provides an
> effective means to redirect execution flow to anyone who can write to
> the TCB.  I am not sure if it makes sense to copy it.

Indeed that's a good point. Do you have ideas for making it equally
efficient without use of a function pointer in the TCB?

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Jeffrey Walton

On Wed, Apr 15, 2020 at 8:17 PM Nicholas Piggin  wrote:
>
> Excerpts from Rich Felker's message of April 16, 2020 8:55 am:
> > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote:
> >> I would like to enable Linux support for the powerpc 'scv' instruction,
> >> as a faster system call instruction.
> >>
> >> This requires two things to be defined: Firstly a way to advertise to
> >> userspace that kernel supports scv, and a way to allocate and advertise
> >> support for individual scv vectors. Secondly, a calling convention ABI
> >> for this new instruction.
> >> ...
> > Note that any libc that actually makes use of the new functionality is
> > not going to be able to make clobbers conditional on support for it;
> > branching around different clobbers is going to defeat any gains vs
> > always just treating anything clobbered by either method as clobbered.
>
> Well it would have to test HWCAP and patch in or branch to two
> completely different sequences including register save/restores yes.
> You could have the same asm and matching clobbers to put the sequence
> inline and then you could patch the one sc/scv instruction I suppose.

Could GCC function multiversioning work here?
https://gcc.gnu.org/wiki/FunctionMultiVersioning

It seems like selecting a runtime version of a function is the sort of
thing you are trying to do.

Jeff

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Szabolcs Nagy

* Nicholas Piggin via Libc-alpha  [2020-04-16 
10:16:54 +1000]:
> Well it would have to test HWCAP and patch in or branch to two 
> completely different sequences including register save/restores yes.
> You could have the same asm and matching clobbers to put the sequence
> inline and then you could patch the one sc/scv instruction I suppose.

how would that 'patch' work?

there are many reasons why you don't
want libc to write its .text

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-16 Thread Adhemerval Zanella




On 15/04/2020 19:55, Rich Felker wrote:
> On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote:
>> I would like to enable Linux support for the powerpc 'scv' instruction,
>> as a faster system call instruction.
>>
>> This requires two things to be defined: Firstly a way to advertise to 
>> userspace that kernel supports scv, and a way to allocate and advertise
>> support for individual scv vectors. Secondly, a calling convention ABI
>> for this new instruction.
>>
>> Thanks to those who commented last time, since then I have removed my
>> answered questions and unpopular alternatives but you can find them
>> here
>>
>> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html
>>
>> Let me try one more with a wider cc list, and then we'll get something
>> merged. Any questions or counter-opinions are welcome.
>>
>> System Call Vectored (scv) ABI
>> ==
>>
>> The scv instruction is introduced with POWER9 / ISA3, it comes with an
>> rfscv counter-part. The benefit of these instructions is performance
>> (trading slower SRR0/1 with faster LR/CTR registers, and entering the
>> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR 
>> updates. The scv instruction has 128 interrupt entry points (not enough 
>> to cover the Linux system call space).
>>
>> The proposal is to assign scv numbers very conservatively and allocate 
>> them as individual HWCAP features as we add support for more. The zero 
>> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.
>>
>> Advertisement
>>
>> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
>> SIGILL in current environments. Linux has defined a HWCAP2 bit 
>> PPC_FEATURE2_SCV for SCV support, but does not set it.
>>
>> When scv instruction support and the scv 0 vector for system calls are 
>> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors 
>> should not be used without future HWCAP bits indicating support, which is
>> how we will allocate them. (Should unallocated ones generate SIGILL, or
>> return -ENOSYS in r3?)
>>
>> Calling convention
>>
>> The proposal is for scv 0 to provide the standard Linux system call ABI 
>> with the following differences from sc convention[1]:
>>
>> - LR is to be volatile across scv calls. This is necessary because the 
>>   scv instruction clobbers LR. From previous discussion, this should be 
>>   possible to deal with in GCC clobbers and CFI.
>>
>> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
>>   kernel system call exit to avoid restoring the CR register (although 
>>   we probably still would anyway to avoid information leak).
>>
>> - Error handling: I think the consensus has been to move to using negative
>>   return value in r3 rather than CR0[SO]=1 to indicate error, which matches
>>   most other architectures and is closer to a function call.
>>
>> The number of scratch registers (r9-r12) at kernel entry seems 
>> sufficient that we don't have any costly spilling, patch is here[2].  
>>
>> [1] 
>> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
>> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html
> 
> My preference would be that it work just like the i386 AT_SYSINFO
> where you just replace "int $128" with "call *%%gs:16" and the kernel
> provides a stub in the vdso that performs either scv or the old
> mechanism with the same calling convention. Then if the kernel doesn't
> provide it (because the kernel is too old) libc would have to provide
> its own stub that uses the legacy method and matches the calling
> convention of the one the kernel is expected to provide.

What about pthread cancellation and the requirement of checking the
cancellable syscall anchors in asynchronous cancellation? My plan is
still to use musl strategy on glibc (BZ#12683) and for i686 it 
requires to always use old int$128 for program that uses cancellation
(static case) or just threads (dynamic mode, which should be more
common on glibc).

Using the i686 strategy of a vDSO bridge symbol would require to always
fallback to 'sc' to still use the same cancellation strategy (and
thus defeating this optimization in such cases).

> Note that any libc that actually makes use of the new functionality is
> not going to be able to make clobbers conditional on support for it;
> branching around different clobbers is going to defeat any gains vs
> always just treating anything clobbered by either method as clobbered.
> Likewise, it's not useful to have different error return mechanisms
> because the caller just has to branch to support both (or the
> kernel-provided stub just has to emulate one for it; that could work
> if you really want to change the bad existing convention).
> 
> Thoughts?
> 
> Rich
>

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-15 Thread Florian Weimer

* Rich Felker:

> My preference would be that it work just like the i386 AT_SYSINFO
> where you just replace "int $128" with "call *%%gs:16" and the kernel
> provides a stub in the vdso that performs either scv or the old
> mechanism with the same calling convention.

The i386 mechanism has received some criticism because it provides an
effective means to redirect execution flow to anyone who can write to
the TCB.  I am not sure if it makes sense to copy it.

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-15 Thread Nicholas Piggin

Excerpts from Rich Felker's message of April 16, 2020 1:03 pm:
> On Thu, Apr 16, 2020 at 12:53:31PM +1000, Nicholas Piggin wrote:
>> > Not to mention the dcache line to access
>> > __hwcap or whatever, and the icache lines to setup access TOC-relative
>> > access to it. (Of course you could put a copy of its value in TLS at a
>> > fixed offset, which would somewhat mitigate both.)
>> > 
>> >> And finally, the HWCAP test can eventually go away in future. A vdso
>> >> call can not.
>> > 
>> > We support nearly arbitrarily old kernels (with limited functionality)
>> > and hardware (with full functionality) and don't intend for that to
>> > change, ever. But indeed glibc might want too eventually drop the
>> > check.
>> 
>> Ah, cool. Any build-time flexibility there?
>> 
>> We may or may not be getting a new ABI that will use instructions not 
>> supported by old processors.
>> 
>> https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html
>> 
>> Current ABI continues to work of course and be the default for some 
>> time, but building for new one would give some opportunity to drop
>> such support for old procs, at least for glibc.
> 
> What does "new ABI" entail to you? In the terminology I use with musl,
> "new ABI" and "new ISA level" are different things. You can compile
> (explicit -march or compiler default) binaries that won't run on older
> cpus due to use of new insns etc., but we consider it the same ABI if
> you can link code for an older/baseline ISA level with the
> newer-ISA-level object files, i.e. if the interface surface for
> linkage remains compatible. We also try to avoid gratuitous
> proliferation of different ABIs unless there's a strong underlying
> need (like addition of softfloat ABIs for archs that usually have FPU,
> or vice versa).

Yeah it will be a new ABI type that also requires a new ISA level.
As far as I know (and I'm not on the toolchain side) there will be
some call compatibility between the two, so it may be fine to
continue with existing ABI for libc. But it just something that
comes to mind as a build-time cutover where we might be able to
assume particular features.

> In principle the same could be done for kernels except it's a bigger
> silent gotcha (possible ENOSYS in places where it shouldn't be able to
> happen rather than a trapping SIGILL or similar) and there's rarely
> any serious performance or size benefit to dropping support for older
> kernels.

Right, I don't think it'd be a huge problem whatever way we go,
compared with the cost of the system call.

Thanks,
Nick

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-15 Thread Rich Felker

On Thu, Apr 16, 2020 at 12:53:31PM +1000, Nicholas Piggin wrote:
> > Not to mention the dcache line to access
> > __hwcap or whatever, and the icache lines to setup access TOC-relative
> > access to it. (Of course you could put a copy of its value in TLS at a
> > fixed offset, which would somewhat mitigate both.)
> > 
> >> And finally, the HWCAP test can eventually go away in future. A vdso
> >> call can not.
> > 
> > We support nearly arbitrarily old kernels (with limited functionality)
> > and hardware (with full functionality) and don't intend for that to
> > change, ever. But indeed glibc might want too eventually drop the
> > check.
> 
> Ah, cool. Any build-time flexibility there?
> 
> We may or may not be getting a new ABI that will use instructions not 
> supported by old processors.
> 
> https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html
> 
> Current ABI continues to work of course and be the default for some 
> time, but building for new one would give some opportunity to drop
> such support for old procs, at least for glibc.

What does "new ABI" entail to you? In the terminology I use with musl,
"new ABI" and "new ISA level" are different things. You can compile
(explicit -march or compiler default) binaries that won't run on older
cpus due to use of new insns etc., but we consider it the same ABI if
you can link code for an older/baseline ISA level with the
newer-ISA-level object files, i.e. if the interface surface for
linkage remains compatible. We also try to avoid gratuitous
proliferation of different ABIs unless there's a strong underlying
need (like addition of softfloat ABIs for archs that usually have FPU,
or vice versa).

In principle the same could be done for kernels except it's a bigger
silent gotcha (possible ENOSYS in places where it shouldn't be able to
happen rather than a trapping SIGILL or similar) and there's rarely
any serious performance or size benefit to dropping support for older
kernels.

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-15 Thread Nicholas Piggin

Excerpts from Rich Felker's message of April 16, 2020 12:35 pm:
> On Thu, Apr 16, 2020 at 12:24:16PM +1000, Nicholas Piggin wrote:
>> >> > Likewise, it's not useful to have different error return mechanisms
>> >> > because the caller just has to branch to support both (or the
>> >> > kernel-provided stub just has to emulate one for it; that could work
>> >> > if you really want to change the bad existing convention).
>> >> > 
>> >> > Thoughts?
>> >> 
>> >> The existing convention has to change somewhat because of the clobbers,
>> >> so I thought we could change the error return at the same time. I'm
>> >> open to not changing it and using CR0[SO], but others liked the idea.
>> >> Pro: it matches sc and vsyscall. Con: it's different from other common
>> >> archs. Performnce-wise it would really be a wash -- cost of conditional
>> >> branch is not the cmp but the mispredict.
>> > 
>> > If you do the branch on hwcap at each syscall, then you significantly
>> > increase code size of every syscall point, likely turning a bunch of
>> > trivial functions that didn't need stack frames into ones that do. You
>> > also potentially make them need a TOC pointer. Making them all just do
>> > an indirect call unconditionally (with pointer in TLS like i386?) is a
>> > lot more efficient in code size and at least as good for performance.
>> 
>> I disagree. Doing the long vdso indirect call *necessarily* requires
>> touching a new icache line, and even a new TLB entry. Indirect branches
> 
> The increase in number of icache lines from the branch at every
> syscall point is far greater than the use of a single extra icache
> line shared by all syscalls.

That's true, I was thinking of a single function that does the test and 
calls syscalls, which might be the fair comparison.

> Not to mention the dcache line to access
> __hwcap or whatever, and the icache lines to setup access TOC-relative
> access to it. (Of course you could put a copy of its value in TLS at a
> fixed offset, which would somewhat mitigate both.)
> 
>> And finally, the HWCAP test can eventually go away in future. A vdso
>> call can not.
> 
> We support nearly arbitrarily old kernels (with limited functionality)
> and hardware (with full functionality) and don't intend for that to
> change, ever. But indeed glibc might want too eventually drop the
> check.

Ah, cool. Any build-time flexibility there?

We may or may not be getting a new ABI that will use instructions not 
supported by old processors.

https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html

Current ABI continues to work of course and be the default for some 
time, but building for new one would give some opportunity to drop
such support for old procs, at least for glibc.

> 
>> If you really want to select with an indirect branch rather than
>> direct conditional, you can do that all within the library.
> 
> OK. It's a little bit more work if that's not the interface the kernel
> will give us, but it's no big deal.

Okay.

Thanks,
Nick

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-15 Thread Rich Felker

On Thu, Apr 16, 2020 at 12:24:16PM +1000, Nicholas Piggin wrote:
> >> > Likewise, it's not useful to have different error return mechanisms
> >> > because the caller just has to branch to support both (or the
> >> > kernel-provided stub just has to emulate one for it; that could work
> >> > if you really want to change the bad existing convention).
> >> > 
> >> > Thoughts?
> >> 
> >> The existing convention has to change somewhat because of the clobbers,
> >> so I thought we could change the error return at the same time. I'm
> >> open to not changing it and using CR0[SO], but others liked the idea.
> >> Pro: it matches sc and vsyscall. Con: it's different from other common
> >> archs. Performnce-wise it would really be a wash -- cost of conditional
> >> branch is not the cmp but the mispredict.
> > 
> > If you do the branch on hwcap at each syscall, then you significantly
> > increase code size of every syscall point, likely turning a bunch of
> > trivial functions that didn't need stack frames into ones that do. You
> > also potentially make them need a TOC pointer. Making them all just do
> > an indirect call unconditionally (with pointer in TLS like i386?) is a
> > lot more efficient in code size and at least as good for performance.
> 
> I disagree. Doing the long vdso indirect call *necessarily* requires
> touching a new icache line, and even a new TLB entry. Indirect branches

The increase in number of icache lines from the branch at every
syscall point is far greater than the use of a single extra icache
line shared by all syscalls. Not to mention the dcache line to access
__hwcap or whatever, and the icache lines to setup access TOC-relative
access to it. (Of course you could put a copy of its value in TLS at a
fixed offset, which would somewhat mitigate both.)

> And finally, the HWCAP test can eventually go away in future. A vdso
> call can not.

We support nearly arbitrarily old kernels (with limited functionality)
and hardware (with full functionality) and don't intend for that to
change, ever. But indeed glibc might want too eventually drop the
check.

> If you really want to select with an indirect branch rather than
> direct conditional, you can do that all within the library.

OK. It's a little bit more work if that's not the interface the kernel
will give us, but it's no big deal.

Rich

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-15 Thread Nicholas Piggin

Excerpts from Rich Felker's message of April 16, 2020 10:48 am:
> On Thu, Apr 16, 2020 at 10:16:54AM +1000, Nicholas Piggin wrote:
>> Excerpts from Rich Felker's message of April 16, 2020 8:55 am:
>> > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote:
>> >> I would like to enable Linux support for the powerpc 'scv' instruction,
>> >> as a faster system call instruction.
>> >> 
>> >> This requires two things to be defined: Firstly a way to advertise to 
>> >> userspace that kernel supports scv, and a way to allocate and advertise
>> >> support for individual scv vectors. Secondly, a calling convention ABI
>> >> for this new instruction.
>> >> 
>> >> Thanks to those who commented last time, since then I have removed my
>> >> answered questions and unpopular alternatives but you can find them
>> >> here
>> >> 
>> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html
>> >> 
>> >> Let me try one more with a wider cc list, and then we'll get something
>> >> merged. Any questions or counter-opinions are welcome.
>> >> 
>> >> System Call Vectored (scv) ABI
>> >> ==
>> >> 
>> >> The scv instruction is introduced with POWER9 / ISA3, it comes with an
>> >> rfscv counter-part. The benefit of these instructions is performance
>> >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the
>> >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR 
>> >> updates. The scv instruction has 128 interrupt entry points (not enough 
>> >> to cover the Linux system call space).
>> >> 
>> >> The proposal is to assign scv numbers very conservatively and allocate 
>> >> them as individual HWCAP features as we add support for more. The zero 
>> >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.
>> >> 
>> >> Advertisement
>> >> 
>> >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
>> >> SIGILL in current environments. Linux has defined a HWCAP2 bit 
>> >> PPC_FEATURE2_SCV for SCV support, but does not set it.
>> >> 
>> >> When scv instruction support and the scv 0 vector for system calls are 
>> >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors 
>> >> should not be used without future HWCAP bits indicating support, which is
>> >> how we will allocate them. (Should unallocated ones generate SIGILL, or
>> >> return -ENOSYS in r3?)
>> >> 
>> >> Calling convention
>> >> 
>> >> The proposal is for scv 0 to provide the standard Linux system call ABI 
>> >> with the following differences from sc convention[1]:
>> >> 
>> >> - LR is to be volatile across scv calls. This is necessary because the 
>> >>   scv instruction clobbers LR. From previous discussion, this should be 
>> >>   possible to deal with in GCC clobbers and CFI.
>> >> 
>> >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
>> >>   kernel system call exit to avoid restoring the CR register (although 
>> >>   we probably still would anyway to avoid information leak).
>> >> 
>> >> - Error handling: I think the consensus has been to move to using negative
>> >>   return value in r3 rather than CR0[SO]=1 to indicate error, which 
>> >> matches
>> >>   most other architectures and is closer to a function call.
>> >> 
>> >> The number of scratch registers (r9-r12) at kernel entry seems 
>> >> sufficient that we don't have any costly spilling, patch is here[2].  
>> >> 
>> >> [1] 
>> >> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
>> >> [2] 
>> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840..html
>> > 
>> > My preference would be that it work just like the i386 AT_SYSINFO
>> > where you just replace "int $128" with "call *%%gs:16" and the kernel
>> > provides a stub in the vdso that performs either scv or the old
>> > mechanism with the same calling convention. Then if the kernel doesn't
>> > provide it (because the kernel is too old) libc would have to provide
>> > its own stub that uses the legacy method and matches the calling
>> > convention of the one the kernel is expected to provide.
>> 
>> I'm not sure if that's necessary. That's done on x86-32 because they
>> select different sequences to use based on the CPU running and if the host
>> kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP
>> bits and select the right sequence in libc as well I suppose.
> 
> It's not just a HWCAP. It's a contract between the kernel and
> userspace to support a particular calling convention that's not
> exposed except as the public entry point the kernel exports via
> AT_SYSINFO.

Right.

>> > Note that any libc that actually makes use of the new functionality is
>> > not going to be able to make clobbers conditional on support for it;
>> > branching around different clobbers is going to defeat any gains vs
>> > always just treating anything clobbered by either method as clobbered.
>> 
>> Well it would have to test HWCAP and

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-15 Thread Rich Felker

On Thu, Apr 16, 2020 at 10:16:54AM +1000, Nicholas Piggin wrote:
> Excerpts from Rich Felker's message of April 16, 2020 8:55 am:
> > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote:
> >> I would like to enable Linux support for the powerpc 'scv' instruction,
> >> as a faster system call instruction.
> >> 
> >> This requires two things to be defined: Firstly a way to advertise to 
> >> userspace that kernel supports scv, and a way to allocate and advertise
> >> support for individual scv vectors. Secondly, a calling convention ABI
> >> for this new instruction.
> >> 
> >> Thanks to those who commented last time, since then I have removed my
> >> answered questions and unpopular alternatives but you can find them
> >> here
> >> 
> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html
> >> 
> >> Let me try one more with a wider cc list, and then we'll get something
> >> merged. Any questions or counter-opinions are welcome.
> >> 
> >> System Call Vectored (scv) ABI
> >> ==
> >> 
> >> The scv instruction is introduced with POWER9 / ISA3, it comes with an
> >> rfscv counter-part. The benefit of these instructions is performance
> >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the
> >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR 
> >> updates. The scv instruction has 128 interrupt entry points (not enough 
> >> to cover the Linux system call space).
> >> 
> >> The proposal is to assign scv numbers very conservatively and allocate 
> >> them as individual HWCAP features as we add support for more. The zero 
> >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.
> >> 
> >> Advertisement
> >> 
> >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
> >> SIGILL in current environments. Linux has defined a HWCAP2 bit 
> >> PPC_FEATURE2_SCV for SCV support, but does not set it.
> >> 
> >> When scv instruction support and the scv 0 vector for system calls are 
> >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors 
> >> should not be used without future HWCAP bits indicating support, which is
> >> how we will allocate them. (Should unallocated ones generate SIGILL, or
> >> return -ENOSYS in r3?)
> >> 
> >> Calling convention
> >> 
> >> The proposal is for scv 0 to provide the standard Linux system call ABI 
> >> with the following differences from sc convention[1]:
> >> 
> >> - LR is to be volatile across scv calls. This is necessary because the 
> >>   scv instruction clobbers LR. From previous discussion, this should be 
> >>   possible to deal with in GCC clobbers and CFI.
> >> 
> >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
> >>   kernel system call exit to avoid restoring the CR register (although 
> >>   we probably still would anyway to avoid information leak).
> >> 
> >> - Error handling: I think the consensus has been to move to using negative
> >>   return value in r3 rather than CR0[SO]=1 to indicate error, which matches
> >>   most other architectures and is closer to a function call.
> >> 
> >> The number of scratch registers (r9-r12) at kernel entry seems 
> >> sufficient that we don't have any costly spilling, patch is here[2].  
> >> 
> >> [1] 
> >> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
> >> [2] 
> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840..html
> > 
> > My preference would be that it work just like the i386 AT_SYSINFO
> > where you just replace "int $128" with "call *%%gs:16" and the kernel
> > provides a stub in the vdso that performs either scv or the old
> > mechanism with the same calling convention. Then if the kernel doesn't
> > provide it (because the kernel is too old) libc would have to provide
> > its own stub that uses the legacy method and matches the calling
> > convention of the one the kernel is expected to provide.
> 
> I'm not sure if that's necessary. That's done on x86-32 because they
> select different sequences to use based on the CPU running and if the host
> kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP
> bits and select the right sequence in libc as well I suppose.

It's not just a HWCAP. It's a contract between the kernel and
userspace to support a particular calling convention that's not
exposed except as the public entry point the kernel exports via
AT_SYSINFO.

> > Note that any libc that actually makes use of the new functionality is
> > not going to be able to make clobbers conditional on support for it;
> > branching around different clobbers is going to defeat any gains vs
> > always just treating anything clobbered by either method as clobbered.
> 
> Well it would have to test HWCAP and patch in or branch to two 
> completely different sequences including register save/restores yes.
> You could have the same asm and matching clobbers to put the sequence
> inline

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-15 Thread Nicholas Piggin

Excerpts from Rich Felker's message of April 16, 2020 8:55 am:
> On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote:
>> I would like to enable Linux support for the powerpc 'scv' instruction,
>> as a faster system call instruction.
>> 
>> This requires two things to be defined: Firstly a way to advertise to 
>> userspace that kernel supports scv, and a way to allocate and advertise
>> support for individual scv vectors. Secondly, a calling convention ABI
>> for this new instruction.
>> 
>> Thanks to those who commented last time, since then I have removed my
>> answered questions and unpopular alternatives but you can find them
>> here
>> 
>> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html
>> 
>> Let me try one more with a wider cc list, and then we'll get something
>> merged. Any questions or counter-opinions are welcome.
>> 
>> System Call Vectored (scv) ABI
>> ==
>> 
>> The scv instruction is introduced with POWER9 / ISA3, it comes with an
>> rfscv counter-part. The benefit of these instructions is performance
>> (trading slower SRR0/1 with faster LR/CTR registers, and entering the
>> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR 
>> updates. The scv instruction has 128 interrupt entry points (not enough 
>> to cover the Linux system call space).
>> 
>> The proposal is to assign scv numbers very conservatively and allocate 
>> them as individual HWCAP features as we add support for more. The zero 
>> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.
>> 
>> Advertisement
>> 
>> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
>> SIGILL in current environments. Linux has defined a HWCAP2 bit 
>> PPC_FEATURE2_SCV for SCV support, but does not set it.
>> 
>> When scv instruction support and the scv 0 vector for system calls are 
>> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors 
>> should not be used without future HWCAP bits indicating support, which is
>> how we will allocate them. (Should unallocated ones generate SIGILL, or
>> return -ENOSYS in r3?)
>> 
>> Calling convention
>> 
>> The proposal is for scv 0 to provide the standard Linux system call ABI 
>> with the following differences from sc convention[1]:
>> 
>> - LR is to be volatile across scv calls. This is necessary because the 
>>   scv instruction clobbers LR. From previous discussion, this should be 
>>   possible to deal with in GCC clobbers and CFI.
>> 
>> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
>>   kernel system call exit to avoid restoring the CR register (although 
>>   we probably still would anyway to avoid information leak).
>> 
>> - Error handling: I think the consensus has been to move to using negative
>>   return value in r3 rather than CR0[SO]=1 to indicate error, which matches
>>   most other architectures and is closer to a function call.
>> 
>> The number of scratch registers (r9-r12) at kernel entry seems 
>> sufficient that we don't have any costly spilling, patch is here[2].  
>> 
>> [1] 
>> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
>> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html
> 
> My preference would be that it work just like the i386 AT_SYSINFO
> where you just replace "int $128" with "call *%%gs:16" and the kernel
> provides a stub in the vdso that performs either scv or the old
> mechanism with the same calling convention. Then if the kernel doesn't
> provide it (because the kernel is too old) libc would have to provide
> its own stub that uses the legacy method and matches the calling
> convention of the one the kernel is expected to provide.

I'm not sure if that's necessary. That's done on x86-32 because they
select different sequences to use based on the CPU running and if the host
kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP
bits and select the right sequence in libc as well I suppose.

> Note that any libc that actually makes use of the new functionality is
> not going to be able to make clobbers conditional on support for it;
> branching around different clobbers is going to defeat any gains vs
> always just treating anything clobbered by either method as clobbered.

Well it would have to test HWCAP and patch in or branch to two 
completely different sequences including register save/restores yes.
You could have the same asm and matching clobbers to put the sequence
inline and then you could patch the one sc/scv instruction I suppose.

A bit of logic to select between them doesn't defeat gains though,
it's about 90 cycle improvement which is a handful of branch mispredicts 
so it really is an improvement. Eventually userspace will stop 
supporting the old variant too.

> Likewise, it's not useful to have different error return mechanisms
> because the caller just has to branch to support both (or the
> kernel-provided s

Re: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-15 Thread Rich Felker

On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote:
> I would like to enable Linux support for the powerpc 'scv' instruction,
> as a faster system call instruction.
> 
> This requires two things to be defined: Firstly a way to advertise to 
> userspace that kernel supports scv, and a way to allocate and advertise
> support for individual scv vectors. Secondly, a calling convention ABI
> for this new instruction.
> 
> Thanks to those who commented last time, since then I have removed my
> answered questions and unpopular alternatives but you can find them
> here
> 
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html
> 
> Let me try one more with a wider cc list, and then we'll get something
> merged. Any questions or counter-opinions are welcome.
> 
> System Call Vectored (scv) ABI
> ==
> 
> The scv instruction is introduced with POWER9 / ISA3, it comes with an
> rfscv counter-part. The benefit of these instructions is performance
> (trading slower SRR0/1 with faster LR/CTR registers, and entering the
> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR 
> updates. The scv instruction has 128 interrupt entry points (not enough 
> to cover the Linux system call space).
> 
> The proposal is to assign scv numbers very conservatively and allocate 
> them as individual HWCAP features as we add support for more. The zero 
> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.
> 
> Advertisement
> 
> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
> SIGILL in current environments. Linux has defined a HWCAP2 bit 
> PPC_FEATURE2_SCV for SCV support, but does not set it.
> 
> When scv instruction support and the scv 0 vector for system calls are 
> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors 
> should not be used without future HWCAP bits indicating support, which is
> how we will allocate them. (Should unallocated ones generate SIGILL, or
> return -ENOSYS in r3?)
> 
> Calling convention
> 
> The proposal is for scv 0 to provide the standard Linux system call ABI 
> with the following differences from sc convention[1]:
> 
> - LR is to be volatile across scv calls. This is necessary because the 
>   scv instruction clobbers LR. From previous discussion, this should be 
>   possible to deal with in GCC clobbers and CFI.
> 
> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
>   kernel system call exit to avoid restoring the CR register (although 
>   we probably still would anyway to avoid information leak).
> 
> - Error handling: I think the consensus has been to move to using negative
>   return value in r3 rather than CR0[SO]=1 to indicate error, which matches
>   most other architectures and is closer to a function call.
> 
> The number of scratch registers (r9-r12) at kernel entry seems 
> sufficient that we don't have any costly spilling, patch is here[2].  
> 
> [1] 
> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html

My preference would be that it work just like the i386 AT_SYSINFO
where you just replace "int $128" with "call *%%gs:16" and the kernel
provides a stub in the vdso that performs either scv or the old
mechanism with the same calling convention. Then if the kernel doesn't
provide it (because the kernel is too old) libc would have to provide
its own stub that uses the legacy method and matches the calling
convention of the one the kernel is expected to provide.

Note that any libc that actually makes use of the new functionality is
not going to be able to make clobbers conditional on support for it;
branching around different clobbers is going to defeat any gains vs
always just treating anything clobbered by either method as clobbered.
Likewise, it's not useful to have different error return mechanisms
because the caller just has to branch to support both (or the
kernel-provided stub just has to emulate one for it; that could work
if you really want to change the bad existing convention).

Thoughts?

Rich

57 matches

Mail list logo