subject:"xgetbv nondeterminism"

Re: xgetbv nondeterminism

2017-06-17 Thread H.J. Lu

On Sat, Jun 17, 2017 at 9:32 AM, Andy Lutomirski  wrote:
> On Sat, Jun 17, 2017 at 5:51 AM, H.J. Lu  wrote:
>> On Fri, Jun 16, 2017 at 11:21 PM, Andy Lutomirski  wrote:
>
> In any event, I still don't understand the issue.  The code does this,
> effectively:
>
> PLT -> GOT
> GOT points to a stub that transfers control to ld.so
> ld.so resolves the symbol (_dl_fixup, I think)
> ld.so patches the GOT
> ld.so jumps to the resolved function
>
> As far as I can tell, the only part of the whole process that might
> touch vector registers at all is elf_ifunc_invoke().  Couldn't all the
> register saving and restoring be moved to elf_ifunc_invoke()?

 Please grep for  FOREIGN_CALL the elf directory.
>>>
>>> I grepped FOREIGN_CALL.  It has no explanation whatsoever and appears
>>> to unconditionally do nothing in the current glibc version.
>>>
>>> In f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e^, in pseudocode, it does:
>>>
>>> __thread bool must_save;
>>>
>>> RTLD_CHECK_FOREIGN_CALL: return must_save;
>>>
>>> RTLD_ENABLE_FOREIGN_CALL: old_must_save = must_save; must_save = true;
>>>
>>> RTLD_PREPARE_FOREIGN_CALL: save_state(); must_save = false;
>>>
>>> RTLD_FINALIZE_FOREIGN_CALL: if (must_save) restore(); must_save = 
>>> old_must_save;
>>>
>>> save_state() and restore_state() operate on TLS buffers.
>>>
>>> In summary: this is not async-signal-safe.  It's also really messy --
>>> there are macros that declare local variables, and the logic isn't
>>> apparent without really digging in to all the code.
>>>
>>> I still don't see why this couldn't be:
>>>
>>> static void elf_do_foreign_stuff(args here)
>>> {
>>>   void *buf = alloca(state_size);
>>>   xsaveopt(buf);  /* or open-code it if you prefer */
>>>   call_the_ifunc();
>>>   xrstor(buf);
>>> }
>>
>> As you have found out that it doesn't work this way since
>>
>> RTLD_PREPARE_FOREIGN_CALL
>>
>> and
>>
>> RTLD_FINALIZE_FOREIGN_CALL
>>
>> are used in 2 DIFFERENT files.
>>
>
> That's ought to be fixable, either by rearranging code or by doing
> something like:
>
> RTLD_INIT_FOREIGN_CALL(foreign_call_state);
>
> _dl_whatever_helper(&foreign_call_state);
>
> RTLD_FINALIZE_FOREIGN_CALL(foreign_call_state);
>
> _dl_whatever_helper would do
> RTLD_PREPARE_FOREIGN_CALL(ptr_to_foreign_call_state);
>
> renaming these macros a bit might help, too.

We have no plan to do anything.

>>> If there's more than just the iifunc (malloc?  profiling?  printf?)
>>> then all of that could be wrapped as well.
>>
>> It has nothing to do with ifunc.
>
> What's it for, then?  I don't understand why, in a sensible ld.so
> architecture, there would ever be a call out from ld.so during runtime
> binding to anything other than an ifunc, but I realize that glibc is
> weird and ld.so might call out to libc.so for some reason.  It doesn't
> really matter, though.
>
>

ld.so has a set of minimal implementations of some library function,
like malloc, free, ..., which are used by ld.so to bootstrap itself.  After
libc.so is loaded and relocated, the full implementations from libc.so
are used in ld.so.  Since they are outside of ld.so, they are foreign
calls to ld.so and ld.so needs to preserve the first 8 vector registers
when the foreign call is made to libc.so.

If there are any suggestions for ld.so, please discuss them at

https://sourceware.org/ml/libc-alpha/

Thanks.

-- 
H.J.

Re: xgetbv nondeterminism

2017-06-17 Thread Andy Lutomirski

On Sat, Jun 17, 2017 at 5:51 AM, H.J. Lu  wrote:
> On Fri, Jun 16, 2017 at 11:21 PM, Andy Lutomirski  wrote:

 In any event, I still don't understand the issue.  The code does this,
 effectively:

 PLT -> GOT
 GOT points to a stub that transfers control to ld.so
 ld.so resolves the symbol (_dl_fixup, I think)
 ld.so patches the GOT
 ld.so jumps to the resolved function

 As far as I can tell, the only part of the whole process that might
 touch vector registers at all is elf_ifunc_invoke().  Couldn't all the
 register saving and restoring be moved to elf_ifunc_invoke()?
>>>
>>> Please grep for  FOREIGN_CALL the elf directory.
>>
>> I grepped FOREIGN_CALL.  It has no explanation whatsoever and appears
>> to unconditionally do nothing in the current glibc version.
>>
>> In f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e^, in pseudocode, it does:
>>
>> __thread bool must_save;
>>
>> RTLD_CHECK_FOREIGN_CALL: return must_save;
>>
>> RTLD_ENABLE_FOREIGN_CALL: old_must_save = must_save; must_save = true;
>>
>> RTLD_PREPARE_FOREIGN_CALL: save_state(); must_save = false;
>>
>> RTLD_FINALIZE_FOREIGN_CALL: if (must_save) restore(); must_save = 
>> old_must_save;
>>
>> save_state() and restore_state() operate on TLS buffers.
>>
>> In summary: this is not async-signal-safe.  It's also really messy --
>> there are macros that declare local variables, and the logic isn't
>> apparent without really digging in to all the code.
>>
>> I still don't see why this couldn't be:
>>
>> static void elf_do_foreign_stuff(args here)
>> {
>>   void *buf = alloca(state_size);
>>   xsaveopt(buf);  /* or open-code it if you prefer */
>>   call_the_ifunc();
>>   xrstor(buf);
>> }
>
> As you have found out that it doesn't work this way since
>
> RTLD_PREPARE_FOREIGN_CALL
>
> and
>
> RTLD_FINALIZE_FOREIGN_CALL
>
> are used in 2 DIFFERENT files.
>

That's ought to be fixable, either by rearranging code or by doing
something like:

RTLD_INIT_FOREIGN_CALL(foreign_call_state);

_dl_whatever_helper(&foreign_call_state);

RTLD_FINALIZE_FOREIGN_CALL(foreign_call_state);

_dl_whatever_helper would do
RTLD_PREPARE_FOREIGN_CALL(ptr_to_foreign_call_state);

renaming these macros a bit might help, too.

>> If there's more than just the iifunc (malloc?  profiling?  printf?)
>> then all of that could be wrapped as well.
>
> It has nothing to do with ifunc.

What's it for, then?  I don't understand why, in a sensible ld.so
architecture, there would ever be a call out from ld.so during runtime
binding to anything other than an ifunc, but I realize that glibc is
weird and ld.so might call out to libc.so for some reason.  It doesn't
really matter, though.

>
>> All this stuff comes from:
>>
>> commit b48a267b8fbb885191a04cffdb4050a4d4c8a20b
>> Author: Ulrich Drepper 
>> Date:   Wed Jul 29 08:33:03 2009 -0700
>>
>> Preserve SSE registers in runtime relocations on x86-64.
>>
>> SSE registers are used for passing parameters and must be preserved
>> in runtime relocations.  This is inside ld.so enforced through the
>> tests in tst-xmmymm.sh.  But the malloc routines used after startup
>> come from libc.so and can be arbitrarily complex.  It's overkill
>> to save the SSE registers all the time because of that.  These calls
>> are rare.  Instead we save them on demand.  The new infrastructure
>> put in place in this patch makes this possible and efficient.
>>
>> While I think that the control flow is a giant mess and the use of TLS
>
> Yes.
>
>> was a mistake, I think Uli had the right idea: explicitly save the
>
> Yes.
>
>> extended state only when needed.
>>
>
> Only its implementation lead to race condition.

I'm suggesting that the races could be fixed without making the
save/restore unconditional.

Re: xgetbv nondeterminism

2017-06-17 Thread H.J. Lu

On Fri, Jun 16, 2017 at 11:21 PM, Andy Lutomirski  wrote:
>>>
>>> In any event, I still don't understand the issue.  The code does this,
>>> effectively:
>>>
>>> PLT -> GOT
>>> GOT points to a stub that transfers control to ld.so
>>> ld.so resolves the symbol (_dl_fixup, I think)
>>> ld.so patches the GOT
>>> ld.so jumps to the resolved function
>>>
>>> As far as I can tell, the only part of the whole process that might
>>> touch vector registers at all is elf_ifunc_invoke().  Couldn't all the
>>> register saving and restoring be moved to elf_ifunc_invoke()?
>>
>> Please grep for  FOREIGN_CALL the elf directory.
>
> I grepped FOREIGN_CALL.  It has no explanation whatsoever and appears
> to unconditionally do nothing in the current glibc version.
>
> In f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e^, in pseudocode, it does:
>
> __thread bool must_save;
>
> RTLD_CHECK_FOREIGN_CALL: return must_save;
>
> RTLD_ENABLE_FOREIGN_CALL: old_must_save = must_save; must_save = true;
>
> RTLD_PREPARE_FOREIGN_CALL: save_state(); must_save = false;
>
> RTLD_FINALIZE_FOREIGN_CALL: if (must_save) restore(); must_save = 
> old_must_save;
>
> save_state() and restore_state() operate on TLS buffers.
>
> In summary: this is not async-signal-safe.  It's also really messy --
> there are macros that declare local variables, and the logic isn't
> apparent without really digging in to all the code.
>
> I still don't see why this couldn't be:
>
> static void elf_do_foreign_stuff(args here)
> {
>   void *buf = alloca(state_size);
>   xsaveopt(buf);  /* or open-code it if you prefer */
>   call_the_ifunc();
>   xrstor(buf);
> }

As you have found out that it doesn't work this way since

RTLD_PREPARE_FOREIGN_CALL

and

RTLD_FINALIZE_FOREIGN_CALL

are used in 2 DIFFERENT files.

> If there's more than just the iifunc (malloc?  profiling?  printf?)
> then all of that could be wrapped as well.

It has nothing to do with ifunc.

> All this stuff comes from:
>
> commit b48a267b8fbb885191a04cffdb4050a4d4c8a20b
> Author: Ulrich Drepper 
> Date:   Wed Jul 29 08:33:03 2009 -0700
>
> Preserve SSE registers in runtime relocations on x86-64.
>
> SSE registers are used for passing parameters and must be preserved
> in runtime relocations.  This is inside ld.so enforced through the
> tests in tst-xmmymm.sh.  But the malloc routines used after startup
> come from libc.so and can be arbitrarily complex.  It's overkill
> to save the SSE registers all the time because of that.  These calls
> are rare.  Instead we save them on demand.  The new infrastructure
> put in place in this patch makes this possible and efficient.
>
> While I think that the control flow is a giant mess and the use of TLS

Yes.

> was a mistake, I think Uli had the right idea: explicitly save the

Yes.

> extended state only when needed.
>

Only its implementation lead to race condition.

-- 
H.J.

Re: xgetbv nondeterminism

2017-06-16 Thread Andy Lutomirski

On Fri, Jun 16, 2017 at 11:03 AM, H.J. Lu  wrote:
> On Fri, Jun 16, 2017 at 10:56 AM, Andy Lutomirski  wrote:
>> On Fri, Jun 16, 2017 at 10:44 AM, H.J. Lu  wrote:
>>> On Fri, Jun 16, 2017 at 9:38 AM, Andy Lutomirski  wrote:
 On Fri, Jun 16, 2017 at 9:17 AM, H.J. Lu  wrote:
> On Fri, Jun 16, 2017 at 9:01 AM, Andy Lutomirski  wrote:
>> On Thu, Jun 15, 2017 at 9:34 PM, H.J. Lu  wrote:
>>> On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski  
>>> wrote:
 On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu  wrote:
> On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski  
> wrote:
>> On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu  wrote:
>>> It is used for lazy binding the first time when an external 
>>> function is called.
>>>
>>
>> Maybe I'm just being dense, but why?  What does ld.so need to do to
>> resolve a symbol and update the GOT that requires using extended
>> state?
>
> Since the first 8 vector registers are used to pass function 
> parameters
> and ld.so uses vector registers, _dl_runtime_resolve needs to preserve
> the first 8 vector registers when transferring control to ld.so.
>

 Wouldn't it be faster and more future-proof to recompile the relevant
 parts of ld.so to avoid using extended state?

>>>
>>> Are you suggesting not to use vector in ld.so?
>>
>> Yes, exactly.
>>
>>>  We used to do that
>>> several years ago, which leads to some subtle bugs, like
>>>
>>> https://sourceware.org/bugzilla/show_bug.cgi?id=15128
>>
>> I don't think x86_64 has the issue that ARM has there.  The Linux
>> kernel, for example, has always been compiled to not use vector or
>> floating point registers on x86 (32 and 64), and it works fine.  Linux
>> doesn't save extended regs on kernel entry and it doesn't restore them
>> on exit.
>>
>> I would suggest that ld.so be compiled without use of vector
>> registers, that the normal lazy binding path not try to save any extra
>> regs, and that ifuncs be called through a thunk that saves whatever
>> registers need saving, possibly just using XSAVEOPT.  After all, ifunc
>> is used for only a tiny fraction of symbols.
>
> x86-64 was the only target which used FOREIGN_CALL macros
> in ld.so,  FOREIGN_CALL macros were the cause of race condition
> in ld.so:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=11214
>
> Not to save and restore the first 8 vector registers means that
> FOREIGN_CALL macros have to be used.  We don't want to
> do that on x86-64.
>
>

 You're talking about this, right:

 commit f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e
 Author: H.J. Lu 
 Date:   Tue Aug 25 04:33:54 2015 -0700

 Save and restore vector registers in x86-64 ld.so

 It seems to me that the problem wasn't that the save/restore happened
 on some of the time -- it was that the save and restore code used a
 TLS variable to track its own state.  Shouldn't it have been a stack
 variable or even just implicit in the control flow?
>>>
>>> No, it can't use stack variable since _dl_runtime_resolve never
>>> returns.
>>
>> I haven't dug all the way through the source, but surely ifuncs are
>> CALLed, not JMPed to.  That means you have a stack somewhere.  This
>> stuff is mostly written in C, and local variables should work just
>> fine.
>>
>>>
 In any case, glibc is effectively doing a foreign call anyway, right?
>>>
>>> No.
>>>
 It's doing the foreign call to itself on every lazy binding
 resolution, though, which seems quite expensive.  I'm saying that it
 seems like it would be more sensible to do the complicated foreign
 call logic only when doing the dangerous case, which is when lazy
 binding calls an ifunc.

 If I were to rewrite this, I would do it like this:

 void *call_runtime_ifunc(void (*ifunc)());  // or whatever the
 signature needs to be
>>>
>>> It is unrelated to IFUNC.  This is how external function call works.
>>
>> External function call to what external function?  Are there any calls
>> to any non-IFUNC external functions that are triggered by runtime
>> resolution?
>>
>> In any event, I still don't understand the issue.  The code does this,
>> effectively:
>>
>> PLT -> GOT
>> GOT points to a stub that transfers control to ld.so
>> ld.so resolves the symbol (_dl_fixup, I think)
>> ld.so patches the GOT
>> ld.so jumps to the resolved function
>>
>> As far as I can tell, the only part of the whole process that might
>> touch vector registers at all is elf_ifunc_invoke().  Couldn't all the
>> register saving and restoring be moved to elf_ifunc_invoke()?
>
> Please grep for  FOREIGN_CALL the elf directory.

I grepped FOREIGN_CALL.  It has no explanation whatsoever and appears
to unconditionall

Re: xgetbv nondeterminism

2017-06-16 Thread H.J. Lu

On Fri, Jun 16, 2017 at 10:56 AM, Andy Lutomirski  wrote:
> On Fri, Jun 16, 2017 at 10:44 AM, H.J. Lu  wrote:
>> On Fri, Jun 16, 2017 at 9:38 AM, Andy Lutomirski  wrote:
>>> On Fri, Jun 16, 2017 at 9:17 AM, H.J. Lu  wrote:
 On Fri, Jun 16, 2017 at 9:01 AM, Andy Lutomirski  wrote:
> On Thu, Jun 15, 2017 at 9:34 PM, H.J. Lu  wrote:
>> On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski  wrote:
>>> On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu  wrote:
 On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski  
 wrote:
> On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu  wrote:
>> It is used for lazy binding the first time when an external function 
>> is called.
>>
>
> Maybe I'm just being dense, but why?  What does ld.so need to do to
> resolve a symbol and update the GOT that requires using extended
> state?

 Since the first 8 vector registers are used to pass function parameters
 and ld.so uses vector registers, _dl_runtime_resolve needs to preserve
 the first 8 vector registers when transferring control to ld.so.

>>>
>>> Wouldn't it be faster and more future-proof to recompile the relevant
>>> parts of ld.so to avoid using extended state?
>>>
>>
>> Are you suggesting not to use vector in ld.so?
>
> Yes, exactly.
>
>>  We used to do that
>> several years ago, which leads to some subtle bugs, like
>>
>> https://sourceware.org/bugzilla/show_bug.cgi?id=15128
>
> I don't think x86_64 has the issue that ARM has there.  The Linux
> kernel, for example, has always been compiled to not use vector or
> floating point registers on x86 (32 and 64), and it works fine.  Linux
> doesn't save extended regs on kernel entry and it doesn't restore them
> on exit.
>
> I would suggest that ld.so be compiled without use of vector
> registers, that the normal lazy binding path not try to save any extra
> regs, and that ifuncs be called through a thunk that saves whatever
> registers need saving, possibly just using XSAVEOPT.  After all, ifunc
> is used for only a tiny fraction of symbols.

 x86-64 was the only target which used FOREIGN_CALL macros
 in ld.so,  FOREIGN_CALL macros were the cause of race condition
 in ld.so:

 https://sourceware.org/bugzilla/show_bug.cgi?id=11214

 Not to save and restore the first 8 vector registers means that
 FOREIGN_CALL macros have to be used.  We don't want to
 do that on x86-64.


>>>
>>> You're talking about this, right:
>>>
>>> commit f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e
>>> Author: H.J. Lu 
>>> Date:   Tue Aug 25 04:33:54 2015 -0700
>>>
>>> Save and restore vector registers in x86-64 ld.so
>>>
>>> It seems to me that the problem wasn't that the save/restore happened
>>> on some of the time -- it was that the save and restore code used a
>>> TLS variable to track its own state.  Shouldn't it have been a stack
>>> variable or even just implicit in the control flow?
>>
>> No, it can't use stack variable since _dl_runtime_resolve never
>> returns.
>
> I haven't dug all the way through the source, but surely ifuncs are
> CALLed, not JMPed to.  That means you have a stack somewhere.  This
> stuff is mostly written in C, and local variables should work just
> fine.
>
>>
>>> In any case, glibc is effectively doing a foreign call anyway, right?
>>
>> No.
>>
>>> It's doing the foreign call to itself on every lazy binding
>>> resolution, though, which seems quite expensive.  I'm saying that it
>>> seems like it would be more sensible to do the complicated foreign
>>> call logic only when doing the dangerous case, which is when lazy
>>> binding calls an ifunc.
>>>
>>> If I were to rewrite this, I would do it like this:
>>>
>>> void *call_runtime_ifunc(void (*ifunc)());  // or whatever the
>>> signature needs to be
>>
>> It is unrelated to IFUNC.  This is how external function call works.
>
> External function call to what external function?  Are there any calls
> to any non-IFUNC external functions that are triggered by runtime
> resolution?
>
> In any event, I still don't understand the issue.  The code does this,
> effectively:
>
> PLT -> GOT
> GOT points to a stub that transfers control to ld.so
> ld.so resolves the symbol (_dl_fixup, I think)
> ld.so patches the GOT
> ld.so jumps to the resolved function
>
> As far as I can tell, the only part of the whole process that might
> touch vector registers at all is elf_ifunc_invoke().  Couldn't all the
> register saving and restoring be moved to elf_ifunc_invoke()?

Please grep for  FOREIGN_CALL the elf directory.

-- 
H.J.

Re: xgetbv nondeterminism

2017-06-16 Thread Andy Lutomirski

On Fri, Jun 16, 2017 at 10:44 AM, H.J. Lu  wrote:
> On Fri, Jun 16, 2017 at 9:38 AM, Andy Lutomirski  wrote:
>> On Fri, Jun 16, 2017 at 9:17 AM, H.J. Lu  wrote:
>>> On Fri, Jun 16, 2017 at 9:01 AM, Andy Lutomirski  wrote:
 On Thu, Jun 15, 2017 at 9:34 PM, H.J. Lu  wrote:
> On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski  wrote:
>> On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu  wrote:
>>> On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski  
>>> wrote:
 On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu  wrote:
> It is used for lazy binding the first time when an external function 
> is called.
>

 Maybe I'm just being dense, but why?  What does ld.so need to do to
 resolve a symbol and update the GOT that requires using extended
 state?
>>>
>>> Since the first 8 vector registers are used to pass function parameters
>>> and ld.so uses vector registers, _dl_runtime_resolve needs to preserve
>>> the first 8 vector registers when transferring control to ld.so.
>>>
>>
>> Wouldn't it be faster and more future-proof to recompile the relevant
>> parts of ld.so to avoid using extended state?
>>
>
> Are you suggesting not to use vector in ld.so?

 Yes, exactly.

>  We used to do that
> several years ago, which leads to some subtle bugs, like
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=15128

 I don't think x86_64 has the issue that ARM has there.  The Linux
 kernel, for example, has always been compiled to not use vector or
 floating point registers on x86 (32 and 64), and it works fine.  Linux
 doesn't save extended regs on kernel entry and it doesn't restore them
 on exit.

 I would suggest that ld.so be compiled without use of vector
 registers, that the normal lazy binding path not try to save any extra
 regs, and that ifuncs be called through a thunk that saves whatever
 registers need saving, possibly just using XSAVEOPT.  After all, ifunc
 is used for only a tiny fraction of symbols.
>>>
>>> x86-64 was the only target which used FOREIGN_CALL macros
>>> in ld.so,  FOREIGN_CALL macros were the cause of race condition
>>> in ld.so:
>>>
>>> https://sourceware.org/bugzilla/show_bug.cgi?id=11214
>>>
>>> Not to save and restore the first 8 vector registers means that
>>> FOREIGN_CALL macros have to be used.  We don't want to
>>> do that on x86-64.
>>>
>>>
>>
>> You're talking about this, right:
>>
>> commit f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e
>> Author: H.J. Lu 
>> Date:   Tue Aug 25 04:33:54 2015 -0700
>>
>> Save and restore vector registers in x86-64 ld.so
>>
>> It seems to me that the problem wasn't that the save/restore happened
>> on some of the time -- it was that the save and restore code used a
>> TLS variable to track its own state.  Shouldn't it have been a stack
>> variable or even just implicit in the control flow?
>
> No, it can't use stack variable since _dl_runtime_resolve never
> returns.

I haven't dug all the way through the source, but surely ifuncs are
CALLed, not JMPed to.  That means you have a stack somewhere.  This
stuff is mostly written in C, and local variables should work just
fine.

>
>> In any case, glibc is effectively doing a foreign call anyway, right?
>
> No.
>
>> It's doing the foreign call to itself on every lazy binding
>> resolution, though, which seems quite expensive.  I'm saying that it
>> seems like it would be more sensible to do the complicated foreign
>> call logic only when doing the dangerous case, which is when lazy
>> binding calls an ifunc.
>>
>> If I were to rewrite this, I would do it like this:
>>
>> void *call_runtime_ifunc(void (*ifunc)());  // or whatever the
>> signature needs to be
>
> It is unrelated to IFUNC.  This is how external function call works.

External function call to what external function?  Are there any calls
to any non-IFUNC external functions that are triggered by runtime
resolution?

In any event, I still don't understand the issue.  The code does this,
effectively:

PLT -> GOT
GOT points to a stub that transfers control to ld.so
ld.so resolves the symbol (_dl_fixup, I think)
ld.so patches the GOT
ld.so jumps to the resolved function

As far as I can tell, the only part of the whole process that might
touch vector registers at all is elf_ifunc_invoke().  Couldn't all the
register saving and restoring be moved to elf_ifunc_invoke()?

>
>> call_runtime_ifunc would be implemented in asm (or maybe even C!) and
>> would use XSAVEOPT or similar to save the state to a buffer on the
>> stack.  Then it would call the ifunc and restore the state.  No TLS
>> needed, so there wouldn't be any races.  In fact, it would work very
>> much like your current save/restore code, except that it wouldn't need
>> to be as highly optimized because it would be called much less
>> frequently.  This should improve performance and could be quite a bit
>> simpler.
>>
>>

Re: xgetbv nondeterminism

2017-06-16 Thread H.J. Lu

On Fri, Jun 16, 2017 at 9:38 AM, Andy Lutomirski  wrote:
> On Fri, Jun 16, 2017 at 9:17 AM, H.J. Lu  wrote:
>> On Fri, Jun 16, 2017 at 9:01 AM, Andy Lutomirski  wrote:
>>> On Thu, Jun 15, 2017 at 9:34 PM, H.J. Lu  wrote:
 On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski  wrote:
> On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu  wrote:
>> On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski  wrote:
>>> On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu  wrote:
 It is used for lazy binding the first time when an external function 
 is called.

>>>
>>> Maybe I'm just being dense, but why?  What does ld.so need to do to
>>> resolve a symbol and update the GOT that requires using extended
>>> state?
>>
>> Since the first 8 vector registers are used to pass function parameters
>> and ld.so uses vector registers, _dl_runtime_resolve needs to preserve
>> the first 8 vector registers when transferring control to ld.so.
>>
>
> Wouldn't it be faster and more future-proof to recompile the relevant
> parts of ld.so to avoid using extended state?
>

 Are you suggesting not to use vector in ld.so?
>>>
>>> Yes, exactly.
>>>
  We used to do that
 several years ago, which leads to some subtle bugs, like

 https://sourceware.org/bugzilla/show_bug.cgi?id=15128
>>>
>>> I don't think x86_64 has the issue that ARM has there.  The Linux
>>> kernel, for example, has always been compiled to not use vector or
>>> floating point registers on x86 (32 and 64), and it works fine.  Linux
>>> doesn't save extended regs on kernel entry and it doesn't restore them
>>> on exit.
>>>
>>> I would suggest that ld.so be compiled without use of vector
>>> registers, that the normal lazy binding path not try to save any extra
>>> regs, and that ifuncs be called through a thunk that saves whatever
>>> registers need saving, possibly just using XSAVEOPT.  After all, ifunc
>>> is used for only a tiny fraction of symbols.
>>
>> x86-64 was the only target which used FOREIGN_CALL macros
>> in ld.so,  FOREIGN_CALL macros were the cause of race condition
>> in ld.so:
>>
>> https://sourceware.org/bugzilla/show_bug.cgi?id=11214
>>
>> Not to save and restore the first 8 vector registers means that
>> FOREIGN_CALL macros have to be used.  We don't want to
>> do that on x86-64.
>>
>>
>
> You're talking about this, right:
>
> commit f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e
> Author: H.J. Lu 
> Date:   Tue Aug 25 04:33:54 2015 -0700
>
> Save and restore vector registers in x86-64 ld.so
>
> It seems to me that the problem wasn't that the save/restore happened
> on some of the time -- it was that the save and restore code used a
> TLS variable to track its own state.  Shouldn't it have been a stack
> variable or even just implicit in the control flow?

No, it can't use stack variable since _dl_runtime_resolve never
returns.

> In any case, glibc is effectively doing a foreign call anyway, right?

No.

> It's doing the foreign call to itself on every lazy binding
> resolution, though, which seems quite expensive.  I'm saying that it
> seems like it would be more sensible to do the complicated foreign
> call logic only when doing the dangerous case, which is when lazy
> binding calls an ifunc.
>
> If I were to rewrite this, I would do it like this:
>
> void *call_runtime_ifunc(void (*ifunc)());  // or whatever the
> signature needs to be

It is unrelated to IFUNC.  This is how external function call works.

> call_runtime_ifunc would be implemented in asm (or maybe even C!) and
> would use XSAVEOPT or similar to save the state to a buffer on the
> stack.  Then it would call the ifunc and restore the state.  No TLS
> needed, so there wouldn't be any races.  In fact, it would work very
> much like your current save/restore code, except that it wouldn't need
> to be as highly optimized because it would be called much less
> frequently.  This should improve performance and could be quite a bit
> simpler.
>
> As an aside, why is saving the first eight registers enough?  I don't
> think there's any particular guarantee that a call through the GOT
> uses the psABI, is there?  Compilers can and do produce custom calling
> conventions, and ISTM that some day a compiler might do that between
> DSOs.  Or those DSOs might not be written in C in the first place.

The result is undefined if psABI isn't followed.

-- 
H.J.

Re: xgetbv nondeterminism

2017-06-16 Thread Andy Lutomirski

On Fri, Jun 16, 2017 at 9:17 AM, H.J. Lu  wrote:
> On Fri, Jun 16, 2017 at 9:01 AM, Andy Lutomirski  wrote:
>> On Thu, Jun 15, 2017 at 9:34 PM, H.J. Lu  wrote:
>>> On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski  wrote:
 On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu  wrote:
> On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski  wrote:
>> On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu  wrote:
>>> It is used for lazy binding the first time when an external function is 
>>> called.
>>>
>>
>> Maybe I'm just being dense, but why?  What does ld.so need to do to
>> resolve a symbol and update the GOT that requires using extended
>> state?
>
> Since the first 8 vector registers are used to pass function parameters
> and ld.so uses vector registers, _dl_runtime_resolve needs to preserve
> the first 8 vector registers when transferring control to ld.so.
>

 Wouldn't it be faster and more future-proof to recompile the relevant
 parts of ld.so to avoid using extended state?

>>>
>>> Are you suggesting not to use vector in ld.so?
>>
>> Yes, exactly.
>>
>>>  We used to do that
>>> several years ago, which leads to some subtle bugs, like
>>>
>>> https://sourceware.org/bugzilla/show_bug.cgi?id=15128
>>
>> I don't think x86_64 has the issue that ARM has there.  The Linux
>> kernel, for example, has always been compiled to not use vector or
>> floating point registers on x86 (32 and 64), and it works fine.  Linux
>> doesn't save extended regs on kernel entry and it doesn't restore them
>> on exit.
>>
>> I would suggest that ld.so be compiled without use of vector
>> registers, that the normal lazy binding path not try to save any extra
>> regs, and that ifuncs be called through a thunk that saves whatever
>> registers need saving, possibly just using XSAVEOPT.  After all, ifunc
>> is used for only a tiny fraction of symbols.
>
> x86-64 was the only target which used FOREIGN_CALL macros
> in ld.so,  FOREIGN_CALL macros were the cause of race condition
> in ld.so:
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=11214
>
> Not to save and restore the first 8 vector registers means that
> FOREIGN_CALL macros have to be used.  We don't want to
> do that on x86-64.
>
>

You're talking about this, right:

commit f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e
Author: H.J. Lu 
Date:   Tue Aug 25 04:33:54 2015 -0700

Save and restore vector registers in x86-64 ld.so

It seems to me that the problem wasn't that the save/restore happened
on some of the time -- it was that the save and restore code used a
TLS variable to track its own state.  Shouldn't it have been a stack
variable or even just implicit in the control flow?

In any case, glibc is effectively doing a foreign call anyway, right?
It's doing the foreign call to itself on every lazy binding
resolution, though, which seems quite expensive.  I'm saying that it
seems like it would be more sensible to do the complicated foreign
call logic only when doing the dangerous case, which is when lazy
binding calls an ifunc.

If I were to rewrite this, I would do it like this:

void *call_runtime_ifunc(void (*ifunc)());  // or whatever the
signature needs to be

call_runtime_ifunc would be implemented in asm (or maybe even C!) and
would use XSAVEOPT or similar to save the state to a buffer on the
stack.  Then it would call the ifunc and restore the state.  No TLS
needed, so there wouldn't be any races.  In fact, it would work very
much like your current save/restore code, except that it wouldn't need
to be as highly optimized because it would be called much less
frequently.  This should improve performance and could be quite a bit
simpler.

As an aside, why is saving the first eight registers enough?  I don't
think there's any particular guarantee that a call through the GOT
uses the psABI, is there?  Compilers can and do produce custom calling
conventions, and ISTM that some day a compiler might do that between
DSOs.  Or those DSOs might not be written in C in the first place.

Re: xgetbv nondeterminism

2017-06-16 Thread H.J. Lu

On Fri, Jun 16, 2017 at 9:01 AM, Andy Lutomirski  wrote:
> On Thu, Jun 15, 2017 at 9:34 PM, H.J. Lu  wrote:
>> On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski  wrote:
>>> On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu  wrote:
 On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski  wrote:
> On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu  wrote:
>> It is used for lazy binding the first time when an external function is 
>> called.
>>
>
> Maybe I'm just being dense, but why?  What does ld.so need to do to
> resolve a symbol and update the GOT that requires using extended
> state?

 Since the first 8 vector registers are used to pass function parameters
 and ld.so uses vector registers, _dl_runtime_resolve needs to preserve
 the first 8 vector registers when transferring control to ld.so.

>>>
>>> Wouldn't it be faster and more future-proof to recompile the relevant
>>> parts of ld.so to avoid using extended state?
>>>
>>
>> Are you suggesting not to use vector in ld.so?
>
> Yes, exactly.
>
>>  We used to do that
>> several years ago, which leads to some subtle bugs, like
>>
>> https://sourceware.org/bugzilla/show_bug.cgi?id=15128
>
> I don't think x86_64 has the issue that ARM has there.  The Linux
> kernel, for example, has always been compiled to not use vector or
> floating point registers on x86 (32 and 64), and it works fine.  Linux
> doesn't save extended regs on kernel entry and it doesn't restore them
> on exit.
>
> I would suggest that ld.so be compiled without use of vector
> registers, that the normal lazy binding path not try to save any extra
> regs, and that ifuncs be called through a thunk that saves whatever
> registers need saving, possibly just using XSAVEOPT.  After all, ifunc
> is used for only a tiny fraction of symbols.

x86-64 was the only target which used FOREIGN_CALL macros
in ld.so,  FOREIGN_CALL macros were the cause of race condition
in ld.so:

https://sourceware.org/bugzilla/show_bug.cgi?id=11214

Not to save and restore the first 8 vector registers means that
FOREIGN_CALL macros have to be used.  We don't want to
do that on x86-64.


-- 
H.J.

Re: xgetbv nondeterminism

2017-06-16 Thread Andy Lutomirski

On Thu, Jun 15, 2017 at 9:34 PM, H.J. Lu  wrote:
> On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski  wrote:
>> On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu  wrote:
>>> On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski  wrote:
 On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu  wrote:
> It is used for lazy binding the first time when an external function is 
> called.
>

 Maybe I'm just being dense, but why?  What does ld.so need to do to
 resolve a symbol and update the GOT that requires using extended
 state?
>>>
>>> Since the first 8 vector registers are used to pass function parameters
>>> and ld.so uses vector registers, _dl_runtime_resolve needs to preserve
>>> the first 8 vector registers when transferring control to ld.so.
>>>
>>
>> Wouldn't it be faster and more future-proof to recompile the relevant
>> parts of ld.so to avoid using extended state?
>>
>
> Are you suggesting not to use vector in ld.so?

Yes, exactly.

>  We used to do that
> several years ago, which leads to some subtle bugs, like
>
> https://sourceware.org/bugzilla/show_bug.cgi?id=15128

I don't think x86_64 has the issue that ARM has there.  The Linux
kernel, for example, has always been compiled to not use vector or
floating point registers on x86 (32 and 64), and it works fine.  Linux
doesn't save extended regs on kernel entry and it doesn't restore them
on exit.

I would suggest that ld.so be compiled without use of vector
registers, that the normal lazy binding path not try to save any extra
regs, and that ifuncs be called through a thunk that saves whatever
registers need saving, possibly just using XSAVEOPT.  After all, ifunc
is used for only a tiny fraction of symbols.

Re: xgetbv nondeterminism

2017-06-15 Thread H.J. Lu

On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski  wrote:
> On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu  wrote:
>> On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski  wrote:
>>> On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu  wrote:
 On Thu, Jun 15, 2017 at 3:45 PM, Andy Lutomirski  wrote:
> On Thu, Jun 15, 2017 at 3:40 PM, H.J. Lu  wrote:
>> On Thu, Jun 15, 2017 at 3:18 PM, Andy Lutomirski  wrote:
>>> On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen  
>>> wrote:
 On 06/14/2017 10:18 PM, Andy Lutomirski wrote:
> Dave, why is XINUSE exposed at all to userspace?

 You need it for XSAVEOPT when it is using the init optimization to be
 able to tell which state was written and which state in the XSAVE 
 buffer
 is potentially stale with respect to what's in the registers.  I guess
 you can just use XSAVE instead of XSAVEOPT, though.

 As you pointed out, if you are using XSAVEC's compaction features by
 leaving bits unset in the requested feature bitmap registers, you have
 no idea how much data XSAVEC will write, unless you read XINUSE with
 XGETBV.  But, you can get around *that* by just presizing the XSAVE
 buffer to be big.
>>>
>>> I imagine that, if you're going to save, do something quick, and
>>> restore, you'd be better off allocating a big buffer rather than
>>> trying to find the smallest buffer you can get away with by reading
>>> XINUSE.  Also, what happens if XINUSE nondeterministically changes out
>>> from under you before you do XSAVEC?  I assume you can avoid this
>>> becoming a problem by using RFBM carefully.
>>>

 So, I guess that leaves its use to just figuring out how much XSAVEOPT
 (and friends) are going to write.

> To be fair, glibc uses this new XGETBV feature, but I suspect its
> usage is rather dubious.  Shouldn't it just do XSAVEC directly rather
> than rolling its own code?

 A quick grep through my glibc source only shows XGETBV(0) used which
 reads XCR0.  I don't see any XGETBV(1) which reads XINUSE.  Did I miss 
 it.
>>>
>>> Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version.
>>
>> I wrote a test to compare latency against different approaches.   This
>> is on Skylake:
>>
>> [hjl@gnu-skl-1 glibc-test]$ make
>> ./test
>> move: 47212
>> fxsave  : 719440
>> xsave   : 925146
>> xsavec  : 811036
>> xsave_state_size: 1088
>> xsave_state_comp_size: 896
>>
>> load/store is about 17X faster than xsavec.
>>
>> I put my hjl/pr21265/xsavec branch at
>>
>> https://sourceware.org/git/?p=glibc.git;a=summary
>>
>> It uses xsave/xsave/xsavec in _dl_runtime_resolve.
>
> What is this used for?  Is it just to avoid clobbering argument regs
> when resolving a symbol that uses an ifunc, or is there more to it?

 It is used for lazy binding the first time when an external function is 
 called.

>>>
>>> Maybe I'm just being dense, but why?  What does ld.so need to do to
>>> resolve a symbol and update the GOT that requires using extended
>>> state?
>>
>> Since the first 8 vector registers are used to pass function parameters
>> and ld.so uses vector registers, _dl_runtime_resolve needs to preserve
>> the first 8 vector registers when transferring control to ld.so.
>>
>
> Wouldn't it be faster and more future-proof to recompile the relevant
> parts of ld.so to avoid using extended state?
>

Are you suggesting not to use vector in ld.so?  We used to do that
several years ago, which leads to some subtle bugs, like

https://sourceware.org/bugzilla/show_bug.cgi?id=15128

Also x86-64 was the only target which used FOREIGN_CALL macros
in ld.so,  FOREIGN_CALL macros were the cause of race condition
in ld.so:

https://sourceware.org/bugzilla/show_bug.cgi?id=11214



-- 
H.J.

Re: xgetbv nondeterminism

2017-06-15 Thread Andy Lutomirski

On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu  wrote:
> On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski  wrote:
>> On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu  wrote:
>>> On Thu, Jun 15, 2017 at 3:45 PM, Andy Lutomirski  wrote:
 On Thu, Jun 15, 2017 at 3:40 PM, H.J. Lu  wrote:
> On Thu, Jun 15, 2017 at 3:18 PM, Andy Lutomirski  wrote:
>> On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen  
>> wrote:
>>> On 06/14/2017 10:18 PM, Andy Lutomirski wrote:
 Dave, why is XINUSE exposed at all to userspace?
>>>
>>> You need it for XSAVEOPT when it is using the init optimization to be
>>> able to tell which state was written and which state in the XSAVE buffer
>>> is potentially stale with respect to what's in the registers.  I guess
>>> you can just use XSAVE instead of XSAVEOPT, though.
>>>
>>> As you pointed out, if you are using XSAVEC's compaction features by
>>> leaving bits unset in the requested feature bitmap registers, you have
>>> no idea how much data XSAVEC will write, unless you read XINUSE with
>>> XGETBV.  But, you can get around *that* by just presizing the XSAVE
>>> buffer to be big.
>>
>> I imagine that, if you're going to save, do something quick, and
>> restore, you'd be better off allocating a big buffer rather than
>> trying to find the smallest buffer you can get away with by reading
>> XINUSE.  Also, what happens if XINUSE nondeterministically changes out
>> from under you before you do XSAVEC?  I assume you can avoid this
>> becoming a problem by using RFBM carefully.
>>
>>>
>>> So, I guess that leaves its use to just figuring out how much XSAVEOPT
>>> (and friends) are going to write.
>>>
 To be fair, glibc uses this new XGETBV feature, but I suspect its
 usage is rather dubious.  Shouldn't it just do XSAVEC directly rather
 than rolling its own code?
>>>
>>> A quick grep through my glibc source only shows XGETBV(0) used which
>>> reads XCR0.  I don't see any XGETBV(1) which reads XINUSE.  Did I miss 
>>> it.
>>
>> Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version.
>
> I wrote a test to compare latency against different approaches.   This
> is on Skylake:
>
> [hjl@gnu-skl-1 glibc-test]$ make
> ./test
> move: 47212
> fxsave  : 719440
> xsave   : 925146
> xsavec  : 811036
> xsave_state_size: 1088
> xsave_state_comp_size: 896
>
> load/store is about 17X faster than xsavec.
>
> I put my hjl/pr21265/xsavec branch at
>
> https://sourceware.org/git/?p=glibc.git;a=summary
>
> It uses xsave/xsave/xsavec in _dl_runtime_resolve.

 What is this used for?  Is it just to avoid clobbering argument regs
 when resolving a symbol that uses an ifunc, or is there more to it?
>>>
>>> It is used for lazy binding the first time when an external function is 
>>> called.
>>>
>>
>> Maybe I'm just being dense, but why?  What does ld.so need to do to
>> resolve a symbol and update the GOT that requires using extended
>> state?
>
> Since the first 8 vector registers are used to pass function parameters
> and ld.so uses vector registers, _dl_runtime_resolve needs to preserve
> the first 8 vector registers when transferring control to ld.so.
>

Wouldn't it be faster and more future-proof to recompile the relevant
parts of ld.so to avoid using extended state?

--Andy

Re: xgetbv nondeterminism

2017-06-15 Thread H.J. Lu

On Thu, Jun 15, 2017 at 4:37 PM, Dave Hansen  wrote:
> On 06/15/2017 03:18 PM, Andy Lutomirski wrote:
>>> As you pointed out, if you are using XSAVEC's compaction features by
>>> leaving bits unset in the requested feature bitmap registers, you have
>>> no idea how much data XSAVEC will write, unless you read XINUSE with
>>> XGETBV.  But, you can get around *that* by just presizing the XSAVE
>>> buffer to be big.
>> I imagine that, if you're going to save, do something quick, and
>> restore, you'd be better off allocating a big buffer rather than
>> trying to find the smallest buffer you can get away with by reading
>> XINUSE.  Also, what happens if XINUSE nondeterministically changes out
>> from under you before you do XSAVEC?  I assume you can avoid this
>> becoming a problem by using RFBM carefully.
>
> That's a good point.  HJ, does your XGETBV1 code disable signals, btw?
> A signal could theoretically change XINUSE if the handler modified the
> on-stack XSAVE state before sigreturn.

We are preserving the first 8 vector registers used by caller to pass
function parameters.  As long as signal doesn't change XINUSE from
yes to no, we are OK.

> Also, your glibc patch talks a lot about the upper parts of the register
> being zeroed, but that isn't precisely what XGETBV1 does, right?  It
> tells you whether the upper portion of the registers are in the init
> state.  But, the high parts of the registers could be zero, and not in
> the init state, rights?

We are preserving the first 8 vector registers used by caller to pass
function parameters.   We only care the vector bits used to pass
function parameters.   If caller doesn't use the upper bits, we will
always zero the upper bits upon exit.

> I'm missing something, though...  Is the stuff in question here called
> *every* time one of these AVX-using functions is called, or called only
> the first time when the binding is done?

Only the first time.  If you pass "-z now" to linker and set LD_BIND_NOW=1,
_dl_runtime_resolve won't be used.

> It's also bonkers that software has to go to this trouble.  This is
> precisely what XSAVEOPT is supposed to do for us.


-- 
H.J.

Re: xgetbv nondeterminism

2017-06-15 Thread H.J. Lu

On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski  wrote:
> On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu  wrote:
>> On Thu, Jun 15, 2017 at 3:45 PM, Andy Lutomirski  wrote:
>>> On Thu, Jun 15, 2017 at 3:40 PM, H.J. Lu  wrote:
 On Thu, Jun 15, 2017 at 3:18 PM, Andy Lutomirski  wrote:
> On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen  
> wrote:
>> On 06/14/2017 10:18 PM, Andy Lutomirski wrote:
>>> Dave, why is XINUSE exposed at all to userspace?
>>
>> You need it for XSAVEOPT when it is using the init optimization to be
>> able to tell which state was written and which state in the XSAVE buffer
>> is potentially stale with respect to what's in the registers.  I guess
>> you can just use XSAVE instead of XSAVEOPT, though.
>>
>> As you pointed out, if you are using XSAVEC's compaction features by
>> leaving bits unset in the requested feature bitmap registers, you have
>> no idea how much data XSAVEC will write, unless you read XINUSE with
>> XGETBV.  But, you can get around *that* by just presizing the XSAVE
>> buffer to be big.
>
> I imagine that, if you're going to save, do something quick, and
> restore, you'd be better off allocating a big buffer rather than
> trying to find the smallest buffer you can get away with by reading
> XINUSE.  Also, what happens if XINUSE nondeterministically changes out
> from under you before you do XSAVEC?  I assume you can avoid this
> becoming a problem by using RFBM carefully.
>
>>
>> So, I guess that leaves its use to just figuring out how much XSAVEOPT
>> (and friends) are going to write.
>>
>>> To be fair, glibc uses this new XGETBV feature, but I suspect its
>>> usage is rather dubious.  Shouldn't it just do XSAVEC directly rather
>>> than rolling its own code?
>>
>> A quick grep through my glibc source only shows XGETBV(0) used which
>> reads XCR0.  I don't see any XGETBV(1) which reads XINUSE.  Did I miss 
>> it.
>
> Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version.

 I wrote a test to compare latency against different approaches.   This
 is on Skylake:

 [hjl@gnu-skl-1 glibc-test]$ make
 ./test
 move: 47212
 fxsave  : 719440
 xsave   : 925146
 xsavec  : 811036
 xsave_state_size: 1088
 xsave_state_comp_size: 896

 load/store is about 17X faster than xsavec.

 I put my hjl/pr21265/xsavec branch at

 https://sourceware.org/git/?p=glibc.git;a=summary

 It uses xsave/xsave/xsavec in _dl_runtime_resolve.
>>>
>>> What is this used for?  Is it just to avoid clobbering argument regs
>>> when resolving a symbol that uses an ifunc, or is there more to it?
>>
>> It is used for lazy binding the first time when an external function is 
>> called.
>>
>
> Maybe I'm just being dense, but why?  What does ld.so need to do to
> resolve a symbol and update the GOT that requires using extended
> state?

Since the first 8 vector registers are used to pass function parameters
and ld.so uses vector registers, _dl_runtime_resolve needs to preserve
the first 8 vector registers when transferring control to ld.so.

-- 
H.J.

Re: xgetbv nondeterminism

2017-06-15 Thread Dave Hansen

On 06/15/2017 03:18 PM, Andy Lutomirski wrote:
>> As you pointed out, if you are using XSAVEC's compaction features by
>> leaving bits unset in the requested feature bitmap registers, you have
>> no idea how much data XSAVEC will write, unless you read XINUSE with
>> XGETBV.  But, you can get around *that* by just presizing the XSAVE
>> buffer to be big.
> I imagine that, if you're going to save, do something quick, and
> restore, you'd be better off allocating a big buffer rather than
> trying to find the smallest buffer you can get away with by reading
> XINUSE.  Also, what happens if XINUSE nondeterministically changes out
> from under you before you do XSAVEC?  I assume you can avoid this
> becoming a problem by using RFBM carefully.

That's a good point.  HJ, does your XGETBV1 code disable signals, btw?
A signal could theoretically change XINUSE if the handler modified the
on-stack XSAVE state before sigreturn.

Also, your glibc patch talks a lot about the upper parts of the register
being zeroed, but that isn't precisely what XGETBV1 does, right?  It
tells you whether the upper portion of the registers are in the init
state.  But, the high parts of the registers could be zero, and not in
the init state, rights?

I'm missing something, though...  Is the stuff in question here called
*every* time one of these AVX-using functions is called, or called only
the first time when the binding is done?

It's also bonkers that software has to go to this trouble.  This is
precisely what XSAVEOPT is supposed to do for us.

Re: xgetbv nondeterminism

2017-06-15 Thread Andy Lutomirski

On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu  wrote:
> On Thu, Jun 15, 2017 at 3:45 PM, Andy Lutomirski  wrote:
>> On Thu, Jun 15, 2017 at 3:40 PM, H.J. Lu  wrote:
>>> On Thu, Jun 15, 2017 at 3:18 PM, Andy Lutomirski  wrote:
 On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen  wrote:
> On 06/14/2017 10:18 PM, Andy Lutomirski wrote:
>> Dave, why is XINUSE exposed at all to userspace?
>
> You need it for XSAVEOPT when it is using the init optimization to be
> able to tell which state was written and which state in the XSAVE buffer
> is potentially stale with respect to what's in the registers.  I guess
> you can just use XSAVE instead of XSAVEOPT, though.
>
> As you pointed out, if you are using XSAVEC's compaction features by
> leaving bits unset in the requested feature bitmap registers, you have
> no idea how much data XSAVEC will write, unless you read XINUSE with
> XGETBV.  But, you can get around *that* by just presizing the XSAVE
> buffer to be big.

 I imagine that, if you're going to save, do something quick, and
 restore, you'd be better off allocating a big buffer rather than
 trying to find the smallest buffer you can get away with by reading
 XINUSE.  Also, what happens if XINUSE nondeterministically changes out
 from under you before you do XSAVEC?  I assume you can avoid this
 becoming a problem by using RFBM carefully.

>
> So, I guess that leaves its use to just figuring out how much XSAVEOPT
> (and friends) are going to write.
>
>> To be fair, glibc uses this new XGETBV feature, but I suspect its
>> usage is rather dubious.  Shouldn't it just do XSAVEC directly rather
>> than rolling its own code?
>
> A quick grep through my glibc source only shows XGETBV(0) used which
> reads XCR0.  I don't see any XGETBV(1) which reads XINUSE.  Did I miss it.

 Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version.
>>>
>>> I wrote a test to compare latency against different approaches.   This
>>> is on Skylake:
>>>
>>> [hjl@gnu-skl-1 glibc-test]$ make
>>> ./test
>>> move: 47212
>>> fxsave  : 719440
>>> xsave   : 925146
>>> xsavec  : 811036
>>> xsave_state_size: 1088
>>> xsave_state_comp_size: 896
>>>
>>> load/store is about 17X faster than xsavec.
>>>
>>> I put my hjl/pr21265/xsavec branch at
>>>
>>> https://sourceware.org/git/?p=glibc.git;a=summary
>>>
>>> It uses xsave/xsave/xsavec in _dl_runtime_resolve.
>>
>> What is this used for?  Is it just to avoid clobbering argument regs
>> when resolving a symbol that uses an ifunc, or is there more to it?
>
> It is used for lazy binding the first time when an external function is 
> called.
>

Maybe I'm just being dense, but why?  What does ld.so need to do to
resolve a symbol and update the GOT that requires using extended
state?

Re: xgetbv nondeterminism

2017-06-15 Thread H.J. Lu

On Thu, Jun 15, 2017 at 3:45 PM, Andy Lutomirski  wrote:
> On Thu, Jun 15, 2017 at 3:40 PM, H.J. Lu  wrote:
>> On Thu, Jun 15, 2017 at 3:18 PM, Andy Lutomirski  wrote:
>>> On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen  wrote:
 On 06/14/2017 10:18 PM, Andy Lutomirski wrote:
> Dave, why is XINUSE exposed at all to userspace?

 You need it for XSAVEOPT when it is using the init optimization to be
 able to tell which state was written and which state in the XSAVE buffer
 is potentially stale with respect to what's in the registers.  I guess
 you can just use XSAVE instead of XSAVEOPT, though.

 As you pointed out, if you are using XSAVEC's compaction features by
 leaving bits unset in the requested feature bitmap registers, you have
 no idea how much data XSAVEC will write, unless you read XINUSE with
 XGETBV.  But, you can get around *that* by just presizing the XSAVE
 buffer to be big.
>>>
>>> I imagine that, if you're going to save, do something quick, and
>>> restore, you'd be better off allocating a big buffer rather than
>>> trying to find the smallest buffer you can get away with by reading
>>> XINUSE.  Also, what happens if XINUSE nondeterministically changes out
>>> from under you before you do XSAVEC?  I assume you can avoid this
>>> becoming a problem by using RFBM carefully.
>>>

 So, I guess that leaves its use to just figuring out how much XSAVEOPT
 (and friends) are going to write.

> To be fair, glibc uses this new XGETBV feature, but I suspect its
> usage is rather dubious.  Shouldn't it just do XSAVEC directly rather
> than rolling its own code?

 A quick grep through my glibc source only shows XGETBV(0) used which
 reads XCR0.  I don't see any XGETBV(1) which reads XINUSE.  Did I miss it.
>>>
>>> Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version.
>>
>> I wrote a test to compare latency against different approaches.   This
>> is on Skylake:
>>
>> [hjl@gnu-skl-1 glibc-test]$ make
>> ./test
>> move: 47212
>> fxsave  : 719440
>> xsave   : 925146
>> xsavec  : 811036
>> xsave_state_size: 1088
>> xsave_state_comp_size: 896
>>
>> load/store is about 17X faster than xsavec.
>>
>> I put my hjl/pr21265/xsavec branch at
>>
>> https://sourceware.org/git/?p=glibc.git;a=summary
>>
>> It uses xsave/xsave/xsavec in _dl_runtime_resolve.
>
> What is this used for?  Is it just to avoid clobbering argument regs
> when resolving a symbol that uses an ifunc, or is there more to it?

It is used for lazy binding the first time when an external function is called.


-- 
H.J.

Re: xgetbv nondeterminism

2017-06-15 Thread Andy Lutomirski

On Thu, Jun 15, 2017 at 3:40 PM, H.J. Lu  wrote:
> On Thu, Jun 15, 2017 at 3:18 PM, Andy Lutomirski  wrote:
>> On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen  wrote:
>>> On 06/14/2017 10:18 PM, Andy Lutomirski wrote:
 Dave, why is XINUSE exposed at all to userspace?
>>>
>>> You need it for XSAVEOPT when it is using the init optimization to be
>>> able to tell which state was written and which state in the XSAVE buffer
>>> is potentially stale with respect to what's in the registers.  I guess
>>> you can just use XSAVE instead of XSAVEOPT, though.
>>>
>>> As you pointed out, if you are using XSAVEC's compaction features by
>>> leaving bits unset in the requested feature bitmap registers, you have
>>> no idea how much data XSAVEC will write, unless you read XINUSE with
>>> XGETBV.  But, you can get around *that* by just presizing the XSAVE
>>> buffer to be big.
>>
>> I imagine that, if you're going to save, do something quick, and
>> restore, you'd be better off allocating a big buffer rather than
>> trying to find the smallest buffer you can get away with by reading
>> XINUSE.  Also, what happens if XINUSE nondeterministically changes out
>> from under you before you do XSAVEC?  I assume you can avoid this
>> becoming a problem by using RFBM carefully.
>>
>>>
>>> So, I guess that leaves its use to just figuring out how much XSAVEOPT
>>> (and friends) are going to write.
>>>
 To be fair, glibc uses this new XGETBV feature, but I suspect its
 usage is rather dubious.  Shouldn't it just do XSAVEC directly rather
 than rolling its own code?
>>>
>>> A quick grep through my glibc source only shows XGETBV(0) used which
>>> reads XCR0.  I don't see any XGETBV(1) which reads XINUSE.  Did I miss it.
>>
>> Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version.
>
> I wrote a test to compare latency against different approaches.   This
> is on Skylake:
>
> [hjl@gnu-skl-1 glibc-test]$ make
> ./test
> move: 47212
> fxsave  : 719440
> xsave   : 925146
> xsavec  : 811036
> xsave_state_size: 1088
> xsave_state_comp_size: 896
>
> load/store is about 17X faster than xsavec.
>
> I put my hjl/pr21265/xsavec branch at
>
> https://sourceware.org/git/?p=glibc.git;a=summary
>
> It uses xsave/xsave/xsavec in _dl_runtime_resolve.

What is this used for?  Is it just to avoid clobbering argument regs
when resolving a symbol that uses an ifunc, or is there more to it?

Re: xgetbv nondeterminism

2017-06-15 Thread H.J. Lu

On Thu, Jun 15, 2017 at 3:18 PM, Andy Lutomirski  wrote:
> On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen  wrote:
>> On 06/14/2017 10:18 PM, Andy Lutomirski wrote:
>>> Dave, why is XINUSE exposed at all to userspace?
>>
>> You need it for XSAVEOPT when it is using the init optimization to be
>> able to tell which state was written and which state in the XSAVE buffer
>> is potentially stale with respect to what's in the registers.  I guess
>> you can just use XSAVE instead of XSAVEOPT, though.
>>
>> As you pointed out, if you are using XSAVEC's compaction features by
>> leaving bits unset in the requested feature bitmap registers, you have
>> no idea how much data XSAVEC will write, unless you read XINUSE with
>> XGETBV.  But, you can get around *that* by just presizing the XSAVE
>> buffer to be big.
>
> I imagine that, if you're going to save, do something quick, and
> restore, you'd be better off allocating a big buffer rather than
> trying to find the smallest buffer you can get away with by reading
> XINUSE.  Also, what happens if XINUSE nondeterministically changes out
> from under you before you do XSAVEC?  I assume you can avoid this
> becoming a problem by using RFBM carefully.
>
>>
>> So, I guess that leaves its use to just figuring out how much XSAVEOPT
>> (and friends) are going to write.
>>
>>> To be fair, glibc uses this new XGETBV feature, but I suspect its
>>> usage is rather dubious.  Shouldn't it just do XSAVEC directly rather
>>> than rolling its own code?
>>
>> A quick grep through my glibc source only shows XGETBV(0) used which
>> reads XCR0.  I don't see any XGETBV(1) which reads XINUSE.  Did I miss it.
>
> Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version.

I wrote a test to compare latency against different approaches.   This
is on Skylake:

[hjl@gnu-skl-1 glibc-test]$ make
./test
move: 47212
fxsave  : 719440
xsave   : 925146
xsavec  : 811036
xsave_state_size: 1088
xsave_state_comp_size: 896

load/store is about 17X faster than xsavec.

I put my hjl/pr21265/xsavec branch at

https://sourceware.org/git/?p=glibc.git;a=summary

It uses xsave/xsave/xsavec in _dl_runtime_resolve.

-- 
H.J.


plt.tar.xz
Description: application/xz

Re: xgetbv nondeterminism

2017-06-15 Thread Andy Lutomirski

On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen  wrote:
> On 06/14/2017 10:18 PM, Andy Lutomirski wrote:
>> Dave, why is XINUSE exposed at all to userspace?
>
> You need it for XSAVEOPT when it is using the init optimization to be
> able to tell which state was written and which state in the XSAVE buffer
> is potentially stale with respect to what's in the registers.  I guess
> you can just use XSAVE instead of XSAVEOPT, though.
>
> As you pointed out, if you are using XSAVEC's compaction features by
> leaving bits unset in the requested feature bitmap registers, you have
> no idea how much data XSAVEC will write, unless you read XINUSE with
> XGETBV.  But, you can get around *that* by just presizing the XSAVE
> buffer to be big.

I imagine that, if you're going to save, do something quick, and
restore, you'd be better off allocating a big buffer rather than
trying to find the smallest buffer you can get away with by reading
XINUSE.  Also, what happens if XINUSE nondeterministically changes out
from under you before you do XSAVEC?  I assume you can avoid this
becoming a problem by using RFBM carefully.

>
> So, I guess that leaves its use to just figuring out how much XSAVEOPT
> (and friends) are going to write.
>
>> To be fair, glibc uses this new XGETBV feature, but I suspect its
>> usage is rather dubious.  Shouldn't it just do XSAVEC directly rather
>> than rolling its own code?
>
> A quick grep through my glibc source only shows XGETBV(0) used which
> reads XCR0.  I don't see any XGETBV(1) which reads XINUSE.  Did I miss it.

Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version.

Re: xgetbv nondeterminism

2017-06-15 Thread Dave Hansen

On 06/14/2017 10:18 PM, Andy Lutomirski wrote:
> Dave, why is XINUSE exposed at all to userspace?

You need it for XSAVEOPT when it is using the init optimization to be
able to tell which state was written and which state in the XSAVE buffer
is potentially stale with respect to what's in the registers.  I guess
you can just use XSAVE instead of XSAVEOPT, though.

As you pointed out, if you are using XSAVEC's compaction features by
leaving bits unset in the requested feature bitmap registers, you have
no idea how much data XSAVEC will write, unless you read XINUSE with
XGETBV.  But, you can get around *that* by just presizing the XSAVE
buffer to be big.

So, I guess that leaves its use to just figuring out how much XSAVEOPT
(and friends) are going to write.

> To be fair, glibc uses this new XGETBV feature, but I suspect its
> usage is rather dubious.  Shouldn't it just do XSAVEC directly rather
> than rolling its own code?

A quick grep through my glibc source only shows XGETBV(0) used which
reads XCR0.  I don't see any XGETBV(1) which reads XINUSE.  Did I miss it.

xgetbv nondeterminism

2017-06-14 Thread Andy Lutomirski

I saw your post about XGETBV
(http://robert.ocallahan.org/2017/06/another-case-of-obscure-cpu.html),
and it sounds like it could plausibly be a kernel bug.  What kernel
are you on?

I wonder if CPUs have an optimization in which, if a given register
set is in the init state but XINUSE=1, then they'll notice when
XRSTORS runs and clear XINUSE.  If so, that would be rather
unfortunate IMO.

Dave, why is XINUSE exposed at all to userspace?  IIRC it's visible
via XGETBV, XSAVEC, and maybe even XSAVE, and ISTM that making it
visible to userspace serves basically no performance purpose and just
encourages annoying corner cases.  I can see an argument that making
XSAVEC fast is nice for user threading libraries, but user threading
libraries that use XSAVEC are probably doing it wrong and should
instead rely on ABI guarantees to avoid saving and restoring extended
state at all.

To be fair, glibc uses this new XGETBV feature, but I suspect its
usage is rather dubious.  Shouldn't it just do XSAVEC directly rather
than rolling its own code?

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

Re: xgetbv nondeterminism

xgetbv nondeterminism

22 matches

Site Navigation

Mail list logo

Footer information