Re: xgetbv nondeterminism
On Sat, Jun 17, 2017 at 9:32 AM, Andy Lutomirski wrote: > On Sat, Jun 17, 2017 at 5:51 AM, H.J. Lu wrote: >> On Fri, Jun 16, 2017 at 11:21 PM, Andy Lutomirski wrote: > > In any event, I still don't understand the issue. The code does this, > effectively: > > PLT -> GOT > GOT points to a stub that transfers control to ld.so > ld.so resolves the symbol (_dl_fixup, I think) > ld.so patches the GOT > ld.so jumps to the resolved function > > As far as I can tell, the only part of the whole process that might > touch vector registers at all is elf_ifunc_invoke(). Couldn't all the > register saving and restoring be moved to elf_ifunc_invoke()? Please grep for FOREIGN_CALL the elf directory. >>> >>> I grepped FOREIGN_CALL. It has no explanation whatsoever and appears >>> to unconditionally do nothing in the current glibc version. >>> >>> In f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e^, in pseudocode, it does: >>> >>> __thread bool must_save; >>> >>> RTLD_CHECK_FOREIGN_CALL: return must_save; >>> >>> RTLD_ENABLE_FOREIGN_CALL: old_must_save = must_save; must_save = true; >>> >>> RTLD_PREPARE_FOREIGN_CALL: save_state(); must_save = false; >>> >>> RTLD_FINALIZE_FOREIGN_CALL: if (must_save) restore(); must_save = >>> old_must_save; >>> >>> save_state() and restore_state() operate on TLS buffers. >>> >>> In summary: this is not async-signal-safe. It's also really messy -- >>> there are macros that declare local variables, and the logic isn't >>> apparent without really digging in to all the code. >>> >>> I still don't see why this couldn't be: >>> >>> static void elf_do_foreign_stuff(args here) >>> { >>> void *buf = alloca(state_size); >>> xsaveopt(buf); /* or open-code it if you prefer */ >>> call_the_ifunc(); >>> xrstor(buf); >>> } >> >> As you have found out that it doesn't work this way since >> >> RTLD_PREPARE_FOREIGN_CALL >> >> and >> >> RTLD_FINALIZE_FOREIGN_CALL >> >> are used in 2 DIFFERENT files. >> > > That's ought to be fixable, either by rearranging code or by doing > something like: > > RTLD_INIT_FOREIGN_CALL(foreign_call_state); > > _dl_whatever_helper(&foreign_call_state); > > RTLD_FINALIZE_FOREIGN_CALL(foreign_call_state); > > _dl_whatever_helper would do > RTLD_PREPARE_FOREIGN_CALL(ptr_to_foreign_call_state); > > renaming these macros a bit might help, too. We have no plan to do anything. >>> If there's more than just the iifunc (malloc? profiling? printf?) >>> then all of that could be wrapped as well. >> >> It has nothing to do with ifunc. > > What's it for, then? I don't understand why, in a sensible ld.so > architecture, there would ever be a call out from ld.so during runtime > binding to anything other than an ifunc, but I realize that glibc is > weird and ld.so might call out to libc.so for some reason. It doesn't > really matter, though. > > ld.so has a set of minimal implementations of some library function, like malloc, free, ..., which are used by ld.so to bootstrap itself. After libc.so is loaded and relocated, the full implementations from libc.so are used in ld.so. Since they are outside of ld.so, they are foreign calls to ld.so and ld.so needs to preserve the first 8 vector registers when the foreign call is made to libc.so. If there are any suggestions for ld.so, please discuss them at https://sourceware.org/ml/libc-alpha/ Thanks. -- H.J.
Re: xgetbv nondeterminism
On Sat, Jun 17, 2017 at 5:51 AM, H.J. Lu wrote: > On Fri, Jun 16, 2017 at 11:21 PM, Andy Lutomirski wrote: In any event, I still don't understand the issue. The code does this, effectively: PLT -> GOT GOT points to a stub that transfers control to ld.so ld.so resolves the symbol (_dl_fixup, I think) ld.so patches the GOT ld.so jumps to the resolved function As far as I can tell, the only part of the whole process that might touch vector registers at all is elf_ifunc_invoke(). Couldn't all the register saving and restoring be moved to elf_ifunc_invoke()? >>> >>> Please grep for FOREIGN_CALL the elf directory. >> >> I grepped FOREIGN_CALL. It has no explanation whatsoever and appears >> to unconditionally do nothing in the current glibc version. >> >> In f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e^, in pseudocode, it does: >> >> __thread bool must_save; >> >> RTLD_CHECK_FOREIGN_CALL: return must_save; >> >> RTLD_ENABLE_FOREIGN_CALL: old_must_save = must_save; must_save = true; >> >> RTLD_PREPARE_FOREIGN_CALL: save_state(); must_save = false; >> >> RTLD_FINALIZE_FOREIGN_CALL: if (must_save) restore(); must_save = >> old_must_save; >> >> save_state() and restore_state() operate on TLS buffers. >> >> In summary: this is not async-signal-safe. It's also really messy -- >> there are macros that declare local variables, and the logic isn't >> apparent without really digging in to all the code. >> >> I still don't see why this couldn't be: >> >> static void elf_do_foreign_stuff(args here) >> { >> void *buf = alloca(state_size); >> xsaveopt(buf); /* or open-code it if you prefer */ >> call_the_ifunc(); >> xrstor(buf); >> } > > As you have found out that it doesn't work this way since > > RTLD_PREPARE_FOREIGN_CALL > > and > > RTLD_FINALIZE_FOREIGN_CALL > > are used in 2 DIFFERENT files. > That's ought to be fixable, either by rearranging code or by doing something like: RTLD_INIT_FOREIGN_CALL(foreign_call_state); _dl_whatever_helper(&foreign_call_state); RTLD_FINALIZE_FOREIGN_CALL(foreign_call_state); _dl_whatever_helper would do RTLD_PREPARE_FOREIGN_CALL(ptr_to_foreign_call_state); renaming these macros a bit might help, too. >> If there's more than just the iifunc (malloc? profiling? printf?) >> then all of that could be wrapped as well. > > It has nothing to do with ifunc. What's it for, then? I don't understand why, in a sensible ld.so architecture, there would ever be a call out from ld.so during runtime binding to anything other than an ifunc, but I realize that glibc is weird and ld.so might call out to libc.so for some reason. It doesn't really matter, though. > >> All this stuff comes from: >> >> commit b48a267b8fbb885191a04cffdb4050a4d4c8a20b >> Author: Ulrich Drepper >> Date: Wed Jul 29 08:33:03 2009 -0700 >> >> Preserve SSE registers in runtime relocations on x86-64. >> >> SSE registers are used for passing parameters and must be preserved >> in runtime relocations. This is inside ld.so enforced through the >> tests in tst-xmmymm.sh. But the malloc routines used after startup >> come from libc.so and can be arbitrarily complex. It's overkill >> to save the SSE registers all the time because of that. These calls >> are rare. Instead we save them on demand. The new infrastructure >> put in place in this patch makes this possible and efficient. >> >> While I think that the control flow is a giant mess and the use of TLS > > Yes. > >> was a mistake, I think Uli had the right idea: explicitly save the > > Yes. > >> extended state only when needed. >> > > Only its implementation lead to race condition. I'm suggesting that the races could be fixed without making the save/restore unconditional.
Re: xgetbv nondeterminism
On Fri, Jun 16, 2017 at 11:21 PM, Andy Lutomirski wrote: >>> >>> In any event, I still don't understand the issue. The code does this, >>> effectively: >>> >>> PLT -> GOT >>> GOT points to a stub that transfers control to ld.so >>> ld.so resolves the symbol (_dl_fixup, I think) >>> ld.so patches the GOT >>> ld.so jumps to the resolved function >>> >>> As far as I can tell, the only part of the whole process that might >>> touch vector registers at all is elf_ifunc_invoke(). Couldn't all the >>> register saving and restoring be moved to elf_ifunc_invoke()? >> >> Please grep for FOREIGN_CALL the elf directory. > > I grepped FOREIGN_CALL. It has no explanation whatsoever and appears > to unconditionally do nothing in the current glibc version. > > In f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e^, in pseudocode, it does: > > __thread bool must_save; > > RTLD_CHECK_FOREIGN_CALL: return must_save; > > RTLD_ENABLE_FOREIGN_CALL: old_must_save = must_save; must_save = true; > > RTLD_PREPARE_FOREIGN_CALL: save_state(); must_save = false; > > RTLD_FINALIZE_FOREIGN_CALL: if (must_save) restore(); must_save = > old_must_save; > > save_state() and restore_state() operate on TLS buffers. > > In summary: this is not async-signal-safe. It's also really messy -- > there are macros that declare local variables, and the logic isn't > apparent without really digging in to all the code. > > I still don't see why this couldn't be: > > static void elf_do_foreign_stuff(args here) > { > void *buf = alloca(state_size); > xsaveopt(buf); /* or open-code it if you prefer */ > call_the_ifunc(); > xrstor(buf); > } As you have found out that it doesn't work this way since RTLD_PREPARE_FOREIGN_CALL and RTLD_FINALIZE_FOREIGN_CALL are used in 2 DIFFERENT files. > If there's more than just the iifunc (malloc? profiling? printf?) > then all of that could be wrapped as well. It has nothing to do with ifunc. > All this stuff comes from: > > commit b48a267b8fbb885191a04cffdb4050a4d4c8a20b > Author: Ulrich Drepper > Date: Wed Jul 29 08:33:03 2009 -0700 > > Preserve SSE registers in runtime relocations on x86-64. > > SSE registers are used for passing parameters and must be preserved > in runtime relocations. This is inside ld.so enforced through the > tests in tst-xmmymm.sh. But the malloc routines used after startup > come from libc.so and can be arbitrarily complex. It's overkill > to save the SSE registers all the time because of that. These calls > are rare. Instead we save them on demand. The new infrastructure > put in place in this patch makes this possible and efficient. > > While I think that the control flow is a giant mess and the use of TLS Yes. > was a mistake, I think Uli had the right idea: explicitly save the Yes. > extended state only when needed. > Only its implementation lead to race condition. -- H.J.
Re: xgetbv nondeterminism
On Fri, Jun 16, 2017 at 11:03 AM, H.J. Lu wrote: > On Fri, Jun 16, 2017 at 10:56 AM, Andy Lutomirski wrote: >> On Fri, Jun 16, 2017 at 10:44 AM, H.J. Lu wrote: >>> On Fri, Jun 16, 2017 at 9:38 AM, Andy Lutomirski wrote: On Fri, Jun 16, 2017 at 9:17 AM, H.J. Lu wrote: > On Fri, Jun 16, 2017 at 9:01 AM, Andy Lutomirski wrote: >> On Thu, Jun 15, 2017 at 9:34 PM, H.J. Lu wrote: >>> On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski >>> wrote: On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu wrote: > On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski > wrote: >> On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu wrote: >>> It is used for lazy binding the first time when an external >>> function is called. >>> >> >> Maybe I'm just being dense, but why? What does ld.so need to do to >> resolve a symbol and update the GOT that requires using extended >> state? > > Since the first 8 vector registers are used to pass function > parameters > and ld.so uses vector registers, _dl_runtime_resolve needs to preserve > the first 8 vector registers when transferring control to ld.so. > Wouldn't it be faster and more future-proof to recompile the relevant parts of ld.so to avoid using extended state? >>> >>> Are you suggesting not to use vector in ld.so? >> >> Yes, exactly. >> >>> We used to do that >>> several years ago, which leads to some subtle bugs, like >>> >>> https://sourceware.org/bugzilla/show_bug.cgi?id=15128 >> >> I don't think x86_64 has the issue that ARM has there. The Linux >> kernel, for example, has always been compiled to not use vector or >> floating point registers on x86 (32 and 64), and it works fine. Linux >> doesn't save extended regs on kernel entry and it doesn't restore them >> on exit. >> >> I would suggest that ld.so be compiled without use of vector >> registers, that the normal lazy binding path not try to save any extra >> regs, and that ifuncs be called through a thunk that saves whatever >> registers need saving, possibly just using XSAVEOPT. After all, ifunc >> is used for only a tiny fraction of symbols. > > x86-64 was the only target which used FOREIGN_CALL macros > in ld.so, FOREIGN_CALL macros were the cause of race condition > in ld.so: > > https://sourceware.org/bugzilla/show_bug.cgi?id=11214 > > Not to save and restore the first 8 vector registers means that > FOREIGN_CALL macros have to be used. We don't want to > do that on x86-64. > > You're talking about this, right: commit f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e Author: H.J. Lu Date: Tue Aug 25 04:33:54 2015 -0700 Save and restore vector registers in x86-64 ld.so It seems to me that the problem wasn't that the save/restore happened on some of the time -- it was that the save and restore code used a TLS variable to track its own state. Shouldn't it have been a stack variable or even just implicit in the control flow? >>> >>> No, it can't use stack variable since _dl_runtime_resolve never >>> returns. >> >> I haven't dug all the way through the source, but surely ifuncs are >> CALLed, not JMPed to. That means you have a stack somewhere. This >> stuff is mostly written in C, and local variables should work just >> fine. >> >>> In any case, glibc is effectively doing a foreign call anyway, right? >>> >>> No. >>> It's doing the foreign call to itself on every lazy binding resolution, though, which seems quite expensive. I'm saying that it seems like it would be more sensible to do the complicated foreign call logic only when doing the dangerous case, which is when lazy binding calls an ifunc. If I were to rewrite this, I would do it like this: void *call_runtime_ifunc(void (*ifunc)()); // or whatever the signature needs to be >>> >>> It is unrelated to IFUNC. This is how external function call works. >> >> External function call to what external function? Are there any calls >> to any non-IFUNC external functions that are triggered by runtime >> resolution? >> >> In any event, I still don't understand the issue. The code does this, >> effectively: >> >> PLT -> GOT >> GOT points to a stub that transfers control to ld.so >> ld.so resolves the symbol (_dl_fixup, I think) >> ld.so patches the GOT >> ld.so jumps to the resolved function >> >> As far as I can tell, the only part of the whole process that might >> touch vector registers at all is elf_ifunc_invoke(). Couldn't all the >> register saving and restoring be moved to elf_ifunc_invoke()? > > Please grep for FOREIGN_CALL the elf directory. I grepped FOREIGN_CALL. It has no explanation whatsoever and appears to unconditionall
Re: xgetbv nondeterminism
On Fri, Jun 16, 2017 at 10:56 AM, Andy Lutomirski wrote: > On Fri, Jun 16, 2017 at 10:44 AM, H.J. Lu wrote: >> On Fri, Jun 16, 2017 at 9:38 AM, Andy Lutomirski wrote: >>> On Fri, Jun 16, 2017 at 9:17 AM, H.J. Lu wrote: On Fri, Jun 16, 2017 at 9:01 AM, Andy Lutomirski wrote: > On Thu, Jun 15, 2017 at 9:34 PM, H.J. Lu wrote: >> On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski wrote: >>> On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu wrote: On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski wrote: > On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu wrote: >> It is used for lazy binding the first time when an external function >> is called. >> > > Maybe I'm just being dense, but why? What does ld.so need to do to > resolve a symbol and update the GOT that requires using extended > state? Since the first 8 vector registers are used to pass function parameters and ld.so uses vector registers, _dl_runtime_resolve needs to preserve the first 8 vector registers when transferring control to ld.so. >>> >>> Wouldn't it be faster and more future-proof to recompile the relevant >>> parts of ld.so to avoid using extended state? >>> >> >> Are you suggesting not to use vector in ld.so? > > Yes, exactly. > >> We used to do that >> several years ago, which leads to some subtle bugs, like >> >> https://sourceware.org/bugzilla/show_bug.cgi?id=15128 > > I don't think x86_64 has the issue that ARM has there. The Linux > kernel, for example, has always been compiled to not use vector or > floating point registers on x86 (32 and 64), and it works fine. Linux > doesn't save extended regs on kernel entry and it doesn't restore them > on exit. > > I would suggest that ld.so be compiled without use of vector > registers, that the normal lazy binding path not try to save any extra > regs, and that ifuncs be called through a thunk that saves whatever > registers need saving, possibly just using XSAVEOPT. After all, ifunc > is used for only a tiny fraction of symbols. x86-64 was the only target which used FOREIGN_CALL macros in ld.so, FOREIGN_CALL macros were the cause of race condition in ld.so: https://sourceware.org/bugzilla/show_bug.cgi?id=11214 Not to save and restore the first 8 vector registers means that FOREIGN_CALL macros have to be used. We don't want to do that on x86-64. >>> >>> You're talking about this, right: >>> >>> commit f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e >>> Author: H.J. Lu >>> Date: Tue Aug 25 04:33:54 2015 -0700 >>> >>> Save and restore vector registers in x86-64 ld.so >>> >>> It seems to me that the problem wasn't that the save/restore happened >>> on some of the time -- it was that the save and restore code used a >>> TLS variable to track its own state. Shouldn't it have been a stack >>> variable or even just implicit in the control flow? >> >> No, it can't use stack variable since _dl_runtime_resolve never >> returns. > > I haven't dug all the way through the source, but surely ifuncs are > CALLed, not JMPed to. That means you have a stack somewhere. This > stuff is mostly written in C, and local variables should work just > fine. > >> >>> In any case, glibc is effectively doing a foreign call anyway, right? >> >> No. >> >>> It's doing the foreign call to itself on every lazy binding >>> resolution, though, which seems quite expensive. I'm saying that it >>> seems like it would be more sensible to do the complicated foreign >>> call logic only when doing the dangerous case, which is when lazy >>> binding calls an ifunc. >>> >>> If I were to rewrite this, I would do it like this: >>> >>> void *call_runtime_ifunc(void (*ifunc)()); // or whatever the >>> signature needs to be >> >> It is unrelated to IFUNC. This is how external function call works. > > External function call to what external function? Are there any calls > to any non-IFUNC external functions that are triggered by runtime > resolution? > > In any event, I still don't understand the issue. The code does this, > effectively: > > PLT -> GOT > GOT points to a stub that transfers control to ld.so > ld.so resolves the symbol (_dl_fixup, I think) > ld.so patches the GOT > ld.so jumps to the resolved function > > As far as I can tell, the only part of the whole process that might > touch vector registers at all is elf_ifunc_invoke(). Couldn't all the > register saving and restoring be moved to elf_ifunc_invoke()? Please grep for FOREIGN_CALL the elf directory. -- H.J.
Re: xgetbv nondeterminism
On Fri, Jun 16, 2017 at 10:44 AM, H.J. Lu wrote: > On Fri, Jun 16, 2017 at 9:38 AM, Andy Lutomirski wrote: >> On Fri, Jun 16, 2017 at 9:17 AM, H.J. Lu wrote: >>> On Fri, Jun 16, 2017 at 9:01 AM, Andy Lutomirski wrote: On Thu, Jun 15, 2017 at 9:34 PM, H.J. Lu wrote: > On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski wrote: >> On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu wrote: >>> On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski >>> wrote: On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu wrote: > It is used for lazy binding the first time when an external function > is called. > Maybe I'm just being dense, but why? What does ld.so need to do to resolve a symbol and update the GOT that requires using extended state? >>> >>> Since the first 8 vector registers are used to pass function parameters >>> and ld.so uses vector registers, _dl_runtime_resolve needs to preserve >>> the first 8 vector registers when transferring control to ld.so. >>> >> >> Wouldn't it be faster and more future-proof to recompile the relevant >> parts of ld.so to avoid using extended state? >> > > Are you suggesting not to use vector in ld.so? Yes, exactly. > We used to do that > several years ago, which leads to some subtle bugs, like > > https://sourceware.org/bugzilla/show_bug.cgi?id=15128 I don't think x86_64 has the issue that ARM has there. The Linux kernel, for example, has always been compiled to not use vector or floating point registers on x86 (32 and 64), and it works fine. Linux doesn't save extended regs on kernel entry and it doesn't restore them on exit. I would suggest that ld.so be compiled without use of vector registers, that the normal lazy binding path not try to save any extra regs, and that ifuncs be called through a thunk that saves whatever registers need saving, possibly just using XSAVEOPT. After all, ifunc is used for only a tiny fraction of symbols. >>> >>> x86-64 was the only target which used FOREIGN_CALL macros >>> in ld.so, FOREIGN_CALL macros were the cause of race condition >>> in ld.so: >>> >>> https://sourceware.org/bugzilla/show_bug.cgi?id=11214 >>> >>> Not to save and restore the first 8 vector registers means that >>> FOREIGN_CALL macros have to be used. We don't want to >>> do that on x86-64. >>> >>> >> >> You're talking about this, right: >> >> commit f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e >> Author: H.J. Lu >> Date: Tue Aug 25 04:33:54 2015 -0700 >> >> Save and restore vector registers in x86-64 ld.so >> >> It seems to me that the problem wasn't that the save/restore happened >> on some of the time -- it was that the save and restore code used a >> TLS variable to track its own state. Shouldn't it have been a stack >> variable or even just implicit in the control flow? > > No, it can't use stack variable since _dl_runtime_resolve never > returns. I haven't dug all the way through the source, but surely ifuncs are CALLed, not JMPed to. That means you have a stack somewhere. This stuff is mostly written in C, and local variables should work just fine. > >> In any case, glibc is effectively doing a foreign call anyway, right? > > No. > >> It's doing the foreign call to itself on every lazy binding >> resolution, though, which seems quite expensive. I'm saying that it >> seems like it would be more sensible to do the complicated foreign >> call logic only when doing the dangerous case, which is when lazy >> binding calls an ifunc. >> >> If I were to rewrite this, I would do it like this: >> >> void *call_runtime_ifunc(void (*ifunc)()); // or whatever the >> signature needs to be > > It is unrelated to IFUNC. This is how external function call works. External function call to what external function? Are there any calls to any non-IFUNC external functions that are triggered by runtime resolution? In any event, I still don't understand the issue. The code does this, effectively: PLT -> GOT GOT points to a stub that transfers control to ld.so ld.so resolves the symbol (_dl_fixup, I think) ld.so patches the GOT ld.so jumps to the resolved function As far as I can tell, the only part of the whole process that might touch vector registers at all is elf_ifunc_invoke(). Couldn't all the register saving and restoring be moved to elf_ifunc_invoke()? > >> call_runtime_ifunc would be implemented in asm (or maybe even C!) and >> would use XSAVEOPT or similar to save the state to a buffer on the >> stack. Then it would call the ifunc and restore the state. No TLS >> needed, so there wouldn't be any races. In fact, it would work very >> much like your current save/restore code, except that it wouldn't need >> to be as highly optimized because it would be called much less >> frequently. This should improve performance and could be quite a bit >> simpler. >> >>
Re: xgetbv nondeterminism
On Fri, Jun 16, 2017 at 9:38 AM, Andy Lutomirski wrote: > On Fri, Jun 16, 2017 at 9:17 AM, H.J. Lu wrote: >> On Fri, Jun 16, 2017 at 9:01 AM, Andy Lutomirski wrote: >>> On Thu, Jun 15, 2017 at 9:34 PM, H.J. Lu wrote: On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski wrote: > On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu wrote: >> On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski wrote: >>> On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu wrote: It is used for lazy binding the first time when an external function is called. >>> >>> Maybe I'm just being dense, but why? What does ld.so need to do to >>> resolve a symbol and update the GOT that requires using extended >>> state? >> >> Since the first 8 vector registers are used to pass function parameters >> and ld.so uses vector registers, _dl_runtime_resolve needs to preserve >> the first 8 vector registers when transferring control to ld.so. >> > > Wouldn't it be faster and more future-proof to recompile the relevant > parts of ld.so to avoid using extended state? > Are you suggesting not to use vector in ld.so? >>> >>> Yes, exactly. >>> We used to do that several years ago, which leads to some subtle bugs, like https://sourceware.org/bugzilla/show_bug.cgi?id=15128 >>> >>> I don't think x86_64 has the issue that ARM has there. The Linux >>> kernel, for example, has always been compiled to not use vector or >>> floating point registers on x86 (32 and 64), and it works fine. Linux >>> doesn't save extended regs on kernel entry and it doesn't restore them >>> on exit. >>> >>> I would suggest that ld.so be compiled without use of vector >>> registers, that the normal lazy binding path not try to save any extra >>> regs, and that ifuncs be called through a thunk that saves whatever >>> registers need saving, possibly just using XSAVEOPT. After all, ifunc >>> is used for only a tiny fraction of symbols. >> >> x86-64 was the only target which used FOREIGN_CALL macros >> in ld.so, FOREIGN_CALL macros were the cause of race condition >> in ld.so: >> >> https://sourceware.org/bugzilla/show_bug.cgi?id=11214 >> >> Not to save and restore the first 8 vector registers means that >> FOREIGN_CALL macros have to be used. We don't want to >> do that on x86-64. >> >> > > You're talking about this, right: > > commit f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e > Author: H.J. Lu > Date: Tue Aug 25 04:33:54 2015 -0700 > > Save and restore vector registers in x86-64 ld.so > > It seems to me that the problem wasn't that the save/restore happened > on some of the time -- it was that the save and restore code used a > TLS variable to track its own state. Shouldn't it have been a stack > variable or even just implicit in the control flow? No, it can't use stack variable since _dl_runtime_resolve never returns. > In any case, glibc is effectively doing a foreign call anyway, right? No. > It's doing the foreign call to itself on every lazy binding > resolution, though, which seems quite expensive. I'm saying that it > seems like it would be more sensible to do the complicated foreign > call logic only when doing the dangerous case, which is when lazy > binding calls an ifunc. > > If I were to rewrite this, I would do it like this: > > void *call_runtime_ifunc(void (*ifunc)()); // or whatever the > signature needs to be It is unrelated to IFUNC. This is how external function call works. > call_runtime_ifunc would be implemented in asm (or maybe even C!) and > would use XSAVEOPT or similar to save the state to a buffer on the > stack. Then it would call the ifunc and restore the state. No TLS > needed, so there wouldn't be any races. In fact, it would work very > much like your current save/restore code, except that it wouldn't need > to be as highly optimized because it would be called much less > frequently. This should improve performance and could be quite a bit > simpler. > > As an aside, why is saving the first eight registers enough? I don't > think there's any particular guarantee that a call through the GOT > uses the psABI, is there? Compilers can and do produce custom calling > conventions, and ISTM that some day a compiler might do that between > DSOs. Or those DSOs might not be written in C in the first place. The result is undefined if psABI isn't followed. -- H.J.
Re: xgetbv nondeterminism
On Fri, Jun 16, 2017 at 9:17 AM, H.J. Lu wrote: > On Fri, Jun 16, 2017 at 9:01 AM, Andy Lutomirski wrote: >> On Thu, Jun 15, 2017 at 9:34 PM, H.J. Lu wrote: >>> On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski wrote: On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu wrote: > On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski wrote: >> On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu wrote: >>> It is used for lazy binding the first time when an external function is >>> called. >>> >> >> Maybe I'm just being dense, but why? What does ld.so need to do to >> resolve a symbol and update the GOT that requires using extended >> state? > > Since the first 8 vector registers are used to pass function parameters > and ld.so uses vector registers, _dl_runtime_resolve needs to preserve > the first 8 vector registers when transferring control to ld.so. > Wouldn't it be faster and more future-proof to recompile the relevant parts of ld.so to avoid using extended state? >>> >>> Are you suggesting not to use vector in ld.so? >> >> Yes, exactly. >> >>> We used to do that >>> several years ago, which leads to some subtle bugs, like >>> >>> https://sourceware.org/bugzilla/show_bug.cgi?id=15128 >> >> I don't think x86_64 has the issue that ARM has there. The Linux >> kernel, for example, has always been compiled to not use vector or >> floating point registers on x86 (32 and 64), and it works fine. Linux >> doesn't save extended regs on kernel entry and it doesn't restore them >> on exit. >> >> I would suggest that ld.so be compiled without use of vector >> registers, that the normal lazy binding path not try to save any extra >> regs, and that ifuncs be called through a thunk that saves whatever >> registers need saving, possibly just using XSAVEOPT. After all, ifunc >> is used for only a tiny fraction of symbols. > > x86-64 was the only target which used FOREIGN_CALL macros > in ld.so, FOREIGN_CALL macros were the cause of race condition > in ld.so: > > https://sourceware.org/bugzilla/show_bug.cgi?id=11214 > > Not to save and restore the first 8 vector registers means that > FOREIGN_CALL macros have to be used. We don't want to > do that on x86-64. > > You're talking about this, right: commit f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e Author: H.J. Lu Date: Tue Aug 25 04:33:54 2015 -0700 Save and restore vector registers in x86-64 ld.so It seems to me that the problem wasn't that the save/restore happened on some of the time -- it was that the save and restore code used a TLS variable to track its own state. Shouldn't it have been a stack variable or even just implicit in the control flow? In any case, glibc is effectively doing a foreign call anyway, right? It's doing the foreign call to itself on every lazy binding resolution, though, which seems quite expensive. I'm saying that it seems like it would be more sensible to do the complicated foreign call logic only when doing the dangerous case, which is when lazy binding calls an ifunc. If I were to rewrite this, I would do it like this: void *call_runtime_ifunc(void (*ifunc)()); // or whatever the signature needs to be call_runtime_ifunc would be implemented in asm (or maybe even C!) and would use XSAVEOPT or similar to save the state to a buffer on the stack. Then it would call the ifunc and restore the state. No TLS needed, so there wouldn't be any races. In fact, it would work very much like your current save/restore code, except that it wouldn't need to be as highly optimized because it would be called much less frequently. This should improve performance and could be quite a bit simpler. As an aside, why is saving the first eight registers enough? I don't think there's any particular guarantee that a call through the GOT uses the psABI, is there? Compilers can and do produce custom calling conventions, and ISTM that some day a compiler might do that between DSOs. Or those DSOs might not be written in C in the first place.
Re: xgetbv nondeterminism
On Fri, Jun 16, 2017 at 9:01 AM, Andy Lutomirski wrote: > On Thu, Jun 15, 2017 at 9:34 PM, H.J. Lu wrote: >> On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski wrote: >>> On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu wrote: On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski wrote: > On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu wrote: >> It is used for lazy binding the first time when an external function is >> called. >> > > Maybe I'm just being dense, but why? What does ld.so need to do to > resolve a symbol and update the GOT that requires using extended > state? Since the first 8 vector registers are used to pass function parameters and ld.so uses vector registers, _dl_runtime_resolve needs to preserve the first 8 vector registers when transferring control to ld.so. >>> >>> Wouldn't it be faster and more future-proof to recompile the relevant >>> parts of ld.so to avoid using extended state? >>> >> >> Are you suggesting not to use vector in ld.so? > > Yes, exactly. > >> We used to do that >> several years ago, which leads to some subtle bugs, like >> >> https://sourceware.org/bugzilla/show_bug.cgi?id=15128 > > I don't think x86_64 has the issue that ARM has there. The Linux > kernel, for example, has always been compiled to not use vector or > floating point registers on x86 (32 and 64), and it works fine. Linux > doesn't save extended regs on kernel entry and it doesn't restore them > on exit. > > I would suggest that ld.so be compiled without use of vector > registers, that the normal lazy binding path not try to save any extra > regs, and that ifuncs be called through a thunk that saves whatever > registers need saving, possibly just using XSAVEOPT. After all, ifunc > is used for only a tiny fraction of symbols. x86-64 was the only target which used FOREIGN_CALL macros in ld.so, FOREIGN_CALL macros were the cause of race condition in ld.so: https://sourceware.org/bugzilla/show_bug.cgi?id=11214 Not to save and restore the first 8 vector registers means that FOREIGN_CALL macros have to be used. We don't want to do that on x86-64. -- H.J.
Re: xgetbv nondeterminism
On Thu, Jun 15, 2017 at 9:34 PM, H.J. Lu wrote: > On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski wrote: >> On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu wrote: >>> On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski wrote: On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu wrote: > It is used for lazy binding the first time when an external function is > called. > Maybe I'm just being dense, but why? What does ld.so need to do to resolve a symbol and update the GOT that requires using extended state? >>> >>> Since the first 8 vector registers are used to pass function parameters >>> and ld.so uses vector registers, _dl_runtime_resolve needs to preserve >>> the first 8 vector registers when transferring control to ld.so. >>> >> >> Wouldn't it be faster and more future-proof to recompile the relevant >> parts of ld.so to avoid using extended state? >> > > Are you suggesting not to use vector in ld.so? Yes, exactly. > We used to do that > several years ago, which leads to some subtle bugs, like > > https://sourceware.org/bugzilla/show_bug.cgi?id=15128 I don't think x86_64 has the issue that ARM has there. The Linux kernel, for example, has always been compiled to not use vector or floating point registers on x86 (32 and 64), and it works fine. Linux doesn't save extended regs on kernel entry and it doesn't restore them on exit. I would suggest that ld.so be compiled without use of vector registers, that the normal lazy binding path not try to save any extra regs, and that ifuncs be called through a thunk that saves whatever registers need saving, possibly just using XSAVEOPT. After all, ifunc is used for only a tiny fraction of symbols.
Re: xgetbv nondeterminism
On Thu, Jun 15, 2017 at 8:05 PM, Andy Lutomirski wrote: > On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu wrote: >> On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski wrote: >>> On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu wrote: On Thu, Jun 15, 2017 at 3:45 PM, Andy Lutomirski wrote: > On Thu, Jun 15, 2017 at 3:40 PM, H.J. Lu wrote: >> On Thu, Jun 15, 2017 at 3:18 PM, Andy Lutomirski wrote: >>> On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen >>> wrote: On 06/14/2017 10:18 PM, Andy Lutomirski wrote: > Dave, why is XINUSE exposed at all to userspace? You need it for XSAVEOPT when it is using the init optimization to be able to tell which state was written and which state in the XSAVE buffer is potentially stale with respect to what's in the registers. I guess you can just use XSAVE instead of XSAVEOPT, though. As you pointed out, if you are using XSAVEC's compaction features by leaving bits unset in the requested feature bitmap registers, you have no idea how much data XSAVEC will write, unless you read XINUSE with XGETBV. But, you can get around *that* by just presizing the XSAVE buffer to be big. >>> >>> I imagine that, if you're going to save, do something quick, and >>> restore, you'd be better off allocating a big buffer rather than >>> trying to find the smallest buffer you can get away with by reading >>> XINUSE. Also, what happens if XINUSE nondeterministically changes out >>> from under you before you do XSAVEC? I assume you can avoid this >>> becoming a problem by using RFBM carefully. >>> So, I guess that leaves its use to just figuring out how much XSAVEOPT (and friends) are going to write. > To be fair, glibc uses this new XGETBV feature, but I suspect its > usage is rather dubious. Shouldn't it just do XSAVEC directly rather > than rolling its own code? A quick grep through my glibc source only shows XGETBV(0) used which reads XCR0. I don't see any XGETBV(1) which reads XINUSE. Did I miss it. >>> >>> Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version. >> >> I wrote a test to compare latency against different approaches. This >> is on Skylake: >> >> [hjl@gnu-skl-1 glibc-test]$ make >> ./test >> move: 47212 >> fxsave : 719440 >> xsave : 925146 >> xsavec : 811036 >> xsave_state_size: 1088 >> xsave_state_comp_size: 896 >> >> load/store is about 17X faster than xsavec. >> >> I put my hjl/pr21265/xsavec branch at >> >> https://sourceware.org/git/?p=glibc.git;a=summary >> >> It uses xsave/xsave/xsavec in _dl_runtime_resolve. > > What is this used for? Is it just to avoid clobbering argument regs > when resolving a symbol that uses an ifunc, or is there more to it? It is used for lazy binding the first time when an external function is called. >>> >>> Maybe I'm just being dense, but why? What does ld.so need to do to >>> resolve a symbol and update the GOT that requires using extended >>> state? >> >> Since the first 8 vector registers are used to pass function parameters >> and ld.so uses vector registers, _dl_runtime_resolve needs to preserve >> the first 8 vector registers when transferring control to ld.so. >> > > Wouldn't it be faster and more future-proof to recompile the relevant > parts of ld.so to avoid using extended state? > Are you suggesting not to use vector in ld.so? We used to do that several years ago, which leads to some subtle bugs, like https://sourceware.org/bugzilla/show_bug.cgi?id=15128 Also x86-64 was the only target which used FOREIGN_CALL macros in ld.so, FOREIGN_CALL macros were the cause of race condition in ld.so: https://sourceware.org/bugzilla/show_bug.cgi?id=11214 -- H.J.
Re: xgetbv nondeterminism
On Thu, Jun 15, 2017 at 7:17 PM, H.J. Lu wrote: > On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski wrote: >> On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu wrote: >>> On Thu, Jun 15, 2017 at 3:45 PM, Andy Lutomirski wrote: On Thu, Jun 15, 2017 at 3:40 PM, H.J. Lu wrote: > On Thu, Jun 15, 2017 at 3:18 PM, Andy Lutomirski wrote: >> On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen >> wrote: >>> On 06/14/2017 10:18 PM, Andy Lutomirski wrote: Dave, why is XINUSE exposed at all to userspace? >>> >>> You need it for XSAVEOPT when it is using the init optimization to be >>> able to tell which state was written and which state in the XSAVE buffer >>> is potentially stale with respect to what's in the registers. I guess >>> you can just use XSAVE instead of XSAVEOPT, though. >>> >>> As you pointed out, if you are using XSAVEC's compaction features by >>> leaving bits unset in the requested feature bitmap registers, you have >>> no idea how much data XSAVEC will write, unless you read XINUSE with >>> XGETBV. But, you can get around *that* by just presizing the XSAVE >>> buffer to be big. >> >> I imagine that, if you're going to save, do something quick, and >> restore, you'd be better off allocating a big buffer rather than >> trying to find the smallest buffer you can get away with by reading >> XINUSE. Also, what happens if XINUSE nondeterministically changes out >> from under you before you do XSAVEC? I assume you can avoid this >> becoming a problem by using RFBM carefully. >> >>> >>> So, I guess that leaves its use to just figuring out how much XSAVEOPT >>> (and friends) are going to write. >>> To be fair, glibc uses this new XGETBV feature, but I suspect its usage is rather dubious. Shouldn't it just do XSAVEC directly rather than rolling its own code? >>> >>> A quick grep through my glibc source only shows XGETBV(0) used which >>> reads XCR0. I don't see any XGETBV(1) which reads XINUSE. Did I miss >>> it. >> >> Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version. > > I wrote a test to compare latency against different approaches. This > is on Skylake: > > [hjl@gnu-skl-1 glibc-test]$ make > ./test > move: 47212 > fxsave : 719440 > xsave : 925146 > xsavec : 811036 > xsave_state_size: 1088 > xsave_state_comp_size: 896 > > load/store is about 17X faster than xsavec. > > I put my hjl/pr21265/xsavec branch at > > https://sourceware.org/git/?p=glibc.git;a=summary > > It uses xsave/xsave/xsavec in _dl_runtime_resolve. What is this used for? Is it just to avoid clobbering argument regs when resolving a symbol that uses an ifunc, or is there more to it? >>> >>> It is used for lazy binding the first time when an external function is >>> called. >>> >> >> Maybe I'm just being dense, but why? What does ld.so need to do to >> resolve a symbol and update the GOT that requires using extended >> state? > > Since the first 8 vector registers are used to pass function parameters > and ld.so uses vector registers, _dl_runtime_resolve needs to preserve > the first 8 vector registers when transferring control to ld.so. > Wouldn't it be faster and more future-proof to recompile the relevant parts of ld.so to avoid using extended state? --Andy
Re: xgetbv nondeterminism
On Thu, Jun 15, 2017 at 4:37 PM, Dave Hansen wrote: > On 06/15/2017 03:18 PM, Andy Lutomirski wrote: >>> As you pointed out, if you are using XSAVEC's compaction features by >>> leaving bits unset in the requested feature bitmap registers, you have >>> no idea how much data XSAVEC will write, unless you read XINUSE with >>> XGETBV. But, you can get around *that* by just presizing the XSAVE >>> buffer to be big. >> I imagine that, if you're going to save, do something quick, and >> restore, you'd be better off allocating a big buffer rather than >> trying to find the smallest buffer you can get away with by reading >> XINUSE. Also, what happens if XINUSE nondeterministically changes out >> from under you before you do XSAVEC? I assume you can avoid this >> becoming a problem by using RFBM carefully. > > That's a good point. HJ, does your XGETBV1 code disable signals, btw? > A signal could theoretically change XINUSE if the handler modified the > on-stack XSAVE state before sigreturn. We are preserving the first 8 vector registers used by caller to pass function parameters. As long as signal doesn't change XINUSE from yes to no, we are OK. > Also, your glibc patch talks a lot about the upper parts of the register > being zeroed, but that isn't precisely what XGETBV1 does, right? It > tells you whether the upper portion of the registers are in the init > state. But, the high parts of the registers could be zero, and not in > the init state, rights? We are preserving the first 8 vector registers used by caller to pass function parameters. We only care the vector bits used to pass function parameters. If caller doesn't use the upper bits, we will always zero the upper bits upon exit. > I'm missing something, though... Is the stuff in question here called > *every* time one of these AVX-using functions is called, or called only > the first time when the binding is done? Only the first time. If you pass "-z now" to linker and set LD_BIND_NOW=1, _dl_runtime_resolve won't be used. > It's also bonkers that software has to go to this trouble. This is > precisely what XSAVEOPT is supposed to do for us. -- H.J.
Re: xgetbv nondeterminism
On Thu, Jun 15, 2017 at 4:28 PM, Andy Lutomirski wrote: > On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu wrote: >> On Thu, Jun 15, 2017 at 3:45 PM, Andy Lutomirski wrote: >>> On Thu, Jun 15, 2017 at 3:40 PM, H.J. Lu wrote: On Thu, Jun 15, 2017 at 3:18 PM, Andy Lutomirski wrote: > On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen > wrote: >> On 06/14/2017 10:18 PM, Andy Lutomirski wrote: >>> Dave, why is XINUSE exposed at all to userspace? >> >> You need it for XSAVEOPT when it is using the init optimization to be >> able to tell which state was written and which state in the XSAVE buffer >> is potentially stale with respect to what's in the registers. I guess >> you can just use XSAVE instead of XSAVEOPT, though. >> >> As you pointed out, if you are using XSAVEC's compaction features by >> leaving bits unset in the requested feature bitmap registers, you have >> no idea how much data XSAVEC will write, unless you read XINUSE with >> XGETBV. But, you can get around *that* by just presizing the XSAVE >> buffer to be big. > > I imagine that, if you're going to save, do something quick, and > restore, you'd be better off allocating a big buffer rather than > trying to find the smallest buffer you can get away with by reading > XINUSE. Also, what happens if XINUSE nondeterministically changes out > from under you before you do XSAVEC? I assume you can avoid this > becoming a problem by using RFBM carefully. > >> >> So, I guess that leaves its use to just figuring out how much XSAVEOPT >> (and friends) are going to write. >> >>> To be fair, glibc uses this new XGETBV feature, but I suspect its >>> usage is rather dubious. Shouldn't it just do XSAVEC directly rather >>> than rolling its own code? >> >> A quick grep through my glibc source only shows XGETBV(0) used which >> reads XCR0. I don't see any XGETBV(1) which reads XINUSE. Did I miss >> it. > > Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version. I wrote a test to compare latency against different approaches. This is on Skylake: [hjl@gnu-skl-1 glibc-test]$ make ./test move: 47212 fxsave : 719440 xsave : 925146 xsavec : 811036 xsave_state_size: 1088 xsave_state_comp_size: 896 load/store is about 17X faster than xsavec. I put my hjl/pr21265/xsavec branch at https://sourceware.org/git/?p=glibc.git;a=summary It uses xsave/xsave/xsavec in _dl_runtime_resolve. >>> >>> What is this used for? Is it just to avoid clobbering argument regs >>> when resolving a symbol that uses an ifunc, or is there more to it? >> >> It is used for lazy binding the first time when an external function is >> called. >> > > Maybe I'm just being dense, but why? What does ld.so need to do to > resolve a symbol and update the GOT that requires using extended > state? Since the first 8 vector registers are used to pass function parameters and ld.so uses vector registers, _dl_runtime_resolve needs to preserve the first 8 vector registers when transferring control to ld.so. -- H.J.
Re: xgetbv nondeterminism
On 06/15/2017 03:18 PM, Andy Lutomirski wrote: >> As you pointed out, if you are using XSAVEC's compaction features by >> leaving bits unset in the requested feature bitmap registers, you have >> no idea how much data XSAVEC will write, unless you read XINUSE with >> XGETBV. But, you can get around *that* by just presizing the XSAVE >> buffer to be big. > I imagine that, if you're going to save, do something quick, and > restore, you'd be better off allocating a big buffer rather than > trying to find the smallest buffer you can get away with by reading > XINUSE. Also, what happens if XINUSE nondeterministically changes out > from under you before you do XSAVEC? I assume you can avoid this > becoming a problem by using RFBM carefully. That's a good point. HJ, does your XGETBV1 code disable signals, btw? A signal could theoretically change XINUSE if the handler modified the on-stack XSAVE state before sigreturn. Also, your glibc patch talks a lot about the upper parts of the register being zeroed, but that isn't precisely what XGETBV1 does, right? It tells you whether the upper portion of the registers are in the init state. But, the high parts of the registers could be zero, and not in the init state, rights? I'm missing something, though... Is the stuff in question here called *every* time one of these AVX-using functions is called, or called only the first time when the binding is done? It's also bonkers that software has to go to this trouble. This is precisely what XSAVEOPT is supposed to do for us.
Re: xgetbv nondeterminism
On Thu, Jun 15, 2017 at 4:11 PM, H.J. Lu wrote: > On Thu, Jun 15, 2017 at 3:45 PM, Andy Lutomirski wrote: >> On Thu, Jun 15, 2017 at 3:40 PM, H.J. Lu wrote: >>> On Thu, Jun 15, 2017 at 3:18 PM, Andy Lutomirski wrote: On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen wrote: > On 06/14/2017 10:18 PM, Andy Lutomirski wrote: >> Dave, why is XINUSE exposed at all to userspace? > > You need it for XSAVEOPT when it is using the init optimization to be > able to tell which state was written and which state in the XSAVE buffer > is potentially stale with respect to what's in the registers. I guess > you can just use XSAVE instead of XSAVEOPT, though. > > As you pointed out, if you are using XSAVEC's compaction features by > leaving bits unset in the requested feature bitmap registers, you have > no idea how much data XSAVEC will write, unless you read XINUSE with > XGETBV. But, you can get around *that* by just presizing the XSAVE > buffer to be big. I imagine that, if you're going to save, do something quick, and restore, you'd be better off allocating a big buffer rather than trying to find the smallest buffer you can get away with by reading XINUSE. Also, what happens if XINUSE nondeterministically changes out from under you before you do XSAVEC? I assume you can avoid this becoming a problem by using RFBM carefully. > > So, I guess that leaves its use to just figuring out how much XSAVEOPT > (and friends) are going to write. > >> To be fair, glibc uses this new XGETBV feature, but I suspect its >> usage is rather dubious. Shouldn't it just do XSAVEC directly rather >> than rolling its own code? > > A quick grep through my glibc source only shows XGETBV(0) used which > reads XCR0. I don't see any XGETBV(1) which reads XINUSE. Did I miss it. Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version. >>> >>> I wrote a test to compare latency against different approaches. This >>> is on Skylake: >>> >>> [hjl@gnu-skl-1 glibc-test]$ make >>> ./test >>> move: 47212 >>> fxsave : 719440 >>> xsave : 925146 >>> xsavec : 811036 >>> xsave_state_size: 1088 >>> xsave_state_comp_size: 896 >>> >>> load/store is about 17X faster than xsavec. >>> >>> I put my hjl/pr21265/xsavec branch at >>> >>> https://sourceware.org/git/?p=glibc.git;a=summary >>> >>> It uses xsave/xsave/xsavec in _dl_runtime_resolve. >> >> What is this used for? Is it just to avoid clobbering argument regs >> when resolving a symbol that uses an ifunc, or is there more to it? > > It is used for lazy binding the first time when an external function is > called. > Maybe I'm just being dense, but why? What does ld.so need to do to resolve a symbol and update the GOT that requires using extended state?
Re: xgetbv nondeterminism
On Thu, Jun 15, 2017 at 3:45 PM, Andy Lutomirski wrote: > On Thu, Jun 15, 2017 at 3:40 PM, H.J. Lu wrote: >> On Thu, Jun 15, 2017 at 3:18 PM, Andy Lutomirski wrote: >>> On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen wrote: On 06/14/2017 10:18 PM, Andy Lutomirski wrote: > Dave, why is XINUSE exposed at all to userspace? You need it for XSAVEOPT when it is using the init optimization to be able to tell which state was written and which state in the XSAVE buffer is potentially stale with respect to what's in the registers. I guess you can just use XSAVE instead of XSAVEOPT, though. As you pointed out, if you are using XSAVEC's compaction features by leaving bits unset in the requested feature bitmap registers, you have no idea how much data XSAVEC will write, unless you read XINUSE with XGETBV. But, you can get around *that* by just presizing the XSAVE buffer to be big. >>> >>> I imagine that, if you're going to save, do something quick, and >>> restore, you'd be better off allocating a big buffer rather than >>> trying to find the smallest buffer you can get away with by reading >>> XINUSE. Also, what happens if XINUSE nondeterministically changes out >>> from under you before you do XSAVEC? I assume you can avoid this >>> becoming a problem by using RFBM carefully. >>> So, I guess that leaves its use to just figuring out how much XSAVEOPT (and friends) are going to write. > To be fair, glibc uses this new XGETBV feature, but I suspect its > usage is rather dubious. Shouldn't it just do XSAVEC directly rather > than rolling its own code? A quick grep through my glibc source only shows XGETBV(0) used which reads XCR0. I don't see any XGETBV(1) which reads XINUSE. Did I miss it. >>> >>> Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version. >> >> I wrote a test to compare latency against different approaches. This >> is on Skylake: >> >> [hjl@gnu-skl-1 glibc-test]$ make >> ./test >> move: 47212 >> fxsave : 719440 >> xsave : 925146 >> xsavec : 811036 >> xsave_state_size: 1088 >> xsave_state_comp_size: 896 >> >> load/store is about 17X faster than xsavec. >> >> I put my hjl/pr21265/xsavec branch at >> >> https://sourceware.org/git/?p=glibc.git;a=summary >> >> It uses xsave/xsave/xsavec in _dl_runtime_resolve. > > What is this used for? Is it just to avoid clobbering argument regs > when resolving a symbol that uses an ifunc, or is there more to it? It is used for lazy binding the first time when an external function is called. -- H.J.
Re: xgetbv nondeterminism
On Thu, Jun 15, 2017 at 3:40 PM, H.J. Lu wrote: > On Thu, Jun 15, 2017 at 3:18 PM, Andy Lutomirski wrote: >> On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen wrote: >>> On 06/14/2017 10:18 PM, Andy Lutomirski wrote: Dave, why is XINUSE exposed at all to userspace? >>> >>> You need it for XSAVEOPT when it is using the init optimization to be >>> able to tell which state was written and which state in the XSAVE buffer >>> is potentially stale with respect to what's in the registers. I guess >>> you can just use XSAVE instead of XSAVEOPT, though. >>> >>> As you pointed out, if you are using XSAVEC's compaction features by >>> leaving bits unset in the requested feature bitmap registers, you have >>> no idea how much data XSAVEC will write, unless you read XINUSE with >>> XGETBV. But, you can get around *that* by just presizing the XSAVE >>> buffer to be big. >> >> I imagine that, if you're going to save, do something quick, and >> restore, you'd be better off allocating a big buffer rather than >> trying to find the smallest buffer you can get away with by reading >> XINUSE. Also, what happens if XINUSE nondeterministically changes out >> from under you before you do XSAVEC? I assume you can avoid this >> becoming a problem by using RFBM carefully. >> >>> >>> So, I guess that leaves its use to just figuring out how much XSAVEOPT >>> (and friends) are going to write. >>> To be fair, glibc uses this new XGETBV feature, but I suspect its usage is rather dubious. Shouldn't it just do XSAVEC directly rather than rolling its own code? >>> >>> A quick grep through my glibc source only shows XGETBV(0) used which >>> reads XCR0. I don't see any XGETBV(1) which reads XINUSE. Did I miss it. >> >> Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version. > > I wrote a test to compare latency against different approaches. This > is on Skylake: > > [hjl@gnu-skl-1 glibc-test]$ make > ./test > move: 47212 > fxsave : 719440 > xsave : 925146 > xsavec : 811036 > xsave_state_size: 1088 > xsave_state_comp_size: 896 > > load/store is about 17X faster than xsavec. > > I put my hjl/pr21265/xsavec branch at > > https://sourceware.org/git/?p=glibc.git;a=summary > > It uses xsave/xsave/xsavec in _dl_runtime_resolve. What is this used for? Is it just to avoid clobbering argument regs when resolving a symbol that uses an ifunc, or is there more to it?
Re: xgetbv nondeterminism
On Thu, Jun 15, 2017 at 3:18 PM, Andy Lutomirski wrote: > On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen wrote: >> On 06/14/2017 10:18 PM, Andy Lutomirski wrote: >>> Dave, why is XINUSE exposed at all to userspace? >> >> You need it for XSAVEOPT when it is using the init optimization to be >> able to tell which state was written and which state in the XSAVE buffer >> is potentially stale with respect to what's in the registers. I guess >> you can just use XSAVE instead of XSAVEOPT, though. >> >> As you pointed out, if you are using XSAVEC's compaction features by >> leaving bits unset in the requested feature bitmap registers, you have >> no idea how much data XSAVEC will write, unless you read XINUSE with >> XGETBV. But, you can get around *that* by just presizing the XSAVE >> buffer to be big. > > I imagine that, if you're going to save, do something quick, and > restore, you'd be better off allocating a big buffer rather than > trying to find the smallest buffer you can get away with by reading > XINUSE. Also, what happens if XINUSE nondeterministically changes out > from under you before you do XSAVEC? I assume you can avoid this > becoming a problem by using RFBM carefully. > >> >> So, I guess that leaves its use to just figuring out how much XSAVEOPT >> (and friends) are going to write. >> >>> To be fair, glibc uses this new XGETBV feature, but I suspect its >>> usage is rather dubious. Shouldn't it just do XSAVEC directly rather >>> than rolling its own code? >> >> A quick grep through my glibc source only shows XGETBV(0) used which >> reads XCR0. I don't see any XGETBV(1) which reads XINUSE. Did I miss it. > > Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version. I wrote a test to compare latency against different approaches. This is on Skylake: [hjl@gnu-skl-1 glibc-test]$ make ./test move: 47212 fxsave : 719440 xsave : 925146 xsavec : 811036 xsave_state_size: 1088 xsave_state_comp_size: 896 load/store is about 17X faster than xsavec. I put my hjl/pr21265/xsavec branch at https://sourceware.org/git/?p=glibc.git;a=summary It uses xsave/xsave/xsavec in _dl_runtime_resolve. -- H.J. plt.tar.xz Description: application/xz
Re: xgetbv nondeterminism
On Thu, Jun 15, 2017 at 7:33 AM, Dave Hansen wrote: > On 06/14/2017 10:18 PM, Andy Lutomirski wrote: >> Dave, why is XINUSE exposed at all to userspace? > > You need it for XSAVEOPT when it is using the init optimization to be > able to tell which state was written and which state in the XSAVE buffer > is potentially stale with respect to what's in the registers. I guess > you can just use XSAVE instead of XSAVEOPT, though. > > As you pointed out, if you are using XSAVEC's compaction features by > leaving bits unset in the requested feature bitmap registers, you have > no idea how much data XSAVEC will write, unless you read XINUSE with > XGETBV. But, you can get around *that* by just presizing the XSAVE > buffer to be big. I imagine that, if you're going to save, do something quick, and restore, you'd be better off allocating a big buffer rather than trying to find the smallest buffer you can get away with by reading XINUSE. Also, what happens if XINUSE nondeterministically changes out from under you before you do XSAVEC? I assume you can avoid this becoming a problem by using RFBM carefully. > > So, I guess that leaves its use to just figuring out how much XSAVEOPT > (and friends) are going to write. > >> To be fair, glibc uses this new XGETBV feature, but I suspect its >> usage is rather dubious. Shouldn't it just do XSAVEC directly rather >> than rolling its own code? > > A quick grep through my glibc source only shows XGETBV(0) used which > reads XCR0. I don't see any XGETBV(1) which reads XINUSE. Did I miss it. Take a look at sysdeps/x86_64/dl-trampoline.h in a new enough version.
Re: xgetbv nondeterminism
On 06/14/2017 10:18 PM, Andy Lutomirski wrote: > Dave, why is XINUSE exposed at all to userspace? You need it for XSAVEOPT when it is using the init optimization to be able to tell which state was written and which state in the XSAVE buffer is potentially stale with respect to what's in the registers. I guess you can just use XSAVE instead of XSAVEOPT, though. As you pointed out, if you are using XSAVEC's compaction features by leaving bits unset in the requested feature bitmap registers, you have no idea how much data XSAVEC will write, unless you read XINUSE with XGETBV. But, you can get around *that* by just presizing the XSAVE buffer to be big. So, I guess that leaves its use to just figuring out how much XSAVEOPT (and friends) are going to write. > To be fair, glibc uses this new XGETBV feature, but I suspect its > usage is rather dubious. Shouldn't it just do XSAVEC directly rather > than rolling its own code? A quick grep through my glibc source only shows XGETBV(0) used which reads XCR0. I don't see any XGETBV(1) which reads XINUSE. Did I miss it.
xgetbv nondeterminism
I saw your post about XGETBV (http://robert.ocallahan.org/2017/06/another-case-of-obscure-cpu.html), and it sounds like it could plausibly be a kernel bug. What kernel are you on? I wonder if CPUs have an optimization in which, if a given register set is in the init state but XINUSE=1, then they'll notice when XRSTORS runs and clear XINUSE. If so, that would be rather unfortunate IMO. Dave, why is XINUSE exposed at all to userspace? IIRC it's visible via XGETBV, XSAVEC, and maybe even XSAVE, and ISTM that making it visible to userspace serves basically no performance purpose and just encourages annoying corner cases. I can see an argument that making XSAVEC fast is nice for user threading libraries, but user threading libraries that use XSAVEC are probably doing it wrong and should instead rely on ABI guarantees to avoid saving and restoring extended state at all. To be fair, glibc uses this new XGETBV feature, but I suspect its usage is rather dubious. Shouldn't it just do XSAVEC directly rather than rolling its own code?