Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 02, 2016 at 09:38:38AM -0800, Andy Lutomirski wrote: > TBH, I didn't start down this path for performance. I did it because > I wanted to kill off a CPUID that was breaking on old CPUs that don't > have CPUID. So I propose MOV-to-CR2 followed by an unconditional > jump. My goal here is to make the #*!& thing work reliably and not be > ludicrously slow. Borislav and I mulled over using an alternative to > use CPUID if and only if we have CPUID, but that doesn't work because > we call sync_core() before we're done applying alternatives. Btw if the noinline thing which Linus suggested, works out, we can still do the alternatives thing with CPUID in sync_core() because we won't need it in alternatives.c itself anymore. Just putting it on the table, I know you complained about the mess yesterday on IRC and in case the CR2 move looks painful on xen, we can still do what we initially considered. I.e., that thing: + /* Do a CPUID if available, otherwise do a forward jump. */ + alternative_io("jmp 1f\n\t1:", "cpuid", + X86_FEATURE_CPUID, + "=a" (tmp), + "0" (1) + : "ebx", "ecx", "edx", "memory"); -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 02, 2016 at 09:38:38AM -0800, Andy Lutomirski wrote: > TBH, I didn't start down this path for performance. I did it because > I wanted to kill off a CPUID that was breaking on old CPUs that don't > have CPUID. So I propose MOV-to-CR2 followed by an unconditional > jump. My goal here is to make the #*!& thing work reliably and not be > ludicrously slow. Borislav and I mulled over using an alternative to > use CPUID if and only if we have CPUID, but that doesn't work because > we call sync_core() before we're done applying alternatives. Btw if the noinline thing which Linus suggested, works out, we can still do the alternatives thing with CPUID in sync_core() because we won't need it in alternatives.c itself anymore. Just putting it on the table, I know you complained about the mess yesterday on IRC and in case the CR2 move looks painful on xen, we can still do what we initially considered. I.e., that thing: + /* Do a CPUID if available, otherwise do a forward jump. */ + alternative_io("jmp 1f\n\t1:", "cpuid", + X86_FEATURE_CPUID, + "=a" (tmp), + "0" (1) + : "ebx", "ecx", "edx", "memory"); -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 2:55 PM, Andy Lutomirskiwrote: >> >> Honestly, I think Intel should clean up their documentation. > > I'm not sure I follow. If a user program gets migrated, it might end > up doing cross-modification when it expects self-modification. If > that trips the program up, is that a user bug or a kernel bug? Well, the user may not see it as a cross-modification. Example: user compiles a program, and writes out the new binary. That write goes to the page cache. The user then immediately executes that program. It's technically a "cross modification", because the build that wrote the page cache ran on one CPU, and then it gets loaded on another. Not, page faulting the binary does bring in a known serializing instruction: iret. But let's theorize that we have your "optimistic sysret" return path because sometimes it can happen. So the "iret" isn't exactly fundamental. But we know we will write %cr2, which is a serializing instruction. But that's not fundamental either, because we could just have a program just load the object file into its own address space using the dynamic linker. And if you don't unmap anything, there won't be any TLB flushes. Now, that is safe too, but by then we're not even relying on simply the fact that the code couldn't even have been in any virtual caches in the running environment, so it _must_ have come from the physically indexed data cache. So no actual serializing instruction even _needed_. So there is no room for any cache I$ coherency issue at any point, but note how we got to the point where we're now basically depending on some fairly fundamental logic that is not in the Intel documentation? THAT is what I don't like. I don't doubt for a moment that what we're doing is entirely coherent, and we're fine. But the intel memory ordering documentation simply doesn't cover this situation at all. The "real" memory ordering documentation only covers the normal data cache. And then they handwave the "self-modifying code" situation with incomplete examples and just bullshit "you need a serializing instruction", which clearly isn't actually the case, and is also something that we very much don't even do. It would be better if here was actual documentation, and we had some nice clear "yeah, we don't need any stinking serializing instructions, because we're already doing X". Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 2:55 PM, Andy Lutomirski wrote: >> >> Honestly, I think Intel should clean up their documentation. > > I'm not sure I follow. If a user program gets migrated, it might end > up doing cross-modification when it expects self-modification. If > that trips the program up, is that a user bug or a kernel bug? Well, the user may not see it as a cross-modification. Example: user compiles a program, and writes out the new binary. That write goes to the page cache. The user then immediately executes that program. It's technically a "cross modification", because the build that wrote the page cache ran on one CPU, and then it gets loaded on another. Not, page faulting the binary does bring in a known serializing instruction: iret. But let's theorize that we have your "optimistic sysret" return path because sometimes it can happen. So the "iret" isn't exactly fundamental. But we know we will write %cr2, which is a serializing instruction. But that's not fundamental either, because we could just have a program just load the object file into its own address space using the dynamic linker. And if you don't unmap anything, there won't be any TLB flushes. Now, that is safe too, but by then we're not even relying on simply the fact that the code couldn't even have been in any virtual caches in the running environment, so it _must_ have come from the physically indexed data cache. So no actual serializing instruction even _needed_. So there is no room for any cache I$ coherency issue at any point, but note how we got to the point where we're now basically depending on some fairly fundamental logic that is not in the Intel documentation? THAT is what I don't like. I don't doubt for a moment that what we're doing is entirely coherent, and we're fine. But the intel memory ordering documentation simply doesn't cover this situation at all. The "real" memory ordering documentation only covers the normal data cache. And then they handwave the "self-modifying code" situation with incomplete examples and just bullshit "you need a serializing instruction", which clearly isn't actually the case, and is also something that we very much don't even do. It would be better if here was actual documentation, and we had some nice clear "yeah, we don't need any stinking serializing instructions, because we're already doing X". Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 1:10 PM, Linus Torvaldswrote: > On Fri, Dec 2, 2016 at 12:41 PM, Andy Lutomirski wrote: >> >> Because, if so, we should maybe serialize whenever we migrate a >> process to a different CPU. > > The intel docs are bad on this issue. > > Technically what we do could fall under the "cross-modifying code" > case, where one CPU does the write, and then we run it on another CPU. > > And no, we do *not* do a serializing instruction before returning to > user space. Sure, we might do an iret (which is serializing), but we > equally well might be doing a systret (which is not). > > Honestly, I think Intel should clean up their documentation. > I'm not sure I follow. If a user program gets migrated, it might end up doing cross-modification when it expects self-modification. If that trips the program up, is that a user bug or a kernel bug? Admittedly, I'd be very surprised if this happened in practice. Migration is *slow*, caches tend to get blown away, lots of code gets executed, etc. Presumably any prefetched / trace cached / decoded / i-cached user code is long gone when we migrate.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 1:10 PM, Linus Torvalds wrote: > On Fri, Dec 2, 2016 at 12:41 PM, Andy Lutomirski wrote: >> >> Because, if so, we should maybe serialize whenever we migrate a >> process to a different CPU. > > The intel docs are bad on this issue. > > Technically what we do could fall under the "cross-modifying code" > case, where one CPU does the write, and then we run it on another CPU. > > And no, we do *not* do a serializing instruction before returning to > user space. Sure, we might do an iret (which is serializing), but we > equally well might be doing a systret (which is not). > > Honestly, I think Intel should clean up their documentation. > I'm not sure I follow. If a user program gets migrated, it might end up doing cross-modification when it expects self-modification. If that trips the program up, is that a user bug or a kernel bug? Admittedly, I'd be very surprised if this happened in practice. Migration is *slow*, caches tend to get blown away, lots of code gets executed, etc. Presumably any prefetched / trace cached / decoded / i-cached user code is long gone when we migrate.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 12:41 PM, Andy Lutomirskiwrote: > > Because, if so, we should maybe serialize whenever we migrate a > process to a different CPU. The intel docs are bad on this issue. Technically what we do could fall under the "cross-modifying code" case, where one CPU does the write, and then we run it on another CPU. And no, we do *not* do a serializing instruction before returning to user space. Sure, we might do an iret (which is serializing), but we equally well might be doing a systret (which is not). Honestly, I think Intel should clean up their documentation. > (We *definitely* need to flush the store buffer when migrating, There is no such thing as flushing the store buffer. But we do end up doing a memory barrier which gives you the required semantics. That's not a problem. Those operations are fast. The serializing instructions are not. Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 12:41 PM, Andy Lutomirski wrote: > > Because, if so, we should maybe serialize whenever we migrate a > process to a different CPU. The intel docs are bad on this issue. Technically what we do could fall under the "cross-modifying code" case, where one CPU does the write, and then we run it on another CPU. And no, we do *not* do a serializing instruction before returning to user space. Sure, we might do an iret (which is serializing), but we equally well might be doing a systret (which is not). Honestly, I think Intel should clean up their documentation. > (We *definitely* need to flush the store buffer when migrating, There is no such thing as flushing the store buffer. But we do end up doing a memory barrier which gives you the required semantics. That's not a problem. Those operations are fast. The serializing instructions are not. Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 11:35 AM, Linus Torvaldswrote: > On Fri, Dec 2, 2016 at 11:30 AM, Andy Lutomirski wrote: >> >> How's this? > > Looks ok. I do think that > >> I suppose it could be an unconditional IRET-to-self, but that's a good >> deal slower and not a whole lot simpler. Although if we start doing >> it right, performance won't really matter here. > > Considering you already got the iret-to-self wrong in the first > version, I really like the "one single unconditional version" so that > everybody tests that _one_ thing and there isn't anything subtle going > on. > > Hmm? Okay, sold. It makes the patchset much much shorter, too. > > And yes, if it turns out that performance matters, we almost certainly > are doing something really wrong, and we shouldn't be using that > sync_core() thing in that place anyway. To play devil's advocate (and definitely out of scope for this particular patchset), is user code permitted to do: 1. Map a page RX at one address and RW somewhere else (for nice ASLR). 2. Write to the RW mapping. 3. CPUID or IRET-to-self. 4. Jump to the RX mapping. Because, if so, we should maybe serialize whenever we migrate a process to a different CPU. (We *definitely* need to flush the store buffer when migrating, because the x86 architecture makes some memory ordering promises that get broken if a store from a thread stays in the store buffer of a different CPU when the thread gets migrated.) And if we're going to start serializing when migrating a thread, then we actually care about performance, in which case we should optimize the crap out of this thing, which probably means using MFENCE on AMD CPUs (AMD promises that MFENCE flushes the pipeline. Intel seems to be confused as to exactly what effect MFENCE has, or at least I'm confused as to what Intel thinks MFENCE does.) And we should make sure that we only do the extra flush when we don't switch mms. --Andy
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 11:35 AM, Linus Torvalds wrote: > On Fri, Dec 2, 2016 at 11:30 AM, Andy Lutomirski wrote: >> >> How's this? > > Looks ok. I do think that > >> I suppose it could be an unconditional IRET-to-self, but that's a good >> deal slower and not a whole lot simpler. Although if we start doing >> it right, performance won't really matter here. > > Considering you already got the iret-to-self wrong in the first > version, I really like the "one single unconditional version" so that > everybody tests that _one_ thing and there isn't anything subtle going > on. > > Hmm? Okay, sold. It makes the patchset much much shorter, too. > > And yes, if it turns out that performance matters, we almost certainly > are doing something really wrong, and we shouldn't be using that > sync_core() thing in that place anyway. To play devil's advocate (and definitely out of scope for this particular patchset), is user code permitted to do: 1. Map a page RX at one address and RW somewhere else (for nice ASLR). 2. Write to the RW mapping. 3. CPUID or IRET-to-self. 4. Jump to the RX mapping. Because, if so, we should maybe serialize whenever we migrate a process to a different CPU. (We *definitely* need to flush the store buffer when migrating, because the x86 architecture makes some memory ordering promises that get broken if a store from a thread stays in the store buffer of a different CPU when the thread gets migrated.) And if we're going to start serializing when migrating a thread, then we actually care about performance, in which case we should optimize the crap out of this thing, which probably means using MFENCE on AMD CPUs (AMD promises that MFENCE flushes the pipeline. Intel seems to be confused as to exactly what effect MFENCE has, or at least I'm confused as to what Intel thinks MFENCE does.) And we should make sure that we only do the extra flush when we don't switch mms. --Andy
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On 12/02/2016 06:44 AM, Andrew Cooper wrote: > On 02/12/16 00:35, Andy Lutomirski wrote: >> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't >> guaranteed to serialize. (Even CPUID isn't *really* guaranteed to >> serialize on Xen PV, but, in practice, any trap it generates will >> serialize.) > Well, Xen will enabled CPUID Faulting wherever it can, which is > realistically all IvyBridge hardware and newer. > > All hypercalls are a privilege change to cpl0. I'd hope this condition > is serialising, but I can't actually find any documentation proving or > disproving this. > >> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is >> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self >> should end up being a nice speedup. >> >> Cc: Andrew Cooper>> Signed-off-by: Andy Lutomirski Executing CPUID in an HVM guest is quite expensive since it will cause a VMEXIT. (And that should be true for any hypervisor, at least on Intel. On AMD it's configurable) -boris
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On 12/02/2016 06:44 AM, Andrew Cooper wrote: > On 02/12/16 00:35, Andy Lutomirski wrote: >> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't >> guaranteed to serialize. (Even CPUID isn't *really* guaranteed to >> serialize on Xen PV, but, in practice, any trap it generates will >> serialize.) > Well, Xen will enabled CPUID Faulting wherever it can, which is > realistically all IvyBridge hardware and newer. > > All hypercalls are a privilege change to cpl0. I'd hope this condition > is serialising, but I can't actually find any documentation proving or > disproving this. > >> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is >> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self >> should end up being a nice speedup. >> >> Cc: Andrew Cooper >> Signed-off-by: Andy Lutomirski Executing CPUID in an HVM guest is quite expensive since it will cause a VMEXIT. (And that should be true for any hypervisor, at least on Intel. On AMD it's configurable) -boris
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 11:30 AM, Andy Lutomirskiwrote: > > How's this? Looks ok. I do think that > I suppose it could be an unconditional IRET-to-self, but that's a good > deal slower and not a whole lot simpler. Although if we start doing > it right, performance won't really matter here. Considering you already got the iret-to-self wrong in the first version, I really like the "one single unconditional version" so that everybody tests that _one_ thing and there isn't anything subtle going on. Hmm? And yes, if it turns out that performance matters, we almost certainly are doing something really wrong, and we shouldn't be using that sync_core() thing in that place anyway. Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 11:30 AM, Andy Lutomirski wrote: > > How's this? Looks ok. I do think that > I suppose it could be an unconditional IRET-to-self, but that's a good > deal slower and not a whole lot simpler. Although if we start doing > it right, performance won't really matter here. Considering you already got the iret-to-self wrong in the first version, I really like the "one single unconditional version" so that everybody tests that _one_ thing and there isn't anything subtle going on. Hmm? And yes, if it turns out that performance matters, we almost certainly are doing something really wrong, and we shouldn't be using that sync_core() thing in that place anyway. Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Dec 2, 2016 10:48 AM, "Boris Ostrovsky"wrote: > > On 12/02/2016 06:44 AM, Andrew Cooper wrote: > > On 02/12/16 00:35, Andy Lutomirski wrote: > >> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't > >> guaranteed to serialize. (Even CPUID isn't *really* guaranteed to > >> serialize on Xen PV, but, in practice, any trap it generates will > >> serialize.) > > Well, Xen will enabled CPUID Faulting wherever it can, which is > > realistically all IvyBridge hardware and newer. > > > > All hypercalls are a privilege change to cpl0. I'd hope this condition > > is serialising, but I can't actually find any documentation proving or > > disproving this. > > > >> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is > >> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self > >> should end up being a nice speedup. > >> > >> Cc: Andrew Cooper > >> Signed-off-by: Andy Lutomirski > > CC'ing xen-devel and the Xen maintainers in Linux. > > > > As this is the only email from this series in my inbox, I will say this > > here, but it should really be against patch 6. > > > > A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not > > serialising on the 486, but I don't have a manual to hand to check. > > > > ~Andrew > > > >> --- > >> arch/x86/xen/enlighten.c | 35 +++ > >> 1 file changed, 35 insertions(+) > >> > >> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c > >> index bdd855685403..1f765b41eee7 100644 > >> --- a/arch/x86/xen/enlighten.c > >> +++ b/arch/x86/xen/enlighten.c > >> @@ -311,6 +311,39 @@ static __read_mostly unsigned int > >> cpuid_leaf1_ecx_set_mask; > >> static __read_mostly unsigned int cpuid_leaf5_ecx_val; > >> static __read_mostly unsigned int cpuid_leaf5_edx_val; > >> > >> +static void xen_sync_core(void) > >> +{ > >> +register void *__sp asm(_ASM_SP); > >> + > >> +#ifdef CONFIG_X86_32 > >> +asm volatile ( > >> +"pushl %%ss\n\t" > >> +"pushl %%esp\n\t" > >> +"addl $4, (%%esp)\n\t" > >> +"pushfl\n\t" > >> +"pushl %%cs\n\t" > >> +"pushl $1f\n\t" > >> +"iret\n\t" > >> +"1:" > >> +: "+r" (__sp) : : "cc"); > > This breaks 32-bit PV guests. > > Why are we pushing %ss? We are not changing privilege levels so why not > just flags, cs and eip (which, incidentally, does work)? > Doh! I carefully tested 64-bit on Xen and 32-bit in user mode.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Dec 2, 2016 10:48 AM, "Boris Ostrovsky" wrote: > > On 12/02/2016 06:44 AM, Andrew Cooper wrote: > > On 02/12/16 00:35, Andy Lutomirski wrote: > >> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't > >> guaranteed to serialize. (Even CPUID isn't *really* guaranteed to > >> serialize on Xen PV, but, in practice, any trap it generates will > >> serialize.) > > Well, Xen will enabled CPUID Faulting wherever it can, which is > > realistically all IvyBridge hardware and newer. > > > > All hypercalls are a privilege change to cpl0. I'd hope this condition > > is serialising, but I can't actually find any documentation proving or > > disproving this. > > > >> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is > >> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self > >> should end up being a nice speedup. > >> > >> Cc: Andrew Cooper > >> Signed-off-by: Andy Lutomirski > > CC'ing xen-devel and the Xen maintainers in Linux. > > > > As this is the only email from this series in my inbox, I will say this > > here, but it should really be against patch 6. > > > > A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not > > serialising on the 486, but I don't have a manual to hand to check. > > > > ~Andrew > > > >> --- > >> arch/x86/xen/enlighten.c | 35 +++ > >> 1 file changed, 35 insertions(+) > >> > >> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c > >> index bdd855685403..1f765b41eee7 100644 > >> --- a/arch/x86/xen/enlighten.c > >> +++ b/arch/x86/xen/enlighten.c > >> @@ -311,6 +311,39 @@ static __read_mostly unsigned int > >> cpuid_leaf1_ecx_set_mask; > >> static __read_mostly unsigned int cpuid_leaf5_ecx_val; > >> static __read_mostly unsigned int cpuid_leaf5_edx_val; > >> > >> +static void xen_sync_core(void) > >> +{ > >> +register void *__sp asm(_ASM_SP); > >> + > >> +#ifdef CONFIG_X86_32 > >> +asm volatile ( > >> +"pushl %%ss\n\t" > >> +"pushl %%esp\n\t" > >> +"addl $4, (%%esp)\n\t" > >> +"pushfl\n\t" > >> +"pushl %%cs\n\t" > >> +"pushl $1f\n\t" > >> +"iret\n\t" > >> +"1:" > >> +: "+r" (__sp) : : "cc"); > > This breaks 32-bit PV guests. > > Why are we pushing %ss? We are not changing privilege levels so why not > just flags, cs and eip (which, incidentally, does work)? > Doh! I carefully tested 64-bit on Xen and 32-bit in user mode.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 11:24 AM, Linus Torvaldswrote: > On Fri, Dec 2, 2016 at 11:20 AM, Borislav Petkov wrote: >> >> Something like below? > > The optimize-nops thing needs it too, I think. > > Again, this will never matter in practice (even if somebody has a i486 > s till, the prefetch window size is like 16 bytes or something), but > from a documentation standpoint it's good. How's this? /* * This function forces the icache and prefetched instruction stream to * catch up with reality in two very specific cases: * * a) Text was modified using one virtual address and is about to be executed * from the same physical page at a different virtual address. * * b) Text was modified on a different CPU, may subsequently be * executed on this CPU, and you want to make sure the new version * gets executed. This generally means you're calling this in a IPI. * * If you're calling this for a different reason, you're probably doing * it wrong. */ static inline void native_sync_core(void) { ... } The body will do a MOV-to-CR2 followed by jmp 1f; 1:. This sequence should be guaranteed to flush the pipeline on any real CPU. On Xen it will do IRET-to-self. I suppose it could be an unconditional IRET-to-self, but that's a good deal slower and not a whole lot simpler. Although if we start doing it right, performance won't really matter here. --Andy
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 11:24 AM, Linus Torvalds wrote: > On Fri, Dec 2, 2016 at 11:20 AM, Borislav Petkov wrote: >> >> Something like below? > > The optimize-nops thing needs it too, I think. > > Again, this will never matter in practice (even if somebody has a i486 > s till, the prefetch window size is like 16 bytes or something), but > from a documentation standpoint it's good. How's this? /* * This function forces the icache and prefetched instruction stream to * catch up with reality in two very specific cases: * * a) Text was modified using one virtual address and is about to be executed * from the same physical page at a different virtual address. * * b) Text was modified on a different CPU, may subsequently be * executed on this CPU, and you want to make sure the new version * gets executed. This generally means you're calling this in a IPI. * * If you're calling this for a different reason, you're probably doing * it wrong. */ static inline void native_sync_core(void) { ... } The body will do a MOV-to-CR2 followed by jmp 1f; 1:. This sequence should be guaranteed to flush the pipeline on any real CPU. On Xen it will do IRET-to-self. I suppose it could be an unconditional IRET-to-self, but that's a good deal slower and not a whole lot simpler. Although if we start doing it right, performance won't really matter here. --Andy
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 02, 2016 at 11:23:09AM -0800, Andy Lutomirski wrote: > Not even firmware loading wants it. Firmware loading needs Microcode... > specifically cpuid(eax=1). It has nothing to do with serializing ... but yes, of course. NOT sync_core() but CPUID(1). Thanks! -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 02, 2016 at 11:23:09AM -0800, Andy Lutomirski wrote: > Not even firmware loading wants it. Firmware loading needs Microcode... > specifically cpuid(eax=1). It has nothing to do with serializing ... but yes, of course. NOT sync_core() but CPUID(1). Thanks! -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 02, 2016 at 11:24:09AM -0800, Linus Torvalds wrote: > The optimize-nops thing needs it too, I think. Ah, it is called only from apply_alternatives() but sure, it is safer this way. Lemme do that and run it through the boxes to see whether anything catches fire. > Again, this will never matter in practice (even if somebody has a i486 > s till, the prefetch window size is like 16 bytes or something), but > from a documentation standpoint it's good. Right. Thanks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 02, 2016 at 11:24:09AM -0800, Linus Torvalds wrote: > The optimize-nops thing needs it too, I think. Ah, it is called only from apply_alternatives() but sure, it is safer this way. Lemme do that and run it through the boxes to see whether anything catches fire. > Again, this will never matter in practice (even if somebody has a i486 > s till, the prefetch window size is like 16 bytes or something), but > from a documentation standpoint it's good. Right. Thanks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 11:20 AM, Borislav Petkovwrote: > > Something like below? The optimize-nops thing needs it too, I think. Again, this will never matter in practice (even if somebody has a i486 s till, the prefetch window size is like 16 bytes or something), but from a documentation standpoint it's good. Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 11:20 AM, Borislav Petkov wrote: > > Something like below? The optimize-nops thing needs it too, I think. Again, this will never matter in practice (even if somebody has a i486 s till, the prefetch window size is like 16 bytes or something), but from a documentation standpoint it's good. Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 11:03 AM, Linus Torvaldswrote: > On Fri, Dec 2, 2016 at 10:50 AM, Borislav Petkov wrote: >> >> Right, we can try to do something like invalidate_icache() or so in >> there with the JMP so that the BSP refetches modified code and see where >> it gets us. > > I'd really rather rjust mark it noinline with a comment. That way the > return from the function acts as the control flow change. > >> The good thing is, the early patching paths run before SMP is >> up but from looking at load_module(), for example, which does >> post_relocation()->module_finalize()->apply_alternatives(), this can >> happen late. >> >> Now there I'd like to avoid other cores walking into that address being >> patched. Or are we "safe" there in the sense that load_module() happens >> on one CPU only sequentially? (I haven't looked at that code to see >> what's going on there, actually). > > 'sync_core()' doesn't help for other CPU's anyway, you need to do the > cross-call IPI. So worrying about other CPU's is *not* a valid reason > to keep a "sync_core()" call. > > Seriously, the only reason I can see for "sync_core()" really is: > > - some deep non-serialized MSR access or similar (ie things like > firmware loading etc really might want it, and a mchine check might > want it) Not even firmware loading wants it. Firmware loading needs specifically cpuid(eax=1). It has nothing to do with serializing anything -- it's just a CPU bug that was turned into "architecture". I think it really is just cross-address or cross-core modification, and I'll add a comment to that effect. --Andy
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 11:03 AM, Linus Torvalds wrote: > On Fri, Dec 2, 2016 at 10:50 AM, Borislav Petkov wrote: >> >> Right, we can try to do something like invalidate_icache() or so in >> there with the JMP so that the BSP refetches modified code and see where >> it gets us. > > I'd really rather rjust mark it noinline with a comment. That way the > return from the function acts as the control flow change. > >> The good thing is, the early patching paths run before SMP is >> up but from looking at load_module(), for example, which does >> post_relocation()->module_finalize()->apply_alternatives(), this can >> happen late. >> >> Now there I'd like to avoid other cores walking into that address being >> patched. Or are we "safe" there in the sense that load_module() happens >> on one CPU only sequentially? (I haven't looked at that code to see >> what's going on there, actually). > > 'sync_core()' doesn't help for other CPU's anyway, you need to do the > cross-call IPI. So worrying about other CPU's is *not* a valid reason > to keep a "sync_core()" call. > > Seriously, the only reason I can see for "sync_core()" really is: > > - some deep non-serialized MSR access or similar (ie things like > firmware loading etc really might want it, and a mchine check might > want it) Not even firmware loading wants it. Firmware loading needs specifically cpuid(eax=1). It has nothing to do with serializing anything -- it's just a CPU bug that was turned into "architecture". I think it really is just cross-address or cross-core modification, and I'll add a comment to that effect. --Andy
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 02, 2016 at 11:03:50AM -0800, Linus Torvalds wrote: > I'd really rather rjust mark it noinline with a comment. That way the > return from the function acts as the control flow change. Something like below? It boots in a guest but that doesn't mean anything. > 'sync_core()' doesn't help for other CPU's anyway, you need to do the > cross-call IPI. So worrying about other CPU's is *not* a valid reason > to keep a "sync_core()" call. Yeah, no, I'm not crazy about it either - I was just sanity-checking all call sites of apply_alternatives(). But as you say, we would've gotten much bigger problems if other CPUs would walk in there on us. > Seriously, the only reason I can see for "sync_core()" really is: > > - some deep non-serialized MSR access or similar (ie things like > firmware loading etc really might want it, and a mchine check might > want it) Yah, we do it in the #MC handler - apparently we need it there - and in the microcode loader to tickle out the version of the microcode currently applied into the MSR. > The issues with modifying code while another CPU may be just about to > access it is a separate issue. And as noted, "sync_core()" is not > sufficient for that, you have to do a whole careful dance with > single-byte debug instruction writes and then a final cross-call. > > See the whole "text_poke_bp()" and "text_poke()" for *that* whole > dance. That's a much more complex thing from the normal > apply_alternatives(). Yeah. --- diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 5cb272a7a5a3..b1d0c35e6dcb 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -346,7 +346,6 @@ static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr) local_irq_save(flags); add_nops(instr + (a->instrlen - a->padlen), a->padlen); - sync_core(); local_irq_restore(flags); DUMP_BYTES(instr, a->instrlen, "%p: [%d:%d) optimized NOPs: ", @@ -359,9 +358,12 @@ static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr) * This implies that asymmetric systems where APs have less capabilities than * the boot processor are not handled. Tough. Make sure you disable such * features by hand. + * + * Marked "noinline" to cause control flow change and thus insn cache + * to refetch changed I$ lines. */ -void __init_or_module apply_alternatives(struct alt_instr *start, -struct alt_instr *end) +void __init_or_module noinline apply_alternatives(struct alt_instr *start, + struct alt_instr *end) { struct alt_instr *a; u8 *instr, *replacement; @@ -667,7 +669,6 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode, unsigned long flags; local_irq_save(flags); memcpy(addr, opcode, len); - sync_core(); local_irq_restore(flags); /* Could also do a CLFLUSH here to speed up CPU recovery; but that causes hangs on some VIA CPUs. */ -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 02, 2016 at 11:03:50AM -0800, Linus Torvalds wrote: > I'd really rather rjust mark it noinline with a comment. That way the > return from the function acts as the control flow change. Something like below? It boots in a guest but that doesn't mean anything. > 'sync_core()' doesn't help for other CPU's anyway, you need to do the > cross-call IPI. So worrying about other CPU's is *not* a valid reason > to keep a "sync_core()" call. Yeah, no, I'm not crazy about it either - I was just sanity-checking all call sites of apply_alternatives(). But as you say, we would've gotten much bigger problems if other CPUs would walk in there on us. > Seriously, the only reason I can see for "sync_core()" really is: > > - some deep non-serialized MSR access or similar (ie things like > firmware loading etc really might want it, and a mchine check might > want it) Yah, we do it in the #MC handler - apparently we need it there - and in the microcode loader to tickle out the version of the microcode currently applied into the MSR. > The issues with modifying code while another CPU may be just about to > access it is a separate issue. And as noted, "sync_core()" is not > sufficient for that, you have to do a whole careful dance with > single-byte debug instruction writes and then a final cross-call. > > See the whole "text_poke_bp()" and "text_poke()" for *that* whole > dance. That's a much more complex thing from the normal > apply_alternatives(). Yeah. --- diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 5cb272a7a5a3..b1d0c35e6dcb 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -346,7 +346,6 @@ static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr) local_irq_save(flags); add_nops(instr + (a->instrlen - a->padlen), a->padlen); - sync_core(); local_irq_restore(flags); DUMP_BYTES(instr, a->instrlen, "%p: [%d:%d) optimized NOPs: ", @@ -359,9 +358,12 @@ static void __init_or_module optimize_nops(struct alt_instr *a, u8 *instr) * This implies that asymmetric systems where APs have less capabilities than * the boot processor are not handled. Tough. Make sure you disable such * features by hand. + * + * Marked "noinline" to cause control flow change and thus insn cache + * to refetch changed I$ lines. */ -void __init_or_module apply_alternatives(struct alt_instr *start, -struct alt_instr *end) +void __init_or_module noinline apply_alternatives(struct alt_instr *start, + struct alt_instr *end) { struct alt_instr *a; u8 *instr, *replacement; @@ -667,7 +669,6 @@ void *__init_or_module text_poke_early(void *addr, const void *opcode, unsigned long flags; local_irq_save(flags); memcpy(addr, opcode, len); - sync_core(); local_irq_restore(flags); /* Could also do a CLFLUSH here to speed up CPU recovery; but that causes hangs on some VIA CPUs. */ -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 10:50 AM, Borislav Petkovwrote: > > Right, we can try to do something like invalidate_icache() or so in > there with the JMP so that the BSP refetches modified code and see where > it gets us. I'd really rather rjust mark it noinline with a comment. That way the return from the function acts as the control flow change. > The good thing is, the early patching paths run before SMP is > up but from looking at load_module(), for example, which does > post_relocation()->module_finalize()->apply_alternatives(), this can > happen late. > > Now there I'd like to avoid other cores walking into that address being > patched. Or are we "safe" there in the sense that load_module() happens > on one CPU only sequentially? (I haven't looked at that code to see > what's going on there, actually). 'sync_core()' doesn't help for other CPU's anyway, you need to do the cross-call IPI. So worrying about other CPU's is *not* a valid reason to keep a "sync_core()" call. Seriously, the only reason I can see for "sync_core()" really is: - some deep non-serialized MSR access or similar (ie things like firmware loading etc really might want it, and a mchine check might want it) - writing to one virtual address, and then executing from another. We do this for (a) user mode (of course) and (b) text_poke(), but text_poke() is a whole different dance. but I may have forgotten some other case. The issues with modifying code while another CPU may be just about to access it is a separate issue. And as noted, "sync_core()" is not sufficient for that, you have to do a whole careful dance with single-byte debug instruction writes and then a final cross-call. See the whole "text_poke_bp()" and "text_poke()" for *that* whole dance. That's a much more complex thing from the normal apply_alternatives(). Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 10:50 AM, Borislav Petkov wrote: > > Right, we can try to do something like invalidate_icache() or so in > there with the JMP so that the BSP refetches modified code and see where > it gets us. I'd really rather rjust mark it noinline with a comment. That way the return from the function acts as the control flow change. > The good thing is, the early patching paths run before SMP is > up but from looking at load_module(), for example, which does > post_relocation()->module_finalize()->apply_alternatives(), this can > happen late. > > Now there I'd like to avoid other cores walking into that address being > patched. Or are we "safe" there in the sense that load_module() happens > on one CPU only sequentially? (I haven't looked at that code to see > what's going on there, actually). 'sync_core()' doesn't help for other CPU's anyway, you need to do the cross-call IPI. So worrying about other CPU's is *not* a valid reason to keep a "sync_core()" call. Seriously, the only reason I can see for "sync_core()" really is: - some deep non-serialized MSR access or similar (ie things like firmware loading etc really might want it, and a mchine check might want it) - writing to one virtual address, and then executing from another. We do this for (a) user mode (of course) and (b) text_poke(), but text_poke() is a whole different dance. but I may have forgotten some other case. The issues with modifying code while another CPU may be just about to access it is a separate issue. And as noted, "sync_core()" is not sufficient for that, you have to do a whole careful dance with single-byte debug instruction writes and then a final cross-call. See the whole "text_poke_bp()" and "text_poke()" for *that* whole dance. That's a much more complex thing from the normal apply_alternatives(). Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 02, 2016 at 10:27:29AM -0800, Linus Torvalds wrote: > That is, of course, assuming that there is no really subtle reason > why that stupid sync_core() is there. Right, we can try to do something like invalidate_icache() or so in there with the JMP so that the BSP refetches modified code and see where it gets us. The good thing is, the early patching paths run before SMP is up but from looking at load_module(), for example, which does post_relocation()->module_finalize()->apply_alternatives(), this can happen late. Now there I'd like to avoid other cores walking into that address being patched. Or are we "safe" there in the sense that load_module() happens on one CPU only sequentially? (I haven't looked at that code to see what's going on there, actually). -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 02, 2016 at 10:27:29AM -0800, Linus Torvalds wrote: > That is, of course, assuming that there is no really subtle reason > why that stupid sync_core() is there. Right, we can try to do something like invalidate_icache() or so in there with the JMP so that the BSP refetches modified code and see where it gets us. The good thing is, the early patching paths run before SMP is up but from looking at load_module(), for example, which does post_relocation()->module_finalize()->apply_alternatives(), this can happen late. Now there I'd like to avoid other cores walking into that address being patched. Or are we "safe" there in the sense that load_module() happens on one CPU only sequentially? (I haven't looked at that code to see what's going on there, actually). -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On 12/02/2016 06:44 AM, Andrew Cooper wrote: > On 02/12/16 00:35, Andy Lutomirski wrote: >> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't >> guaranteed to serialize. (Even CPUID isn't *really* guaranteed to >> serialize on Xen PV, but, in practice, any trap it generates will >> serialize.) > Well, Xen will enabled CPUID Faulting wherever it can, which is > realistically all IvyBridge hardware and newer. > > All hypercalls are a privilege change to cpl0. I'd hope this condition > is serialising, but I can't actually find any documentation proving or > disproving this. > >> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is >> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self >> should end up being a nice speedup. >> >> Cc: Andrew Cooper>> Signed-off-by: Andy Lutomirski > CC'ing xen-devel and the Xen maintainers in Linux. > > As this is the only email from this series in my inbox, I will say this > here, but it should really be against patch 6. > > A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not > serialising on the 486, but I don't have a manual to hand to check. > > ~Andrew > >> --- >> arch/x86/xen/enlighten.c | 35 +++ >> 1 file changed, 35 insertions(+) >> >> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c >> index bdd855685403..1f765b41eee7 100644 >> --- a/arch/x86/xen/enlighten.c >> +++ b/arch/x86/xen/enlighten.c >> @@ -311,6 +311,39 @@ static __read_mostly unsigned int >> cpuid_leaf1_ecx_set_mask; >> static __read_mostly unsigned int cpuid_leaf5_ecx_val; >> static __read_mostly unsigned int cpuid_leaf5_edx_val; >> >> +static void xen_sync_core(void) >> +{ >> +register void *__sp asm(_ASM_SP); >> + >> +#ifdef CONFIG_X86_32 >> +asm volatile ( >> +"pushl %%ss\n\t" >> +"pushl %%esp\n\t" >> +"addl $4, (%%esp)\n\t" >> +"pushfl\n\t" >> +"pushl %%cs\n\t" >> +"pushl $1f\n\t" >> +"iret\n\t" >> +"1:" >> +: "+r" (__sp) : : "cc"); This breaks 32-bit PV guests. Why are we pushing %ss? We are not changing privilege levels so why not just flags, cs and eip (which, incidentally, does work)? -boris >> +#else >> +unsigned long tmp; >> + >> +asm volatile ( >> +"movq %%ss, %0\n\t" >> +"pushq %0\n\t" >> +"pushq %%rsp\n\t" >> +"addq $8, (%%rsp)\n\t" >> +"pushfq\n\t" >> +"movq %%cs, %0\n\t" >> +"pushq %0\n\t" >> +"pushq $1f\n\t" >> +"iretq\n\t" >> +"1:" >> +: "=r" (tmp), "+r" (__sp) : : "cc"); >> +#endif >> +} >> + >> static void xen_cpuid(unsigned int *ax, unsigned int *bx, >>unsigned int *cx, unsigned int *dx) >> { >> @@ -1289,6 +1322,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst >> = { >> >> .start_context_switch = paravirt_start_context_switch, >> .end_context_switch = xen_end_context_switch, >> + >> +.sync_core = xen_sync_core, >> }; >> >> static void xen_reboot(int reason)
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On 12/02/2016 06:44 AM, Andrew Cooper wrote: > On 02/12/16 00:35, Andy Lutomirski wrote: >> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't >> guaranteed to serialize. (Even CPUID isn't *really* guaranteed to >> serialize on Xen PV, but, in practice, any trap it generates will >> serialize.) > Well, Xen will enabled CPUID Faulting wherever it can, which is > realistically all IvyBridge hardware and newer. > > All hypercalls are a privilege change to cpl0. I'd hope this condition > is serialising, but I can't actually find any documentation proving or > disproving this. > >> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is >> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self >> should end up being a nice speedup. >> >> Cc: Andrew Cooper >> Signed-off-by: Andy Lutomirski > CC'ing xen-devel and the Xen maintainers in Linux. > > As this is the only email from this series in my inbox, I will say this > here, but it should really be against patch 6. > > A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not > serialising on the 486, but I don't have a manual to hand to check. > > ~Andrew > >> --- >> arch/x86/xen/enlighten.c | 35 +++ >> 1 file changed, 35 insertions(+) >> >> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c >> index bdd855685403..1f765b41eee7 100644 >> --- a/arch/x86/xen/enlighten.c >> +++ b/arch/x86/xen/enlighten.c >> @@ -311,6 +311,39 @@ static __read_mostly unsigned int >> cpuid_leaf1_ecx_set_mask; >> static __read_mostly unsigned int cpuid_leaf5_ecx_val; >> static __read_mostly unsigned int cpuid_leaf5_edx_val; >> >> +static void xen_sync_core(void) >> +{ >> +register void *__sp asm(_ASM_SP); >> + >> +#ifdef CONFIG_X86_32 >> +asm volatile ( >> +"pushl %%ss\n\t" >> +"pushl %%esp\n\t" >> +"addl $4, (%%esp)\n\t" >> +"pushfl\n\t" >> +"pushl %%cs\n\t" >> +"pushl $1f\n\t" >> +"iret\n\t" >> +"1:" >> +: "+r" (__sp) : : "cc"); This breaks 32-bit PV guests. Why are we pushing %ss? We are not changing privilege levels so why not just flags, cs and eip (which, incidentally, does work)? -boris >> +#else >> +unsigned long tmp; >> + >> +asm volatile ( >> +"movq %%ss, %0\n\t" >> +"pushq %0\n\t" >> +"pushq %%rsp\n\t" >> +"addq $8, (%%rsp)\n\t" >> +"pushfq\n\t" >> +"movq %%cs, %0\n\t" >> +"pushq %0\n\t" >> +"pushq $1f\n\t" >> +"iretq\n\t" >> +"1:" >> +: "=r" (tmp), "+r" (__sp) : : "cc"); >> +#endif >> +} >> + >> static void xen_cpuid(unsigned int *ax, unsigned int *bx, >>unsigned int *cx, unsigned int *dx) >> { >> @@ -1289,6 +1322,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst >> = { >> >> .start_context_switch = paravirt_start_context_switch, >> .end_context_switch = xen_end_context_switch, >> + >> +.sync_core = xen_sync_core, >> }; >> >> static void xen_reboot(int reason)
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 10:03 AM, Borislav Petkovwrote: > > SNB: > * before: > * after: I suspect it's entirely invisible on raw hardware. But quite possibly more noticeable in a VM that takes slow faults for every case. But yes, even there is' probably not *that* noticeable. I'd prefer to get rid of them just because I don't like voodoo programming. If there is no actual reason for "sync_core()", we shouldn't have one there. Even if it were entirely free and the compiler optimized it away because the compiler was so smart that it would see that it's pointless, it's misleading and wrong on a source level. That is, of course, assuming that there is no really subtle reason why that stupid sync_core() is there. Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 10:03 AM, Borislav Petkov wrote: > > SNB: > * before: > * after: I suspect it's entirely invisible on raw hardware. But quite possibly more noticeable in a VM that takes slow faults for every case. But yes, even there is' probably not *that* noticeable. I'd prefer to get rid of them just because I don't like voodoo programming. If there is no actual reason for "sync_core()", we shouldn't have one there. Even if it were entirely free and the compiler optimized it away because the compiler was so smart that it would see that it's pointless, it's misleading and wrong on a source level. That is, of course, assuming that there is no really subtle reason why that stupid sync_core() is there. Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 02, 2016 at 09:38:38AM -0800, Andy Lutomirski wrote: > apply_alternatives, unfortunately. It's performance-critical because > it's intensely stupid and does sync_core() for every single patch. > Fixing that would be nice, too. So I did experiment at the time to batch that sync_core() and interrupts disabling in text_poke_early() but that amounted to almost nothing. That's why I didn't bother chasing this further. I can try to dig out that old thread... /me goes and searches... ah, here's something: SNB: * before: [0.011707] SMP alternatives: Starting apply_alternatives... [0.011784] SMP alternatives: done. [0.011806] Freeing SMP alternatives: 20k freed * after: [0.011699] SMP alternatives: Starting apply_alternatives... [0.011721] SMP alternatives: done. [0.011743] Freeing SMP alternatives: 20k freed -> 63 microseconds speedup * kvm guest: * before: [0.017005] SMP alternatives: Starting apply_alternatives... [0.019024] SMP alternatives: done. [0.020119] Freeing SMP alternatives: 20k freed * after: [0.015008] SMP alternatives: Starting apply_alternatives... [0.016029] SMP alternatives: done. [0.017118] Freeing SMP alternatives: 20k freed -> ~3 milliseconds speedup --- I tried something simple like this: --- diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 1850592f4700..f4d5689ea503 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -253,10 +253,13 @@ void __init_or_module apply_alternatives(struct alt_instr *start, struct alt_instr *end) { struct alt_instr *a; + unsigned long flags; u8 *instr, *replacement; u8 insnbuf[MAX_PATCH_LEN]; DPRINTK("%s: alt table %p -> %p\n", __func__, start, end); + + local_irq_save(flags); /* * The scan order should be from start to end. A later scanned * alternative code can overwrite a previous scanned alternative code. @@ -284,8 +287,10 @@ void __init_or_module apply_alternatives(struct alt_instr *start, add_nops(insnbuf + a->replacementlen, a->instrlen - a->replacementlen); - text_poke_early(instr, insnbuf, a->instrlen); + memcpy(instr, insnbuf, a->instrlen); } + sync_core(); + local_irq_restore(flags); } #ifdef CONFIG_SMP --- I could try and redo the measurements again but I doubt it'll be any different. Unless I haven't made a mistake somewhere... -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 02, 2016 at 09:38:38AM -0800, Andy Lutomirski wrote: > apply_alternatives, unfortunately. It's performance-critical because > it's intensely stupid and does sync_core() for every single patch. > Fixing that would be nice, too. So I did experiment at the time to batch that sync_core() and interrupts disabling in text_poke_early() but that amounted to almost nothing. That's why I didn't bother chasing this further. I can try to dig out that old thread... /me goes and searches... ah, here's something: SNB: * before: [0.011707] SMP alternatives: Starting apply_alternatives... [0.011784] SMP alternatives: done. [0.011806] Freeing SMP alternatives: 20k freed * after: [0.011699] SMP alternatives: Starting apply_alternatives... [0.011721] SMP alternatives: done. [0.011743] Freeing SMP alternatives: 20k freed -> 63 microseconds speedup * kvm guest: * before: [0.017005] SMP alternatives: Starting apply_alternatives... [0.019024] SMP alternatives: done. [0.020119] Freeing SMP alternatives: 20k freed * after: [0.015008] SMP alternatives: Starting apply_alternatives... [0.016029] SMP alternatives: done. [0.017118] Freeing SMP alternatives: 20k freed -> ~3 milliseconds speedup --- I tried something simple like this: --- diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 1850592f4700..f4d5689ea503 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -253,10 +253,13 @@ void __init_or_module apply_alternatives(struct alt_instr *start, struct alt_instr *end) { struct alt_instr *a; + unsigned long flags; u8 *instr, *replacement; u8 insnbuf[MAX_PATCH_LEN]; DPRINTK("%s: alt table %p -> %p\n", __func__, start, end); + + local_irq_save(flags); /* * The scan order should be from start to end. A later scanned * alternative code can overwrite a previous scanned alternative code. @@ -284,8 +287,10 @@ void __init_or_module apply_alternatives(struct alt_instr *start, add_nops(insnbuf + a->replacementlen, a->instrlen - a->replacementlen); - text_poke_early(instr, insnbuf, a->instrlen); + memcpy(instr, insnbuf, a->instrlen); } + sync_core(); + local_irq_restore(flags); } #ifdef CONFIG_SMP --- I could try and redo the measurements again but I doubt it'll be any different. Unless I haven't made a mistake somewhere... -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 9:38 AM, Andy Lutomirskiwrote: > > apply_alternatives, unfortunately. It's performance-critical because > it's intensely stupid and does sync_core() for every single patch. > Fixing that would be nice, too. So looking at text_poke_early(), that's very much a case that really shouldn't need any "sync_core()" at all as far as I can tell. Only the current CPU is running, and for local CPU I$ coherence all you need is a jump instruction, and even that is only on really old CPU's. From the PPro onwards (maybe even Pentium?) the I$ is entirely serialized as long as you change the data using the same linear address. So at most, that function could mark itsel f"noinline" just to guarantee that it will cause a control flow change before returning. The sync_core() seems entirely bogus. Same goes for optimize_nops() too. Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 9:38 AM, Andy Lutomirski wrote: > > apply_alternatives, unfortunately. It's performance-critical because > it's intensely stupid and does sync_core() for every single patch. > Fixing that would be nice, too. So looking at text_poke_early(), that's very much a case that really shouldn't need any "sync_core()" at all as far as I can tell. Only the current CPU is running, and for local CPU I$ coherence all you need is a jump instruction, and even that is only on really old CPU's. From the PPro onwards (maybe even Pentium?) the I$ is entirely serialized as long as you change the data using the same linear address. So at most, that function could mark itsel f"noinline" just to guarantee that it will cause a control flow change before returning. The sync_core() seems entirely bogus. Same goes for optimize_nops() too. Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 9:32 AM, Linus Torvaldswrote: > On Thu, Dec 1, 2016 at 4:35 PM, Andy Lutomirski wrote: >> >> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is >> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self >> should end up being a nice speedup. > > So if we care deeply about the performance of this, we should really > ask ourselves how much we need this... > > There are *very* few places where we really need to do a full > serializing instruction, and I'd worry that we really don't need it in > many of the places we do this. > > The only real case I'm aware of is modifying code that is modified > through a different linear address than it's executed. TBH, I didn't start down this path for performance. I did it because I wanted to kill off a CPUID that was breaking on old CPUs that don't have CPUID. So I propose MOV-to-CR2 followed by an unconditional jump. My goal here is to make the #*!& thing work reliably and not be ludicrously slow. Borislav and I mulled over using an alternative to use CPUID if and only if we have CPUID, but that doesn't work because we call sync_core() before we're done applying alternatives. > > Is there anything else where we _really_ need this sync-core thing? > Sure, the microcode loader looks fine, but that doesn't look > particularly performance-critical either. > > So I'd like to know which sync_core is actually so > performance-critical that w e care about it, and then I'd like to > understand why it's needed at all, because I suspect a number of them > has been added with the model of "sprinkle random things around and > hope". apply_alternatives, unfortunately. It's performance-critical because it's intensely stupid and does sync_core() for every single patch. Fixing that would be nice, too. > Adding Peter Anvin to the participants list, because iirc he was the > one who really talked to hardwre engineers about the synchronization > issues with serializing kernel code. > > Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 9:32 AM, Linus Torvalds wrote: > On Thu, Dec 1, 2016 at 4:35 PM, Andy Lutomirski wrote: >> >> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is >> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self >> should end up being a nice speedup. > > So if we care deeply about the performance of this, we should really > ask ourselves how much we need this... > > There are *very* few places where we really need to do a full > serializing instruction, and I'd worry that we really don't need it in > many of the places we do this. > > The only real case I'm aware of is modifying code that is modified > through a different linear address than it's executed. TBH, I didn't start down this path for performance. I did it because I wanted to kill off a CPUID that was breaking on old CPUs that don't have CPUID. So I propose MOV-to-CR2 followed by an unconditional jump. My goal here is to make the #*!& thing work reliably and not be ludicrously slow. Borislav and I mulled over using an alternative to use CPUID if and only if we have CPUID, but that doesn't work because we call sync_core() before we're done applying alternatives. > > Is there anything else where we _really_ need this sync-core thing? > Sure, the microcode loader looks fine, but that doesn't look > particularly performance-critical either. > > So I'd like to know which sync_core is actually so > performance-critical that w e care about it, and then I'd like to > understand why it's needed at all, because I suspect a number of them > has been added with the model of "sprinkle random things around and > hope". apply_alternatives, unfortunately. It's performance-critical because it's intensely stupid and does sync_core() for every single patch. Fixing that would be nice, too. > Adding Peter Anvin to the participants list, because iirc he was the > one who really talked to hardwre engineers about the synchronization > issues with serializing kernel code. > > Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Thu, Dec 1, 2016 at 4:35 PM, Andy Lutomirskiwrote: > > On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is > ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self > should end up being a nice speedup. So if we care deeply about the performance of this, we should really ask ourselves how much we need this... There are *very* few places where we really need to do a full serializing instruction, and I'd worry that we really don't need it in many of the places we do this. The only real case I'm aware of is modifying code that is modified through a different linear address than it's executed. Is there anything else where we _really_ need this sync-core thing? Sure, the microcode loader looks fine, but that doesn't look particularly performance-critical either. So I'd like to know which sync_core is actually so performance-critical that w e care about it, and then I'd like to understand why it's needed at all, because I suspect a number of them has been added with the model of "sprinkle random things around and hope". Looking at ftrace, for example, which is one of the users, does it actually _really_ need sync_core() at all? It seems to use the kerrnel virtual address, and then the instruction stream will be coherent. Adding Peter Anvin to the participants list, because iirc he was the one who really talked to hardwre engineers about the synchronization issues with serializing kernel code. Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Thu, Dec 1, 2016 at 4:35 PM, Andy Lutomirski wrote: > > On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is > ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self > should end up being a nice speedup. So if we care deeply about the performance of this, we should really ask ourselves how much we need this... There are *very* few places where we really need to do a full serializing instruction, and I'd worry that we really don't need it in many of the places we do this. The only real case I'm aware of is modifying code that is modified through a different linear address than it's executed. Is there anything else where we _really_ need this sync-core thing? Sure, the microcode loader looks fine, but that doesn't look particularly performance-critical either. So I'd like to know which sync_core is actually so performance-critical that w e care about it, and then I'd like to understand why it's needed at all, because I suspect a number of them has been added with the model of "sprinkle random things around and hope". Looking at ftrace, for example, which is one of the users, does it actually _really_ need sync_core() at all? It seems to use the kerrnel virtual address, and then the instruction stream will be coherent. Adding Peter Anvin to the participants list, because iirc he was the one who really talked to hardwre engineers about the synchronization issues with serializing kernel code. Linus
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On 02/12/16 17:23, Andy Lutomirski wrote: > On Fri, Dec 2, 2016 at 9:16 AM, Andrew Cooper> wrote: >> On 02/12/16 17:07, Andy Lutomirski wrote: >>> On Dec 2, 2016 3:44 AM, "Andrew Cooper" wrote: On 02/12/16 00:35, Andy Lutomirski wrote: > On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't > guaranteed to serialize. (Even CPUID isn't *really* guaranteed to > serialize on Xen PV, but, in practice, any trap it generates will > serialize.) Well, Xen will enabled CPUID Faulting wherever it can, which is realistically all IvyBridge hardware and newer. All hypercalls are a privilege change to cpl0. I'd hope this condition is serialising, but I can't actually find any documentation proving or disproving this. >>> I don't know for sure. IRET is serializing, and if Xen returns using >>> IRET, we're fine. >> All returns to a 64bit PV guest at defined points (hypercall return, >> exception entry, etc) are from SYSRET, not IRET. > But CPUID faulting isn't like this, right? Unless Xen does > opportunistic SYSRET. Correct. Xen doesn't do opportunistic SYSRET. > >> Talking of, I still have a patch to remove >> PARAVIRT_ADJUST_EXCEPTION_FRAME which I need to complete and send upstream. >> > On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is > ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self > should end up being a nice speedup. > > Cc: Andrew Cooper > Signed-off-by: Andy Lutomirski CC'ing xen-devel and the Xen maintainers in Linux. As this is the only email from this series in my inbox, I will say this here, but it should really be against patch 6. A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not serialising on the 486, but I don't have a manual to hand to check. >>> I'll quote the (modern) SDM. For self-modifying code "The use of one >>> of these options is not required for programs intended to run on the >>> Pentium or Intel486 processors, >>> but are recommended to ensure compatibility with the P6 and more >>> recent processor families.". For cross-modifying code "The use of >>> this option is not required for programs intended to run on the >>> Intel486 processor, but is recommended >>> to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and >>> Pentium processors." So I'm not sure there's a problem. >> Fair enough. (Assuming similar properties hold for the older processors >> of other vendors.) > No, you were right -- a different section of the SDM contradicts it: > > For Intel486 processors, a write to an instruction in the cache will > modify it in both the cache and memory, but if > the instruction was prefetched before the write, the old version of > the instruction could be the one executed. To > prevent the old instruction from being executed, flush the instruction > prefetch unit by coding a jump instruction > immediately after any write that modifies an instruction. :( Presumably this means patching has been subtly broken forever on the 486? ~Andrew
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On 02/12/16 17:23, Andy Lutomirski wrote: > On Fri, Dec 2, 2016 at 9:16 AM, Andrew Cooper > wrote: >> On 02/12/16 17:07, Andy Lutomirski wrote: >>> On Dec 2, 2016 3:44 AM, "Andrew Cooper" wrote: On 02/12/16 00:35, Andy Lutomirski wrote: > On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't > guaranteed to serialize. (Even CPUID isn't *really* guaranteed to > serialize on Xen PV, but, in practice, any trap it generates will > serialize.) Well, Xen will enabled CPUID Faulting wherever it can, which is realistically all IvyBridge hardware and newer. All hypercalls are a privilege change to cpl0. I'd hope this condition is serialising, but I can't actually find any documentation proving or disproving this. >>> I don't know for sure. IRET is serializing, and if Xen returns using >>> IRET, we're fine. >> All returns to a 64bit PV guest at defined points (hypercall return, >> exception entry, etc) are from SYSRET, not IRET. > But CPUID faulting isn't like this, right? Unless Xen does > opportunistic SYSRET. Correct. Xen doesn't do opportunistic SYSRET. > >> Talking of, I still have a patch to remove >> PARAVIRT_ADJUST_EXCEPTION_FRAME which I need to complete and send upstream. >> > On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is > ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self > should end up being a nice speedup. > > Cc: Andrew Cooper > Signed-off-by: Andy Lutomirski CC'ing xen-devel and the Xen maintainers in Linux. As this is the only email from this series in my inbox, I will say this here, but it should really be against patch 6. A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not serialising on the 486, but I don't have a manual to hand to check. >>> I'll quote the (modern) SDM. For self-modifying code "The use of one >>> of these options is not required for programs intended to run on the >>> Pentium or Intel486 processors, >>> but are recommended to ensure compatibility with the P6 and more >>> recent processor families.". For cross-modifying code "The use of >>> this option is not required for programs intended to run on the >>> Intel486 processor, but is recommended >>> to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and >>> Pentium processors." So I'm not sure there's a problem. >> Fair enough. (Assuming similar properties hold for the older processors >> of other vendors.) > No, you were right -- a different section of the SDM contradicts it: > > For Intel486 processors, a write to an instruction in the cache will > modify it in both the cache and memory, but if > the instruction was prefetched before the write, the old version of > the instruction could be the one executed. To > prevent the old instruction from being executed, flush the instruction > prefetch unit by coding a jump instruction > immediately after any write that modifies an instruction. :( Presumably this means patching has been subtly broken forever on the 486? ~Andrew
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 9:16 AM, Andrew Cooperwrote: > On 02/12/16 17:07, Andy Lutomirski wrote: >> On Dec 2, 2016 3:44 AM, "Andrew Cooper" wrote: >>> On 02/12/16 00:35, Andy Lutomirski wrote: On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't guaranteed to serialize. (Even CPUID isn't *really* guaranteed to serialize on Xen PV, but, in practice, any trap it generates will serialize.) >>> Well, Xen will enabled CPUID Faulting wherever it can, which is >>> realistically all IvyBridge hardware and newer. >>> >>> All hypercalls are a privilege change to cpl0. I'd hope this condition >>> is serialising, but I can't actually find any documentation proving or >>> disproving this. >> I don't know for sure. IRET is serializing, and if Xen returns using >> IRET, we're fine. > > All returns to a 64bit PV guest at defined points (hypercall return, > exception entry, etc) are from SYSRET, not IRET. But CPUID faulting isn't like this, right? Unless Xen does opportunistic SYSRET. > > Talking of, I still have a patch to remove > PARAVIRT_ADJUST_EXCEPTION_FRAME which I need to complete and send upstream. > >> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self should end up being a nice speedup. Cc: Andrew Cooper Signed-off-by: Andy Lutomirski >>> CC'ing xen-devel and the Xen maintainers in Linux. >>> >>> As this is the only email from this series in my inbox, I will say this >>> here, but it should really be against patch 6. >>> >>> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not >>> serialising on the 486, but I don't have a manual to hand to check. >> I'll quote the (modern) SDM. For self-modifying code "The use of one >> of these options is not required for programs intended to run on the >> Pentium or Intel486 processors, >> but are recommended to ensure compatibility with the P6 and more >> recent processor families.". For cross-modifying code "The use of >> this option is not required for programs intended to run on the >> Intel486 processor, but is recommended >> to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and >> Pentium processors." So I'm not sure there's a problem. > > Fair enough. (Assuming similar properties hold for the older processors > of other vendors.) No, you were right -- a different section of the SDM contradicts it: For Intel486 processors, a write to an instruction in the cache will modify it in both the cache and memory, but if the instruction was prefetched before the write, the old version of the instruction could be the one executed. To prevent the old instruction from being executed, flush the instruction prefetch unit by coding a jump instruction immediately after any write that modifies an instruction. --Andy
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Fri, Dec 2, 2016 at 9:16 AM, Andrew Cooper wrote: > On 02/12/16 17:07, Andy Lutomirski wrote: >> On Dec 2, 2016 3:44 AM, "Andrew Cooper" wrote: >>> On 02/12/16 00:35, Andy Lutomirski wrote: On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't guaranteed to serialize. (Even CPUID isn't *really* guaranteed to serialize on Xen PV, but, in practice, any trap it generates will serialize.) >>> Well, Xen will enabled CPUID Faulting wherever it can, which is >>> realistically all IvyBridge hardware and newer. >>> >>> All hypercalls are a privilege change to cpl0. I'd hope this condition >>> is serialising, but I can't actually find any documentation proving or >>> disproving this. >> I don't know for sure. IRET is serializing, and if Xen returns using >> IRET, we're fine. > > All returns to a 64bit PV guest at defined points (hypercall return, > exception entry, etc) are from SYSRET, not IRET. But CPUID faulting isn't like this, right? Unless Xen does opportunistic SYSRET. > > Talking of, I still have a patch to remove > PARAVIRT_ADJUST_EXCEPTION_FRAME which I need to complete and send upstream. > >> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self should end up being a nice speedup. Cc: Andrew Cooper Signed-off-by: Andy Lutomirski >>> CC'ing xen-devel and the Xen maintainers in Linux. >>> >>> As this is the only email from this series in my inbox, I will say this >>> here, but it should really be against patch 6. >>> >>> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not >>> serialising on the 486, but I don't have a manual to hand to check. >> I'll quote the (modern) SDM. For self-modifying code "The use of one >> of these options is not required for programs intended to run on the >> Pentium or Intel486 processors, >> but are recommended to ensure compatibility with the P6 and more >> recent processor families.". For cross-modifying code "The use of >> this option is not required for programs intended to run on the >> Intel486 processor, but is recommended >> to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and >> Pentium processors." So I'm not sure there's a problem. > > Fair enough. (Assuming similar properties hold for the older processors > of other vendors.) No, you were right -- a different section of the SDM contradicts it: For Intel486 processors, a write to an instruction in the cache will modify it in both the cache and memory, but if the instruction was prefetched before the write, the old version of the instruction could be the one executed. To prevent the old instruction from being executed, flush the instruction prefetch unit by coding a jump instruction immediately after any write that modifies an instruction. --Andy
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On 02/12/16 17:07, Andy Lutomirski wrote: > On Dec 2, 2016 3:44 AM, "Andrew Cooper"wrote: >> On 02/12/16 00:35, Andy Lutomirski wrote: >>> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't >>> guaranteed to serialize. (Even CPUID isn't *really* guaranteed to >>> serialize on Xen PV, but, in practice, any trap it generates will >>> serialize.) >> Well, Xen will enabled CPUID Faulting wherever it can, which is >> realistically all IvyBridge hardware and newer. >> >> All hypercalls are a privilege change to cpl0. I'd hope this condition >> is serialising, but I can't actually find any documentation proving or >> disproving this. > I don't know for sure. IRET is serializing, and if Xen returns using > IRET, we're fine. All returns to a 64bit PV guest at defined points (hypercall return, exception entry, etc) are from SYSRET, not IRET. Talking of, I still have a patch to remove PARAVIRT_ADJUST_EXCEPTION_FRAME which I need to complete and send upstream. > >>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is >>> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self >>> should end up being a nice speedup. >>> >>> Cc: Andrew Cooper >>> Signed-off-by: Andy Lutomirski >> CC'ing xen-devel and the Xen maintainers in Linux. >> >> As this is the only email from this series in my inbox, I will say this >> here, but it should really be against patch 6. >> >> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not >> serialising on the 486, but I don't have a manual to hand to check. > I'll quote the (modern) SDM. For self-modifying code "The use of one > of these options is not required for programs intended to run on the > Pentium or Intel486 processors, > but are recommended to ensure compatibility with the P6 and more > recent processor families.". For cross-modifying code "The use of > this option is not required for programs intended to run on the > Intel486 processor, but is recommended > to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and > Pentium processors." So I'm not sure there's a problem. Fair enough. (Assuming similar properties hold for the older processors of other vendors.) ~Andrew
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On 02/12/16 17:07, Andy Lutomirski wrote: > On Dec 2, 2016 3:44 AM, "Andrew Cooper" wrote: >> On 02/12/16 00:35, Andy Lutomirski wrote: >>> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't >>> guaranteed to serialize. (Even CPUID isn't *really* guaranteed to >>> serialize on Xen PV, but, in practice, any trap it generates will >>> serialize.) >> Well, Xen will enabled CPUID Faulting wherever it can, which is >> realistically all IvyBridge hardware and newer. >> >> All hypercalls are a privilege change to cpl0. I'd hope this condition >> is serialising, but I can't actually find any documentation proving or >> disproving this. > I don't know for sure. IRET is serializing, and if Xen returns using > IRET, we're fine. All returns to a 64bit PV guest at defined points (hypercall return, exception entry, etc) are from SYSRET, not IRET. Talking of, I still have a patch to remove PARAVIRT_ADJUST_EXCEPTION_FRAME which I need to complete and send upstream. > >>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is >>> ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self >>> should end up being a nice speedup. >>> >>> Cc: Andrew Cooper >>> Signed-off-by: Andy Lutomirski >> CC'ing xen-devel and the Xen maintainers in Linux. >> >> As this is the only email from this series in my inbox, I will say this >> here, but it should really be against patch 6. >> >> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not >> serialising on the 486, but I don't have a manual to hand to check. > I'll quote the (modern) SDM. For self-modifying code "The use of one > of these options is not required for programs intended to run on the > Pentium or Intel486 processors, > but are recommended to ensure compatibility with the P6 and more > recent processor families.". For cross-modifying code "The use of > this option is not required for programs intended to run on the > Intel486 processor, but is recommended > to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and > Pentium processors." So I'm not sure there's a problem. Fair enough. (Assuming similar properties hold for the older processors of other vendors.) ~Andrew
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Dec 2, 2016 3:44 AM, "Andrew Cooper"wrote: > > On 02/12/16 00:35, Andy Lutomirski wrote: > > On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't > > guaranteed to serialize. (Even CPUID isn't *really* guaranteed to > > serialize on Xen PV, but, in practice, any trap it generates will > > serialize.) > > Well, Xen will enabled CPUID Faulting wherever it can, which is > realistically all IvyBridge hardware and newer. > > All hypercalls are a privilege change to cpl0. I'd hope this condition > is serialising, but I can't actually find any documentation proving or > disproving this. I don't know for sure. IRET is serializing, and if Xen returns using IRET, we're fine. > > > > > On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is > > ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self > > should end up being a nice speedup. > > > > Cc: Andrew Cooper > > Signed-off-by: Andy Lutomirski > > CC'ing xen-devel and the Xen maintainers in Linux. > > As this is the only email from this series in my inbox, I will say this > here, but it should really be against patch 6. > > A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not > serialising on the 486, but I don't have a manual to hand to check. I'll quote the (modern) SDM. For self-modifying code "The use of one of these options is not required for programs intended to run on the Pentium or Intel486 processors, but are recommended to ensure compatibility with the P6 and more recent processor families.". For cross-modifying code "The use of this option is not required for programs intended to run on the Intel486 processor, but is recommended to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and Pentium processors." So I'm not sure there's a problem. I can add an unconditional jump just to make sure. It costs basically nothing on modern CPUs. (Also, CPUID also isn't serializing on 486 according to the table.) --Andy
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Dec 2, 2016 3:44 AM, "Andrew Cooper" wrote: > > On 02/12/16 00:35, Andy Lutomirski wrote: > > On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't > > guaranteed to serialize. (Even CPUID isn't *really* guaranteed to > > serialize on Xen PV, but, in practice, any trap it generates will > > serialize.) > > Well, Xen will enabled CPUID Faulting wherever it can, which is > realistically all IvyBridge hardware and newer. > > All hypercalls are a privilege change to cpl0. I'd hope this condition > is serialising, but I can't actually find any documentation proving or > disproving this. I don't know for sure. IRET is serializing, and if Xen returns using IRET, we're fine. > > > > > On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is > > ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self > > should end up being a nice speedup. > > > > Cc: Andrew Cooper > > Signed-off-by: Andy Lutomirski > > CC'ing xen-devel and the Xen maintainers in Linux. > > As this is the only email from this series in my inbox, I will say this > here, but it should really be against patch 6. > > A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not > serialising on the 486, but I don't have a manual to hand to check. I'll quote the (modern) SDM. For self-modifying code "The use of one of these options is not required for programs intended to run on the Pentium or Intel486 processors, but are recommended to ensure compatibility with the P6 and more recent processor families.". For cross-modifying code "The use of this option is not required for programs intended to run on the Intel486 processor, but is recommended to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and Pentium processors." So I'm not sure there's a problem. I can add an unconditional jump just to make sure. It costs basically nothing on modern CPUs. (Also, CPUID also isn't serializing on 486 according to the table.) --Andy
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On 02/12/16 00:35, Andy Lutomirski wrote: > On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't > guaranteed to serialize. (Even CPUID isn't *really* guaranteed to > serialize on Xen PV, but, in practice, any trap it generates will > serialize.) Well, Xen will enabled CPUID Faulting wherever it can, which is realistically all IvyBridge hardware and newer. All hypercalls are a privilege change to cpl0. I'd hope this condition is serialising, but I can't actually find any documentation proving or disproving this. > > On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is > ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self > should end up being a nice speedup. > > Cc: Andrew Cooper> Signed-off-by: Andy Lutomirski CC'ing xen-devel and the Xen maintainers in Linux. As this is the only email from this series in my inbox, I will say this here, but it should really be against patch 6. A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not serialising on the 486, but I don't have a manual to hand to check. ~Andrew > --- > arch/x86/xen/enlighten.c | 35 +++ > 1 file changed, 35 insertions(+) > > diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c > index bdd855685403..1f765b41eee7 100644 > --- a/arch/x86/xen/enlighten.c > +++ b/arch/x86/xen/enlighten.c > @@ -311,6 +311,39 @@ static __read_mostly unsigned int > cpuid_leaf1_ecx_set_mask; > static __read_mostly unsigned int cpuid_leaf5_ecx_val; > static __read_mostly unsigned int cpuid_leaf5_edx_val; > > +static void xen_sync_core(void) > +{ > + register void *__sp asm(_ASM_SP); > + > +#ifdef CONFIG_X86_32 > + asm volatile ( > + "pushl %%ss\n\t" > + "pushl %%esp\n\t" > + "addl $4, (%%esp)\n\t" > + "pushfl\n\t" > + "pushl %%cs\n\t" > + "pushl $1f\n\t" > + "iret\n\t" > + "1:" > + : "+r" (__sp) : : "cc"); > +#else > + unsigned long tmp; > + > + asm volatile ( > + "movq %%ss, %0\n\t" > + "pushq %0\n\t" > + "pushq %%rsp\n\t" > + "addq $8, (%%rsp)\n\t" > + "pushfq\n\t" > + "movq %%cs, %0\n\t" > + "pushq %0\n\t" > + "pushq $1f\n\t" > + "iretq\n\t" > + "1:" > + : "=r" (tmp), "+r" (__sp) : : "cc"); > +#endif > +} > + > static void xen_cpuid(unsigned int *ax, unsigned int *bx, > unsigned int *cx, unsigned int *dx) > { > @@ -1289,6 +1322,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst > = { > > .start_context_switch = paravirt_start_context_switch, > .end_context_switch = xen_end_context_switch, > + > + .sync_core = xen_sync_core, > }; > > static void xen_reboot(int reason)
Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On 02/12/16 00:35, Andy Lutomirski wrote: > On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't > guaranteed to serialize. (Even CPUID isn't *really* guaranteed to > serialize on Xen PV, but, in practice, any trap it generates will > serialize.) Well, Xen will enabled CPUID Faulting wherever it can, which is realistically all IvyBridge hardware and newer. All hypercalls are a privilege change to cpl0. I'd hope this condition is serialising, but I can't actually find any documentation proving or disproving this. > > On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is > ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self > should end up being a nice speedup. > > Cc: Andrew Cooper > Signed-off-by: Andy Lutomirski CC'ing xen-devel and the Xen maintainers in Linux. As this is the only email from this series in my inbox, I will say this here, but it should really be against patch 6. A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not serialising on the 486, but I don't have a manual to hand to check. ~Andrew > --- > arch/x86/xen/enlighten.c | 35 +++ > 1 file changed, 35 insertions(+) > > diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c > index bdd855685403..1f765b41eee7 100644 > --- a/arch/x86/xen/enlighten.c > +++ b/arch/x86/xen/enlighten.c > @@ -311,6 +311,39 @@ static __read_mostly unsigned int > cpuid_leaf1_ecx_set_mask; > static __read_mostly unsigned int cpuid_leaf5_ecx_val; > static __read_mostly unsigned int cpuid_leaf5_edx_val; > > +static void xen_sync_core(void) > +{ > + register void *__sp asm(_ASM_SP); > + > +#ifdef CONFIG_X86_32 > + asm volatile ( > + "pushl %%ss\n\t" > + "pushl %%esp\n\t" > + "addl $4, (%%esp)\n\t" > + "pushfl\n\t" > + "pushl %%cs\n\t" > + "pushl $1f\n\t" > + "iret\n\t" > + "1:" > + : "+r" (__sp) : : "cc"); > +#else > + unsigned long tmp; > + > + asm volatile ( > + "movq %%ss, %0\n\t" > + "pushq %0\n\t" > + "pushq %%rsp\n\t" > + "addq $8, (%%rsp)\n\t" > + "pushfq\n\t" > + "movq %%cs, %0\n\t" > + "pushq %0\n\t" > + "pushq $1f\n\t" > + "iretq\n\t" > + "1:" > + : "=r" (tmp), "+r" (__sp) : : "cc"); > +#endif > +} > + > static void xen_cpuid(unsigned int *ax, unsigned int *bx, > unsigned int *cx, unsigned int *dx) > { > @@ -1289,6 +1322,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst > = { > > .start_context_switch = paravirt_start_context_switch, > .end_context_switch = xen_end_context_switch, > + > + .sync_core = xen_sync_core, > }; > > static void xen_reboot(int reason)
[PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't guaranteed to serialize. (Even CPUID isn't *really* guaranteed to serialize on Xen PV, but, in practice, any trap it generates will serialize.) On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self should end up being a nice speedup. Cc: Andrew CooperSigned-off-by: Andy Lutomirski --- arch/x86/xen/enlighten.c | 35 +++ 1 file changed, 35 insertions(+) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index bdd855685403..1f765b41eee7 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -311,6 +311,39 @@ static __read_mostly unsigned int cpuid_leaf1_ecx_set_mask; static __read_mostly unsigned int cpuid_leaf5_ecx_val; static __read_mostly unsigned int cpuid_leaf5_edx_val; +static void xen_sync_core(void) +{ + register void *__sp asm(_ASM_SP); + +#ifdef CONFIG_X86_32 + asm volatile ( + "pushl %%ss\n\t" + "pushl %%esp\n\t" + "addl $4, (%%esp)\n\t" + "pushfl\n\t" + "pushl %%cs\n\t" + "pushl $1f\n\t" + "iret\n\t" + "1:" + : "+r" (__sp) : : "cc"); +#else + unsigned long tmp; + + asm volatile ( + "movq %%ss, %0\n\t" + "pushq %0\n\t" + "pushq %%rsp\n\t" + "addq $8, (%%rsp)\n\t" + "pushfq\n\t" + "movq %%cs, %0\n\t" + "pushq %0\n\t" + "pushq $1f\n\t" + "iretq\n\t" + "1:" + : "=r" (tmp), "+r" (__sp) : : "cc"); +#endif +} + static void xen_cpuid(unsigned int *ax, unsigned int *bx, unsigned int *cx, unsigned int *dx) { @@ -1289,6 +1322,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst = { .start_context_switch = paravirt_start_context_switch, .end_context_switch = xen_end_context_switch, + + .sync_core = xen_sync_core, }; static void xen_reboot(int reason) -- 2.9.3
[PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation
On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't guaranteed to serialize. (Even CPUID isn't *really* guaranteed to serialize on Xen PV, but, in practice, any trap it generates will serialize.) On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is ~110ns. But Xen PV will trap CPUID if possible, so IRET-to-self should end up being a nice speedup. Cc: Andrew Cooper Signed-off-by: Andy Lutomirski --- arch/x86/xen/enlighten.c | 35 +++ 1 file changed, 35 insertions(+) diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index bdd855685403..1f765b41eee7 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -311,6 +311,39 @@ static __read_mostly unsigned int cpuid_leaf1_ecx_set_mask; static __read_mostly unsigned int cpuid_leaf5_ecx_val; static __read_mostly unsigned int cpuid_leaf5_edx_val; +static void xen_sync_core(void) +{ + register void *__sp asm(_ASM_SP); + +#ifdef CONFIG_X86_32 + asm volatile ( + "pushl %%ss\n\t" + "pushl %%esp\n\t" + "addl $4, (%%esp)\n\t" + "pushfl\n\t" + "pushl %%cs\n\t" + "pushl $1f\n\t" + "iret\n\t" + "1:" + : "+r" (__sp) : : "cc"); +#else + unsigned long tmp; + + asm volatile ( + "movq %%ss, %0\n\t" + "pushq %0\n\t" + "pushq %%rsp\n\t" + "addq $8, (%%rsp)\n\t" + "pushfq\n\t" + "movq %%cs, %0\n\t" + "pushq %0\n\t" + "pushq $1f\n\t" + "iretq\n\t" + "1:" + : "=r" (tmp), "+r" (__sp) : : "cc"); +#endif +} + static void xen_cpuid(unsigned int *ax, unsigned int *bx, unsigned int *cx, unsigned int *dx) { @@ -1289,6 +1322,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst = { .start_context_switch = paravirt_start_context_switch, .end_context_switch = xen_end_context_switch, + + .sync_core = xen_sync_core, }; static void xen_reboot(int reason) -- 2.9.3