Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-03 Thread Borislav Petkov
On Fri, Dec 02, 2016 at 09:38:38AM -0800, Andy Lutomirski wrote:
> TBH, I didn't start down this path for performance.  I did it because
> I wanted to kill off a CPUID that was breaking on old CPUs that don't
> have CPUID.  So I propose MOV-to-CR2 followed by an unconditional
> jump.  My goal here is to make the #*!& thing work reliably and not be
> ludicrously slow.  Borislav and I mulled over using an alternative to
> use CPUID if and only if we have CPUID, but that doesn't work because
> we call sync_core() before we're done applying alternatives.

Btw if the noinline thing which Linus suggested, works out, we can still
do the alternatives thing with CPUID in sync_core() because we won't
need it in alternatives.c itself anymore.

Just putting it on the table, I know you complained about the mess
yesterday on IRC and in case the CR2 move looks painful on xen, we can
still do what we initially considered. I.e., that thing:

+   /* Do a CPUID if available, otherwise do a forward jump. */
+   alternative_io("jmp 1f\n\t1:", "cpuid",
+   X86_FEATURE_CPUID,
+   "=a" (tmp),
+   "0" (1)
+   : "ebx", "ecx", "edx", "memory");

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-03 Thread Borislav Petkov
On Fri, Dec 02, 2016 at 09:38:38AM -0800, Andy Lutomirski wrote:
> TBH, I didn't start down this path for performance.  I did it because
> I wanted to kill off a CPUID that was breaking on old CPUs that don't
> have CPUID.  So I propose MOV-to-CR2 followed by an unconditional
> jump.  My goal here is to make the #*!& thing work reliably and not be
> ludicrously slow.  Borislav and I mulled over using an alternative to
> use CPUID if and only if we have CPUID, but that doesn't work because
> we call sync_core() before we're done applying alternatives.

Btw if the noinline thing which Linus suggested, works out, we can still
do the alternatives thing with CPUID in sync_core() because we won't
need it in alternatives.c itself anymore.

Just putting it on the table, I know you complained about the mess
yesterday on IRC and in case the CR2 move looks painful on xen, we can
still do what we initially considered. I.e., that thing:

+   /* Do a CPUID if available, otherwise do a forward jump. */
+   alternative_io("jmp 1f\n\t1:", "cpuid",
+   X86_FEATURE_CPUID,
+   "=a" (tmp),
+   "0" (1)
+   : "ebx", "ecx", "edx", "memory");

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Fri, Dec 2, 2016 at 2:55 PM, Andy Lutomirski  wrote:
>>
>> Honestly, I think Intel should clean up their documentation.
>
> I'm not sure I follow.  If a user program gets migrated, it might end
> up doing cross-modification when it expects self-modification.  If
> that trips the program up, is that a user bug or a kernel bug?

Well, the user may not see it as a cross-modification.

Example: user compiles a program, and writes out the new binary. That
write goes to the page cache.

The user then immediately executes that program.

It's technically a "cross modification", because the build that wrote
the page cache ran on one CPU, and then it gets loaded on another.

Not, page faulting the binary does bring in a known serializing
instruction: iret.

But let's theorize that we have your "optimistic sysret" return path
because sometimes it can happen. So the "iret" isn't exactly
fundamental.

But we know we will write %cr2, which is a serializing instruction.

But that's not fundamental either, because we could just have a
program just load the object file into its own address space using the
dynamic linker. And if you don't unmap anything, there won't be any
TLB flushes.

Now, that is safe too, but by then we're not even relying on
simply the fact that the code couldn't even have been in any virtual
caches in the running environment, so it _must_ have come from the
physically indexed data cache. So no actual serializing instruction
even _needed_.

So there is no room for any cache I$ coherency issue at any point, but
note how we got to the point where we're now basically depending on
some fairly fundamental logic that is not in the Intel documentation?

THAT is what I don't like. I don't doubt for a moment that what we're
doing is entirely coherent, and we're fine. But the intel memory
ordering documentation simply doesn't cover this situation at all. The
"real" memory ordering documentation only covers the normal data
cache. And then they handwave the "self-modifying code" situation with
incomplete examples and just bullshit "you need a serializing
instruction", which clearly isn't actually the case, and is also
something that we very much don't even do.

It would be better if here was actual documentation, and we had some
nice clear "yeah, we don't need any stinking serializing instructions,
because we're already doing X".

  Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Fri, Dec 2, 2016 at 2:55 PM, Andy Lutomirski  wrote:
>>
>> Honestly, I think Intel should clean up their documentation.
>
> I'm not sure I follow.  If a user program gets migrated, it might end
> up doing cross-modification when it expects self-modification.  If
> that trips the program up, is that a user bug or a kernel bug?

Well, the user may not see it as a cross-modification.

Example: user compiles a program, and writes out the new binary. That
write goes to the page cache.

The user then immediately executes that program.

It's technically a "cross modification", because the build that wrote
the page cache ran on one CPU, and then it gets loaded on another.

Not, page faulting the binary does bring in a known serializing
instruction: iret.

But let's theorize that we have your "optimistic sysret" return path
because sometimes it can happen. So the "iret" isn't exactly
fundamental.

But we know we will write %cr2, which is a serializing instruction.

But that's not fundamental either, because we could just have a
program just load the object file into its own address space using the
dynamic linker. And if you don't unmap anything, there won't be any
TLB flushes.

Now, that is safe too, but by then we're not even relying on
simply the fact that the code couldn't even have been in any virtual
caches in the running environment, so it _must_ have come from the
physically indexed data cache. So no actual serializing instruction
even _needed_.

So there is no room for any cache I$ coherency issue at any point, but
note how we got to the point where we're now basically depending on
some fairly fundamental logic that is not in the Intel documentation?

THAT is what I don't like. I don't doubt for a moment that what we're
doing is entirely coherent, and we're fine. But the intel memory
ordering documentation simply doesn't cover this situation at all. The
"real" memory ordering documentation only covers the normal data
cache. And then they handwave the "self-modifying code" situation with
incomplete examples and just bullshit "you need a serializing
instruction", which clearly isn't actually the case, and is also
something that we very much don't even do.

It would be better if here was actual documentation, and we had some
nice clear "yeah, we don't need any stinking serializing instructions,
because we're already doing X".

  Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Fri, Dec 2, 2016 at 1:10 PM, Linus Torvalds
 wrote:
> On Fri, Dec 2, 2016 at 12:41 PM, Andy Lutomirski  wrote:
>>
>> Because, if so, we should maybe serialize whenever we migrate a
>> process to a different CPU.
>
> The intel docs are bad on this issue.
>
> Technically what we do could fall under the "cross-modifying code"
> case, where one CPU does the write, and then we run it on another CPU.
>
> And no, we do *not* do a serializing instruction before returning to
> user space. Sure, we might do an iret (which is serializing), but we
> equally well might be doing a systret (which is not).
>
> Honestly, I think Intel should clean up their documentation.
>

I'm not sure I follow.  If a user program gets migrated, it might end
up doing cross-modification when it expects self-modification.  If
that trips the program up, is that a user bug or a kernel bug?

Admittedly, I'd be very surprised if this happened in practice.
Migration is *slow*, caches tend to get blown away, lots of code gets
executed, etc.  Presumably any prefetched / trace cached / decoded /
i-cached user code is long gone when we migrate.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Fri, Dec 2, 2016 at 1:10 PM, Linus Torvalds
 wrote:
> On Fri, Dec 2, 2016 at 12:41 PM, Andy Lutomirski  wrote:
>>
>> Because, if so, we should maybe serialize whenever we migrate a
>> process to a different CPU.
>
> The intel docs are bad on this issue.
>
> Technically what we do could fall under the "cross-modifying code"
> case, where one CPU does the write, and then we run it on another CPU.
>
> And no, we do *not* do a serializing instruction before returning to
> user space. Sure, we might do an iret (which is serializing), but we
> equally well might be doing a systret (which is not).
>
> Honestly, I think Intel should clean up their documentation.
>

I'm not sure I follow.  If a user program gets migrated, it might end
up doing cross-modification when it expects self-modification.  If
that trips the program up, is that a user bug or a kernel bug?

Admittedly, I'd be very surprised if this happened in practice.
Migration is *slow*, caches tend to get blown away, lots of code gets
executed, etc.  Presumably any prefetched / trace cached / decoded /
i-cached user code is long gone when we migrate.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Fri, Dec 2, 2016 at 12:41 PM, Andy Lutomirski  wrote:
>
> Because, if so, we should maybe serialize whenever we migrate a
> process to a different CPU.

The intel docs are bad on this issue.

Technically what we do could fall under the "cross-modifying code"
case, where one CPU does the write, and then we run it on another CPU.

And no, we do *not* do a serializing instruction before returning to
user space. Sure, we might do an iret (which is serializing), but we
equally well might be doing a systret (which is not).

Honestly, I think Intel should clean up their documentation.

> (We *definitely* need to flush the store  buffer when migrating,

There is no such thing as flushing the store buffer.

But we do end up doing a memory barrier which gives you the required
semantics. That's not a problem. Those operations are fast. The
serializing instructions are not.

   Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Fri, Dec 2, 2016 at 12:41 PM, Andy Lutomirski  wrote:
>
> Because, if so, we should maybe serialize whenever we migrate a
> process to a different CPU.

The intel docs are bad on this issue.

Technically what we do could fall under the "cross-modifying code"
case, where one CPU does the write, and then we run it on another CPU.

And no, we do *not* do a serializing instruction before returning to
user space. Sure, we might do an iret (which is serializing), but we
equally well might be doing a systret (which is not).

Honestly, I think Intel should clean up their documentation.

> (We *definitely* need to flush the store  buffer when migrating,

There is no such thing as flushing the store buffer.

But we do end up doing a memory barrier which gives you the required
semantics. That's not a problem. Those operations are fast. The
serializing instructions are not.

   Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Fri, Dec 2, 2016 at 11:35 AM, Linus Torvalds
 wrote:
> On Fri, Dec 2, 2016 at 11:30 AM, Andy Lutomirski  wrote:
>>
>> How's this?
>
> Looks ok. I do think that
>
>> I suppose it could be an unconditional IRET-to-self, but that's a good
>> deal slower and not a whole lot simpler.  Although if we start doing
>> it right, performance won't really matter here.
>
> Considering you already got the iret-to-self wrong in the first
> version, I really like the "one single unconditional version" so that
> everybody tests that _one_ thing and there isn't anything subtle going
> on.
>
> Hmm?

Okay, sold.  It makes the patchset much much shorter, too.

>
> And yes, if it turns out that performance matters, we almost certainly
> are doing something really wrong, and we shouldn't be using that
> sync_core() thing in that place anyway.

To play devil's advocate (and definitely out of scope for this
particular patchset), is user code permitted to do:

1. Map a page RX at one address and RW somewhere else (for nice ASLR).
2. Write to the RW mapping.
3. CPUID or IRET-to-self.
4. Jump to the RX mapping.

Because, if so, we should maybe serialize whenever we migrate a
process to a different CPU.  (We *definitely* need to flush the store
buffer when migrating, because the x86 architecture makes some memory
ordering promises that get broken if a store from a thread stays in
the store buffer of a different CPU when the thread gets migrated.)
And if we're going to start serializing when migrating a thread, then
we actually care about performance, in which case we should optimize
the crap out of this thing, which probably means using MFENCE on AMD
CPUs (AMD promises that MFENCE flushes the pipeline.  Intel seems to
be confused as to exactly what effect MFENCE has, or at least I'm
confused as to what Intel thinks MFENCE does.)  And we should make
sure that we only do the extra flush when we don't switch mms.

--Andy


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Fri, Dec 2, 2016 at 11:35 AM, Linus Torvalds
 wrote:
> On Fri, Dec 2, 2016 at 11:30 AM, Andy Lutomirski  wrote:
>>
>> How's this?
>
> Looks ok. I do think that
>
>> I suppose it could be an unconditional IRET-to-self, but that's a good
>> deal slower and not a whole lot simpler.  Although if we start doing
>> it right, performance won't really matter here.
>
> Considering you already got the iret-to-self wrong in the first
> version, I really like the "one single unconditional version" so that
> everybody tests that _one_ thing and there isn't anything subtle going
> on.
>
> Hmm?

Okay, sold.  It makes the patchset much much shorter, too.

>
> And yes, if it turns out that performance matters, we almost certainly
> are doing something really wrong, and we shouldn't be using that
> sync_core() thing in that place anyway.

To play devil's advocate (and definitely out of scope for this
particular patchset), is user code permitted to do:

1. Map a page RX at one address and RW somewhere else (for nice ASLR).
2. Write to the RW mapping.
3. CPUID or IRET-to-self.
4. Jump to the RX mapping.

Because, if so, we should maybe serialize whenever we migrate a
process to a different CPU.  (We *definitely* need to flush the store
buffer when migrating, because the x86 architecture makes some memory
ordering promises that get broken if a store from a thread stays in
the store buffer of a different CPU when the thread gets migrated.)
And if we're going to start serializing when migrating a thread, then
we actually care about performance, in which case we should optimize
the crap out of this thing, which probably means using MFENCE on AMD
CPUs (AMD promises that MFENCE flushes the pipeline.  Intel seems to
be confused as to exactly what effect MFENCE has, or at least I'm
confused as to what Intel thinks MFENCE does.)  And we should make
sure that we only do the extra flush when we don't switch mms.

--Andy


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Boris Ostrovsky
On 12/02/2016 06:44 AM, Andrew Cooper wrote:
> On 02/12/16 00:35, Andy Lutomirski wrote:
>> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
>> guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
>> serialize on Xen PV, but, in practice, any trap it generates will
>> serialize.)
> Well, Xen will enabled CPUID Faulting wherever it can, which is
> realistically all IvyBridge hardware and newer.
>
> All hypercalls are a privilege change to cpl0.  I'd hope this condition
> is serialising, but I can't actually find any documentation proving or
> disproving this.
>
>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
>> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
>> should end up being a nice speedup.
>>
>> Cc: Andrew Cooper 
>> Signed-off-by: Andy Lutomirski 

Executing CPUID in an HVM guest is quite expensive since it will cause a
VMEXIT. (And that should be true for any hypervisor, at least on Intel.
On AMD it's configurable)

-boris


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Boris Ostrovsky
On 12/02/2016 06:44 AM, Andrew Cooper wrote:
> On 02/12/16 00:35, Andy Lutomirski wrote:
>> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
>> guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
>> serialize on Xen PV, but, in practice, any trap it generates will
>> serialize.)
> Well, Xen will enabled CPUID Faulting wherever it can, which is
> realistically all IvyBridge hardware and newer.
>
> All hypercalls are a privilege change to cpl0.  I'd hope this condition
> is serialising, but I can't actually find any documentation proving or
> disproving this.
>
>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
>> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
>> should end up being a nice speedup.
>>
>> Cc: Andrew Cooper 
>> Signed-off-by: Andy Lutomirski 

Executing CPUID in an HVM guest is quite expensive since it will cause a
VMEXIT. (And that should be true for any hypervisor, at least on Intel.
On AMD it's configurable)

-boris


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Fri, Dec 2, 2016 at 11:30 AM, Andy Lutomirski  wrote:
>
> How's this?

Looks ok. I do think that

> I suppose it could be an unconditional IRET-to-self, but that's a good
> deal slower and not a whole lot simpler.  Although if we start doing
> it right, performance won't really matter here.

Considering you already got the iret-to-self wrong in the first
version, I really like the "one single unconditional version" so that
everybody tests that _one_ thing and there isn't anything subtle going
on.

Hmm?

And yes, if it turns out that performance matters, we almost certainly
are doing something really wrong, and we shouldn't be using that
sync_core() thing in that place anyway.

 Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Fri, Dec 2, 2016 at 11:30 AM, Andy Lutomirski  wrote:
>
> How's this?

Looks ok. I do think that

> I suppose it could be an unconditional IRET-to-self, but that's a good
> deal slower and not a whole lot simpler.  Although if we start doing
> it right, performance won't really matter here.

Considering you already got the iret-to-self wrong in the first
version, I really like the "one single unconditional version" so that
everybody tests that _one_ thing and there isn't anything subtle going
on.

Hmm?

And yes, if it turns out that performance matters, we almost certainly
are doing something really wrong, and we shouldn't be using that
sync_core() thing in that place anyway.

 Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Dec 2, 2016 10:48 AM, "Boris Ostrovsky"  wrote:
>
> On 12/02/2016 06:44 AM, Andrew Cooper wrote:
> > On 02/12/16 00:35, Andy Lutomirski wrote:
> >> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
> >> guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
> >> serialize on Xen PV, but, in practice, any trap it generates will
> >> serialize.)
> > Well, Xen will enabled CPUID Faulting wherever it can, which is
> > realistically all IvyBridge hardware and newer.
> >
> > All hypercalls are a privilege change to cpl0.  I'd hope this condition
> > is serialising, but I can't actually find any documentation proving or
> > disproving this.
> >
> >> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
> >> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
> >> should end up being a nice speedup.
> >>
> >> Cc: Andrew Cooper 
> >> Signed-off-by: Andy Lutomirski 
> > CC'ing xen-devel and the Xen maintainers in Linux.
> >
> > As this is the only email from this series in my inbox, I will say this
> > here, but it should really be against patch 6.
> >
> > A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
> > serialising on the 486, but I don't have a manual to hand to check.
> >
> > ~Andrew
> >
> >> ---
> >>  arch/x86/xen/enlighten.c | 35 +++
> >>  1 file changed, 35 insertions(+)
> >>
> >> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> >> index bdd855685403..1f765b41eee7 100644
> >> --- a/arch/x86/xen/enlighten.c
> >> +++ b/arch/x86/xen/enlighten.c
> >> @@ -311,6 +311,39 @@ static __read_mostly unsigned int 
> >> cpuid_leaf1_ecx_set_mask;
> >>  static __read_mostly unsigned int cpuid_leaf5_ecx_val;
> >>  static __read_mostly unsigned int cpuid_leaf5_edx_val;
> >>
> >> +static void xen_sync_core(void)
> >> +{
> >> +register void *__sp asm(_ASM_SP);
> >> +
> >> +#ifdef CONFIG_X86_32
> >> +asm volatile (
> >> +"pushl %%ss\n\t"
> >> +"pushl %%esp\n\t"
> >> +"addl $4, (%%esp)\n\t"
> >> +"pushfl\n\t"
> >> +"pushl %%cs\n\t"
> >> +"pushl $1f\n\t"
> >> +"iret\n\t"
> >> +"1:"
> >> +: "+r" (__sp) : : "cc");
>
> This breaks 32-bit PV guests.
>
> Why are we pushing %ss? We are not changing privilege levels so why not
> just flags, cs and eip (which, incidentally, does work)?
>

Doh!  I carefully tested 64-bit on Xen and 32-bit in user mode.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Dec 2, 2016 10:48 AM, "Boris Ostrovsky"  wrote:
>
> On 12/02/2016 06:44 AM, Andrew Cooper wrote:
> > On 02/12/16 00:35, Andy Lutomirski wrote:
> >> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
> >> guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
> >> serialize on Xen PV, but, in practice, any trap it generates will
> >> serialize.)
> > Well, Xen will enabled CPUID Faulting wherever it can, which is
> > realistically all IvyBridge hardware and newer.
> >
> > All hypercalls are a privilege change to cpl0.  I'd hope this condition
> > is serialising, but I can't actually find any documentation proving or
> > disproving this.
> >
> >> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
> >> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
> >> should end up being a nice speedup.
> >>
> >> Cc: Andrew Cooper 
> >> Signed-off-by: Andy Lutomirski 
> > CC'ing xen-devel and the Xen maintainers in Linux.
> >
> > As this is the only email from this series in my inbox, I will say this
> > here, but it should really be against patch 6.
> >
> > A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
> > serialising on the 486, but I don't have a manual to hand to check.
> >
> > ~Andrew
> >
> >> ---
> >>  arch/x86/xen/enlighten.c | 35 +++
> >>  1 file changed, 35 insertions(+)
> >>
> >> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> >> index bdd855685403..1f765b41eee7 100644
> >> --- a/arch/x86/xen/enlighten.c
> >> +++ b/arch/x86/xen/enlighten.c
> >> @@ -311,6 +311,39 @@ static __read_mostly unsigned int 
> >> cpuid_leaf1_ecx_set_mask;
> >>  static __read_mostly unsigned int cpuid_leaf5_ecx_val;
> >>  static __read_mostly unsigned int cpuid_leaf5_edx_val;
> >>
> >> +static void xen_sync_core(void)
> >> +{
> >> +register void *__sp asm(_ASM_SP);
> >> +
> >> +#ifdef CONFIG_X86_32
> >> +asm volatile (
> >> +"pushl %%ss\n\t"
> >> +"pushl %%esp\n\t"
> >> +"addl $4, (%%esp)\n\t"
> >> +"pushfl\n\t"
> >> +"pushl %%cs\n\t"
> >> +"pushl $1f\n\t"
> >> +"iret\n\t"
> >> +"1:"
> >> +: "+r" (__sp) : : "cc");
>
> This breaks 32-bit PV guests.
>
> Why are we pushing %ss? We are not changing privilege levels so why not
> just flags, cs and eip (which, incidentally, does work)?
>

Doh!  I carefully tested 64-bit on Xen and 32-bit in user mode.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Fri, Dec 2, 2016 at 11:24 AM, Linus Torvalds
 wrote:
> On Fri, Dec 2, 2016 at 11:20 AM, Borislav Petkov  wrote:
>>
>> Something like below?
>
> The optimize-nops thing needs it too, I think.
>
> Again, this will never matter in practice (even if somebody has a i486
> s till, the prefetch window size is like 16 bytes or something), but
> from a documentation standpoint it's good.

How's this?

/*
 * This function forces the icache and prefetched instruction stream to
 * catch up with reality in two very specific cases:
 *
 *  a) Text was modified using one virtual address and is about to be executed
 * from the same physical page at a different virtual address.
 *
 *  b) Text was modified on a different CPU, may subsequently be
 * executed on this CPU, and you want to make sure the new version
 * gets executed.  This generally means you're calling this in a IPI.
 *
 * If you're calling this for a different reason, you're probably doing
 * it wrong.
 */
static inline void native_sync_core(void) { ... }

The body will do a MOV-to-CR2 followed by jmp 1f; 1:.  This sequence
should be guaranteed to flush the pipeline on any real CPU.  On Xen it
will do IRET-to-self.

I suppose it could be an unconditional IRET-to-self, but that's a good
deal slower and not a whole lot simpler.  Although if we start doing
it right, performance won't really matter here.

--Andy


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Fri, Dec 2, 2016 at 11:24 AM, Linus Torvalds
 wrote:
> On Fri, Dec 2, 2016 at 11:20 AM, Borislav Petkov  wrote:
>>
>> Something like below?
>
> The optimize-nops thing needs it too, I think.
>
> Again, this will never matter in practice (even if somebody has a i486
> s till, the prefetch window size is like 16 bytes or something), but
> from a documentation standpoint it's good.

How's this?

/*
 * This function forces the icache and prefetched instruction stream to
 * catch up with reality in two very specific cases:
 *
 *  a) Text was modified using one virtual address and is about to be executed
 * from the same physical page at a different virtual address.
 *
 *  b) Text was modified on a different CPU, may subsequently be
 * executed on this CPU, and you want to make sure the new version
 * gets executed.  This generally means you're calling this in a IPI.
 *
 * If you're calling this for a different reason, you're probably doing
 * it wrong.
 */
static inline void native_sync_core(void) { ... }

The body will do a MOV-to-CR2 followed by jmp 1f; 1:.  This sequence
should be guaranteed to flush the pipeline on any real CPU.  On Xen it
will do IRET-to-self.

I suppose it could be an unconditional IRET-to-self, but that's a good
deal slower and not a whole lot simpler.  Although if we start doing
it right, performance won't really matter here.

--Andy


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Borislav Petkov
On Fri, Dec 02, 2016 at 11:23:09AM -0800, Andy Lutomirski wrote:
> Not even firmware loading wants it.  Firmware loading needs

Microcode...

> specifically cpuid(eax=1).  It has nothing to do with serializing

... but yes, of course. NOT sync_core() but CPUID(1).

Thanks!

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Borislav Petkov
On Fri, Dec 02, 2016 at 11:23:09AM -0800, Andy Lutomirski wrote:
> Not even firmware loading wants it.  Firmware loading needs

Microcode...

> specifically cpuid(eax=1).  It has nothing to do with serializing

... but yes, of course. NOT sync_core() but CPUID(1).

Thanks!

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Borislav Petkov
On Fri, Dec 02, 2016 at 11:24:09AM -0800, Linus Torvalds wrote:
> The optimize-nops thing needs it too, I think.

Ah, it is called only from apply_alternatives() but sure, it is safer
this way. Lemme do that and run it through the boxes to see whether
anything catches fire.

> Again, this will never matter in practice (even if somebody has a i486
> s till, the prefetch window size is like 16 bytes or something), but
> from a documentation standpoint it's good.

Right.

Thanks.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Borislav Petkov
On Fri, Dec 02, 2016 at 11:24:09AM -0800, Linus Torvalds wrote:
> The optimize-nops thing needs it too, I think.

Ah, it is called only from apply_alternatives() but sure, it is safer
this way. Lemme do that and run it through the boxes to see whether
anything catches fire.

> Again, this will never matter in practice (even if somebody has a i486
> s till, the prefetch window size is like 16 bytes or something), but
> from a documentation standpoint it's good.

Right.

Thanks.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Fri, Dec 2, 2016 at 11:20 AM, Borislav Petkov  wrote:
>
> Something like below?

The optimize-nops thing needs it too, I think.

Again, this will never matter in practice (even if somebody has a i486
s till, the prefetch window size is like 16 bytes or something), but
from a documentation standpoint it's good.

  Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Fri, Dec 2, 2016 at 11:20 AM, Borislav Petkov  wrote:
>
> Something like below?

The optimize-nops thing needs it too, I think.

Again, this will never matter in practice (even if somebody has a i486
s till, the prefetch window size is like 16 bytes or something), but
from a documentation standpoint it's good.

  Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Fri, Dec 2, 2016 at 11:03 AM, Linus Torvalds
 wrote:
> On Fri, Dec 2, 2016 at 10:50 AM, Borislav Petkov  wrote:
>>
>> Right, we can try to do something like invalidate_icache() or so in
>> there with the JMP so that the BSP refetches modified code and see where
>> it gets us.
>
> I'd really rather rjust mark it noinline with a comment. That way the
> return from the function acts as the control flow change.
>
>> The good thing is, the early patching paths run before SMP is
>> up but from looking at load_module(), for example, which does
>> post_relocation()->module_finalize()->apply_alternatives(), this can
>> happen late.
>>
>> Now there I'd like to avoid other cores walking into that address being
>> patched. Or are we "safe" there in the sense that load_module() happens
>> on one CPU only sequentially? (I haven't looked at that code to see
>> what's going on there, actually).
>
> 'sync_core()' doesn't help for other CPU's anyway, you need to do the
> cross-call IPI. So worrying about other CPU's is *not* a valid reason
> to keep a "sync_core()" call.
>
> Seriously, the only reason I can see for "sync_core()" really is:
>
>  - some deep non-serialized MSR  access or similar (ie things like
> firmware loading etc really might want it, and a mchine check might
> want it)

Not even firmware loading wants it.  Firmware loading needs
specifically cpuid(eax=1).  It has nothing to do with serializing
anything -- it's just a CPU bug that was turned into "architecture".
I think it really is just cross-address or cross-core modification,
and I'll add a comment to that effect.

--Andy


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Fri, Dec 2, 2016 at 11:03 AM, Linus Torvalds
 wrote:
> On Fri, Dec 2, 2016 at 10:50 AM, Borislav Petkov  wrote:
>>
>> Right, we can try to do something like invalidate_icache() or so in
>> there with the JMP so that the BSP refetches modified code and see where
>> it gets us.
>
> I'd really rather rjust mark it noinline with a comment. That way the
> return from the function acts as the control flow change.
>
>> The good thing is, the early patching paths run before SMP is
>> up but from looking at load_module(), for example, which does
>> post_relocation()->module_finalize()->apply_alternatives(), this can
>> happen late.
>>
>> Now there I'd like to avoid other cores walking into that address being
>> patched. Or are we "safe" there in the sense that load_module() happens
>> on one CPU only sequentially? (I haven't looked at that code to see
>> what's going on there, actually).
>
> 'sync_core()' doesn't help for other CPU's anyway, you need to do the
> cross-call IPI. So worrying about other CPU's is *not* a valid reason
> to keep a "sync_core()" call.
>
> Seriously, the only reason I can see for "sync_core()" really is:
>
>  - some deep non-serialized MSR  access or similar (ie things like
> firmware loading etc really might want it, and a mchine check might
> want it)

Not even firmware loading wants it.  Firmware loading needs
specifically cpuid(eax=1).  It has nothing to do with serializing
anything -- it's just a CPU bug that was turned into "architecture".
I think it really is just cross-address or cross-core modification,
and I'll add a comment to that effect.

--Andy


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Borislav Petkov
On Fri, Dec 02, 2016 at 11:03:50AM -0800, Linus Torvalds wrote:
> I'd really rather rjust mark it noinline with a comment. That way the
> return from the function acts as the control flow change.

Something like below?

It boots in a guest but that doesn't mean anything.

> 'sync_core()' doesn't help for other CPU's anyway, you need to do the
> cross-call IPI. So worrying about other CPU's is *not* a valid reason
> to keep a "sync_core()" call.

Yeah, no, I'm not crazy about it either - I was just sanity-checking all
call sites of apply_alternatives(). But as you say, we would've gotten
much bigger problems if other CPUs would walk in there on us.

> Seriously, the only reason I can see for "sync_core()" really is:
> 
>  - some deep non-serialized MSR  access or similar (ie things like
> firmware loading etc really might want it, and a mchine check might
> want it)

Yah, we do it in the #MC handler - apparently we need it there - and
in the microcode loader to tickle out the version of the microcode
currently applied into the MSR.

> The issues with modifying code while another CPU may be just about to
> access it is a separate issue. And as noted, "sync_core()" is not
> sufficient for that, you have to do a whole careful dance with
> single-byte debug instruction writes and then a final cross-call.
> 
> See the whole "text_poke_bp()" and "text_poke()" for *that* whole
> dance. That's a much more complex thing from the normal
> apply_alternatives().

Yeah.

---
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5cb272a7a5a3..b1d0c35e6dcb 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -346,7 +346,6 @@ static void __init_or_module optimize_nops(struct alt_instr 
*a, u8 *instr)
 
local_irq_save(flags);
add_nops(instr + (a->instrlen - a->padlen), a->padlen);
-   sync_core();
local_irq_restore(flags);
 
DUMP_BYTES(instr, a->instrlen, "%p: [%d:%d) optimized NOPs: ",
@@ -359,9 +358,12 @@ static void __init_or_module optimize_nops(struct 
alt_instr *a, u8 *instr)
  * This implies that asymmetric systems where APs have less capabilities than
  * the boot processor are not handled. Tough. Make sure you disable such
  * features by hand.
+ *
+ * Marked "noinline" to cause control flow change and thus insn cache
+ * to refetch changed I$ lines.
  */
-void __init_or_module apply_alternatives(struct alt_instr *start,
-struct alt_instr *end)
+void __init_or_module noinline apply_alternatives(struct alt_instr *start,
+ struct alt_instr *end)
 {
struct alt_instr *a;
u8 *instr, *replacement;
@@ -667,7 +669,6 @@ void *__init_or_module text_poke_early(void *addr, const 
void *opcode,
unsigned long flags;
local_irq_save(flags);
memcpy(addr, opcode, len);
-   sync_core();
local_irq_restore(flags);
/* Could also do a CLFLUSH here to speed up CPU recovery; but
   that causes hangs on some VIA CPUs. */

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Borislav Petkov
On Fri, Dec 02, 2016 at 11:03:50AM -0800, Linus Torvalds wrote:
> I'd really rather rjust mark it noinline with a comment. That way the
> return from the function acts as the control flow change.

Something like below?

It boots in a guest but that doesn't mean anything.

> 'sync_core()' doesn't help for other CPU's anyway, you need to do the
> cross-call IPI. So worrying about other CPU's is *not* a valid reason
> to keep a "sync_core()" call.

Yeah, no, I'm not crazy about it either - I was just sanity-checking all
call sites of apply_alternatives(). But as you say, we would've gotten
much bigger problems if other CPUs would walk in there on us.

> Seriously, the only reason I can see for "sync_core()" really is:
> 
>  - some deep non-serialized MSR  access or similar (ie things like
> firmware loading etc really might want it, and a mchine check might
> want it)

Yah, we do it in the #MC handler - apparently we need it there - and
in the microcode loader to tickle out the version of the microcode
currently applied into the MSR.

> The issues with modifying code while another CPU may be just about to
> access it is a separate issue. And as noted, "sync_core()" is not
> sufficient for that, you have to do a whole careful dance with
> single-byte debug instruction writes and then a final cross-call.
> 
> See the whole "text_poke_bp()" and "text_poke()" for *that* whole
> dance. That's a much more complex thing from the normal
> apply_alternatives().

Yeah.

---
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5cb272a7a5a3..b1d0c35e6dcb 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -346,7 +346,6 @@ static void __init_or_module optimize_nops(struct alt_instr 
*a, u8 *instr)
 
local_irq_save(flags);
add_nops(instr + (a->instrlen - a->padlen), a->padlen);
-   sync_core();
local_irq_restore(flags);
 
DUMP_BYTES(instr, a->instrlen, "%p: [%d:%d) optimized NOPs: ",
@@ -359,9 +358,12 @@ static void __init_or_module optimize_nops(struct 
alt_instr *a, u8 *instr)
  * This implies that asymmetric systems where APs have less capabilities than
  * the boot processor are not handled. Tough. Make sure you disable such
  * features by hand.
+ *
+ * Marked "noinline" to cause control flow change and thus insn cache
+ * to refetch changed I$ lines.
  */
-void __init_or_module apply_alternatives(struct alt_instr *start,
-struct alt_instr *end)
+void __init_or_module noinline apply_alternatives(struct alt_instr *start,
+ struct alt_instr *end)
 {
struct alt_instr *a;
u8 *instr, *replacement;
@@ -667,7 +669,6 @@ void *__init_or_module text_poke_early(void *addr, const 
void *opcode,
unsigned long flags;
local_irq_save(flags);
memcpy(addr, opcode, len);
-   sync_core();
local_irq_restore(flags);
/* Could also do a CLFLUSH here to speed up CPU recovery; but
   that causes hangs on some VIA CPUs. */

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Fri, Dec 2, 2016 at 10:50 AM, Borislav Petkov  wrote:
>
> Right, we can try to do something like invalidate_icache() or so in
> there with the JMP so that the BSP refetches modified code and see where
> it gets us.

I'd really rather rjust mark it noinline with a comment. That way the
return from the function acts as the control flow change.

> The good thing is, the early patching paths run before SMP is
> up but from looking at load_module(), for example, which does
> post_relocation()->module_finalize()->apply_alternatives(), this can
> happen late.
>
> Now there I'd like to avoid other cores walking into that address being
> patched. Or are we "safe" there in the sense that load_module() happens
> on one CPU only sequentially? (I haven't looked at that code to see
> what's going on there, actually).

'sync_core()' doesn't help for other CPU's anyway, you need to do the
cross-call IPI. So worrying about other CPU's is *not* a valid reason
to keep a "sync_core()" call.

Seriously, the only reason I can see for "sync_core()" really is:

 - some deep non-serialized MSR  access or similar (ie things like
firmware loading etc really might want it, and a mchine check might
want it)

 - writing to one virtual address, and then executing from another. We
do this for (a) user mode (of course) and (b) text_poke(), but
text_poke() is a whole different dance.

but I may have forgotten some other case.

The issues with modifying code while another CPU may be just about to
access it is a separate issue. And as noted, "sync_core()" is not
sufficient for that, you have to do a whole careful dance with
single-byte debug instruction writes and then a final cross-call.

See the whole "text_poke_bp()" and "text_poke()" for *that* whole
dance. That's a much more complex thing from the normal
apply_alternatives().

 Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Fri, Dec 2, 2016 at 10:50 AM, Borislav Petkov  wrote:
>
> Right, we can try to do something like invalidate_icache() or so in
> there with the JMP so that the BSP refetches modified code and see where
> it gets us.

I'd really rather rjust mark it noinline with a comment. That way the
return from the function acts as the control flow change.

> The good thing is, the early patching paths run before SMP is
> up but from looking at load_module(), for example, which does
> post_relocation()->module_finalize()->apply_alternatives(), this can
> happen late.
>
> Now there I'd like to avoid other cores walking into that address being
> patched. Or are we "safe" there in the sense that load_module() happens
> on one CPU only sequentially? (I haven't looked at that code to see
> what's going on there, actually).

'sync_core()' doesn't help for other CPU's anyway, you need to do the
cross-call IPI. So worrying about other CPU's is *not* a valid reason
to keep a "sync_core()" call.

Seriously, the only reason I can see for "sync_core()" really is:

 - some deep non-serialized MSR  access or similar (ie things like
firmware loading etc really might want it, and a mchine check might
want it)

 - writing to one virtual address, and then executing from another. We
do this for (a) user mode (of course) and (b) text_poke(), but
text_poke() is a whole different dance.

but I may have forgotten some other case.

The issues with modifying code while another CPU may be just about to
access it is a separate issue. And as noted, "sync_core()" is not
sufficient for that, you have to do a whole careful dance with
single-byte debug instruction writes and then a final cross-call.

See the whole "text_poke_bp()" and "text_poke()" for *that* whole
dance. That's a much more complex thing from the normal
apply_alternatives().

 Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Borislav Petkov
On Fri, Dec 02, 2016 at 10:27:29AM -0800, Linus Torvalds wrote:
> That is, of course, assuming that there  is no really subtle reason
> why that stupid sync_core() is there.

Right, we can try to do something like invalidate_icache() or so in
there with the JMP so that the BSP refetches modified code and see where
it gets us.

The good thing is, the early patching paths run before SMP is
up but from looking at load_module(), for example, which does
post_relocation()->module_finalize()->apply_alternatives(), this can
happen late.

Now there I'd like to avoid other cores walking into that address being
patched. Or are we "safe" there in the sense that load_module() happens
on one CPU only sequentially? (I haven't looked at that code to see
what's going on there, actually).

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Borislav Petkov
On Fri, Dec 02, 2016 at 10:27:29AM -0800, Linus Torvalds wrote:
> That is, of course, assuming that there  is no really subtle reason
> why that stupid sync_core() is there.

Right, we can try to do something like invalidate_icache() or so in
there with the JMP so that the BSP refetches modified code and see where
it gets us.

The good thing is, the early patching paths run before SMP is
up but from looking at load_module(), for example, which does
post_relocation()->module_finalize()->apply_alternatives(), this can
happen late.

Now there I'd like to avoid other cores walking into that address being
patched. Or are we "safe" there in the sense that load_module() happens
on one CPU only sequentially? (I haven't looked at that code to see
what's going on there, actually).

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Boris Ostrovsky
On 12/02/2016 06:44 AM, Andrew Cooper wrote:
> On 02/12/16 00:35, Andy Lutomirski wrote:
>> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
>> guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
>> serialize on Xen PV, but, in practice, any trap it generates will
>> serialize.)
> Well, Xen will enabled CPUID Faulting wherever it can, which is
> realistically all IvyBridge hardware and newer.
>
> All hypercalls are a privilege change to cpl0.  I'd hope this condition
> is serialising, but I can't actually find any documentation proving or
> disproving this.
>
>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
>> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
>> should end up being a nice speedup.
>>
>> Cc: Andrew Cooper 
>> Signed-off-by: Andy Lutomirski 
> CC'ing xen-devel and the Xen maintainers in Linux.
>
> As this is the only email from this series in my inbox, I will say this
> here, but it should really be against patch 6.
>
> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
> serialising on the 486, but I don't have a manual to hand to check.
>
> ~Andrew
>
>> ---
>>  arch/x86/xen/enlighten.c | 35 +++
>>  1 file changed, 35 insertions(+)
>>
>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> index bdd855685403..1f765b41eee7 100644
>> --- a/arch/x86/xen/enlighten.c
>> +++ b/arch/x86/xen/enlighten.c
>> @@ -311,6 +311,39 @@ static __read_mostly unsigned int 
>> cpuid_leaf1_ecx_set_mask;
>>  static __read_mostly unsigned int cpuid_leaf5_ecx_val;
>>  static __read_mostly unsigned int cpuid_leaf5_edx_val;
>>  
>> +static void xen_sync_core(void)
>> +{
>> +register void *__sp asm(_ASM_SP);
>> +
>> +#ifdef CONFIG_X86_32
>> +asm volatile (
>> +"pushl %%ss\n\t"
>> +"pushl %%esp\n\t"
>> +"addl $4, (%%esp)\n\t"
>> +"pushfl\n\t"
>> +"pushl %%cs\n\t"
>> +"pushl $1f\n\t"
>> +"iret\n\t"
>> +"1:"
>> +: "+r" (__sp) : : "cc");

This breaks 32-bit PV guests.

Why are we pushing %ss? We are not changing privilege levels so why not
just flags, cs and eip (which, incidentally, does work)?

-boris

>> +#else
>> +unsigned long tmp;
>> +
>> +asm volatile (
>> +"movq %%ss, %0\n\t"
>> +"pushq %0\n\t"
>> +"pushq %%rsp\n\t"
>> +"addq $8, (%%rsp)\n\t"
>> +"pushfq\n\t"
>> +"movq %%cs, %0\n\t"
>> +"pushq %0\n\t"
>> +"pushq $1f\n\t"
>> +"iretq\n\t"
>> +"1:"
>> +: "=r" (tmp), "+r" (__sp) : : "cc");
>> +#endif
>> +}
>> +
>>  static void xen_cpuid(unsigned int *ax, unsigned int *bx,
>>unsigned int *cx, unsigned int *dx)
>>  {
>> @@ -1289,6 +1322,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst 
>> = {
>>  
>>  .start_context_switch = paravirt_start_context_switch,
>>  .end_context_switch = xen_end_context_switch,
>> +
>> +.sync_core = xen_sync_core,
>>  };
>>  
>>  static void xen_reboot(int reason)



Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Boris Ostrovsky
On 12/02/2016 06:44 AM, Andrew Cooper wrote:
> On 02/12/16 00:35, Andy Lutomirski wrote:
>> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
>> guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
>> serialize on Xen PV, but, in practice, any trap it generates will
>> serialize.)
> Well, Xen will enabled CPUID Faulting wherever it can, which is
> realistically all IvyBridge hardware and newer.
>
> All hypercalls are a privilege change to cpl0.  I'd hope this condition
> is serialising, but I can't actually find any documentation proving or
> disproving this.
>
>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
>> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
>> should end up being a nice speedup.
>>
>> Cc: Andrew Cooper 
>> Signed-off-by: Andy Lutomirski 
> CC'ing xen-devel and the Xen maintainers in Linux.
>
> As this is the only email from this series in my inbox, I will say this
> here, but it should really be against patch 6.
>
> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
> serialising on the 486, but I don't have a manual to hand to check.
>
> ~Andrew
>
>> ---
>>  arch/x86/xen/enlighten.c | 35 +++
>>  1 file changed, 35 insertions(+)
>>
>> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>> index bdd855685403..1f765b41eee7 100644
>> --- a/arch/x86/xen/enlighten.c
>> +++ b/arch/x86/xen/enlighten.c
>> @@ -311,6 +311,39 @@ static __read_mostly unsigned int 
>> cpuid_leaf1_ecx_set_mask;
>>  static __read_mostly unsigned int cpuid_leaf5_ecx_val;
>>  static __read_mostly unsigned int cpuid_leaf5_edx_val;
>>  
>> +static void xen_sync_core(void)
>> +{
>> +register void *__sp asm(_ASM_SP);
>> +
>> +#ifdef CONFIG_X86_32
>> +asm volatile (
>> +"pushl %%ss\n\t"
>> +"pushl %%esp\n\t"
>> +"addl $4, (%%esp)\n\t"
>> +"pushfl\n\t"
>> +"pushl %%cs\n\t"
>> +"pushl $1f\n\t"
>> +"iret\n\t"
>> +"1:"
>> +: "+r" (__sp) : : "cc");

This breaks 32-bit PV guests.

Why are we pushing %ss? We are not changing privilege levels so why not
just flags, cs and eip (which, incidentally, does work)?

-boris

>> +#else
>> +unsigned long tmp;
>> +
>> +asm volatile (
>> +"movq %%ss, %0\n\t"
>> +"pushq %0\n\t"
>> +"pushq %%rsp\n\t"
>> +"addq $8, (%%rsp)\n\t"
>> +"pushfq\n\t"
>> +"movq %%cs, %0\n\t"
>> +"pushq %0\n\t"
>> +"pushq $1f\n\t"
>> +"iretq\n\t"
>> +"1:"
>> +: "=r" (tmp), "+r" (__sp) : : "cc");
>> +#endif
>> +}
>> +
>>  static void xen_cpuid(unsigned int *ax, unsigned int *bx,
>>unsigned int *cx, unsigned int *dx)
>>  {
>> @@ -1289,6 +1322,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst 
>> = {
>>  
>>  .start_context_switch = paravirt_start_context_switch,
>>  .end_context_switch = xen_end_context_switch,
>> +
>> +.sync_core = xen_sync_core,
>>  };
>>  
>>  static void xen_reboot(int reason)



Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Fri, Dec 2, 2016 at 10:03 AM, Borislav Petkov  wrote:
>
> SNB:
> * before:
> * after:

I suspect it's entirely invisible on raw hardware. But quite possibly
more noticeable in a VM that takes slow faults for every case.

But yes, even there is' probably not *that* noticeable.

I'd prefer to get rid of them just because I don't like voodoo
programming. If there is no actual reason for "sync_core()", we
shouldn't have one there. Even if it were entirely free and the
compiler optimized it away because the compiler was so smart that it
would see that it's pointless, it's misleading and wrong on a source
level.

That is, of course, assuming that there  is no really subtle reason
why that stupid sync_core() is there.

  Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Fri, Dec 2, 2016 at 10:03 AM, Borislav Petkov  wrote:
>
> SNB:
> * before:
> * after:

I suspect it's entirely invisible on raw hardware. But quite possibly
more noticeable in a VM that takes slow faults for every case.

But yes, even there is' probably not *that* noticeable.

I'd prefer to get rid of them just because I don't like voodoo
programming. If there is no actual reason for "sync_core()", we
shouldn't have one there. Even if it were entirely free and the
compiler optimized it away because the compiler was so smart that it
would see that it's pointless, it's misleading and wrong on a source
level.

That is, of course, assuming that there  is no really subtle reason
why that stupid sync_core() is there.

  Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Borislav Petkov
On Fri, Dec 02, 2016 at 09:38:38AM -0800, Andy Lutomirski wrote:
> apply_alternatives, unfortunately.  It's performance-critical because
> it's intensely stupid and does sync_core() for every single patch.
> Fixing that would be nice, too.

So I did experiment at the time to batch that sync_core() and interrupts
disabling in text_poke_early() but that amounted to almost nothing.
That's why I didn't bother chasing this further. I can try to dig out
that old thread...

/me goes and searches...

ah, here's something:

SNB:
* before:
[0.011707] SMP alternatives: Starting apply_alternatives...
[0.011784] SMP alternatives: done.
[0.011806] Freeing SMP alternatives: 20k freed

* after:
[0.011699] SMP alternatives: Starting apply_alternatives...
[0.011721] SMP alternatives: done.
[0.011743] Freeing SMP alternatives: 20k freed

-> 63 microseconds speedup


* kvm guest:
* before:
[0.017005] SMP alternatives: Starting apply_alternatives...
[0.019024] SMP alternatives: done.
[0.020119] Freeing SMP alternatives: 20k freed

* after:
[0.015008] SMP alternatives: Starting apply_alternatives...
[0.016029] SMP alternatives: done.
[0.017118] Freeing SMP alternatives: 20k freed

-> ~3 milliseconds speedup

---

I tried something simple like this:

---
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 1850592f4700..f4d5689ea503 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -253,10 +253,13 @@ void __init_or_module apply_alternatives(struct alt_instr 
*start,
 struct alt_instr *end)
 {
struct alt_instr *a;
+   unsigned long flags;
u8 *instr, *replacement;
u8 insnbuf[MAX_PATCH_LEN];

DPRINTK("%s: alt table %p -> %p\n", __func__, start, end);
+
+   local_irq_save(flags);
/*
 * The scan order should be from start to end. A later scanned
 * alternative code can overwrite a previous scanned alternative code.
@@ -284,8 +287,10 @@ void __init_or_module apply_alternatives(struct alt_instr 
*start,
add_nops(insnbuf + a->replacementlen,
 a->instrlen - a->replacementlen);

-   text_poke_early(instr, insnbuf, a->instrlen);
+   memcpy(instr, insnbuf, a->instrlen);
}
+   sync_core();
+   local_irq_restore(flags);
 }

 #ifdef CONFIG_SMP
---

I could try and redo the measurements again but I doubt it'll be any
different. Unless I haven't made a mistake somewhere...

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Borislav Petkov
On Fri, Dec 02, 2016 at 09:38:38AM -0800, Andy Lutomirski wrote:
> apply_alternatives, unfortunately.  It's performance-critical because
> it's intensely stupid and does sync_core() for every single patch.
> Fixing that would be nice, too.

So I did experiment at the time to batch that sync_core() and interrupts
disabling in text_poke_early() but that amounted to almost nothing.
That's why I didn't bother chasing this further. I can try to dig out
that old thread...

/me goes and searches...

ah, here's something:

SNB:
* before:
[0.011707] SMP alternatives: Starting apply_alternatives...
[0.011784] SMP alternatives: done.
[0.011806] Freeing SMP alternatives: 20k freed

* after:
[0.011699] SMP alternatives: Starting apply_alternatives...
[0.011721] SMP alternatives: done.
[0.011743] Freeing SMP alternatives: 20k freed

-> 63 microseconds speedup


* kvm guest:
* before:
[0.017005] SMP alternatives: Starting apply_alternatives...
[0.019024] SMP alternatives: done.
[0.020119] Freeing SMP alternatives: 20k freed

* after:
[0.015008] SMP alternatives: Starting apply_alternatives...
[0.016029] SMP alternatives: done.
[0.017118] Freeing SMP alternatives: 20k freed

-> ~3 milliseconds speedup

---

I tried something simple like this:

---
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 1850592f4700..f4d5689ea503 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -253,10 +253,13 @@ void __init_or_module apply_alternatives(struct alt_instr 
*start,
 struct alt_instr *end)
 {
struct alt_instr *a;
+   unsigned long flags;
u8 *instr, *replacement;
u8 insnbuf[MAX_PATCH_LEN];

DPRINTK("%s: alt table %p -> %p\n", __func__, start, end);
+
+   local_irq_save(flags);
/*
 * The scan order should be from start to end. A later scanned
 * alternative code can overwrite a previous scanned alternative code.
@@ -284,8 +287,10 @@ void __init_or_module apply_alternatives(struct alt_instr 
*start,
add_nops(insnbuf + a->replacementlen,
 a->instrlen - a->replacementlen);

-   text_poke_early(instr, insnbuf, a->instrlen);
+   memcpy(instr, insnbuf, a->instrlen);
}
+   sync_core();
+   local_irq_restore(flags);
 }

 #ifdef CONFIG_SMP
---

I could try and redo the measurements again but I doubt it'll be any
different. Unless I haven't made a mistake somewhere...

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Fri, Dec 2, 2016 at 9:38 AM, Andy Lutomirski  wrote:
>
> apply_alternatives, unfortunately.  It's performance-critical because
> it's intensely stupid and does sync_core() for every single patch.
> Fixing that would be nice, too.

So looking at text_poke_early(), that's very much a case that really
shouldn't need any "sync_core()" at all as far as I can tell.

Only the current CPU is running, and for local CPU I$ coherence all
you need is a jump instruction, and even that is only on really old
CPU's. From the PPro onwards (maybe even Pentium?) the I$ is entirely
serialized as long as you change the data using the same linear
address.

So at most, that function could mark itsel f"noinline" just to
guarantee that it will cause a control flow change before returning.
The sync_core() seems entirely bogus.

Same goes for optimize_nops() too.

  Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Fri, Dec 2, 2016 at 9:38 AM, Andy Lutomirski  wrote:
>
> apply_alternatives, unfortunately.  It's performance-critical because
> it's intensely stupid and does sync_core() for every single patch.
> Fixing that would be nice, too.

So looking at text_poke_early(), that's very much a case that really
shouldn't need any "sync_core()" at all as far as I can tell.

Only the current CPU is running, and for local CPU I$ coherence all
you need is a jump instruction, and even that is only on really old
CPU's. From the PPro onwards (maybe even Pentium?) the I$ is entirely
serialized as long as you change the data using the same linear
address.

So at most, that function could mark itsel f"noinline" just to
guarantee that it will cause a control flow change before returning.
The sync_core() seems entirely bogus.

Same goes for optimize_nops() too.

  Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Fri, Dec 2, 2016 at 9:32 AM, Linus Torvalds
 wrote:
> On Thu, Dec 1, 2016 at 4:35 PM, Andy Lutomirski  wrote:
>>
>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
>> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
>> should end up being a nice speedup.
>
> So if we care deeply about the performance of this, we should really
> ask ourselves how much we need this...
>
> There are *very* few places where we really need to do a full
> serializing instruction, and I'd worry that we really don't need it in
> many of the places we do this.
>
> The only real case I'm aware of is modifying code that is modified
> through a different linear address than it's executed.

TBH, I didn't start down this path for performance.  I did it because
I wanted to kill off a CPUID that was breaking on old CPUs that don't
have CPUID.  So I propose MOV-to-CR2 followed by an unconditional
jump.  My goal here is to make the #*!& thing work reliably and not be
ludicrously slow.  Borislav and I mulled over using an alternative to
use CPUID if and only if we have CPUID, but that doesn't work because
we call sync_core() before we're done applying alternatives.

>
> Is there anything else where we _really_ need this sync-core thing?
> Sure, the microcode loader looks fine, but that doesn't look
> particularly performance-critical either.
>
> So I'd like to know which sync_core is actually so
> performance-critical that w e care about it, and then I'd like to
> understand why it's needed at all, because I suspect a number of them
> has been added with the model of "sprinkle random things around and
> hope".

apply_alternatives, unfortunately.  It's performance-critical because
it's intensely stupid and does sync_core() for every single patch.
Fixing that would be nice, too.

> Adding Peter Anvin to the participants list, because iirc he was the
> one who really talked to hardwre engineers about the synchronization
> issues with serializing kernel code.
>
> Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Fri, Dec 2, 2016 at 9:32 AM, Linus Torvalds
 wrote:
> On Thu, Dec 1, 2016 at 4:35 PM, Andy Lutomirski  wrote:
>>
>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
>> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
>> should end up being a nice speedup.
>
> So if we care deeply about the performance of this, we should really
> ask ourselves how much we need this...
>
> There are *very* few places where we really need to do a full
> serializing instruction, and I'd worry that we really don't need it in
> many of the places we do this.
>
> The only real case I'm aware of is modifying code that is modified
> through a different linear address than it's executed.

TBH, I didn't start down this path for performance.  I did it because
I wanted to kill off a CPUID that was breaking on old CPUs that don't
have CPUID.  So I propose MOV-to-CR2 followed by an unconditional
jump.  My goal here is to make the #*!& thing work reliably and not be
ludicrously slow.  Borislav and I mulled over using an alternative to
use CPUID if and only if we have CPUID, but that doesn't work because
we call sync_core() before we're done applying alternatives.

>
> Is there anything else where we _really_ need this sync-core thing?
> Sure, the microcode loader looks fine, but that doesn't look
> particularly performance-critical either.
>
> So I'd like to know which sync_core is actually so
> performance-critical that w e care about it, and then I'd like to
> understand why it's needed at all, because I suspect a number of them
> has been added with the model of "sprinkle random things around and
> hope".

apply_alternatives, unfortunately.  It's performance-critical because
it's intensely stupid and does sync_core() for every single patch.
Fixing that would be nice, too.

> Adding Peter Anvin to the participants list, because iirc he was the
> one who really talked to hardwre engineers about the synchronization
> issues with serializing kernel code.
>
> Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Thu, Dec 1, 2016 at 4:35 PM, Andy Lutomirski  wrote:
>
> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
> should end up being a nice speedup.

So if we care deeply about the performance of this, we should really
ask ourselves how much we need this...

There are *very* few places where we really need to do a full
serializing instruction, and I'd worry that we really don't need it in
many of the places we do this.

The only real case I'm aware of is modifying code that is modified
through a different linear address than it's executed.

Is there anything else where we _really_ need this sync-core thing?
Sure, the microcode loader looks fine, but that doesn't look
particularly performance-critical either.

So I'd like to know which sync_core is actually so
performance-critical that w e care about it, and then I'd like to
understand why it's needed at all, because I suspect a number of them
has been added with the model of "sprinkle random things around and
hope".

Looking at ftrace, for example, which is one of the users, does it
actually _really_ need sync_core() at all? It seems to use the kerrnel
virtual address, and then the instruction stream will be coherent.

Adding Peter Anvin to the participants list, because iirc he was the
one who really talked to hardwre engineers about the synchronization
issues with serializing kernel code.

Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Linus Torvalds
On Thu, Dec 1, 2016 at 4:35 PM, Andy Lutomirski  wrote:
>
> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
> should end up being a nice speedup.

So if we care deeply about the performance of this, we should really
ask ourselves how much we need this...

There are *very* few places where we really need to do a full
serializing instruction, and I'd worry that we really don't need it in
many of the places we do this.

The only real case I'm aware of is modifying code that is modified
through a different linear address than it's executed.

Is there anything else where we _really_ need this sync-core thing?
Sure, the microcode loader looks fine, but that doesn't look
particularly performance-critical either.

So I'd like to know which sync_core is actually so
performance-critical that w e care about it, and then I'd like to
understand why it's needed at all, because I suspect a number of them
has been added with the model of "sprinkle random things around and
hope".

Looking at ftrace, for example, which is one of the users, does it
actually _really_ need sync_core() at all? It seems to use the kerrnel
virtual address, and then the instruction stream will be coherent.

Adding Peter Anvin to the participants list, because iirc he was the
one who really talked to hardwre engineers about the synchronization
issues with serializing kernel code.

Linus


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andrew Cooper
On 02/12/16 17:23, Andy Lutomirski wrote:
> On Fri, Dec 2, 2016 at 9:16 AM, Andrew Cooper  
> wrote:
>> On 02/12/16 17:07, Andy Lutomirski wrote:
>>> On Dec 2, 2016 3:44 AM, "Andrew Cooper"  wrote:
 On 02/12/16 00:35, Andy Lutomirski wrote:
> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
> guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
> serialize on Xen PV, but, in practice, any trap it generates will
> serialize.)
 Well, Xen will enabled CPUID Faulting wherever it can, which is
 realistically all IvyBridge hardware and newer.

 All hypercalls are a privilege change to cpl0.  I'd hope this condition
 is serialising, but I can't actually find any documentation proving or
 disproving this.
>>> I don't know for sure.  IRET is serializing, and if Xen returns using
>>> IRET, we're fine.
>> All returns to a 64bit PV guest at defined points (hypercall return,
>> exception entry, etc) are from SYSRET, not IRET.
> But CPUID faulting isn't like this, right?  Unless Xen does
> opportunistic SYSRET.

Correct.  Xen doesn't do opportunistic SYSRET.

>
>> Talking of, I still have a patch to remove
>> PARAVIRT_ADJUST_EXCEPTION_FRAME which I need to complete and send upstream.
>>
> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
> should end up being a nice speedup.
>
> Cc: Andrew Cooper 
> Signed-off-by: Andy Lutomirski 
 CC'ing xen-devel and the Xen maintainers in Linux.

 As this is the only email from this series in my inbox, I will say this
 here, but it should really be against patch 6.

 A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
 serialising on the 486, but I don't have a manual to hand to check.
>>> I'll quote the (modern) SDM.  For self-modifying code "The use of one
>>> of these options is not required for programs intended to run on the
>>> Pentium or Intel486 processors,
>>> but are recommended to ensure compatibility with the P6 and more
>>> recent processor families.".  For cross-modifying code "The use of
>>> this option is not required for programs intended to run on the
>>> Intel486 processor, but is recommended
>>> to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and
>>> Pentium processors."  So I'm not sure there's a problem.
>> Fair enough.  (Assuming similar properties hold for the older processors
>> of other vendors.)
> No, you were right -- a different section of the SDM contradicts it:
>
> For Intel486 processors, a write to an instruction in the cache will
> modify it in both the cache and memory, but if
> the instruction was prefetched before the write, the old version of
> the instruction could be the one executed. To
> prevent the old instruction from being executed, flush the instruction
> prefetch unit by coding a jump instruction
> immediately after any write that modifies an instruction.

:(

Presumably this means patching has been subtly broken forever on the 486?

~Andrew


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andrew Cooper
On 02/12/16 17:23, Andy Lutomirski wrote:
> On Fri, Dec 2, 2016 at 9:16 AM, Andrew Cooper  
> wrote:
>> On 02/12/16 17:07, Andy Lutomirski wrote:
>>> On Dec 2, 2016 3:44 AM, "Andrew Cooper"  wrote:
 On 02/12/16 00:35, Andy Lutomirski wrote:
> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
> guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
> serialize on Xen PV, but, in practice, any trap it generates will
> serialize.)
 Well, Xen will enabled CPUID Faulting wherever it can, which is
 realistically all IvyBridge hardware and newer.

 All hypercalls are a privilege change to cpl0.  I'd hope this condition
 is serialising, but I can't actually find any documentation proving or
 disproving this.
>>> I don't know for sure.  IRET is serializing, and if Xen returns using
>>> IRET, we're fine.
>> All returns to a 64bit PV guest at defined points (hypercall return,
>> exception entry, etc) are from SYSRET, not IRET.
> But CPUID faulting isn't like this, right?  Unless Xen does
> opportunistic SYSRET.

Correct.  Xen doesn't do opportunistic SYSRET.

>
>> Talking of, I still have a patch to remove
>> PARAVIRT_ADJUST_EXCEPTION_FRAME which I need to complete and send upstream.
>>
> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
> should end up being a nice speedup.
>
> Cc: Andrew Cooper 
> Signed-off-by: Andy Lutomirski 
 CC'ing xen-devel and the Xen maintainers in Linux.

 As this is the only email from this series in my inbox, I will say this
 here, but it should really be against patch 6.

 A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
 serialising on the 486, but I don't have a manual to hand to check.
>>> I'll quote the (modern) SDM.  For self-modifying code "The use of one
>>> of these options is not required for programs intended to run on the
>>> Pentium or Intel486 processors,
>>> but are recommended to ensure compatibility with the P6 and more
>>> recent processor families.".  For cross-modifying code "The use of
>>> this option is not required for programs intended to run on the
>>> Intel486 processor, but is recommended
>>> to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and
>>> Pentium processors."  So I'm not sure there's a problem.
>> Fair enough.  (Assuming similar properties hold for the older processors
>> of other vendors.)
> No, you were right -- a different section of the SDM contradicts it:
>
> For Intel486 processors, a write to an instruction in the cache will
> modify it in both the cache and memory, but if
> the instruction was prefetched before the write, the old version of
> the instruction could be the one executed. To
> prevent the old instruction from being executed, flush the instruction
> prefetch unit by coding a jump instruction
> immediately after any write that modifies an instruction.

:(

Presumably this means patching has been subtly broken forever on the 486?

~Andrew


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Fri, Dec 2, 2016 at 9:16 AM, Andrew Cooper  wrote:
> On 02/12/16 17:07, Andy Lutomirski wrote:
>> On Dec 2, 2016 3:44 AM, "Andrew Cooper"  wrote:
>>> On 02/12/16 00:35, Andy Lutomirski wrote:
 On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
 guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
 serialize on Xen PV, but, in practice, any trap it generates will
 serialize.)
>>> Well, Xen will enabled CPUID Faulting wherever it can, which is
>>> realistically all IvyBridge hardware and newer.
>>>
>>> All hypercalls are a privilege change to cpl0.  I'd hope this condition
>>> is serialising, but I can't actually find any documentation proving or
>>> disproving this.
>> I don't know for sure.  IRET is serializing, and if Xen returns using
>> IRET, we're fine.
>
> All returns to a 64bit PV guest at defined points (hypercall return,
> exception entry, etc) are from SYSRET, not IRET.

But CPUID faulting isn't like this, right?  Unless Xen does
opportunistic SYSRET.

>
> Talking of, I still have a patch to remove
> PARAVIRT_ADJUST_EXCEPTION_FRAME which I need to complete and send upstream.
>
>>
 On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
 ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
 should end up being a nice speedup.

 Cc: Andrew Cooper 
 Signed-off-by: Andy Lutomirski 
>>> CC'ing xen-devel and the Xen maintainers in Linux.
>>>
>>> As this is the only email from this series in my inbox, I will say this
>>> here, but it should really be against patch 6.
>>>
>>> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
>>> serialising on the 486, but I don't have a manual to hand to check.
>> I'll quote the (modern) SDM.  For self-modifying code "The use of one
>> of these options is not required for programs intended to run on the
>> Pentium or Intel486 processors,
>> but are recommended to ensure compatibility with the P6 and more
>> recent processor families.".  For cross-modifying code "The use of
>> this option is not required for programs intended to run on the
>> Intel486 processor, but is recommended
>> to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and
>> Pentium processors."  So I'm not sure there's a problem.
>
> Fair enough.  (Assuming similar properties hold for the older processors
> of other vendors.)

No, you were right -- a different section of the SDM contradicts it:

For Intel486 processors, a write to an instruction in the cache will
modify it in both the cache and memory, but if
the instruction was prefetched before the write, the old version of
the instruction could be the one executed. To
prevent the old instruction from being executed, flush the instruction
prefetch unit by coding a jump instruction
immediately after any write that modifies an instruction.

--Andy


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Fri, Dec 2, 2016 at 9:16 AM, Andrew Cooper  wrote:
> On 02/12/16 17:07, Andy Lutomirski wrote:
>> On Dec 2, 2016 3:44 AM, "Andrew Cooper"  wrote:
>>> On 02/12/16 00:35, Andy Lutomirski wrote:
 On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
 guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
 serialize on Xen PV, but, in practice, any trap it generates will
 serialize.)
>>> Well, Xen will enabled CPUID Faulting wherever it can, which is
>>> realistically all IvyBridge hardware and newer.
>>>
>>> All hypercalls are a privilege change to cpl0.  I'd hope this condition
>>> is serialising, but I can't actually find any documentation proving or
>>> disproving this.
>> I don't know for sure.  IRET is serializing, and if Xen returns using
>> IRET, we're fine.
>
> All returns to a 64bit PV guest at defined points (hypercall return,
> exception entry, etc) are from SYSRET, not IRET.

But CPUID faulting isn't like this, right?  Unless Xen does
opportunistic SYSRET.

>
> Talking of, I still have a patch to remove
> PARAVIRT_ADJUST_EXCEPTION_FRAME which I need to complete and send upstream.
>
>>
 On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
 ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
 should end up being a nice speedup.

 Cc: Andrew Cooper 
 Signed-off-by: Andy Lutomirski 
>>> CC'ing xen-devel and the Xen maintainers in Linux.
>>>
>>> As this is the only email from this series in my inbox, I will say this
>>> here, but it should really be against patch 6.
>>>
>>> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
>>> serialising on the 486, but I don't have a manual to hand to check.
>> I'll quote the (modern) SDM.  For self-modifying code "The use of one
>> of these options is not required for programs intended to run on the
>> Pentium or Intel486 processors,
>> but are recommended to ensure compatibility with the P6 and more
>> recent processor families.".  For cross-modifying code "The use of
>> this option is not required for programs intended to run on the
>> Intel486 processor, but is recommended
>> to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and
>> Pentium processors."  So I'm not sure there's a problem.
>
> Fair enough.  (Assuming similar properties hold for the older processors
> of other vendors.)

No, you were right -- a different section of the SDM contradicts it:

For Intel486 processors, a write to an instruction in the cache will
modify it in both the cache and memory, but if
the instruction was prefetched before the write, the old version of
the instruction could be the one executed. To
prevent the old instruction from being executed, flush the instruction
prefetch unit by coding a jump instruction
immediately after any write that modifies an instruction.

--Andy


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andrew Cooper
On 02/12/16 17:07, Andy Lutomirski wrote:
> On Dec 2, 2016 3:44 AM, "Andrew Cooper"  wrote:
>> On 02/12/16 00:35, Andy Lutomirski wrote:
>>> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
>>> guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
>>> serialize on Xen PV, but, in practice, any trap it generates will
>>> serialize.)
>> Well, Xen will enabled CPUID Faulting wherever it can, which is
>> realistically all IvyBridge hardware and newer.
>>
>> All hypercalls are a privilege change to cpl0.  I'd hope this condition
>> is serialising, but I can't actually find any documentation proving or
>> disproving this.
> I don't know for sure.  IRET is serializing, and if Xen returns using
> IRET, we're fine.

All returns to a 64bit PV guest at defined points (hypercall return,
exception entry, etc) are from SYSRET, not IRET.

Talking of, I still have a patch to remove
PARAVIRT_ADJUST_EXCEPTION_FRAME which I need to complete and send upstream.

>
>>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
>>> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
>>> should end up being a nice speedup.
>>>
>>> Cc: Andrew Cooper 
>>> Signed-off-by: Andy Lutomirski 
>> CC'ing xen-devel and the Xen maintainers in Linux.
>>
>> As this is the only email from this series in my inbox, I will say this
>> here, but it should really be against patch 6.
>>
>> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
>> serialising on the 486, but I don't have a manual to hand to check.
> I'll quote the (modern) SDM.  For self-modifying code "The use of one
> of these options is not required for programs intended to run on the
> Pentium or Intel486 processors,
> but are recommended to ensure compatibility with the P6 and more
> recent processor families.".  For cross-modifying code "The use of
> this option is not required for programs intended to run on the
> Intel486 processor, but is recommended
> to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and
> Pentium processors."  So I'm not sure there's a problem.

Fair enough.  (Assuming similar properties hold for the older processors
of other vendors.)

~Andrew


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andrew Cooper
On 02/12/16 17:07, Andy Lutomirski wrote:
> On Dec 2, 2016 3:44 AM, "Andrew Cooper"  wrote:
>> On 02/12/16 00:35, Andy Lutomirski wrote:
>>> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
>>> guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
>>> serialize on Xen PV, but, in practice, any trap it generates will
>>> serialize.)
>> Well, Xen will enabled CPUID Faulting wherever it can, which is
>> realistically all IvyBridge hardware and newer.
>>
>> All hypercalls are a privilege change to cpl0.  I'd hope this condition
>> is serialising, but I can't actually find any documentation proving or
>> disproving this.
> I don't know for sure.  IRET is serializing, and if Xen returns using
> IRET, we're fine.

All returns to a 64bit PV guest at defined points (hypercall return,
exception entry, etc) are from SYSRET, not IRET.

Talking of, I still have a patch to remove
PARAVIRT_ADJUST_EXCEPTION_FRAME which I need to complete and send upstream.

>
>>> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
>>> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
>>> should end up being a nice speedup.
>>>
>>> Cc: Andrew Cooper 
>>> Signed-off-by: Andy Lutomirski 
>> CC'ing xen-devel and the Xen maintainers in Linux.
>>
>> As this is the only email from this series in my inbox, I will say this
>> here, but it should really be against patch 6.
>>
>> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
>> serialising on the 486, but I don't have a manual to hand to check.
> I'll quote the (modern) SDM.  For self-modifying code "The use of one
> of these options is not required for programs intended to run on the
> Pentium or Intel486 processors,
> but are recommended to ensure compatibility with the P6 and more
> recent processor families.".  For cross-modifying code "The use of
> this option is not required for programs intended to run on the
> Intel486 processor, but is recommended
> to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and
> Pentium processors."  So I'm not sure there's a problem.

Fair enough.  (Assuming similar properties hold for the older processors
of other vendors.)

~Andrew


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Dec 2, 2016 3:44 AM, "Andrew Cooper"  wrote:
>
> On 02/12/16 00:35, Andy Lutomirski wrote:
> > On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
> > guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
> > serialize on Xen PV, but, in practice, any trap it generates will
> > serialize.)
>
> Well, Xen will enabled CPUID Faulting wherever it can, which is
> realistically all IvyBridge hardware and newer.
>
> All hypercalls are a privilege change to cpl0.  I'd hope this condition
> is serialising, but I can't actually find any documentation proving or
> disproving this.

I don't know for sure.  IRET is serializing, and if Xen returns using
IRET, we're fine.

>
> >
> > On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
> > ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
> > should end up being a nice speedup.
> >
> > Cc: Andrew Cooper 
> > Signed-off-by: Andy Lutomirski 
>
> CC'ing xen-devel and the Xen maintainers in Linux.
>
> As this is the only email from this series in my inbox, I will say this
> here, but it should really be against patch 6.
>
> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
> serialising on the 486, but I don't have a manual to hand to check.

I'll quote the (modern) SDM.  For self-modifying code "The use of one
of these options is not required for programs intended to run on the
Pentium or Intel486 processors,
but are recommended to ensure compatibility with the P6 and more
recent processor families.".  For cross-modifying code "The use of
this option is not required for programs intended to run on the
Intel486 processor, but is recommended
to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and
Pentium processors."  So I'm not sure there's a problem.

I can add an unconditional jump just to make sure.  It costs basically
nothing on modern CPUs.  (Also, CPUID also isn't serializing on 486
according to the table.)

--Andy


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andy Lutomirski
On Dec 2, 2016 3:44 AM, "Andrew Cooper"  wrote:
>
> On 02/12/16 00:35, Andy Lutomirski wrote:
> > On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
> > guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
> > serialize on Xen PV, but, in practice, any trap it generates will
> > serialize.)
>
> Well, Xen will enabled CPUID Faulting wherever it can, which is
> realistically all IvyBridge hardware and newer.
>
> All hypercalls are a privilege change to cpl0.  I'd hope this condition
> is serialising, but I can't actually find any documentation proving or
> disproving this.

I don't know for sure.  IRET is serializing, and if Xen returns using
IRET, we're fine.

>
> >
> > On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
> > ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
> > should end up being a nice speedup.
> >
> > Cc: Andrew Cooper 
> > Signed-off-by: Andy Lutomirski 
>
> CC'ing xen-devel and the Xen maintainers in Linux.
>
> As this is the only email from this series in my inbox, I will say this
> here, but it should really be against patch 6.
>
> A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
> serialising on the 486, but I don't have a manual to hand to check.

I'll quote the (modern) SDM.  For self-modifying code "The use of one
of these options is not required for programs intended to run on the
Pentium or Intel486 processors,
but are recommended to ensure compatibility with the P6 and more
recent processor families.".  For cross-modifying code "The use of
this option is not required for programs intended to run on the
Intel486 processor, but is recommended
to ensure compatibility with the Pentium 4, Intel Xeon, P6 family, and
Pentium processors."  So I'm not sure there's a problem.

I can add an unconditional jump just to make sure.  It costs basically
nothing on modern CPUs.  (Also, CPUID also isn't serializing on 486
according to the table.)

--Andy


Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andrew Cooper
On 02/12/16 00:35, Andy Lutomirski wrote:
> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
> guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
> serialize on Xen PV, but, in practice, any trap it generates will
> serialize.)

Well, Xen will enabled CPUID Faulting wherever it can, which is
realistically all IvyBridge hardware and newer.

All hypercalls are a privilege change to cpl0.  I'd hope this condition
is serialising, but I can't actually find any documentation proving or
disproving this.

>
> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
> should end up being a nice speedup.
>
> Cc: Andrew Cooper 
> Signed-off-by: Andy Lutomirski 

CC'ing xen-devel and the Xen maintainers in Linux.

As this is the only email from this series in my inbox, I will say this
here, but it should really be against patch 6.

A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
serialising on the 486, but I don't have a manual to hand to check.

~Andrew

> ---
>  arch/x86/xen/enlighten.c | 35 +++
>  1 file changed, 35 insertions(+)
>
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index bdd855685403..1f765b41eee7 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -311,6 +311,39 @@ static __read_mostly unsigned int 
> cpuid_leaf1_ecx_set_mask;
>  static __read_mostly unsigned int cpuid_leaf5_ecx_val;
>  static __read_mostly unsigned int cpuid_leaf5_edx_val;
>  
> +static void xen_sync_core(void)
> +{
> + register void *__sp asm(_ASM_SP);
> +
> +#ifdef CONFIG_X86_32
> + asm volatile (
> + "pushl %%ss\n\t"
> + "pushl %%esp\n\t"
> + "addl $4, (%%esp)\n\t"
> + "pushfl\n\t"
> + "pushl %%cs\n\t"
> + "pushl $1f\n\t"
> + "iret\n\t"
> + "1:"
> + : "+r" (__sp) : : "cc");
> +#else
> + unsigned long tmp;
> +
> + asm volatile (
> + "movq %%ss, %0\n\t"
> + "pushq %0\n\t"
> + "pushq %%rsp\n\t"
> + "addq $8, (%%rsp)\n\t"
> + "pushfq\n\t"
> + "movq %%cs, %0\n\t"
> + "pushq %0\n\t"
> + "pushq $1f\n\t"
> + "iretq\n\t"
> + "1:"
> + : "=r" (tmp), "+r" (__sp) : : "cc");
> +#endif
> +}
> +
>  static void xen_cpuid(unsigned int *ax, unsigned int *bx,
> unsigned int *cx, unsigned int *dx)
>  {
> @@ -1289,6 +1322,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst 
> = {
>  
>   .start_context_switch = paravirt_start_context_switch,
>   .end_context_switch = xen_end_context_switch,
> +
> + .sync_core = xen_sync_core,
>  };
>  
>  static void xen_reboot(int reason)



Re: [PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-02 Thread Andrew Cooper
On 02/12/16 00:35, Andy Lutomirski wrote:
> On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
> guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
> serialize on Xen PV, but, in practice, any trap it generates will
> serialize.)

Well, Xen will enabled CPUID Faulting wherever it can, which is
realistically all IvyBridge hardware and newer.

All hypercalls are a privilege change to cpl0.  I'd hope this condition
is serialising, but I can't actually find any documentation proving or
disproving this.

>
> On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
> ~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
> should end up being a nice speedup.
>
> Cc: Andrew Cooper 
> Signed-off-by: Andy Lutomirski 

CC'ing xen-devel and the Xen maintainers in Linux.

As this is the only email from this series in my inbox, I will say this
here, but it should really be against patch 6.

A write to %cr2 is apparently (http://sandpile.org/x86/coherent.htm) not
serialising on the 486, but I don't have a manual to hand to check.

~Andrew

> ---
>  arch/x86/xen/enlighten.c | 35 +++
>  1 file changed, 35 insertions(+)
>
> diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
> index bdd855685403..1f765b41eee7 100644
> --- a/arch/x86/xen/enlighten.c
> +++ b/arch/x86/xen/enlighten.c
> @@ -311,6 +311,39 @@ static __read_mostly unsigned int 
> cpuid_leaf1_ecx_set_mask;
>  static __read_mostly unsigned int cpuid_leaf5_ecx_val;
>  static __read_mostly unsigned int cpuid_leaf5_edx_val;
>  
> +static void xen_sync_core(void)
> +{
> + register void *__sp asm(_ASM_SP);
> +
> +#ifdef CONFIG_X86_32
> + asm volatile (
> + "pushl %%ss\n\t"
> + "pushl %%esp\n\t"
> + "addl $4, (%%esp)\n\t"
> + "pushfl\n\t"
> + "pushl %%cs\n\t"
> + "pushl $1f\n\t"
> + "iret\n\t"
> + "1:"
> + : "+r" (__sp) : : "cc");
> +#else
> + unsigned long tmp;
> +
> + asm volatile (
> + "movq %%ss, %0\n\t"
> + "pushq %0\n\t"
> + "pushq %%rsp\n\t"
> + "addq $8, (%%rsp)\n\t"
> + "pushfq\n\t"
> + "movq %%cs, %0\n\t"
> + "pushq %0\n\t"
> + "pushq $1f\n\t"
> + "iretq\n\t"
> + "1:"
> + : "=r" (tmp), "+r" (__sp) : : "cc");
> +#endif
> +}
> +
>  static void xen_cpuid(unsigned int *ax, unsigned int *bx,
> unsigned int *cx, unsigned int *dx)
>  {
> @@ -1289,6 +1322,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst 
> = {
>  
>   .start_context_switch = paravirt_start_context_switch,
>   .end_context_switch = xen_end_context_switch,
> +
> + .sync_core = xen_sync_core,
>  };
>  
>  static void xen_reboot(int reason)



[PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-01 Thread Andy Lutomirski
On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
serialize on Xen PV, but, in practice, any trap it generates will
serialize.)

On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
should end up being a nice speedup.

Cc: Andrew Cooper 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/xen/enlighten.c | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index bdd855685403..1f765b41eee7 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -311,6 +311,39 @@ static __read_mostly unsigned int cpuid_leaf1_ecx_set_mask;
 static __read_mostly unsigned int cpuid_leaf5_ecx_val;
 static __read_mostly unsigned int cpuid_leaf5_edx_val;
 
+static void xen_sync_core(void)
+{
+   register void *__sp asm(_ASM_SP);
+
+#ifdef CONFIG_X86_32
+   asm volatile (
+   "pushl %%ss\n\t"
+   "pushl %%esp\n\t"
+   "addl $4, (%%esp)\n\t"
+   "pushfl\n\t"
+   "pushl %%cs\n\t"
+   "pushl $1f\n\t"
+   "iret\n\t"
+   "1:"
+   : "+r" (__sp) : : "cc");
+#else
+   unsigned long tmp;
+
+   asm volatile (
+   "movq %%ss, %0\n\t"
+   "pushq %0\n\t"
+   "pushq %%rsp\n\t"
+   "addq $8, (%%rsp)\n\t"
+   "pushfq\n\t"
+   "movq %%cs, %0\n\t"
+   "pushq %0\n\t"
+   "pushq $1f\n\t"
+   "iretq\n\t"
+   "1:"
+   : "=r" (tmp), "+r" (__sp) : : "cc");
+#endif
+}
+
 static void xen_cpuid(unsigned int *ax, unsigned int *bx,
  unsigned int *cx, unsigned int *dx)
 {
@@ -1289,6 +1322,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst = {
 
.start_context_switch = paravirt_start_context_switch,
.end_context_switch = xen_end_context_switch,
+
+   .sync_core = xen_sync_core,
 };
 
 static void xen_reboot(int reason)
-- 
2.9.3



[PATCH v2 5/6] x86/xen: Add a Xen-specific sync_core() implementation

2016-12-01 Thread Andy Lutomirski
On Xen PV, CPUID is likely to trap, and Xen hypercalls aren't
guaranteed to serialize.  (Even CPUID isn't *really* guaranteed to
serialize on Xen PV, but, in practice, any trap it generates will
serialize.)

On my laptop, CPUID(eax=1, ecx=0) is ~83ns and IRET-to-self is
~110ns.  But Xen PV will trap CPUID if possible, so IRET-to-self
should end up being a nice speedup.

Cc: Andrew Cooper 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/xen/enlighten.c | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index bdd855685403..1f765b41eee7 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -311,6 +311,39 @@ static __read_mostly unsigned int cpuid_leaf1_ecx_set_mask;
 static __read_mostly unsigned int cpuid_leaf5_ecx_val;
 static __read_mostly unsigned int cpuid_leaf5_edx_val;
 
+static void xen_sync_core(void)
+{
+   register void *__sp asm(_ASM_SP);
+
+#ifdef CONFIG_X86_32
+   asm volatile (
+   "pushl %%ss\n\t"
+   "pushl %%esp\n\t"
+   "addl $4, (%%esp)\n\t"
+   "pushfl\n\t"
+   "pushl %%cs\n\t"
+   "pushl $1f\n\t"
+   "iret\n\t"
+   "1:"
+   : "+r" (__sp) : : "cc");
+#else
+   unsigned long tmp;
+
+   asm volatile (
+   "movq %%ss, %0\n\t"
+   "pushq %0\n\t"
+   "pushq %%rsp\n\t"
+   "addq $8, (%%rsp)\n\t"
+   "pushfq\n\t"
+   "movq %%cs, %0\n\t"
+   "pushq %0\n\t"
+   "pushq $1f\n\t"
+   "iretq\n\t"
+   "1:"
+   : "=r" (tmp), "+r" (__sp) : : "cc");
+#endif
+}
+
 static void xen_cpuid(unsigned int *ax, unsigned int *bx,
  unsigned int *cx, unsigned int *dx)
 {
@@ -1289,6 +1322,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst = {
 
.start_context_switch = paravirt_start_context_switch,
.end_context_switch = xen_end_context_switch,
+
+   .sync_core = xen_sync_core,
 };
 
 static void xen_reboot(int reason)
-- 
2.9.3