Re: [Xen-devel] [PATCH RFC] x86/xsave: prefer eager clearing of state over eager restoring

2018-08-21 Thread Jan Beulich
>>> On 21.08.18 at 12:10,  wrote:
> On Thu, Aug 16, 2018 at 10:07:00AM +0100, Andrew Cooper wrote:
>> Irrespective of what we do here, I'd really like Wei to rebase his work
>> to remove the lazy fpu logic from the nested virt paths, because its a
>> no-brainer (perf wise) and comes with a massive amount of code
>> simplification in Xen.
> 
> I am very happy to get rid of more code if that's agreed. :-)

Well, we'll have to see. I don't recall a series removing
"lazy fpu logic from the nested virt paths"; I only recall a (giant)
patch removing it altogether from Xen (which would need to be
backed by numbers imo, first and foremost because going
forward the price of eager state loading is only going to grow,
since they'll only ever add new states, while the performance
of lazy state loading is - especially for the load avoided case -
likely to remain relatively stable).

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] x86/xsave: prefer eager clearing of state over eager restoring

2018-08-21 Thread Wei Liu
On Thu, Aug 16, 2018 at 10:07:00AM +0100, Andrew Cooper wrote:
> Irrespective of what we do here, I'd really like Wei to rebase his work
> to remove the lazy fpu logic from the nested virt paths, because its a
> no-brainer (perf wise) and comes with a massive amount of code
> simplification in Xen.

I am very happy to get rid of more code if that's agreed. :-)

Wei.

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] x86/xsave: prefer eager clearing of state over eager restoring

2018-08-16 Thread Jan Beulich
>>> On 16.08.18 at 13:27,  wrote:
> On 16/08/18 11:03, Jan Beulich wrote:
> On 16.08.18 at 11:07,  wrote:
>>> On 22/06/2018 11:57, Jan Beulich wrote:
 --- a/xen/arch/x86/spec_ctrl.c
 +++ b/xen/arch/x86/spec_ctrl.c
 @@ -616,7 +616,7 @@ void __init init_speculation_mitigations
  
  /* Check whether Eager FPU should be enabled by default. */
  if ( opt_eager_fpu == -1 )
 -opt_eager_fpu = should_use_eager_fpu();
 +opt_eager_fpu = !cpu_has_xsave && should_use_eager_fpu();
>>> I'd not spotted this the first time round.
>>>
>>> Intel is very clear that, if you're using xsave, you should be using
>>> eager FPU.  Therefore, this goes specifically against the advice in the
>>> ORM, and the advise we were given during the LazyFPU timeframe.
>>>
>>> Furthermore we (XenServer) and customers have seen a reliable perf
>>> improvement from the LazyFPU security fix, up to 8% in places, for
>>> normal VDI and server workloads.  As I said during the development the
>>> LazyFPU fixes, this is almost certainly down to the fact that all code
>>> uses the FPU these days.
>> Well - as said in the description, observation in my tests (which are
>> not a typical server workload) were that about 50% of the context
>> switches were no followed by a (lazy) restore, until the vCPU was
>> de-scheduled again.
> 
> Counting absolute numbers gives a false impression.
> 
> You've got to account for the relative difference in cycles between an
> xrstor and servicing #NM (which includes the xrstor you previously skipped).
> 
> The 50/50 split you see here is definitely going to result in a net perf
> hit because servicing #NM is several orders of magnitude more expensive
> than xrstor.  (For HVM guests, you've got to add another order of
> magnitude for the vmexit).
> 
> (At a guess, seeing as its been a little too long since I last did this
> kind of stats), you've got to get to somewhere like 85-95% before you're
> likely to break even from a performance point of view.

That's all understood; hence the post-commit message remark in the
patch.

>> The change as presented is in fact trying to move to a middle ground,
>> in that it doesn't leave stale state in the registers anymore, but
>> instead frees the underlying physical ones up for other uses (by
>> putting the state components into init state).
>>
>>> I'm still waiting on a more formal statement from AMD, and don't yet
>>> have any perf numbers on their hardware.
>>>
>>> However, as we will definitely get an extra perf boost from fully
>>> deleting the remaining lazy paths (no more clts/stts in the context
>>> switch path), my gut feeing is that there is going to have to be some
>>> terrible chronic case on AMD for for us to consider not switching to
>>> fully eager.
>> Yes, eliminating in particular the stts() is certainly going to help
>> performance. With ever growing state sizes I'm not convinced though
>> that in the long run (and even already with AVX-512, with its well over
>> 2k of state) the CR0 access is indeed (going to remain) worse than the
>> (perhaps unnecessary) state load.
> 
> You've got to consider what code does in practice, and in practice code
> is either number crunching heavily (in which case eager is definitely
> the best option), or its using vzeroall/upper/etc in which case you're
> not loading 2k of state, and eager is still the better option.

You realize that vzeroall / vzeroupper don't touch the high 16 registers
(at least as per the doc, I'm yet to verify this on hardware)? Together
with the mask registers and other components, that's still way more
than 1k then. And as said - the set is only ever growing.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] x86/xsave: prefer eager clearing of state over eager restoring

2018-08-16 Thread Andrew Cooper
On 16/08/18 11:03, Jan Beulich wrote:
 On 16.08.18 at 11:07,  wrote:
>> On 22/06/2018 11:57, Jan Beulich wrote:
>>> --- a/xen/arch/x86/spec_ctrl.c
>>> +++ b/xen/arch/x86/spec_ctrl.c
>>> @@ -616,7 +616,7 @@ void __init init_speculation_mitigations
>>>  
>>>  /* Check whether Eager FPU should be enabled by default. */
>>>  if ( opt_eager_fpu == -1 )
>>> -opt_eager_fpu = should_use_eager_fpu();
>>> +opt_eager_fpu = !cpu_has_xsave && should_use_eager_fpu();
>> I'd not spotted this the first time round.
>>
>> Intel is very clear that, if you're using xsave, you should be using
>> eager FPU.  Therefore, this goes specifically against the advice in the
>> ORM, and the advise we were given during the LazyFPU timeframe.
>>
>> Furthermore we (XenServer) and customers have seen a reliable perf
>> improvement from the LazyFPU security fix, up to 8% in places, for
>> normal VDI and server workloads.  As I said during the development the
>> LazyFPU fixes, this is almost certainly down to the fact that all code
>> uses the FPU these days.
> Well - as said in the description, observation in my tests (which are
> not a typical server workload) were that about 50% of the context
> switches were no followed by a (lazy) restore, until the vCPU was
> de-scheduled again.

Counting absolute numbers gives a false impression.

You've got to account for the relative difference in cycles between an
xrstor and servicing #NM (which includes the xrstor you previously skipped).

The 50/50 split you see here is definitely going to result in a net perf
hit because servicing #NM is several orders of magnitude more expensive
than xrstor.  (For HVM guests, you've got to add another order of
magnitude for the vmexit).

(At a guess, seeing as its been a little too long since I last did this
kind of stats), you've got to get to somewhere like 85-95% before you're
likely to break even from a performance point of view.

> The change as presented is in fact trying to move to a middle ground,
> in that it doesn't leave stale state in the registers anymore, but
> instead frees the underlying physical ones up for other uses (by
> putting the state components into init state).
>
>> I'm still waiting on a more formal statement from AMD, and don't yet
>> have any perf numbers on their hardware.
>>
>> However, as we will definitely get an extra perf boost from fully
>> deleting the remaining lazy paths (no more clts/stts in the context
>> switch path), my gut feeing is that there is going to have to be some
>> terrible chronic case on AMD for for us to consider not switching to
>> fully eager.
> Yes, eliminating in particular the stts() is certainly going to help
> performance. With ever growing state sizes I'm not convinced though
> that in the long run (and even already with AVX-512, with its well over
> 2k of state) the CR0 access is indeed (going to remain) worse than the
> (perhaps unnecessary) state load.

You've got to consider what code does in practice, and in practice code
is either number crunching heavily (in which case eager is definitely
the best option), or its using vzeroall/upper/etc in which case you're
not loading 2k of state, and eager is still the better option.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] x86/xsave: prefer eager clearing of state over eager restoring

2018-08-16 Thread Jan Beulich
>>> On 16.08.18 at 11:07,  wrote:
> On 22/06/2018 11:57, Jan Beulich wrote:
>> --- a/xen/arch/x86/spec_ctrl.c
>> +++ b/xen/arch/x86/spec_ctrl.c
>> @@ -616,7 +616,7 @@ void __init init_speculation_mitigations
>>  
>>  /* Check whether Eager FPU should be enabled by default. */
>>  if ( opt_eager_fpu == -1 )
>> -opt_eager_fpu = should_use_eager_fpu();
>> +opt_eager_fpu = !cpu_has_xsave && should_use_eager_fpu();
> 
> I'd not spotted this the first time round.
> 
> Intel is very clear that, if you're using xsave, you should be using
> eager FPU.  Therefore, this goes specifically against the advice in the
> ORM, and the advise we were given during the LazyFPU timeframe.
> 
> Furthermore we (XenServer) and customers have seen a reliable perf
> improvement from the LazyFPU security fix, up to 8% in places, for
> normal VDI and server workloads.  As I said during the development the
> LazyFPU fixes, this is almost certainly down to the fact that all code
> uses the FPU these days.

Well - as said in the description, observation in my tests (which are
not a typical server workload) were that about 50% of the context
switches were no followed by a (lazy) restore, until the vCPU was
de-scheduled again.

The change as presented is in fact trying to move to a middle ground,
in that it doesn't leave stale state in the registers anymore, but
instead frees the underlying physical ones up for other uses (by
putting the state components into init state).

> I'm still waiting on a more formal statement from AMD, and don't yet
> have any perf numbers on their hardware.
> 
> However, as we will definitely get an extra perf boost from fully
> deleting the remaining lazy paths (no more clts/stts in the context
> switch path), my gut feeing is that there is going to have to be some
> terrible chronic case on AMD for for us to consider not switching to
> fully eager.

Yes, eliminating in particular the stts() is certainly going to help
performance. With ever growing state sizes I'm not convinced though
that in the long run (and even already with AVX-512, with its well over
2k of state) the CR0 access is indeed (going to remain) worse than the
(perhaps unnecessary) state load.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] x86/xsave: prefer eager clearing of state over eager restoring

2018-08-16 Thread Andrew Cooper
On 16/08/2018 10:07, Andrew Cooper wrote:
> On 22/06/2018 11:57, Jan Beulich wrote:
>> --- a/xen/arch/x86/spec_ctrl.c
>> +++ b/xen/arch/x86/spec_ctrl.c
>> @@ -616,7 +616,7 @@ void __init init_speculation_mitigations
>>  
>>  /* Check whether Eager FPU should be enabled by default. */
>>  if ( opt_eager_fpu == -1 )
>> -opt_eager_fpu = should_use_eager_fpu();
>> +opt_eager_fpu = !cpu_has_xsave && should_use_eager_fpu();
> I'd not spotted this the first time round.
>
> Intel is very clear that, if you're using xsave, you should be using
> eager FPU.  Therefore, this goes specifically against the advice in the
> ORM, and the advise we were given during the LazyFPU timeframe.
>
> Furthermore we (XenServer) and customers have seen a reliable perf
> improvement from the LazyFPU security fix, up to 8% in places, for
> normal VDI and server workloads.  As I said during the development the
> LazyFPU fixes, this is almost certainly down to the fact that all code
> uses the FPU these days.
>
> I'm still waiting on a more formal statement from AMD, and don't yet
> have any perf numbers on their hardware.
>
> However, as we will definitely get an extra perf boost from fully
> deleting the remaining lazy paths (no more clts/stts in the context
> switch path), my gut feeing is that there is going to have to be some
> terrible chronic case on AMD for for us to consider not switching to
> fully eager.
>
> Irrespective of what we do here, I'd really like Wei to rebase his work
> to remove the lazy fpu logic from the nested virt paths, because its a
> no-brainer (perf wise) and comes with a massive amount of code
> simplification in Xen.

Actually, this reminds me of a bug report given during XenSummit in
Nanjing.  Once Xen has restored lazy state, we drop the interception of
#NM, but we still take a vmexit on the clts.  This was from Alibaba
iirc, and came in at an astounding 70% perf hit to one particular HPC
workload.

I think this can be fixed by using the host/guest cr0 mask to allow
writes of cr0.ts, in exactly the same way as we have recently gained for
cr4.pge.  Also, AMD has a specific option for virtualisation of cr0.ts
writes, and I can't remember if we're using it or not.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] x86/xsave: prefer eager clearing of state over eager restoring

2018-08-16 Thread Andrew Cooper
On 22/06/2018 11:57, Jan Beulich wrote:
> --- a/xen/arch/x86/spec_ctrl.c
> +++ b/xen/arch/x86/spec_ctrl.c
> @@ -616,7 +616,7 @@ void __init init_speculation_mitigations
>  
>  /* Check whether Eager FPU should be enabled by default. */
>  if ( opt_eager_fpu == -1 )
> -opt_eager_fpu = should_use_eager_fpu();
> +opt_eager_fpu = !cpu_has_xsave && should_use_eager_fpu();

I'd not spotted this the first time round.

Intel is very clear that, if you're using xsave, you should be using
eager FPU.  Therefore, this goes specifically against the advice in the
ORM, and the advise we were given during the LazyFPU timeframe.

Furthermore we (XenServer) and customers have seen a reliable perf
improvement from the LazyFPU security fix, up to 8% in places, for
normal VDI and server workloads.  As I said during the development the
LazyFPU fixes, this is almost certainly down to the fact that all code
uses the FPU these days.

I'm still waiting on a more formal statement from AMD, and don't yet
have any perf numbers on their hardware.

However, as we will definitely get an extra perf boost from fully
deleting the remaining lazy paths (no more clts/stts in the context
switch path), my gut feeing is that there is going to have to be some
terrible chronic case on AMD for for us to consider not switching to
fully eager.

Irrespective of what we do here, I'd really like Wei to rebase his work
to remove the lazy fpu logic from the nested virt paths, because its a
no-brainer (perf wise) and comes with a massive amount of code
simplification in Xen.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH RFC] x86/xsave: prefer eager clearing of state over eager restoring

2018-06-22 Thread Jan Beulich
Other than FXRSTOR, XRSTOR allows for setting components to their
initial state. Utilize this to clear register state immediately after
having saved a vCPU's state (which we don't defer past
__context_switch()), considering that
- this supposedly reduces power consumption,
- this might even free up physical registers,
- we don't normally save/restore FPU state for a vCPU on every context
  switch (in some initial measurements I've observed an approximate
  50:50 relation between the two on a not overly heavily loaded system;
  it's clear anyway that this is heavily dependent on what exactly a
  vCPU is used for).

Signed-off-by: Jan Beulich 
---
RFC since the full performance effect is still not very clear.

--- a/xen/arch/x86/i387.c
+++ b/xen/arch/x86/i387.c
@@ -33,6 +33,7 @@ static inline void fpu_xrstor(struct vcp
 ok = set_xcr0(v->arch.xcr0_accum | XSTATE_FP_SSE);
 ASSERT(ok);
 xrstor(v, mask);
+v->arch.xstate_dirty = mask;
 ok = set_xcr0(v->arch.xcr0 ?: XSTATE_FP_SSE);
 ASSERT(ok);
 }
@@ -148,6 +149,9 @@ static inline void fpu_xsave(struct vcpu
 ok = set_xcr0(v->arch.xcr0_accum | XSTATE_FP_SSE);
 ASSERT(ok);
 xsave(v, mask);
+xstate_load_init(v->arch.xstate_dirty &
+ v->arch.xsave_area->xsave_hdr.xstate_bv);
+v->arch.xstate_dirty = 0;
 ok = set_xcr0(v->arch.xcr0 ?: XSTATE_FP_SSE);
 ASSERT(ok);
 }
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -616,7 +616,7 @@ void __init init_speculation_mitigations
 
 /* Check whether Eager FPU should be enabled by default. */
 if ( opt_eager_fpu == -1 )
-opt_eager_fpu = should_use_eager_fpu();
+opt_eager_fpu = !cpu_has_xsave && should_use_eager_fpu();
 
 /* (Re)init BSP state now that default_spec_ctrl_flags has been 
calculated. */
 init_shadow_spec_ctrl_state();
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -734,6 +734,7 @@ int handle_xsetbv(u32 index, u64 new_bv)
 cr0 &= ~X86_CR0_TS;
 }
 xrstor(curr, mask);
+curr->arch.xstate_dirty |= mask;
 if ( cr0 & X86_CR0_TS )
 write_cr0(cr0);
 }
@@ -774,12 +775,19 @@ uint64_t read_bndcfgu(void)
 return xstate->xsave_hdr.xstate_bv & X86_XCR0_BNDCSR ? bndcsr->bndcfgu : 0;
 }
 
+void xstate_load_init(uint64_t mask)
+{
+struct vcpu *v = idle_vcpu[smp_processor_id()];
+struct xsave_struct *xstate = v->arch.xsave_area;
+
+memset(&xstate->xsave_hdr, 0, sizeof(xstate->xsave_hdr));
+xrstor(v, mask);
+}
+
 void xstate_set_init(uint64_t mask)
 {
 unsigned long cr0 = read_cr0();
 unsigned long xcr0 = this_cpu(xcr0);
-struct vcpu *v = idle_vcpu[smp_processor_id()];
-struct xsave_struct *xstate = v->arch.xsave_area;
 
 if ( ~xfeature_mask & mask )
 {
@@ -792,8 +800,7 @@ void xstate_set_init(uint64_t mask)
 
 clts();
 
-memset(&xstate->xsave_hdr, 0, sizeof(xstate->xsave_hdr));
-xrstor(v, mask);
+xstate_load_init(mask);
 
 if ( cr0 & X86_CR0_TS )
 write_cr0(cr0);
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -559,6 +559,11 @@ struct arch_vcpu
  * it explicitly enables it via xcr0.
  */
 uint64_t xcr0_accum;
+/*
+ * Accumulated set of components which may currently be dirty, and hence
+ * should be cleared immediately after saving state.
+ */
+uint64_t xstate_dirty;
 /* This variable determines whether nonlazy extended state has been used,
  * and thus should be saved/restored. */
 bool_t nonlazy_xstate_used;
--- a/xen/include/asm-x86/xstate.h
+++ b/xen/include/asm-x86/xstate.h
@@ -95,6 +95,7 @@ uint64_t get_msr_xss(void);
 uint64_t read_bndcfgu(void);
 void xsave(struct vcpu *v, uint64_t mask);
 void xrstor(struct vcpu *v, uint64_t mask);
+void xstate_load_init(uint64_t mask);
 void xstate_set_init(uint64_t mask);
 bool xsave_enabled(const struct vcpu *v);
 int __must_check validate_xstate(u64 xcr0, u64 xcr0_accum,




___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel