On 06/04/18 17:17, Andrew Cooper wrote:
> On 06/04/18 08:52, Juergen Gross wrote:
>> Instead of flushing the TLB from global pages when switching address
>> spaces with XPTI being active just disable global pages via %cr4
>> completely when a domain subject to XPTI is active. This avoids the
>> need for extra TLB flushes as loading %cr3 will remove all TLB
>> entries.
>>
>> In order to avoid states with cr3/cr4 having inconsistent values
>> (e.g. global pages being activated while cr3 already specifies a XPTI
>> address space) move loading of the new cr4 value to write_ptbase()
>> (actually to write_cr3_cr4() called by write_ptbase()).
>>
>> Signed-off-by: Juergen Gross <jgr...@suse.com>
>> Reviewed-by: Jan Beulich <jbeul...@suse.com>
> 
> This is contrary to the performance optimisations taken by Linux,
> Windows and Apple, which borrowing Xen's 64bit PV optimisation of having
> global user pages, because it really is a (small) performance improvement.
> 
> Are there performance numbers for this change alone, and/or are later
> changes strictly dependent on this functionality?

Yes and yes.

Performance is much better for XPTI (again my standard compile test):
elapsed time dropped from 112 to 106 seconds, system time from 160 to
139 seconds, user time from 187 to 186 seconds.

Page invalidation with PCID enabled is _much_ easier.

> On the Xen side of things, an argument could probably be made that the
> extra cr4 rewrites due to the L4 shadowing might eat away the
> performance we would otherwise gain, but I'd be hesitant to blindly
> assume that this is the case.

Another problem I wanted to avoid was the global page handling of Xen
private pages: I would have needed to remove all the global bits, either
even for AMD cpus, or do that dynamically.

> A complicating factor is that Intel have said that the performance gains
> from user global pages would be more noticeable on older hardware, due
> to differences in the TLB architecture.
> 
>> ---
>> V4:
>> - don't use mmu_cr4_features for setting new cr4 value (Jan Beulich)
>> - use simpler scheme for setting X86_CR4_PGE in
>>   pv_guest_cr4_to_real_cr4() (Jan Beulich)
>>
>> V3:
>> - move cr4 loading for all domains from *_ctxt_switch_to() to
>>   write_cr3_cr4() called by write_ptbase() (Jan Beulich)
>> - rebase
>> ---
>>  xen/arch/x86/domain.c          |  5 -----
>>  xen/arch/x86/flushtlb.c        | 13 ++++++++-----
>>  xen/arch/x86/mm.c              | 14 +++++++++++---
>>  xen/arch/x86/x86_64/entry.S    | 10 ----------
>>  xen/common/efi/runtime.c       |  4 ++--
>>  xen/include/asm-x86/domain.h   |  3 ++-
>>  xen/include/asm-x86/flushtlb.h |  2 +-
>>  7 files changed, 24 insertions(+), 27 deletions(-)
>>
>> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
>> index 9c229594f4..c2bb70c483 100644
>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -1522,17 +1522,12 @@ void paravirt_ctxt_switch_from(struct vcpu *v)
>>  void paravirt_ctxt_switch_to(struct vcpu *v)
>>  {
>>      root_pgentry_t *root_pgt = this_cpu(root_pgt);
>> -    unsigned long cr4;
>>  
>>      if ( root_pgt )
>>          root_pgt[root_table_offset(PERDOMAIN_VIRT_START)] =
>>              l4e_from_page(v->domain->arch.perdomain_l3_pg,
>>                            __PAGE_HYPERVISOR_RW);
>>  
>> -    cr4 = pv_guest_cr4_to_real_cr4(v);
>> -    if ( unlikely(cr4 != read_cr4()) )
>> -        write_cr4(cr4);
>> -
>>      if ( unlikely(v->arch.debugreg[7] & DR7_ACTIVE_MASK) )
>>          activate_debugregs(v);
>>  
>> diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
>> index f793b70696..5dcd9a2bf6 100644
>> --- a/xen/arch/x86/flushtlb.c
>> +++ b/xen/arch/x86/flushtlb.c
>> @@ -89,20 +89,23 @@ static void do_tlb_flush(void)
>>      post_flush(t);
>>  }
>>  
>> -void write_cr3(unsigned long cr3)
>> +void write_cr3_cr4(unsigned long cr3, unsigned long cr4)
>>  {
>> -    unsigned long flags, cr4;
>> +    unsigned long flags;
>>      u32 t;
>>  
>>      /* This non-reentrant function is sometimes called in interrupt 
>> context. */
>>      local_irq_save(flags);
>>  
>>      t = pre_flush();
>> -    cr4 = read_cr4();
>>  
>> -    write_cr4(cr4 & ~X86_CR4_PGE);
>> +    if ( read_cr4() & X86_CR4_PGE )
>> +        write_cr4(cr4 & ~X86_CR4_PGE);
>> +
>>      asm volatile ( "mov %0, %%cr3" : : "r" (cr3) : "memory" );
>> -    write_cr4(cr4);
>> +
>> +    if ( read_cr4() != cr4 )
> 
> read_cr4(), despite being a cached read, isn't free because of the %rsp
> manipulation required to access the variable.  I'd keep the locally
> cached cr4, and use "if ( cr4 & X86_CR4_PGE )" here.

I did this on purpose: it might be cr4 is being modified not only
regarding pge. I can nevertheless cache the read cr4 value, of course.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Reply via email to