Re: [Xen-devel] [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible

2018-03-22 Thread Jan Beulich
>>> On 22.03.18 at 16:26,  wrote:
> On 22/03/18 15:31, Jan Beulich wrote:
> On 21.03.18 at 13:51,  wrote:
>>>  void write_ptbase(struct vcpu *v)
>>>  {
>>> +if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
>>> +get_cpu_info()->root_pgt_changed = true;
>>>  write_cr3(v->arch.cr3);
>> 
>> When you come here from e.g. __sync_local_execstate(), you
>> don't really need to set the flag. Of course you'll come here again
>> before the next 64-bit PV vCPU will make it to restore_all_guest,
>> so by the time we make it there the flag will be set anyway.
>> However, if you already use such a subtlety, then there's also
>> no point excluding 32-bit vCPU-s here (nor in make_cr3()), as
>> those will never make it to restore_all_guest. Same then for
>> excluding HVM vCPU-s. And I then wonder whether (here or
>> more likely in a later patch) the root_pgt check couldn't go away
>> as well.
> 
> I'm not sure this is worth it. Patch 3 will re-introduce a conditional
> here and it will look rather different (e.g. without the root_pgt
> check). So micro-optimizing this patch barely makes any sense.

Yes, I've seen that once I made it there. Perhaps worth dropping
the two is_*() checks here, but not worry about the root_pgt one.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible

2018-03-22 Thread Juergen Gross
On 22/03/18 15:31, Jan Beulich wrote:
 On 21.03.18 at 13:51,  wrote:
>> --- a/xen/arch/x86/flushtlb.c
>> +++ b/xen/arch/x86/flushtlb.c
>> @@ -158,6 +158,9 @@ unsigned int flush_area_local(const void *va, unsigned 
>> int flags)
>>  }
>>  }
>>  
>> +if ( flags & FLUSH_ROOT_PGTBL )
>> +get_cpu_info()->root_pgt_changed = true;
>> +
>>  local_irq_restore(irqfl);
>>  
>>  return flags;
> 
> Does this really need to sit inside the interrupts disabled section?

Hmm, no, I don't think so. I'll move it below local_irq_restore().

> Thinking about it I even wonder whether the cache flush part needs
> to be. Even for the INVLPG portion of the TLB flush part I can't
> seem to see a need for IRQs to be off. I think it's really just the
> pre_flush() / post_flush() pair which needs to be inside such a
> section. I'll prepare a patch (for after 4.11). I think some of the
> changes later in your series will actually further ease this.
> 
>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -499,10 +499,15 @@ void free_shared_domheap_page(struct page_info *page)
>>  void make_cr3(struct vcpu *v, mfn_t mfn)
>>  {
>>  v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
>> +if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) &&
>> + !is_pv_32bit_vcpu(v) )
>> +get_cpu_info()->root_pgt_changed = true;
>>  }
> 
> As this doesn't actually update CR3, setting the flag shouldn't
> generally be necessary if the caller then invokes write_ptbase().
> Isn't setting the flag here needed solely in the case of
> _toggle_guest_pt() being up the call tree? In which case it would
> perhaps better be set there (and in turn some or even all of the
> conditional around it could be dropped)?

Yes, you are right.

> 
>>  void write_ptbase(struct vcpu *v)
>>  {
>> +if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
>> +get_cpu_info()->root_pgt_changed = true;
>>  write_cr3(v->arch.cr3);
> 
> When you come here from e.g. __sync_local_execstate(), you
> don't really need to set the flag. Of course you'll come here again
> before the next 64-bit PV vCPU will make it to restore_all_guest,
> so by the time we make it there the flag will be set anyway.
> However, if you already use such a subtlety, then there's also
> no point excluding 32-bit vCPU-s here (nor in make_cr3()), as
> those will never make it to restore_all_guest. Same then for
> excluding HVM vCPU-s. And I then wonder whether (here or
> more likely in a later patch) the root_pgt check couldn't go away
> as well.

I'm not sure this is worth it. Patch 3 will re-introduce a conditional
here and it will look rather different (e.g. without the root_pgt
check). So micro-optimizing this patch barely makes any sense.

> 
>> @@ -3698,18 +3703,29 @@ long do_mmu_update(
>>  break;
>>  rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>> -/*
>> - * No need to sync if all uses of the page can be 
>> accounted
>> - * to the page lock we hold, its pinned status, and 
>> uses on
>> - * this (v)CPU.
>> - */
>> -if ( !rc && !cpu_has_no_xpti &&
>> - ((page->u.inuse.type_info & PGT_count_mask) >
>> -  (1 + !!(page->u.inuse.type_info & PGT_pinned) +
>> -   (pagetable_get_pfn(curr->arch.guest_table) == 
>> mfn) 
>> +
>> -   (pagetable_get_pfn(curr->arch.guest_table_user) 
>> ==
>> -mfn))) )
>> -sync_guest = true;
>> +if ( !rc && !cpu_has_no_xpti )
>> +{
>> +bool local_in_use = false;
>> +
>> +if ( (pagetable_get_pfn(curr->arch.guest_table) ==
>> +  mfn) ||
>> + 
>> (pagetable_get_pfn(curr->arch.guest_table_user) ==
>> +  mfn) )
>> +{
>> +local_in_use = true;
>> +get_cpu_info()->root_pgt_changed = true;
>> +}
> 
> The conditional causes root_pgt_changed to get set even in cases
> where what CR3 points to doesn't actually change (if it's the user
> page tables that get modified). I think you want to check
> curr->arch.cr3 here, or only curr->arch.guest_table (as user mode
> can't invoke hypercalls).

I'll go with curr->arch.guest_table.

> 
>> +/*
>> + * No need to sync if all uses of the page can be
>> + * accounted to the page lock we hold, its pinned
>> + * status, and uses on this (v)CPU.
>> + */
>> +

Re: [Xen-devel] [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible

2018-03-22 Thread Jan Beulich
>>> On 21.03.18 at 13:51,  wrote:
> --- a/xen/arch/x86/flushtlb.c
> +++ b/xen/arch/x86/flushtlb.c
> @@ -158,6 +158,9 @@ unsigned int flush_area_local(const void *va, unsigned 
> int flags)
>  }
>  }
>  
> +if ( flags & FLUSH_ROOT_PGTBL )
> +get_cpu_info()->root_pgt_changed = true;
> +
>  local_irq_restore(irqfl);
>  
>  return flags;

Does this really need to sit inside the interrupts disabled section?

Thinking about it I even wonder whether the cache flush part needs
to be. Even for the INVLPG portion of the TLB flush part I can't
seem to see a need for IRQs to be off. I think it's really just the
pre_flush() / post_flush() pair which needs to be inside such a
section. I'll prepare a patch (for after 4.11). I think some of the
changes later in your series will actually further ease this.

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -499,10 +499,15 @@ void free_shared_domheap_page(struct page_info *page)
>  void make_cr3(struct vcpu *v, mfn_t mfn)
>  {
>  v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
> +if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) &&
> + !is_pv_32bit_vcpu(v) )
> +get_cpu_info()->root_pgt_changed = true;
>  }

As this doesn't actually update CR3, setting the flag shouldn't
generally be necessary if the caller then invokes write_ptbase().
Isn't setting the flag here needed solely in the case of
_toggle_guest_pt() being up the call tree? In which case it would
perhaps better be set there (and in turn some or even all of the
conditional around it could be dropped)?

>  void write_ptbase(struct vcpu *v)
>  {
> +if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
> +get_cpu_info()->root_pgt_changed = true;
>  write_cr3(v->arch.cr3);

When you come here from e.g. __sync_local_execstate(), you
don't really need to set the flag. Of course you'll come here again
before the next 64-bit PV vCPU will make it to restore_all_guest,
so by the time we make it there the flag will be set anyway.
However, if you already use such a subtlety, then there's also
no point excluding 32-bit vCPU-s here (nor in make_cr3()), as
those will never make it to restore_all_guest. Same then for
excluding HVM vCPU-s. And I then wonder whether (here or
more likely in a later patch) the root_pgt check couldn't go away
as well.

> @@ -3698,18 +3703,29 @@ long do_mmu_update(
>  break;
>  rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> -/*
> - * No need to sync if all uses of the page can be 
> accounted
> - * to the page lock we hold, its pinned status, and uses 
> on
> - * this (v)CPU.
> - */
> -if ( !rc && !cpu_has_no_xpti &&
> - ((page->u.inuse.type_info & PGT_count_mask) >
> -  (1 + !!(page->u.inuse.type_info & PGT_pinned) +
> -   (pagetable_get_pfn(curr->arch.guest_table) == 
> mfn) 
> +
> -   (pagetable_get_pfn(curr->arch.guest_table_user) ==
> -mfn))) )
> -sync_guest = true;
> +if ( !rc && !cpu_has_no_xpti )
> +{
> +bool local_in_use = false;
> +
> +if ( (pagetable_get_pfn(curr->arch.guest_table) ==
> +  mfn) ||
> + (pagetable_get_pfn(curr->arch.guest_table_user) 
> ==
> +  mfn) )
> +{
> +local_in_use = true;
> +get_cpu_info()->root_pgt_changed = true;
> +}

The conditional causes root_pgt_changed to get set even in cases
where what CR3 points to doesn't actually change (if it's the user
page tables that get modified). I think you want to check
curr->arch.cr3 here, or only curr->arch.guest_table (as user mode
can't invoke hypercalls).

> +/*
> + * No need to sync if all uses of the page can be
> + * accounted to the page lock we hold, its pinned
> + * status, and uses on this (v)CPU.
> + */
> +if ( (page->u.inuse.type_info & PGT_count_mask) >
> + (1 + !!(page->u.inuse.type_info & PGT_pinned) +
> +  local_in_use) )

The boolean local_in_use evaluates to 1 here, when previously the
value could have been 1 or 2 (I agree that's highly theoretical, but
anyway). Of course this will be addressed implicitly if you check
(only) curr->arch.guest_table above and move the
curr->arch.guest_table_user check here.

Jan



[Xen-devel] [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible

2018-03-21 Thread Juergen Gross
For mitigation of Meltdown the current L4 page table is copied to the
cpu local root page table each time a 64 bit pv guest is entered.

Copying can be avoided in cases where the guest L4 page table hasn't
been modified while running the hypervisor, e.g. when handling
interrupts or any hypercall not modifying the L4 page table or %cr3.

So add a per-cpu flag whether the copying should be performed and set
that flag only when loading a new %cr3 or modifying the L4 page table.
This includes synchronization of the cpu local root page table with
other cpus, so add a special synchronization flag for that case.

A simple performance check (compiling the hypervisor via "make -j 4")
in dom0 with 4 vcpus shows a significant improvement:

- real time drops from 112 seconds to 103 seconds
- system time drops from 142 seconds to 131 seconds

Signed-off-by: Juergen Gross 
---
V3:
- set flag locally only if affected L4 is active (Jan Beulich)
- add setting flag to flush_area_mask() (Jan Beulich)
- set flag in make_cr3() only if called for current active vcpu

To be applied on top of Jan's "Meltdown band-aid overhead reduction"
series
---
 xen/arch/x86/flushtlb.c   |  3 +++
 xen/arch/x86/mm.c | 42 +++
 xen/arch/x86/smp.c|  2 +-
 xen/arch/x86/x86_64/asm-offsets.c |  1 +
 xen/arch/x86/x86_64/entry.S   |  8 ++--
 xen/include/asm-x86/current.h |  8 
 xen/include/asm-x86/flushtlb.h|  2 ++
 7 files changed, 50 insertions(+), 16 deletions(-)

diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 8a7a76b8ff..9a9af71d5a 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -158,6 +158,9 @@ unsigned int flush_area_local(const void *va, unsigned int 
flags)
 }
 }
 
+if ( flags & FLUSH_ROOT_PGTBL )
+get_cpu_info()->root_pgt_changed = true;
+
 local_irq_restore(irqfl);
 
 return flags;
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index f0195561c2..352600ad73 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -499,10 +499,15 @@ void free_shared_domheap_page(struct page_info *page)
 void make_cr3(struct vcpu *v, mfn_t mfn)
 {
 v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT;
+if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) &&
+ !is_pv_32bit_vcpu(v) )
+get_cpu_info()->root_pgt_changed = true;
 }
 
 void write_ptbase(struct vcpu *v)
 {
+if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) )
+get_cpu_info()->root_pgt_changed = true;
 write_cr3(v->arch.cr3);
 }
 
@@ -3698,18 +3703,29 @@ long do_mmu_update(
 break;
 rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
   cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
-/*
- * No need to sync if all uses of the page can be accounted
- * to the page lock we hold, its pinned status, and uses on
- * this (v)CPU.
- */
-if ( !rc && !cpu_has_no_xpti &&
- ((page->u.inuse.type_info & PGT_count_mask) >
-  (1 + !!(page->u.inuse.type_info & PGT_pinned) +
-   (pagetable_get_pfn(curr->arch.guest_table) == mfn) +
-   (pagetable_get_pfn(curr->arch.guest_table_user) ==
-mfn))) )
-sync_guest = true;
+if ( !rc && !cpu_has_no_xpti )
+{
+bool local_in_use = false;
+
+if ( (pagetable_get_pfn(curr->arch.guest_table) ==
+  mfn) ||
+ (pagetable_get_pfn(curr->arch.guest_table_user) ==
+  mfn) )
+{
+local_in_use = true;
+get_cpu_info()->root_pgt_changed = true;
+}
+
+/*
+ * No need to sync if all uses of the page can be
+ * accounted to the page lock we hold, its pinned
+ * status, and uses on this (v)CPU.
+ */
+if ( (page->u.inuse.type_info & PGT_count_mask) >
+ (1 + !!(page->u.inuse.type_info & PGT_pinned) +
+  local_in_use) )
+sync_guest = true;
+}
 break;
 
 case PGT_writable_page:
@@ -3824,7 +3840,7 @@ long do_mmu_update(
 
 cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
 if ( !cpumask_empty(mask) )
-flush_mask(mask, FLUSH_TLB_GLOBAL);
+flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
 }
 
 perfc_add(num_page_updates,