Re: [Xen-devel] [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible
>>> On 22.03.18 at 16:26,wrote: > On 22/03/18 15:31, Jan Beulich wrote: > On 21.03.18 at 13:51, wrote: >>> void write_ptbase(struct vcpu *v) >>> { >>> +if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) ) >>> +get_cpu_info()->root_pgt_changed = true; >>> write_cr3(v->arch.cr3); >> >> When you come here from e.g. __sync_local_execstate(), you >> don't really need to set the flag. Of course you'll come here again >> before the next 64-bit PV vCPU will make it to restore_all_guest, >> so by the time we make it there the flag will be set anyway. >> However, if you already use such a subtlety, then there's also >> no point excluding 32-bit vCPU-s here (nor in make_cr3()), as >> those will never make it to restore_all_guest. Same then for >> excluding HVM vCPU-s. And I then wonder whether (here or >> more likely in a later patch) the root_pgt check couldn't go away >> as well. > > I'm not sure this is worth it. Patch 3 will re-introduce a conditional > here and it will look rather different (e.g. without the root_pgt > check). So micro-optimizing this patch barely makes any sense. Yes, I've seen that once I made it there. Perhaps worth dropping the two is_*() checks here, but not worry about the root_pgt one. Jan ___ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [Xen-devel] [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible
On 22/03/18 15:31, Jan Beulich wrote: On 21.03.18 at 13:51,wrote: >> --- a/xen/arch/x86/flushtlb.c >> +++ b/xen/arch/x86/flushtlb.c >> @@ -158,6 +158,9 @@ unsigned int flush_area_local(const void *va, unsigned >> int flags) >> } >> } >> >> +if ( flags & FLUSH_ROOT_PGTBL ) >> +get_cpu_info()->root_pgt_changed = true; >> + >> local_irq_restore(irqfl); >> >> return flags; > > Does this really need to sit inside the interrupts disabled section? Hmm, no, I don't think so. I'll move it below local_irq_restore(). > Thinking about it I even wonder whether the cache flush part needs > to be. Even for the INVLPG portion of the TLB flush part I can't > seem to see a need for IRQs to be off. I think it's really just the > pre_flush() / post_flush() pair which needs to be inside such a > section. I'll prepare a patch (for after 4.11). I think some of the > changes later in your series will actually further ease this. > >> --- a/xen/arch/x86/mm.c >> +++ b/xen/arch/x86/mm.c >> @@ -499,10 +499,15 @@ void free_shared_domheap_page(struct page_info *page) >> void make_cr3(struct vcpu *v, mfn_t mfn) >> { >> v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT; >> +if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) && >> + !is_pv_32bit_vcpu(v) ) >> +get_cpu_info()->root_pgt_changed = true; >> } > > As this doesn't actually update CR3, setting the flag shouldn't > generally be necessary if the caller then invokes write_ptbase(). > Isn't setting the flag here needed solely in the case of > _toggle_guest_pt() being up the call tree? In which case it would > perhaps better be set there (and in turn some or even all of the > conditional around it could be dropped)? Yes, you are right. > >> void write_ptbase(struct vcpu *v) >> { >> +if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) ) >> +get_cpu_info()->root_pgt_changed = true; >> write_cr3(v->arch.cr3); > > When you come here from e.g. __sync_local_execstate(), you > don't really need to set the flag. Of course you'll come here again > before the next 64-bit PV vCPU will make it to restore_all_guest, > so by the time we make it there the flag will be set anyway. > However, if you already use such a subtlety, then there's also > no point excluding 32-bit vCPU-s here (nor in make_cr3()), as > those will never make it to restore_all_guest. Same then for > excluding HVM vCPU-s. And I then wonder whether (here or > more likely in a later patch) the root_pgt check couldn't go away > as well. I'm not sure this is worth it. Patch 3 will re-introduce a conditional here and it will look rather different (e.g. without the root_pgt check). So micro-optimizing this patch barely makes any sense. > >> @@ -3698,18 +3703,29 @@ long do_mmu_update( >> break; >> rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn, >>cmd == MMU_PT_UPDATE_PRESERVE_AD, v); >> -/* >> - * No need to sync if all uses of the page can be >> accounted >> - * to the page lock we hold, its pinned status, and >> uses on >> - * this (v)CPU. >> - */ >> -if ( !rc && !cpu_has_no_xpti && >> - ((page->u.inuse.type_info & PGT_count_mask) > >> - (1 + !!(page->u.inuse.type_info & PGT_pinned) + >> - (pagetable_get_pfn(curr->arch.guest_table) == >> mfn) >> + >> - (pagetable_get_pfn(curr->arch.guest_table_user) >> == >> -mfn))) ) >> -sync_guest = true; >> +if ( !rc && !cpu_has_no_xpti ) >> +{ >> +bool local_in_use = false; >> + >> +if ( (pagetable_get_pfn(curr->arch.guest_table) == >> + mfn) || >> + >> (pagetable_get_pfn(curr->arch.guest_table_user) == >> + mfn) ) >> +{ >> +local_in_use = true; >> +get_cpu_info()->root_pgt_changed = true; >> +} > > The conditional causes root_pgt_changed to get set even in cases > where what CR3 points to doesn't actually change (if it's the user > page tables that get modified). I think you want to check > curr->arch.cr3 here, or only curr->arch.guest_table (as user mode > can't invoke hypercalls). I'll go with curr->arch.guest_table. > >> +/* >> + * No need to sync if all uses of the page can be >> + * accounted to the page lock we hold, its pinned >> + * status, and uses on this (v)CPU. >> + */ >> +
Re: [Xen-devel] [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible
>>> On 21.03.18 at 13:51,wrote: > --- a/xen/arch/x86/flushtlb.c > +++ b/xen/arch/x86/flushtlb.c > @@ -158,6 +158,9 @@ unsigned int flush_area_local(const void *va, unsigned > int flags) > } > } > > +if ( flags & FLUSH_ROOT_PGTBL ) > +get_cpu_info()->root_pgt_changed = true; > + > local_irq_restore(irqfl); > > return flags; Does this really need to sit inside the interrupts disabled section? Thinking about it I even wonder whether the cache flush part needs to be. Even for the INVLPG portion of the TLB flush part I can't seem to see a need for IRQs to be off. I think it's really just the pre_flush() / post_flush() pair which needs to be inside such a section. I'll prepare a patch (for after 4.11). I think some of the changes later in your series will actually further ease this. > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -499,10 +499,15 @@ void free_shared_domheap_page(struct page_info *page) > void make_cr3(struct vcpu *v, mfn_t mfn) > { > v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT; > +if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) && > + !is_pv_32bit_vcpu(v) ) > +get_cpu_info()->root_pgt_changed = true; > } As this doesn't actually update CR3, setting the flag shouldn't generally be necessary if the caller then invokes write_ptbase(). Isn't setting the flag here needed solely in the case of _toggle_guest_pt() being up the call tree? In which case it would perhaps better be set there (and in turn some or even all of the conditional around it could be dropped)? > void write_ptbase(struct vcpu *v) > { > +if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) ) > +get_cpu_info()->root_pgt_changed = true; > write_cr3(v->arch.cr3); When you come here from e.g. __sync_local_execstate(), you don't really need to set the flag. Of course you'll come here again before the next 64-bit PV vCPU will make it to restore_all_guest, so by the time we make it there the flag will be set anyway. However, if you already use such a subtlety, then there's also no point excluding 32-bit vCPU-s here (nor in make_cr3()), as those will never make it to restore_all_guest. Same then for excluding HVM vCPU-s. And I then wonder whether (here or more likely in a later patch) the root_pgt check couldn't go away as well. > @@ -3698,18 +3703,29 @@ long do_mmu_update( > break; > rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn, >cmd == MMU_PT_UPDATE_PRESERVE_AD, v); > -/* > - * No need to sync if all uses of the page can be > accounted > - * to the page lock we hold, its pinned status, and uses > on > - * this (v)CPU. > - */ > -if ( !rc && !cpu_has_no_xpti && > - ((page->u.inuse.type_info & PGT_count_mask) > > - (1 + !!(page->u.inuse.type_info & PGT_pinned) + > - (pagetable_get_pfn(curr->arch.guest_table) == > mfn) > + > - (pagetable_get_pfn(curr->arch.guest_table_user) == > -mfn))) ) > -sync_guest = true; > +if ( !rc && !cpu_has_no_xpti ) > +{ > +bool local_in_use = false; > + > +if ( (pagetable_get_pfn(curr->arch.guest_table) == > + mfn) || > + (pagetable_get_pfn(curr->arch.guest_table_user) > == > + mfn) ) > +{ > +local_in_use = true; > +get_cpu_info()->root_pgt_changed = true; > +} The conditional causes root_pgt_changed to get set even in cases where what CR3 points to doesn't actually change (if it's the user page tables that get modified). I think you want to check curr->arch.cr3 here, or only curr->arch.guest_table (as user mode can't invoke hypercalls). > +/* > + * No need to sync if all uses of the page can be > + * accounted to the page lock we hold, its pinned > + * status, and uses on this (v)CPU. > + */ > +if ( (page->u.inuse.type_info & PGT_count_mask) > > + (1 + !!(page->u.inuse.type_info & PGT_pinned) + > + local_in_use) ) The boolean local_in_use evaluates to 1 here, when previously the value could have been 1 or 2 (I agree that's highly theoretical, but anyway). Of course this will be addressed implicitly if you check (only) curr->arch.guest_table above and move the curr->arch.guest_table_user check here. Jan
[Xen-devel] [PATCH v3 1/7] x86/xpti: avoid copying L4 page table contents when possible
For mitigation of Meltdown the current L4 page table is copied to the cpu local root page table each time a 64 bit pv guest is entered. Copying can be avoided in cases where the guest L4 page table hasn't been modified while running the hypervisor, e.g. when handling interrupts or any hypercall not modifying the L4 page table or %cr3. So add a per-cpu flag whether the copying should be performed and set that flag only when loading a new %cr3 or modifying the L4 page table. This includes synchronization of the cpu local root page table with other cpus, so add a special synchronization flag for that case. A simple performance check (compiling the hypervisor via "make -j 4") in dom0 with 4 vcpus shows a significant improvement: - real time drops from 112 seconds to 103 seconds - system time drops from 142 seconds to 131 seconds Signed-off-by: Juergen Gross--- V3: - set flag locally only if affected L4 is active (Jan Beulich) - add setting flag to flush_area_mask() (Jan Beulich) - set flag in make_cr3() only if called for current active vcpu To be applied on top of Jan's "Meltdown band-aid overhead reduction" series --- xen/arch/x86/flushtlb.c | 3 +++ xen/arch/x86/mm.c | 42 +++ xen/arch/x86/smp.c| 2 +- xen/arch/x86/x86_64/asm-offsets.c | 1 + xen/arch/x86/x86_64/entry.S | 8 ++-- xen/include/asm-x86/current.h | 8 xen/include/asm-x86/flushtlb.h| 2 ++ 7 files changed, 50 insertions(+), 16 deletions(-) diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c index 8a7a76b8ff..9a9af71d5a 100644 --- a/xen/arch/x86/flushtlb.c +++ b/xen/arch/x86/flushtlb.c @@ -158,6 +158,9 @@ unsigned int flush_area_local(const void *va, unsigned int flags) } } +if ( flags & FLUSH_ROOT_PGTBL ) +get_cpu_info()->root_pgt_changed = true; + local_irq_restore(irqfl); return flags; diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index f0195561c2..352600ad73 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -499,10 +499,15 @@ void free_shared_domheap_page(struct page_info *page) void make_cr3(struct vcpu *v, mfn_t mfn) { v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT; +if ( v == current && this_cpu(root_pgt) && is_pv_vcpu(v) && + !is_pv_32bit_vcpu(v) ) +get_cpu_info()->root_pgt_changed = true; } void write_ptbase(struct vcpu *v) { +if ( this_cpu(root_pgt) && is_pv_vcpu(v) && !is_pv_32bit_vcpu(v) ) +get_cpu_info()->root_pgt_changed = true; write_cr3(v->arch.cr3); } @@ -3698,18 +3703,29 @@ long do_mmu_update( break; rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn, cmd == MMU_PT_UPDATE_PRESERVE_AD, v); -/* - * No need to sync if all uses of the page can be accounted - * to the page lock we hold, its pinned status, and uses on - * this (v)CPU. - */ -if ( !rc && !cpu_has_no_xpti && - ((page->u.inuse.type_info & PGT_count_mask) > - (1 + !!(page->u.inuse.type_info & PGT_pinned) + - (pagetable_get_pfn(curr->arch.guest_table) == mfn) + - (pagetable_get_pfn(curr->arch.guest_table_user) == -mfn))) ) -sync_guest = true; +if ( !rc && !cpu_has_no_xpti ) +{ +bool local_in_use = false; + +if ( (pagetable_get_pfn(curr->arch.guest_table) == + mfn) || + (pagetable_get_pfn(curr->arch.guest_table_user) == + mfn) ) +{ +local_in_use = true; +get_cpu_info()->root_pgt_changed = true; +} + +/* + * No need to sync if all uses of the page can be + * accounted to the page lock we hold, its pinned + * status, and uses on this (v)CPU. + */ +if ( (page->u.inuse.type_info & PGT_count_mask) > + (1 + !!(page->u.inuse.type_info & PGT_pinned) + + local_in_use) ) +sync_guest = true; +} break; case PGT_writable_page: @@ -3824,7 +3840,7 @@ long do_mmu_update( cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu)); if ( !cpumask_empty(mask) ) -flush_mask(mask, FLUSH_TLB_GLOBAL); +flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL); } perfc_add(num_page_updates,