Re: [patch 09/10] KVM: MMU: out of sync shadow core v2
Marcelo Tosatti wrote: I don't understand how the variables sp, child, and parent interact. You either need recursion or an explicit stack? It restarts at parent level whenever finishing any children: + if (i == PT64_ENT_PER_PAGE) { + sp-unsync_children = 0; + sp = parent; + } No efficiency. Oh okay. 'parent' is never assigned to. Lack of concentration. Yes. The next element for_each_entry_safe saved could have been zapped. Ouch. Ouch. I hate doing this. Can see no alternative though. Me neither. Well. But I don't see kvm_mmu_zap_page()'s return value used anywhere. Actually, I think I see an alternative: set the invalid flag on these pages and queue them in a list, like we do for roots in use. Flush the list on some cleanup path. Windows 2008 64-bit has all sorts of sharing a pagetable at multiple levels too. We still want to allow oos for the two quadrants of a nonpae shadow page. Sure, can be an optimization step later? I'd like to reexamine this from another angle: what if we allow oos of any level? This will simplify the can_unsync path (always true) and remove a special case. The cost is implementing invlpg and resync for non-leaf pages (invlpg has to resync the pte for every level). Are there other problems with this? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 09/10] KVM: MMU: out of sync shadow core v2
On Tue, Sep 23, 2008 at 01:46:23PM +0300, Avi Kivity wrote: Marcelo Tosatti wrote: I don't understand how the variables sp, child, and parent interact. You either need recursion or an explicit stack? It restarts at parent level whenever finishing any children: + if (i == PT64_ENT_PER_PAGE) { + sp-unsync_children = 0; + sp = parent; + } No efficiency. Oh okay. 'parent' is never assigned to. Lack of concentration. Yes. The next element for_each_entry_safe saved could have been zapped. Ouch. Ouch. I hate doing this. Can see no alternative though. Me neither. Well. But I don't see kvm_mmu_zap_page()'s return value used anywhere. It is. List walk becomes unsafe otherwise. Actually, I think I see an alternative: set the invalid flag on these pages and queue them in a list, like we do for roots in use. Flush the list on some cleanup path. Yes, it is an alternative. But then you would have to test for the invalid flag on all those paths that currently test for kvm_mmu_zap_page return value. I'm not sure if thats any better? Windows 2008 64-bit has all sorts of sharing a pagetable at multiple levels too. We still want to allow oos for the two quadrants of a nonpae shadow page. Sure, can be an optimization step later? I'd like to reexamine this from another angle: what if we allow oos of any level? This will simplify the can_unsync path (always true) The can_unsync flag is there to avoid the resync path (mmu_unsync_walk-kvm_sync_page) from unsyncing pages of the root being synced. Say, if at every resync you end up unsyncing two pages (unlikely but possible). However, we can probably get rid of it the bitmap walk (which won't restart the walk from the beginning). and remove a special case. The cost is implementing invlpg and resync for non-leaf pages (invlpg has to resync the pte for every level). Are there other problems with this? There is no gfn cache for non-leaf pages, so you either need to introduce it or go for gfn_to_page_atomic-like functionality (expensive). I was hoping to look into non-leaf unsync to be another for later optimization step, if found to be worthwhile. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 09/10] KVM: MMU: out of sync shadow core v2
On Mon, Sep 22, 2008 at 11:41:14PM +0300, Avi Kivity wrote: Marcelo Tosatti wrote: + while (parent-unsync_children) { + for (i = 0; i PT64_ENT_PER_PAGE; ++i) { + u64 ent = sp-spt[i]; + + if (is_shadow_present_pte(ent)) { + struct kvm_mmu_page *child; + child = page_header(ent PT64_BASE_ADDR_MASK); What does this do? Walks all children of given page with no efficiency. Its replaced later by the bitmap version. I don't understand how the variables sp, child, and parent interact. You either need recursion or an explicit stack? It restarts at parent level whenever finishing any children: + if (i == PT64_ENT_PER_PAGE) { + sp-unsync_children = 0; + sp = parent; + } No efficiency. Yes. The next element for_each_entry_safe saved could have been zapped. Ouch. Ouch. I hate doing this. Can see no alternative though. Me neither. + /* don't unsync if pagetable is shadowed with multiple roles */ + hlist_for_each_entry_safe(s, node, n, bucket, hash_link) { + if (s-gfn != sp-gfn || s-role.metaphysical) + continue; + if (s-role.word != sp-role.word) + return 1; + } This will happen for nonpae paging. But why not allow it? Zap all unsynced pages on mode switch. Oh, if a page is both a page directory and page table, yes. Yes. So to allow nonpae oos, check the level instead. Windows 2008 64-bit has all sorts of sharing a pagetable at multiple levels too. We still want to allow oos for the two quadrants of a nonpae shadow page. Sure, can be an optimization step later? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 09/10] KVM: MMU: out of sync shadow core v2
On Mon, Sep 22, 2008 at 06:55:03PM -0300, Marcelo Tosatti wrote: On Mon, Sep 22, 2008 at 11:41:14PM +0300, Avi Kivity wrote: Marcelo Tosatti wrote: +while (parent-unsync_children) { +for (i = 0; i PT64_ENT_PER_PAGE; ++i) { +u64 ent = sp-spt[i]; + +if (is_shadow_present_pte(ent)) { +struct kvm_mmu_page *child; +child = page_header(ent PT64_BASE_ADDR_MASK); What does this do? Walks all children of given page with no efficiency. Its replaced later by the bitmap version. I don't understand how the variables sp, child, and parent interact. You either need recursion or an explicit stack? It restarts at parent level whenever finishing any children: + if (i == PT64_ENT_PER_PAGE) { + sp-unsync_children = 0; + sp = parent; + } No efficiency. Do you prefer a recursive version for this one too? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 09/10] KVM: MMU: out of sync shadow core v2
On Fri, Sep 19, 2008 at 06:22:52PM -0700, Avi Kivity wrote: Instead of private, have an object contain both callback and private data, and use container_of(). Reduces the chance of type errors. OK. +while (parent-unsync_children) { +for (i = 0; i PT64_ENT_PER_PAGE; ++i) { +u64 ent = sp-spt[i]; + +if (is_shadow_present_pte(ent)) { +struct kvm_mmu_page *child; +child = page_header(ent PT64_BASE_ADDR_MASK); What does this do? Walks all children of given page with no efficiency. Its replaced later by the bitmap version. +static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +{ +if (sp-role.glevels != vcpu-arch.mmu.root_level) { +kvm_mmu_zap_page(vcpu-kvm, sp); +return 1; +} Suppose we switch to real mode, touch a pte, switch back. Is this handled? The shadow page will go unsync on pte touch and resynced as soon as its visible (after return to paging). Or, while still in real mode, it might be zapped by kvm_mmu_get_page-kvm_sync_page. Am I missing something? @@ -991,8 +1066,18 @@ static struct kvm_mmu_page *kvm_mmu_get_ gfn, role.word); index = kvm_page_table_hashfn(gfn); bucket = vcpu-kvm-arch.mmu_page_hash[index]; -hlist_for_each_entry(sp, node, bucket, hash_link) -if (sp-gfn == gfn sp-role.word == role.word) { +hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link) +if (sp-gfn == gfn) { +if (sp-unsync) +if (kvm_sync_page(vcpu, sp)) +continue; + +if (sp-role.word != role.word) +continue; + +if (sp-unsync_children) +vcpu-arch.mmu.need_root_sync = 1; mmu_reload() maybe? Hum, will think about it. static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp) -return 0; +return ret; } Why does the caller care if zap also zapped some other random pages? To restart walking the list? Yes. The next element for_each_entry_safe saved could have been zapped. +/* don't unsync if pagetable is shadowed with multiple roles */ +hlist_for_each_entry_safe(s, node, n, bucket, hash_link) { +if (s-gfn != sp-gfn || s-role.metaphysical) +continue; +if (s-role.word != sp-role.word) +return 1; +} This will happen for nonpae paging. But why not allow it? Zap all unsynced pages on mode switch. Oh, if a page is both a page directory and page table, yes. Yes. So to allow nonpae oos, check the level instead. Windows 2008 64-bit has all sorts of sharing a pagetable at multiple levels too. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 09/10] KVM: MMU: out of sync shadow core v2
Marcelo Tosatti wrote: static struct kmem_cache *rmap_desc_cache; @@ -942,6 +943,39 @@ static void nonpaging_invlpg(struct kvm_ { } +static int mmu_unsync_walk(struct kvm_mmu_page *parent, mmu_unsync_fn fn, + void *priv) Instead of private, have an object contain both callback and private data, and use container_of(). Reduces the chance of type errors. +{ + int i, ret; + struct kvm_mmu_page *sp = parent; + + while (parent-unsync_children) { + for (i = 0; i PT64_ENT_PER_PAGE; ++i) { + u64 ent = sp-spt[i]; + + if (is_shadow_present_pte(ent)) { + struct kvm_mmu_page *child; + child = page_header(ent PT64_BASE_ADDR_MASK); + + if (child-unsync_children) { + sp = child; + break; + } + if (child-unsync) { + ret = fn(child, priv); + if (ret) + return ret; + } + } + } + if (i == PT64_ENT_PER_PAGE) { + sp-unsync_children = 0; + sp = parent; + } + } + return 0; +} What does this do? +static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +{ + if (sp-role.glevels != vcpu-arch.mmu.root_level) { + kvm_mmu_zap_page(vcpu-kvm, sp); + return 1; + } Suppose we switch to real mode, touch a pte, switch back. Is this handled? @@ -991,8 +1066,18 @@ static struct kvm_mmu_page *kvm_mmu_get_ gfn, role.word); index = kvm_page_table_hashfn(gfn); bucket = vcpu-kvm-arch.mmu_page_hash[index]; - hlist_for_each_entry(sp, node, bucket, hash_link) - if (sp-gfn == gfn sp-role.word == role.word) { + hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link) + if (sp-gfn == gfn) { + if (sp-unsync) + if (kvm_sync_page(vcpu, sp)) + continue; + + if (sp-role.word != role.word) + continue; + + if (sp-unsync_children) + vcpu-arch.mmu.need_root_sync = 1; mmu_reload() maybe? static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp) { + int ret; ++kvm-stat.mmu_shadow_zapped; + ret = mmu_zap_unsync_children(kvm, sp); kvm_mmu_page_unlink_children(kvm, sp); kvm_mmu_unlink_parents(kvm, sp); kvm_flush_remote_tlbs(kvm); if (!sp-role.invalid !sp-role.metaphysical) unaccount_shadowed(kvm, sp-gfn); + if (sp-unsync) + kvm_unlink_unsync_page(kvm, sp); if (!sp-root_count) { hlist_del(sp-hash_link); kvm_mmu_free_page(kvm, sp); @@ -1129,7 +1245,7 @@ static int kvm_mmu_zap_page(struct kvm * kvm_reload_remote_mmus(kvm); } kvm_mmu_reset_last_pte_updated(kvm); - return 0; + return ret; } Why does the caller care if zap also zapped some other random pages? To restart walking the list? + +static int kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +{ + unsigned index; + struct hlist_head *bucket; + struct kvm_mmu_page *s; + struct hlist_node *node, *n; + + index = kvm_page_table_hashfn(sp-gfn); + bucket = vcpu-kvm-arch.mmu_page_hash[index]; + /* don't unsync if pagetable is shadowed with multiple roles */ + hlist_for_each_entry_safe(s, node, n, bucket, hash_link) { + if (s-gfn != sp-gfn || s-role.metaphysical) + continue; + if (s-role.word != sp-role.word) + return 1; + } This will happen for nonpae paging. But why not allow it? Zap all unsynced pages on mode switch. Oh, if a page is both a page directory and page table, yes. So to allow nonpae oos, check the level instead. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 09/10] KVM: MMU: out of sync shadow core v2
Allow guest pagetables to go out of sync. Signed-off-by: Marcelo Tosatti [EMAIL PROTECTED] Index: kvm/arch/x86/kvm/mmu.c === --- kvm.orig/arch/x86/kvm/mmu.c +++ kvm/arch/x86/kvm/mmu.c @@ -148,6 +148,7 @@ struct kvm_shadow_walk { }; typedef int (*mmu_parent_walk_fn) (struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp); +typedef int (*mmu_unsync_fn) (struct kvm_mmu_page *sp, void *priv); static struct kmem_cache *pte_chain_cache; static struct kmem_cache *rmap_desc_cache; @@ -942,6 +943,39 @@ static void nonpaging_invlpg(struct kvm_ { } +static int mmu_unsync_walk(struct kvm_mmu_page *parent, mmu_unsync_fn fn, + void *priv) +{ + int i, ret; + struct kvm_mmu_page *sp = parent; + + while (parent-unsync_children) { + for (i = 0; i PT64_ENT_PER_PAGE; ++i) { + u64 ent = sp-spt[i]; + + if (is_shadow_present_pte(ent)) { + struct kvm_mmu_page *child; + child = page_header(ent PT64_BASE_ADDR_MASK); + + if (child-unsync_children) { + sp = child; + break; + } + if (child-unsync) { + ret = fn(child, priv); + if (ret) + return ret; + } + } + } + if (i == PT64_ENT_PER_PAGE) { + sp-unsync_children = 0; + sp = parent; + } + } + return 0; +} + static struct kvm_mmu_page *kvm_mmu_lookup_page(struct kvm *kvm, gfn_t gfn) { unsigned index; @@ -962,6 +996,47 @@ static struct kvm_mmu_page *kvm_mmu_look return NULL; } +static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp) +{ + WARN_ON(!sp-unsync); + sp-unsync = 0; + --kvm-stat.mmu_unsync; +} + +static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp); + +static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +{ + if (sp-role.glevels != vcpu-arch.mmu.root_level) { + kvm_mmu_zap_page(vcpu-kvm, sp); + return 1; + } + + rmap_write_protect(vcpu-kvm, sp-gfn); + if (vcpu-arch.mmu.sync_page(vcpu, sp)) { + kvm_mmu_zap_page(vcpu-kvm, sp); + return 1; + } + + kvm_mmu_flush_tlb(vcpu); + kvm_unlink_unsync_page(vcpu-kvm, sp); + return 0; +} + +static int mmu_sync_fn(struct kvm_mmu_page *sp, void *priv) +{ + struct kvm_vcpu *vcpu = priv; + + kvm_sync_page(vcpu, sp); + return (need_resched() || spin_needbreak(vcpu-kvm-mmu_lock)); +} + +static void mmu_sync_children(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +{ + while (mmu_unsync_walk(sp, mmu_sync_fn, vcpu)) + cond_resched_lock(vcpu-kvm-mmu_lock); +} + static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gaddr, @@ -975,7 +1050,7 @@ static struct kvm_mmu_page *kvm_mmu_get_ unsigned quadrant; struct hlist_head *bucket; struct kvm_mmu_page *sp; - struct hlist_node *node; + struct hlist_node *node, *tmp; role.word = 0; role.glevels = vcpu-arch.mmu.root_level; @@ -991,8 +1066,18 @@ static struct kvm_mmu_page *kvm_mmu_get_ gfn, role.word); index = kvm_page_table_hashfn(gfn); bucket = vcpu-kvm-arch.mmu_page_hash[index]; - hlist_for_each_entry(sp, node, bucket, hash_link) - if (sp-gfn == gfn sp-role.word == role.word) { + hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link) + if (sp-gfn == gfn) { + if (sp-unsync) + if (kvm_sync_page(vcpu, sp)) + continue; + + if (sp-role.word != role.word) + continue; + + if (sp-unsync_children) + vcpu-arch.mmu.need_root_sync = 1; + mmu_page_add_parent_pte(vcpu, sp, parent_pte); pgprintk(%s: found\n, __func__); return sp; @@ -1112,14 +1197,45 @@ static void kvm_mmu_unlink_parents(struc } } +struct mmu_zap_walk { + struct kvm *kvm; + int zapped; +}; + +static int mmu_zap_fn(struct kvm_mmu_page *sp, void *private) +{ + struct mmu_zap_walk *zap_walk = private; + + kvm_mmu_zap_page(zap_walk-kvm, sp); + zap_walk-zapped = 1; + return 0; +} + +static int