Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 04:45 PM, Xiao Guangrong wrote: On 07/04/2016 04:41 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 04:19:20PM +0800, Xiao Guangrong wrote: On 07/04/2016 03:53 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote: On 07/04/2016 03:03 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote: On 06/30/2016 09:01 PM, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. Why does it require fetching the pfn when the fault is triggered rather than when mmap() is called? Hi Guangrong, as such mapping information between virtual mmio to physical mmio is only available at runtime. Sorry, i do not know what the different between mmap() and the time VM actually accesses the memory for your case. Could you please more detail? Hi Guangrong, Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when setting up the virtual mmio, at that moment nobody has any knowledge about how the physical mmio gets virtualized. When the vm (or application if we don't want to limit ourselves to vmm term) starts, the virtual and physical mmio gets mapped by mpci kernel module with the help from vendor supplied mediated host driver according to the hw resource assigned to this vm / application. Thanks for your expiation. It sounds like a strategy of resource allocation, you delay the allocation until VM really accesses it, right? Yes, that is where the fault handler inside mpci code comes to the picture. I am not sure this strategy is good. The instance is successfully created, and it is started successful, but the VM is crashed due to the resource of that instance is not enough. That sounds unreasonable. Especially, you can not squeeze this kind of memory to balance the usage between all VMs. Does this strategy still make sense?
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 04:41 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 04:19:20PM +0800, Xiao Guangrong wrote: On 07/04/2016 03:53 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote: On 07/04/2016 03:03 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote: On 06/30/2016 09:01 PM, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. Why does it require fetching the pfn when the fault is triggered rather than when mmap() is called? Hi Guangrong, as such mapping information between virtual mmio to physical mmio is only available at runtime. Sorry, i do not know what the different between mmap() and the time VM actually accesses the memory for your case. Could you please more detail? Hi Guangrong, Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when setting up the virtual mmio, at that moment nobody has any knowledge about how the physical mmio gets virtualized. When the vm (or application if we don't want to limit ourselves to vmm term) starts, the virtual and physical mmio gets mapped by mpci kernel module with the help from vendor supplied mediated host driver according to the hw resource assigned to this vm / application. Thanks for your expiation. It sounds like a strategy of resource allocation, you delay the allocation until VM really accesses it, right? Yes, that is where the fault handler inside mpci code comes to the picture. I am not sure this strategy is good. The instance is successfully created, and it is started successful, but the VM is crashed due to the resource of that instance is not enough. That sounds unreasonable.
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 04:41 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 04:19:20PM +0800, Xiao Guangrong wrote: On 07/04/2016 03:53 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote: On 07/04/2016 03:03 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote: On 06/30/2016 09:01 PM, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. Why does it require fetching the pfn when the fault is triggered rather than when mmap() is called? Hi Guangrong, as such mapping information between virtual mmio to physical mmio is only available at runtime. Sorry, i do not know what the different between mmap() and the time VM actually accesses the memory for your case. Could you please more detail? Hi Guangrong, Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when setting up the virtual mmio, at that moment nobody has any knowledge about how the physical mmio gets virtualized. When the vm (or application if we don't want to limit ourselves to vmm term) starts, the virtual and physical mmio gets mapped by mpci kernel module with the help from vendor supplied mediated host driver according to the hw resource assigned to this vm / application. Thanks for your expiation. It sounds like a strategy of resource allocation, you delay the allocation until VM really accesses it, right? Yes, that is where the fault handler inside mpci code comes to the picture. I am not sure this strategy is good. The instance is successfully created, and it is started successful, but the VM is crashed due to the resource of that instance is not enough. That sounds unreasonable.
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 04:14 PM, Paolo Bonzini wrote: On 04/07/2016 09:59, Xiao Guangrong wrote: But apart from this, it's much more obvious to consider the refcount. The x86 MMU code doesn't care if the page is reserved or not; mmu_set_spte does a kvm_release_pfn_clean, hence it makes sense for hva_to_pfn_remapped to try doing a get_page (via kvm_get_pfn) after invoking the fault handler, just like the get_user_pages family of function does. Well, it's little strange as you always try to get refcont for a PFNMAP region without MIXEDMAP which indicates all the memory in this region is no 'struct page' backend. Fair enough, I can modify the comment. /* * In case the VMA has VM_MIXEDMAP set, whoever called remap_pfn_range * is also going to call e.g. unmap_mapping_range before the underlying * non-reserved pages are freed, which will then call our MMU notifier. * We still have to get a reference here to the page, because the callers * of *hva_to_pfn* and *gfn_to_pfn* ultimately end up doing a * kvm_release_pfn_clean on the returned pfn. If the pfn is * reserved, the kvm_get_pfn/kvm_release_pfn_clean pair will simply * do nothing. */ Excellent. I like it. :)
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 04:14 PM, Paolo Bonzini wrote: On 04/07/2016 09:59, Xiao Guangrong wrote: But apart from this, it's much more obvious to consider the refcount. The x86 MMU code doesn't care if the page is reserved or not; mmu_set_spte does a kvm_release_pfn_clean, hence it makes sense for hva_to_pfn_remapped to try doing a get_page (via kvm_get_pfn) after invoking the fault handler, just like the get_user_pages family of function does. Well, it's little strange as you always try to get refcont for a PFNMAP region without MIXEDMAP which indicates all the memory in this region is no 'struct page' backend. Fair enough, I can modify the comment. /* * In case the VMA has VM_MIXEDMAP set, whoever called remap_pfn_range * is also going to call e.g. unmap_mapping_range before the underlying * non-reserved pages are freed, which will then call our MMU notifier. * We still have to get a reference here to the page, because the callers * of *hva_to_pfn* and *gfn_to_pfn* ultimately end up doing a * kvm_release_pfn_clean on the returned pfn. If the pfn is * reserved, the kvm_get_pfn/kvm_release_pfn_clean pair will simply * do nothing. */ Excellent. I like it. :)
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 03:53 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote: On 07/04/2016 03:03 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote: On 06/30/2016 09:01 PM, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. Why does it require fetching the pfn when the fault is triggered rather than when mmap() is called? Hi Guangrong, as such mapping information between virtual mmio to physical mmio is only available at runtime. Sorry, i do not know what the different between mmap() and the time VM actually accesses the memory for your case. Could you please more detail? Hi Guangrong, Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when setting up the virtual mmio, at that moment nobody has any knowledge about how the physical mmio gets virtualized. When the vm (or application if we don't want to limit ourselves to vmm term) starts, the virtual and physical mmio gets mapped by mpci kernel module with the help from vendor supplied mediated host driver according to the hw resource assigned to this vm / application. Thanks for your expiation. It sounds like a strategy of resource allocation, you delay the allocation until VM really accesses it, right?
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 03:53 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote: On 07/04/2016 03:03 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote: On 06/30/2016 09:01 PM, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. Why does it require fetching the pfn when the fault is triggered rather than when mmap() is called? Hi Guangrong, as such mapping information between virtual mmio to physical mmio is only available at runtime. Sorry, i do not know what the different between mmap() and the time VM actually accesses the memory for your case. Could you please more detail? Hi Guangrong, Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when setting up the virtual mmio, at that moment nobody has any knowledge about how the physical mmio gets virtualized. When the vm (or application if we don't want to limit ourselves to vmm term) starts, the virtual and physical mmio gets mapped by mpci kernel module with the help from vendor supplied mediated host driver according to the hw resource assigned to this vm / application. Thanks for your expiation. It sounds like a strategy of resource allocation, you delay the allocation until VM really accesses it, right?
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 03:48 PM, Paolo Bonzini wrote: On 04/07/2016 09:37, Xiao Guangrong wrote: It actually is a portion of the physical mmio which is set by vfio mmap. So i do not think we need to care its refcount, i,e, we can consider it as reserved_pfn, Paolo? nVidia provided me (offlist) with a simple patch that modified VFIO to exhibit the problem, and it didn't use reserved PFNs. This is why the commit message for the patch is not entirely accurate. It's clear now. But apart from this, it's much more obvious to consider the refcount. The x86 MMU code doesn't care if the page is reserved or not; mmu_set_spte does a kvm_release_pfn_clean, hence it makes sense for hva_to_pfn_remapped to try doing a get_page (via kvm_get_pfn) after invoking the fault handler, just like the get_user_pages family of function does. Well, it's little strange as you always try to get refcont for a PFNMAP region without MIXEDMAP which indicates all the memory in this region is no 'struct page' backend. But it works as kvm_{get, release}_* have already been aware of reserved_pfn, so i am okay with it..
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 03:48 PM, Paolo Bonzini wrote: On 04/07/2016 09:37, Xiao Guangrong wrote: It actually is a portion of the physical mmio which is set by vfio mmap. So i do not think we need to care its refcount, i,e, we can consider it as reserved_pfn, Paolo? nVidia provided me (offlist) with a simple patch that modified VFIO to exhibit the problem, and it didn't use reserved PFNs. This is why the commit message for the patch is not entirely accurate. It's clear now. But apart from this, it's much more obvious to consider the refcount. The x86 MMU code doesn't care if the page is reserved or not; mmu_set_spte does a kvm_release_pfn_clean, hence it makes sense for hva_to_pfn_remapped to try doing a get_page (via kvm_get_pfn) after invoking the fault handler, just like the get_user_pages family of function does. Well, it's little strange as you always try to get refcont for a PFNMAP region without MIXEDMAP which indicates all the memory in this region is no 'struct page' backend. But it works as kvm_{get, release}_* have already been aware of reserved_pfn, so i am okay with it..
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 03:38 PM, Paolo Bonzini wrote: On 04/07/2016 08:39, Xiao Guangrong wrote: Why the memory mapped by this mmap() is not a portion of MMIO from underlayer physical device? If it is a valid system memory, is this interface really needed to implemented in vfio? (you at least need to set VM_MIXEDMAP if it mixed system memory with MMIO) The KVM code does not care if VM_MIXEDMAP is set or not, it works in either case. Yes, it is. I mean nvdia's vfio patchset should use VM_MIXEDMAP if the memory is mixed. :)
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 03:38 PM, Paolo Bonzini wrote: On 04/07/2016 08:39, Xiao Guangrong wrote: Why the memory mapped by this mmap() is not a portion of MMIO from underlayer physical device? If it is a valid system memory, is this interface really needed to implemented in vfio? (you at least need to set VM_MIXEDMAP if it mixed system memory with MMIO) The KVM code does not care if VM_MIXEDMAP is set or not, it works in either case. Yes, it is. I mean nvdia's vfio patchset should use VM_MIXEDMAP if the memory is mixed. :)
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 03:03 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote: On 06/30/2016 09:01 PM, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. Why does it require fetching the pfn when the fault is triggered rather than when mmap() is called? Hi Guangrong, as such mapping information between virtual mmio to physical mmio is only available at runtime. Sorry, i do not know what the different between mmap() and the time VM actually accesses the memory for your case. Could you please more detail? Why the memory mapped by this mmap() is not a portion of MMIO from underlayer physical device? If it is a valid system memory, is this interface really needed to implemented in vfio? (you at least need to set VM_MIXEDMAP if it mixed system memory with MMIO) It actually is a portion of the physical mmio which is set by vfio mmap. So i do not think we need to care its refcount, i,e, we can consider it as reserved_pfn, Paolo? IIUC, the kernel assumes that VM_PFNMAP is a continuous memory, e.g, like current KVM and vaddr_get_pfn() in vfio, but it seems nvdia's patchset breaks this semantic as ops->validate_map_request() can adjust the physical address arbitrarily. (again, the name 'validate' should be changed to match the thing as it is really doing) The vgpu api will allow you to adjust the target mmio address and the size via validate_map_request, but it is still physical contiguous as <start_pfn, size>. Okay, the interface confused us, maybe this interface need to be cooked to reflect to this fact. Thanks!
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 07/04/2016 03:03 PM, Neo Jia wrote: On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote: On 06/30/2016 09:01 PM, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. Why does it require fetching the pfn when the fault is triggered rather than when mmap() is called? Hi Guangrong, as such mapping information between virtual mmio to physical mmio is only available at runtime. Sorry, i do not know what the different between mmap() and the time VM actually accesses the memory for your case. Could you please more detail? Why the memory mapped by this mmap() is not a portion of MMIO from underlayer physical device? If it is a valid system memory, is this interface really needed to implemented in vfio? (you at least need to set VM_MIXEDMAP if it mixed system memory with MMIO) It actually is a portion of the physical mmio which is set by vfio mmap. So i do not think we need to care its refcount, i,e, we can consider it as reserved_pfn, Paolo? IIUC, the kernel assumes that VM_PFNMAP is a continuous memory, e.g, like current KVM and vaddr_get_pfn() in vfio, but it seems nvdia's patchset breaks this semantic as ops->validate_map_request() can adjust the physical address arbitrarily. (again, the name 'validate' should be changed to match the thing as it is really doing) The vgpu api will allow you to adjust the target mmio address and the size via validate_map_request, but it is still physical contiguous as . Okay, the interface confused us, maybe this interface need to be cooked to reflect to this fact. Thanks!
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 06/30/2016 09:01 PM, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. Why does it require fetching the pfn when the fault is triggered rather than when mmap() is called? Why the memory mapped by this mmap() is not a portion of MMIO from underlayer physical device? If it is a valid system memory, is this interface really needed to implemented in vfio? (you at least need to set VM_MIXEDMAP if it mixed system memory with MMIO) IIUC, the kernel assumes that VM_PFNMAP is a continuous memory, e.g, like current KVM and vaddr_get_pfn() in vfio, but it seems nvdia's patchset breaks this semantic as ops->validate_map_request() can adjust the physical address arbitrarily. (again, the name 'validate' should be changed to match the thing as it is really doing)
Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed
On 06/30/2016 09:01 PM, Paolo Bonzini wrote: The vGPU folks would like to trap the first access to a BAR by setting vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler then can use remap_pfn_range to place some non-reserved pages in the VMA. Why does it require fetching the pfn when the fault is triggered rather than when mmap() is called? Why the memory mapped by this mmap() is not a portion of MMIO from underlayer physical device? If it is a valid system memory, is this interface really needed to implemented in vfio? (you at least need to set VM_MIXEDMAP if it mixed system memory with MMIO) IIUC, the kernel assumes that VM_PFNMAP is a continuous memory, e.g, like current KVM and vaddr_get_pfn() in vfio, but it seems nvdia's patchset breaks this semantic as ops->validate_map_request() can adjust the physical address arbitrarily. (again, the name 'validate' should be changed to match the thing as it is really doing)
Re: [PATCH 3/5] mmu: don't set the present bit unconditionally
On 06/29/2016 04:18 PM, Paolo Bonzini wrote: On 29/06/2016 05:17, Xiao Guangrong wrote: +++ b/arch/x86/kvm/mmu.c @@ -2516,13 +2516,17 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn, kvm_pfn_t pfn, bool speculative, bool can_unsync, bool host_writable) { -u64 spte; +u64 spte = 0; int ret = 0; +struct kvm_mmu *context = >arch.mmu; +bool execonly = !(context->guest_rsvd_check.bad_mt_xwr & + (1ull << VMX_EPT_EXECUTABLE_MASK)); Could we introduce a new field, say execonly, to "struct kvm_mmu"? That would make the code be more clearer. Given how execonly is used, let's add shadow_present_mask instead. Yup, it is better.
Re: [PATCH 3/5] mmu: don't set the present bit unconditionally
On 06/29/2016 04:18 PM, Paolo Bonzini wrote: On 29/06/2016 05:17, Xiao Guangrong wrote: +++ b/arch/x86/kvm/mmu.c @@ -2516,13 +2516,17 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn, kvm_pfn_t pfn, bool speculative, bool can_unsync, bool host_writable) { -u64 spte; +u64 spte = 0; int ret = 0; +struct kvm_mmu *context = >arch.mmu; +bool execonly = !(context->guest_rsvd_check.bad_mt_xwr & + (1ull << VMX_EPT_EXECUTABLE_MASK)); Could we introduce a new field, say execonly, to "struct kvm_mmu"? That would make the code be more clearer. Given how execonly is used, let's add shadow_present_mask instead. Yup, it is better.
Re: [PATCH 3/5] mmu: don't set the present bit unconditionally
On 06/28/2016 12:32 PM, Bandan Das wrote: To support execute only mappings on behalf of L1 hypervisors, we teach set_spte() to honor L1's valid XWR bits. This is only if host supports EPT execute only. Reuse ACC_USER_MASK to signify if the L1 hypervisor has the R bit set Signed-off-by: Bandan Das--- arch/x86/kvm/mmu.c | 9 +++-- arch/x86/kvm/paging_tmpl.h | 2 +- arch/x86/kvm/vmx.c | 2 +- 3 files changed, 9 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 875d4f7..ee2fb16 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2516,13 +2516,17 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn, kvm_pfn_t pfn, bool speculative, bool can_unsync, bool host_writable) { - u64 spte; + u64 spte = 0; int ret = 0; + struct kvm_mmu *context = >arch.mmu; + bool execonly = !(context->guest_rsvd_check.bad_mt_xwr & + (1ull << VMX_EPT_EXECUTABLE_MASK)); Could we introduce a new field, say execonly, to "struct kvm_mmu"? That would make the code be more clearer.
Re: [PATCH 3/5] mmu: don't set the present bit unconditionally
On 06/28/2016 12:32 PM, Bandan Das wrote: To support execute only mappings on behalf of L1 hypervisors, we teach set_spte() to honor L1's valid XWR bits. This is only if host supports EPT execute only. Reuse ACC_USER_MASK to signify if the L1 hypervisor has the R bit set Signed-off-by: Bandan Das --- arch/x86/kvm/mmu.c | 9 +++-- arch/x86/kvm/paging_tmpl.h | 2 +- arch/x86/kvm/vmx.c | 2 +- 3 files changed, 9 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 875d4f7..ee2fb16 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2516,13 +2516,17 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn, kvm_pfn_t pfn, bool speculative, bool can_unsync, bool host_writable) { - u64 spte; + u64 spte = 0; int ret = 0; + struct kvm_mmu *context = >arch.mmu; + bool execonly = !(context->guest_rsvd_check.bad_mt_xwr & + (1ull << VMX_EPT_EXECUTABLE_MASK)); Could we introduce a new field, say execonly, to "struct kvm_mmu"? That would make the code be more clearer.
Re: [PATCH 2/5] mmu: pass execonly value when initializing rsvd bits
On 06/28/2016 12:32 PM, Bandan Das wrote: In reset_tdp_shadow_zero_bits_mask, we always pass false when initializing the reserved bits. By initializing with the correct value of ept exec only, the host can correctly identify if the guest pte is valid. Note that kvm_init_shadow_ept_mmu() already knows about execonly. Signed-off-by: Bandan Das--- arch/x86/kvm/mmu.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index a50af79..875d4f7 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3831,23 +3831,27 @@ static inline bool boot_cpu_is_amd(void) /* * the direct page table on host, use as much mmu features as - * possible, however, kvm currently does not do execution-protection. + * possible */ static void reset_tdp_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context) { It is not necessary. reset_tdp_shadow_zero_bits_mask() is used for the guest without nested-ept, host never sets shadow execute-only actively. For the nested ept guest, kvm_init_shadow_ept_mmu() has already handled xonly case perfectly.
Re: [PATCH 2/5] mmu: pass execonly value when initializing rsvd bits
On 06/28/2016 12:32 PM, Bandan Das wrote: In reset_tdp_shadow_zero_bits_mask, we always pass false when initializing the reserved bits. By initializing with the correct value of ept exec only, the host can correctly identify if the guest pte is valid. Note that kvm_init_shadow_ept_mmu() already knows about execonly. Signed-off-by: Bandan Das --- arch/x86/kvm/mmu.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index a50af79..875d4f7 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3831,23 +3831,27 @@ static inline bool boot_cpu_is_amd(void) /* * the direct page table on host, use as much mmu features as - * possible, however, kvm currently does not do execution-protection. + * possible */ static void reset_tdp_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context) { It is not necessary. reset_tdp_shadow_zero_bits_mask() is used for the guest without nested-ept, host never sets shadow execute-only actively. For the nested ept guest, kvm_init_shadow_ept_mmu() has already handled xonly case perfectly.
Re: [PATCH 1/5] mmu: mark spte present if the x bit is set
On 06/29/2016 04:49 AM, Paolo Bonzini wrote: On 28/06/2016 22:37, Bandan Das wrote: Paolo Bonziniwrites: On 28/06/2016 19:33, Bandan Das wrote: static int is_shadow_present_pte(u64 pte) { - return pte & PT_PRESENT_MASK && !is_mmio_spte(pte); + return pte & (PT_PRESENT_MASK | shadow_x_mask) && + !is_mmio_spte(pte); This should really be pte & 7 when using EPT. But this is okay as an alternative to a new shadow_present_mask. I could revive shadow_xonly_valid probably... Anyway, for now I will add a TODO comment here. It's okay to it like this, because the only invalid PTEs reaching this point are those that is_mmio_spte filters away. Hence you'll never get -W- PTEs here, and pte & 7 is really the same as how you wrote it. It's pretty clever, and doesn't need a TODO at all. :) Thanks, understood. So, the way it is written now covers all cases for pte & 7. Let's still add a comment - clever things are usually confusing to many! I think another way to write it is "(pte & 0xull) && !is_mmio_spte(pte)", since non-present/non-MMIO SPTEs never use bits 1..31 (they can have non-zero bits 32..63 on 32-bit CPUs where we don't update the PTEs atomically). Guangrong, what do you prefer? I think the way you innovated is better. :)
Re: [PATCH 1/5] mmu: mark spte present if the x bit is set
On 06/29/2016 04:49 AM, Paolo Bonzini wrote: On 28/06/2016 22:37, Bandan Das wrote: Paolo Bonzini writes: On 28/06/2016 19:33, Bandan Das wrote: static int is_shadow_present_pte(u64 pte) { - return pte & PT_PRESENT_MASK && !is_mmio_spte(pte); + return pte & (PT_PRESENT_MASK | shadow_x_mask) && + !is_mmio_spte(pte); This should really be pte & 7 when using EPT. But this is okay as an alternative to a new shadow_present_mask. I could revive shadow_xonly_valid probably... Anyway, for now I will add a TODO comment here. It's okay to it like this, because the only invalid PTEs reaching this point are those that is_mmio_spte filters away. Hence you'll never get -W- PTEs here, and pte & 7 is really the same as how you wrote it. It's pretty clever, and doesn't need a TODO at all. :) Thanks, understood. So, the way it is written now covers all cases for pte & 7. Let's still add a comment - clever things are usually confusing to many! I think another way to write it is "(pte & 0xull) && !is_mmio_spte(pte)", since non-present/non-MMIO SPTEs never use bits 1..31 (they can have non-zero bits 32..63 on 32-bit CPUs where we don't update the PTEs atomically). Guangrong, what do you prefer? I think the way you innovated is better. :)
Re: [PATCH 1/4] KVM: MMU: fix permission_fault()
On 04/06/2016 04:56 PM, Paolo Bonzini wrote: On 25/03/2016 14:19, Xiao Guangrong wrote: @@ -193,11 +193,11 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, ((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT)); One more tweak is needed in the line above; pfec - 1 must become pfec & ~1, because you've removed the pfec |= PFERR_PRESENT_MASK; line. Applied with this change. Yes, indeed. Thanks for your fix.
Re: [PATCH 1/4] KVM: MMU: fix permission_fault()
On 04/06/2016 04:56 PM, Paolo Bonzini wrote: On 25/03/2016 14:19, Xiao Guangrong wrote: @@ -193,11 +193,11 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, ((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT)); One more tweak is needed in the line above; pfec - 1 must become pfec & ~1, because you've removed the pfec |= PFERR_PRESENT_MASK; line. Applied with this change. Yes, indeed. Thanks for your fix.
Re: [PATCH 1/4] KVM: MMU: fix permission_fault()
On 03/30/2016 02:39 PM, Xiao Guangrong wrote: On 03/30/2016 02:36 PM, Paolo Bonzini wrote: On 30/03/2016 03:56, Xiao Guangrong wrote: x86/access.flat is currently using the "other" definition, i.e., PFEC.PK is only set if W=1 or CR0.WP=0 && PFEC.U=0 or PFEC.W=0. Can you use it (with ept=1 of course) to check what the processor is doing? Sure. And ept=1 is hard to trigger MMU issue, i am enabling PKEY on shadow MMU, let's see what will happen. ;) No, don't do that! ept=1 lets you test what the processor does. It means you cannot test permission_fault(), but what we want here is just reverse engineering the microcode. ept=1 lets you do exactly that. Yes, i got this point. Huaitong will do the test once the machine gets free. I tested it and it is failed: test pte.p pte.user pde.p pde.user pde.a pde.pse pkru.wd pkey=1 user write efer.nx cr4.pke: FAIL: error code 27 expected 7 Dump mapping: address: 0x1234 --L4: 2ebe007 --L3: 2ebf007 --L2: 80002a5 So PFEC.PKEY is set even if the ordinary check failed (caused by pde.rw = 0), the kvm code is right. :)
Re: [PATCH 1/4] KVM: MMU: fix permission_fault()
On 03/30/2016 02:39 PM, Xiao Guangrong wrote: On 03/30/2016 02:36 PM, Paolo Bonzini wrote: On 30/03/2016 03:56, Xiao Guangrong wrote: x86/access.flat is currently using the "other" definition, i.e., PFEC.PK is only set if W=1 or CR0.WP=0 && PFEC.U=0 or PFEC.W=0. Can you use it (with ept=1 of course) to check what the processor is doing? Sure. And ept=1 is hard to trigger MMU issue, i am enabling PKEY on shadow MMU, let's see what will happen. ;) No, don't do that! ept=1 lets you test what the processor does. It means you cannot test permission_fault(), but what we want here is just reverse engineering the microcode. ept=1 lets you do exactly that. Yes, i got this point. Huaitong will do the test once the machine gets free. I tested it and it is failed: test pte.p pte.user pde.p pde.user pde.a pde.pse pkru.wd pkey=1 user write efer.nx cr4.pke: FAIL: error code 27 expected 7 Dump mapping: address: 0x1234 --L4: 2ebe007 --L3: 2ebf007 --L2: 80002a5 So PFEC.PKEY is set even if the ordinary check failed (caused by pde.rw = 0), the kvm code is right. :)
Re: [PATCH 1/4] KVM: MMU: fix permission_fault()
On 03/30/2016 02:36 PM, Paolo Bonzini wrote: On 30/03/2016 03:56, Xiao Guangrong wrote: x86/access.flat is currently using the "other" definition, i.e., PFEC.PK is only set if W=1 or CR0.WP=0 && PFEC.U=0 or PFEC.W=0. Can you use it (with ept=1 of course) to check what the processor is doing? Sure. And ept=1 is hard to trigger MMU issue, i am enabling PKEY on shadow MMU, let's see what will happen. ;) No, don't do that! ept=1 lets you test what the processor does. It means you cannot test permission_fault(), but what we want here is just reverse engineering the microcode. ept=1 lets you do exactly that. Yes, i got this point. Huaitong will do the test once the machine gets free.
Re: [PATCH 1/4] KVM: MMU: fix permission_fault()
On 03/30/2016 02:36 PM, Paolo Bonzini wrote: On 30/03/2016 03:56, Xiao Guangrong wrote: x86/access.flat is currently using the "other" definition, i.e., PFEC.PK is only set if W=1 or CR0.WP=0 && PFEC.U=0 or PFEC.W=0. Can you use it (with ept=1 of course) to check what the processor is doing? Sure. And ept=1 is hard to trigger MMU issue, i am enabling PKEY on shadow MMU, let's see what will happen. ;) No, don't do that! ept=1 lets you test what the processor does. It means you cannot test permission_fault(), but what we want here is just reverse engineering the microcode. ept=1 lets you do exactly that. Yes, i got this point. Huaitong will do the test once the machine gets free.
Re: [PATCH 1/4] KVM: MMU: fix permission_fault()
On 03/30/2016 04:09 AM, Paolo Bonzini wrote: On 29/03/2016 19:43, Xiao Guangrong wrote: Based on the SDM: PK flag (bit 5). This flag is 1 if (1) IA32_EFER.LMA = CR4.PKE = 1; (2) the access causing the page-fault exception was a data access; (3) the linear address was a user-mode address with protection key i; and (5) the PKRU register (see Section 4.6.2) is such that either (a) ADi = 1; or (b) the following all hold: (i) WDi = 1; (ii) the access is a write access; and (iii) either CR0.WP = 1 or the access causing the page-fault exception was a user-mode access. So I think PKEY check and ordinary check are independent, i.e, PFEC.PKEY may be set even if the on permission on the page table is not adequate. x86/access.flat is currently using the "other" definition, i.e., PFEC.PK is only set if W=1 or CR0.WP=0 && PFEC.U=0 or PFEC.W=0. Can you use it (with ept=1 of course) to check what the processor is doing? Sure. And ept=1 is hard to trigger MMU issue, i am enabling PKEY on shadow MMU, let's see what will happen. ;)
Re: [PATCH 1/4] KVM: MMU: fix permission_fault()
On 03/30/2016 04:09 AM, Paolo Bonzini wrote: On 29/03/2016 19:43, Xiao Guangrong wrote: Based on the SDM: PK flag (bit 5). This flag is 1 if (1) IA32_EFER.LMA = CR4.PKE = 1; (2) the access causing the page-fault exception was a data access; (3) the linear address was a user-mode address with protection key i; and (5) the PKRU register (see Section 4.6.2) is such that either (a) ADi = 1; or (b) the following all hold: (i) WDi = 1; (ii) the access is a write access; and (iii) either CR0.WP = 1 or the access causing the page-fault exception was a user-mode access. So I think PKEY check and ordinary check are independent, i.e, PFEC.PKEY may be set even if the on permission on the page table is not adequate. x86/access.flat is currently using the "other" definition, i.e., PFEC.PK is only set if W=1 or CR0.WP=0 && PFEC.U=0 or PFEC.W=0. Can you use it (with ept=1 of course) to check what the processor is doing? Sure. And ept=1 is hard to trigger MMU issue, i am enabling PKEY on shadow MMU, let's see what will happen. ;)
Re: [PATCH 1/4] KVM: MMU: fix permission_fault()
On 03/25/2016 10:21 PM, Paolo Bonzini wrote: On 25/03/2016 14:19, Xiao Guangrong wrote: WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK)); - pfec |= PFERR_PRESENT_MASK; + errcode = PFERR_PRESENT_MASK; if (unlikely(mmu->pkru_mask)) { u32 pkru_bits, offset; @@ -193,11 +193,11 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, ((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT)); pkru_bits &= mmu->pkru_mask >> offset; - pfec |= -pkru_bits & PFERR_PK_MASK; + errcode |= -pkru_bits & PFERR_PK_MASK; fault |= (pkru_bits != 0); } - return -(uint32_t)fault & pfec; + return -(uint32_t)fault & errcode; } I have another doubt here. If you get a fault due to U=0, you would not get PFERR_PK_MASK. This is checked implicitly through the pte_user bit which we moved to PFERR_RSVD_BIT. However, if you get a fault due to W=0 _and_ PKRU.AD=1 or PKRU.WD=1 for the page's protection key, would the PK bit be set in the error code? If not, we would need something like this: Based on the SDM: PK flag (bit 5). This flag is 1 if (1) IA32_EFER.LMA = CR4.PKE = 1; (2) the access causing the page-fault exception was a data access; (3) the linear address was a user-mode address with protection key i; and (5) the PKRU register (see Section 4.6.2) is such that either (a) ADi = 1; or (b) the following all hold: (i) WDi = 1; (ii) the access is a write access; and (iii) either CR0.WP = 1 or the access causing the page-fault exception was a user-mode access. So I think PKEY check and ordinary check are independent, i.e, PFEC.PKEY may be set even if the on permission on the page table is not adequate.
Re: [PATCH 1/4] KVM: MMU: fix permission_fault()
On 03/25/2016 10:21 PM, Paolo Bonzini wrote: On 25/03/2016 14:19, Xiao Guangrong wrote: WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK)); - pfec |= PFERR_PRESENT_MASK; + errcode = PFERR_PRESENT_MASK; if (unlikely(mmu->pkru_mask)) { u32 pkru_bits, offset; @@ -193,11 +193,11 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, ((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT)); pkru_bits &= mmu->pkru_mask >> offset; - pfec |= -pkru_bits & PFERR_PK_MASK; + errcode |= -pkru_bits & PFERR_PK_MASK; fault |= (pkru_bits != 0); } - return -(uint32_t)fault & pfec; + return -(uint32_t)fault & errcode; } I have another doubt here. If you get a fault due to U=0, you would not get PFERR_PK_MASK. This is checked implicitly through the pte_user bit which we moved to PFERR_RSVD_BIT. However, if you get a fault due to W=0 _and_ PKRU.AD=1 or PKRU.WD=1 for the page's protection key, would the PK bit be set in the error code? If not, we would need something like this: Based on the SDM: PK flag (bit 5). This flag is 1 if (1) IA32_EFER.LMA = CR4.PKE = 1; (2) the access causing the page-fault exception was a data access; (3) the linear address was a user-mode address with protection key i; and (5) the PKRU register (see Section 4.6.2) is such that either (a) ADi = 1; or (b) the following all hold: (i) WDi = 1; (ii) the access is a write access; and (iii) either CR0.WP = 1 or the access causing the page-fault exception was a user-mode access. So I think PKEY check and ordinary check are independent, i.e, PFEC.PKEY may be set even if the on permission on the page table is not adequate.
Re: [PATCH 3/4] KVM: MMU: reduce the size of mmu_page_path
On 03/25/2016 09:56 PM, Paolo Bonzini wrote: On 25/03/2016 14:48, Xiao Guangrong wrote: This patch and the previous one are basically redoing commit 0a47cd85833e ("KVM: MMU: Fix ubsan warnings", 2016-03-04). While you find your version easier to understand, I of course find mine easier. Rather than getting stuck in a ko fight, the solution is to stick with the code in KVM and add comments. I'll give it a try... If you do not like this one, we can just make the .index is [PT64_ROOT_LEVEL - 1] and keep the sentinel in .parents[], that little change and nice code shape. I suppose you'd have something like this then: diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 70e95d097ef1..15e1735a2e3a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1980,7 +1980,7 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, struct mmu_page_path { struct kvm_mmu_page *parent[PT64_ROOT_LEVEL]; - unsigned int idx[PT64_ROOT_LEVEL]; + unsigned int idx[PT64_ROOT_LEVEL-1]; }; #define for_each_sp(pvec, sp, parents, i) \ @@ -2037,13 +2037,14 @@ static void mmu_pages_clear_parents(struct mmu_page_path *parents) { struct kvm_mmu_page *sp; unsigned int level = 0; + unsigned int idx; do { - unsigned int idx = parents->idx[level]; sp = parents->parent[level]; - if (!sp) + if (!sp || WARN_ON(level == PT64_ROOT_LEVEL-1)) return; + idx = parents->idx[level]; WARN_ON(idx == INVALID_INDEX); clear_unsync_child_bit(sp, idx); level++; Yes, exactly. [ actually, we can keep mmu_pages_clear_parents() unchanged ] By making the arrays the same size, the effect of the sentinel seems clearer to me. It doesn't seem worth 4 bytes (and strictly speaking those 4 bytes would be there anyway due to padding)... The sentinel is NULL forever so it can not go to the inner loop anyway... Okay, i am not strong opinion on it, it is not a big deal. Let's happily drop it if you really dislike it. :)
Re: [PATCH 3/4] KVM: MMU: reduce the size of mmu_page_path
On 03/25/2016 09:56 PM, Paolo Bonzini wrote: On 25/03/2016 14:48, Xiao Guangrong wrote: This patch and the previous one are basically redoing commit 0a47cd85833e ("KVM: MMU: Fix ubsan warnings", 2016-03-04). While you find your version easier to understand, I of course find mine easier. Rather than getting stuck in a ko fight, the solution is to stick with the code in KVM and add comments. I'll give it a try... If you do not like this one, we can just make the .index is [PT64_ROOT_LEVEL - 1] and keep the sentinel in .parents[], that little change and nice code shape. I suppose you'd have something like this then: diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 70e95d097ef1..15e1735a2e3a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1980,7 +1980,7 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, struct mmu_page_path { struct kvm_mmu_page *parent[PT64_ROOT_LEVEL]; - unsigned int idx[PT64_ROOT_LEVEL]; + unsigned int idx[PT64_ROOT_LEVEL-1]; }; #define for_each_sp(pvec, sp, parents, i) \ @@ -2037,13 +2037,14 @@ static void mmu_pages_clear_parents(struct mmu_page_path *parents) { struct kvm_mmu_page *sp; unsigned int level = 0; + unsigned int idx; do { - unsigned int idx = parents->idx[level]; sp = parents->parent[level]; - if (!sp) + if (!sp || WARN_ON(level == PT64_ROOT_LEVEL-1)) return; + idx = parents->idx[level]; WARN_ON(idx == INVALID_INDEX); clear_unsync_child_bit(sp, idx); level++; Yes, exactly. [ actually, we can keep mmu_pages_clear_parents() unchanged ] By making the arrays the same size, the effect of the sentinel seems clearer to me. It doesn't seem worth 4 bytes (and strictly speaking those 4 bytes would be there anyway due to padding)... The sentinel is NULL forever so it can not go to the inner loop anyway... Okay, i am not strong opinion on it, it is not a big deal. Let's happily drop it if you really dislike it. :)
Re: [PATCH 3/4] KVM: MMU: reduce the size of mmu_page_path
On 03/25/2016 09:45 PM, Paolo Bonzini wrote: On 25/03/2016 14:19, Xiao Guangrong wrote: Currently only PT64_ROOT_LEVEL - 1 levels are used, one additional entry in .parent[] is used as a sentinel, the additional entry in .idx[] is purely wasted This patch reduces its size and sets the sentinel on the upper level of the place where we start from This patch and the previous one are basically redoing commit 0a47cd85833e ("KVM: MMU: Fix ubsan warnings", 2016-03-04). While you find your version easier to understand, I of course find mine easier. Rather than getting stuck in a ko fight, the solution is to stick with the code in KVM and add comments. I'll give it a try... If you do not like this one, we can just make the .index is [PT64_ROOT_LEVEL - 1] and keep the sentinel in .parents[], that little change and nice code shape.
Re: [PATCH 3/4] KVM: MMU: reduce the size of mmu_page_path
On 03/25/2016 09:45 PM, Paolo Bonzini wrote: On 25/03/2016 14:19, Xiao Guangrong wrote: Currently only PT64_ROOT_LEVEL - 1 levels are used, one additional entry in .parent[] is used as a sentinel, the additional entry in .idx[] is purely wasted This patch reduces its size and sets the sentinel on the upper level of the place where we start from This patch and the previous one are basically redoing commit 0a47cd85833e ("KVM: MMU: Fix ubsan warnings", 2016-03-04). While you find your version easier to understand, I of course find mine easier. Rather than getting stuck in a ko fight, the solution is to stick with the code in KVM and add comments. I'll give it a try... If you do not like this one, we can just make the .index is [PT64_ROOT_LEVEL - 1] and keep the sentinel in .parents[], that little change and nice code shape.
Re: [PATCH 1/4] KVM: MMU: fix permission_fault()
On 03/25/2016 09:35 PM, Paolo Bonzini wrote: On 25/03/2016 14:19, Xiao Guangrong wrote: kvm-unit-tests complained about the PFEC is not set properly, e.g,: test pte.rw pte.d pte.nx pde.p pde.rw pde.pse user fetch: FAIL: error code 15 expected 5 Dump mapping: address: 0x1234 --L4: 3e95007 --L3: 3e96007 --L2: 283 What's the command line for the reproducer? QEMU=/home/eric/qemu/x86_64-softmmu/qemu-system-x86_64 ./x86-run x86/access.flat diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index b70df72..81bffd1 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -154,7 +154,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned pfec) { int cpl = kvm_x86_ops->get_cpl(vcpu); - unsigned long rflags = kvm_x86_ops->get_rflags(vcpu); + unsigned long errcode, rflags = kvm_x86_ops->get_rflags(vcpu); /* * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1. @@ -175,7 +175,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, bool fault = (mmu->permissions[index] >> pte_access) & 1; WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK)); - pfec |= PFERR_PRESENT_MASK; + errcode = PFERR_PRESENT_MASK; So is this patch doing the same as "KVM: MMU: precompute page fault error code"? It was necessary after all. :) Sorry for my mistake... I missed the logic you changed :( I still prefer to calculating the error code on the fault path which is rare, or think a way to encapsulate it to permission_fault()...
Re: [PATCH 1/4] KVM: MMU: fix permission_fault()
On 03/25/2016 09:35 PM, Paolo Bonzini wrote: On 25/03/2016 14:19, Xiao Guangrong wrote: kvm-unit-tests complained about the PFEC is not set properly, e.g,: test pte.rw pte.d pte.nx pde.p pde.rw pde.pse user fetch: FAIL: error code 15 expected 5 Dump mapping: address: 0x1234 --L4: 3e95007 --L3: 3e96007 --L2: 283 What's the command line for the reproducer? QEMU=/home/eric/qemu/x86_64-softmmu/qemu-system-x86_64 ./x86-run x86/access.flat diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index b70df72..81bffd1 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -154,7 +154,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned pfec) { int cpl = kvm_x86_ops->get_cpl(vcpu); - unsigned long rflags = kvm_x86_ops->get_rflags(vcpu); + unsigned long errcode, rflags = kvm_x86_ops->get_rflags(vcpu); /* * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1. @@ -175,7 +175,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, bool fault = (mmu->permissions[index] >> pte_access) & 1; WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK)); - pfec |= PFERR_PRESENT_MASK; + errcode = PFERR_PRESENT_MASK; So is this patch doing the same as "KVM: MMU: precompute page fault error code"? It was necessary after all. :) Sorry for my mistake... I missed the logic you changed :( I still prefer to calculating the error code on the fault path which is rare, or think a way to encapsulate it to permission_fault()...
[PATCH 3/4] KVM: MMU: reduce the size of mmu_page_path
Currently only PT64_ROOT_LEVEL - 1 levels are used, one additional entry in .parent[] is used as a sentinel, the additional entry in .idx[] is purely wasted This patch reduces its size and sets the sentinel on the upper level of the place where we start from Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com> --- arch/x86/kvm/mmu.c | 32 1 file changed, 12 insertions(+), 20 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index e273144..c396e8b 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1984,12 +1984,12 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, } struct mmu_page_path { - struct kvm_mmu_page *parent[PT64_ROOT_LEVEL]; - unsigned int idx[PT64_ROOT_LEVEL]; + struct kvm_mmu_page *parent[PT64_ROOT_LEVEL - 1]; + unsigned int idx[PT64_ROOT_LEVEL - 1]; }; #define for_each_sp(pvec, sp, parents, i) \ - for (i = mmu_pages_first(, ); \ + for (i = mmu_pages_next(, , -1); \ i < pvec.nr && ({ sp = pvec.page[i].sp; 1;}); \ i = mmu_pages_next(, , i)) @@ -2016,25 +2016,15 @@ static int mmu_pages_next(struct kvm_mmu_pages *pvec, return n; } -static int mmu_pages_first(struct kvm_mmu_pages *pvec, - struct mmu_page_path *parents) +static void +mmu_pages_init(struct mmu_page_path *parents, struct kvm_mmu_page *parent) { - struct kvm_mmu_page *sp; - int level; - - if (pvec->nr == 0) - return 0; - - sp = pvec->page[0].sp; - level = sp->role.level; - WARN_ON(level == PT_PAGE_TABLE_LEVEL); - /* -* Also set up a sentinel. Further entries in pvec are all -* children of sp, so this element is never overwritten. +* set up a sentinel. Further entries in pvec are all children of +* sp, so this element is never overwritten. */ - parents->parent[level - 1] = NULL; - return mmu_pages_next(pvec, parents, -1); + if (parent->role.level < PT64_ROOT_LEVEL) + parents->parent[parent->role.level - 1] = NULL; } static void mmu_pages_clear_parents(struct mmu_page_path *parents) @@ -2051,7 +2041,7 @@ static void mmu_pages_clear_parents(struct mmu_page_path *parents) WARN_ON(idx == INVALID_INDEX); clear_unsync_child_bit(sp, idx); level++; - } while (!sp->unsync_children); + } while (!sp->unsync_children && (level < PT64_ROOT_LEVEL - 1)); } static void mmu_sync_children(struct kvm_vcpu *vcpu, @@ -2064,6 +2054,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu, LIST_HEAD(invalid_list); bool flush = false; + mmu_pages_init(, parent); while (mmu_unsync_walk(parent, )) { bool protected = false; @@ -2335,6 +2326,7 @@ static int mmu_zap_unsync_children(struct kvm *kvm, if (parent->role.level == PT_PAGE_TABLE_LEVEL) return 0; + mmu_pages_init(, parent); while (mmu_unsync_walk(parent, )) { struct kvm_mmu_page *sp; -- 1.8.3.1
[PATCH 2/4] KVM: MMU: simplify the logic of __mmu_unsync_walk()
Each time i looked into the logic of walking unsync shadow pages, it costs lots of time to understand what it is doing. The trick of this logic is that the item, sp and idx, saved to kvm_mmu_pages is the sp and the index in the _parent_ level and it lacks any comment to explain this fact This patch simplifies it by saving the sp and its index to kvm_mmu_pages, then it is much easier to understand the operations on the its index Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com> --- arch/x86/kvm/mmu.c | 40 ++-- 1 file changed, 22 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 6bdfbc2..e273144 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1830,6 +1830,8 @@ static inline void clear_unsync_child_bit(struct kvm_mmu_page *sp, int idx) __clear_bit(idx, sp->unsync_child_bitmap); } +#define INVALID_INDEX (-1) + static int __mmu_unsync_walk(struct kvm_mmu_page *sp, struct kvm_mmu_pages *pvec) { @@ -1846,10 +1848,10 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp, child = page_header(ent & PT64_BASE_ADDR_MASK); - if (child->unsync_children) { - if (mmu_pages_add(pvec, child, i)) - return -ENOSPC; + if (mmu_pages_add(pvec, sp, i)) + return -ENOSPC; + if (child->unsync_children) { ret = __mmu_unsync_walk(child, pvec); if (!ret) { clear_unsync_child_bit(sp, i); @@ -1860,7 +1862,13 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp, return ret; } else if (child->unsync) { nr_unsync_leaf++; - if (mmu_pages_add(pvec, child, i)) + + /* +* the unsync is on the last level so its 'idx' is +* useless, we set it to INVALID_INDEX to catch +* potential bugs. +*/ + if (mmu_pages_add(pvec, child, INVALID_INDEX)) return -ENOSPC; } else clear_unsync_child_bit(sp, i); @@ -1869,8 +1877,6 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp, return nr_unsync_leaf; } -#define INVALID_INDEX (-1) - static int mmu_unsync_walk(struct kvm_mmu_page *sp, struct kvm_mmu_pages *pvec) { @@ -1878,7 +1884,6 @@ static int mmu_unsync_walk(struct kvm_mmu_page *sp, if (!sp->unsync_children) return 0; - mmu_pages_add(pvec, sp, INVALID_INDEX); return __mmu_unsync_walk(sp, pvec); } @@ -1994,16 +1999,18 @@ static int mmu_pages_next(struct kvm_mmu_pages *pvec, { int n; - for (n = i+1; n < pvec->nr; n++) { + for (n = i + 1; n < pvec->nr; n++) { struct kvm_mmu_page *sp = pvec->page[n].sp; unsigned idx = pvec->page[n].idx; int level = sp->role.level; - parents->idx[level-1] = idx; - if (level == PT_PAGE_TABLE_LEVEL) + if (level == PT_PAGE_TABLE_LEVEL) { + WARN_ON(idx != INVALID_INDEX); break; + } - parents->parent[level-2] = sp; + parents->idx[level - 2] = idx; + parents->parent[level - 2] = sp; } return n; @@ -2018,19 +2025,16 @@ static int mmu_pages_first(struct kvm_mmu_pages *pvec, if (pvec->nr == 0) return 0; - WARN_ON(pvec->page[0].idx != INVALID_INDEX); - sp = pvec->page[0].sp; level = sp->role.level; WARN_ON(level == PT_PAGE_TABLE_LEVEL); - parents->parent[level-2] = sp; - - /* Also set up a sentinel. Further entries in pvec are all + /* +* Also set up a sentinel. Further entries in pvec are all * children of sp, so this element is never overwritten. */ - parents->parent[level-1] = NULL; - return mmu_pages_next(pvec, parents, 0); + parents->parent[level - 1] = NULL; + return mmu_pages_next(pvec, parents, -1); } static void mmu_pages_clear_parents(struct mmu_page_path *parents) -- 1.8.3.1
[PATCH 3/4] KVM: MMU: reduce the size of mmu_page_path
Currently only PT64_ROOT_LEVEL - 1 levels are used, one additional entry in .parent[] is used as a sentinel, the additional entry in .idx[] is purely wasted This patch reduces its size and sets the sentinel on the upper level of the place where we start from Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 32 1 file changed, 12 insertions(+), 20 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index e273144..c396e8b 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1984,12 +1984,12 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, } struct mmu_page_path { - struct kvm_mmu_page *parent[PT64_ROOT_LEVEL]; - unsigned int idx[PT64_ROOT_LEVEL]; + struct kvm_mmu_page *parent[PT64_ROOT_LEVEL - 1]; + unsigned int idx[PT64_ROOT_LEVEL - 1]; }; #define for_each_sp(pvec, sp, parents, i) \ - for (i = mmu_pages_first(, ); \ + for (i = mmu_pages_next(, , -1); \ i < pvec.nr && ({ sp = pvec.page[i].sp; 1;}); \ i = mmu_pages_next(, , i)) @@ -2016,25 +2016,15 @@ static int mmu_pages_next(struct kvm_mmu_pages *pvec, return n; } -static int mmu_pages_first(struct kvm_mmu_pages *pvec, - struct mmu_page_path *parents) +static void +mmu_pages_init(struct mmu_page_path *parents, struct kvm_mmu_page *parent) { - struct kvm_mmu_page *sp; - int level; - - if (pvec->nr == 0) - return 0; - - sp = pvec->page[0].sp; - level = sp->role.level; - WARN_ON(level == PT_PAGE_TABLE_LEVEL); - /* -* Also set up a sentinel. Further entries in pvec are all -* children of sp, so this element is never overwritten. +* set up a sentinel. Further entries in pvec are all children of +* sp, so this element is never overwritten. */ - parents->parent[level - 1] = NULL; - return mmu_pages_next(pvec, parents, -1); + if (parent->role.level < PT64_ROOT_LEVEL) + parents->parent[parent->role.level - 1] = NULL; } static void mmu_pages_clear_parents(struct mmu_page_path *parents) @@ -2051,7 +2041,7 @@ static void mmu_pages_clear_parents(struct mmu_page_path *parents) WARN_ON(idx == INVALID_INDEX); clear_unsync_child_bit(sp, idx); level++; - } while (!sp->unsync_children); + } while (!sp->unsync_children && (level < PT64_ROOT_LEVEL - 1)); } static void mmu_sync_children(struct kvm_vcpu *vcpu, @@ -2064,6 +2054,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu, LIST_HEAD(invalid_list); bool flush = false; + mmu_pages_init(, parent); while (mmu_unsync_walk(parent, )) { bool protected = false; @@ -2335,6 +2326,7 @@ static int mmu_zap_unsync_children(struct kvm *kvm, if (parent->role.level == PT_PAGE_TABLE_LEVEL) return 0; + mmu_pages_init(, parent); while (mmu_unsync_walk(parent, )) { struct kvm_mmu_page *sp; -- 1.8.3.1
[PATCH 2/4] KVM: MMU: simplify the logic of __mmu_unsync_walk()
Each time i looked into the logic of walking unsync shadow pages, it costs lots of time to understand what it is doing. The trick of this logic is that the item, sp and idx, saved to kvm_mmu_pages is the sp and the index in the _parent_ level and it lacks any comment to explain this fact This patch simplifies it by saving the sp and its index to kvm_mmu_pages, then it is much easier to understand the operations on the its index Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 40 ++-- 1 file changed, 22 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 6bdfbc2..e273144 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1830,6 +1830,8 @@ static inline void clear_unsync_child_bit(struct kvm_mmu_page *sp, int idx) __clear_bit(idx, sp->unsync_child_bitmap); } +#define INVALID_INDEX (-1) + static int __mmu_unsync_walk(struct kvm_mmu_page *sp, struct kvm_mmu_pages *pvec) { @@ -1846,10 +1848,10 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp, child = page_header(ent & PT64_BASE_ADDR_MASK); - if (child->unsync_children) { - if (mmu_pages_add(pvec, child, i)) - return -ENOSPC; + if (mmu_pages_add(pvec, sp, i)) + return -ENOSPC; + if (child->unsync_children) { ret = __mmu_unsync_walk(child, pvec); if (!ret) { clear_unsync_child_bit(sp, i); @@ -1860,7 +1862,13 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp, return ret; } else if (child->unsync) { nr_unsync_leaf++; - if (mmu_pages_add(pvec, child, i)) + + /* +* the unsync is on the last level so its 'idx' is +* useless, we set it to INVALID_INDEX to catch +* potential bugs. +*/ + if (mmu_pages_add(pvec, child, INVALID_INDEX)) return -ENOSPC; } else clear_unsync_child_bit(sp, i); @@ -1869,8 +1877,6 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp, return nr_unsync_leaf; } -#define INVALID_INDEX (-1) - static int mmu_unsync_walk(struct kvm_mmu_page *sp, struct kvm_mmu_pages *pvec) { @@ -1878,7 +1884,6 @@ static int mmu_unsync_walk(struct kvm_mmu_page *sp, if (!sp->unsync_children) return 0; - mmu_pages_add(pvec, sp, INVALID_INDEX); return __mmu_unsync_walk(sp, pvec); } @@ -1994,16 +1999,18 @@ static int mmu_pages_next(struct kvm_mmu_pages *pvec, { int n; - for (n = i+1; n < pvec->nr; n++) { + for (n = i + 1; n < pvec->nr; n++) { struct kvm_mmu_page *sp = pvec->page[n].sp; unsigned idx = pvec->page[n].idx; int level = sp->role.level; - parents->idx[level-1] = idx; - if (level == PT_PAGE_TABLE_LEVEL) + if (level == PT_PAGE_TABLE_LEVEL) { + WARN_ON(idx != INVALID_INDEX); break; + } - parents->parent[level-2] = sp; + parents->idx[level - 2] = idx; + parents->parent[level - 2] = sp; } return n; @@ -2018,19 +2025,16 @@ static int mmu_pages_first(struct kvm_mmu_pages *pvec, if (pvec->nr == 0) return 0; - WARN_ON(pvec->page[0].idx != INVALID_INDEX); - sp = pvec->page[0].sp; level = sp->role.level; WARN_ON(level == PT_PAGE_TABLE_LEVEL); - parents->parent[level-2] = sp; - - /* Also set up a sentinel. Further entries in pvec are all + /* +* Also set up a sentinel. Further entries in pvec are all * children of sp, so this element is never overwritten. */ - parents->parent[level-1] = NULL; - return mmu_pages_next(pvec, parents, 0); + parents->parent[level - 1] = NULL; + return mmu_pages_next(pvec, parents, -1); } static void mmu_pages_clear_parents(struct mmu_page_path *parents) -- 1.8.3.1
[PATCH 1/4] KVM: MMU: fix permission_fault()
kvm-unit-tests complained about the PFEC is not set properly, e.g,: test pte.rw pte.d pte.nx pde.p pde.rw pde.pse user fetch: FAIL: error code 15 expected 5 Dump mapping: address: 0x1234 --L4: 3e95007 --L3: 3e96007 --L2: 283 It's caused by the reason that PFEC returned to guest is copied from the PFEC triggered by shadow page table This patch fixes it and makes the logic of updating errcode more clean Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com> --- arch/x86/kvm/mmu.h | 8 arch/x86/kvm/paging_tmpl.h | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index b70df72..81bffd1 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -154,7 +154,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned pfec) { int cpl = kvm_x86_ops->get_cpl(vcpu); - unsigned long rflags = kvm_x86_ops->get_rflags(vcpu); + unsigned long errcode, rflags = kvm_x86_ops->get_rflags(vcpu); /* * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1. @@ -175,7 +175,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, bool fault = (mmu->permissions[index] >> pte_access) & 1; WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK)); - pfec |= PFERR_PRESENT_MASK; + errcode = PFERR_PRESENT_MASK; if (unlikely(mmu->pkru_mask)) { u32 pkru_bits, offset; @@ -193,11 +193,11 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, ((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT)); pkru_bits &= mmu->pkru_mask >> offset; - pfec |= -pkru_bits & PFERR_PK_MASK; + errcode |= -pkru_bits & PFERR_PK_MASK; fault |= (pkru_bits != 0); } - return -(uint32_t)fault & pfec; + return -(uint32_t)fault & errcode; } void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm); diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 1d971c7..bc019f7 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -360,7 +360,7 @@ retry_walk: goto error; if (unlikely(is_rsvd_bits_set(mmu, pte, walker->level))) { - errcode |= PFERR_RSVD_MASK | PFERR_PRESENT_MASK; + errcode = PFERR_RSVD_MASK | PFERR_PRESENT_MASK; goto error; } -- 1.8.3.1
[PATCH 1/4] KVM: MMU: fix permission_fault()
kvm-unit-tests complained about the PFEC is not set properly, e.g,: test pte.rw pte.d pte.nx pde.p pde.rw pde.pse user fetch: FAIL: error code 15 expected 5 Dump mapping: address: 0x1234 --L4: 3e95007 --L3: 3e96007 --L2: 283 It's caused by the reason that PFEC returned to guest is copied from the PFEC triggered by shadow page table This patch fixes it and makes the logic of updating errcode more clean Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.h | 8 arch/x86/kvm/paging_tmpl.h | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index b70df72..81bffd1 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -154,7 +154,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned pfec) { int cpl = kvm_x86_ops->get_cpl(vcpu); - unsigned long rflags = kvm_x86_ops->get_rflags(vcpu); + unsigned long errcode, rflags = kvm_x86_ops->get_rflags(vcpu); /* * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1. @@ -175,7 +175,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, bool fault = (mmu->permissions[index] >> pte_access) & 1; WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK)); - pfec |= PFERR_PRESENT_MASK; + errcode = PFERR_PRESENT_MASK; if (unlikely(mmu->pkru_mask)) { u32 pkru_bits, offset; @@ -193,11 +193,11 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, ((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - PT_USER_SHIFT)); pkru_bits &= mmu->pkru_mask >> offset; - pfec |= -pkru_bits & PFERR_PK_MASK; + errcode |= -pkru_bits & PFERR_PK_MASK; fault |= (pkru_bits != 0); } - return -(uint32_t)fault & pfec; + return -(uint32_t)fault & errcode; } void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm); diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 1d971c7..bc019f7 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -360,7 +360,7 @@ retry_walk: goto error; if (unlikely(is_rsvd_bits_set(mmu, pte, walker->level))) { - errcode |= PFERR_RSVD_MASK | PFERR_PRESENT_MASK; + errcode = PFERR_RSVD_MASK | PFERR_PRESENT_MASK; goto error; } -- 1.8.3.1
[PATCH 4/4] KVM: MMU: skip obsolete sp in for_each_gfn_*()
The obsolete sp should not be used on current vCPUs and should not hurt vCPU's running, so skip it from for_each_gfn_sp() and for_each_gfn_indirect_valid_sp() The side effort is we will double check role.invalid in kvm_mmu_get_page() but i think it is okay as role is well cached Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com> --- arch/x86/kvm/mmu.c | 28 1 file changed, 12 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index c396e8b..4d66a9e 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1906,18 +1906,17 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm, * since it has been deleted from active_mmu_pages but still can be found * at hast list. * - * for_each_gfn_indirect_valid_sp has skipped that kind of page and - * kvm_mmu_get_page(), the only user of for_each_gfn_sp(), has skipped - * all the obsolete pages. + * for_each_gfn_valid_sp() has skipped that kind of pages. */ -#define for_each_gfn_sp(_kvm, _sp, _gfn) \ +#define for_each_gfn_valid_sp(_kvm, _sp, _gfn) \ hlist_for_each_entry(_sp, \ &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)], hash_link) \ - if ((_sp)->gfn != (_gfn)) {} else + if ((_sp)->gfn != (_gfn) || is_obsolete_sp((_kvm), (_sp)) \ + || (_sp)->role.invalid) {} else #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn) \ - for_each_gfn_sp(_kvm, _sp, _gfn)\ - if ((_sp)->role.direct || (_sp)->role.invalid) {} else + for_each_gfn_valid_sp(_kvm, _sp, _gfn) \ + if ((_sp)->role.direct) {} else /* @sp->gfn should be write-protected at the call site */ static bool __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, @@ -1958,6 +1957,11 @@ static void kvm_mmu_audit(struct kvm_vcpu *vcpu, int point) { } static void mmu_audit_disable(void) { } #endif +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) +{ + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen); +} + static bool kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, struct list_head *invalid_list) { @@ -2092,11 +2096,6 @@ static void clear_sp_write_flooding_count(u64 *spte) __clear_sp_write_flooding_count(sp); } -static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) -{ - return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen); -} - static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gaddr, @@ -2123,10 +2122,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1; role.quadrant = quadrant; } - for_each_gfn_sp(vcpu->kvm, sp, gfn) { - if (is_obsolete_sp(vcpu->kvm, sp)) - continue; - + for_each_gfn_valid_sp(vcpu->kvm, sp, gfn) { if (!need_sync && sp->unsync) need_sync = true; -- 1.8.3.1
[PATCH 4/4] KVM: MMU: skip obsolete sp in for_each_gfn_*()
The obsolete sp should not be used on current vCPUs and should not hurt vCPU's running, so skip it from for_each_gfn_sp() and for_each_gfn_indirect_valid_sp() The side effort is we will double check role.invalid in kvm_mmu_get_page() but i think it is okay as role is well cached Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 28 1 file changed, 12 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index c396e8b..4d66a9e 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1906,18 +1906,17 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm, * since it has been deleted from active_mmu_pages but still can be found * at hast list. * - * for_each_gfn_indirect_valid_sp has skipped that kind of page and - * kvm_mmu_get_page(), the only user of for_each_gfn_sp(), has skipped - * all the obsolete pages. + * for_each_gfn_valid_sp() has skipped that kind of pages. */ -#define for_each_gfn_sp(_kvm, _sp, _gfn) \ +#define for_each_gfn_valid_sp(_kvm, _sp, _gfn) \ hlist_for_each_entry(_sp, \ &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)], hash_link) \ - if ((_sp)->gfn != (_gfn)) {} else + if ((_sp)->gfn != (_gfn) || is_obsolete_sp((_kvm), (_sp)) \ + || (_sp)->role.invalid) {} else #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn) \ - for_each_gfn_sp(_kvm, _sp, _gfn)\ - if ((_sp)->role.direct || (_sp)->role.invalid) {} else + for_each_gfn_valid_sp(_kvm, _sp, _gfn) \ + if ((_sp)->role.direct) {} else /* @sp->gfn should be write-protected at the call site */ static bool __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, @@ -1958,6 +1957,11 @@ static void kvm_mmu_audit(struct kvm_vcpu *vcpu, int point) { } static void mmu_audit_disable(void) { } #endif +static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) +{ + return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen); +} + static bool kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, struct list_head *invalid_list) { @@ -2092,11 +2096,6 @@ static void clear_sp_write_flooding_count(u64 *spte) __clear_sp_write_flooding_count(sp); } -static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) -{ - return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen); -} - static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gaddr, @@ -2123,10 +2122,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1; role.quadrant = quadrant; } - for_each_gfn_sp(vcpu->kvm, sp, gfn) { - if (is_obsolete_sp(vcpu->kvm, sp)) - continue; - + for_each_gfn_valid_sp(vcpu->kvm, sp, gfn) { if (!need_sync && sp->unsync) need_sync = true; -- 1.8.3.1
Re: [PATCH] KVM: page_track: fix access to NULL slot
On 03/23/2016 12:40 AM, Paolo Bonzini wrote: This happens when doing the reboot test from virt-tests: [ 131.833653] BUG: unable to handle kernel NULL pointer dereference at (null) [ 131.842461] IP: [] kvm_page_track_is_active+0x17/0x60 [kvm] [ 131.850500] PGD 0 [ 131.852763] Oops: [#1] SMP [ 132.007188] task: 880075fbc500 ti: 880850a3c000 task.ti: 880850a3c000 [ 132.138891] Call Trace: [ 132.141639] [] page_fault_handle_page_track+0x31/0x40 [kvm] [ 132.149732] [] paging64_page_fault+0xff/0x910 [kvm] [ 132.172159] [] kvm_mmu_page_fault+0x64/0x110 [kvm] [ 132.179372] [] handle_exception+0x1b2/0x430 [kvm_intel] [ 132.187072] [] vmx_handle_exit+0x1e1/0xc50 [kvm_intel] ... Thank you for fixing it, Paolo! Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
Re: [PATCH] KVM: page_track: fix access to NULL slot
On 03/23/2016 12:40 AM, Paolo Bonzini wrote: This happens when doing the reboot test from virt-tests: [ 131.833653] BUG: unable to handle kernel NULL pointer dereference at (null) [ 131.842461] IP: [] kvm_page_track_is_active+0x17/0x60 [kvm] [ 131.850500] PGD 0 [ 131.852763] Oops: [#1] SMP [ 132.007188] task: 880075fbc500 ti: 880850a3c000 task.ti: 880850a3c000 [ 132.138891] Call Trace: [ 132.141639] [] page_fault_handle_page_track+0x31/0x40 [kvm] [ 132.149732] [] paging64_page_fault+0xff/0x910 [kvm] [ 132.172159] [] kvm_mmu_page_fault+0x64/0x110 [kvm] [ 132.179372] [] handle_exception+0x1b2/0x430 [kvm_intel] [ 132.187072] [] vmx_handle_exit+0x1e1/0xc50 [kvm_intel] ... Thank you for fixing it, Paolo! Reviewed-by: Xiao Guangrong
Re: [PATCH 1/1] KVM: don't allow irq_fpu_usable when the VCPU's XCR0 is loaded
On 03/16/2016 03:32 AM, Paolo Bonzini wrote: On 15/03/2016 19:27, Andy Lutomirski wrote: On Mon, Mar 14, 2016 at 6:17 AM, Paolo Bonziniwrote: On 11/03/2016 22:33, David Matlack wrote: Is this better than just always keeping the host's XCR0 loaded outside if the KVM interrupts-disabled region? Probably not. AFAICT KVM does not rely on it being loaded outside that region. xsetbv isn't insanely expensive, is it? Maybe to minimize the time spent with interrupts disabled it was put outside. I do like that your solution would be contained to KVM. I agree with Andy. We do want a fix for recent kernels because of the !eager_fpu case that Guangrong mentioned. Relying on interrupt is not easy as XCR0 can not be automatically saved/loaded by VMCS... Once interrupt happens, it will use guest's XCR0 anyway.
Re: [PATCH 1/1] KVM: don't allow irq_fpu_usable when the VCPU's XCR0 is loaded
On 03/16/2016 03:32 AM, Paolo Bonzini wrote: On 15/03/2016 19:27, Andy Lutomirski wrote: On Mon, Mar 14, 2016 at 6:17 AM, Paolo Bonzini wrote: On 11/03/2016 22:33, David Matlack wrote: Is this better than just always keeping the host's XCR0 loaded outside if the KVM interrupts-disabled region? Probably not. AFAICT KVM does not rely on it being loaded outside that region. xsetbv isn't insanely expensive, is it? Maybe to minimize the time spent with interrupts disabled it was put outside. I do like that your solution would be contained to KVM. I agree with Andy. We do want a fix for recent kernels because of the !eager_fpu case that Guangrong mentioned. Relying on interrupt is not easy as XCR0 can not be automatically saved/loaded by VMCS... Once interrupt happens, it will use guest's XCR0 anyway.
Re: [PATCH 0/1] KVM: x86: using the fpu in interrupt context with a guest's xcr0
On 03/16/2016 03:01 AM, David Matlack wrote: On Mon, Mar 14, 2016 at 12:46 AM, Xiao Guangrong <guangrong.x...@linux.intel.com> wrote: On 03/12/2016 04:47 AM, David Matlack wrote: I have not been able to trigger this bug on Linux 4.3, and suspect it is due to this commit from Linux 4.2: 653f52c kvm,x86: load guest FPU context more eagerly With this commit, as long as the host is using eagerfpu, the guest's fpu is always loaded just before the guest's xcr0 (vcpu->fpu_active is always 1 in the following snippet): 6569 if (vcpu->fpu_active) 6570 kvm_load_guest_fpu(vcpu); 6571 kvm_load_guest_xcr0(vcpu); When the guest's fpu is loaded, irq_fpu_usable() returns false. Er, i did not see that commit introduced this change. We've included our workaround for this bug, which applies to Linux 3.11. It does not apply cleanly to HEAD since the fpu subsystem was refactored in Linux 4.2. While the latest kernel does not look vulnerable, we may want to apply a fix to the vulnerable stable kernels. Is the latest kvm safe if we use !eager fpu? Yes I believe so. When !eagerfpu, interrupted_kernel_fpu_idle() returns "!current->thread.fpu.fpregs_active && (read_cr0() & X86_CR0_TS)". This should ensure the interrupt handler never does XSAVE/XRSTOR with the guest's xcr0. interrupted_kernel_fpu_idle() returns true if KVM-based hypervisor (e.g. QEMU) is not using fpu. That can not stop handler using fpu.
Re: [PATCH 0/1] KVM: x86: using the fpu in interrupt context with a guest's xcr0
On 03/16/2016 03:01 AM, David Matlack wrote: On Mon, Mar 14, 2016 at 12:46 AM, Xiao Guangrong wrote: On 03/12/2016 04:47 AM, David Matlack wrote: I have not been able to trigger this bug on Linux 4.3, and suspect it is due to this commit from Linux 4.2: 653f52c kvm,x86: load guest FPU context more eagerly With this commit, as long as the host is using eagerfpu, the guest's fpu is always loaded just before the guest's xcr0 (vcpu->fpu_active is always 1 in the following snippet): 6569 if (vcpu->fpu_active) 6570 kvm_load_guest_fpu(vcpu); 6571 kvm_load_guest_xcr0(vcpu); When the guest's fpu is loaded, irq_fpu_usable() returns false. Er, i did not see that commit introduced this change. We've included our workaround for this bug, which applies to Linux 3.11. It does not apply cleanly to HEAD since the fpu subsystem was refactored in Linux 4.2. While the latest kernel does not look vulnerable, we may want to apply a fix to the vulnerable stable kernels. Is the latest kvm safe if we use !eager fpu? Yes I believe so. When !eagerfpu, interrupted_kernel_fpu_idle() returns "!current->thread.fpu.fpregs_active && (read_cr0() & X86_CR0_TS)". This should ensure the interrupt handler never does XSAVE/XRSTOR with the guest's xcr0. interrupted_kernel_fpu_idle() returns true if KVM-based hypervisor (e.g. QEMU) is not using fpu. That can not stop handler using fpu.
Re: [PATCH 0/1] KVM: x86: using the fpu in interrupt context with a guest's xcr0
On 03/12/2016 04:47 AM, David Matlack wrote: I have not been able to trigger this bug on Linux 4.3, and suspect it is due to this commit from Linux 4.2: 653f52c kvm,x86: load guest FPU context more eagerly With this commit, as long as the host is using eagerfpu, the guest's fpu is always loaded just before the guest's xcr0 (vcpu->fpu_active is always 1 in the following snippet): 6569 if (vcpu->fpu_active) 6570 kvm_load_guest_fpu(vcpu); 6571 kvm_load_guest_xcr0(vcpu); When the guest's fpu is loaded, irq_fpu_usable() returns false. Er, i did not see that commit introduced this change. We've included our workaround for this bug, which applies to Linux 3.11. It does not apply cleanly to HEAD since the fpu subsystem was refactored in Linux 4.2. While the latest kernel does not look vulnerable, we may want to apply a fix to the vulnerable stable kernels. Is the latest kvm safe if we use !eager fpu? Under this case, kvm_load_guest_fpu() is not called for every single VM-enter, that means kernel will use guest's xcr0 to save/restore XSAVE area. Maybe a simpler fix is just calling __kernel_fpu_begin() when the CPU switches to vCPU and reverts it when the vCPU is scheduled out or returns to userspace.
Re: [PATCH 0/1] KVM: x86: using the fpu in interrupt context with a guest's xcr0
On 03/12/2016 04:47 AM, David Matlack wrote: I have not been able to trigger this bug on Linux 4.3, and suspect it is due to this commit from Linux 4.2: 653f52c kvm,x86: load guest FPU context more eagerly With this commit, as long as the host is using eagerfpu, the guest's fpu is always loaded just before the guest's xcr0 (vcpu->fpu_active is always 1 in the following snippet): 6569 if (vcpu->fpu_active) 6570 kvm_load_guest_fpu(vcpu); 6571 kvm_load_guest_xcr0(vcpu); When the guest's fpu is loaded, irq_fpu_usable() returns false. Er, i did not see that commit introduced this change. We've included our workaround for this bug, which applies to Linux 3.11. It does not apply cleanly to HEAD since the fpu subsystem was refactored in Linux 4.2. While the latest kernel does not look vulnerable, we may want to apply a fix to the vulnerable stable kernels. Is the latest kvm safe if we use !eager fpu? Under this case, kvm_load_guest_fpu() is not called for every single VM-enter, that means kernel will use guest's xcr0 to save/restore XSAVE area. Maybe a simpler fix is just calling __kernel_fpu_begin() when the CPU switches to vCPU and reverts it when the vCPU is scheduled out or returns to userspace.
Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()
On 03/11/2016 01:07 AM, Paolo Bonzini wrote: On 09/03/2016 08:18, Lan Tianyu wrote: How about the following comments. Log for kvm_mmu_commit_zap_page() /* * We need to make sure everyone sees our modifications to * the page tables and see changes to vcpu->mode here. Please mention that this pairs with vcpu_enter_guest and walk_shadow_page_lockless_begin/end. The * barrier in the kvm_flush_remote_tlbs() helps us to achieve * these. Otherwise, wait for all vcpus to exit guest mode * and/or lockless shadow page table walks. */ kvm_flush_remote_tlbs(kvm); The rest of the comment is okay, but please replace "Otherwise" with "In addition, we need to". Log for kvm_flush_remote_tlbs() /* * We want to publish modifications to the page tables before * reading mode. Pairs with a memory barrier in arch-specific * code. * - x86: smp_mb__after_srcu_read_unlock in vcpu_enter_guest. ... and smp_mb in walk_shadow_page_lockless_begin/end. * - powerpc: smp_mb in kvmppc_prepare_to_enter. */ smp_mb__before_atomic(); The comment looks good, but the smp_mb__before_atomic() is not needed. As mentioned in the reply to Guangrong, only a smp_load_acquire is required. So the comment should say something like "There is already an smp_mb() before kvm_make_all_cpus_request reads vcpu->mode. We reuse that barrier here.". On top of this there is: - the change to paging_tmpl.h that Guangrong posted, adding smp_wmb() before each increment of vcpu->kvm->tlbs_dirty Yes, please make it as a separated patch. - the change to smp_mb__after_atomic() in kvm_make_all_cpus_request - if you want :) you can also replace the store+mb in walk_shadow_page_lockless_begin with smp_store_mb, and the mb+store in walk_shadow_page_lockless_end with smp_store_release. These changes are good to me. TianYu, please CC me when you post the new version out.
Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()
On 03/11/2016 01:07 AM, Paolo Bonzini wrote: On 09/03/2016 08:18, Lan Tianyu wrote: How about the following comments. Log for kvm_mmu_commit_zap_page() /* * We need to make sure everyone sees our modifications to * the page tables and see changes to vcpu->mode here. Please mention that this pairs with vcpu_enter_guest and walk_shadow_page_lockless_begin/end. The * barrier in the kvm_flush_remote_tlbs() helps us to achieve * these. Otherwise, wait for all vcpus to exit guest mode * and/or lockless shadow page table walks. */ kvm_flush_remote_tlbs(kvm); The rest of the comment is okay, but please replace "Otherwise" with "In addition, we need to". Log for kvm_flush_remote_tlbs() /* * We want to publish modifications to the page tables before * reading mode. Pairs with a memory barrier in arch-specific * code. * - x86: smp_mb__after_srcu_read_unlock in vcpu_enter_guest. ... and smp_mb in walk_shadow_page_lockless_begin/end. * - powerpc: smp_mb in kvmppc_prepare_to_enter. */ smp_mb__before_atomic(); The comment looks good, but the smp_mb__before_atomic() is not needed. As mentioned in the reply to Guangrong, only a smp_load_acquire is required. So the comment should say something like "There is already an smp_mb() before kvm_make_all_cpus_request reads vcpu->mode. We reuse that barrier here.". On top of this there is: - the change to paging_tmpl.h that Guangrong posted, adding smp_wmb() before each increment of vcpu->kvm->tlbs_dirty Yes, please make it as a separated patch. - the change to smp_mb__after_atomic() in kvm_make_all_cpus_request - if you want :) you can also replace the store+mb in walk_shadow_page_lockless_begin with smp_store_mb, and the mb+store in walk_shadow_page_lockless_end with smp_store_release. These changes are good to me. TianYu, please CC me when you post the new version out.
Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()
On 03/11/2016 12:04 AM, Paolo Bonzini wrote: On 10/03/2016 16:45, Xiao Guangrong wrote: Compared to smp_load_acquire(), smp_mb() adds an ordering between stores and loads. Here, the ordering is load-store, hence... Yes, this is why i put smp_mb() in the code. :) Here is a table of barriers: '. after| | before '. |load |store __'.|___| | | | smp_rmb | smp_load_acquire load| smp_load_acquire | smp_store_releaseXX | smp_mb | smp_mb |___| | | | | smp_wmb store | smp_mb | smp_store_release | | smp_mb | | Your case is the one marked with XX, so a smp_load_acquire() is enough---and it's preferrable, because it's cheaper than smp_mb() and more self-documenting. Yes, you are right and thank you for pointing it out.
Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()
On 03/11/2016 12:04 AM, Paolo Bonzini wrote: On 10/03/2016 16:45, Xiao Guangrong wrote: Compared to smp_load_acquire(), smp_mb() adds an ordering between stores and loads. Here, the ordering is load-store, hence... Yes, this is why i put smp_mb() in the code. :) Here is a table of barriers: '. after| | before '. |load |store __'.|___| | | | smp_rmb | smp_load_acquire load| smp_load_acquire | smp_store_releaseXX | smp_mb | smp_mb |___| | | | | smp_wmb store | smp_mb | smp_store_release | | smp_mb | | Your case is the one marked with XX, so a smp_load_acquire() is enough---and it's preferrable, because it's cheaper than smp_mb() and more self-documenting. Yes, you are right and thank you for pointing it out.
Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()
On 03/10/2016 11:31 PM, Paolo Bonzini wrote: On 10/03/2016 16:26, Paolo Bonzini wrote: Compared to smp_load_acquire(), smp_mb() adds an ordering between stores and loads. Here, the ordering is load-store, hence... Yes, this is why i put smp_mb() in the code. :)
Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()
On 03/10/2016 11:31 PM, Paolo Bonzini wrote: On 10/03/2016 16:26, Paolo Bonzini wrote: Compared to smp_load_acquire(), smp_mb() adds an ordering between stores and loads. Here, the ordering is load-store, hence... Yes, this is why i put smp_mb() in the code. :)
Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()
On 03/08/2016 11:27 PM, Paolo Bonzini wrote: On 08/03/2016 09:36, Lan Tianyu wrote: Summary about smp_mb()s we met in this thread. If misunderstood, please correct me. Thanks. The smp_mb() in the kvm_flush_remote_tlbs() was introduced by the commit a4ee1ca4 and it seems to keep the order of reading and cmpxchg kvm->tlbs_dirty. Quote from Avi: | I don't think we need to flush immediately; set a "tlb dirty" bit somewhere | that is cleareded when we flush the tlb. kvm_mmu_notifier_invalidate_page() | can consult the bit and force a flush if set. Signed-off-by: Xiao Guangrong <xiaoguangr...@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com> Unfortunately that patch added a bad memory barrier: 1) it lacks a comment; 2) it lacks obvious pairing; 3) it is an smp_mb() after a read, so it's not even obvious that this memory barrier has to do with the immediately preceding read of kvm->tlbs_dirty. It also is not documented in Documentation/virtual/kvm/mmu.txt (Guangrong documented there most of his other work, back in 2013, but not this one :)). The cmpxchg is ordered anyway against the read, because 1) x86 has implicit ordering between earlier loads and later stores; 2) even store-load barriers are unnecessary for accesses to the same variable (in this case kvm->tlbs_dirty). So offhand, I cannot say what it orders. There are two possibilities: 1) it orders the read of tlbs_dirty with the read of mode. In this case, a smp_rmb() would have been enough, and it's not clear where is the matching smp_wmb(). 2) it orders the read of tlbs_dirty with the KVM_REQ_TLB_FLUSH request. In this case a smp_load_acquire would be better. 3) it does the same as kvm_mmu_commit_zap_page's smp_mb() but for other callers of kvm_flush_remote_tlbs(). In this case, we know what's the matching memory barrier (walk_shadow_page_lockless_*). 4) it is completely unnecessary. Sorry, memory barriers were missed in sync_page(), this diff should fix it: diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 91e939b..4cad57f 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -948,6 +948,12 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) return -EINVAL; if (FNAME(prefetch_invalid_gpte)(vcpu, sp, >spt[i], gpte)) { + /* +* update spte before increasing tlbs_dirty to make sure no tlb +* flush in lost after spte is zapped, see the comments in +* kvm_flush_remote_tlbs(). +*/ + smp_wmb(); vcpu->kvm->tlbs_dirty++; continue; } @@ -963,6 +969,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) if (gfn != sp->gfns[i]) { drop_spte(vcpu->kvm, >spt[i]); + /* the same as above where we are doing prefetch_invalid_gpte(). */ + smp_wmb(); vcpu->kvm->tlbs_dirty++; continue; } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 314c777..82c86ea 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -193,7 +193,12 @@ void kvm_flush_remote_tlbs(struct kvm *kvm) { long dirty_count = kvm->tlbs_dirty; +/* + * read tlbs_dirty before doing tlb flush to make sure not tlb request is + * lost. + */ smp_mb(); + if (kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) ++kvm->stat.remote_tlb_flush; cmpxchg(>tlbs_dirty, dirty_count, 0); Any comment?
Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()
On 03/08/2016 11:27 PM, Paolo Bonzini wrote: On 08/03/2016 09:36, Lan Tianyu wrote: Summary about smp_mb()s we met in this thread. If misunderstood, please correct me. Thanks. The smp_mb() in the kvm_flush_remote_tlbs() was introduced by the commit a4ee1ca4 and it seems to keep the order of reading and cmpxchg kvm->tlbs_dirty. Quote from Avi: | I don't think we need to flush immediately; set a "tlb dirty" bit somewhere | that is cleareded when we flush the tlb. kvm_mmu_notifier_invalidate_page() | can consult the bit and force a flush if set. Signed-off-by: Xiao Guangrong Signed-off-by: Marcelo Tosatti Unfortunately that patch added a bad memory barrier: 1) it lacks a comment; 2) it lacks obvious pairing; 3) it is an smp_mb() after a read, so it's not even obvious that this memory barrier has to do with the immediately preceding read of kvm->tlbs_dirty. It also is not documented in Documentation/virtual/kvm/mmu.txt (Guangrong documented there most of his other work, back in 2013, but not this one :)). The cmpxchg is ordered anyway against the read, because 1) x86 has implicit ordering between earlier loads and later stores; 2) even store-load barriers are unnecessary for accesses to the same variable (in this case kvm->tlbs_dirty). So offhand, I cannot say what it orders. There are two possibilities: 1) it orders the read of tlbs_dirty with the read of mode. In this case, a smp_rmb() would have been enough, and it's not clear where is the matching smp_wmb(). 2) it orders the read of tlbs_dirty with the KVM_REQ_TLB_FLUSH request. In this case a smp_load_acquire would be better. 3) it does the same as kvm_mmu_commit_zap_page's smp_mb() but for other callers of kvm_flush_remote_tlbs(). In this case, we know what's the matching memory barrier (walk_shadow_page_lockless_*). 4) it is completely unnecessary. Sorry, memory barriers were missed in sync_page(), this diff should fix it: diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 91e939b..4cad57f 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -948,6 +948,12 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) return -EINVAL; if (FNAME(prefetch_invalid_gpte)(vcpu, sp, >spt[i], gpte)) { + /* +* update spte before increasing tlbs_dirty to make sure no tlb +* flush in lost after spte is zapped, see the comments in +* kvm_flush_remote_tlbs(). +*/ + smp_wmb(); vcpu->kvm->tlbs_dirty++; continue; } @@ -963,6 +969,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) if (gfn != sp->gfns[i]) { drop_spte(vcpu->kvm, >spt[i]); + /* the same as above where we are doing prefetch_invalid_gpte(). */ + smp_wmb(); vcpu->kvm->tlbs_dirty++; continue; } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 314c777..82c86ea 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -193,7 +193,12 @@ void kvm_flush_remote_tlbs(struct kvm *kvm) { long dirty_count = kvm->tlbs_dirty; +/* + * read tlbs_dirty before doing tlb flush to make sure not tlb request is + * lost. + */ smp_mb(); + if (kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) ++kvm->stat.remote_tlb_flush; cmpxchg(>tlbs_dirty, dirty_count, 0); Any comment?
Re: [RFC PATCH 2/2] KVM: MMU: return page fault error code from permission_fault
On 03/08/2016 07:45 PM, Paolo Bonzini wrote: Intel's new PKU extension introduces a bit of the page fault error code that must be dynamically computed after the PTE lookup (unlike I/D, U/S and W/R). To ease this, these two patches make permission_fault return the page fault error code. Right now it only adds the P bit to the input parameter "pfec", but PKU can change that. Yep, i got the same idea when i reviewed the pkey patchset. This patch looks good to me. Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
Re: [RFC PATCH 2/2] KVM: MMU: return page fault error code from permission_fault
On 03/08/2016 07:45 PM, Paolo Bonzini wrote: Intel's new PKU extension introduces a bit of the page fault error code that must be dynamically computed after the PTE lookup (unlike I/D, U/S and W/R). To ease this, these two patches make permission_fault return the page fault error code. Right now it only adds the P bit to the input parameter "pfec", but PKU can change that. Yep, i got the same idea when i reviewed the pkey patchset. This patch looks good to me. Reviewed-by: Xiao Guangrong
Re: [RFC PATCH 1/2] KVM: MMU: precompute page fault error code
On 03/08/2016 07:45 PM, Paolo Bonzini wrote: For the next patch, we will want to filter PFERR_FETCH_MASK away early, and not pass it to permission_fault if neither NX nor SMEP are enabled. Prepare for the change. Why it is needed? It is much easier to drop PFEC.F in update_permission_bitmask(). Signed-off-by: Paolo Bonzini--- arch/x86/kvm/mmu.c | 2 +- arch/x86/kvm/paging_tmpl.h | 26 +++--- 2 files changed, 16 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2463de0b935c..e57f7be061e3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3883,7 +3883,7 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu, u = bit & ACC_USER_MASK; if (!ept) { - /* Not really needed: !nx will cause pte.nx to fault */ + /* Not really needed: if !nx, ff will always be zero */ x |= !mmu->nx; /* Allow supervisor writes if !cr0.wp */ w |= !is_write_protection(vcpu) && !uf; diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 6013f3685ef4..285858d3223b 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -272,13 +272,24 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, gpa_t pte_gpa; int offset; const int write_fault = access & PFERR_WRITE_MASK; - const int user_fault = access & PFERR_USER_MASK; - const int fetch_fault = access & PFERR_FETCH_MASK; - u16 errcode = 0; + u16 errcode; gpa_t real_gpa; gfn_t gfn; trace_kvm_mmu_pagetable_walk(addr, access); + + /* +* Do not modify PFERR_FETCH_MASK in access. It is used later in the call to +* mmu->translate_gpa and, when nested virtualization is in use, the X or NX +* bit of nested page tables always applies---even if NX and SMEP are disabled +* in the guest. +* +* TODO: cache the result of the NX and SMEP test in struct kvm_mmu? +*/ + errcode = access; + if (!(mmu->nx || kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))) + errcode &= ~PFERR_FETCH_MASK; + PFEC.P might be set since it is copied from @access. And the workload is moved from rare path to the common path... retry_walk: walker->level = mmu->root_level; pte = mmu->get_cr3(vcpu); @@ -389,9 +400,7 @@ retry_walk: if (unlikely(!accessed_dirty)) { ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, write_fault); - if (unlikely(ret < 0)) - goto error; - else if (ret) + if (ret > 0) goto retry_walk; So it returns normally even if update_accessed_dirty_bits() failed.
Re: [RFC PATCH 1/2] KVM: MMU: precompute page fault error code
On 03/08/2016 07:45 PM, Paolo Bonzini wrote: For the next patch, we will want to filter PFERR_FETCH_MASK away early, and not pass it to permission_fault if neither NX nor SMEP are enabled. Prepare for the change. Why it is needed? It is much easier to drop PFEC.F in update_permission_bitmask(). Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu.c | 2 +- arch/x86/kvm/paging_tmpl.h | 26 +++--- 2 files changed, 16 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2463de0b935c..e57f7be061e3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3883,7 +3883,7 @@ static void update_permission_bitmask(struct kvm_vcpu *vcpu, u = bit & ACC_USER_MASK; if (!ept) { - /* Not really needed: !nx will cause pte.nx to fault */ + /* Not really needed: if !nx, ff will always be zero */ x |= !mmu->nx; /* Allow supervisor writes if !cr0.wp */ w |= !is_write_protection(vcpu) && !uf; diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 6013f3685ef4..285858d3223b 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -272,13 +272,24 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, gpa_t pte_gpa; int offset; const int write_fault = access & PFERR_WRITE_MASK; - const int user_fault = access & PFERR_USER_MASK; - const int fetch_fault = access & PFERR_FETCH_MASK; - u16 errcode = 0; + u16 errcode; gpa_t real_gpa; gfn_t gfn; trace_kvm_mmu_pagetable_walk(addr, access); + + /* +* Do not modify PFERR_FETCH_MASK in access. It is used later in the call to +* mmu->translate_gpa and, when nested virtualization is in use, the X or NX +* bit of nested page tables always applies---even if NX and SMEP are disabled +* in the guest. +* +* TODO: cache the result of the NX and SMEP test in struct kvm_mmu? +*/ + errcode = access; + if (!(mmu->nx || kvm_read_cr4_bits(vcpu, X86_CR4_SMEP))) + errcode &= ~PFERR_FETCH_MASK; + PFEC.P might be set since it is copied from @access. And the workload is moved from rare path to the common path... retry_walk: walker->level = mmu->root_level; pte = mmu->get_cr3(vcpu); @@ -389,9 +400,7 @@ retry_walk: if (unlikely(!accessed_dirty)) { ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, write_fault); - if (unlikely(ret < 0)) - goto error; - else if (ret) + if (ret > 0) goto retry_walk; So it returns normally even if update_accessed_dirty_bits() failed.
Re: [PATCH 0/2] KVM: x86: disable MPX if host did not enable MPX XSAVE features
On 03/08/2016 07:44 PM, Paolo Bonzini wrote: Patch 1 ensures that all aspects of MPX are disabled when eager FPU is disabled on the host. Patch 2 is just a cleanup. It looks good to me. Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com> Now, more and more features depend on eger xsave, e.g, fpu, mpx and protection-key, maybe it is the time to rename eager-fpu to eager-xsave?
Re: [PATCH 0/2] KVM: x86: disable MPX if host did not enable MPX XSAVE features
On 03/08/2016 07:44 PM, Paolo Bonzini wrote: Patch 1 ensures that all aspects of MPX are disabled when eager FPU is disabled on the host. Patch 2 is just a cleanup. It looks good to me. Reviewed-by: Xiao Guangrong Now, more and more features depend on eger xsave, e.g, fpu, mpx and protection-key, maybe it is the time to rename eager-fpu to eager-xsave?
Re: [PATCH 1/2] KVM: MMU: fix ept=0/pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo
On 03/10/2016 06:09 PM, Paolo Bonzini wrote: On 10/03/2016 09:27, Xiao Guangrong wrote: +if (!enable_ept) { +guest_efer |= EFER_NX; +ignore_bits |= EFER_NX; Update ignore_bits is not necessary i think. More precisely, ignore_bits is only needed if guest EFER.NX=0 and we're not in this CR0.WP=1/CR4.SMEP=0 situation. In theory you could have guest EFER.NX=1 and host EFER.NX=0. It is not in linux, the kernel always set EFER.NX if CPUID reports it, arch/x86/kernel/head_64.S: 204 /* Setup EFER (Extended Feature Enable Register) */ 205 movl$MSR_EFER, %ecx 206 rdmsr 207 btsl$_EFER_SCE, %eax/* Enable System Call */ 208 btl $20,%edi/* No Execute supported? */ 209 jnc 1f 210 btsl$_EFER_NX, %eax 211 btsq$_PAGE_BIT_NX,early_pmd_flags(%rip) 212 1: wrmsr /* Make changes effective */ So if guest sees NX in its cpuid then host EFER.NX should be 1. This is what I came up with (plus some comments :)): u64 guest_efer = vmx->vcpu.arch.efer; u64 ignore_bits = 0; if (!enable_ept) { if (boot_cpu_has(X86_FEATURE_SMEP)) guest_efer |= EFER_NX; else if (!(guest_efer & EFER_NX)) ignore_bits |= EFER_NX; } Your logic is very right. What my suggestion is we can keep ignore_bits = EFER_NX | EFER_SCE; (needn't conditionally adjust it) because EFER_NX must be the same between guest and host if we switch EFER manually. My patch is bigger but the resulting code is smaller and easier to follow: guest_efer = vmx->vcpu.arch.efer; if (!enable_ept) guest_efer |= EFER_NX; ... if (...) { ... } else { guest_efer &= ~ignore_bits; guest_efer |= host_efer & ignore_bits; } I agreed. :)
Re: [PATCH 1/2] KVM: MMU: fix ept=0/pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo
On 03/10/2016 06:09 PM, Paolo Bonzini wrote: On 10/03/2016 09:27, Xiao Guangrong wrote: +if (!enable_ept) { +guest_efer |= EFER_NX; +ignore_bits |= EFER_NX; Update ignore_bits is not necessary i think. More precisely, ignore_bits is only needed if guest EFER.NX=0 and we're not in this CR0.WP=1/CR4.SMEP=0 situation. In theory you could have guest EFER.NX=1 and host EFER.NX=0. It is not in linux, the kernel always set EFER.NX if CPUID reports it, arch/x86/kernel/head_64.S: 204 /* Setup EFER (Extended Feature Enable Register) */ 205 movl$MSR_EFER, %ecx 206 rdmsr 207 btsl$_EFER_SCE, %eax/* Enable System Call */ 208 btl $20,%edi/* No Execute supported? */ 209 jnc 1f 210 btsl$_EFER_NX, %eax 211 btsq$_PAGE_BIT_NX,early_pmd_flags(%rip) 212 1: wrmsr /* Make changes effective */ So if guest sees NX in its cpuid then host EFER.NX should be 1. This is what I came up with (plus some comments :)): u64 guest_efer = vmx->vcpu.arch.efer; u64 ignore_bits = 0; if (!enable_ept) { if (boot_cpu_has(X86_FEATURE_SMEP)) guest_efer |= EFER_NX; else if (!(guest_efer & EFER_NX)) ignore_bits |= EFER_NX; } Your logic is very right. What my suggestion is we can keep ignore_bits = EFER_NX | EFER_SCE; (needn't conditionally adjust it) because EFER_NX must be the same between guest and host if we switch EFER manually. My patch is bigger but the resulting code is smaller and easier to follow: guest_efer = vmx->vcpu.arch.efer; if (!enable_ept) guest_efer |= EFER_NX; ... if (...) { ... } else { guest_efer &= ~ignore_bits; guest_efer |= host_efer & ignore_bits; } I agreed. :)
Re: [PATCH 1/2] KVM: MMU: fix ept=0/pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo
On 03/08/2016 07:44 PM, Paolo Bonzini wrote: Yes, all of these are needed. :) This is admittedly a bit odd, but kvm-unit-tests access.flat tests this if you run it with "-cpu host" and of course ept=0. KVM handles supervisor writes of a pte.u=0/pte.w=0/CR0.WP=0 page by setting U=0 and W=1 in the shadow PTE. This will cause a user write to fault and a supervisor write to succeed (which is correct because CR0.WP=0). A user read instead will flip U=0 to 1 and W=1 back to 0. BTW, it should be pte.u = 1 where you mentioned above.
Re: [PATCH 1/2] KVM: MMU: fix ept=0/pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo
On 03/08/2016 07:44 PM, Paolo Bonzini wrote: Yes, all of these are needed. :) This is admittedly a bit odd, but kvm-unit-tests access.flat tests this if you run it with "-cpu host" and of course ept=0. KVM handles supervisor writes of a pte.u=0/pte.w=0/CR0.WP=0 page by setting U=0 and W=1 in the shadow PTE. This will cause a user write to fault and a supervisor write to succeed (which is correct because CR0.WP=0). A user read instead will flip U=0 to 1 and W=1 back to 0. BTW, it should be pte.u = 1 where you mentioned above.
Re: [PATCH 2/2] KVM: MMU: fix reserved bit check for pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0
On 03/08/2016 07:44 PM, Paolo Bonzini wrote: KVM handles supervisor writes of a pte.u=0/pte.w=0/CR0.WP=0 page by setting U=0 and W=1 in the shadow PTE. This will cause a user write to fault and a supervisor write to succeed (which is correct because CR0.WP=0). A user read instead will flip U=0 to 1 and W=1 back to 0. This enables user reads; it also disables supervisor writes, the next of which will then flip the bits again. When SMEP is in effect, however, pte.u=0 will enable kernel execution of this page. To avoid this, KVM also sets pte.nx=1. The reserved bit catches this because it only looks at the guest's EFER.NX bit. Teach it that smep_andnot_wp will also use the NX bit of SPTEs. Cc: sta...@vger.kernel.org Cc: Xiao Guangrong <guangrong.x...@redhat.com> As a redhat guy i am so proud. :) Fixes: c258b62b264fdc469b6d3610a907708068145e3b Thanks for you fixing it, Paolo! Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> --- arch/x86/kvm/mmu.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 95a955de5964..0cd4ee01de94 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3721,13 +3721,15 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu, void reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context) { + int uses_nx = context->nx || context->base_role.smep_andnot_wp; It would be better if it is 'bool'
Re: [PATCH 2/2] KVM: MMU: fix reserved bit check for pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0
On 03/08/2016 07:44 PM, Paolo Bonzini wrote: KVM handles supervisor writes of a pte.u=0/pte.w=0/CR0.WP=0 page by setting U=0 and W=1 in the shadow PTE. This will cause a user write to fault and a supervisor write to succeed (which is correct because CR0.WP=0). A user read instead will flip U=0 to 1 and W=1 back to 0. This enables user reads; it also disables supervisor writes, the next of which will then flip the bits again. When SMEP is in effect, however, pte.u=0 will enable kernel execution of this page. To avoid this, KVM also sets pte.nx=1. The reserved bit catches this because it only looks at the guest's EFER.NX bit. Teach it that smep_andnot_wp will also use the NX bit of SPTEs. Cc: sta...@vger.kernel.org Cc: Xiao Guangrong As a redhat guy i am so proud. :) Fixes: c258b62b264fdc469b6d3610a907708068145e3b Thanks for you fixing it, Paolo! Reviewed-by: Xiao Guangrong Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 95a955de5964..0cd4ee01de94 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3721,13 +3721,15 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu *vcpu, void reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context) { + int uses_nx = context->nx || context->base_role.smep_andnot_wp; It would be better if it is 'bool'
Re: [PATCH 1/2] KVM: MMU: fix ept=0/pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo
On 03/08/2016 07:44 PM, Paolo Bonzini wrote: Yes, all of these are needed. :) This is admittedly a bit odd, but kvm-unit-tests access.flat tests this if you run it with "-cpu host" and of course ept=0. KVM handles supervisor writes of a pte.u=0/pte.w=0/CR0.WP=0 page by setting U=0 and W=1 in the shadow PTE. This will cause a user write to fault and a supervisor write to succeed (which is correct because CR0.WP=0). A user read instead will flip U=0 to 1 and W=1 back to 0. This enables user reads; it also disables supervisor writes, the next of which will then flip the bits again. When SMEP is in effect, however, U=0 will enable kernel execution of this page. To avoid this, KVM also sets NX=1 in the shadow PTE together with U=0. If the guest has not enabled NX, the result is a continuous stream of page faults due to the NX bit being reserved. The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER switch. Good catch! So it only hurts the box which has cpu_has_load_ia32_efer support otherwise NX is inherited from kernel (kernel always sets NX if CPU supports it), right? There is another bug in the reserved bit check, which I've split to a separate patch for easier application to stable kernels. Cc: sta...@vger.kernel.org Cc: Xiao Guangrong <guangrong.x...@redhat.com> Cc: Andy Lutomirski <l...@amacapital.net> Fixes: f6577a5fa15d82217ca73c74cd2dcbc0f6c781dd Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> --- Documentation/virtual/kvm/mmu.txt | 3 ++- arch/x86/kvm/vmx.c| 25 +++-- 2 files changed, 17 insertions(+), 11 deletions(-) diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt index daf9c0f742d2..c81731096a43 100644 --- a/Documentation/virtual/kvm/mmu.txt +++ b/Documentation/virtual/kvm/mmu.txt @@ -358,7 +358,8 @@ In the first case there are two additional complications: - if CR4.SMEP is enabled: since we've turned the page into a kernel page, the kernel may now execute it. We handle this by also setting spte.nx. If we get a user fetch or read fault, we'll change spte.u=1 and - spte.nx=gpte.nx back. + spte.nx=gpte.nx back. For this to work, KVM forces EFER.NX to 1 when + shadow paging is in use. - if CR4.SMAP is disabled: since the page has been changed to a kernel page, it can not be reused when CR4.SMAP is enabled. We set CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6e51493ff4f9..91830809d837 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1863,20 +1863,20 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset) guest_efer = vmx->vcpu.arch.efer; /* -* NX is emulated; LMA and LME handled by hardware; SCE meaningless -* outside long mode +* LMA and LME handled by hardware; SCE meaningless outside long mode. */ - ignore_bits = EFER_NX | EFER_SCE; + ignore_bits = EFER_SCE; #ifdef CONFIG_X86_64 ignore_bits |= EFER_LMA | EFER_LME; /* SCE is meaningful only in long mode on Intel */ if (guest_efer & EFER_LMA) ignore_bits &= ~(u64)EFER_SCE; #endif - guest_efer &= ~ignore_bits; - guest_efer |= host_efer & ignore_bits; - vmx->guest_msrs[efer_offset].data = guest_efer; - vmx->guest_msrs[efer_offset].mask = ~ignore_bits; + /* NX is needed to handle CR0.WP=1, CR4.SMEP=1. */ + if (!enable_ept) { + guest_efer |= EFER_NX; + ignore_bits |= EFER_NX; Update ignore_bits is not necessary i think. + } clear_atomic_switch_msr(vmx, MSR_EFER); @@ -1887,16 +1887,21 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset) */ if (cpu_has_load_ia32_efer || (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX))) { - guest_efer = vmx->vcpu.arch.efer; if (!(guest_efer & EFER_LMA)) guest_efer &= ~EFER_LME; if (guest_efer != host_efer) add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer); So, why not set EFER_NX (if !ept) just in this branch to make the fix more simpler?
Re: [PATCH 1/2] KVM: MMU: fix ept=0/pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo
On 03/08/2016 07:44 PM, Paolo Bonzini wrote: Yes, all of these are needed. :) This is admittedly a bit odd, but kvm-unit-tests access.flat tests this if you run it with "-cpu host" and of course ept=0. KVM handles supervisor writes of a pte.u=0/pte.w=0/CR0.WP=0 page by setting U=0 and W=1 in the shadow PTE. This will cause a user write to fault and a supervisor write to succeed (which is correct because CR0.WP=0). A user read instead will flip U=0 to 1 and W=1 back to 0. This enables user reads; it also disables supervisor writes, the next of which will then flip the bits again. When SMEP is in effect, however, U=0 will enable kernel execution of this page. To avoid this, KVM also sets NX=1 in the shadow PTE together with U=0. If the guest has not enabled NX, the result is a continuous stream of page faults due to the NX bit being reserved. The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER switch. Good catch! So it only hurts the box which has cpu_has_load_ia32_efer support otherwise NX is inherited from kernel (kernel always sets NX if CPU supports it), right? There is another bug in the reserved bit check, which I've split to a separate patch for easier application to stable kernels. Cc: sta...@vger.kernel.org Cc: Xiao Guangrong Cc: Andy Lutomirski Fixes: f6577a5fa15d82217ca73c74cd2dcbc0f6c781dd Signed-off-by: Paolo Bonzini --- Documentation/virtual/kvm/mmu.txt | 3 ++- arch/x86/kvm/vmx.c| 25 +++-- 2 files changed, 17 insertions(+), 11 deletions(-) diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt index daf9c0f742d2..c81731096a43 100644 --- a/Documentation/virtual/kvm/mmu.txt +++ b/Documentation/virtual/kvm/mmu.txt @@ -358,7 +358,8 @@ In the first case there are two additional complications: - if CR4.SMEP is enabled: since we've turned the page into a kernel page, the kernel may now execute it. We handle this by also setting spte.nx. If we get a user fetch or read fault, we'll change spte.u=1 and - spte.nx=gpte.nx back. + spte.nx=gpte.nx back. For this to work, KVM forces EFER.NX to 1 when + shadow paging is in use. - if CR4.SMAP is disabled: since the page has been changed to a kernel page, it can not be reused when CR4.SMAP is enabled. We set CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6e51493ff4f9..91830809d837 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1863,20 +1863,20 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset) guest_efer = vmx->vcpu.arch.efer; /* -* NX is emulated; LMA and LME handled by hardware; SCE meaningless -* outside long mode +* LMA and LME handled by hardware; SCE meaningless outside long mode. */ - ignore_bits = EFER_NX | EFER_SCE; + ignore_bits = EFER_SCE; #ifdef CONFIG_X86_64 ignore_bits |= EFER_LMA | EFER_LME; /* SCE is meaningful only in long mode on Intel */ if (guest_efer & EFER_LMA) ignore_bits &= ~(u64)EFER_SCE; #endif - guest_efer &= ~ignore_bits; - guest_efer |= host_efer & ignore_bits; - vmx->guest_msrs[efer_offset].data = guest_efer; - vmx->guest_msrs[efer_offset].mask = ~ignore_bits; + /* NX is needed to handle CR0.WP=1, CR4.SMEP=1. */ + if (!enable_ept) { + guest_efer |= EFER_NX; + ignore_bits |= EFER_NX; Update ignore_bits is not necessary i think. + } clear_atomic_switch_msr(vmx, MSR_EFER); @@ -1887,16 +1887,21 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, int efer_offset) */ if (cpu_has_load_ia32_efer || (enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX))) { - guest_efer = vmx->vcpu.arch.efer; if (!(guest_efer & EFER_LMA)) guest_efer &= ~EFER_LME; if (guest_efer != host_efer) add_atomic_switch_msr(vmx, MSR_EFER, guest_efer, host_efer); So, why not set EFER_NX (if !ept) just in this branch to make the fix more simpler?
Re: [PATCH v2 0/9] cleanup around kvm_sync_page, and a few micro-optimizations
On 03/07/2016 10:15 PM, Paolo Bonzini wrote: Having committed the ubsan fixes, this are the cleanups that are left. Compared to v1, I have fixed the patch to coalesce page zapping after mmu_sync_children (as requested by Takuya and Guangrong), and I have rewritten is_last_gpte again in an even simpler way. Nice work! Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
Re: [PATCH v2 0/9] cleanup around kvm_sync_page, and a few micro-optimizations
On 03/07/2016 10:15 PM, Paolo Bonzini wrote: Having committed the ubsan fixes, this are the cleanups that are left. Compared to v1, I have fixed the patch to coalesce page zapping after mmu_sync_children (as requested by Takuya and Guangrong), and I have rewritten is_last_gpte again in an even simpler way. Nice work! Reviewed-by: Xiao Guangrong
Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()
On 03/04/2016 04:04 PM, Paolo Bonzini wrote: On 04/03/2016 02:35, Lan Tianyu wrote: The following kvm_flush_remote_tlbs() will call smp_mb() inside and so remove smp_mb() here. Signed-off-by: Lan Tianyu--- arch/x86/kvm/mmu.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index a54ecd9..6315416 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2383,12 +2383,6 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm, return; /* -* wmb: make sure everyone sees our modifications to the page tables -* rmb: make sure we see changes to vcpu->mode -*/ - smp_mb(); - - /* * Wait for all vcpus to exit guest mode and/or lockless shadow * page table walks. */ kvm_flush_remote_tlbs loads kvm->tlbs_dirty before issuing the memory barrier. I think it's okay if the load is done earlier, but I'd like Guangrong to take a look. Yes, It looks good to me, but please keep the comment. Thanks!
Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()
On 03/04/2016 04:04 PM, Paolo Bonzini wrote: On 04/03/2016 02:35, Lan Tianyu wrote: The following kvm_flush_remote_tlbs() will call smp_mb() inside and so remove smp_mb() here. Signed-off-by: Lan Tianyu --- arch/x86/kvm/mmu.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index a54ecd9..6315416 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2383,12 +2383,6 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm, return; /* -* wmb: make sure everyone sees our modifications to the page tables -* rmb: make sure we see changes to vcpu->mode -*/ - smp_mb(); - - /* * Wait for all vcpus to exit guest mode and/or lockless shadow * page table walks. */ kvm_flush_remote_tlbs loads kvm->tlbs_dirty before issuing the memory barrier. I think it's okay if the load is done earlier, but I'd like Guangrong to take a look. Yes, It looks good to me, but please keep the comment. Thanks!
Re: [PATCH] KVM: VMX: use vmcs_clear/set_bits for debug register exits
On 02/26/2016 07:55 PM, Paolo Bonzini wrote: Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> --- arch/x86/kvm/vmx.c | 14 +++--- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index aa16d5874fe6..46154dac71e6 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5619,11 +5619,8 @@ static int handle_dr(struct kvm_vcpu *vcpu) } if (vcpu->guest_debug == 0) { - u32 cpu_based_vm_exec_control; - - cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); - cpu_based_vm_exec_control &= ~CPU_BASED_MOV_DR_EXITING; - vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control); + vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL, + CPU_BASED_MOV_DR_EXITING); /* * No more DR vmexits; force a reload of the debug registers @@ -5660,8 +5657,6 @@ static void vmx_set_dr6(struct kvm_vcpu *vcpu, unsigned long val) static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu) { - u32 cpu_based_vm_exec_control; - get_debugreg(vcpu->arch.db[0], 0); get_debugreg(vcpu->arch.db[1], 1); get_debugreg(vcpu->arch.db[2], 2); @@ -5670,10 +5665,7 @@ static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu) vcpu->arch.dr7 = vmcs_readl(GUEST_DR7); vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_WONT_EXIT; - - cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); - cpu_based_vm_exec_control |= CPU_BASED_MOV_DR_EXITING; - vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control); + vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL, CPU_BASED_MOV_DR_EXITING); } static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val) Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
Re: [PATCH] KVM: VMX: use vmcs_clear/set_bits for debug register exits
On 02/26/2016 07:55 PM, Paolo Bonzini wrote: Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx.c | 14 +++--- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index aa16d5874fe6..46154dac71e6 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5619,11 +5619,8 @@ static int handle_dr(struct kvm_vcpu *vcpu) } if (vcpu->guest_debug == 0) { - u32 cpu_based_vm_exec_control; - - cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); - cpu_based_vm_exec_control &= ~CPU_BASED_MOV_DR_EXITING; - vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control); + vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL, + CPU_BASED_MOV_DR_EXITING); /* * No more DR vmexits; force a reload of the debug registers @@ -5660,8 +5657,6 @@ static void vmx_set_dr6(struct kvm_vcpu *vcpu, unsigned long val) static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu) { - u32 cpu_based_vm_exec_control; - get_debugreg(vcpu->arch.db[0], 0); get_debugreg(vcpu->arch.db[1], 1); get_debugreg(vcpu->arch.db[2], 2); @@ -5670,10 +5665,7 @@ static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu) vcpu->arch.dr7 = vmcs_readl(GUEST_DR7); vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_WONT_EXIT; - - cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); - cpu_based_vm_exec_control |= CPU_BASED_MOV_DR_EXITING; - vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control); + vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL, CPU_BASED_MOV_DR_EXITING); } static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val) Reviewed-by: Xiao Guangrong
Re: [PATCH] KVM: x86: fix root cause for missed hardware breakpoints
On 02/26/2016 07:46 PM, Paolo Bonzini wrote: Commit 172b2386ed16 ("KVM: x86: fix missed hardware breakpoints", 2016-02-10) worked around a case where the debug registers are not loaded correctly on preemption and on the first entry to KVM_RUN. However, Xiao Guangrong pointed out that the root cause must be that KVM_DEBUGREG_BP_ENABLED is not being set correctly. This can indeed happen due to the lazy debug exit mechanism, which does not call kvm_update_dr7. Fix it by replacing the existing loop (more or less equivalent to kvm_update_dr0123) with calls to all the kvm_update_dr* functions. The original patch is good enough for stable releases before 4.1, since it does not have any dependencies such as commit ae561edeb421 ("KVM: x86: DR0-DR3 are not clear on reset", 2015-04-02). Cc: sta...@vger.kernel.org # 4.1+ Fixes: 172b2386ed16a9143d9a456aae5ec87275c61489 Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> --- arch/x86/kvm/x86.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index f4891f2ece23..eaf6ee8c28b8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2752,7 +2752,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) } kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); - vcpu->arch.switch_db_regs |= KVM_DEBUGREG_RELOAD; } void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) @@ -6619,12 +6618,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) * KVM_DEBUGREG_WONT_EXIT again. */ if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) { - int i; - WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP); kvm_x86_ops->sync_dirty_debug_regs(vcpu); - for (i = 0; i < KVM_NR_DB_REGS; i++) - vcpu->arch.eff_db[i] = vcpu->arch.db[i]; + kvm_update_dr0123(vcpu); + kvm_update_dr6(vcpu); + kvm_update_dr7(vcpu); + vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD; } It is good to me now. It also can fix the case that the debug registers are changed by the debug-process(gdb/perf) in the context of the vcpu as i described in another thread. Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
Re: [PATCH] KVM: x86: fix root cause for missed hardware breakpoints
On 02/26/2016 07:46 PM, Paolo Bonzini wrote: Commit 172b2386ed16 ("KVM: x86: fix missed hardware breakpoints", 2016-02-10) worked around a case where the debug registers are not loaded correctly on preemption and on the first entry to KVM_RUN. However, Xiao Guangrong pointed out that the root cause must be that KVM_DEBUGREG_BP_ENABLED is not being set correctly. This can indeed happen due to the lazy debug exit mechanism, which does not call kvm_update_dr7. Fix it by replacing the existing loop (more or less equivalent to kvm_update_dr0123) with calls to all the kvm_update_dr* functions. The original patch is good enough for stable releases before 4.1, since it does not have any dependencies such as commit ae561edeb421 ("KVM: x86: DR0-DR3 are not clear on reset", 2015-04-02). Cc: sta...@vger.kernel.org # 4.1+ Fixes: 172b2386ed16a9143d9a456aae5ec87275c61489 Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index f4891f2ece23..eaf6ee8c28b8 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2752,7 +2752,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) } kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); - vcpu->arch.switch_db_regs |= KVM_DEBUGREG_RELOAD; } void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) @@ -6619,12 +6618,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) * KVM_DEBUGREG_WONT_EXIT again. */ if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) { - int i; - WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP); kvm_x86_ops->sync_dirty_debug_regs(vcpu); - for (i = 0; i < KVM_NR_DB_REGS; i++) - vcpu->arch.eff_db[i] = vcpu->arch.db[i]; + kvm_update_dr0123(vcpu); + kvm_update_dr6(vcpu); + kvm_update_dr7(vcpu); + vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD; } It is good to me now. It also can fix the case that the debug registers are changed by the debug-process(gdb/perf) in the context of the vcpu as i described in another thread. Reviewed-by: Xiao Guangrong
Re: [PATCH] KVM: x86: fix missed hardware breakpoints
On 02/26/2016 07:28 PM, Nadav Amit wrote: Xiao Guangrong <guangrong.x...@linux.intel.com> wrote: On 02/19/2016 06:56 PM, Paolo Bonzini wrote: Sometimes when setting a breakpoint a process doesn't stop on it. This is because the debug registers are not loaded correctly on VCPU load. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4244c2baf57d..f4891f2ece23 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2752,6 +2752,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) } kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); + vcpu->arch.switch_db_regs |= KVM_DEBUGREG_RELOAD; Er, i do not understand how it works. The BP is enabled in this test case so the debug registers are always reloaded before entering guest as KVM_DEBUGREG_BP_ENABLED bit is always set on switch_db_regs. What did i miss? Note that KVM_DEBUGREG_BP_ENABLED does not have to be set, since kvm_update_dr7() is not called once the guest has KVM_DEBUGREG_WONT_EXIT. Ah, yes. So that we will enter guest without reloading debug registers under this case. Is it safe? What about if gdb/perf attached to QEMU and reset the debug0-3 (in this case, the vcpu is interrupted by IPI and it is not rescheduled)?
Re: [PATCH] KVM: x86: fix missed hardware breakpoints
On 02/26/2016 07:28 PM, Nadav Amit wrote: Xiao Guangrong wrote: On 02/19/2016 06:56 PM, Paolo Bonzini wrote: Sometimes when setting a breakpoint a process doesn't stop on it. This is because the debug registers are not loaded correctly on VCPU load. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4244c2baf57d..f4891f2ece23 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2752,6 +2752,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) } kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); + vcpu->arch.switch_db_regs |= KVM_DEBUGREG_RELOAD; Er, i do not understand how it works. The BP is enabled in this test case so the debug registers are always reloaded before entering guest as KVM_DEBUGREG_BP_ENABLED bit is always set on switch_db_regs. What did i miss? Note that KVM_DEBUGREG_BP_ENABLED does not have to be set, since kvm_update_dr7() is not called once the guest has KVM_DEBUGREG_WONT_EXIT. Ah, yes. So that we will enter guest without reloading debug registers under this case. Is it safe? What about if gdb/perf attached to QEMU and reset the debug0-3 (in this case, the vcpu is interrupted by IPI and it is not rescheduled)?
Re: [PATCH] KVM: x86: fix missed hardware breakpoints
On 02/19/2016 06:56 PM, Paolo Bonzini wrote: Sometimes when setting a breakpoint a process doesn't stop on it. This is because the debug registers are not loaded correctly on VCPU load. The following simple reproducer from Oleg Nesterov tries using debug registers in two threads. To see the bug, run a 2-VCPU guest under "taskset -c 0", then run "./bp 0 1" inside the guest. #include #include #include #include #include #include #include #include #include #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER) unsigned long encode_dr7(int drnum, int enable, unsigned int type, unsigned int len) { unsigned long dr7; dr7 = ((len | type) & 0xf) << (DR_CONTROL_SHIFT + drnum * DR_CONTROL_SIZE); if (enable) dr7 |= (DR_GLOBAL_ENABLE << (drnum * DR_ENABLE_SIZE)); return dr7; } int write_dr(int pid, int dr, unsigned long val) { return ptrace(PTRACE_POKEUSER, pid, offsetof (struct user, u_debugreg[dr]), val); } void set_bp(pid_t pid, void *addr) { unsigned long dr7; assert(write_dr(pid, 0, (long)addr) == 0); dr7 = encode_dr7(0, 1, DR_RW_EXECUTE, DR_LEN_1); assert(write_dr(pid, 7, dr7) == 0); } void *get_rip(int pid) { return (void*)ptrace(PTRACE_PEEKUSER, pid, offsetof(struct user, regs.rip), 0); } void test(int nr) { void *bp_addr = & + nr, *bp_hit; int pid; printf("test bp %d\n", nr); assert(nr < 16); // see 16 asm nops below pid = fork(); if (!pid) { assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0); kill(getpid(), SIGSTOP); for (;;) { label: asm ( "nop; nop; nop; nop;" "nop; nop; nop; nop;" "nop; nop; nop; nop;" "nop; nop; nop; nop;" ); } } assert(pid == wait(NULL)); set_bp(pid, bp_addr); for (;;) { assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0); assert(pid == wait(NULL)); bp_hit = get_rip(pid); if (bp_hit != bp_addr) fprintf(stderr, "ERR!! hit wrong bp %ld != %d\n", bp_hit - &, nr); } } int main(int argc, const char *argv[]) { while (--argc) { int nr = atoi(*++argv); if (!fork()) test(nr); } while (wait(NULL) > 0) ; return 0; } Cc: sta...@vger.kernel.org Suggested-by: Nadav AmitReported-by: Andrey Wagin Signed-off-by: Paolo Bonzini --- Andrey, sorry for the time it took to get to this one. arch/x86/kvm/x86.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4244c2baf57d..f4891f2ece23 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2752,6 +2752,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) } kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); + vcpu->arch.switch_db_regs |= KVM_DEBUGREG_RELOAD; Er, i do not understand how it works. The BP is enabled in this test case so the debug registers are always reloaded before entering guest as KVM_DEBUGREG_BP_ENABLED bit is always set on switch_db_regs. What did i miss? Another impact of this fix is when vcpu is rescheduled we need to always reload debug registers even if guest does not enable it, it is really needed?
Re: [PATCH] KVM: x86: fix missed hardware breakpoints
On 02/19/2016 06:56 PM, Paolo Bonzini wrote: Sometimes when setting a breakpoint a process doesn't stop on it. This is because the debug registers are not loaded correctly on VCPU load. The following simple reproducer from Oleg Nesterov tries using debug registers in two threads. To see the bug, run a 2-VCPU guest under "taskset -c 0", then run "./bp 0 1" inside the guest. #include #include #include #include #include #include #include #include #include #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER) unsigned long encode_dr7(int drnum, int enable, unsigned int type, unsigned int len) { unsigned long dr7; dr7 = ((len | type) & 0xf) << (DR_CONTROL_SHIFT + drnum * DR_CONTROL_SIZE); if (enable) dr7 |= (DR_GLOBAL_ENABLE << (drnum * DR_ENABLE_SIZE)); return dr7; } int write_dr(int pid, int dr, unsigned long val) { return ptrace(PTRACE_POKEUSER, pid, offsetof (struct user, u_debugreg[dr]), val); } void set_bp(pid_t pid, void *addr) { unsigned long dr7; assert(write_dr(pid, 0, (long)addr) == 0); dr7 = encode_dr7(0, 1, DR_RW_EXECUTE, DR_LEN_1); assert(write_dr(pid, 7, dr7) == 0); } void *get_rip(int pid) { return (void*)ptrace(PTRACE_PEEKUSER, pid, offsetof(struct user, regs.rip), 0); } void test(int nr) { void *bp_addr = & + nr, *bp_hit; int pid; printf("test bp %d\n", nr); assert(nr < 16); // see 16 asm nops below pid = fork(); if (!pid) { assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0); kill(getpid(), SIGSTOP); for (;;) { label: asm ( "nop; nop; nop; nop;" "nop; nop; nop; nop;" "nop; nop; nop; nop;" "nop; nop; nop; nop;" ); } } assert(pid == wait(NULL)); set_bp(pid, bp_addr); for (;;) { assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0); assert(pid == wait(NULL)); bp_hit = get_rip(pid); if (bp_hit != bp_addr) fprintf(stderr, "ERR!! hit wrong bp %ld != %d\n", bp_hit - &, nr); } } int main(int argc, const char *argv[]) { while (--argc) { int nr = atoi(*++argv); if (!fork()) test(nr); } while (wait(NULL) > 0) ; return 0; } Cc: sta...@vger.kernel.org Suggested-by: Nadav Amit Reported-by: Andrey Wagin Signed-off-by: Paolo Bonzini --- Andrey, sorry for the time it took to get to this one. arch/x86/kvm/x86.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4244c2baf57d..f4891f2ece23 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2752,6 +2752,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) } kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); + vcpu->arch.switch_db_regs |= KVM_DEBUGREG_RELOAD; Er, i do not understand how it works. The BP is enabled in this test case so the debug registers are always reloaded before entering guest as KVM_DEBUGREG_BP_ENABLED bit is always set on switch_db_regs. What did i miss? Another impact of this fix is when vcpu is rescheduled we need to always reload debug registers even if guest does not enable it, it is really needed?
Re: [PATCH 09/12] KVM: MMU: coalesce zapping page after mmu_sync_children
On 02/25/2016 04:49 PM, Paolo Bonzini wrote: On 25/02/2016 08:35, Xiao Guangrong wrote: This may release the mmu_lock before committing the zapping. Is it safe? If so, we may want to see the reason in the changelog. It is unsafe indeed, please do not do it. Can you explain why? kvm_zap_obsolete_pages does the same. It's not the same, please see the comment in kvm_mmu_invalidate_zap_all_pages: /* * Notify all vcpus to reload its shadow page table * and flush TLB. Then all vcpus will switch to new * shadow page table with the new mmu_valid_gen. * * Note: we should do this under the protection of * mmu-lock, otherwise, vcpu would purge shadow page * but miss tlb flush. */ kvm_reload_remote_mmus(kvm); That means the tlb is flushed before releasing mmu-lock. A example is in rmap_write_protect(), when KVM creates a shadow page table for the the guest, it detects no spte pointing to the gfn, so tlb is not flushed so that guest can freely updates its pte.
Re: [PATCH 09/12] KVM: MMU: coalesce zapping page after mmu_sync_children
On 02/25/2016 04:49 PM, Paolo Bonzini wrote: On 25/02/2016 08:35, Xiao Guangrong wrote: This may release the mmu_lock before committing the zapping. Is it safe? If so, we may want to see the reason in the changelog. It is unsafe indeed, please do not do it. Can you explain why? kvm_zap_obsolete_pages does the same. It's not the same, please see the comment in kvm_mmu_invalidate_zap_all_pages: /* * Notify all vcpus to reload its shadow page table * and flush TLB. Then all vcpus will switch to new * shadow page table with the new mmu_valid_gen. * * Note: we should do this under the protection of * mmu-lock, otherwise, vcpu would purge shadow page * but miss tlb flush. */ kvm_reload_remote_mmus(kvm); That means the tlb is flushed before releasing mmu-lock. A example is in rmap_write_protect(), when KVM creates a shadow page table for the the guest, it detects no spte pointing to the gfn, so tlb is not flushed so that guest can freely updates its pte.
Re: [PATCH] KVM: x86: MMU: fix ubsan index-out-of-range warning
On 02/25/2016 02:02 AM, Mike Krinkin wrote: Ubsan reports the following warning due to a typo in update_accessed_dirty_bits template, the patch fixes the typo: [ 168.791851] [ 168.791862] UBSAN: Undefined behaviour in arch/x86/kvm/paging_tmpl.h:252:15 [ 168.791866] index 4 is out of range for type 'u64 [4]' [ 168.791871] CPU: 0 PID: 2950 Comm: qemu-system-x86 Tainted: G O L 4.5.0-rc5-next-20160222 #7 [ 168.791873] Hardware name: LENOVO 23205NG/23205NG, BIOS G2ET95WW (2.55 ) 07/09/2013 [ 168.791876] 8801cfcaf208 81c9f780 41b58ab3 [ 168.791882] 82eb2cc1 81c9f6b4 8801cfcaf230 8801cfcaf1e0 [ 168.791886] 0004 0001 a1981600 [ 168.791891] Call Trace: [ 168.791899] [] dump_stack+0xcc/0x12c [ 168.791904] [] ? _atomic_dec_and_lock+0xc4/0xc4 [ 168.791910] [] ubsan_epilogue+0xd/0x8a [ 168.791914] [] __ubsan_handle_out_of_bounds+0x15c/0x1a3 [ 168.791918] [] ? __ubsan_handle_shift_out_of_bounds+0x2bd/0x2bd [ 168.791922] [] ? get_user_pages_fast+0x2bf/0x360 [ 168.791954] [] ? kvm_largepages_enabled+0x30/0x30 [kvm] [ 168.791958] [] ? __get_user_pages_fast+0x360/0x360 [ 168.791987] [] paging64_walk_addr_generic+0x1b28/0x2600 [kvm] [ 168.792014] [] ? init_kvm_mmu+0x1100/0x1100 [kvm] [ 168.792019] [] ? debug_check_no_locks_freed+0x350/0x350 [ 168.792044] [] ? init_kvm_mmu+0x1100/0x1100 [kvm] [ 168.792076] [] paging64_gva_to_gpa+0x7d/0x110 [kvm] [ 168.792121] [] ? paging64_walk_addr_generic+0x2600/0x2600 [kvm] [ 168.792130] [] ? debug_lockdep_rcu_enabled+0x7b/0x90 [ 168.792178] [] emulator_read_write_onepage+0x27a/0x1150 [kvm] [ 168.792208] [] ? __kvm_read_guest_page+0x54/0x70 [kvm] [ 168.792234] [] ? kvm_task_switch+0x160/0x160 [kvm] [ 168.792238] [] ? debug_lockdep_rcu_enabled+0x7b/0x90 [ 168.792263] [] emulator_read_write+0xe7/0x6d0 [kvm] [ 168.792290] [] ? em_cr_write+0x230/0x230 [kvm] [ 168.792314] [] emulator_write_emulated+0x15/0x20 [kvm] [ 168.792340] [] segmented_write+0xf8/0x130 [kvm] [ 168.792367] [] ? em_lgdt+0x20/0x20 [kvm] [ 168.792374] [] ? vmx_read_guest_seg_ar+0x42/0x1e0 [kvm_intel] [ 168.792400] [] writeback+0x3f2/0x700 [kvm] [ 168.792424] [] ? em_sidt+0xa0/0xa0 [kvm] [ 168.792449] [] ? x86_decode_insn+0x1b3d/0x4f70 [kvm] [ 168.792474] [] x86_emulate_insn+0x572/0x3010 [kvm] [ 168.792499] [] x86_emulate_instruction+0x3bd/0x2110 [kvm] [ 168.792524] [] ? reexecute_instruction.part.110+0x2e0/0x2e0 [kvm] [ 168.792532] [] handle_ept_misconfig+0x61/0x460 [kvm_intel] [ 168.792539] [] ? handle_pause+0x450/0x450 [kvm_intel] [ 168.792546] [] vmx_handle_exit+0xd6a/0x1ad0 [kvm_intel] [ 168.792572] [] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 [kvm] [ 168.792597] [] kvm_arch_vcpu_ioctl_run+0xd3d/0x6090 [kvm] [ 168.792621] [] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 [kvm] [ 168.792627] [] ? __ww_mutex_lock_interruptible+0x1630/0x1630 [ 168.792651] [] ? kvm_arch_vcpu_runnable+0x4f0/0x4f0 [kvm] [ 168.792656] [] ? preempt_notifier_unregister+0x190/0x190 [ 168.792681] [] ? kvm_arch_vcpu_load+0x127/0x650 [kvm] [ 168.792704] [] kvm_vcpu_ioctl+0x553/0xda0 [kvm] [ 168.792727] [] ? vcpu_put+0x40/0x40 [kvm] [ 168.792732] [] ? debug_check_no_locks_freed+0x350/0x350 [ 168.792735] [] ? _raw_spin_unlock+0x27/0x40 [ 168.792740] [] ? handle_mm_fault+0x1673/0x2e40 [ 168.792744] [] ? trace_hardirqs_on_caller+0x478/0x6c0 [ 168.792747] [] ? trace_hardirqs_on+0xd/0x10 [ 168.792751] [] ? debug_lockdep_rcu_enabled+0x7b/0x90 [ 168.792756] [] do_vfs_ioctl+0x1b0/0x12b0 [ 168.792759] [] ? ioctl_preallocate+0x210/0x210 [ 168.792763] [] ? __fget+0x273/0x4a0 [ 168.792766] [] ? __fget+0x50/0x4a0 [ 168.792770] [] ? __fget_light+0x96/0x2b0 [ 168.792773] [] SyS_ioctl+0x79/0x90 [ 168.792777] [] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 168.792780] Signed-off-by: Mike Krinkin <krinkin@gmail.com> Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
Re: [PATCH] KVM: x86: MMU: fix ubsan index-out-of-range warning
On 02/25/2016 02:02 AM, Mike Krinkin wrote: Ubsan reports the following warning due to a typo in update_accessed_dirty_bits template, the patch fixes the typo: [ 168.791851] [ 168.791862] UBSAN: Undefined behaviour in arch/x86/kvm/paging_tmpl.h:252:15 [ 168.791866] index 4 is out of range for type 'u64 [4]' [ 168.791871] CPU: 0 PID: 2950 Comm: qemu-system-x86 Tainted: G O L 4.5.0-rc5-next-20160222 #7 [ 168.791873] Hardware name: LENOVO 23205NG/23205NG, BIOS G2ET95WW (2.55 ) 07/09/2013 [ 168.791876] 8801cfcaf208 81c9f780 41b58ab3 [ 168.791882] 82eb2cc1 81c9f6b4 8801cfcaf230 8801cfcaf1e0 [ 168.791886] 0004 0001 a1981600 [ 168.791891] Call Trace: [ 168.791899] [] dump_stack+0xcc/0x12c [ 168.791904] [] ? _atomic_dec_and_lock+0xc4/0xc4 [ 168.791910] [] ubsan_epilogue+0xd/0x8a [ 168.791914] [] __ubsan_handle_out_of_bounds+0x15c/0x1a3 [ 168.791918] [] ? __ubsan_handle_shift_out_of_bounds+0x2bd/0x2bd [ 168.791922] [] ? get_user_pages_fast+0x2bf/0x360 [ 168.791954] [] ? kvm_largepages_enabled+0x30/0x30 [kvm] [ 168.791958] [] ? __get_user_pages_fast+0x360/0x360 [ 168.791987] [] paging64_walk_addr_generic+0x1b28/0x2600 [kvm] [ 168.792014] [] ? init_kvm_mmu+0x1100/0x1100 [kvm] [ 168.792019] [] ? debug_check_no_locks_freed+0x350/0x350 [ 168.792044] [] ? init_kvm_mmu+0x1100/0x1100 [kvm] [ 168.792076] [] paging64_gva_to_gpa+0x7d/0x110 [kvm] [ 168.792121] [] ? paging64_walk_addr_generic+0x2600/0x2600 [kvm] [ 168.792130] [] ? debug_lockdep_rcu_enabled+0x7b/0x90 [ 168.792178] [] emulator_read_write_onepage+0x27a/0x1150 [kvm] [ 168.792208] [] ? __kvm_read_guest_page+0x54/0x70 [kvm] [ 168.792234] [] ? kvm_task_switch+0x160/0x160 [kvm] [ 168.792238] [] ? debug_lockdep_rcu_enabled+0x7b/0x90 [ 168.792263] [] emulator_read_write+0xe7/0x6d0 [kvm] [ 168.792290] [] ? em_cr_write+0x230/0x230 [kvm] [ 168.792314] [] emulator_write_emulated+0x15/0x20 [kvm] [ 168.792340] [] segmented_write+0xf8/0x130 [kvm] [ 168.792367] [] ? em_lgdt+0x20/0x20 [kvm] [ 168.792374] [] ? vmx_read_guest_seg_ar+0x42/0x1e0 [kvm_intel] [ 168.792400] [] writeback+0x3f2/0x700 [kvm] [ 168.792424] [] ? em_sidt+0xa0/0xa0 [kvm] [ 168.792449] [] ? x86_decode_insn+0x1b3d/0x4f70 [kvm] [ 168.792474] [] x86_emulate_insn+0x572/0x3010 [kvm] [ 168.792499] [] x86_emulate_instruction+0x3bd/0x2110 [kvm] [ 168.792524] [] ? reexecute_instruction.part.110+0x2e0/0x2e0 [kvm] [ 168.792532] [] handle_ept_misconfig+0x61/0x460 [kvm_intel] [ 168.792539] [] ? handle_pause+0x450/0x450 [kvm_intel] [ 168.792546] [] vmx_handle_exit+0xd6a/0x1ad0 [kvm_intel] [ 168.792572] [] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 [kvm] [ 168.792597] [] kvm_arch_vcpu_ioctl_run+0xd3d/0x6090 [kvm] [ 168.792621] [] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 [kvm] [ 168.792627] [] ? __ww_mutex_lock_interruptible+0x1630/0x1630 [ 168.792651] [] ? kvm_arch_vcpu_runnable+0x4f0/0x4f0 [kvm] [ 168.792656] [] ? preempt_notifier_unregister+0x190/0x190 [ 168.792681] [] ? kvm_arch_vcpu_load+0x127/0x650 [kvm] [ 168.792704] [] kvm_vcpu_ioctl+0x553/0xda0 [kvm] [ 168.792727] [] ? vcpu_put+0x40/0x40 [kvm] [ 168.792732] [] ? debug_check_no_locks_freed+0x350/0x350 [ 168.792735] [] ? _raw_spin_unlock+0x27/0x40 [ 168.792740] [] ? handle_mm_fault+0x1673/0x2e40 [ 168.792744] [] ? trace_hardirqs_on_caller+0x478/0x6c0 [ 168.792747] [] ? trace_hardirqs_on+0xd/0x10 [ 168.792751] [] ? debug_lockdep_rcu_enabled+0x7b/0x90 [ 168.792756] [] do_vfs_ioctl+0x1b0/0x12b0 [ 168.792759] [] ? ioctl_preallocate+0x210/0x210 [ 168.792763] [] ? __fget+0x273/0x4a0 [ 168.792766] [] ? __fget+0x50/0x4a0 [ 168.792770] [] ? __fget_light+0x96/0x2b0 [ 168.792773] [] SyS_ioctl+0x79/0x90 [ 168.792777] [] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 168.792780] Signed-off-by: Mike Krinkin Reviewed-by: Xiao Guangrong
Re: [PATCH 00/12] KVM: MMU: cleanup around kvm_sync_page, and a few micro-optimizations
On 02/24/2016 09:17 PM, Paolo Bonzini wrote: This series started from looking at mmu_unsync_walk for the ubsan thread. Patches 1 and 2 are the result of the discussions in that thread. Patches 3 to 9 do more cleanups in __kvm_sync_page and its callers. Among other changes, it removes kvm_sync_page_transient and avoids duplicate code between __kvm_sync_page and kvm_sync_pages. I stopped where I had questions about the existing kvm_mmu_get_page code (see patch 8 for the question). However perhaps more cleanups are possible, also thanks to Takuya's work on that function and link_shadow_page. Patches 10 to 12 are just micro-optimizations. Guangrong, it would be great if you took a look since you know this part of KVM very well. I have reviewed it and it works fine except the one leaking tlb flush out of mmu-lock. I will continue to simplify the path of walking unsync sp to keep mmu_page_path smaller and make comments for kvm_mmu_get_page on top of this patchset. BTW, is any conflict to apply my page-tracking patchset on top of this patchset (i noticed you've merged this patchset on kvm/queue)? Please tell me to rebase it if it is needed.
Re: [PATCH 00/12] KVM: MMU: cleanup around kvm_sync_page, and a few micro-optimizations
On 02/24/2016 09:17 PM, Paolo Bonzini wrote: This series started from looking at mmu_unsync_walk for the ubsan thread. Patches 1 and 2 are the result of the discussions in that thread. Patches 3 to 9 do more cleanups in __kvm_sync_page and its callers. Among other changes, it removes kvm_sync_page_transient and avoids duplicate code between __kvm_sync_page and kvm_sync_pages. I stopped where I had questions about the existing kvm_mmu_get_page code (see patch 8 for the question). However perhaps more cleanups are possible, also thanks to Takuya's work on that function and link_shadow_page. Patches 10 to 12 are just micro-optimizations. Guangrong, it would be great if you took a look since you know this part of KVM very well. I have reviewed it and it works fine except the one leaking tlb flush out of mmu-lock. I will continue to simplify the path of walking unsync sp to keep mmu_page_path smaller and make comments for kvm_mmu_get_page on top of this patchset. BTW, is any conflict to apply my page-tracking patchset on top of this patchset (i noticed you've merged this patchset on kvm/queue)? Please tell me to rebase it if it is needed.
Re: [PATCH 09/12] KVM: MMU: coalesce zapping page after mmu_sync_children
On 02/25/2016 10:15 AM, Takuya Yoshikawa wrote: On 2016/02/24 22:17, Paolo Bonzini wrote: Move the call to kvm_mmu_flush_or_zap outside the loop. Signed-off-by: Paolo Bonzini--- arch/x86/kvm/mmu.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 725316df32ec..6d47b5c43246 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2029,24 +2029,27 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu, struct mmu_page_path parents; struct kvm_mmu_pages pages; LIST_HEAD(invalid_list); + bool flush = false; while (mmu_unsync_walk(parent, )) { bool protected = false; - bool flush = false; for_each_sp(pages, sp, parents, i) protected |= rmap_write_protect(vcpu, sp->gfn); - if (protected) + if (protected) { kvm_flush_remote_tlbs(vcpu->kvm); + flush = false; + } for_each_sp(pages, sp, parents, i) { flush |= kvm_sync_page(vcpu, sp, _list); mmu_pages_clear_parents(); } - kvm_mmu_flush_or_zap(vcpu, _list, false, flush); cond_resched_lock(>kvm->mmu_lock); This may release the mmu_lock before committing the zapping. Is it safe? If so, we may want to see the reason in the changelog. It is unsafe indeed, please do not do it.
Re: [PATCH 09/12] KVM: MMU: coalesce zapping page after mmu_sync_children
On 02/25/2016 10:15 AM, Takuya Yoshikawa wrote: On 2016/02/24 22:17, Paolo Bonzini wrote: Move the call to kvm_mmu_flush_or_zap outside the loop. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/mmu.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 725316df32ec..6d47b5c43246 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2029,24 +2029,27 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu, struct mmu_page_path parents; struct kvm_mmu_pages pages; LIST_HEAD(invalid_list); + bool flush = false; while (mmu_unsync_walk(parent, )) { bool protected = false; - bool flush = false; for_each_sp(pages, sp, parents, i) protected |= rmap_write_protect(vcpu, sp->gfn); - if (protected) + if (protected) { kvm_flush_remote_tlbs(vcpu->kvm); + flush = false; + } for_each_sp(pages, sp, parents, i) { flush |= kvm_sync_page(vcpu, sp, _list); mmu_pages_clear_parents(); } - kvm_mmu_flush_or_zap(vcpu, _list, false, flush); cond_resched_lock(>kvm->mmu_lock); This may release the mmu_lock before committing the zapping. Is it safe? If so, we may want to see the reason in the changelog. It is unsafe indeed, please do not do it.
Re: [PATCH 08/12] KVM: MMU: move zap/flush to kvm_mmu_get_page
On 02/24/2016 09:17 PM, Paolo Bonzini wrote: kvm_mmu_get_page is the only caller of kvm_sync_page_transient and kvm_sync_pages. Moving the handling of the invalid_list there removes the need for the underdocumented kvm_sync_page_transient function. Signed-off-by: Paolo Bonzini--- Guangrong, at this point I am confused about why kvm_sync_page_transient didn't clear sp->unsync. Do you remember? Or perhaps kvm_mmu_get_page could just call kvm_sync_page now? It is the optimization to reduce write-protect as changing unsync to sync need to write-protect the page and sync all sptes pointing to the same gfn. However, after syncing the content between unsync-ed spte and guest pte, we can reuse this spte perfectly. Also, can you explain the need_sync variable in kvm_mmu_get_page? This is because we need to to protect the semanteme of 'unsync spte' as only the spte on last level (level = 1) can be unsync so that if a spte on the upper level is created we should eliminate all the unsync sptes pointing to the same gfn. As you have already merged this patchset to the kvm tree, i will post a patch to comment these cases to make the code be more understandable.
Re: [PATCH 08/12] KVM: MMU: move zap/flush to kvm_mmu_get_page
On 02/24/2016 09:17 PM, Paolo Bonzini wrote: kvm_mmu_get_page is the only caller of kvm_sync_page_transient and kvm_sync_pages. Moving the handling of the invalid_list there removes the need for the underdocumented kvm_sync_page_transient function. Signed-off-by: Paolo Bonzini --- Guangrong, at this point I am confused about why kvm_sync_page_transient didn't clear sp->unsync. Do you remember? Or perhaps kvm_mmu_get_page could just call kvm_sync_page now? It is the optimization to reduce write-protect as changing unsync to sync need to write-protect the page and sync all sptes pointing to the same gfn. However, after syncing the content between unsync-ed spte and guest pte, we can reuse this spte perfectly. Also, can you explain the need_sync variable in kvm_mmu_get_page? This is because we need to to protect the semanteme of 'unsync spte' as only the spte on last level (level = 1) can be unsync so that if a spte on the upper level is created we should eliminate all the unsync sptes pointing to the same gfn. As you have already merged this patchset to the kvm tree, i will post a patch to comment these cases to make the code be more understandable.
[PATCH 03/11] KVM: MMU: introduce kvm_mmu_slot_gfn_write_protect
Split rmap_write_protect() and introduce the function to abstract the write protection based on the slot This function will be used in the later patch Reviewed-by: Paolo Bonzini <pbonz...@redhat.com> Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com> --- arch/x86/kvm/mmu.c | 16 +++- arch/x86/kvm/mmu.h | 2 ++ 2 files changed, 13 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 566639d..7184218 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1336,23 +1336,29 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask); } -static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn) +bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm, + struct kvm_memory_slot *slot, u64 gfn) { - struct kvm_memory_slot *slot; struct kvm_rmap_head *rmap_head; int i; bool write_protected = false; - slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); - for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) { rmap_head = __gfn_to_rmap(gfn, i, slot); - write_protected |= __rmap_write_protect(vcpu->kvm, rmap_head, true); + write_protected |= __rmap_write_protect(kvm, rmap_head, true); } return write_protected; } +static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn) +{ + struct kvm_memory_slot *slot; + + slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); + return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn); +} + static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head) { u64 *sptep; diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index de92bed..58fe98a 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -177,4 +177,6 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end); void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn); void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn); +bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm, + struct kvm_memory_slot *slot, u64 gfn); #endif -- 1.8.3.1