Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 04:45 PM, Xiao Guangrong wrote:



On 07/04/2016 04:41 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 04:19:20PM +0800, Xiao Guangrong wrote:



On 07/04/2016 03:53 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote:



On 07/04/2016 03:03 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote:



On 06/30/2016 09:01 PM, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.


Why does it require fetching the pfn when the fault is triggered rather
than when mmap() is called?


Hi Guangrong,

as such mapping information between virtual mmio to physical mmio is only 
available
at runtime.


Sorry, i do not know what the different between mmap() and the time VM actually
accesses the memory for your case. Could you please more detail?


Hi Guangrong,

Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when
setting up the virtual mmio, at that moment nobody has any knowledge about how
the physical mmio gets virtualized.

When the vm (or application if we don't want to limit ourselves to vmm term)
starts, the virtual and physical mmio gets mapped by mpci kernel module with the
help from vendor supplied mediated host driver according to the hw resource
assigned to this vm / application.


Thanks for your expiation.

It sounds like a strategy of resource allocation, you delay the allocation 
until VM really
accesses it, right?


Yes, that is where the fault handler inside mpci code comes to the picture.



I am not sure this strategy is good. The instance is successfully created, and 
it is started
successful, but the VM is crashed due to the resource of that instance is not 
enough. That sounds
unreasonable.



Especially, you can not squeeze this kind of memory to balance the usage 
between all VMs. Does
this strategy still make sense?




Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 04:41 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 04:19:20PM +0800, Xiao Guangrong wrote:



On 07/04/2016 03:53 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote:



On 07/04/2016 03:03 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote:



On 06/30/2016 09:01 PM, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.


Why does it require fetching the pfn when the fault is triggered rather
than when mmap() is called?


Hi Guangrong,

as such mapping information between virtual mmio to physical mmio is only 
available
at runtime.


Sorry, i do not know what the different between mmap() and the time VM actually
accesses the memory for your case. Could you please more detail?


Hi Guangrong,

Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when
setting up the virtual mmio, at that moment nobody has any knowledge about how
the physical mmio gets virtualized.

When the vm (or application if we don't want to limit ourselves to vmm term)
starts, the virtual and physical mmio gets mapped by mpci kernel module with the
help from vendor supplied mediated host driver according to the hw resource
assigned to this vm / application.


Thanks for your expiation.

It sounds like a strategy of resource allocation, you delay the allocation 
until VM really
accesses it, right?


Yes, that is where the fault handler inside mpci code comes to the picture.



I am not sure this strategy is good. The instance is successfully created, and 
it is started
successful, but the VM is crashed due to the resource of that instance is not 
enough. That sounds
unreasonable.





Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 04:41 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 04:19:20PM +0800, Xiao Guangrong wrote:



On 07/04/2016 03:53 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote:



On 07/04/2016 03:03 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote:



On 06/30/2016 09:01 PM, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.


Why does it require fetching the pfn when the fault is triggered rather
than when mmap() is called?


Hi Guangrong,

as such mapping information between virtual mmio to physical mmio is only 
available
at runtime.


Sorry, i do not know what the different between mmap() and the time VM actually
accesses the memory for your case. Could you please more detail?


Hi Guangrong,

Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when
setting up the virtual mmio, at that moment nobody has any knowledge about how
the physical mmio gets virtualized.

When the vm (or application if we don't want to limit ourselves to vmm term)
starts, the virtual and physical mmio gets mapped by mpci kernel module with the
help from vendor supplied mediated host driver according to the hw resource
assigned to this vm / application.


Thanks for your expiation.

It sounds like a strategy of resource allocation, you delay the allocation 
until VM really
accesses it, right?


Yes, that is where the fault handler inside mpci code comes to the picture.



I am not sure this strategy is good. The instance is successfully created, and 
it is started
successful, but the VM is crashed due to the resource of that instance is not 
enough. That sounds
unreasonable.





Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 04:14 PM, Paolo Bonzini wrote:



On 04/07/2016 09:59, Xiao Guangrong wrote:



But apart from this, it's much more obvious to consider the refcount.
The x86 MMU code doesn't care if the page is reserved or not;
mmu_set_spte does a kvm_release_pfn_clean, hence it makes sense for
hva_to_pfn_remapped to try doing a get_page (via kvm_get_pfn) after
invoking the fault handler, just like the get_user_pages family of
function does.


Well,  it's little strange as you always try to get refcont
for a PFNMAP region without MIXEDMAP which indicates all the memory
in this region is no 'struct page' backend.


Fair enough, I can modify the comment.

/*
 * In case the VMA has VM_MIXEDMAP set, whoever called remap_pfn_range
 * is also going to call e.g. unmap_mapping_range before the underlying
 * non-reserved pages are freed, which will then call our MMU notifier.
 * We still have to get a reference here to the page, because the 
callers
 * of *hva_to_pfn* and *gfn_to_pfn* ultimately end up doing a
 * kvm_release_pfn_clean on the returned pfn.  If the pfn is
 * reserved, the kvm_get_pfn/kvm_release_pfn_clean pair will simply
 * do nothing.
 */



Excellent. I like it. :)


Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 04:14 PM, Paolo Bonzini wrote:



On 04/07/2016 09:59, Xiao Guangrong wrote:



But apart from this, it's much more obvious to consider the refcount.
The x86 MMU code doesn't care if the page is reserved or not;
mmu_set_spte does a kvm_release_pfn_clean, hence it makes sense for
hva_to_pfn_remapped to try doing a get_page (via kvm_get_pfn) after
invoking the fault handler, just like the get_user_pages family of
function does.


Well,  it's little strange as you always try to get refcont
for a PFNMAP region without MIXEDMAP which indicates all the memory
in this region is no 'struct page' backend.


Fair enough, I can modify the comment.

/*
 * In case the VMA has VM_MIXEDMAP set, whoever called remap_pfn_range
 * is also going to call e.g. unmap_mapping_range before the underlying
 * non-reserved pages are freed, which will then call our MMU notifier.
 * We still have to get a reference here to the page, because the 
callers
 * of *hva_to_pfn* and *gfn_to_pfn* ultimately end up doing a
 * kvm_release_pfn_clean on the returned pfn.  If the pfn is
 * reserved, the kvm_get_pfn/kvm_release_pfn_clean pair will simply
 * do nothing.
 */



Excellent. I like it. :)


Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 03:53 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote:



On 07/04/2016 03:03 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote:



On 06/30/2016 09:01 PM, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.


Why does it require fetching the pfn when the fault is triggered rather
than when mmap() is called?


Hi Guangrong,

as such mapping information between virtual mmio to physical mmio is only 
available
at runtime.


Sorry, i do not know what the different between mmap() and the time VM actually
accesses the memory for your case. Could you please more detail?


Hi Guangrong,

Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when
setting up the virtual mmio, at that moment nobody has any knowledge about how
the physical mmio gets virtualized.

When the vm (or application if we don't want to limit ourselves to vmm term)
starts, the virtual and physical mmio gets mapped by mpci kernel module with the
help from vendor supplied mediated host driver according to the hw resource
assigned to this vm / application.


Thanks for your expiation.

It sounds like a strategy of resource allocation, you delay the allocation 
until VM really
accesses it, right?


Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 03:53 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 03:37:35PM +0800, Xiao Guangrong wrote:



On 07/04/2016 03:03 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote:



On 06/30/2016 09:01 PM, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.


Why does it require fetching the pfn when the fault is triggered rather
than when mmap() is called?


Hi Guangrong,

as such mapping information between virtual mmio to physical mmio is only 
available
at runtime.


Sorry, i do not know what the different between mmap() and the time VM actually
accesses the memory for your case. Could you please more detail?


Hi Guangrong,

Sure. The mmap() gets called by qemu or any VFIO API userspace consumer when
setting up the virtual mmio, at that moment nobody has any knowledge about how
the physical mmio gets virtualized.

When the vm (or application if we don't want to limit ourselves to vmm term)
starts, the virtual and physical mmio gets mapped by mpci kernel module with the
help from vendor supplied mediated host driver according to the hw resource
assigned to this vm / application.


Thanks for your expiation.

It sounds like a strategy of resource allocation, you delay the allocation 
until VM really
accesses it, right?


Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 03:48 PM, Paolo Bonzini wrote:



On 04/07/2016 09:37, Xiao Guangrong wrote:




It actually is a portion of the physical mmio which is set by vfio mmap.


So i do not think we need to care its refcount, i,e, we can consider it
as reserved_pfn,
Paolo?


nVidia provided me (offlist) with a simple patch that modified VFIO to
exhibit the problem, and it didn't use reserved PFNs.  This is why the
commit message for the patch is not entirely accurate.



It's clear now.


But apart from this, it's much more obvious to consider the refcount.
The x86 MMU code doesn't care if the page is reserved or not;
mmu_set_spte does a kvm_release_pfn_clean, hence it makes sense for
hva_to_pfn_remapped to try doing a get_page (via kvm_get_pfn) after
invoking the fault handler, just like the get_user_pages family of
function does.


Well,  it's little strange as you always try to get refcont
for a PFNMAP region without MIXEDMAP which indicates all the memory
in this region is no 'struct page' backend.

But it works as kvm_{get, release}_* have already been aware of
reserved_pfn, so i am okay with it..


Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 03:48 PM, Paolo Bonzini wrote:



On 04/07/2016 09:37, Xiao Guangrong wrote:




It actually is a portion of the physical mmio which is set by vfio mmap.


So i do not think we need to care its refcount, i,e, we can consider it
as reserved_pfn,
Paolo?


nVidia provided me (offlist) with a simple patch that modified VFIO to
exhibit the problem, and it didn't use reserved PFNs.  This is why the
commit message for the patch is not entirely accurate.



It's clear now.


But apart from this, it's much more obvious to consider the refcount.
The x86 MMU code doesn't care if the page is reserved or not;
mmu_set_spte does a kvm_release_pfn_clean, hence it makes sense for
hva_to_pfn_remapped to try doing a get_page (via kvm_get_pfn) after
invoking the fault handler, just like the get_user_pages family of
function does.


Well,  it's little strange as you always try to get refcont
for a PFNMAP region without MIXEDMAP which indicates all the memory
in this region is no 'struct page' backend.

But it works as kvm_{get, release}_* have already been aware of
reserved_pfn, so i am okay with it..


Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 03:38 PM, Paolo Bonzini wrote:



On 04/07/2016 08:39, Xiao Guangrong wrote:

Why the memory mapped by this mmap() is not a portion of MMIO from
underlayer physical device? If it is a valid system memory, is this
interface
really needed to implemented in vfio? (you at least need to set VM_MIXEDMAP
if it mixed system memory with MMIO)


The KVM code does not care if VM_MIXEDMAP is set or not, it works in
either case.


Yes, it is. I mean nvdia's vfio patchset should use VM_MIXEDMAP if the memory
is mixed. :)


Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 03:38 PM, Paolo Bonzini wrote:



On 04/07/2016 08:39, Xiao Guangrong wrote:

Why the memory mapped by this mmap() is not a portion of MMIO from
underlayer physical device? If it is a valid system memory, is this
interface
really needed to implemented in vfio? (you at least need to set VM_MIXEDMAP
if it mixed system memory with MMIO)


The KVM code does not care if VM_MIXEDMAP is set or not, it works in
either case.


Yes, it is. I mean nvdia's vfio patchset should use VM_MIXEDMAP if the memory
is mixed. :)


Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 03:03 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote:



On 06/30/2016 09:01 PM, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.


Why does it require fetching the pfn when the fault is triggered rather
than when mmap() is called?


Hi Guangrong,

as such mapping information between virtual mmio to physical mmio is only 
available
at runtime.


Sorry, i do not know what the different between mmap() and the time VM actually
accesses the memory for your case. Could you please more detail?





Why the memory mapped by this mmap() is not a portion of MMIO from
underlayer physical device? If it is a valid system memory, is this interface
really needed to implemented in vfio? (you at least need to set VM_MIXEDMAP
if it mixed system memory with MMIO)



It actually is a portion of the physical mmio which is set by vfio mmap.


So i do not think we need to care its refcount, i,e, we can consider it as 
reserved_pfn,
Paolo?




IIUC, the kernel assumes that VM_PFNMAP is a continuous memory, e.g, like
current KVM and vaddr_get_pfn() in vfio, but it seems nvdia's patchset
breaks this semantic as ops->validate_map_request() can adjust the physical
address arbitrarily. (again, the name 'validate' should be changed to match
the thing as it is really doing)


The vgpu api will allow you to adjust the target mmio address and the size via
validate_map_request, but it is still physical contiguous as <start_pfn, size>.


Okay, the interface confused us, maybe this interface need to be cooked to 
reflect
to this fact.

Thanks!



Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 07/04/2016 03:03 PM, Neo Jia wrote:

On Mon, Jul 04, 2016 at 02:39:22PM +0800, Xiao Guangrong wrote:



On 06/30/2016 09:01 PM, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.


Why does it require fetching the pfn when the fault is triggered rather
than when mmap() is called?


Hi Guangrong,

as such mapping information between virtual mmio to physical mmio is only 
available
at runtime.


Sorry, i do not know what the different between mmap() and the time VM actually
accesses the memory for your case. Could you please more detail?





Why the memory mapped by this mmap() is not a portion of MMIO from
underlayer physical device? If it is a valid system memory, is this interface
really needed to implemented in vfio? (you at least need to set VM_MIXEDMAP
if it mixed system memory with MMIO)



It actually is a portion of the physical mmio which is set by vfio mmap.


So i do not think we need to care its refcount, i,e, we can consider it as 
reserved_pfn,
Paolo?




IIUC, the kernel assumes that VM_PFNMAP is a continuous memory, e.g, like
current KVM and vaddr_get_pfn() in vfio, but it seems nvdia's patchset
breaks this semantic as ops->validate_map_request() can adjust the physical
address arbitrarily. (again, the name 'validate' should be changed to match
the thing as it is really doing)


The vgpu api will allow you to adjust the target mmio address and the size via
validate_map_request, but it is still physical contiguous as .


Okay, the interface confused us, maybe this interface need to be cooked to 
reflect
to this fact.

Thanks!



Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 06/30/2016 09:01 PM, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.


Why does it require fetching the pfn when the fault is triggered rather
than when mmap() is called?

Why the memory mapped by this mmap() is not a portion of MMIO from
underlayer physical device? If it is a valid system memory, is this interface
really needed to implemented in vfio? (you at least need to set VM_MIXEDMAP
if it mixed system memory with MMIO)

IIUC, the kernel assumes that VM_PFNMAP is a continuous memory, e.g, like
current KVM and vaddr_get_pfn() in vfio, but it seems nvdia's patchset
breaks this semantic as ops->validate_map_request() can adjust the physical
address arbitrarily. (again, the name 'validate' should be changed to match
the thing as it is really doing)




Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-04 Thread Xiao Guangrong



On 06/30/2016 09:01 PM, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.


Why does it require fetching the pfn when the fault is triggered rather
than when mmap() is called?

Why the memory mapped by this mmap() is not a portion of MMIO from
underlayer physical device? If it is a valid system memory, is this interface
really needed to implemented in vfio? (you at least need to set VM_MIXEDMAP
if it mixed system memory with MMIO)

IIUC, the kernel assumes that VM_PFNMAP is a continuous memory, e.g, like
current KVM and vaddr_get_pfn() in vfio, but it seems nvdia's patchset
breaks this semantic as ops->validate_map_request() can adjust the physical
address arbitrarily. (again, the name 'validate' should be changed to match
the thing as it is really doing)




Re: [PATCH 3/5] mmu: don't set the present bit unconditionally

2016-06-30 Thread Xiao Guangrong



On 06/29/2016 04:18 PM, Paolo Bonzini wrote:



On 29/06/2016 05:17, Xiao Guangrong wrote:


+++ b/arch/x86/kvm/mmu.c
@@ -2516,13 +2516,17 @@ static int set_spte(struct kvm_vcpu *vcpu, u64
*sptep,
   gfn_t gfn, kvm_pfn_t pfn, bool speculative,
   bool can_unsync, bool host_writable)
   {
-u64 spte;
+u64 spte = 0;
   int ret = 0;
+struct kvm_mmu *context = >arch.mmu;
+bool execonly = !(context->guest_rsvd_check.bad_mt_xwr &
+  (1ull << VMX_EPT_EXECUTABLE_MASK));


Could we introduce a new field, say execonly, to "struct kvm_mmu"?
That would make the code be more clearer.


Given how execonly is used, let's add shadow_present_mask instead.


Yup, it is better.



Re: [PATCH 3/5] mmu: don't set the present bit unconditionally

2016-06-30 Thread Xiao Guangrong



On 06/29/2016 04:18 PM, Paolo Bonzini wrote:



On 29/06/2016 05:17, Xiao Guangrong wrote:


+++ b/arch/x86/kvm/mmu.c
@@ -2516,13 +2516,17 @@ static int set_spte(struct kvm_vcpu *vcpu, u64
*sptep,
   gfn_t gfn, kvm_pfn_t pfn, bool speculative,
   bool can_unsync, bool host_writable)
   {
-u64 spte;
+u64 spte = 0;
   int ret = 0;
+struct kvm_mmu *context = >arch.mmu;
+bool execonly = !(context->guest_rsvd_check.bad_mt_xwr &
+  (1ull << VMX_EPT_EXECUTABLE_MASK));


Could we introduce a new field, say execonly, to "struct kvm_mmu"?
That would make the code be more clearer.


Given how execonly is used, let's add shadow_present_mask instead.


Yup, it is better.



Re: [PATCH 3/5] mmu: don't set the present bit unconditionally

2016-06-28 Thread Xiao Guangrong



On 06/28/2016 12:32 PM, Bandan Das wrote:

To support execute only mappings on behalf of L1
hypervisors, we teach set_spte() to honor L1's valid XWR
bits. This is only if host supports EPT execute only. Reuse
ACC_USER_MASK to signify if the L1 hypervisor has the R bit
set

Signed-off-by: Bandan Das 
---
  arch/x86/kvm/mmu.c | 9 +++--
  arch/x86/kvm/paging_tmpl.h | 2 +-
  arch/x86/kvm/vmx.c | 2 +-
  3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 875d4f7..ee2fb16 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2516,13 +2516,17 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
gfn_t gfn, kvm_pfn_t pfn, bool speculative,
bool can_unsync, bool host_writable)
  {
-   u64 spte;
+   u64 spte = 0;
int ret = 0;
+   struct kvm_mmu *context = >arch.mmu;
+   bool execonly = !(context->guest_rsvd_check.bad_mt_xwr &
+ (1ull << VMX_EPT_EXECUTABLE_MASK));


Could we introduce a new field, say execonly, to "struct kvm_mmu"?
That would make the code be more clearer.


Re: [PATCH 3/5] mmu: don't set the present bit unconditionally

2016-06-28 Thread Xiao Guangrong



On 06/28/2016 12:32 PM, Bandan Das wrote:

To support execute only mappings on behalf of L1
hypervisors, we teach set_spte() to honor L1's valid XWR
bits. This is only if host supports EPT execute only. Reuse
ACC_USER_MASK to signify if the L1 hypervisor has the R bit
set

Signed-off-by: Bandan Das 
---
  arch/x86/kvm/mmu.c | 9 +++--
  arch/x86/kvm/paging_tmpl.h | 2 +-
  arch/x86/kvm/vmx.c | 2 +-
  3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 875d4f7..ee2fb16 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2516,13 +2516,17 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
gfn_t gfn, kvm_pfn_t pfn, bool speculative,
bool can_unsync, bool host_writable)
  {
-   u64 spte;
+   u64 spte = 0;
int ret = 0;
+   struct kvm_mmu *context = >arch.mmu;
+   bool execonly = !(context->guest_rsvd_check.bad_mt_xwr &
+ (1ull << VMX_EPT_EXECUTABLE_MASK));


Could we introduce a new field, say execonly, to "struct kvm_mmu"?
That would make the code be more clearer.


Re: [PATCH 2/5] mmu: pass execonly value when initializing rsvd bits

2016-06-28 Thread Xiao Guangrong



On 06/28/2016 12:32 PM, Bandan Das wrote:

In reset_tdp_shadow_zero_bits_mask, we always pass false
when initializing the reserved bits. By initializing with the
correct value of ept exec only, the host can correctly
identify if the guest pte is valid. Note that
kvm_init_shadow_ept_mmu() already knows about execonly.

Signed-off-by: Bandan Das 
---
  arch/x86/kvm/mmu.c | 12 
  1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a50af79..875d4f7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3831,23 +3831,27 @@ static inline bool boot_cpu_is_amd(void)

  /*
   * the direct page table on host, use as much mmu features as
- * possible, however, kvm currently does not do execution-protection.
+ * possible
   */
  static void
  reset_tdp_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
struct kvm_mmu *context)
  {


It is not necessary. reset_tdp_shadow_zero_bits_mask() is used for
the guest without nested-ept, host never sets shadow execute-only
actively.

For the nested ept guest, kvm_init_shadow_ept_mmu() has already
handled xonly case perfectly.


Re: [PATCH 2/5] mmu: pass execonly value when initializing rsvd bits

2016-06-28 Thread Xiao Guangrong



On 06/28/2016 12:32 PM, Bandan Das wrote:

In reset_tdp_shadow_zero_bits_mask, we always pass false
when initializing the reserved bits. By initializing with the
correct value of ept exec only, the host can correctly
identify if the guest pte is valid. Note that
kvm_init_shadow_ept_mmu() already knows about execonly.

Signed-off-by: Bandan Das 
---
  arch/x86/kvm/mmu.c | 12 
  1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a50af79..875d4f7 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3831,23 +3831,27 @@ static inline bool boot_cpu_is_amd(void)

  /*
   * the direct page table on host, use as much mmu features as
- * possible, however, kvm currently does not do execution-protection.
+ * possible
   */
  static void
  reset_tdp_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
struct kvm_mmu *context)
  {


It is not necessary. reset_tdp_shadow_zero_bits_mask() is used for
the guest without nested-ept, host never sets shadow execute-only
actively.

For the nested ept guest, kvm_init_shadow_ept_mmu() has already
handled xonly case perfectly.


Re: [PATCH 1/5] mmu: mark spte present if the x bit is set

2016-06-28 Thread Xiao Guangrong



On 06/29/2016 04:49 AM, Paolo Bonzini wrote:



On 28/06/2016 22:37, Bandan Das wrote:

Paolo Bonzini  writes:


On 28/06/2016 19:33, Bandan Das wrote:

  static int is_shadow_present_pte(u64 pte)
  {
-   return pte & PT_PRESENT_MASK && !is_mmio_spte(pte);
+   return pte & (PT_PRESENT_MASK | shadow_x_mask) &&
+   !is_mmio_spte(pte);


This should really be pte & 7 when using EPT.  But this is okay as an
alternative to a new shadow_present_mask.

I could revive shadow_xonly_valid probably... Anyway, for now I will
add a TODO comment here.


It's okay to it like this, because the only invalid PTEs reaching this
point are those that is_mmio_spte filters away.  Hence you'll never get
-W- PTEs here, and pte & 7 is really the same as how you wrote it.  It's
pretty clever, and doesn't need a TODO at all. :)


Thanks, understood. So, the way it is written now covers all cases for
pte & 7. Let's still add a comment - clever things are usually
confusing to many!


I think another way to write it is "(pte & 0xull) &&
!is_mmio_spte(pte)", since non-present/non-MMIO SPTEs never use bits
1..31 (they can have non-zero bits 32..63 on 32-bit CPUs where we don't
update the PTEs atomically).  Guangrong, what do you prefer?


I think the way you innovated is better. :)


Re: [PATCH 1/5] mmu: mark spte present if the x bit is set

2016-06-28 Thread Xiao Guangrong



On 06/29/2016 04:49 AM, Paolo Bonzini wrote:



On 28/06/2016 22:37, Bandan Das wrote:

Paolo Bonzini  writes:


On 28/06/2016 19:33, Bandan Das wrote:

  static int is_shadow_present_pte(u64 pte)
  {
-   return pte & PT_PRESENT_MASK && !is_mmio_spte(pte);
+   return pte & (PT_PRESENT_MASK | shadow_x_mask) &&
+   !is_mmio_spte(pte);


This should really be pte & 7 when using EPT.  But this is okay as an
alternative to a new shadow_present_mask.

I could revive shadow_xonly_valid probably... Anyway, for now I will
add a TODO comment here.


It's okay to it like this, because the only invalid PTEs reaching this
point are those that is_mmio_spte filters away.  Hence you'll never get
-W- PTEs here, and pte & 7 is really the same as how you wrote it.  It's
pretty clever, and doesn't need a TODO at all. :)


Thanks, understood. So, the way it is written now covers all cases for
pte & 7. Let's still add a comment - clever things are usually
confusing to many!


I think another way to write it is "(pte & 0xull) &&
!is_mmio_spte(pte)", since non-present/non-MMIO SPTEs never use bits
1..31 (they can have non-zero bits 32..63 on 32-bit CPUs where we don't
update the PTEs atomically).  Guangrong, what do you prefer?


I think the way you innovated is better. :)


Re: [PATCH 1/4] KVM: MMU: fix permission_fault()

2016-04-06 Thread Xiao Guangrong



On 04/06/2016 04:56 PM, Paolo Bonzini wrote:



On 25/03/2016 14:19, Xiao Guangrong wrote:

@@ -193,11 +193,11 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - 
PT_USER_SHIFT));


One more tweak is needed in the line above; pfec - 1 must become pfec &
~1, because you've removed the

pfec |= PFERR_PRESENT_MASK;

line.  Applied with this change.


Yes, indeed. Thanks for your fix.


Re: [PATCH 1/4] KVM: MMU: fix permission_fault()

2016-04-06 Thread Xiao Guangrong



On 04/06/2016 04:56 PM, Paolo Bonzini wrote:



On 25/03/2016 14:19, Xiao Guangrong wrote:

@@ -193,11 +193,11 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - 
PT_USER_SHIFT));


One more tweak is needed in the line above; pfec - 1 must become pfec &
~1, because you've removed the

pfec |= PFERR_PRESENT_MASK;

line.  Applied with this change.


Yes, indeed. Thanks for your fix.


Re: [PATCH 1/4] KVM: MMU: fix permission_fault()

2016-04-05 Thread Xiao Guangrong



On 03/30/2016 02:39 PM, Xiao Guangrong wrote:



On 03/30/2016 02:36 PM, Paolo Bonzini wrote:



On 30/03/2016 03:56, Xiao Guangrong wrote:

x86/access.flat is currently using the "other" definition, i.e., PFEC.PK
is only set if W=1 or CR0.WP=0 && PFEC.U=0 or PFEC.W=0.  Can you use it
(with ept=1 of course) to check what the processor is doing?


Sure.

And ept=1 is hard to trigger MMU issue, i am enabling PKEY on shadow
MMU, let's see what will happen. ;)


No, don't do that!

ept=1 lets you test what the processor does.  It means you cannot test
permission_fault(), but what we want here is just reverse engineering
the microcode.  ept=1 lets you do exactly that.


Yes, i got this point. Huaitong will do the test once the machine gets
free.


I tested it and it is failed:

test pte.p pte.user pde.p pde.user pde.a pde.pse pkru.wd pkey=1 user write efer.nx cr4.pke: FAIL: 
error code 27 expected 7

Dump mapping: address: 0x1234
--L4: 2ebe007
--L3: 2ebf007
--L2: 80002a5

So PFEC.PKEY is set even if the ordinary check failed (caused by pde.rw = 0), 
the kvm code is
right. :)


Re: [PATCH 1/4] KVM: MMU: fix permission_fault()

2016-04-05 Thread Xiao Guangrong



On 03/30/2016 02:39 PM, Xiao Guangrong wrote:



On 03/30/2016 02:36 PM, Paolo Bonzini wrote:



On 30/03/2016 03:56, Xiao Guangrong wrote:

x86/access.flat is currently using the "other" definition, i.e., PFEC.PK
is only set if W=1 or CR0.WP=0 && PFEC.U=0 or PFEC.W=0.  Can you use it
(with ept=1 of course) to check what the processor is doing?


Sure.

And ept=1 is hard to trigger MMU issue, i am enabling PKEY on shadow
MMU, let's see what will happen. ;)


No, don't do that!

ept=1 lets you test what the processor does.  It means you cannot test
permission_fault(), but what we want here is just reverse engineering
the microcode.  ept=1 lets you do exactly that.


Yes, i got this point. Huaitong will do the test once the machine gets
free.


I tested it and it is failed:

test pte.p pte.user pde.p pde.user pde.a pde.pse pkru.wd pkey=1 user write efer.nx cr4.pke: FAIL: 
error code 27 expected 7

Dump mapping: address: 0x1234
--L4: 2ebe007
--L3: 2ebf007
--L2: 80002a5

So PFEC.PKEY is set even if the ordinary check failed (caused by pde.rw = 0), 
the kvm code is
right. :)


Re: [PATCH 1/4] KVM: MMU: fix permission_fault()

2016-03-30 Thread Xiao Guangrong



On 03/30/2016 02:36 PM, Paolo Bonzini wrote:



On 30/03/2016 03:56, Xiao Guangrong wrote:

x86/access.flat is currently using the "other" definition, i.e., PFEC.PK
is only set if W=1 or CR0.WP=0 && PFEC.U=0 or PFEC.W=0.  Can you use it
(with ept=1 of course) to check what the processor is doing?


Sure.

And ept=1 is hard to trigger MMU issue, i am enabling PKEY on shadow
MMU, let's see what will happen. ;)


No, don't do that!

ept=1 lets you test what the processor does.  It means you cannot test
permission_fault(), but what we want here is just reverse engineering
the microcode.  ept=1 lets you do exactly that.


Yes, i got this point. Huaitong will do the test once the machine gets
free.


Re: [PATCH 1/4] KVM: MMU: fix permission_fault()

2016-03-30 Thread Xiao Guangrong



On 03/30/2016 02:36 PM, Paolo Bonzini wrote:



On 30/03/2016 03:56, Xiao Guangrong wrote:

x86/access.flat is currently using the "other" definition, i.e., PFEC.PK
is only set if W=1 or CR0.WP=0 && PFEC.U=0 or PFEC.W=0.  Can you use it
(with ept=1 of course) to check what the processor is doing?


Sure.

And ept=1 is hard to trigger MMU issue, i am enabling PKEY on shadow
MMU, let's see what will happen. ;)


No, don't do that!

ept=1 lets you test what the processor does.  It means you cannot test
permission_fault(), but what we want here is just reverse engineering
the microcode.  ept=1 lets you do exactly that.


Yes, i got this point. Huaitong will do the test once the machine gets
free.


Re: [PATCH 1/4] KVM: MMU: fix permission_fault()

2016-03-29 Thread Xiao Guangrong



On 03/30/2016 04:09 AM, Paolo Bonzini wrote:



On 29/03/2016 19:43, Xiao Guangrong wrote:

Based on the SDM:
PK flag (bit 5).
This flag is 1 if (1) IA32_EFER.LMA = CR4.PKE = 1; (2) the access
causing the page-fault exception was a data access; (3) the linear
address was a user-mode address with protection key i; and (5) the PKRU
register (see Section 4.6.2) is such that either (a) ADi = 1; or (b) the
following all hold: (i) WDi = 1; (ii) the access is a write access; and
(iii) either CR0.WP = 1 or the access causing the page-fault exception
was a user-mode access.

So I think PKEY check and ordinary check are independent, i.e, PFEC.PKEY
may be set even if the on permission on the page table is not adequate.


x86/access.flat is currently using the "other" definition, i.e., PFEC.PK
is only set if W=1 or CR0.WP=0 && PFEC.U=0 or PFEC.W=0.  Can you use it
(with ept=1 of course) to check what the processor is doing?



Sure.

And ept=1 is hard to trigger MMU issue, i am enabling PKEY on shadow
MMU, let's see what will happen. ;)


Re: [PATCH 1/4] KVM: MMU: fix permission_fault()

2016-03-29 Thread Xiao Guangrong



On 03/30/2016 04:09 AM, Paolo Bonzini wrote:



On 29/03/2016 19:43, Xiao Guangrong wrote:

Based on the SDM:
PK flag (bit 5).
This flag is 1 if (1) IA32_EFER.LMA = CR4.PKE = 1; (2) the access
causing the page-fault exception was a data access; (3) the linear
address was a user-mode address with protection key i; and (5) the PKRU
register (see Section 4.6.2) is such that either (a) ADi = 1; or (b) the
following all hold: (i) WDi = 1; (ii) the access is a write access; and
(iii) either CR0.WP = 1 or the access causing the page-fault exception
was a user-mode access.

So I think PKEY check and ordinary check are independent, i.e, PFEC.PKEY
may be set even if the on permission on the page table is not adequate.


x86/access.flat is currently using the "other" definition, i.e., PFEC.PK
is only set if W=1 or CR0.WP=0 && PFEC.U=0 or PFEC.W=0.  Can you use it
(with ept=1 of course) to check what the processor is doing?



Sure.

And ept=1 is hard to trigger MMU issue, i am enabling PKEY on shadow
MMU, let's see what will happen. ;)


Re: [PATCH 1/4] KVM: MMU: fix permission_fault()

2016-03-29 Thread Xiao Guangrong



On 03/25/2016 10:21 PM, Paolo Bonzini wrote:



On 25/03/2016 14:19, Xiao Guangrong wrote:

WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
-   pfec |= PFERR_PRESENT_MASK;
+   errcode = PFERR_PRESENT_MASK;

if (unlikely(mmu->pkru_mask)) {
u32 pkru_bits, offset;
@@ -193,11 +193,11 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - 
PT_USER_SHIFT));

pkru_bits &= mmu->pkru_mask >> offset;
-   pfec |= -pkru_bits & PFERR_PK_MASK;
+   errcode |= -pkru_bits & PFERR_PK_MASK;
fault |= (pkru_bits != 0);
}

-   return -(uint32_t)fault & pfec;
+   return -(uint32_t)fault & errcode;
  }


I have another doubt here.

If you get a fault due to U=0, you would not get PFERR_PK_MASK.  This
is checked implicitly through the pte_user bit which we moved to
PFERR_RSVD_BIT.  However, if you get a fault due to W=0 _and_
PKRU.AD=1 or PKRU.WD=1 for the page's protection key, would the PK
bit be set in the error code?  If not, we would need something like
this:


Based on the SDM:
PK flag (bit 5).
This flag is 1 if (1) IA32_EFER.LMA = CR4.PKE = 1; (2) the access causing the page-fault exception 
was a data access; (3) the linear address was a user-mode address with protection key i; and (5) the 
PKRU register (see Section 4.6.2) is such that either (a) ADi = 1; or (b) the following all hold: 
(i) WDi = 1; (ii) the access is a write access; and (iii) either CR0.WP = 1 or the access causing 
the page-fault exception was a user-mode access.


So I think PKEY check and ordinary check are independent, i.e, PFEC.PKEY may be
set even if the on permission on the page table is not adequate.



Re: [PATCH 1/4] KVM: MMU: fix permission_fault()

2016-03-29 Thread Xiao Guangrong



On 03/25/2016 10:21 PM, Paolo Bonzini wrote:



On 25/03/2016 14:19, Xiao Guangrong wrote:

WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
-   pfec |= PFERR_PRESENT_MASK;
+   errcode = PFERR_PRESENT_MASK;

if (unlikely(mmu->pkru_mask)) {
u32 pkru_bits, offset;
@@ -193,11 +193,11 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - 
PT_USER_SHIFT));

pkru_bits &= mmu->pkru_mask >> offset;
-   pfec |= -pkru_bits & PFERR_PK_MASK;
+   errcode |= -pkru_bits & PFERR_PK_MASK;
fault |= (pkru_bits != 0);
}

-   return -(uint32_t)fault & pfec;
+   return -(uint32_t)fault & errcode;
  }


I have another doubt here.

If you get a fault due to U=0, you would not get PFERR_PK_MASK.  This
is checked implicitly through the pte_user bit which we moved to
PFERR_RSVD_BIT.  However, if you get a fault due to W=0 _and_
PKRU.AD=1 or PKRU.WD=1 for the page's protection key, would the PK
bit be set in the error code?  If not, we would need something like
this:


Based on the SDM:
PK flag (bit 5).
This flag is 1 if (1) IA32_EFER.LMA = CR4.PKE = 1; (2) the access causing the page-fault exception 
was a data access; (3) the linear address was a user-mode address with protection key i; and (5) the 
PKRU register (see Section 4.6.2) is such that either (a) ADi = 1; or (b) the following all hold: 
(i) WDi = 1; (ii) the access is a write access; and (iii) either CR0.WP = 1 or the access causing 
the page-fault exception was a user-mode access.


So I think PKEY check and ordinary check are independent, i.e, PFEC.PKEY may be
set even if the on permission on the page table is not adequate.



Re: [PATCH 3/4] KVM: MMU: reduce the size of mmu_page_path

2016-03-25 Thread Xiao Guangrong



On 03/25/2016 09:56 PM, Paolo Bonzini wrote:



On 25/03/2016 14:48, Xiao Guangrong wrote:




This patch and the previous one are basically redoing commit
0a47cd85833e ("KVM: MMU: Fix ubsan warnings", 2016-03-04).  While you
find your version easier to understand, I of course find mine easier.

Rather than getting stuck in a ko fight, the solution is to stick with
the code in KVM and add comments.  I'll give it a try...


If you do not like this one, we can just make the .index is
[PT64_ROOT_LEVEL - 1] and keep the sentinel in .parents[], that little
change and nice code shape.


I suppose you'd have something like this then:

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 70e95d097ef1..15e1735a2e3a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1980,7 +1980,7 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t 
gfn,

  struct mmu_page_path {
struct kvm_mmu_page *parent[PT64_ROOT_LEVEL];
-   unsigned int idx[PT64_ROOT_LEVEL];
+   unsigned int idx[PT64_ROOT_LEVEL-1];
  };

  #define for_each_sp(pvec, sp, parents, i) \
@@ -2037,13 +2037,14 @@ static void mmu_pages_clear_parents(struct 
mmu_page_path *parents)
  {
struct kvm_mmu_page *sp;
unsigned int level = 0;
+   unsigned int idx;

do {
-   unsigned int idx = parents->idx[level];
sp = parents->parent[level];
-   if (!sp)
+   if (!sp || WARN_ON(level == PT64_ROOT_LEVEL-1))
return;

+   idx = parents->idx[level];
WARN_ON(idx == INVALID_INDEX);
clear_unsync_child_bit(sp, idx);
level++;



Yes, exactly.

[ actually, we can keep mmu_pages_clear_parents() unchanged ]


By making the arrays the same size, the effect of the sentinel seems
clearer to me.  It doesn't seem worth 4 bytes (and strictly speaking
those 4 bytes would be there anyway due to padding)...


The sentinel is NULL forever so it can not go to the inner loop anyway...

Okay, i am not strong opinion on it, it is not a big deal. Let's
happily drop it if you really dislike it. :)


Re: [PATCH 3/4] KVM: MMU: reduce the size of mmu_page_path

2016-03-25 Thread Xiao Guangrong



On 03/25/2016 09:56 PM, Paolo Bonzini wrote:



On 25/03/2016 14:48, Xiao Guangrong wrote:




This patch and the previous one are basically redoing commit
0a47cd85833e ("KVM: MMU: Fix ubsan warnings", 2016-03-04).  While you
find your version easier to understand, I of course find mine easier.

Rather than getting stuck in a ko fight, the solution is to stick with
the code in KVM and add comments.  I'll give it a try...


If you do not like this one, we can just make the .index is
[PT64_ROOT_LEVEL - 1] and keep the sentinel in .parents[], that little
change and nice code shape.


I suppose you'd have something like this then:

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 70e95d097ef1..15e1735a2e3a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1980,7 +1980,7 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t 
gfn,

  struct mmu_page_path {
struct kvm_mmu_page *parent[PT64_ROOT_LEVEL];
-   unsigned int idx[PT64_ROOT_LEVEL];
+   unsigned int idx[PT64_ROOT_LEVEL-1];
  };

  #define for_each_sp(pvec, sp, parents, i) \
@@ -2037,13 +2037,14 @@ static void mmu_pages_clear_parents(struct 
mmu_page_path *parents)
  {
struct kvm_mmu_page *sp;
unsigned int level = 0;
+   unsigned int idx;

do {
-   unsigned int idx = parents->idx[level];
sp = parents->parent[level];
-   if (!sp)
+   if (!sp || WARN_ON(level == PT64_ROOT_LEVEL-1))
return;

+   idx = parents->idx[level];
WARN_ON(idx == INVALID_INDEX);
clear_unsync_child_bit(sp, idx);
level++;



Yes, exactly.

[ actually, we can keep mmu_pages_clear_parents() unchanged ]


By making the arrays the same size, the effect of the sentinel seems
clearer to me.  It doesn't seem worth 4 bytes (and strictly speaking
those 4 bytes would be there anyway due to padding)...


The sentinel is NULL forever so it can not go to the inner loop anyway...

Okay, i am not strong opinion on it, it is not a big deal. Let's
happily drop it if you really dislike it. :)


Re: [PATCH 3/4] KVM: MMU: reduce the size of mmu_page_path

2016-03-25 Thread Xiao Guangrong



On 03/25/2016 09:45 PM, Paolo Bonzini wrote:



On 25/03/2016 14:19, Xiao Guangrong wrote:

Currently only PT64_ROOT_LEVEL - 1 levels are used, one additional entry
in .parent[] is used as a sentinel, the additional entry in .idx[] is
purely wasted

This patch reduces its size and sets the sentinel on the upper level of
the place where we start from


This patch and the previous one are basically redoing commit
0a47cd85833e ("KVM: MMU: Fix ubsan warnings", 2016-03-04).  While you
find your version easier to understand, I of course find mine easier.

Rather than getting stuck in a ko fight, the solution is to stick with
the code in KVM and add comments.  I'll give it a try...


If you do not like this one, we can just make the .index is
[PT64_ROOT_LEVEL - 1] and keep the sentinel in .parents[], that little
change and nice code shape.



Re: [PATCH 3/4] KVM: MMU: reduce the size of mmu_page_path

2016-03-25 Thread Xiao Guangrong



On 03/25/2016 09:45 PM, Paolo Bonzini wrote:



On 25/03/2016 14:19, Xiao Guangrong wrote:

Currently only PT64_ROOT_LEVEL - 1 levels are used, one additional entry
in .parent[] is used as a sentinel, the additional entry in .idx[] is
purely wasted

This patch reduces its size and sets the sentinel on the upper level of
the place where we start from


This patch and the previous one are basically redoing commit
0a47cd85833e ("KVM: MMU: Fix ubsan warnings", 2016-03-04).  While you
find your version easier to understand, I of course find mine easier.

Rather than getting stuck in a ko fight, the solution is to stick with
the code in KVM and add comments.  I'll give it a try...


If you do not like this one, we can just make the .index is
[PT64_ROOT_LEVEL - 1] and keep the sentinel in .parents[], that little
change and nice code shape.



Re: [PATCH 1/4] KVM: MMU: fix permission_fault()

2016-03-25 Thread Xiao Guangrong



On 03/25/2016 09:35 PM, Paolo Bonzini wrote:



On 25/03/2016 14:19, Xiao Guangrong wrote:

kvm-unit-tests complained about the PFEC is not set properly, e.g,:
test pte.rw pte.d pte.nx pde.p pde.rw pde.pse user fetch: FAIL: error code 15
expected 5
Dump mapping: address: 0x1234
--L4: 3e95007
--L3: 3e96007
--L2: 283


What's the command line for the reproducer?


QEMU=/home/eric/qemu/x86_64-softmmu/qemu-system-x86_64 ./x86-run x86/access.flat




diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b70df72..81bffd1 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -154,7 +154,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
  unsigned pfec)
  {
int cpl = kvm_x86_ops->get_cpl(vcpu);
-   unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
+   unsigned long errcode, rflags = kvm_x86_ops->get_rflags(vcpu);

/*
 * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1.
@@ -175,7 +175,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
bool fault = (mmu->permissions[index] >> pte_access) & 1;

WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
-   pfec |= PFERR_PRESENT_MASK;
+   errcode = PFERR_PRESENT_MASK;


So is this patch doing the same as "KVM: MMU: precompute page fault
error code"?  It was necessary after all. :)


Sorry for my mistake... I missed the logic you changed :(

I still prefer to calculating the error code on the fault path which is rare, or
think a way to encapsulate it to permission_fault()...


Re: [PATCH 1/4] KVM: MMU: fix permission_fault()

2016-03-25 Thread Xiao Guangrong



On 03/25/2016 09:35 PM, Paolo Bonzini wrote:



On 25/03/2016 14:19, Xiao Guangrong wrote:

kvm-unit-tests complained about the PFEC is not set properly, e.g,:
test pte.rw pte.d pte.nx pde.p pde.rw pde.pse user fetch: FAIL: error code 15
expected 5
Dump mapping: address: 0x1234
--L4: 3e95007
--L3: 3e96007
--L2: 283


What's the command line for the reproducer?


QEMU=/home/eric/qemu/x86_64-softmmu/qemu-system-x86_64 ./x86-run x86/access.flat




diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b70df72..81bffd1 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -154,7 +154,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
  unsigned pfec)
  {
int cpl = kvm_x86_ops->get_cpl(vcpu);
-   unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
+   unsigned long errcode, rflags = kvm_x86_ops->get_rflags(vcpu);

/*
 * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1.
@@ -175,7 +175,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
bool fault = (mmu->permissions[index] >> pte_access) & 1;

WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
-   pfec |= PFERR_PRESENT_MASK;
+   errcode = PFERR_PRESENT_MASK;


So is this patch doing the same as "KVM: MMU: precompute page fault
error code"?  It was necessary after all. :)


Sorry for my mistake... I missed the logic you changed :(

I still prefer to calculating the error code on the fault path which is rare, or
think a way to encapsulate it to permission_fault()...


[PATCH 3/4] KVM: MMU: reduce the size of mmu_page_path

2016-03-25 Thread Xiao Guangrong
Currently only PT64_ROOT_LEVEL - 1 levels are used, one additional entry
in .parent[] is used as a sentinel, the additional entry in .idx[] is
purely wasted

This patch reduces its size and sets the sentinel on the upper level of
the place where we start from

Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
---
 arch/x86/kvm/mmu.c | 32 
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e273144..c396e8b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1984,12 +1984,12 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t 
gfn,
 }
 
 struct mmu_page_path {
-   struct kvm_mmu_page *parent[PT64_ROOT_LEVEL];
-   unsigned int idx[PT64_ROOT_LEVEL];
+   struct kvm_mmu_page *parent[PT64_ROOT_LEVEL - 1];
+   unsigned int idx[PT64_ROOT_LEVEL - 1];
 };
 
 #define for_each_sp(pvec, sp, parents, i)  \
-   for (i = mmu_pages_first(, );  \
+   for (i =  mmu_pages_next(, , -1);  \
i < pvec.nr && ({ sp = pvec.page[i].sp; 1;});   \
i = mmu_pages_next(, , i))
 
@@ -2016,25 +2016,15 @@ static int mmu_pages_next(struct kvm_mmu_pages *pvec,
return n;
 }
 
-static int mmu_pages_first(struct kvm_mmu_pages *pvec,
-  struct mmu_page_path *parents)
+static void
+mmu_pages_init(struct mmu_page_path *parents, struct kvm_mmu_page *parent)
 {
-   struct kvm_mmu_page *sp;
-   int level;
-
-   if (pvec->nr == 0)
-   return 0;
-
-   sp = pvec->page[0].sp;
-   level = sp->role.level;
-   WARN_ON(level == PT_PAGE_TABLE_LEVEL);
-
/*
-* Also set up a sentinel. Further entries in pvec are all
-* children of sp, so this element is never overwritten.
+* set up a sentinel. Further entries in pvec are all children of
+* sp, so this element is never overwritten.
 */
-   parents->parent[level - 1] = NULL;
-   return mmu_pages_next(pvec, parents, -1);
+   if (parent->role.level < PT64_ROOT_LEVEL)
+   parents->parent[parent->role.level - 1] = NULL;
 }
 
 static void mmu_pages_clear_parents(struct mmu_page_path *parents)
@@ -2051,7 +2041,7 @@ static void mmu_pages_clear_parents(struct mmu_page_path 
*parents)
WARN_ON(idx == INVALID_INDEX);
clear_unsync_child_bit(sp, idx);
level++;
-   } while (!sp->unsync_children);
+   } while (!sp->unsync_children && (level < PT64_ROOT_LEVEL - 1));
 }
 
 static void mmu_sync_children(struct kvm_vcpu *vcpu,
@@ -2064,6 +2054,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
LIST_HEAD(invalid_list);
bool flush = false;
 
+   mmu_pages_init(, parent);
while (mmu_unsync_walk(parent, )) {
bool protected = false;
 
@@ -2335,6 +2326,7 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
if (parent->role.level == PT_PAGE_TABLE_LEVEL)
return 0;
 
+   mmu_pages_init(, parent);
while (mmu_unsync_walk(parent, )) {
struct kvm_mmu_page *sp;
 
-- 
1.8.3.1



[PATCH 2/4] KVM: MMU: simplify the logic of __mmu_unsync_walk()

2016-03-25 Thread Xiao Guangrong
Each time i looked into the logic of walking unsync shadow pages, it costs
lots of time to understand what it is doing. The trick of this logic is
that the item, sp and idx, saved to kvm_mmu_pages is the sp and the index
in the _parent_ level and it lacks any comment to explain this fact

This patch simplifies it by saving the sp and its index to kvm_mmu_pages,
then it is much easier to understand the operations on the its index

Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
---
 arch/x86/kvm/mmu.c | 40 ++--
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6bdfbc2..e273144 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1830,6 +1830,8 @@ static inline void clear_unsync_child_bit(struct 
kvm_mmu_page *sp, int idx)
__clear_bit(idx, sp->unsync_child_bitmap);
 }
 
+#define INVALID_INDEX (-1)
+
 static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
   struct kvm_mmu_pages *pvec)
 {
@@ -1846,10 +1848,10 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
 
child = page_header(ent & PT64_BASE_ADDR_MASK);
 
-   if (child->unsync_children) {
-   if (mmu_pages_add(pvec, child, i))
-   return -ENOSPC;
+   if (mmu_pages_add(pvec, sp, i))
+   return -ENOSPC;
 
+   if (child->unsync_children) {
ret = __mmu_unsync_walk(child, pvec);
if (!ret) {
clear_unsync_child_bit(sp, i);
@@ -1860,7 +1862,13 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
return ret;
} else if (child->unsync) {
nr_unsync_leaf++;
-   if (mmu_pages_add(pvec, child, i))
+
+   /*
+* the unsync is on the last level so its 'idx' is
+* useless, we set it to INVALID_INDEX to catch
+* potential bugs.
+*/
+   if (mmu_pages_add(pvec, child, INVALID_INDEX))
return -ENOSPC;
} else
clear_unsync_child_bit(sp, i);
@@ -1869,8 +1877,6 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
return nr_unsync_leaf;
 }
 
-#define INVALID_INDEX (-1)
-
 static int mmu_unsync_walk(struct kvm_mmu_page *sp,
   struct kvm_mmu_pages *pvec)
 {
@@ -1878,7 +1884,6 @@ static int mmu_unsync_walk(struct kvm_mmu_page *sp,
if (!sp->unsync_children)
return 0;
 
-   mmu_pages_add(pvec, sp, INVALID_INDEX);
return __mmu_unsync_walk(sp, pvec);
 }
 
@@ -1994,16 +1999,18 @@ static int mmu_pages_next(struct kvm_mmu_pages *pvec,
 {
int n;
 
-   for (n = i+1; n < pvec->nr; n++) {
+   for (n = i + 1; n < pvec->nr; n++) {
struct kvm_mmu_page *sp = pvec->page[n].sp;
unsigned idx = pvec->page[n].idx;
int level = sp->role.level;
 
-   parents->idx[level-1] = idx;
-   if (level == PT_PAGE_TABLE_LEVEL)
+   if (level == PT_PAGE_TABLE_LEVEL) {
+   WARN_ON(idx != INVALID_INDEX);
break;
+   }
 
-   parents->parent[level-2] = sp;
+   parents->idx[level - 2] = idx;
+   parents->parent[level - 2] = sp;
}
 
return n;
@@ -2018,19 +2025,16 @@ static int mmu_pages_first(struct kvm_mmu_pages *pvec,
if (pvec->nr == 0)
return 0;
 
-   WARN_ON(pvec->page[0].idx != INVALID_INDEX);
-
sp = pvec->page[0].sp;
level = sp->role.level;
WARN_ON(level == PT_PAGE_TABLE_LEVEL);
 
-   parents->parent[level-2] = sp;
-
-   /* Also set up a sentinel.  Further entries in pvec are all
+   /*
+* Also set up a sentinel. Further entries in pvec are all
 * children of sp, so this element is never overwritten.
 */
-   parents->parent[level-1] = NULL;
-   return mmu_pages_next(pvec, parents, 0);
+   parents->parent[level - 1] = NULL;
+   return mmu_pages_next(pvec, parents, -1);
 }
 
 static void mmu_pages_clear_parents(struct mmu_page_path *parents)
-- 
1.8.3.1



[PATCH 3/4] KVM: MMU: reduce the size of mmu_page_path

2016-03-25 Thread Xiao Guangrong
Currently only PT64_ROOT_LEVEL - 1 levels are used, one additional entry
in .parent[] is used as a sentinel, the additional entry in .idx[] is
purely wasted

This patch reduces its size and sets the sentinel on the upper level of
the place where we start from

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 32 
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e273144..c396e8b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1984,12 +1984,12 @@ static bool kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t 
gfn,
 }
 
 struct mmu_page_path {
-   struct kvm_mmu_page *parent[PT64_ROOT_LEVEL];
-   unsigned int idx[PT64_ROOT_LEVEL];
+   struct kvm_mmu_page *parent[PT64_ROOT_LEVEL - 1];
+   unsigned int idx[PT64_ROOT_LEVEL - 1];
 };
 
 #define for_each_sp(pvec, sp, parents, i)  \
-   for (i = mmu_pages_first(, );  \
+   for (i =  mmu_pages_next(, , -1);  \
i < pvec.nr && ({ sp = pvec.page[i].sp; 1;});   \
i = mmu_pages_next(, , i))
 
@@ -2016,25 +2016,15 @@ static int mmu_pages_next(struct kvm_mmu_pages *pvec,
return n;
 }
 
-static int mmu_pages_first(struct kvm_mmu_pages *pvec,
-  struct mmu_page_path *parents)
+static void
+mmu_pages_init(struct mmu_page_path *parents, struct kvm_mmu_page *parent)
 {
-   struct kvm_mmu_page *sp;
-   int level;
-
-   if (pvec->nr == 0)
-   return 0;
-
-   sp = pvec->page[0].sp;
-   level = sp->role.level;
-   WARN_ON(level == PT_PAGE_TABLE_LEVEL);
-
/*
-* Also set up a sentinel. Further entries in pvec are all
-* children of sp, so this element is never overwritten.
+* set up a sentinel. Further entries in pvec are all children of
+* sp, so this element is never overwritten.
 */
-   parents->parent[level - 1] = NULL;
-   return mmu_pages_next(pvec, parents, -1);
+   if (parent->role.level < PT64_ROOT_LEVEL)
+   parents->parent[parent->role.level - 1] = NULL;
 }
 
 static void mmu_pages_clear_parents(struct mmu_page_path *parents)
@@ -2051,7 +2041,7 @@ static void mmu_pages_clear_parents(struct mmu_page_path 
*parents)
WARN_ON(idx == INVALID_INDEX);
clear_unsync_child_bit(sp, idx);
level++;
-   } while (!sp->unsync_children);
+   } while (!sp->unsync_children && (level < PT64_ROOT_LEVEL - 1));
 }
 
 static void mmu_sync_children(struct kvm_vcpu *vcpu,
@@ -2064,6 +2054,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
LIST_HEAD(invalid_list);
bool flush = false;
 
+   mmu_pages_init(, parent);
while (mmu_unsync_walk(parent, )) {
bool protected = false;
 
@@ -2335,6 +2326,7 @@ static int mmu_zap_unsync_children(struct kvm *kvm,
if (parent->role.level == PT_PAGE_TABLE_LEVEL)
return 0;
 
+   mmu_pages_init(, parent);
while (mmu_unsync_walk(parent, )) {
struct kvm_mmu_page *sp;
 
-- 
1.8.3.1



[PATCH 2/4] KVM: MMU: simplify the logic of __mmu_unsync_walk()

2016-03-25 Thread Xiao Guangrong
Each time i looked into the logic of walking unsync shadow pages, it costs
lots of time to understand what it is doing. The trick of this logic is
that the item, sp and idx, saved to kvm_mmu_pages is the sp and the index
in the _parent_ level and it lacks any comment to explain this fact

This patch simplifies it by saving the sp and its index to kvm_mmu_pages,
then it is much easier to understand the operations on the its index

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 40 ++--
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6bdfbc2..e273144 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1830,6 +1830,8 @@ static inline void clear_unsync_child_bit(struct 
kvm_mmu_page *sp, int idx)
__clear_bit(idx, sp->unsync_child_bitmap);
 }
 
+#define INVALID_INDEX (-1)
+
 static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
   struct kvm_mmu_pages *pvec)
 {
@@ -1846,10 +1848,10 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
 
child = page_header(ent & PT64_BASE_ADDR_MASK);
 
-   if (child->unsync_children) {
-   if (mmu_pages_add(pvec, child, i))
-   return -ENOSPC;
+   if (mmu_pages_add(pvec, sp, i))
+   return -ENOSPC;
 
+   if (child->unsync_children) {
ret = __mmu_unsync_walk(child, pvec);
if (!ret) {
clear_unsync_child_bit(sp, i);
@@ -1860,7 +1862,13 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
return ret;
} else if (child->unsync) {
nr_unsync_leaf++;
-   if (mmu_pages_add(pvec, child, i))
+
+   /*
+* the unsync is on the last level so its 'idx' is
+* useless, we set it to INVALID_INDEX to catch
+* potential bugs.
+*/
+   if (mmu_pages_add(pvec, child, INVALID_INDEX))
return -ENOSPC;
} else
clear_unsync_child_bit(sp, i);
@@ -1869,8 +1877,6 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
return nr_unsync_leaf;
 }
 
-#define INVALID_INDEX (-1)
-
 static int mmu_unsync_walk(struct kvm_mmu_page *sp,
   struct kvm_mmu_pages *pvec)
 {
@@ -1878,7 +1884,6 @@ static int mmu_unsync_walk(struct kvm_mmu_page *sp,
if (!sp->unsync_children)
return 0;
 
-   mmu_pages_add(pvec, sp, INVALID_INDEX);
return __mmu_unsync_walk(sp, pvec);
 }
 
@@ -1994,16 +1999,18 @@ static int mmu_pages_next(struct kvm_mmu_pages *pvec,
 {
int n;
 
-   for (n = i+1; n < pvec->nr; n++) {
+   for (n = i + 1; n < pvec->nr; n++) {
struct kvm_mmu_page *sp = pvec->page[n].sp;
unsigned idx = pvec->page[n].idx;
int level = sp->role.level;
 
-   parents->idx[level-1] = idx;
-   if (level == PT_PAGE_TABLE_LEVEL)
+   if (level == PT_PAGE_TABLE_LEVEL) {
+   WARN_ON(idx != INVALID_INDEX);
break;
+   }
 
-   parents->parent[level-2] = sp;
+   parents->idx[level - 2] = idx;
+   parents->parent[level - 2] = sp;
}
 
return n;
@@ -2018,19 +2025,16 @@ static int mmu_pages_first(struct kvm_mmu_pages *pvec,
if (pvec->nr == 0)
return 0;
 
-   WARN_ON(pvec->page[0].idx != INVALID_INDEX);
-
sp = pvec->page[0].sp;
level = sp->role.level;
WARN_ON(level == PT_PAGE_TABLE_LEVEL);
 
-   parents->parent[level-2] = sp;
-
-   /* Also set up a sentinel.  Further entries in pvec are all
+   /*
+* Also set up a sentinel. Further entries in pvec are all
 * children of sp, so this element is never overwritten.
 */
-   parents->parent[level-1] = NULL;
-   return mmu_pages_next(pvec, parents, 0);
+   parents->parent[level - 1] = NULL;
+   return mmu_pages_next(pvec, parents, -1);
 }
 
 static void mmu_pages_clear_parents(struct mmu_page_path *parents)
-- 
1.8.3.1



[PATCH 1/4] KVM: MMU: fix permission_fault()

2016-03-25 Thread Xiao Guangrong
kvm-unit-tests complained about the PFEC is not set properly, e.g,:
test pte.rw pte.d pte.nx pde.p pde.rw pde.pse user fetch: FAIL: error code 15 
expected 5
Dump mapping: address: 0x1234
--L4: 3e95007
--L3: 3e96007
--L2: 283

It's caused by the reason that PFEC returned to guest is copied from the
PFEC triggered by shadow page table

This patch fixes it and makes the logic of updating errcode more clean

Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
---
 arch/x86/kvm/mmu.h | 8 
 arch/x86/kvm/paging_tmpl.h | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b70df72..81bffd1 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -154,7 +154,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
  unsigned pfec)
 {
int cpl = kvm_x86_ops->get_cpl(vcpu);
-   unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
+   unsigned long errcode, rflags = kvm_x86_ops->get_rflags(vcpu);
 
/*
 * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1.
@@ -175,7 +175,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
bool fault = (mmu->permissions[index] >> pte_access) & 1;
 
WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
-   pfec |= PFERR_PRESENT_MASK;
+   errcode = PFERR_PRESENT_MASK;
 
if (unlikely(mmu->pkru_mask)) {
u32 pkru_bits, offset;
@@ -193,11 +193,11 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - 
PT_USER_SHIFT));
 
pkru_bits &= mmu->pkru_mask >> offset;
-   pfec |= -pkru_bits & PFERR_PK_MASK;
+   errcode |= -pkru_bits & PFERR_PK_MASK;
fault |= (pkru_bits != 0);
}
 
-   return -(uint32_t)fault & pfec;
+   return -(uint32_t)fault & errcode;
 }
 
 void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 1d971c7..bc019f7 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -360,7 +360,7 @@ retry_walk:
goto error;
 
if (unlikely(is_rsvd_bits_set(mmu, pte, walker->level))) {
-   errcode |= PFERR_RSVD_MASK | PFERR_PRESENT_MASK;
+   errcode = PFERR_RSVD_MASK | PFERR_PRESENT_MASK;
goto error;
}
 
-- 
1.8.3.1



[PATCH 1/4] KVM: MMU: fix permission_fault()

2016-03-25 Thread Xiao Guangrong
kvm-unit-tests complained about the PFEC is not set properly, e.g,:
test pte.rw pte.d pte.nx pde.p pde.rw pde.pse user fetch: FAIL: error code 15 
expected 5
Dump mapping: address: 0x1234
--L4: 3e95007
--L3: 3e96007
--L2: 283

It's caused by the reason that PFEC returned to guest is copied from the
PFEC triggered by shadow page table

This patch fixes it and makes the logic of updating errcode more clean

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.h | 8 
 arch/x86/kvm/paging_tmpl.h | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b70df72..81bffd1 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -154,7 +154,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
  unsigned pfec)
 {
int cpl = kvm_x86_ops->get_cpl(vcpu);
-   unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
+   unsigned long errcode, rflags = kvm_x86_ops->get_rflags(vcpu);
 
/*
 * If CPL < 3, SMAP prevention are disabled if EFLAGS.AC = 1.
@@ -175,7 +175,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
bool fault = (mmu->permissions[index] >> pte_access) & 1;
 
WARN_ON(pfec & (PFERR_PK_MASK | PFERR_RSVD_MASK));
-   pfec |= PFERR_PRESENT_MASK;
+   errcode = PFERR_PRESENT_MASK;
 
if (unlikely(mmu->pkru_mask)) {
u32 pkru_bits, offset;
@@ -193,11 +193,11 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
((pte_access & PT_USER_MASK) << (PFERR_RSVD_BIT - 
PT_USER_SHIFT));
 
pkru_bits &= mmu->pkru_mask >> offset;
-   pfec |= -pkru_bits & PFERR_PK_MASK;
+   errcode |= -pkru_bits & PFERR_PK_MASK;
fault |= (pkru_bits != 0);
}
 
-   return -(uint32_t)fault & pfec;
+   return -(uint32_t)fault & errcode;
 }
 
 void kvm_mmu_invalidate_zap_all_pages(struct kvm *kvm);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 1d971c7..bc019f7 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -360,7 +360,7 @@ retry_walk:
goto error;
 
if (unlikely(is_rsvd_bits_set(mmu, pte, walker->level))) {
-   errcode |= PFERR_RSVD_MASK | PFERR_PRESENT_MASK;
+   errcode = PFERR_RSVD_MASK | PFERR_PRESENT_MASK;
goto error;
}
 
-- 
1.8.3.1



[PATCH 4/4] KVM: MMU: skip obsolete sp in for_each_gfn_*()

2016-03-25 Thread Xiao Guangrong
The obsolete sp should not be used on current vCPUs and should not hurt
vCPU's running, so skip it from for_each_gfn_sp() and
for_each_gfn_indirect_valid_sp()

The side effort is we will double check role.invalid in kvm_mmu_get_page()
but i think it is okay as role is well cached

Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
---
 arch/x86/kvm/mmu.c | 28 
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c396e8b..4d66a9e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1906,18 +1906,17 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
  * since it has been deleted from active_mmu_pages but still can be found
  * at hast list.
  *
- * for_each_gfn_indirect_valid_sp has skipped that kind of page and
- * kvm_mmu_get_page(), the only user of for_each_gfn_sp(), has skipped
- * all the obsolete pages.
+ * for_each_gfn_valid_sp() has skipped that kind of pages.
  */
-#define for_each_gfn_sp(_kvm, _sp, _gfn)   \
+#define for_each_gfn_valid_sp(_kvm, _sp, _gfn) \
hlist_for_each_entry(_sp,   \
  &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)], hash_link) \
-   if ((_sp)->gfn != (_gfn)) {} else
+   if ((_sp)->gfn != (_gfn) || is_obsolete_sp((_kvm), (_sp)) \
+   || (_sp)->role.invalid) {} else
 
 #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn)
\
-   for_each_gfn_sp(_kvm, _sp, _gfn)\
-   if ((_sp)->role.direct || (_sp)->role.invalid) {} else
+   for_each_gfn_valid_sp(_kvm, _sp, _gfn)  \
+   if ((_sp)->role.direct) {} else
 
 /* @sp->gfn should be write-protected at the call site */
 static bool __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
@@ -1958,6 +1957,11 @@ static void kvm_mmu_audit(struct kvm_vcpu *vcpu, int 
point) { }
 static void mmu_audit_disable(void) { }
 #endif
 
+static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+   return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
+}
+
 static bool kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 struct list_head *invalid_list)
 {
@@ -2092,11 +2096,6 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sp);
 }
 
-static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
-   return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
-}
-
 static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 gfn_t gfn,
 gva_t gaddr,
@@ -2123,10 +2122,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
role.quadrant = quadrant;
}
-   for_each_gfn_sp(vcpu->kvm, sp, gfn) {
-   if (is_obsolete_sp(vcpu->kvm, sp))
-   continue;
-
+   for_each_gfn_valid_sp(vcpu->kvm, sp, gfn) {
if (!need_sync && sp->unsync)
need_sync = true;
 
-- 
1.8.3.1



[PATCH 4/4] KVM: MMU: skip obsolete sp in for_each_gfn_*()

2016-03-25 Thread Xiao Guangrong
The obsolete sp should not be used on current vCPUs and should not hurt
vCPU's running, so skip it from for_each_gfn_sp() and
for_each_gfn_indirect_valid_sp()

The side effort is we will double check role.invalid in kvm_mmu_get_page()
but i think it is okay as role is well cached

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c | 28 
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c396e8b..4d66a9e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1906,18 +1906,17 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
  * since it has been deleted from active_mmu_pages but still can be found
  * at hast list.
  *
- * for_each_gfn_indirect_valid_sp has skipped that kind of page and
- * kvm_mmu_get_page(), the only user of for_each_gfn_sp(), has skipped
- * all the obsolete pages.
+ * for_each_gfn_valid_sp() has skipped that kind of pages.
  */
-#define for_each_gfn_sp(_kvm, _sp, _gfn)   \
+#define for_each_gfn_valid_sp(_kvm, _sp, _gfn) \
hlist_for_each_entry(_sp,   \
  &(_kvm)->arch.mmu_page_hash[kvm_page_table_hashfn(_gfn)], hash_link) \
-   if ((_sp)->gfn != (_gfn)) {} else
+   if ((_sp)->gfn != (_gfn) || is_obsolete_sp((_kvm), (_sp)) \
+   || (_sp)->role.invalid) {} else
 
 #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn)
\
-   for_each_gfn_sp(_kvm, _sp, _gfn)\
-   if ((_sp)->role.direct || (_sp)->role.invalid) {} else
+   for_each_gfn_valid_sp(_kvm, _sp, _gfn)  \
+   if ((_sp)->role.direct) {} else
 
 /* @sp->gfn should be write-protected at the call site */
 static bool __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
@@ -1958,6 +1957,11 @@ static void kvm_mmu_audit(struct kvm_vcpu *vcpu, int 
point) { }
 static void mmu_audit_disable(void) { }
 #endif
 
+static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+   return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
+}
+
 static bool kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 struct list_head *invalid_list)
 {
@@ -2092,11 +2096,6 @@ static void clear_sp_write_flooding_count(u64 *spte)
__clear_sp_write_flooding_count(sp);
 }
 
-static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
-{
-   return unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
-}
-
 static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 gfn_t gfn,
 gva_t gaddr,
@@ -2123,10 +2122,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct 
kvm_vcpu *vcpu,
quadrant &= (1 << ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1;
role.quadrant = quadrant;
}
-   for_each_gfn_sp(vcpu->kvm, sp, gfn) {
-   if (is_obsolete_sp(vcpu->kvm, sp))
-   continue;
-
+   for_each_gfn_valid_sp(vcpu->kvm, sp, gfn) {
if (!need_sync && sp->unsync)
need_sync = true;
 
-- 
1.8.3.1



Re: [PATCH] KVM: page_track: fix access to NULL slot

2016-03-22 Thread Xiao Guangrong



On 03/23/2016 12:40 AM, Paolo Bonzini wrote:

This happens when doing the reboot test from virt-tests:

[  131.833653] BUG: unable to handle kernel NULL pointer dereference at 
  (null)
[  131.842461] IP: [] kvm_page_track_is_active+0x17/0x60 [kvm]
[  131.850500] PGD 0
[  131.852763] Oops:  [#1] SMP
[  132.007188] task: 880075fbc500 ti: 880850a3c000 task.ti: 
880850a3c000
[  132.138891] Call Trace:
[  132.141639]  [] page_fault_handle_page_track+0x31/0x40 
[kvm]
[  132.149732]  [] paging64_page_fault+0xff/0x910 [kvm]
[  132.172159]  [] kvm_mmu_page_fault+0x64/0x110 [kvm]
[  132.179372]  [] handle_exception+0x1b2/0x430 [kvm_intel]
[  132.187072]  [] vmx_handle_exit+0x1e1/0xc50 [kvm_intel]
...


Thank you for fixing it, Paolo!

Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com>


Re: [PATCH] KVM: page_track: fix access to NULL slot

2016-03-22 Thread Xiao Guangrong



On 03/23/2016 12:40 AM, Paolo Bonzini wrote:

This happens when doing the reboot test from virt-tests:

[  131.833653] BUG: unable to handle kernel NULL pointer dereference at 
  (null)
[  131.842461] IP: [] kvm_page_track_is_active+0x17/0x60 [kvm]
[  131.850500] PGD 0
[  131.852763] Oops:  [#1] SMP
[  132.007188] task: 880075fbc500 ti: 880850a3c000 task.ti: 
880850a3c000
[  132.138891] Call Trace:
[  132.141639]  [] page_fault_handle_page_track+0x31/0x40 
[kvm]
[  132.149732]  [] paging64_page_fault+0xff/0x910 [kvm]
[  132.172159]  [] kvm_mmu_page_fault+0x64/0x110 [kvm]
[  132.179372]  [] handle_exception+0x1b2/0x430 [kvm_intel]
[  132.187072]  [] vmx_handle_exit+0x1e1/0xc50 [kvm_intel]
...


Thank you for fixing it, Paolo!

Reviewed-by: Xiao Guangrong 


Re: [PATCH 1/1] KVM: don't allow irq_fpu_usable when the VCPU's XCR0 is loaded

2016-03-15 Thread Xiao Guangrong



On 03/16/2016 03:32 AM, Paolo Bonzini wrote:



On 15/03/2016 19:27, Andy Lutomirski wrote:

On Mon, Mar 14, 2016 at 6:17 AM, Paolo Bonzini  wrote:



On 11/03/2016 22:33, David Matlack wrote:

Is this better than just always keeping the host's XCR0 loaded outside
if the KVM interrupts-disabled region?


Probably not. AFAICT KVM does not rely on it being loaded outside that
region. xsetbv isn't insanely expensive, is it? Maybe to minimize the
time spent with interrupts disabled it was put outside.

I do like that your solution would be contained to KVM.


I agree with Andy.  We do want a fix for recent kernels because of the
!eager_fpu case that Guangrong mentioned.


Relying on interrupt is not easy as XCR0 can not be automatically saved/loaded
by VMCS... Once interrupt happens, it will use guest's XCR0 anyway.



Re: [PATCH 1/1] KVM: don't allow irq_fpu_usable when the VCPU's XCR0 is loaded

2016-03-15 Thread Xiao Guangrong



On 03/16/2016 03:32 AM, Paolo Bonzini wrote:



On 15/03/2016 19:27, Andy Lutomirski wrote:

On Mon, Mar 14, 2016 at 6:17 AM, Paolo Bonzini  wrote:



On 11/03/2016 22:33, David Matlack wrote:

Is this better than just always keeping the host's XCR0 loaded outside
if the KVM interrupts-disabled region?


Probably not. AFAICT KVM does not rely on it being loaded outside that
region. xsetbv isn't insanely expensive, is it? Maybe to minimize the
time spent with interrupts disabled it was put outside.

I do like that your solution would be contained to KVM.


I agree with Andy.  We do want a fix for recent kernels because of the
!eager_fpu case that Guangrong mentioned.


Relying on interrupt is not easy as XCR0 can not be automatically saved/loaded
by VMCS... Once interrupt happens, it will use guest's XCR0 anyway.



Re: [PATCH 0/1] KVM: x86: using the fpu in interrupt context with a guest's xcr0

2016-03-15 Thread Xiao Guangrong



On 03/16/2016 03:01 AM, David Matlack wrote:

On Mon, Mar 14, 2016 at 12:46 AM, Xiao Guangrong
<guangrong.x...@linux.intel.com> wrote:



On 03/12/2016 04:47 AM, David Matlack wrote:


I have not been able to trigger this bug on Linux 4.3, and suspect
it is due to this commit from Linux 4.2:

653f52c kvm,x86: load guest FPU context more eagerly

With this commit, as long as the host is using eagerfpu, the guest's
fpu is always loaded just before the guest's xcr0 (vcpu->fpu_active
is always 1 in the following snippet):

6569 if (vcpu->fpu_active)
6570 kvm_load_guest_fpu(vcpu);
6571 kvm_load_guest_xcr0(vcpu);

When the guest's fpu is loaded, irq_fpu_usable() returns false.



Er, i did not see that commit introduced this change.



We've included our workaround for this bug, which applies to Linux 3.11.
It does not apply cleanly to HEAD since the fpu subsystem was refactored
in Linux 4.2. While the latest kernel does not look vulnerable, we may
want to apply a fix to the vulnerable stable kernels.



Is the latest kvm safe if we use !eager fpu?


Yes I believe so. When !eagerfpu, interrupted_kernel_fpu_idle()
returns "!current->thread.fpu.fpregs_active && (read_cr0() &
X86_CR0_TS)". This should ensure the interrupt handler never does
XSAVE/XRSTOR with the guest's xcr0.



interrupted_kernel_fpu_idle() returns true if KVM-based hypervisor (e.g. QEMU)
is not using fpu. That can not stop handler using fpu.


Re: [PATCH 0/1] KVM: x86: using the fpu in interrupt context with a guest's xcr0

2016-03-15 Thread Xiao Guangrong



On 03/16/2016 03:01 AM, David Matlack wrote:

On Mon, Mar 14, 2016 at 12:46 AM, Xiao Guangrong
 wrote:



On 03/12/2016 04:47 AM, David Matlack wrote:


I have not been able to trigger this bug on Linux 4.3, and suspect
it is due to this commit from Linux 4.2:

653f52c kvm,x86: load guest FPU context more eagerly

With this commit, as long as the host is using eagerfpu, the guest's
fpu is always loaded just before the guest's xcr0 (vcpu->fpu_active
is always 1 in the following snippet):

6569 if (vcpu->fpu_active)
6570 kvm_load_guest_fpu(vcpu);
6571 kvm_load_guest_xcr0(vcpu);

When the guest's fpu is loaded, irq_fpu_usable() returns false.



Er, i did not see that commit introduced this change.



We've included our workaround for this bug, which applies to Linux 3.11.
It does not apply cleanly to HEAD since the fpu subsystem was refactored
in Linux 4.2. While the latest kernel does not look vulnerable, we may
want to apply a fix to the vulnerable stable kernels.



Is the latest kvm safe if we use !eager fpu?


Yes I believe so. When !eagerfpu, interrupted_kernel_fpu_idle()
returns "!current->thread.fpu.fpregs_active && (read_cr0() &
X86_CR0_TS)". This should ensure the interrupt handler never does
XSAVE/XRSTOR with the guest's xcr0.



interrupted_kernel_fpu_idle() returns true if KVM-based hypervisor (e.g. QEMU)
is not using fpu. That can not stop handler using fpu.


Re: [PATCH 0/1] KVM: x86: using the fpu in interrupt context with a guest's xcr0

2016-03-14 Thread Xiao Guangrong



On 03/12/2016 04:47 AM, David Matlack wrote:


I have not been able to trigger this bug on Linux 4.3, and suspect
it is due to this commit from Linux 4.2:

653f52c kvm,x86: load guest FPU context more eagerly

With this commit, as long as the host is using eagerfpu, the guest's
fpu is always loaded just before the guest's xcr0 (vcpu->fpu_active
is always 1 in the following snippet):

6569 if (vcpu->fpu_active)
6570 kvm_load_guest_fpu(vcpu);
6571 kvm_load_guest_xcr0(vcpu);

When the guest's fpu is loaded, irq_fpu_usable() returns false.


Er, i did not see that commit introduced this change.



We've included our workaround for this bug, which applies to Linux 3.11.
It does not apply cleanly to HEAD since the fpu subsystem was refactored
in Linux 4.2. While the latest kernel does not look vulnerable, we may
want to apply a fix to the vulnerable stable kernels.


Is the latest kvm safe if we use !eager fpu? Under this case, 
kvm_load_guest_fpu()
is not called for every single VM-enter, that means kernel will use guest's 
xcr0 to
save/restore XSAVE area.

Maybe a simpler fix is just calling __kernel_fpu_begin() when the CPU switches
to vCPU and reverts it when the vCPU is scheduled out or returns to userspace.



Re: [PATCH 0/1] KVM: x86: using the fpu in interrupt context with a guest's xcr0

2016-03-14 Thread Xiao Guangrong



On 03/12/2016 04:47 AM, David Matlack wrote:


I have not been able to trigger this bug on Linux 4.3, and suspect
it is due to this commit from Linux 4.2:

653f52c kvm,x86: load guest FPU context more eagerly

With this commit, as long as the host is using eagerfpu, the guest's
fpu is always loaded just before the guest's xcr0 (vcpu->fpu_active
is always 1 in the following snippet):

6569 if (vcpu->fpu_active)
6570 kvm_load_guest_fpu(vcpu);
6571 kvm_load_guest_xcr0(vcpu);

When the guest's fpu is loaded, irq_fpu_usable() returns false.


Er, i did not see that commit introduced this change.



We've included our workaround for this bug, which applies to Linux 3.11.
It does not apply cleanly to HEAD since the fpu subsystem was refactored
in Linux 4.2. While the latest kernel does not look vulnerable, we may
want to apply a fix to the vulnerable stable kernels.


Is the latest kvm safe if we use !eager fpu? Under this case, 
kvm_load_guest_fpu()
is not called for every single VM-enter, that means kernel will use guest's 
xcr0 to
save/restore XSAVE area.

Maybe a simpler fix is just calling __kernel_fpu_begin() when the CPU switches
to vCPU and reverts it when the vCPU is scheduled out or returns to userspace.



Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-10 Thread Xiao Guangrong



On 03/11/2016 01:07 AM, Paolo Bonzini wrote:

On 09/03/2016 08:18, Lan Tianyu wrote:

How about the following comments.

Log for kvm_mmu_commit_zap_page()
/*
 * We need to make sure everyone sees our modifications to
 * the page tables and see changes to vcpu->mode here.


Please mention that this pairs with vcpu_enter_guest and
walk_shadow_page_lockless_begin/end.


The
 * barrier in the kvm_flush_remote_tlbs() helps us to achieve
 * these. Otherwise, wait for all vcpus to exit guest mode
 * and/or lockless shadow page table walks.
 */
kvm_flush_remote_tlbs(kvm);


The rest of the comment is okay, but please replace "Otherwise" with "In
addition, we need to".


Log for kvm_flush_remote_tlbs()
/*
 * We want to publish modifications to the page tables before
 * reading mode. Pairs with a memory barrier in arch-specific
 * code.
 * - x86: smp_mb__after_srcu_read_unlock in vcpu_enter_guest.


... and smp_mb in walk_shadow_page_lockless_begin/end.


 * - powerpc: smp_mb in kvmppc_prepare_to_enter.
 */
 smp_mb__before_atomic();


The comment looks good, but the smp_mb__before_atomic() is not needed.
As mentioned in the reply to Guangrong, only a smp_load_acquire is required.

So the comment should say something like "There is already an smp_mb()
before kvm_make_all_cpus_request reads vcpu->mode.  We reuse that
barrier here.".

On top of this there is:

- the change to paging_tmpl.h that Guangrong posted, adding smp_wmb()
before each increment of vcpu->kvm->tlbs_dirty


Yes, please make it as a separated patch.



- the change to smp_mb__after_atomic() in kvm_make_all_cpus_request

- if you want :) you can also replace the store+mb in
walk_shadow_page_lockless_begin with smp_store_mb, and the mb+store in
walk_shadow_page_lockless_end with smp_store_release.


These changes are good to me.

TianYu, please CC me when you post the new version out.



Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-10 Thread Xiao Guangrong



On 03/11/2016 01:07 AM, Paolo Bonzini wrote:

On 09/03/2016 08:18, Lan Tianyu wrote:

How about the following comments.

Log for kvm_mmu_commit_zap_page()
/*
 * We need to make sure everyone sees our modifications to
 * the page tables and see changes to vcpu->mode here.


Please mention that this pairs with vcpu_enter_guest and
walk_shadow_page_lockless_begin/end.


The
 * barrier in the kvm_flush_remote_tlbs() helps us to achieve
 * these. Otherwise, wait for all vcpus to exit guest mode
 * and/or lockless shadow page table walks.
 */
kvm_flush_remote_tlbs(kvm);


The rest of the comment is okay, but please replace "Otherwise" with "In
addition, we need to".


Log for kvm_flush_remote_tlbs()
/*
 * We want to publish modifications to the page tables before
 * reading mode. Pairs with a memory barrier in arch-specific
 * code.
 * - x86: smp_mb__after_srcu_read_unlock in vcpu_enter_guest.


... and smp_mb in walk_shadow_page_lockless_begin/end.


 * - powerpc: smp_mb in kvmppc_prepare_to_enter.
 */
 smp_mb__before_atomic();


The comment looks good, but the smp_mb__before_atomic() is not needed.
As mentioned in the reply to Guangrong, only a smp_load_acquire is required.

So the comment should say something like "There is already an smp_mb()
before kvm_make_all_cpus_request reads vcpu->mode.  We reuse that
barrier here.".

On top of this there is:

- the change to paging_tmpl.h that Guangrong posted, adding smp_wmb()
before each increment of vcpu->kvm->tlbs_dirty


Yes, please make it as a separated patch.



- the change to smp_mb__after_atomic() in kvm_make_all_cpus_request

- if you want :) you can also replace the store+mb in
walk_shadow_page_lockless_begin with smp_store_mb, and the mb+store in
walk_shadow_page_lockless_end with smp_store_release.


These changes are good to me.

TianYu, please CC me when you post the new version out.



Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-10 Thread Xiao Guangrong



On 03/11/2016 12:04 AM, Paolo Bonzini wrote:



On 10/03/2016 16:45, Xiao Guangrong wrote:



Compared to smp_load_acquire(), smp_mb() adds an ordering between stores
and loads.


Here, the ordering is load-store, hence...


Yes, this is why i put smp_mb() in the code. :)


Here is a table of barriers:


 '. after|   |
before '.   |load   |store
__'.|___|
 |   |
 |  smp_rmb  | smp_load_acquire
load|  smp_load_acquire | smp_store_releaseXX
 |  smp_mb   | smp_mb
|___|
 |   |
 |   | smp_wmb
store   |  smp_mb   | smp_store_release
 |   | smp_mb
 |   |

Your case is the one marked with XX, so a smp_load_acquire() is
enough---and it's preferrable, because it's cheaper than smp_mb() and
more self-documenting.


Yes, you are right and thank you for pointing it out.



Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-10 Thread Xiao Guangrong



On 03/11/2016 12:04 AM, Paolo Bonzini wrote:



On 10/03/2016 16:45, Xiao Guangrong wrote:



Compared to smp_load_acquire(), smp_mb() adds an ordering between stores
and loads.


Here, the ordering is load-store, hence...


Yes, this is why i put smp_mb() in the code. :)


Here is a table of barriers:


 '. after|   |
before '.   |load   |store
__'.|___|
 |   |
 |  smp_rmb  | smp_load_acquire
load|  smp_load_acquire | smp_store_releaseXX
 |  smp_mb   | smp_mb
|___|
 |   |
 |   | smp_wmb
store   |  smp_mb   | smp_store_release
 |   | smp_mb
 |   |

Your case is the one marked with XX, so a smp_load_acquire() is
enough---and it's preferrable, because it's cheaper than smp_mb() and
more self-documenting.


Yes, you are right and thank you for pointing it out.



Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-10 Thread Xiao Guangrong



On 03/10/2016 11:31 PM, Paolo Bonzini wrote:



On 10/03/2016 16:26, Paolo Bonzini wrote:

Compared to smp_load_acquire(), smp_mb() adds an ordering between stores
and loads.


Here, the ordering is load-store, hence...


Yes, this is why i put smp_mb() in the code. :)



Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-10 Thread Xiao Guangrong



On 03/10/2016 11:31 PM, Paolo Bonzini wrote:



On 10/03/2016 16:26, Paolo Bonzini wrote:

Compared to smp_load_acquire(), smp_mb() adds an ordering between stores
and loads.


Here, the ordering is load-store, hence...


Yes, this is why i put smp_mb() in the code. :)



Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-10 Thread Xiao Guangrong



On 03/08/2016 11:27 PM, Paolo Bonzini wrote:



On 08/03/2016 09:36, Lan Tianyu wrote:

Summary about smp_mb()s we met in this thread. If misunderstood, please
correct me. Thanks.

The smp_mb() in the kvm_flush_remote_tlbs() was introduced by the commit
a4ee1ca4 and it seems to keep the order of reading and cmpxchg
kvm->tlbs_dirty.

 Quote from Avi:
 | I don't think we need to flush immediately; set a "tlb dirty" bit
somewhere
 | that is cleareded when we flush the tlb.
kvm_mmu_notifier_invalidate_page()
 | can consult the bit and force a flush if set.

 Signed-off-by: Xiao Guangrong <xiaoguangr...@cn.fujitsu.com>
 Signed-off-by: Marcelo Tosatti <mtosa...@redhat.com>


Unfortunately that patch added a bad memory barrier: 1) it lacks a
comment; 2) it lacks obvious pairing; 3) it is an smp_mb() after a read,
so it's not even obvious that this memory barrier has to do with the
immediately preceding read of kvm->tlbs_dirty.  It also is not
documented in Documentation/virtual/kvm/mmu.txt (Guangrong documented
there most of his other work, back in 2013, but not this one :)).

The cmpxchg is ordered anyway against the read, because 1) x86 has
implicit ordering between earlier loads and later stores; 2) even
store-load barriers are unnecessary for accesses to the same variable
(in this case kvm->tlbs_dirty).

So offhand, I cannot say what it orders.  There are two possibilities:

1) it orders the read of tlbs_dirty with the read of mode.  In this
case, a smp_rmb() would have been enough, and it's not clear where is
the matching smp_wmb().

2) it orders the read of tlbs_dirty with the KVM_REQ_TLB_FLUSH request.
  In this case a smp_load_acquire would be better.

3) it does the same as kvm_mmu_commit_zap_page's smp_mb() but for other
callers of kvm_flush_remote_tlbs().  In this case, we know what's the
matching memory barrier (walk_shadow_page_lockless_*).

4) it is completely unnecessary.


Sorry, memory barriers were missed in sync_page(), this diff should fix it:

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 91e939b..4cad57f 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -948,6 +948,12 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
return -EINVAL;

if (FNAME(prefetch_invalid_gpte)(vcpu, sp, >spt[i], gpte)) {
+   /*
+* update spte before increasing tlbs_dirty to make 
sure no tlb
+* flush in lost after spte is zapped, see the comments 
in
+* kvm_flush_remote_tlbs().
+*/
+   smp_wmb();
vcpu->kvm->tlbs_dirty++;
continue;
}
@@ -963,6 +969,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)

if (gfn != sp->gfns[i]) {
drop_spte(vcpu->kvm, >spt[i]);
+   /* the same as above where we are doing 
prefetch_invalid_gpte(). */
+   smp_wmb();
vcpu->kvm->tlbs_dirty++;
continue;
}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 314c777..82c86ea 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -193,7 +193,12 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
 {
long dirty_count = kvm->tlbs_dirty;

+/*
+ * read tlbs_dirty before doing tlb flush to make sure not tlb request 
is
+ * lost.
+ */
smp_mb();
+
if (kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
++kvm->stat.remote_tlb_flush;
cmpxchg(>tlbs_dirty, dirty_count, 0);


Any comment?


Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-10 Thread Xiao Guangrong



On 03/08/2016 11:27 PM, Paolo Bonzini wrote:



On 08/03/2016 09:36, Lan Tianyu wrote:

Summary about smp_mb()s we met in this thread. If misunderstood, please
correct me. Thanks.

The smp_mb() in the kvm_flush_remote_tlbs() was introduced by the commit
a4ee1ca4 and it seems to keep the order of reading and cmpxchg
kvm->tlbs_dirty.

 Quote from Avi:
 | I don't think we need to flush immediately; set a "tlb dirty" bit
somewhere
 | that is cleareded when we flush the tlb.
kvm_mmu_notifier_invalidate_page()
 | can consult the bit and force a flush if set.

 Signed-off-by: Xiao Guangrong 
 Signed-off-by: Marcelo Tosatti 


Unfortunately that patch added a bad memory barrier: 1) it lacks a
comment; 2) it lacks obvious pairing; 3) it is an smp_mb() after a read,
so it's not even obvious that this memory barrier has to do with the
immediately preceding read of kvm->tlbs_dirty.  It also is not
documented in Documentation/virtual/kvm/mmu.txt (Guangrong documented
there most of his other work, back in 2013, but not this one :)).

The cmpxchg is ordered anyway against the read, because 1) x86 has
implicit ordering between earlier loads and later stores; 2) even
store-load barriers are unnecessary for accesses to the same variable
(in this case kvm->tlbs_dirty).

So offhand, I cannot say what it orders.  There are two possibilities:

1) it orders the read of tlbs_dirty with the read of mode.  In this
case, a smp_rmb() would have been enough, and it's not clear where is
the matching smp_wmb().

2) it orders the read of tlbs_dirty with the KVM_REQ_TLB_FLUSH request.
  In this case a smp_load_acquire would be better.

3) it does the same as kvm_mmu_commit_zap_page's smp_mb() but for other
callers of kvm_flush_remote_tlbs().  In this case, we know what's the
matching memory barrier (walk_shadow_page_lockless_*).

4) it is completely unnecessary.


Sorry, memory barriers were missed in sync_page(), this diff should fix it:

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 91e939b..4cad57f 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -948,6 +948,12 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
return -EINVAL;

if (FNAME(prefetch_invalid_gpte)(vcpu, sp, >spt[i], gpte)) {
+   /*
+* update spte before increasing tlbs_dirty to make 
sure no tlb
+* flush in lost after spte is zapped, see the comments 
in
+* kvm_flush_remote_tlbs().
+*/
+   smp_wmb();
vcpu->kvm->tlbs_dirty++;
continue;
}
@@ -963,6 +969,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)

if (gfn != sp->gfns[i]) {
drop_spte(vcpu->kvm, >spt[i]);
+   /* the same as above where we are doing 
prefetch_invalid_gpte(). */
+   smp_wmb();
vcpu->kvm->tlbs_dirty++;
continue;
}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 314c777..82c86ea 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -193,7 +193,12 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
 {
long dirty_count = kvm->tlbs_dirty;

+/*
+ * read tlbs_dirty before doing tlb flush to make sure not tlb request 
is
+ * lost.
+ */
smp_mb();
+
if (kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
++kvm->stat.remote_tlb_flush;
cmpxchg(>tlbs_dirty, dirty_count, 0);


Any comment?


Re: [RFC PATCH 2/2] KVM: MMU: return page fault error code from permission_fault

2016-03-10 Thread Xiao Guangrong



On 03/08/2016 07:45 PM, Paolo Bonzini wrote:

Intel's new PKU extension introduces a bit of the page fault error code
that must be dynamically computed after the PTE lookup (unlike I/D, U/S
and W/R).  To ease this, these two patches make permission_fault return
the page fault error code.

Right now it only adds the P bit to the input parameter "pfec", but PKU
can change that.


Yep, i got the same idea when i reviewed the pkey patchset. This patch
looks good to me.

Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com>


Re: [RFC PATCH 2/2] KVM: MMU: return page fault error code from permission_fault

2016-03-10 Thread Xiao Guangrong



On 03/08/2016 07:45 PM, Paolo Bonzini wrote:

Intel's new PKU extension introduces a bit of the page fault error code
that must be dynamically computed after the PTE lookup (unlike I/D, U/S
and W/R).  To ease this, these two patches make permission_fault return
the page fault error code.

Right now it only adds the P bit to the input parameter "pfec", but PKU
can change that.


Yep, i got the same idea when i reviewed the pkey patchset. This patch
looks good to me.

Reviewed-by: Xiao Guangrong 


Re: [RFC PATCH 1/2] KVM: MMU: precompute page fault error code

2016-03-10 Thread Xiao Guangrong



On 03/08/2016 07:45 PM, Paolo Bonzini wrote:

For the next patch, we will want to filter PFERR_FETCH_MASK away early,
and not pass it to permission_fault if neither NX nor SMEP are enabled.
Prepare for the change.


Why it is needed? It is much easier to drop PFEC.F in
update_permission_bitmask().



Signed-off-by: Paolo Bonzini 
---
  arch/x86/kvm/mmu.c |  2 +-
  arch/x86/kvm/paging_tmpl.h | 26 +++---
  2 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2463de0b935c..e57f7be061e3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3883,7 +3883,7 @@ static void update_permission_bitmask(struct kvm_vcpu 
*vcpu,
u = bit & ACC_USER_MASK;

if (!ept) {
-   /* Not really needed: !nx will cause pte.nx to 
fault */
+   /* Not really needed: if !nx, ff will always be 
zero */
x |= !mmu->nx;
/* Allow supervisor writes if !cr0.wp */
w |= !is_write_protection(vcpu) && !uf;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 6013f3685ef4..285858d3223b 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -272,13 +272,24 @@ static int FNAME(walk_addr_generic)(struct guest_walker 
*walker,
gpa_t pte_gpa;
int offset;
const int write_fault = access & PFERR_WRITE_MASK;
-   const int user_fault  = access & PFERR_USER_MASK;
-   const int fetch_fault = access & PFERR_FETCH_MASK;
-   u16 errcode = 0;
+   u16 errcode;
gpa_t real_gpa;
gfn_t gfn;

trace_kvm_mmu_pagetable_walk(addr, access);
+
+   /*
+* Do not modify PFERR_FETCH_MASK in access.  It is used later in the 
call to
+* mmu->translate_gpa and, when nested virtualization is in use, the X 
or NX
+* bit of nested page tables always applies---even if NX and SMEP are 
disabled
+* in the guest.
+*
+* TODO: cache the result of the NX and SMEP test in struct kvm_mmu?
+*/
+   errcode = access;
+   if (!(mmu->nx || kvm_read_cr4_bits(vcpu, X86_CR4_SMEP)))
+   errcode &= ~PFERR_FETCH_MASK;
+


PFEC.P might be set since it is copied from @access.

And the workload is moved from rare path to the common path...


  retry_walk:
walker->level = mmu->root_level;
pte   = mmu->get_cr3(vcpu);
@@ -389,9 +400,7 @@ retry_walk:

if (unlikely(!accessed_dirty)) {
ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, 
write_fault);
-   if (unlikely(ret < 0))
-   goto error;
-   else if (ret)
+   if (ret > 0)
goto retry_walk;


So it returns normally even if update_accessed_dirty_bits() failed.


Re: [RFC PATCH 1/2] KVM: MMU: precompute page fault error code

2016-03-10 Thread Xiao Guangrong



On 03/08/2016 07:45 PM, Paolo Bonzini wrote:

For the next patch, we will want to filter PFERR_FETCH_MASK away early,
and not pass it to permission_fault if neither NX nor SMEP are enabled.
Prepare for the change.


Why it is needed? It is much easier to drop PFEC.F in
update_permission_bitmask().



Signed-off-by: Paolo Bonzini 
---
  arch/x86/kvm/mmu.c |  2 +-
  arch/x86/kvm/paging_tmpl.h | 26 +++---
  2 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2463de0b935c..e57f7be061e3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3883,7 +3883,7 @@ static void update_permission_bitmask(struct kvm_vcpu 
*vcpu,
u = bit & ACC_USER_MASK;

if (!ept) {
-   /* Not really needed: !nx will cause pte.nx to 
fault */
+   /* Not really needed: if !nx, ff will always be 
zero */
x |= !mmu->nx;
/* Allow supervisor writes if !cr0.wp */
w |= !is_write_protection(vcpu) && !uf;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 6013f3685ef4..285858d3223b 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -272,13 +272,24 @@ static int FNAME(walk_addr_generic)(struct guest_walker 
*walker,
gpa_t pte_gpa;
int offset;
const int write_fault = access & PFERR_WRITE_MASK;
-   const int user_fault  = access & PFERR_USER_MASK;
-   const int fetch_fault = access & PFERR_FETCH_MASK;
-   u16 errcode = 0;
+   u16 errcode;
gpa_t real_gpa;
gfn_t gfn;

trace_kvm_mmu_pagetable_walk(addr, access);
+
+   /*
+* Do not modify PFERR_FETCH_MASK in access.  It is used later in the 
call to
+* mmu->translate_gpa and, when nested virtualization is in use, the X 
or NX
+* bit of nested page tables always applies---even if NX and SMEP are 
disabled
+* in the guest.
+*
+* TODO: cache the result of the NX and SMEP test in struct kvm_mmu?
+*/
+   errcode = access;
+   if (!(mmu->nx || kvm_read_cr4_bits(vcpu, X86_CR4_SMEP)))
+   errcode &= ~PFERR_FETCH_MASK;
+


PFEC.P might be set since it is copied from @access.

And the workload is moved from rare path to the common path...


  retry_walk:
walker->level = mmu->root_level;
pte   = mmu->get_cr3(vcpu);
@@ -389,9 +400,7 @@ retry_walk:

if (unlikely(!accessed_dirty)) {
ret = FNAME(update_accessed_dirty_bits)(vcpu, mmu, walker, 
write_fault);
-   if (unlikely(ret < 0))
-   goto error;
-   else if (ret)
+   if (ret > 0)
goto retry_walk;


So it returns normally even if update_accessed_dirty_bits() failed.


Re: [PATCH 0/2] KVM: x86: disable MPX if host did not enable MPX XSAVE features

2016-03-10 Thread Xiao Guangrong



On 03/08/2016 07:44 PM, Paolo Bonzini wrote:

Patch 1 ensures that all aspects of MPX are disabled when eager FPU
is disabled on the host.  Patch 2 is just a cleanup.


It looks good to me.

Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com>

Now, more and more features depend on eger xsave, e.g, fpu, mpx and
protection-key, maybe it is the time to rename eager-fpu to eager-xsave?




Re: [PATCH 0/2] KVM: x86: disable MPX if host did not enable MPX XSAVE features

2016-03-10 Thread Xiao Guangrong



On 03/08/2016 07:44 PM, Paolo Bonzini wrote:

Patch 1 ensures that all aspects of MPX are disabled when eager FPU
is disabled on the host.  Patch 2 is just a cleanup.


It looks good to me.

Reviewed-by: Xiao Guangrong 

Now, more and more features depend on eger xsave, e.g, fpu, mpx and
protection-key, maybe it is the time to rename eager-fpu to eager-xsave?




Re: [PATCH 1/2] KVM: MMU: fix ept=0/pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo

2016-03-10 Thread Xiao Guangrong



On 03/10/2016 06:09 PM, Paolo Bonzini wrote:



On 10/03/2016 09:27, Xiao Guangrong wrote:





+if (!enable_ept) {
+guest_efer |= EFER_NX;
+ignore_bits |= EFER_NX;


Update ignore_bits is not necessary i think.


More precisely, ignore_bits is only needed if guest EFER.NX=0 and we're
not in this CR0.WP=1/CR4.SMEP=0 situation.  In theory you could have
guest EFER.NX=1 and host EFER.NX=0.


It is not in linux, the kernel always set EFER.NX if CPUID reports it,
arch/x86/kernel/head_64.S:

204 /* Setup EFER (Extended Feature Enable Register) */
205 movl$MSR_EFER, %ecx
206 rdmsr
207 btsl$_EFER_SCE, %eax/* Enable System Call */
208 btl $20,%edi/* No Execute supported? */
209 jnc 1f
210 btsl$_EFER_NX, %eax
211 btsq$_PAGE_BIT_NX,early_pmd_flags(%rip)
212 1:  wrmsr   /* Make changes effective */

So if guest sees NX in its cpuid then host EFER.NX should be 1.



This is what I came up with (plus some comments :)):

u64 guest_efer = vmx->vcpu.arch.efer;
u64 ignore_bits = 0;

if (!enable_ept) {
if (boot_cpu_has(X86_FEATURE_SMEP))
guest_efer |= EFER_NX;
else if (!(guest_efer & EFER_NX))
ignore_bits |= EFER_NX;
}


Your logic is very right.

What my suggestion is we can keep ignore_bits = EFER_NX | EFER_SCE;
(needn't conditionally adjust it) because EFER_NX must be the same
between guest and host if we switch EFER manually.


My patch is bigger but the resulting code is smaller and easier to follow:

guest_efer = vmx->vcpu.arch.efer;
if (!enable_ept)
guest_efer |= EFER_NX;
...
if (...) {
...
} else {
guest_efer &= ~ignore_bits;
guest_efer |= host_efer & ignore_bits;
}


I agreed. :)


Re: [PATCH 1/2] KVM: MMU: fix ept=0/pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo

2016-03-10 Thread Xiao Guangrong



On 03/10/2016 06:09 PM, Paolo Bonzini wrote:



On 10/03/2016 09:27, Xiao Guangrong wrote:





+if (!enable_ept) {
+guest_efer |= EFER_NX;
+ignore_bits |= EFER_NX;


Update ignore_bits is not necessary i think.


More precisely, ignore_bits is only needed if guest EFER.NX=0 and we're
not in this CR0.WP=1/CR4.SMEP=0 situation.  In theory you could have
guest EFER.NX=1 and host EFER.NX=0.


It is not in linux, the kernel always set EFER.NX if CPUID reports it,
arch/x86/kernel/head_64.S:

204 /* Setup EFER (Extended Feature Enable Register) */
205 movl$MSR_EFER, %ecx
206 rdmsr
207 btsl$_EFER_SCE, %eax/* Enable System Call */
208 btl $20,%edi/* No Execute supported? */
209 jnc 1f
210 btsl$_EFER_NX, %eax
211 btsq$_PAGE_BIT_NX,early_pmd_flags(%rip)
212 1:  wrmsr   /* Make changes effective */

So if guest sees NX in its cpuid then host EFER.NX should be 1.



This is what I came up with (plus some comments :)):

u64 guest_efer = vmx->vcpu.arch.efer;
u64 ignore_bits = 0;

if (!enable_ept) {
if (boot_cpu_has(X86_FEATURE_SMEP))
guest_efer |= EFER_NX;
else if (!(guest_efer & EFER_NX))
ignore_bits |= EFER_NX;
}


Your logic is very right.

What my suggestion is we can keep ignore_bits = EFER_NX | EFER_SCE;
(needn't conditionally adjust it) because EFER_NX must be the same
between guest and host if we switch EFER manually.


My patch is bigger but the resulting code is smaller and easier to follow:

guest_efer = vmx->vcpu.arch.efer;
if (!enable_ept)
guest_efer |= EFER_NX;
...
if (...) {
...
} else {
guest_efer &= ~ignore_bits;
guest_efer |= host_efer & ignore_bits;
}


I agreed. :)


Re: [PATCH 1/2] KVM: MMU: fix ept=0/pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo

2016-03-10 Thread Xiao Guangrong



On 03/08/2016 07:44 PM, Paolo Bonzini wrote:

Yes, all of these are needed. :) This is admittedly a bit odd, but
kvm-unit-tests access.flat tests this if you run it with "-cpu host"
and of course ept=0.

KVM handles supervisor writes of a pte.u=0/pte.w=0/CR0.WP=0 page by
setting U=0 and W=1 in the shadow PTE.  This will cause a user write
to fault and a supervisor write to succeed (which is correct because
CR0.WP=0).  A user read instead will flip U=0 to 1 and W=1 back to 0.


BTW, it should be pte.u = 1 where you mentioned above.


Re: [PATCH 1/2] KVM: MMU: fix ept=0/pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo

2016-03-10 Thread Xiao Guangrong



On 03/08/2016 07:44 PM, Paolo Bonzini wrote:

Yes, all of these are needed. :) This is admittedly a bit odd, but
kvm-unit-tests access.flat tests this if you run it with "-cpu host"
and of course ept=0.

KVM handles supervisor writes of a pte.u=0/pte.w=0/CR0.WP=0 page by
setting U=0 and W=1 in the shadow PTE.  This will cause a user write
to fault and a supervisor write to succeed (which is correct because
CR0.WP=0).  A user read instead will flip U=0 to 1 and W=1 back to 0.


BTW, it should be pte.u = 1 where you mentioned above.


Re: [PATCH 2/2] KVM: MMU: fix reserved bit check for pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0

2016-03-10 Thread Xiao Guangrong



On 03/08/2016 07:44 PM, Paolo Bonzini wrote:

KVM handles supervisor writes of a pte.u=0/pte.w=0/CR0.WP=0 page by
setting U=0 and W=1 in the shadow PTE.  This will cause a user write
to fault and a supervisor write to succeed (which is correct because
CR0.WP=0).  A user read instead will flip U=0 to 1 and W=1 back to 0.
This enables user reads; it also disables supervisor writes, the next
of which will then flip the bits again.

When SMEP is in effect, however, pte.u=0 will enable kernel execution
of this page.  To avoid this, KVM also sets pte.nx=1.  The reserved bit
catches this because it only looks at the guest's EFER.NX bit.  Teach it
that smep_andnot_wp will also use the NX bit of SPTEs.

Cc: sta...@vger.kernel.org
Cc: Xiao Guangrong <guangrong.x...@redhat.com>


As a redhat guy i am so proud. :)


Fixes: c258b62b264fdc469b6d3610a907708068145e3b


Thanks for you fixing it, Paolo!

Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com>


Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
---
  arch/x86/kvm/mmu.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 95a955de5964..0cd4ee01de94 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3721,13 +3721,15 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu 
*vcpu,
  void
  reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
  {
+   int uses_nx = context->nx || context->base_role.smep_andnot_wp;


It would be better if it is 'bool'


Re: [PATCH 2/2] KVM: MMU: fix reserved bit check for pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0

2016-03-10 Thread Xiao Guangrong



On 03/08/2016 07:44 PM, Paolo Bonzini wrote:

KVM handles supervisor writes of a pte.u=0/pte.w=0/CR0.WP=0 page by
setting U=0 and W=1 in the shadow PTE.  This will cause a user write
to fault and a supervisor write to succeed (which is correct because
CR0.WP=0).  A user read instead will flip U=0 to 1 and W=1 back to 0.
This enables user reads; it also disables supervisor writes, the next
of which will then flip the bits again.

When SMEP is in effect, however, pte.u=0 will enable kernel execution
of this page.  To avoid this, KVM also sets pte.nx=1.  The reserved bit
catches this because it only looks at the guest's EFER.NX bit.  Teach it
that smep_andnot_wp will also use the NX bit of SPTEs.

Cc: sta...@vger.kernel.org
Cc: Xiao Guangrong 


As a redhat guy i am so proud. :)


Fixes: c258b62b264fdc469b6d3610a907708068145e3b


Thanks for you fixing it, Paolo!

Reviewed-by: Xiao Guangrong 


Signed-off-by: Paolo Bonzini 
---
  arch/x86/kvm/mmu.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 95a955de5964..0cd4ee01de94 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3721,13 +3721,15 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu 
*vcpu,
  void
  reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
  {
+   int uses_nx = context->nx || context->base_role.smep_andnot_wp;


It would be better if it is 'bool'


Re: [PATCH 1/2] KVM: MMU: fix ept=0/pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo

2016-03-10 Thread Xiao Guangrong



On 03/08/2016 07:44 PM, Paolo Bonzini wrote:

Yes, all of these are needed. :) This is admittedly a bit odd, but
kvm-unit-tests access.flat tests this if you run it with "-cpu host"
and of course ept=0.

KVM handles supervisor writes of a pte.u=0/pte.w=0/CR0.WP=0 page by
setting U=0 and W=1 in the shadow PTE.  This will cause a user write
to fault and a supervisor write to succeed (which is correct because
CR0.WP=0).  A user read instead will flip U=0 to 1 and W=1 back to 0.
This enables user reads; it also disables supervisor writes, the next
of which will then flip the bits again.

When SMEP is in effect, however, U=0 will enable kernel execution of
this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
with U=0.  If the guest has not enabled NX, the result is a continuous
stream of page faults due to the NX bit being reserved.

The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER
switch.


Good catch!

So it only hurts the box which has cpu_has_load_ia32_efer support otherwise
NX is inherited from kernel (kernel always sets NX if CPU supports it),
right?



There is another bug in the reserved bit check, which I've split to a
separate patch for easier application to stable kernels.




Cc: sta...@vger.kernel.org
Cc: Xiao Guangrong <guangrong.x...@redhat.com>
Cc: Andy Lutomirski <l...@amacapital.net>
Fixes: f6577a5fa15d82217ca73c74cd2dcbc0f6c781dd
Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
---
  Documentation/virtual/kvm/mmu.txt |  3 ++-
  arch/x86/kvm/vmx.c| 25 +++--
  2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/Documentation/virtual/kvm/mmu.txt 
b/Documentation/virtual/kvm/mmu.txt
index daf9c0f742d2..c81731096a43 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -358,7 +358,8 @@ In the first case there are two additional complications:
  - if CR4.SMEP is enabled: since we've turned the page into a kernel page,
the kernel may now execute it.  We handle this by also setting spte.nx.
If we get a user fetch or read fault, we'll change spte.u=1 and
-  spte.nx=gpte.nx back.
+  spte.nx=gpte.nx back.  For this to work, KVM forces EFER.NX to 1 when
+  shadow paging is in use.
  - if CR4.SMAP is disabled: since the page has been changed to a kernel
page, it can not be reused when CR4.SMAP is enabled. We set
CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6e51493ff4f9..91830809d837 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1863,20 +1863,20 @@ static bool update_transition_efer(struct vcpu_vmx 
*vmx, int efer_offset)
guest_efer = vmx->vcpu.arch.efer;

/*
-* NX is emulated; LMA and LME handled by hardware; SCE meaningless
-* outside long mode
+* LMA and LME handled by hardware; SCE meaningless outside long mode.
 */
-   ignore_bits = EFER_NX | EFER_SCE;
+   ignore_bits = EFER_SCE;
  #ifdef CONFIG_X86_64
ignore_bits |= EFER_LMA | EFER_LME;
/* SCE is meaningful only in long mode on Intel */
if (guest_efer & EFER_LMA)
ignore_bits &= ~(u64)EFER_SCE;
  #endif
-   guest_efer &= ~ignore_bits;
-   guest_efer |= host_efer & ignore_bits;
-   vmx->guest_msrs[efer_offset].data = guest_efer;
-   vmx->guest_msrs[efer_offset].mask = ~ignore_bits;
+   /* NX is needed to handle CR0.WP=1, CR4.SMEP=1.  */



+   if (!enable_ept) {
+   guest_efer |= EFER_NX;
+   ignore_bits |= EFER_NX;


Update ignore_bits is not necessary i think.


+   }

clear_atomic_switch_msr(vmx, MSR_EFER);

@@ -1887,16 +1887,21 @@ static bool update_transition_efer(struct vcpu_vmx 
*vmx, int efer_offset)
 */
if (cpu_has_load_ia32_efer ||
(enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX))) {
-   guest_efer = vmx->vcpu.arch.efer;
if (!(guest_efer & EFER_LMA))
guest_efer &= ~EFER_LME;
if (guest_efer != host_efer)
add_atomic_switch_msr(vmx, MSR_EFER,
  guest_efer, host_efer);


So, why not set EFER_NX (if !ept) just in this branch to make the fix more 
simpler?



Re: [PATCH 1/2] KVM: MMU: fix ept=0/pte.u=0/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo

2016-03-10 Thread Xiao Guangrong



On 03/08/2016 07:44 PM, Paolo Bonzini wrote:

Yes, all of these are needed. :) This is admittedly a bit odd, but
kvm-unit-tests access.flat tests this if you run it with "-cpu host"
and of course ept=0.

KVM handles supervisor writes of a pte.u=0/pte.w=0/CR0.WP=0 page by
setting U=0 and W=1 in the shadow PTE.  This will cause a user write
to fault and a supervisor write to succeed (which is correct because
CR0.WP=0).  A user read instead will flip U=0 to 1 and W=1 back to 0.
This enables user reads; it also disables supervisor writes, the next
of which will then flip the bits again.

When SMEP is in effect, however, U=0 will enable kernel execution of
this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
with U=0.  If the guest has not enabled NX, the result is a continuous
stream of page faults due to the NX bit being reserved.

The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER
switch.


Good catch!

So it only hurts the box which has cpu_has_load_ia32_efer support otherwise
NX is inherited from kernel (kernel always sets NX if CPU supports it),
right?



There is another bug in the reserved bit check, which I've split to a
separate patch for easier application to stable kernels.




Cc: sta...@vger.kernel.org
Cc: Xiao Guangrong 
Cc: Andy Lutomirski 
Fixes: f6577a5fa15d82217ca73c74cd2dcbc0f6c781dd
Signed-off-by: Paolo Bonzini 
---
  Documentation/virtual/kvm/mmu.txt |  3 ++-
  arch/x86/kvm/vmx.c| 25 +++--
  2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/Documentation/virtual/kvm/mmu.txt 
b/Documentation/virtual/kvm/mmu.txt
index daf9c0f742d2..c81731096a43 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -358,7 +358,8 @@ In the first case there are two additional complications:
  - if CR4.SMEP is enabled: since we've turned the page into a kernel page,
the kernel may now execute it.  We handle this by also setting spte.nx.
If we get a user fetch or read fault, we'll change spte.u=1 and
-  spte.nx=gpte.nx back.
+  spte.nx=gpte.nx back.  For this to work, KVM forces EFER.NX to 1 when
+  shadow paging is in use.
  - if CR4.SMAP is disabled: since the page has been changed to a kernel
page, it can not be reused when CR4.SMAP is enabled. We set
CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6e51493ff4f9..91830809d837 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1863,20 +1863,20 @@ static bool update_transition_efer(struct vcpu_vmx 
*vmx, int efer_offset)
guest_efer = vmx->vcpu.arch.efer;

/*
-* NX is emulated; LMA and LME handled by hardware; SCE meaningless
-* outside long mode
+* LMA and LME handled by hardware; SCE meaningless outside long mode.
 */
-   ignore_bits = EFER_NX | EFER_SCE;
+   ignore_bits = EFER_SCE;
  #ifdef CONFIG_X86_64
ignore_bits |= EFER_LMA | EFER_LME;
/* SCE is meaningful only in long mode on Intel */
if (guest_efer & EFER_LMA)
ignore_bits &= ~(u64)EFER_SCE;
  #endif
-   guest_efer &= ~ignore_bits;
-   guest_efer |= host_efer & ignore_bits;
-   vmx->guest_msrs[efer_offset].data = guest_efer;
-   vmx->guest_msrs[efer_offset].mask = ~ignore_bits;
+   /* NX is needed to handle CR0.WP=1, CR4.SMEP=1.  */



+   if (!enable_ept) {
+   guest_efer |= EFER_NX;
+   ignore_bits |= EFER_NX;


Update ignore_bits is not necessary i think.


+   }

clear_atomic_switch_msr(vmx, MSR_EFER);

@@ -1887,16 +1887,21 @@ static bool update_transition_efer(struct vcpu_vmx 
*vmx, int efer_offset)
 */
if (cpu_has_load_ia32_efer ||
(enable_ept && ((vmx->vcpu.arch.efer ^ host_efer) & EFER_NX))) {
-   guest_efer = vmx->vcpu.arch.efer;
if (!(guest_efer & EFER_LMA))
guest_efer &= ~EFER_LME;
if (guest_efer != host_efer)
add_atomic_switch_msr(vmx, MSR_EFER,
  guest_efer, host_efer);


So, why not set EFER_NX (if !ept) just in this branch to make the fix more 
simpler?



Re: [PATCH v2 0/9] cleanup around kvm_sync_page, and a few micro-optimizations

2016-03-08 Thread Xiao Guangrong



On 03/07/2016 10:15 PM, Paolo Bonzini wrote:

Having committed the ubsan fixes, this are the cleanups that are left.

Compared to v1, I have fixed the patch to coalesce page zapping after
mmu_sync_children (as requested by Takuya and Guangrong), and I have
rewritten is_last_gpte again in an even simpler way.



Nice work!

Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com>


Re: [PATCH v2 0/9] cleanup around kvm_sync_page, and a few micro-optimizations

2016-03-08 Thread Xiao Guangrong



On 03/07/2016 10:15 PM, Paolo Bonzini wrote:

Having committed the ubsan fixes, this are the cleanups that are left.

Compared to v1, I have fixed the patch to coalesce page zapping after
mmu_sync_children (as requested by Takuya and Guangrong), and I have
rewritten is_last_gpte again in an even simpler way.



Nice work!

Reviewed-by: Xiao Guangrong 


Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-04 Thread Xiao Guangrong



On 03/04/2016 04:04 PM, Paolo Bonzini wrote:

On 04/03/2016 02:35, Lan Tianyu wrote:

The following kvm_flush_remote_tlbs() will call smp_mb() inside and so
remove smp_mb() here.

Signed-off-by: Lan Tianyu 
---
  arch/x86/kvm/mmu.c | 6 --
  1 file changed, 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a54ecd9..6315416 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2383,12 +2383,6 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
return;

/*
-* wmb: make sure everyone sees our modifications to the page tables
-* rmb: make sure we see changes to vcpu->mode
-*/
-   smp_mb();
-
-   /*
 * Wait for all vcpus to exit guest mode and/or lockless shadow
 * page table walks.
 */



kvm_flush_remote_tlbs loads kvm->tlbs_dirty before issuing the memory
barrier.  I think it's okay if the load is done earlier, but I'd like
Guangrong to take a look.


Yes, It looks good to me, but please keep the comment.

Thanks!


Re: [PATCH] KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()

2016-03-04 Thread Xiao Guangrong



On 03/04/2016 04:04 PM, Paolo Bonzini wrote:

On 04/03/2016 02:35, Lan Tianyu wrote:

The following kvm_flush_remote_tlbs() will call smp_mb() inside and so
remove smp_mb() here.

Signed-off-by: Lan Tianyu 
---
  arch/x86/kvm/mmu.c | 6 --
  1 file changed, 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a54ecd9..6315416 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2383,12 +2383,6 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
return;

/*
-* wmb: make sure everyone sees our modifications to the page tables
-* rmb: make sure we see changes to vcpu->mode
-*/
-   smp_mb();
-
-   /*
 * Wait for all vcpus to exit guest mode and/or lockless shadow
 * page table walks.
 */



kvm_flush_remote_tlbs loads kvm->tlbs_dirty before issuing the memory
barrier.  I think it's okay if the load is done earlier, but I'd like
Guangrong to take a look.


Yes, It looks good to me, but please keep the comment.

Thanks!


Re: [PATCH] KVM: VMX: use vmcs_clear/set_bits for debug register exits

2016-02-26 Thread Xiao Guangrong



On 02/26/2016 07:55 PM, Paolo Bonzini wrote:

Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
---
  arch/x86/kvm/vmx.c | 14 +++---
  1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index aa16d5874fe6..46154dac71e6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5619,11 +5619,8 @@ static int handle_dr(struct kvm_vcpu *vcpu)
}

if (vcpu->guest_debug == 0) {
-   u32 cpu_based_vm_exec_control;
-
-   cpu_based_vm_exec_control = 
vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
-   cpu_based_vm_exec_control &= ~CPU_BASED_MOV_DR_EXITING;
-   vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, 
cpu_based_vm_exec_control);
+   vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL,
+   CPU_BASED_MOV_DR_EXITING);

/*
 * No more DR vmexits; force a reload of the debug registers
@@ -5660,8 +5657,6 @@ static void vmx_set_dr6(struct kvm_vcpu *vcpu, unsigned 
long val)

  static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
  {
-   u32 cpu_based_vm_exec_control;
-
get_debugreg(vcpu->arch.db[0], 0);
get_debugreg(vcpu->arch.db[1], 1);
get_debugreg(vcpu->arch.db[2], 2);
@@ -5670,10 +5665,7 @@ static void vmx_sync_dirty_debug_regs(struct kvm_vcpu 
*vcpu)
vcpu->arch.dr7 = vmcs_readl(GUEST_DR7);

vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_WONT_EXIT;
-
-   cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
-   cpu_based_vm_exec_control |= CPU_BASED_MOV_DR_EXITING;
-   vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
+   vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL, CPU_BASED_MOV_DR_EXITING);
  }

  static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)



Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com>



Re: [PATCH] KVM: VMX: use vmcs_clear/set_bits for debug register exits

2016-02-26 Thread Xiao Guangrong



On 02/26/2016 07:55 PM, Paolo Bonzini wrote:

Signed-off-by: Paolo Bonzini 
---
  arch/x86/kvm/vmx.c | 14 +++---
  1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index aa16d5874fe6..46154dac71e6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5619,11 +5619,8 @@ static int handle_dr(struct kvm_vcpu *vcpu)
}

if (vcpu->guest_debug == 0) {
-   u32 cpu_based_vm_exec_control;
-
-   cpu_based_vm_exec_control = 
vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
-   cpu_based_vm_exec_control &= ~CPU_BASED_MOV_DR_EXITING;
-   vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, 
cpu_based_vm_exec_control);
+   vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL,
+   CPU_BASED_MOV_DR_EXITING);

/*
 * No more DR vmexits; force a reload of the debug registers
@@ -5660,8 +5657,6 @@ static void vmx_set_dr6(struct kvm_vcpu *vcpu, unsigned 
long val)

  static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
  {
-   u32 cpu_based_vm_exec_control;
-
get_debugreg(vcpu->arch.db[0], 0);
get_debugreg(vcpu->arch.db[1], 1);
get_debugreg(vcpu->arch.db[2], 2);
@@ -5670,10 +5665,7 @@ static void vmx_sync_dirty_debug_regs(struct kvm_vcpu 
*vcpu)
vcpu->arch.dr7 = vmcs_readl(GUEST_DR7);

vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_WONT_EXIT;
-
-   cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
-   cpu_based_vm_exec_control |= CPU_BASED_MOV_DR_EXITING;
-   vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
+   vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL, CPU_BASED_MOV_DR_EXITING);
  }

  static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)



Reviewed-by: Xiao Guangrong 



Re: [PATCH] KVM: x86: fix root cause for missed hardware breakpoints

2016-02-26 Thread Xiao Guangrong



On 02/26/2016 07:46 PM, Paolo Bonzini wrote:

Commit 172b2386ed16 ("KVM: x86: fix missed hardware breakpoints",
2016-02-10) worked around a case where the debug registers are not loaded
correctly on preemption and on the first entry to KVM_RUN.

However, Xiao Guangrong pointed out that the root cause must be that
KVM_DEBUGREG_BP_ENABLED is not being set correctly.  This can indeed
happen due to the lazy debug exit mechanism, which does not call
kvm_update_dr7.  Fix it by replacing the existing loop (more or less
equivalent to kvm_update_dr0123) with calls to all the kvm_update_dr*
functions.

The original patch is good enough for stable releases before 4.1, since
it does not have any dependencies such as commit ae561edeb421 ("KVM:
x86: DR0-DR3 are not clear on reset", 2015-04-02).

Cc: sta...@vger.kernel.org   # 4.1+
Fixes: 172b2386ed16a9143d9a456aae5ec87275c61489
Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
---
  arch/x86/kvm/x86.c | 9 -
  1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f4891f2ece23..eaf6ee8c28b8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2752,7 +2752,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
}

kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
-   vcpu->arch.switch_db_regs |= KVM_DEBUGREG_RELOAD;
  }

  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
@@ -6619,12 +6618,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 * KVM_DEBUGREG_WONT_EXIT again.
 */
if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
-   int i;
-
WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP);
kvm_x86_ops->sync_dirty_debug_regs(vcpu);
-   for (i = 0; i < KVM_NR_DB_REGS; i++)
-   vcpu->arch.eff_db[i] = vcpu->arch.db[i];
+   kvm_update_dr0123(vcpu);
+   kvm_update_dr6(vcpu);
+   kvm_update_dr7(vcpu);
+   vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
}


It is good to me now. It also can fix the case that the debug registers are
changed by the debug-process(gdb/perf) in the context of the vcpu as i
described in another thread.

Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com>


Re: [PATCH] KVM: x86: fix root cause for missed hardware breakpoints

2016-02-26 Thread Xiao Guangrong



On 02/26/2016 07:46 PM, Paolo Bonzini wrote:

Commit 172b2386ed16 ("KVM: x86: fix missed hardware breakpoints",
2016-02-10) worked around a case where the debug registers are not loaded
correctly on preemption and on the first entry to KVM_RUN.

However, Xiao Guangrong pointed out that the root cause must be that
KVM_DEBUGREG_BP_ENABLED is not being set correctly.  This can indeed
happen due to the lazy debug exit mechanism, which does not call
kvm_update_dr7.  Fix it by replacing the existing loop (more or less
equivalent to kvm_update_dr0123) with calls to all the kvm_update_dr*
functions.

The original patch is good enough for stable releases before 4.1, since
it does not have any dependencies such as commit ae561edeb421 ("KVM:
x86: DR0-DR3 are not clear on reset", 2015-04-02).

Cc: sta...@vger.kernel.org   # 4.1+
Fixes: 172b2386ed16a9143d9a456aae5ec87275c61489
Signed-off-by: Paolo Bonzini 
---
  arch/x86/kvm/x86.c | 9 -
  1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f4891f2ece23..eaf6ee8c28b8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2752,7 +2752,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
}

kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
-   vcpu->arch.switch_db_regs |= KVM_DEBUGREG_RELOAD;
  }

  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
@@ -6619,12 +6618,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 * KVM_DEBUGREG_WONT_EXIT again.
 */
if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
-   int i;
-
WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP);
kvm_x86_ops->sync_dirty_debug_regs(vcpu);
-   for (i = 0; i < KVM_NR_DB_REGS; i++)
-   vcpu->arch.eff_db[i] = vcpu->arch.db[i];
+   kvm_update_dr0123(vcpu);
+   kvm_update_dr6(vcpu);
+   kvm_update_dr7(vcpu);
+   vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
}


It is good to me now. It also can fix the case that the debug registers are
changed by the debug-process(gdb/perf) in the context of the vcpu as i
described in another thread.

Reviewed-by: Xiao Guangrong 


Re: [PATCH] KVM: x86: fix missed hardware breakpoints

2016-02-26 Thread Xiao Guangrong



On 02/26/2016 07:28 PM, Nadav Amit wrote:

Xiao Guangrong <guangrong.x...@linux.intel.com> wrote:


On 02/19/2016 06:56 PM, Paolo Bonzini wrote:

Sometimes when setting a breakpoint a process doesn't stop on it.
This is because the debug registers are not loaded correctly on
VCPU load.


diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4244c2baf57d..f4891f2ece23 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2752,6 +2752,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
}

kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
+   vcpu->arch.switch_db_regs |= KVM_DEBUGREG_RELOAD;


Er, i do not understand how it works. The BP is enabled in this test case so
the debug registers are always reloaded before entering guest as
KVM_DEBUGREG_BP_ENABLED bit is always set on switch_db_regs. What did i miss?


Note that KVM_DEBUGREG_BP_ENABLED does not have to be set, since
kvm_update_dr7() is not called once the guest has KVM_DEBUGREG_WONT_EXIT.


Ah, yes. So that we will enter guest without reloading debug registers under 
this
case. Is it safe? What about if gdb/perf attached to QEMU and reset the debug0-3
(in this case, the vcpu is interrupted by IPI and it is not rescheduled)?


Re: [PATCH] KVM: x86: fix missed hardware breakpoints

2016-02-26 Thread Xiao Guangrong



On 02/26/2016 07:28 PM, Nadav Amit wrote:

Xiao Guangrong  wrote:


On 02/19/2016 06:56 PM, Paolo Bonzini wrote:

Sometimes when setting a breakpoint a process doesn't stop on it.
This is because the debug registers are not loaded correctly on
VCPU load.


diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4244c2baf57d..f4891f2ece23 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2752,6 +2752,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
}

kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
+   vcpu->arch.switch_db_regs |= KVM_DEBUGREG_RELOAD;


Er, i do not understand how it works. The BP is enabled in this test case so
the debug registers are always reloaded before entering guest as
KVM_DEBUGREG_BP_ENABLED bit is always set on switch_db_regs. What did i miss?


Note that KVM_DEBUGREG_BP_ENABLED does not have to be set, since
kvm_update_dr7() is not called once the guest has KVM_DEBUGREG_WONT_EXIT.


Ah, yes. So that we will enter guest without reloading debug registers under 
this
case. Is it safe? What about if gdb/perf attached to QEMU and reset the debug0-3
(in this case, the vcpu is interrupted by IPI and it is not rescheduled)?


Re: [PATCH] KVM: x86: fix missed hardware breakpoints

2016-02-26 Thread Xiao Guangrong



On 02/19/2016 06:56 PM, Paolo Bonzini wrote:

Sometimes when setting a breakpoint a process doesn't stop on it.
This is because the debug registers are not loaded correctly on
VCPU load.

The following simple reproducer from Oleg Nesterov tries using debug
registers in two threads.  To see the bug, run a 2-VCPU guest under
"taskset -c 0", then run "./bp 0 1" inside the guest.

 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 

 #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

 unsigned long encode_dr7(int drnum, int enable, unsigned int type, 
unsigned int len)
 {
 unsigned long dr7;

 dr7 = ((len | type) & 0xf)
 << (DR_CONTROL_SHIFT + drnum * DR_CONTROL_SIZE);
 if (enable)
 dr7 |= (DR_GLOBAL_ENABLE << (drnum * DR_ENABLE_SIZE));

 return dr7;
 }

 int write_dr(int pid, int dr, unsigned long val)
 {
 return ptrace(PTRACE_POKEUSER, pid,
 offsetof (struct user, u_debugreg[dr]),
 val);
 }

 void set_bp(pid_t pid, void *addr)
 {
 unsigned long dr7;
 assert(write_dr(pid, 0, (long)addr) == 0);
 dr7 = encode_dr7(0, 1, DR_RW_EXECUTE, DR_LEN_1);
 assert(write_dr(pid, 7, dr7) == 0);
 }

 void *get_rip(int pid)
 {
 return (void*)ptrace(PTRACE_PEEKUSER, pid,
 offsetof(struct user, regs.rip), 0);
 }

 void test(int nr)
 {
 void *bp_addr = & + nr, *bp_hit;
 int pid;

 printf("test bp %d\n", nr);
 assert(nr < 16); // see 16 asm nops below

 pid = fork();
 if (!pid) {
 assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
 kill(getpid(), SIGSTOP);
 for (;;) {
 label: asm (
 "nop; nop; nop; nop;"
 "nop; nop; nop; nop;"
 "nop; nop; nop; nop;"
 "nop; nop; nop; nop;"
 );
 }
 }

 assert(pid == wait(NULL));
 set_bp(pid, bp_addr);

 for (;;) {
 assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0);
 assert(pid == wait(NULL));

 bp_hit = get_rip(pid);
 if (bp_hit != bp_addr)
 fprintf(stderr, "ERR!! hit wrong bp %ld != %d\n",
 bp_hit - &, nr);
 }
 }

 int main(int argc, const char *argv[])
 {
 while (--argc) {
 int nr = atoi(*++argv);
 if (!fork())
 test(nr);
 }

 while (wait(NULL) > 0)
 ;
 return 0;
 }

Cc: sta...@vger.kernel.org
Suggested-by: Nadav Amit 
Reported-by: Andrey Wagin 
Signed-off-by: Paolo Bonzini 
---
Andrey, sorry for the time it took to get to this one.

  arch/x86/kvm/x86.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4244c2baf57d..f4891f2ece23 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2752,6 +2752,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
}

kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
+   vcpu->arch.switch_db_regs |= KVM_DEBUGREG_RELOAD;


Er, i do not understand how it works. The BP is enabled in this test case so
the debug registers are always reloaded before entering guest as
KVM_DEBUGREG_BP_ENABLED bit is always set on switch_db_regs. What did i miss?

Another impact of this fix is when vcpu is rescheduled we need to always reload
debug registers even if guest does not enable it, it is really needed?



Re: [PATCH] KVM: x86: fix missed hardware breakpoints

2016-02-26 Thread Xiao Guangrong



On 02/19/2016 06:56 PM, Paolo Bonzini wrote:

Sometimes when setting a breakpoint a process doesn't stop on it.
This is because the debug registers are not loaded correctly on
VCPU load.

The following simple reproducer from Oleg Nesterov tries using debug
registers in two threads.  To see the bug, run a 2-VCPU guest under
"taskset -c 0", then run "./bp 0 1" inside the guest.

 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 

 #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

 unsigned long encode_dr7(int drnum, int enable, unsigned int type, 
unsigned int len)
 {
 unsigned long dr7;

 dr7 = ((len | type) & 0xf)
 << (DR_CONTROL_SHIFT + drnum * DR_CONTROL_SIZE);
 if (enable)
 dr7 |= (DR_GLOBAL_ENABLE << (drnum * DR_ENABLE_SIZE));

 return dr7;
 }

 int write_dr(int pid, int dr, unsigned long val)
 {
 return ptrace(PTRACE_POKEUSER, pid,
 offsetof (struct user, u_debugreg[dr]),
 val);
 }

 void set_bp(pid_t pid, void *addr)
 {
 unsigned long dr7;
 assert(write_dr(pid, 0, (long)addr) == 0);
 dr7 = encode_dr7(0, 1, DR_RW_EXECUTE, DR_LEN_1);
 assert(write_dr(pid, 7, dr7) == 0);
 }

 void *get_rip(int pid)
 {
 return (void*)ptrace(PTRACE_PEEKUSER, pid,
 offsetof(struct user, regs.rip), 0);
 }

 void test(int nr)
 {
 void *bp_addr = & + nr, *bp_hit;
 int pid;

 printf("test bp %d\n", nr);
 assert(nr < 16); // see 16 asm nops below

 pid = fork();
 if (!pid) {
 assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
 kill(getpid(), SIGSTOP);
 for (;;) {
 label: asm (
 "nop; nop; nop; nop;"
 "nop; nop; nop; nop;"
 "nop; nop; nop; nop;"
 "nop; nop; nop; nop;"
 );
 }
 }

 assert(pid == wait(NULL));
 set_bp(pid, bp_addr);

 for (;;) {
 assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0);
 assert(pid == wait(NULL));

 bp_hit = get_rip(pid);
 if (bp_hit != bp_addr)
 fprintf(stderr, "ERR!! hit wrong bp %ld != %d\n",
 bp_hit - &, nr);
 }
 }

 int main(int argc, const char *argv[])
 {
 while (--argc) {
 int nr = atoi(*++argv);
 if (!fork())
 test(nr);
 }

 while (wait(NULL) > 0)
 ;
 return 0;
 }

Cc: sta...@vger.kernel.org
Suggested-by: Nadav Amit 
Reported-by: Andrey Wagin 
Signed-off-by: Paolo Bonzini 
---
Andrey, sorry for the time it took to get to this one.

  arch/x86/kvm/x86.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4244c2baf57d..f4891f2ece23 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2752,6 +2752,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
}

kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
+   vcpu->arch.switch_db_regs |= KVM_DEBUGREG_RELOAD;


Er, i do not understand how it works. The BP is enabled in this test case so
the debug registers are always reloaded before entering guest as
KVM_DEBUGREG_BP_ENABLED bit is always set on switch_db_regs. What did i miss?

Another impact of this fix is when vcpu is rescheduled we need to always reload
debug registers even if guest does not enable it, it is really needed?



Re: [PATCH 09/12] KVM: MMU: coalesce zapping page after mmu_sync_children

2016-02-25 Thread Xiao Guangrong



On 02/25/2016 04:49 PM, Paolo Bonzini wrote:



On 25/02/2016 08:35, Xiao Guangrong wrote:

This may release the mmu_lock before committing the zapping.
Is it safe?  If so, we may want to see the reason in the changelog.


It is unsafe indeed, please do not do it.


Can you explain why?  kvm_zap_obsolete_pages does the same.


It's not the same, please see the comment in  kvm_mmu_invalidate_zap_all_pages:
/*
 * Notify all vcpus to reload its shadow page table
 * and flush TLB. Then all vcpus will switch to new
 * shadow page table with the new mmu_valid_gen.
 *
 * Note: we should do this under the protection of
 * mmu-lock, otherwise, vcpu would purge shadow page
 * but miss tlb flush.
 */
kvm_reload_remote_mmus(kvm);

That means the tlb is flushed before releasing mmu-lock.

A example is in rmap_write_protect(), when KVM creates a shadow page table for
the the guest, it detects no spte pointing to the gfn, so tlb is not flushed so
that guest can freely updates its pte.


Re: [PATCH 09/12] KVM: MMU: coalesce zapping page after mmu_sync_children

2016-02-25 Thread Xiao Guangrong



On 02/25/2016 04:49 PM, Paolo Bonzini wrote:



On 25/02/2016 08:35, Xiao Guangrong wrote:

This may release the mmu_lock before committing the zapping.
Is it safe?  If so, we may want to see the reason in the changelog.


It is unsafe indeed, please do not do it.


Can you explain why?  kvm_zap_obsolete_pages does the same.


It's not the same, please see the comment in  kvm_mmu_invalidate_zap_all_pages:
/*
 * Notify all vcpus to reload its shadow page table
 * and flush TLB. Then all vcpus will switch to new
 * shadow page table with the new mmu_valid_gen.
 *
 * Note: we should do this under the protection of
 * mmu-lock, otherwise, vcpu would purge shadow page
 * but miss tlb flush.
 */
kvm_reload_remote_mmus(kvm);

That means the tlb is flushed before releasing mmu-lock.

A example is in rmap_write_protect(), when KVM creates a shadow page table for
the the guest, it detects no spte pointing to the gfn, so tlb is not flushed so
that guest can freely updates its pte.


Re: [PATCH] KVM: x86: MMU: fix ubsan index-out-of-range warning

2016-02-25 Thread Xiao Guangrong



On 02/25/2016 02:02 AM, Mike Krinkin wrote:

Ubsan reports the following warning due to a typo in
update_accessed_dirty_bits template, the patch fixes
the typo:

[  168.791851] 

[  168.791862] UBSAN: Undefined behaviour in arch/x86/kvm/paging_tmpl.h:252:15
[  168.791866] index 4 is out of range for type 'u64 [4]'
[  168.791871] CPU: 0 PID: 2950 Comm: qemu-system-x86 Tainted: G   O L  
4.5.0-rc5-next-20160222 #7
[  168.791873] Hardware name: LENOVO 23205NG/23205NG, BIOS G2ET95WW (2.55 ) 
07/09/2013
[  168.791876]   8801cfcaf208 81c9f780 
41b58ab3
[  168.791882]  82eb2cc1 81c9f6b4 8801cfcaf230 
8801cfcaf1e0
[  168.791886]  0004 0001  
a1981600
[  168.791891] Call Trace:
[  168.791899]  [] dump_stack+0xcc/0x12c
[  168.791904]  [] ? _atomic_dec_and_lock+0xc4/0xc4
[  168.791910]  [] ubsan_epilogue+0xd/0x8a
[  168.791914]  [] __ubsan_handle_out_of_bounds+0x15c/0x1a3
[  168.791918]  [] ? 
__ubsan_handle_shift_out_of_bounds+0x2bd/0x2bd
[  168.791922]  [] ? get_user_pages_fast+0x2bf/0x360
[  168.791954]  [] ? kvm_largepages_enabled+0x30/0x30 [kvm]
[  168.791958]  [] ? __get_user_pages_fast+0x360/0x360
[  168.791987]  [] paging64_walk_addr_generic+0x1b28/0x2600 
[kvm]
[  168.792014]  [] ? init_kvm_mmu+0x1100/0x1100 [kvm]
[  168.792019]  [] ? debug_check_no_locks_freed+0x350/0x350
[  168.792044]  [] ? init_kvm_mmu+0x1100/0x1100 [kvm]
[  168.792076]  [] paging64_gva_to_gpa+0x7d/0x110 [kvm]
[  168.792121]  [] ? paging64_walk_addr_generic+0x2600/0x2600 
[kvm]
[  168.792130]  [] ? debug_lockdep_rcu_enabled+0x7b/0x90
[  168.792178]  [] emulator_read_write_onepage+0x27a/0x1150 
[kvm]
[  168.792208]  [] ? __kvm_read_guest_page+0x54/0x70 [kvm]
[  168.792234]  [] ? kvm_task_switch+0x160/0x160 [kvm]
[  168.792238]  [] ? debug_lockdep_rcu_enabled+0x7b/0x90
[  168.792263]  [] emulator_read_write+0xe7/0x6d0 [kvm]
[  168.792290]  [] ? em_cr_write+0x230/0x230 [kvm]
[  168.792314]  [] emulator_write_emulated+0x15/0x20 [kvm]
[  168.792340]  [] segmented_write+0xf8/0x130 [kvm]
[  168.792367]  [] ? em_lgdt+0x20/0x20 [kvm]
[  168.792374]  [] ? vmx_read_guest_seg_ar+0x42/0x1e0 
[kvm_intel]
[  168.792400]  [] writeback+0x3f2/0x700 [kvm]
[  168.792424]  [] ? em_sidt+0xa0/0xa0 [kvm]
[  168.792449]  [] ? x86_decode_insn+0x1b3d/0x4f70 [kvm]
[  168.792474]  [] x86_emulate_insn+0x572/0x3010 [kvm]
[  168.792499]  [] x86_emulate_instruction+0x3bd/0x2110 [kvm]
[  168.792524]  [] ? 
reexecute_instruction.part.110+0x2e0/0x2e0 [kvm]
[  168.792532]  [] handle_ept_misconfig+0x61/0x460 [kvm_intel]
[  168.792539]  [] ? handle_pause+0x450/0x450 [kvm_intel]
[  168.792546]  [] vmx_handle_exit+0xd6a/0x1ad0 [kvm_intel]
[  168.792572]  [] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 
[kvm]
[  168.792597]  [] kvm_arch_vcpu_ioctl_run+0xd3d/0x6090 [kvm]
[  168.792621]  [] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 
[kvm]
[  168.792627]  [] ? 
__ww_mutex_lock_interruptible+0x1630/0x1630
[  168.792651]  [] ? kvm_arch_vcpu_runnable+0x4f0/0x4f0 [kvm]
[  168.792656]  [] ? preempt_notifier_unregister+0x190/0x190
[  168.792681]  [] ? kvm_arch_vcpu_load+0x127/0x650 [kvm]
[  168.792704]  [] kvm_vcpu_ioctl+0x553/0xda0 [kvm]
[  168.792727]  [] ? vcpu_put+0x40/0x40 [kvm]
[  168.792732]  [] ? debug_check_no_locks_freed+0x350/0x350
[  168.792735]  [] ? _raw_spin_unlock+0x27/0x40
[  168.792740]  [] ? handle_mm_fault+0x1673/0x2e40
[  168.792744]  [] ? trace_hardirqs_on_caller+0x478/0x6c0
[  168.792747]  [] ? trace_hardirqs_on+0xd/0x10
[  168.792751]  [] ? debug_lockdep_rcu_enabled+0x7b/0x90
[  168.792756]  [] do_vfs_ioctl+0x1b0/0x12b0
[  168.792759]  [] ? ioctl_preallocate+0x210/0x210
[  168.792763]  [] ? __fget+0x273/0x4a0
[  168.792766]  [] ? __fget+0x50/0x4a0
[  168.792770]  [] ? __fget_light+0x96/0x2b0
[  168.792773]  [] SyS_ioctl+0x79/0x90
[  168.792777]  [] entry_SYSCALL_64_fastpath+0x23/0xc1
[  168.792780] 


Signed-off-by: Mike Krinkin <krinkin@gmail.com>


Reviewed-by: Xiao Guangrong <guangrong.x...@linux.intel.com>


Re: [PATCH] KVM: x86: MMU: fix ubsan index-out-of-range warning

2016-02-25 Thread Xiao Guangrong



On 02/25/2016 02:02 AM, Mike Krinkin wrote:

Ubsan reports the following warning due to a typo in
update_accessed_dirty_bits template, the patch fixes
the typo:

[  168.791851] 

[  168.791862] UBSAN: Undefined behaviour in arch/x86/kvm/paging_tmpl.h:252:15
[  168.791866] index 4 is out of range for type 'u64 [4]'
[  168.791871] CPU: 0 PID: 2950 Comm: qemu-system-x86 Tainted: G   O L  
4.5.0-rc5-next-20160222 #7
[  168.791873] Hardware name: LENOVO 23205NG/23205NG, BIOS G2ET95WW (2.55 ) 
07/09/2013
[  168.791876]   8801cfcaf208 81c9f780 
41b58ab3
[  168.791882]  82eb2cc1 81c9f6b4 8801cfcaf230 
8801cfcaf1e0
[  168.791886]  0004 0001  
a1981600
[  168.791891] Call Trace:
[  168.791899]  [] dump_stack+0xcc/0x12c
[  168.791904]  [] ? _atomic_dec_and_lock+0xc4/0xc4
[  168.791910]  [] ubsan_epilogue+0xd/0x8a
[  168.791914]  [] __ubsan_handle_out_of_bounds+0x15c/0x1a3
[  168.791918]  [] ? 
__ubsan_handle_shift_out_of_bounds+0x2bd/0x2bd
[  168.791922]  [] ? get_user_pages_fast+0x2bf/0x360
[  168.791954]  [] ? kvm_largepages_enabled+0x30/0x30 [kvm]
[  168.791958]  [] ? __get_user_pages_fast+0x360/0x360
[  168.791987]  [] paging64_walk_addr_generic+0x1b28/0x2600 
[kvm]
[  168.792014]  [] ? init_kvm_mmu+0x1100/0x1100 [kvm]
[  168.792019]  [] ? debug_check_no_locks_freed+0x350/0x350
[  168.792044]  [] ? init_kvm_mmu+0x1100/0x1100 [kvm]
[  168.792076]  [] paging64_gva_to_gpa+0x7d/0x110 [kvm]
[  168.792121]  [] ? paging64_walk_addr_generic+0x2600/0x2600 
[kvm]
[  168.792130]  [] ? debug_lockdep_rcu_enabled+0x7b/0x90
[  168.792178]  [] emulator_read_write_onepage+0x27a/0x1150 
[kvm]
[  168.792208]  [] ? __kvm_read_guest_page+0x54/0x70 [kvm]
[  168.792234]  [] ? kvm_task_switch+0x160/0x160 [kvm]
[  168.792238]  [] ? debug_lockdep_rcu_enabled+0x7b/0x90
[  168.792263]  [] emulator_read_write+0xe7/0x6d0 [kvm]
[  168.792290]  [] ? em_cr_write+0x230/0x230 [kvm]
[  168.792314]  [] emulator_write_emulated+0x15/0x20 [kvm]
[  168.792340]  [] segmented_write+0xf8/0x130 [kvm]
[  168.792367]  [] ? em_lgdt+0x20/0x20 [kvm]
[  168.792374]  [] ? vmx_read_guest_seg_ar+0x42/0x1e0 
[kvm_intel]
[  168.792400]  [] writeback+0x3f2/0x700 [kvm]
[  168.792424]  [] ? em_sidt+0xa0/0xa0 [kvm]
[  168.792449]  [] ? x86_decode_insn+0x1b3d/0x4f70 [kvm]
[  168.792474]  [] x86_emulate_insn+0x572/0x3010 [kvm]
[  168.792499]  [] x86_emulate_instruction+0x3bd/0x2110 [kvm]
[  168.792524]  [] ? 
reexecute_instruction.part.110+0x2e0/0x2e0 [kvm]
[  168.792532]  [] handle_ept_misconfig+0x61/0x460 [kvm_intel]
[  168.792539]  [] ? handle_pause+0x450/0x450 [kvm_intel]
[  168.792546]  [] vmx_handle_exit+0xd6a/0x1ad0 [kvm_intel]
[  168.792572]  [] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 
[kvm]
[  168.792597]  [] kvm_arch_vcpu_ioctl_run+0xd3d/0x6090 [kvm]
[  168.792621]  [] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 
[kvm]
[  168.792627]  [] ? 
__ww_mutex_lock_interruptible+0x1630/0x1630
[  168.792651]  [] ? kvm_arch_vcpu_runnable+0x4f0/0x4f0 [kvm]
[  168.792656]  [] ? preempt_notifier_unregister+0x190/0x190
[  168.792681]  [] ? kvm_arch_vcpu_load+0x127/0x650 [kvm]
[  168.792704]  [] kvm_vcpu_ioctl+0x553/0xda0 [kvm]
[  168.792727]  [] ? vcpu_put+0x40/0x40 [kvm]
[  168.792732]  [] ? debug_check_no_locks_freed+0x350/0x350
[  168.792735]  [] ? _raw_spin_unlock+0x27/0x40
[  168.792740]  [] ? handle_mm_fault+0x1673/0x2e40
[  168.792744]  [] ? trace_hardirqs_on_caller+0x478/0x6c0
[  168.792747]  [] ? trace_hardirqs_on+0xd/0x10
[  168.792751]  [] ? debug_lockdep_rcu_enabled+0x7b/0x90
[  168.792756]  [] do_vfs_ioctl+0x1b0/0x12b0
[  168.792759]  [] ? ioctl_preallocate+0x210/0x210
[  168.792763]  [] ? __fget+0x273/0x4a0
[  168.792766]  [] ? __fget+0x50/0x4a0
[  168.792770]  [] ? __fget_light+0x96/0x2b0
[  168.792773]  [] SyS_ioctl+0x79/0x90
[  168.792777]  [] entry_SYSCALL_64_fastpath+0x23/0xc1
[  168.792780] 


Signed-off-by: Mike Krinkin 


Reviewed-by: Xiao Guangrong 


Re: [PATCH 00/12] KVM: MMU: cleanup around kvm_sync_page, and a few micro-optimizations

2016-02-25 Thread Xiao Guangrong



On 02/24/2016 09:17 PM, Paolo Bonzini wrote:

This series started from looking at mmu_unsync_walk for the ubsan thread.
Patches 1 and 2 are the result of the discussions in that thread.

Patches 3 to 9 do more cleanups in __kvm_sync_page and its callers.
Among other changes, it removes kvm_sync_page_transient and avoids
duplicate code between __kvm_sync_page and kvm_sync_pages.

I stopped where I had questions about the existing kvm_mmu_get_page
code (see patch 8 for the question).  However perhaps more cleanups
are possible, also thanks to Takuya's work on that function and
link_shadow_page.

Patches 10 to 12 are just micro-optimizations.

Guangrong, it would be great if you took a look since you know this part
of KVM very well.


I have reviewed it and it works fine except the one leaking tlb flush out
of mmu-lock.

I will continue to simplify the path of walking unsync sp to keep
mmu_page_path smaller and make comments for kvm_mmu_get_page on top of
this patchset.

BTW, is any conflict to apply my page-tracking patchset on top of this
patchset (i noticed you've merged this patchset on kvm/queue)? Please
tell me to rebase it if it is needed.




Re: [PATCH 00/12] KVM: MMU: cleanup around kvm_sync_page, and a few micro-optimizations

2016-02-25 Thread Xiao Guangrong



On 02/24/2016 09:17 PM, Paolo Bonzini wrote:

This series started from looking at mmu_unsync_walk for the ubsan thread.
Patches 1 and 2 are the result of the discussions in that thread.

Patches 3 to 9 do more cleanups in __kvm_sync_page and its callers.
Among other changes, it removes kvm_sync_page_transient and avoids
duplicate code between __kvm_sync_page and kvm_sync_pages.

I stopped where I had questions about the existing kvm_mmu_get_page
code (see patch 8 for the question).  However perhaps more cleanups
are possible, also thanks to Takuya's work on that function and
link_shadow_page.

Patches 10 to 12 are just micro-optimizations.

Guangrong, it would be great if you took a look since you know this part
of KVM very well.


I have reviewed it and it works fine except the one leaking tlb flush out
of mmu-lock.

I will continue to simplify the path of walking unsync sp to keep
mmu_page_path smaller and make comments for kvm_mmu_get_page on top of
this patchset.

BTW, is any conflict to apply my page-tracking patchset on top of this
patchset (i noticed you've merged this patchset on kvm/queue)? Please
tell me to rebase it if it is needed.




Re: [PATCH 09/12] KVM: MMU: coalesce zapping page after mmu_sync_children

2016-02-24 Thread Xiao Guangrong



On 02/25/2016 10:15 AM, Takuya Yoshikawa wrote:

On 2016/02/24 22:17, Paolo Bonzini wrote:

Move the call to kvm_mmu_flush_or_zap outside the loop.

Signed-off-by: Paolo Bonzini 
---
   arch/x86/kvm/mmu.c | 9 ++---
   1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 725316df32ec..6d47b5c43246 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2029,24 +2029,27 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
struct mmu_page_path parents;
struct kvm_mmu_pages pages;
LIST_HEAD(invalid_list);
+   bool flush = false;

while (mmu_unsync_walk(parent, )) {
bool protected = false;
-   bool flush = false;

for_each_sp(pages, sp, parents, i)
protected |= rmap_write_protect(vcpu, sp->gfn);

-   if (protected)
+   if (protected) {
kvm_flush_remote_tlbs(vcpu->kvm);
+   flush = false;
+   }

for_each_sp(pages, sp, parents, i) {
flush |= kvm_sync_page(vcpu, sp, _list);
mmu_pages_clear_parents();
}
-   kvm_mmu_flush_or_zap(vcpu, _list, false, flush);
cond_resched_lock(>kvm->mmu_lock);


This may release the mmu_lock before committing the zapping.
Is it safe?  If so, we may want to see the reason in the changelog.


It is unsafe indeed, please do not do it.




Re: [PATCH 09/12] KVM: MMU: coalesce zapping page after mmu_sync_children

2016-02-24 Thread Xiao Guangrong



On 02/25/2016 10:15 AM, Takuya Yoshikawa wrote:

On 2016/02/24 22:17, Paolo Bonzini wrote:

Move the call to kvm_mmu_flush_or_zap outside the loop.

Signed-off-by: Paolo Bonzini 
---
   arch/x86/kvm/mmu.c | 9 ++---
   1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 725316df32ec..6d47b5c43246 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2029,24 +2029,27 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
struct mmu_page_path parents;
struct kvm_mmu_pages pages;
LIST_HEAD(invalid_list);
+   bool flush = false;

while (mmu_unsync_walk(parent, )) {
bool protected = false;
-   bool flush = false;

for_each_sp(pages, sp, parents, i)
protected |= rmap_write_protect(vcpu, sp->gfn);

-   if (protected)
+   if (protected) {
kvm_flush_remote_tlbs(vcpu->kvm);
+   flush = false;
+   }

for_each_sp(pages, sp, parents, i) {
flush |= kvm_sync_page(vcpu, sp, _list);
mmu_pages_clear_parents();
}
-   kvm_mmu_flush_or_zap(vcpu, _list, false, flush);
cond_resched_lock(>kvm->mmu_lock);


This may release the mmu_lock before committing the zapping.
Is it safe?  If so, we may want to see the reason in the changelog.


It is unsafe indeed, please do not do it.




Re: [PATCH 08/12] KVM: MMU: move zap/flush to kvm_mmu_get_page

2016-02-24 Thread Xiao Guangrong



On 02/24/2016 09:17 PM, Paolo Bonzini wrote:

kvm_mmu_get_page is the only caller of kvm_sync_page_transient
and kvm_sync_pages.  Moving the handling of the invalid_list there
removes the need for the underdocumented kvm_sync_page_transient
function.

Signed-off-by: Paolo Bonzini 
---
Guangrong, at this point I am confused about why
kvm_sync_page_transient didn't clear sp->unsync.  Do
you remember?  Or perhaps kvm_mmu_get_page could just
call kvm_sync_page now?



It is the optimization to reduce write-protect as changing unsync to
sync need to write-protect the page and sync all sptes pointing to the
same gfn.

However, after syncing the content between unsync-ed spte and guest pte,
we can reuse this spte perfectly.


Also, can you explain the need_sync variable in
kvm_mmu_get_page?


This is because we need to to protect the semanteme of 'unsync spte' as
only the spte on last level (level = 1) can be unsync so that if a spte
on the upper level is created we should eliminate all the unsync sptes
pointing to the same gfn.

As you have already merged this patchset to the kvm tree, i will post
a patch to comment these cases to make the code be more understandable.


Re: [PATCH 08/12] KVM: MMU: move zap/flush to kvm_mmu_get_page

2016-02-24 Thread Xiao Guangrong



On 02/24/2016 09:17 PM, Paolo Bonzini wrote:

kvm_mmu_get_page is the only caller of kvm_sync_page_transient
and kvm_sync_pages.  Moving the handling of the invalid_list there
removes the need for the underdocumented kvm_sync_page_transient
function.

Signed-off-by: Paolo Bonzini 
---
Guangrong, at this point I am confused about why
kvm_sync_page_transient didn't clear sp->unsync.  Do
you remember?  Or perhaps kvm_mmu_get_page could just
call kvm_sync_page now?



It is the optimization to reduce write-protect as changing unsync to
sync need to write-protect the page and sync all sptes pointing to the
same gfn.

However, after syncing the content between unsync-ed spte and guest pte,
we can reuse this spte perfectly.


Also, can you explain the need_sync variable in
kvm_mmu_get_page?


This is because we need to to protect the semanteme of 'unsync spte' as
only the spte on last level (level = 1) can be unsync so that if a spte
on the upper level is created we should eliminate all the unsync sptes
pointing to the same gfn.

As you have already merged this patchset to the kvm tree, i will post
a patch to comment these cases to make the code be more understandable.


[PATCH 03/11] KVM: MMU: introduce kvm_mmu_slot_gfn_write_protect

2016-02-24 Thread Xiao Guangrong
Split rmap_write_protect() and introduce the function to abstract the write
protection based on the slot

This function will be used in the later patch

Reviewed-by: Paolo Bonzini <pbonz...@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.x...@linux.intel.com>
---
 arch/x86/kvm/mmu.c | 16 +++-
 arch/x86/kvm/mmu.h |  2 ++
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 566639d..7184218 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1336,23 +1336,29 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm 
*kvm,
kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
-static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
+bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
+   struct kvm_memory_slot *slot, u64 gfn)
 {
-   struct kvm_memory_slot *slot;
struct kvm_rmap_head *rmap_head;
int i;
bool write_protected = false;
 
-   slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
-
for (i = PT_PAGE_TABLE_LEVEL; i <= PT_MAX_HUGEPAGE_LEVEL; ++i) {
rmap_head = __gfn_to_rmap(gfn, i, slot);
-   write_protected |= __rmap_write_protect(vcpu->kvm, rmap_head, 
true);
+   write_protected |= __rmap_write_protect(kvm, rmap_head, true);
}
 
return write_protected;
 }
 
+static bool rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
+{
+   struct kvm_memory_slot *slot;
+
+   slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+   return kvm_mmu_slot_gfn_write_protect(vcpu->kvm, slot, gfn);
+}
+
 static bool kvm_zap_rmapp(struct kvm *kvm, struct kvm_rmap_head *rmap_head)
 {
u64 *sptep;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index de92bed..58fe98a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -177,4 +177,6 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end);
 
 void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
+bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
+   struct kvm_memory_slot *slot, u64 gfn);
 #endif
-- 
1.8.3.1



<    1   2   3   4   5   6   7   8   9   10   >