from:"Maxim Levitsky"

Re: [RFC PATCH v2 14/20] x86/kvm: Make kvm_async_pf_enabled __ro_after_init

2023-10-09 Thread Maxim Levitsky

У чт, 2023-07-20 у 17:30 +0100, Valentin Schneider пише:
> objtool now warns about it:
> 
>   vmlinux.o: warning: objtool: exc_page_fault+0x2a: Non __ro_after_init 
> static key "kvm_async_pf_enabled" in .noinstr section
> 
> The key can only be enabled (and not disabled) in the __init function
> kvm_guest_init(), so mark it as __ro_after_init.
> 
> Signed-off-by: Valentin Schneider 
> ---
>  arch/x86/kernel/kvm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 1cceac5984daa..319460090a836 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -44,7 +44,7 @@
>  #include 
>  #include 
>  
> -DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
> +DEFINE_STATIC_KEY_FALSE_RO(kvm_async_pf_enabled);
>  
>  static int kvmapf = 1;
>  
Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky

Re: [PATCH v2 0/9] KVM: my debug patch queue

2021-04-06 Thread Maxim Levitsky

On Fri, 2021-04-02 at 19:38 +0200, Paolo Bonzini wrote:
> On 01/04/21 15:54, Maxim Levitsky wrote:
> > Hi!
> > 
> > I would like to publish two debug features which were needed for other stuff
> > I work on.
> > 
> > One is the reworked lx-symbols script which now actually works on at least
> > gdb 9.1 (gdb 9.2 was reported to fail to load the debug symbols from the 
> > kernel
> > for some reason, not related to this patch) and upstream qemu.
> 
> Queued patches 2-5 for now.  6 is okay but it needs a selftest. (e.g. 
> using KVM_VCPU_SET_EVENTS) and the correct name for the constant.

Thanks!
I will do this very soon.

Best regards,
Maxim Levitsky
> 
> Paolo
> 
> > The other feature is the ability to trap all guest exceptions (on SVM for 
> > now)
> > and see them in kvmtrace prior to potential merge to double/triple fault.
> > 
> > This can be very useful and I already had to manually patch KVM a few
> > times for this.
> > I will, once time permits, implement this feature on Intel as well.
> > 
> > V2:
> > 
> >   * Some more refactoring and workarounds for lx-symbols script
> > 
> >   * added KVM_GUESTDBG_BLOCKEVENTS flag to enable 'block interrupts on
> > single step' together with KVM_CAP_SET_GUEST_DEBUG2 capability
> > to indicate which guest debug flags are supported.
> > 
> > This is a replacement for unconditional block of interrupts on single
> > step that was done in previous version of this patch set.
> > Patches to qemu to use that feature will be sent soon.
> > 
> >   * Reworked the the 'intercept all exceptions for debug' feature according
> > to the review feedback:
> > 
> > - renamed the parameter that enables the feature and
> >   moved it to common kvm module.
> >   (only SVM part is currently implemented though)
> > 
> > - disable the feature for SEV guests as was suggested during the review
> > - made the vmexit table const again, as was suggested in the review as 
> > well.
> > 
> > Best regards,
> > Maxim Levitsky
> > 
> > Maxim Levitsky (9):
> >scripts/gdb: rework lx-symbols gdb script
> >KVM: introduce KVM_CAP_SET_GUEST_DEBUG2
> >KVM: x86: implement KVM_CAP_SET_GUEST_DEBUG2
> >KVM: aarch64: implement KVM_CAP_SET_GUEST_DEBUG2
> >KVM: s390x: implement KVM_CAP_SET_GUEST_DEBUG2
> >KVM: x86: implement KVM_GUESTDBG_BLOCKEVENTS
> >KVM: SVM: split svm_handle_invalid_exit
> >KVM: x86: add force_intercept_exceptions_mask
> >KVM: SVM: implement force_intercept_exceptions_mask
> > 
> >   Documentation/virt/kvm/api.rst|   4 +
> >   arch/arm64/include/asm/kvm_host.h |   4 +
> >   arch/arm64/kvm/arm.c  |   2 +
> >   arch/arm64/kvm/guest.c|   5 -
> >   arch/s390/include/asm/kvm_host.h  |   4 +
> >   arch/s390/kvm/kvm-s390.c  |   3 +
> >   arch/x86/include/asm/kvm_host.h   |  12 ++
> >   arch/x86/include/uapi/asm/kvm.h   |   1 +
> >   arch/x86/kvm/svm/svm.c|  87 +++--
> >   arch/x86/kvm/svm/svm.h|   6 +-
> >   arch/x86/kvm/x86.c|  14 ++-
> >   arch/x86/kvm/x86.h|   2 +
> >   include/uapi/linux/kvm.h  |   1 +
> >   kernel/module.c   |   8 +-
> >   scripts/gdb/linux/symbols.py  | 203 --
> >   15 files changed, 272 insertions(+), 84 deletions(-)
> >

Re: [PATCH 1/6] KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES

2021-04-06 Thread Maxim Levitsky

On Fri, 2021-04-02 at 17:27 +, Sean Christopherson wrote:
> On Thu, Apr 01, 2021, Maxim Levitsky wrote:
> > Similar to the rest of guest page accesses after migration,
> > this should be delayed to KVM_REQ_GET_NESTED_STATE_PAGES
> > request.
> 
> FWIW, I still object to this approach, and this patch has a plethora of 
> issues.
> 
> I'm not against deferring various state loading to KVM_RUN, but wholesale 
> moving
> all of GUEST_CR3 processing without in-depth consideration of all the side
> effects is a really bad idea.
It could be, I won't argue about this.

> 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  arch/x86/kvm/vmx/nested.c | 14 +-
> >  1 file changed, 9 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index fd334e4aa6db..b44f1f6b68db 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -2564,11 +2564,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
> > struct vmcs12 *vmcs12,
> > return -EINVAL;
> > }
> >  
> > -   /* Shadow page tables on either EPT or shadow page tables. */
> > -   if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, 
> > nested_cpu_has_ept(vmcs12),
> > -   entry_failure_code))
> > -   return -EINVAL;
> > -
> > /*
> >  * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
> >  * on nested VM-Exit, which can occur without actually running L2 and
> > @@ -3109,11 +3104,16 @@ static bool nested_get_evmcs_page(struct kvm_vcpu 
> > *vcpu)
> >  static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
> >  {
> > struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +   enum vm_entry_failure_code entry_failure_code;
> > struct vcpu_vmx *vmx = to_vmx(vcpu);
> > struct kvm_host_map *map;
> > struct page *page;
> > u64 hpa;
> >  
> > +   if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, 
> > nested_cpu_has_ept(vmcs12),
> > +   _failure_code))
> 
> This results in KVM_RUN returning 0 without filling vcpu->run->exit_reason.
> Speaking from experience, debugging those types of issues is beyond painful.
> 
> It also means CR3 is double loaded in the from_vmentry case.
> 
> And it will cause KVM to incorrectly return NVMX_VMENTRY_KVM_INTERNAL_ERROR
> if a consistency check fails when nested_get_vmcs12_pages() is called on
> from_vmentry.  E.g. run unit tests with this and it will silently disappear.

I do remember now that you said something about this, but I wasn't able
to find it in my email. Sorry about this.
I agree with you.

I think that a question I should ask is why do we really need to 
delay accessing guest memory after a migration.

So far I mostly just assumed that we need to do so, thinking that qemu
updates the memslots or something, or maybe because guest memory
isn't fully migrated and relies on post-copy to finish it.

Also I am not against leaving CR3 processing in here and doing only PDPTR load
in KVM_RUN (and only when *SREG2 API is not used).

> 
> diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
> index bbb006a..b8ccc69 100644
> --- a/x86/vmx_tests.c
> +++ b/x86/vmx_tests.c
> @@ -8172,6 +8172,16 @@ static void test_guest_segment_base_addr_fields(void)
> vmcs_write(GUEST_AR_ES, ar_saved);
>  }
> 
> +static void test_guest_cr3(void)
> +{
> +   u64 cr3_saved = vmcs_read(GUEST_CR3);
> +
> +   vmcs_write(GUEST_CR3, -1ull);
> +   test_guest_state("Bad CR3 fails VM-Enter", true, -1ull, "GUEST_CR3");
> +
> +   vmcs_write(GUEST_DR7, cr3_saved);
> +}
> +
Could you send this test to kvm unit tests?

>  /*
>   * Check that the virtual CPU checks the VMX Guest State Area as
>   * documented in the Intel SDM.
> @@ -8181,6 +8191,8 @@ static void vmx_guest_state_area_test(void)
> vmx_set_test_stage(1);
> test_set_guest(guest_state_test_main);
> 
> +   test_guest_cr3();
> +
> /*
>  * The IA32_SYSENTER_ESP field and the IA32_SYSENTER_EIP field
>  * must each contain a canonical address.
> 
> 
> > +   return false;
> > +
> > if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
> > /*
> >  * Translate L1 physical address to host physical
> > @@ -3357,6 +3357,10 @@ enum nvmx_vmentry_status 
> > nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
> > }
> >  
> > if (from_vmentry) {
> > +   if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3,

Re: [PATCH 5/6] KVM: nSVM: avoid loading PDPTRs after migration when possible

2021-04-06 Thread Maxim Levitsky

On Mon, 2021-04-05 at 17:01 +, Sean Christopherson wrote:
> On Thu, Apr 01, 2021, Maxim Levitsky wrote:
> > if new KVM_*_SREGS2 ioctls are used, the PDPTRs are
> > part of the migration state and thus are loaded
> > by those ioctls.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  arch/x86/kvm/svm/nested.c | 15 +--
> >  1 file changed, 13 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index ac5e3e17bda4..b94916548cfa 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -373,10 +373,9 @@ static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, 
> > unsigned long cr3,
> > return -EINVAL;
> >  
> > if (!nested_npt && is_pae_paging(vcpu) &&
> > -   (cr3 != kvm_read_cr3(vcpu) || pdptrs_changed(vcpu))) {
> > +   (cr3 != kvm_read_cr3(vcpu) || !kvm_register_is_available(vcpu, 
> > VCPU_EXREG_PDPTR)))
> > if (CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3)))
> 
> What if we ditch the optimizations[*] altogether and just do:
> 
>   if (!nested_npt && is_pae_paging(vcpu) &&
>   CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
>   return -EINVAL;
> 
> Won't that obviate the need for KVM_{GET|SET}_SREGS2 since KVM will always 
> load
> the PDPTRs from memory?  IMO, nested migration with shadowing paging doesn't
> warrant this level of optimization complexity.

Its not an optimization, it was done to be 100% within the X86 spec. 
PDPTRs are internal cpu registers which are loaded only when
CR3/CR0/CR4 are written by the guest, guest entry loads CR3, or 
when guest exit loads CR3 (I checked both Intel and AMD manuals).

In addition to that when NPT is enabled, AMD drops this siliness and 
just treats PDPTRs as normal paging entries, while on Intel side 
when EPT is enabled, PDPTRs are stored in VMCS.

Nested migration is neither of these cases, thus PDPTRs should be 
stored out of band.
Same for non nested migration.

This was requested by Jim Mattson, and I went ahead and 
implemented it, even though I do understand that no sane OS 
relies on PDPTRs to be unsync v.s the actual page
table containing them.

Best regards,
Maxim Levitsky

> 
> [*] For some definitions of "optimization", since the extra pdptrs_changed()
> check in the existing code is likely a net negative.
> 
> > return -EINVAL;
> > -   }
> >  
> > /*
> >  * TODO: optimize unconditional TLB flush/MMU sync here and in

[PATCH 2/6] KVM: nSVM: call nested_svm_load_cr3 on nested state load

2021-04-01 Thread Maxim Levitsky

While KVM's MMU should be fully reset by loading of nested CR0/CR3/CR4
by KVM_SET_SREGS, we are not in nested mode yet when we do it and therefore
only root_mmu is reset.

On regular nested entries we call nested_svm_load_cr3 which both updates
the guest's CR3 in the MMU when it is needed, and it also initializes
the mmu again which makes it initialize the walk_mmu as well when nested
paging is enabled in both host and guest.

Since we don't call nested_svm_load_cr3 on nested state load,
the walk_mmu can be left uninitialized, which can lead to a NULL pointer
dereference while accessing it if we happen to get a nested page fault
right after entering the nested guest first time after the migration and
we decide to emulate it, which leads to the emulator trying to access
walk_mmu->gva_to_gpa which is NULL.

Therefore we should call this function on nested state load as well.

Suggested-by: Paolo Bonzini 
Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c | 40 +--
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 8523f60adb92..ac5e3e17bda4 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -215,24 +215,6 @@ static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm)
return true;
 }
 
-static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
-{
-   struct vcpu_svm *svm = to_svm(vcpu);
-
-   if (WARN_ON(!is_guest_mode(vcpu)))
-   return true;
-
-   if (!nested_svm_vmrun_msrpm(svm)) {
-   vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
-   vcpu->run->internal.suberror =
-   KVM_INTERNAL_ERROR_EMULATION;
-   vcpu->run->internal.ndata = 0;
-   return false;
-   }
-
-   return true;
-}
-
 static bool nested_vmcb_check_controls(struct vmcb_control_area *control)
 {
if (CC(!vmcb_is_intercept(control, INTERCEPT_VMRUN)))
@@ -1312,6 +1294,28 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
return ret;
 }
 
+static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_svm *svm = to_svm(vcpu);
+
+   if (WARN_ON(!is_guest_mode(vcpu)))
+   return true;
+
+   if (nested_svm_load_cr3(>vcpu, vcpu->arch.cr3,
+   nested_npt_enabled(svm)))
+   return false;
+
+   if (!nested_svm_vmrun_msrpm(svm)) {
+   vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+   vcpu->run->internal.suberror =
+   KVM_INTERNAL_ERROR_EMULATION;
+   vcpu->run->internal.ndata = 0;
+   return false;
+   }
+
+   return true;
+}
+
 struct kvm_x86_nested_ops svm_nested_ops = {
.check_events = svm_check_nested_events,
.triple_fault = nested_svm_triple_fault,
-- 
2.26.2

[PATCH 3/6] KVM: x86: introduce kvm_register_clear_available

2021-04-01 Thread Maxim Levitsky

Small refactoring that will be used in the next patch.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/kvm_cache_regs.h | 7 +++
 arch/x86/kvm/svm/svm.c| 6 ++
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 2e11da2f5621..07d607947805 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -55,6 +55,13 @@ static inline void kvm_register_mark_available(struct 
kvm_vcpu *vcpu,
__set_bit(reg, (unsigned long *)>arch.regs_avail);
 }
 
+static inline void kvm_register_clear_available(struct kvm_vcpu *vcpu,
+  enum kvm_reg reg)
+{
+   __clear_bit(reg, (unsigned long *)>arch.regs_avail);
+   __clear_bit(reg, (unsigned long *)>arch.regs_dirty);
+}
+
 static inline void kvm_register_mark_dirty(struct kvm_vcpu *vcpu,
   enum kvm_reg reg)
 {
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 271196400495..2843732299a2 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3880,10 +3880,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct 
kvm_vcpu *vcpu)
vcpu->arch.apf.host_apf_flags =
kvm_read_and_reset_apf_flags();
 
-   if (npt_enabled) {
-   vcpu->arch.regs_avail &= ~(1 << VCPU_EXREG_PDPTR);
-   vcpu->arch.regs_dirty &= ~(1 << VCPU_EXREG_PDPTR);
-   }
+   if (npt_enabled)
+   kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
 
/*
 * We need to handle MC intercepts here before the vcpu has a chance to
-- 
2.26.2

[PATCH 5/6] KVM: nSVM: avoid loading PDPTRs after migration when possible

2021-04-01 Thread Maxim Levitsky

if new KVM_*_SREGS2 ioctls are used, the PDPTRs are
part of the migration state and thus are loaded
by those ioctls.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index ac5e3e17bda4..b94916548cfa 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -373,10 +373,9 @@ static int nested_svm_load_cr3(struct kvm_vcpu *vcpu, 
unsigned long cr3,
return -EINVAL;
 
if (!nested_npt && is_pae_paging(vcpu) &&
-   (cr3 != kvm_read_cr3(vcpu) || pdptrs_changed(vcpu))) {
+   (cr3 != kvm_read_cr3(vcpu) || !kvm_register_is_available(vcpu, 
VCPU_EXREG_PDPTR)))
if (CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3)))
return -EINVAL;
-   }
 
/*
 * TODO: optimize unconditional TLB flush/MMU sync here and in
@@ -552,6 +551,8 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 
vmcb12_gpa,
nested_vmcb02_prepare_control(svm);
nested_vmcb02_prepare_save(svm, vmcb12);
 
+   kvm_register_clear_available(>vcpu, VCPU_EXREG_PDPTR);
+
ret = nested_svm_load_cr3(>vcpu, vmcb12->save.cr3,
  nested_npt_enabled(svm));
if (ret)
@@ -779,6 +780,8 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
 
nested_svm_uninit_mmu_context(vcpu);
 
+   kvm_register_clear_available(>vcpu, VCPU_EXREG_PDPTR);
+
rc = nested_svm_load_cr3(vcpu, svm->vmcb->save.cr3, false);
if (rc)
return 1;
@@ -1301,6 +1304,14 @@ static bool svm_get_nested_state_pages(struct kvm_vcpu 
*vcpu)
if (WARN_ON(!is_guest_mode(vcpu)))
return true;
 
+   if (vcpu->arch.reload_pdptrs_on_nested_entry) {
+   /* If legacy KVM_SET_SREGS API was used, it might have
+* loaded wrong PDPTRs from memory so we have to reload
+* them here (which is against x86 spec)
+*/
+   kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
+   }
+
if (nested_svm_load_cr3(>vcpu, vcpu->arch.cr3,
nested_npt_enabled(svm)))
return false;
-- 
2.26.2

[PATCH 6/6] KVM: nVMX: avoid loading PDPTRs after migration when possible

2021-04-01 Thread Maxim Levitsky

if new KVM_*_SREGS2 ioctls are used, the PDPTRs are
part of the migration state and thus are loaded
by those ioctls.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/vmx/nested.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b44f1f6b68db..f2291165995e 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1115,7 +1115,7 @@ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, 
unsigned long cr3, bool ne
 * must not be dereferenced.
 */
if (!nested_ept && is_pae_paging(vcpu) &&
-   (cr3 != kvm_read_cr3(vcpu) || pdptrs_changed(vcpu))) {
+   (cr3 != kvm_read_cr3(vcpu) || !kvm_register_is_available(vcpu, 
VCPU_EXREG_PDPTR))) {
if (CC(!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))) {
*entry_failure_code = ENTRY_FAIL_PDPTE;
return -EINVAL;
@@ -3110,6 +3110,14 @@ static bool nested_get_vmcs12_pages(struct kvm_vcpu 
*vcpu)
struct page *page;
u64 hpa;
 
+   if (vcpu->arch.reload_pdptrs_on_nested_entry) {
+   /* if legacy KVM_SET_SREGS API was used, it might have loaded
+* wrong PDPTRs from memory so we have to reload them here
+* (which is against x86 spec)
+*/
+   kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
+   }
+
if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, 
nested_cpu_has_ept(vmcs12),
_failure_code))
return false;
@@ -3357,6 +3365,7 @@ enum nvmx_vmentry_status 
nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
}
 
if (from_vmentry) {
+   kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3,
nested_cpu_has_ept(vmcs12), _failure_code))
goto vmentry_fail_vmexit_guest_mode;
@@ -4195,6 +4204,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 * Only PDPTE load can fail as the value of cr3 was checked on entry and
 * couldn't have changed.
 */
+   kvm_register_clear_available(vcpu, VCPU_EXREG_PDPTR);
if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, ))
nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
 
-- 
2.26.2

[PATCH 0/6] Introduce KVM_{GET|SET}_SREGS2 and fix PDPTR migration

2021-04-01 Thread Maxim Levitsky

This patch set aims to fix few flaws that were discovered
in KVM_{GET|SET}_SREGS on x86:

* There is no support for reading/writing PDPTRs although
  these are considered to be part of the guest state.

* There is useless interrupt bitmap which isn't needed

* No support for future extensions (via flags and such)

Final two patches in this patch series allow to
correctly migrate PDPTRs when new API is used.

This patch series was tested by doing nested migration test
of 32 bit PAE L1 + 32 bit PAE L2 on AMD and Intel and by
nested migration test of 64 bit L1 + 32 bit PAE L2 on AMD.
The later test currently fails on Intel (regardless of my patches).

Finally patch 2 in this patch series fixes a rare L0 kernel oops,
which I can trigger by migrating a hyper-v machine.

Best regards,
Maxim Levitskky

Maxim Levitsky (6):
  KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES
  KVM: nSVM: call nested_svm_load_cr3 on nested state load
  KVM: x86: introduce kvm_register_clear_available
  KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2
  KVM: nSVM: avoid loading PDPTRs after migration when possible
  KVM: nVMX: avoid loading PDPTRs after migration when possible

 Documentation/virt/kvm/api.rst  |  43 ++
 arch/x86/include/asm/kvm_host.h |   7 ++
 arch/x86/include/uapi/asm/kvm.h |  13 +++
 arch/x86/kvm/kvm_cache_regs.h   |  12 +++
 arch/x86/kvm/svm/nested.c   |  55 -
 arch/x86/kvm/svm/svm.c  |   6 +-
 arch/x86/kvm/vmx/nested.c   |  26 --
 arch/x86/kvm/x86.c  | 136 ++--
 include/uapi/linux/kvm.h|   5 ++
 9 files changed, 249 insertions(+), 54 deletions(-)

-- 
2.26.2

[PATCH 4/6] KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2

2021-04-01 Thread Maxim Levitsky

This is a new version of KVM_GET_SREGS / KVM_SET_SREGS ioctls,
aiming to replace them.

It has the following changes:
   * Has flags for future extensions
   * Has vcpu's PDPTS, which allows to save/restore them on migration.
   * Lacks obsolete interrupt bitmap (done now via KVM_SET_VCPU_EVENTS)

New capability, KVM_CAP_SREGS2 is added to signal
userspace of this ioctl.

Currently only implemented on x86.

Signed-off-by: Maxim Levitsky 
---
 Documentation/virt/kvm/api.rst  |  43 ++
 arch/x86/include/asm/kvm_host.h |   7 ++
 arch/x86/include/uapi/asm/kvm.h |  13 +++
 arch/x86/kvm/kvm_cache_regs.h   |   5 ++
 arch/x86/kvm/x86.c  | 136 ++--
 include/uapi/linux/kvm.h|   5 ++
 6 files changed, 185 insertions(+), 24 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 38e327d4b479..b006d5b5f554 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4941,6 +4941,49 @@ see KVM_XEN_VCPU_SET_ATTR above.
 The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
 with the KVM_XEN_VCPU_GET_ATTR ioctl.
 
+
+4.131 KVM_GET_SREGS2
+--
+
+:Capability: KVM_CAP_SREGS2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_sregs2 (out)
+:Returns: 0 on success, -1 on error
+
+Reads special registers from the vcpu.
+This ioctl is preferred over KVM_GET_SREGS when available.
+
+::
+
+struct kvm_sregs2 {
+   /* out (KVM_GET_SREGS2) / in (KVM_SET_SREGS2) */
+   struct kvm_segment cs, ds, es, fs, gs, ss;
+   struct kvm_segment tr, ldt;
+   struct kvm_dtable gdt, idt;
+   __u64 cr0, cr2, cr3, cr4, cr8;
+   __u64 efer;
+   __u64 apic_base;
+   __u64 flags; /* must be zero*/
+   __u64 pdptrs[4];
+   __u64 padding;
+};
+
+
+4.132 KVM_SET_SREGS2
+--
+
+:Capability: KVM_CAP_SREGS2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_sregs2 (in)
+:Returns: 0 on success, -1 on error
+
+Writes special registers into the vcpu.
+See KVM_GET_SREGS2 for the data structures.
+This ioctl is preferred over the KVM_SET_SREGS when available.
+
+
 5. The kvm_run structure
 
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a52f973bdff6..87b680d111f9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -838,6 +838,13 @@ struct kvm_vcpu_arch {
 
/* Protected Guests */
bool guest_state_protected;
+
+   /*
+* Do we need to reload the pdptrs when entering nested state?
+* Set after nested migration if userspace didn't use the
+* newer KVM_SET_SREGS2 ioctl to load pdptrs from the migration state.
+*/
+   bool reload_pdptrs_on_nested_entry;
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 5a3022c8af82..201a85884c81 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -159,6 +159,19 @@ struct kvm_sregs {
__u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64];
 };
 
+struct kvm_sregs2 {
+   /* out (KVM_GET_SREGS2) / in (KVM_SET_SREGS2) */
+   struct kvm_segment cs, ds, es, fs, gs, ss;
+   struct kvm_segment tr, ldt;
+   struct kvm_dtable gdt, idt;
+   __u64 cr0, cr2, cr3, cr4, cr8;
+   __u64 efer;
+   __u64 apic_base;
+   __u64 flags; /* must be zero*/
+   __u64 pdptrs[4];
+   __u64 padding;
+};
+
 /* for KVM_GET_FPU and KVM_SET_FPU */
 struct kvm_fpu {
__u8  fpr[8][16];
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 07d607947805..1a6e2de4248a 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -120,6 +120,11 @@ static inline u64 kvm_pdptr_read(struct kvm_vcpu *vcpu, 
int index)
return vcpu->arch.walk_mmu->pdptrs[index];
 }
 
+static inline void kvm_pdptr_write(struct kvm_vcpu *vcpu, int index, u64 value)
+{
+   vcpu->arch.walk_mmu->pdptrs[index] = value;
+}
+
 static inline ulong kvm_read_cr0_bits(struct kvm_vcpu *vcpu, ulong mask)
 {
ulong tmask = mask & KVM_POSSIBLE_CR0_GUEST_BITS;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a9d95f90a048..f10a37f88c30 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -112,6 +112,9 @@ static void __kvm_set_rflags(struct kvm_vcpu *vcpu, 
unsigned long rflags);
 static void store_regs(struct kvm_vcpu *vcpu);
 static int sync_regs(struct kvm_vcpu *vcpu);
 
+static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
+static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
+
 struct kvm_x86_ops kvm_x86_ops __read_mostly;
 EXPORT_SYMBOL_GPL(kvm_x86_ops);
 
@@ -3796,6 +3799,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_X86_USER_SPACE_MSR:
case KVM_CAP_X86_MSR_FILTER:
case KVM_CAP_ENFORCE_

[PATCH 1/6] KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES

2021-04-01 Thread Maxim Levitsky

Similar to the rest of guest page accesses after migration,
this should be delayed to KVM_REQ_GET_NESTED_STATE_PAGES
request.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/vmx/nested.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index fd334e4aa6db..b44f1f6b68db 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2564,11 +2564,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12,
return -EINVAL;
}
 
-   /* Shadow page tables on either EPT or shadow page tables. */
-   if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, 
nested_cpu_has_ept(vmcs12),
-   entry_failure_code))
-   return -EINVAL;
-
/*
 * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
 * on nested VM-Exit, which can occur without actually running L2 and
@@ -3109,11 +3104,16 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
 static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
 {
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+   enum vm_entry_failure_code entry_failure_code;
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct kvm_host_map *map;
struct page *page;
u64 hpa;
 
+   if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, 
nested_cpu_has_ept(vmcs12),
+   _failure_code))
+   return false;
+
if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
/*
 * Translate L1 physical address to host physical
@@ -3357,6 +3357,10 @@ enum nvmx_vmentry_status 
nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
}
 
if (from_vmentry) {
+   if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3,
+   nested_cpu_has_ept(vmcs12), _failure_code))
+   goto vmentry_fail_vmexit_guest_mode;
+
failed_index = nested_vmx_load_msr(vcpu,
   
vmcs12->vm_entry_msr_load_addr,
   
vmcs12->vm_entry_msr_load_count);
-- 
2.26.2

[PATCH v2 7/9] KVM: SVM: split svm_handle_invalid_exit

2021-04-01 Thread Maxim Levitsky

Split the check for having a vmexit handler to
svm_check_exit_valid, and make svm_handle_invalid_exit
only handle a vmexit that is already not valid.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/svm.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 271196400495..2aa951bc470c 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3220,12 +3220,14 @@ static void dump_vmcb(struct kvm_vcpu *vcpu)
   "excp_to:", save->last_excp_to);
 }
 
-static int svm_handle_invalid_exit(struct kvm_vcpu *vcpu, u64 exit_code)
+static bool svm_check_exit_valid(struct kvm_vcpu *vcpu, u64 exit_code)
 {
-   if (exit_code < ARRAY_SIZE(svm_exit_handlers) &&
-   svm_exit_handlers[exit_code])
-   return 0;
+   return (exit_code < ARRAY_SIZE(svm_exit_handlers) &&
+   svm_exit_handlers[exit_code]);
+}
 
+static int svm_handle_invalid_exit(struct kvm_vcpu *vcpu, u64 exit_code)
+{
vcpu_unimpl(vcpu, "svm: unexpected exit reason 0x%llx\n", exit_code);
dump_vmcb(vcpu);
vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
@@ -3233,14 +3235,13 @@ static int svm_handle_invalid_exit(struct kvm_vcpu 
*vcpu, u64 exit_code)
vcpu->run->internal.ndata = 2;
vcpu->run->internal.data[0] = exit_code;
vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu;
-
-   return -EINVAL;
+   return 0;
 }
 
 int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code)
 {
-   if (svm_handle_invalid_exit(vcpu, exit_code))
-   return 0;
+   if (!svm_check_exit_valid(vcpu, exit_code))
+   return svm_handle_invalid_exit(vcpu, exit_code);
 
 #ifdef CONFIG_RETPOLINE
if (exit_code == SVM_EXIT_MSR)
-- 
2.26.2

[PATCH v2 3/9] KVM: x86: implement KVM_CAP_SET_GUEST_DEBUG2

2021-04-01 Thread Maxim Levitsky

Store the supported bits into KVM_GUESTDBG_VALID_MASK
macro, similar to how arm does this.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm_host.h | 9 +
 arch/x86/kvm/x86.c  | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a52f973bdff6..cc7c82a449d5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -221,6 +221,15 @@ enum x86_intercept_stage;
 #define DR7_FIXED_10x0400
 #define DR7_VOLATILE   0x2bff
 
+#define KVM_GUESTDBG_VALID_MASK \
+   (KVM_GUESTDBG_ENABLE | \
+   KVM_GUESTDBG_SINGLESTEP | \
+   KVM_GUESTDBG_USE_HW_BP | \
+   KVM_GUESTDBG_USE_SW_BP | \
+   KVM_GUESTDBG_INJECT_BP | \
+   KVM_GUESTDBG_INJECT_DB)
+
+
 #define PFERR_PRESENT_BIT 0
 #define PFERR_WRITE_BIT 1
 #define PFERR_USER_BIT 2
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a9d95f90a048..956e8e0bd6af 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3798,6 +3798,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
r = 1;
break;
+   case KVM_CAP_SET_GUEST_DEBUG2:
+   return KVM_GUESTDBG_VALID_MASK;
 #ifdef CONFIG_KVM_XEN
case KVM_CAP_XEN_HVM:
r = KVM_XEN_HVM_CONFIG_HYPERCALL_MSR |
-- 
2.26.2

[PATCH v2 6/9] KVM: x86: implement KVM_GUESTDBG_BLOCKEVENTS

2021-04-01 Thread Maxim Levitsky

KVM_GUESTDBG_BLOCKEVENTS is a guest debug feature that
will allow KVM to block all interrupts while running.
It is mostly intended to be used together with single stepping,
to make it more robust, and has the following benefits:

* Resuming from a breakpoint is much more reliable:
  When resuming execution from a breakpoint, with interrupts enabled,
  more often than not, KVM would inject an interrupt and make the CPU
  jump immediately to the interrupt handler and eventually return to
  the breakpoint, only to trigger it again.

  From the gdb's user point of view it looks like the CPU has never
  executed a single instruction and in some cases that can even
  prevent forward progress, for example, when the breakpoint
  is placed by an automated script (e.g lx-symbols), which does
  something in response to the breakpoint and then continues
  the guest automatically.
  If the script execution takes enough time for another interrupt to
  arrive, the guest will be stuck on the same breakpoint forever.

* Normal single stepping is much more predictable, since it won't
  land the debugger into an interrupt handler.

* Chances of RFLAGS.TF being leaked to the guest are reduced:

  KVM sets that flag behind the guest's back to single step it,
  but if the single step lands the vCPU into an
  interrupt/exception handler the RFLAGS.TF will be leaked to the
  guest in the form of being pushed to the stack.
  This doesn't completely eliminate this problem as exceptions
  can still happen, but at least this eliminates the common
  case.

Signed-off-by: Maxim Levitsky 
---
 Documentation/virt/kvm/api.rst  | 1 +
 arch/x86/include/asm/kvm_host.h | 3 ++-
 arch/x86/include/uapi/asm/kvm.h | 1 +
 arch/x86/kvm/x86.c  | 4 
 4 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 9778b2434c03..a4f2dc84741f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -3338,6 +3338,7 @@ flags which can include the following:
   - KVM_GUESTDBG_INJECT_DB: inject DB type exception [x86]
   - KVM_GUESTDBG_INJECT_BP: inject BP type exception [x86]
   - KVM_GUESTDBG_EXIT_PENDING:  trigger an immediate guest exit [s390]
+  - KVM_GUESTDBG_BLOCKIRQ:  avoid injecting interrupts/NMI/SMI [x86]
 
 For example KVM_GUESTDBG_USE_SW_BP indicates that software breakpoints
 are enabled in memory so we need to ensure breakpoint exceptions are
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cc7c82a449d5..8c529ae9dbbe 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -227,7 +227,8 @@ enum x86_intercept_stage;
KVM_GUESTDBG_USE_HW_BP | \
KVM_GUESTDBG_USE_SW_BP | \
KVM_GUESTDBG_INJECT_BP | \
-   KVM_GUESTDBG_INJECT_DB)
+   KVM_GUESTDBG_INJECT_DB | \
+   KVM_GUESTDBG_BLOCKIRQ)
 
 
 #define PFERR_PRESENT_BIT 0
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 5a3022c8af82..b0f9945067f7 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -282,6 +282,7 @@ struct kvm_debug_exit_arch {
 #define KVM_GUESTDBG_USE_HW_BP 0x0002
 #define KVM_GUESTDBG_INJECT_DB 0x0004
 #define KVM_GUESTDBG_INJECT_BP 0x0008
+#define KVM_GUESTDBG_BLOCKIRQ  0x0010
 
 /* for KVM_SET_GUEST_DEBUG */
 struct kvm_guest_debug_arch {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 956e8e0bd6af..3627ce8fe5bb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8460,6 +8460,10 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, 
bool *req_immediate_exit
can_inject = false;
}
 
+   /* Don't inject interrupts if the user asked to avoid doing so */
+   if (vcpu->guest_debug & KVM_GUESTDBG_BLOCKIRQ)
+   return;
+
/*
 * Finally, inject interrupt events.  If an event cannot be injected
 * due to architectural conditions (e.g. IF=0) a window-open exit
-- 
2.26.2

[PATCH v2 4/9] KVM: aarch64: implement KVM_CAP_SET_GUEST_DEBUG2

2021-04-01 Thread Maxim Levitsky

Move KVM_GUESTDBG_VALID_MASK to kvm_host.h
and use it to return the value of this capability.
Compile tested only.

Signed-off-by: Maxim Levitsky 
---
 arch/arm64/include/asm/kvm_host.h | 4 
 arch/arm64/kvm/arm.c  | 2 ++
 arch/arm64/kvm/guest.c| 5 -
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 3d10e6527f7d..613421454ab6 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -401,6 +401,10 @@ struct kvm_vcpu_arch {
 #define KVM_ARM64_PENDING_EXCEPTION(1 << 8) /* Exception pending */
 #define KVM_ARM64_EXCEPT_MASK  (7 << 9) /* Target EL/MODE */
 
+#define KVM_GUESTDBG_VALID_MASK (KVM_GUESTDBG_ENABLE | \
+KVM_GUESTDBG_USE_SW_BP | \
+KVM_GUESTDBG_USE_HW | \
+KVM_GUESTDBG_SINGLESTEP)
 /*
  * When KVM_ARM64_PENDING_EXCEPTION is set, KVM_ARM64_EXCEPT_MASK can
  * take the following values:
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 7f06ba76698d..e575eff76e97 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -208,6 +208,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_VCPU_ATTRIBUTES:
r = 1;
break;
+   case KVM_CAP_SET_GUEST_DEBUG2:
+   return KVM_GUESTDBG_VALID_MASK;
case KVM_CAP_ARM_SET_DEVICE_ADDR:
r = 1;
break;
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 9bbd30e62799..6cb39ee74acd 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -888,11 +888,6 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
return -EINVAL;
 }
 
-#define KVM_GUESTDBG_VALID_MASK (KVM_GUESTDBG_ENABLE |\
-   KVM_GUESTDBG_USE_SW_BP | \
-   KVM_GUESTDBG_USE_HW | \
-   KVM_GUESTDBG_SINGLESTEP)
-
 /**
  * kvm_arch_vcpu_ioctl_set_guest_debug - set up guest debugging
  * @kvm:   pointer to the KVM struct
-- 
2.26.2

[PATCH v2 9/9] KVM: SVM: implement force_intercept_exceptions_mask

2021-04-01 Thread Maxim Levitsky

Currently #TS interception is only done once.
Also exception interception is not enabled for SEV guests.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm_host.h |  2 +
 arch/x86/kvm/svm/svm.c  | 70 +
 arch/x86/kvm/svm/svm.h  |  6 ++-
 arch/x86/kvm/x86.c  |  5 ++-
 4 files changed, 80 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8c529ae9dbbe..d15ae64a2c4e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1574,6 +1574,8 @@ int kvm_emulate_rdpmc(struct kvm_vcpu *vcpu);
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
 void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long 
payload);
+void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
+u32 error_code, unsigned long payload);
 void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 
error_code);
 void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 2aa951bc470c..de7fd7922ec7 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -220,6 +220,8 @@ static const u32 msrpm_ranges[] = {0, 0xc000, 
0xc001};
 #define MSRS_RANGE_SIZE 2048
 #define MSRS_IN_RANGE (MSRS_RANGE_SIZE * 8 / 2)
 
+static int svm_handle_invalid_exit(struct kvm_vcpu *vcpu, u64 exit_code);
+
 u32 svm_msrpm_offset(u32 msr)
 {
u32 offset;
@@ -1113,6 +1115,22 @@ static void svm_check_invpcid(struct vcpu_svm *svm)
}
 }
 
+static void svm_init_force_exceptions_intercepts(struct vcpu_svm *svm)
+{
+   int exc;
+
+   svm->force_intercept_exceptions_mask = force_intercept_exceptions_mask;
+   for (exc = 0 ; exc < 32 ; exc++) {
+   if (!(svm->force_intercept_exceptions_mask & (1 << exc)))
+   continue;
+
+   /* Those are defined to have undefined behavior in the SVM spec 
*/
+   if (exc != 2 && exc != 9)
+   continue;
+   set_exception_intercept(svm, exc);
+   }
+}
+
 static void init_vmcb(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
@@ -1288,6 +1306,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
 
enable_gif(svm);
 
+   if (!sev_es_guest(vcpu->kvm))
+   svm_init_force_exceptions_intercepts(svm);
+
 }
 
 static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
@@ -1913,6 +1934,17 @@ static int pf_interception(struct kvm_vcpu *vcpu)
u64 fault_address = svm->vmcb->control.exit_info_2;
u64 error_code = svm->vmcb->control.exit_info_1;
 
+   if ((svm->force_intercept_exceptions_mask & (1 << PF_VECTOR)))
+   if (npt_enabled && !vcpu->arch.apf.host_apf_flags) {
+   /* If the #PF was only intercepted for debug, inject
+* it directly to the guest, since the kvm's mmu code
+* is not ready to deal with such page faults.
+*/
+   kvm_queue_exception_e_p(vcpu, PF_VECTOR,
+   error_code, fault_address);
+   return 1;
+   }
+
return kvm_handle_page_fault(vcpu, error_code, fault_address,
static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
svm->vmcb->control.insn_bytes : NULL,
@@ -1988,6 +2020,40 @@ static int ac_interception(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static int gen_exc_interception(struct kvm_vcpu *vcpu)
+{
+   /*
+* Generic exception intercept handler which forwards a guest exception
+* as-is to the guest.
+* For exceptions that don't have a special intercept handler.
+*
+* Used only for 'force_intercept_exceptions_mask' KVM debug feature.
+*/
+   struct vcpu_svm *svm = to_svm(vcpu);
+   int exc = svm->vmcb->control.exit_code - SVM_EXIT_EXCP_BASE;
+
+   /* SVM doesn't provide us with an error code for the #DF */
+   u32 err_code = exc == DF_VECTOR ? 0 : svm->vmcb->control.exit_info_1;
+
+   if (!(svm->force_intercept_exceptions_mask & (1 << exc)))
+   return svm_handle_invalid_exit(vcpu, 
svm->vmcb->control.exit_code);
+
+   if (exc == TS_VECTOR) {
+   /*
+* SVM doesn't provide us with an error code to be able to
+* re-inject the #TS exception, so just disable its
+* intercept, and let the guest re-execute the instruction.
+*/
+   vmcb_clr_in

[PATCH v2 1/9] scripts/gdb: rework lx-symbols gdb script

2021-04-01 Thread Maxim Levitsky

Fix several issues that are present in lx-symbols script:

* Track module unloads by placing another software breakpoint at
  'free_module'
  (force uninline this symbol just in case), and use remove-symbol-file
  gdb command to unload the symobls of the module that is unloading.

  That gives the gdb a chance to mark all software breakpoints from
  this module as pending again.
  Also remove the module from the 'known' module list once it is unloaded.

* Since we now track module unload, we don't need to reload all
  symbols anymore when 'known' module loaded again
  (that can't happen anymore).
  This allows reloading a module in the debugged kernel to finish
  much faster, while lx-symbols tracks module loads and unloads.

* Disable/enable all gdb breakpoints on both module load and unload
  breakpoint hits, and not only in 'load_all_symbols' as was done before.
  (load_all_symbols is no longer called on breakpoint hit)
  That allows gdb to avoid getting confused about the state of the
  (now two) internal breakpoints we place.
  Otherwise it will leave them in the kernel code segment, when
  continuing which triggers a guest kernel panic as soon as it skips
  over the 'int3' instruction and executes the garbage tail of the optcode
  on which the breakpoint was placed.

* Block SIGINT while symbols are reloading as this seems to crash gdb.
  (new in V2)

* Add a basic check that kernel is already loaded into the guest memory
  to avoid confusing errors.
  (new in V2)

Signed-off-by: Maxim Levitsky 
---
 kernel/module.c  |   8 +-
 scripts/gdb/linux/symbols.py | 203 +++
 2 files changed, 143 insertions(+), 68 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 30479355ab85..ea81fc06ea1f 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -901,8 +901,12 @@ int module_refcount(struct module *mod)
 }
 EXPORT_SYMBOL(module_refcount);
 
-/* This exists whether we can unload or not */
-static void free_module(struct module *mod);
+/* This exists whether we can unload or not
+ * Keep it uninlined to provide a reliable breakpoint target,
+ * e.g. for the gdb helper command 'lx-symbols'.
+ */
+
+static noinline void free_module(struct module *mod);
 
 SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
unsigned int, flags)
diff --git a/scripts/gdb/linux/symbols.py b/scripts/gdb/linux/symbols.py
index 1be9763cf8bb..e1374a6e06f7 100644
--- a/scripts/gdb/linux/symbols.py
+++ b/scripts/gdb/linux/symbols.py
@@ -14,45 +14,23 @@
 import gdb
 import os
 import re
+import signal
 
 from linux import modules, utils
 
 
 if hasattr(gdb, 'Breakpoint'):
-class LoadModuleBreakpoint(gdb.Breakpoint):
-def __init__(self, spec, gdb_command):
-super(LoadModuleBreakpoint, self).__init__(spec, internal=True)
+
+class BreakpointWrapper(gdb.Breakpoint):
+def __init__(self, callback, **kwargs):
+super(BreakpointWrapper, self).__init__(internal=True, **kwargs)
 self.silent = True
-self.gdb_command = gdb_command
+self.callback = callback
 
 def stop(self):
-module = gdb.parse_and_eval("mod")
-module_name = module['name'].string()
-cmd = self.gdb_command
-
-# enforce update if object file is not found
-cmd.module_files_updated = False
-
-# Disable pagination while reporting symbol (re-)loading.
-# The console input is blocked in this context so that we would
-# get stuck waiting for the user to acknowledge paged output.
-show_pagination = gdb.execute("show pagination", to_string=True)
-pagination = show_pagination.endswith("on.\n")
-gdb.execute("set pagination off")
-
-if module_name in cmd.loaded_modules:
-gdb.write("refreshing all symbols to reload module "
-  "'{0}'\n".format(module_name))
-cmd.load_all_symbols()
-else:
-cmd.load_module_symbols(module)
-
-# restore pagination state
-gdb.execute("set pagination %s" % ("on" if pagination else "off"))
-
+self.callback()
 return False
 
-
 class LxSymbols(gdb.Command):
 """(Re-)load symbols of Linux kernel and currently loaded modules.
 
@@ -61,15 +39,52 @@ are scanned recursively, starting in the same directory. 
Optionally, the module
 search path can be extended by a space separated list of paths passed to the
 lx-symbols command."""
 
-module_paths = []
-module_files = []
-module_files_updated = False
-loaded_modules = []
-breakpoint = None
-
 def __init__(self):
 super(LxSymbols, self).__init__("lx-symbols", gdb.COMMAND_FILES,

[PATCH v2 8/9] KVM: x86: add force_intercept_exceptions_mask

2021-04-01 Thread Maxim Levitsky

This parameter will be used by VMX and SVM code to force
interception of a set of exceptions, given by a bitmask
for guest debug and/or kvm debug.

This option is not intended for production.

This is based on an idea first shown here:
https://patchwork.kernel.org/project/kvm/patch/20160301192822.gd22...@pd.tnic/

CC: Borislav Petkov 
Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/x86.c | 3 +++
 arch/x86/kvm/x86.h | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3627ce8fe5bb..1a51031d64d8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -176,6 +176,9 @@ module_param(force_emulation_prefix, bool, S_IRUGO);
 int __read_mostly pi_inject_timer = -1;
 module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR);
 
+uint force_intercept_exceptions_mask;
+module_param(force_intercept_exceptions_mask, uint, S_IRUGO | S_IWUSR);
+EXPORT_SYMBOL_GPL(force_intercept_exceptions_mask);
 /*
  * Restoring the host value for MSRs that are only consumed when running in
  * usermode, e.g. SYSCALL MSRs and TSC_AUX, can be deferred until the CPU
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index daccf20fbcd5..644480711ff7 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -311,6 +311,8 @@ extern struct static_key kvm_no_apic_vcpu;
 
 extern bool report_ignored_msrs;
 
+extern uint force_intercept_exceptions_mask;
+
 static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
 {
return pvclock_scale_delta(nsec, vcpu->arch.virtual_tsc_mult,
-- 
2.26.2

[PATCH v2 5/9] KVM: s390x: implement KVM_CAP_SET_GUEST_DEBUG2

2021-04-01 Thread Maxim Levitsky

Define KVM_GUESTDBG_VALID_MASK and use it to implement this capabiity.
Compile tested only.

Signed-off-by: Maxim Levitsky 
---
 arch/s390/include/asm/kvm_host.h | 4 
 arch/s390/kvm/kvm-s390.c | 3 +++
 2 files changed, 7 insertions(+)

diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 6bcfc5614bbc..a3902b57b825 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -700,6 +700,10 @@ struct kvm_hw_bp_info_arch {
 #define guestdbg_exit_pending(vcpu) (guestdbg_enabled(vcpu) && \
(vcpu->guest_debug & KVM_GUESTDBG_EXIT_PENDING))
 
+#define KVM_GUESTDBG_VALID_MASK \
+   (KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP |\
+   KVM_GUESTDBG_USE_HW_BP | KVM_GUESTDBG_EXIT_PENDING)
+
 struct kvm_guestdbg_info_arch {
unsigned long cr0;
unsigned long cr9;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 2f09e9d7dc95..2049fc8c222a 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -544,6 +544,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_S390_DIAG318:
r = 1;
break;
+   case KVM_CAP_SET_GUEST_DEBUG2:
+   r = KVM_GUESTDBG_VALID_MASK;
+   break;
case KVM_CAP_S390_HPAGE_1M:
r = 0;
if (hpage && !kvm_is_ucontrol(kvm))
-- 
2.26.2

[PATCH v2 2/9] KVM: introduce KVM_CAP_SET_GUEST_DEBUG2

2021-04-01 Thread Maxim Levitsky

This capability will allow the user to know which KVM_GUESTDBG_* bits
are supported.

Signed-off-by: Maxim Levitsky 
---
 Documentation/virt/kvm/api.rst | 3 +++
 include/uapi/linux/kvm.h   | 1 +
 2 files changed, 4 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 38e327d4b479..9778b2434c03 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -3357,6 +3357,9 @@ indicating the number of supported registers.
 For ppc, the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability indicates whether
 the single-step debug event (KVM_GUESTDBG_SINGLESTEP) is supported.
 
+Also when supported, KVM_CAP_SET_GUEST_DEBUG2 capability indicates the
+supported KVM_GUESTDBG_* bits in the control field.
+
 When debug events exit the main run loop with the reason
 KVM_EXIT_DEBUG with the kvm_debug_exit_arch part of the kvm_run
 structure containing architecture specific debug information.
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f6afee209620..727010788eff 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1078,6 +1078,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_DIRTY_LOG_RING 192
 #define KVM_CAP_X86_BUS_LOCK_EXIT 193
 #define KVM_CAP_PPC_DAWR1 194
+#define KVM_CAP_SET_GUEST_DEBUG2 195
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.26.2

[PATCH v2 0/9] KVM: my debug patch queue

2021-04-01 Thread Maxim Levitsky

Hi!

I would like to publish two debug features which were needed for other stuff
I work on.

One is the reworked lx-symbols script which now actually works on at least
gdb 9.1 (gdb 9.2 was reported to fail to load the debug symbols from the kernel
for some reason, not related to this patch) and upstream qemu.

The other feature is the ability to trap all guest exceptions (on SVM for now)
and see them in kvmtrace prior to potential merge to double/triple fault.

This can be very useful and I already had to manually patch KVM a few
times for this.
I will, once time permits, implement this feature on Intel as well.

V2:

 * Some more refactoring and workarounds for lx-symbols script

 * added KVM_GUESTDBG_BLOCKEVENTS flag to enable 'block interrupts on
   single step' together with KVM_CAP_SET_GUEST_DEBUG2 capability
   to indicate which guest debug flags are supported.

   This is a replacement for unconditional block of interrupts on single
   step that was done in previous version of this patch set.
   Patches to qemu to use that feature will be sent soon.

 * Reworked the the 'intercept all exceptions for debug' feature according
   to the review feedback:

   - renamed the parameter that enables the feature and
 moved it to common kvm module.
 (only SVM part is currently implemented though)

   - disable the feature for SEV guests as was suggested during the review
   - made the vmexit table const again, as was suggested in the review as well.

Best regards,
Maxim Levitsky

Maxim Levitsky (9):
  scripts/gdb: rework lx-symbols gdb script
  KVM: introduce KVM_CAP_SET_GUEST_DEBUG2
  KVM: x86: implement KVM_CAP_SET_GUEST_DEBUG2
  KVM: aarch64: implement KVM_CAP_SET_GUEST_DEBUG2
  KVM: s390x: implement KVM_CAP_SET_GUEST_DEBUG2
  KVM: x86: implement KVM_GUESTDBG_BLOCKEVENTS
  KVM: SVM: split svm_handle_invalid_exit
  KVM: x86: add force_intercept_exceptions_mask
  KVM: SVM: implement force_intercept_exceptions_mask

 Documentation/virt/kvm/api.rst|   4 +
 arch/arm64/include/asm/kvm_host.h |   4 +
 arch/arm64/kvm/arm.c  |   2 +
 arch/arm64/kvm/guest.c|   5 -
 arch/s390/include/asm/kvm_host.h  |   4 +
 arch/s390/kvm/kvm-s390.c  |   3 +
 arch/x86/include/asm/kvm_host.h   |  12 ++
 arch/x86/include/uapi/asm/kvm.h   |   1 +
 arch/x86/kvm/svm/svm.c|  87 +++--
 arch/x86/kvm/svm/svm.h|   6 +-
 arch/x86/kvm/x86.c|  14 ++-
 arch/x86/kvm/x86.h|   2 +
 include/uapi/linux/kvm.h  |   1 +
 kernel/module.c   |   8 +-
 scripts/gdb/linux/symbols.py  | 203 --
 15 files changed, 272 insertions(+), 84 deletions(-)

-- 
2.26.2

Re: [PATCH 0/2] KVM: x86: nSVM: fixes for SYSENTER emulation

2021-04-01 Thread Maxim Levitsky

On Thu, 2021-04-01 at 14:16 +0300, Maxim Levitsky wrote:
> This is a result of a deep rabbit hole dive in regard to why
> currently the nested migration of 32 bit guests
> is totally broken on AMD.

Please ignore this patch series, I didn't update the patch version.

Best regards,
    Maxim Levitsky

> 
> It turns out that due to slight differences between the original AMD64
> implementation and the Intel's remake, SYSENTER instruction behaves a
> bit differently on Intel, and to support migration from Intel to AMD we
> try to emulate those differences away.
> 
> Sadly that collides with virtual vmload/vmsave feature that is used in 
> nesting.
> The problem was that when it is enabled,
> on migration (and otherwise when userspace reads MSR_IA32_SYSENTER_{EIP|ESP},
> wrong value were returned, which leads to #DF in the
> nested guest when the wrong value is loaded back.
> 
> The patch I prepared carefully fixes this, by mostly disabling that
> SYSCALL emulation when we don't spoof the Intel's vendor ID, and if we do,
> and yet somehow SVM is enabled (this is a very rare edge case), then
> virtual vmload/save is force disabled.
> 
> V2: incorporated review feedback from Paulo.
> 
> Best regards,
> Maxim Levitsky
> 
> Maxim Levitsky (2):
>   KVM: x86: add guest_cpuid_is_intel
>   KVM: nSVM: improve SYSENTER emulation on AMD
> 
>  arch/x86/kvm/cpuid.h   |  8 
>  arch/x86/kvm/svm/svm.c | 99 +++---
>  arch/x86/kvm/svm/svm.h |  6 +--
>  3 files changed, 76 insertions(+), 37 deletions(-)
> 
> -- 
> 2.26.2
>

[PATCH v2 0/2] KVM: x86: nSVM: fixes for SYSENTER emulation

2021-04-01 Thread Maxim Levitsky

This is a result of a deep rabbit hole dive in regard to why
currently the nested migration of 32 bit guests
is totally broken on AMD.

It turns out that due to slight differences between the original AMD64
implementation and the Intel's remake, SYSENTER instruction behaves a
bit differently on Intel, and to support migration from Intel to AMD we
try to emulate those differences away.

Sadly that collides with virtual vmload/vmsave feature that is used in nesting.
The problem was that when it is enabled,
on migration (and otherwise when userspace reads MSR_IA32_SYSENTER_{EIP|ESP},
wrong value were returned, which leads to #DF in the
nested guest when the wrong value is loaded back.

The patch I prepared carefully fixes this, by mostly disabling that
SYSCALL emulation when we don't spoof the Intel's vendor ID, and if we do,
and yet somehow SVM is enabled (this is a very rare edge case), then
virtual vmload/save is force disabled.

V2: incorporated review feedback from Paulo.

Best regards,
Maxim Levitsky

Maxim Levitsky (2):
  KVM: x86: add guest_cpuid_is_intel
  KVM: nSVM: improve SYSENTER emulation on AMD

 arch/x86/kvm/cpuid.h   |  8 
 arch/x86/kvm/svm/svm.c | 99 +++---
 arch/x86/kvm/svm/svm.h |  6 +--
 3 files changed, 76 insertions(+), 37 deletions(-)

-- 
2.26.2

[PATCH v2 1/2] KVM: x86: add guest_cpuid_is_intel

2021-04-01 Thread Maxim Levitsky

This is similar to existing 'guest_cpuid_is_amd_or_hygon'

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/cpuid.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 2a0c5064497f..ded84d244f19 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -248,6 +248,14 @@ static inline bool guest_cpuid_is_amd_or_hygon(struct 
kvm_vcpu *vcpu)
is_guest_vendor_hygon(best->ebx, best->ecx, best->edx));
 }
 
+static inline bool guest_cpuid_is_intel(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 0, 0);
+   return best && is_guest_vendor_intel(best->ebx, best->ecx, best->edx);
+}
+
 static inline int guest_cpuid_family(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
-- 
2.26.2

[PATCH v2 2/2] KVM: nSVM: improve SYSENTER emulation on AMD

2021-04-01 Thread Maxim Levitsky

Currently to support Intel->AMD migration, if CPU vendor is GenuineIntel,
we emulate the full 64 value for MSR_IA32_SYSENTER_{EIP|ESP}
msrs, and we also emulate the sysenter/sysexit instruction in long mode.

(Emulator does still refuse to emulate sysenter in 64 bit mode, on the
ground that the code for that wasn't tested and likely has no users)

However when virtual vmload/vmsave is enabled, the vmload instruction will
update these 32 bit msrs without triggering their msr intercept,
which will lead to having stale values in kvm's shadow copy of these msrs,
which relies on the intercept to be up to date.

Fix/optimize this by doing the following:

1. Enable the MSR intercepts for SYSENTER MSRs iff vendor=GenuineIntel
   (This is both a tiny optimization and also ensures that in case
   the guest cpu vendor is AMD, the msrs will be 32 bit wide as
   AMD defined).

2. Store only high 32 bit part of these msrs on interception and combine
   it with hardware msr value on intercepted read/writes
   iff vendor=GenuineIntel.

3. Disable vmload/vmsave virtualization if vendor=GenuineIntel.
   (It is somewhat insane to set vendor=GenuineIntel and still enable
   SVM for the guest but well whatever).
   Then zero the high 32 bit parts when kvm intercepts and emulates vmload.

Thanks a lot to Paulo Bonzini for helping me with fixing this in the most
correct way.

This patch fixes nested migration of 32 bit nested guests, that was
broken because incorrect cached values of SYSENTER msrs were stored in
the migration stream if L1 changed these msrs with
vmload prior to L2 entry.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/svm.c | 99 +++---
 arch/x86/kvm/svm/svm.h |  6 +--
 2 files changed, 68 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 271196400495..6c39b0cd6ec6 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -95,6 +95,8 @@ static const struct svm_direct_access_msrs {
 } direct_access_msrs[MAX_DIRECT_ACCESS_MSRS] = {
{ .index = MSR_STAR,.always = true  },
{ .index = MSR_IA32_SYSENTER_CS,.always = true  },
+   { .index = MSR_IA32_SYSENTER_EIP,   .always = false },
+   { .index = MSR_IA32_SYSENTER_ESP,   .always = false },
 #ifdef CONFIG_X86_64
{ .index = MSR_GS_BASE, .always = true  },
{ .index = MSR_FS_BASE, .always = true  },
@@ -1258,16 +1260,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
if (kvm_vcpu_apicv_active(vcpu))
avic_init_vmcb(svm);
 
-   /*
-* If hardware supports Virtual VMLOAD VMSAVE then enable it
-* in VMCB and clear intercepts to avoid #VMEXIT.
-*/
-   if (vls) {
-   svm_clr_intercept(svm, INTERCEPT_VMLOAD);
-   svm_clr_intercept(svm, INTERCEPT_VMSAVE);
-   svm->vmcb->control.virt_ext |= 
VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
-   }
-
if (vgif) {
svm_clr_intercept(svm, INTERCEPT_STGI);
svm_clr_intercept(svm, INTERCEPT_CLGI);
@@ -2133,9 +2125,11 @@ static int vmload_vmsave_interception(struct kvm_vcpu 
*vcpu, bool vmload)
 
ret = kvm_skip_emulated_instruction(vcpu);
 
-   if (vmload)
+   if (vmload) {
nested_svm_vmloadsave(vmcb12, svm->vmcb);
-   else
+   svm->sysenter_eip_hi = 0;
+   svm->sysenter_esp_hi = 0;
+   } else
nested_svm_vmloadsave(svm->vmcb, vmcb12);
 
kvm_vcpu_unmap(vcpu, , true);
@@ -2677,10 +2671,14 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
msr_info->data = svm->vmcb01.ptr->save.sysenter_cs;
break;
case MSR_IA32_SYSENTER_EIP:
-   msr_info->data = svm->sysenter_eip;
+   msr_info->data = (u32)svm->vmcb01.ptr->save.sysenter_eip;
+   if (guest_cpuid_is_intel(vcpu))
+   msr_info->data |= (u64)svm->sysenter_eip_hi << 32;
break;
case MSR_IA32_SYSENTER_ESP:
-   msr_info->data = svm->sysenter_esp;
+   msr_info->data = svm->vmcb01.ptr->save.sysenter_esp;
+   if (guest_cpuid_is_intel(vcpu))
+   msr_info->data |= (u64)svm->sysenter_esp_hi << 32;
break;
case MSR_TSC_AUX:
if (!boot_cpu_has(X86_FEATURE_RDTSCP))
@@ -2885,12 +2883,19 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr)
svm->vmcb01.ptr->save.sysenter_cs = data;
break;
case MSR_IA32_SYSENTER_EIP:
-   svm->sysenter_eip = data;
-   svm->vmcb01.ptr->save.sysenter_eip = data;
+   svm->vmcb01.ptr->save.sysente

Re: [PATCH 1/4] KVM: x86: pending exceptions must not be blocked by an injected event

2021-04-01 Thread Maxim Levitsky

On Thu, 2021-04-01 at 19:05 +0200, Paolo Bonzini wrote:
> On 01/04/21 16:38, Maxim Levitsky wrote:
> > Injected interrupts/nmi should not block a pending exception,
> > but rather be either lost if nested hypervisor doesn't
> > intercept the pending exception (as in stock x86), or be delivered
> > in exitintinfo/IDT_VECTORING_INFO field, as a part of a VMexit
> > that corresponds to the pending exception.
> > 
> > The only reason for an exception to be blocked is when nested run
> > is pending (and that can't really happen currently
> > but still worth checking for).
> > 
> > Signed-off-by: Maxim Levitsky 
> 
> This patch would be an almost separate bugfix, right?  I am going to 
> queue this, but a confirmation would be helpful.

Yes, this patch doesn't depend on anything else.
Thanks!
Best regards,
Maxim Levitsky

> 
> Paolo
> 
> > ---
> >   arch/x86/kvm/svm/nested.c |  8 +++-
> >   arch/x86/kvm/vmx/nested.c | 10 --
> >   2 files changed, 15 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index 8523f60adb92..34a37b2bd486 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -1062,7 +1062,13 @@ static int svm_check_nested_events(struct kvm_vcpu 
> > *vcpu)
> > }
> >   
> > if (vcpu->arch.exception.pending) {
> > -   if (block_nested_events)
> > +   /*
> > +* Only a pending nested run can block a pending exception.
> > +* Otherwise an injected NMI/interrupt should either be
> > +* lost or delivered to the nested hypervisor in the EXITINTINFO
> > +* vmcb field, while delivering the pending exception.
> > +*/
> > +   if (svm->nested.nested_run_pending)
> >   return -EBUSY;
> > if (!nested_exit_on_exception(svm))
> > return 0;
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index fd334e4aa6db..c3ba842fc07f 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -3806,9 +3806,15 @@ static int vmx_check_nested_events(struct kvm_vcpu 
> > *vcpu)
> >   
> > /*
> >  * Process any exceptions that are not debug traps before MTF.
> > +*
> > +* Note that only a pending nested run can block a pending exception.
> > +* Otherwise an injected NMI/interrupt should either be
> > +* lost or delivered to the nested hypervisor in the IDT_VECTORING_INFO,
> > +* while delivering the pending exception.
> >  */
> > +
> > if (vcpu->arch.exception.pending && !vmx_pending_dbg_trap(vcpu)) {
> > -   if (block_nested_events)
> > +   if (vmx->nested.nested_run_pending)
> > return -EBUSY;
> > if (!nested_vmx_check_exception(vcpu, _qual))
> > goto no_vmexit;
> > @@ -3825,7 +3831,7 @@ static int vmx_check_nested_events(struct kvm_vcpu 
> > *vcpu)
> > }
> >   
> > if (vcpu->arch.exception.pending) {
> > -   if (block_nested_events)
> > +   if (vmx->nested.nested_run_pending)
> > return -EBUSY;
> > if (!nested_vmx_check_exception(vcpu, _qual))
> > goto no_vmexit;
> >

[PATCH 2/2] KVM: nSVM: improve SYSENTER emulation on AMD

2021-04-01 Thread Maxim Levitsky

Currently to support Intel->AMD migration, if CPU vendor is GenuineIntel,
we emulate the full 64 value for MSR_IA32_SYSENTER_{EIP|ESP}
msrs, and we also emulate the sysenter/sysexit instruction in long mode.

(Emulator does still refuse to emulate sysenter in 64 bit mode, on the
ground that the code for that wasn't tested and likely has no users)

However when virtual vmload/vmsave is enabled, the vmload instruction will
update these 32 bit msrs without triggering their msr intercept,
which will lead to having stale values in kvm's shadow copy of these msrs,
which relies on the intercept to be up to date.

Fix/optimize this by doing the following:

1. Enable the MSR intercepts for SYSENTER MSRs iff vendor=GenuineIntel
   (This is both a tiny optimization and also ensures that in case
   the guest cpu vendor is AMD, the msrs will be 32 bit wide as
   AMD defined).

2. Store only high 32 bit part of these msrs on interception and combine
   it with hardware msr value on intercepted read/writes
   iff vendor=GenuineIntel.

3. Disable vmload/vmsave virtualization if vendor=GenuineIntel.
   (It is somewhat insane to set vendor=GenuineIntel and still enable
   SVM for the guest but well whatever).
   Then zero the high 32 bit parts when kvm intercepts and emulates vmload.

Thanks a lot to Paulo Bonzini for helping me with fixing this in the most
correct way.

This patch fixes nested migration of 32 bit nested guests, that was
broken because incorrect cached values of SYSENTER msrs were stored in
the migration stream if L1 changed these msrs with
vmload prior to L2 entry.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/svm.c | 99 +++---
 arch/x86/kvm/svm/svm.h |  6 +--
 2 files changed, 68 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 271196400495..6c39b0cd6ec6 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -95,6 +95,8 @@ static const struct svm_direct_access_msrs {
 } direct_access_msrs[MAX_DIRECT_ACCESS_MSRS] = {
{ .index = MSR_STAR,.always = true  },
{ .index = MSR_IA32_SYSENTER_CS,.always = true  },
+   { .index = MSR_IA32_SYSENTER_EIP,   .always = false },
+   { .index = MSR_IA32_SYSENTER_ESP,   .always = false },
 #ifdef CONFIG_X86_64
{ .index = MSR_GS_BASE, .always = true  },
{ .index = MSR_FS_BASE, .always = true  },
@@ -1258,16 +1260,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
if (kvm_vcpu_apicv_active(vcpu))
avic_init_vmcb(svm);
 
-   /*
-* If hardware supports Virtual VMLOAD VMSAVE then enable it
-* in VMCB and clear intercepts to avoid #VMEXIT.
-*/
-   if (vls) {
-   svm_clr_intercept(svm, INTERCEPT_VMLOAD);
-   svm_clr_intercept(svm, INTERCEPT_VMSAVE);
-   svm->vmcb->control.virt_ext |= 
VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
-   }
-
if (vgif) {
svm_clr_intercept(svm, INTERCEPT_STGI);
svm_clr_intercept(svm, INTERCEPT_CLGI);
@@ -2133,9 +2125,11 @@ static int vmload_vmsave_interception(struct kvm_vcpu 
*vcpu, bool vmload)
 
ret = kvm_skip_emulated_instruction(vcpu);
 
-   if (vmload)
+   if (vmload) {
nested_svm_vmloadsave(vmcb12, svm->vmcb);
-   else
+   svm->sysenter_eip_hi = 0;
+   svm->sysenter_esp_hi = 0;
+   } else
nested_svm_vmloadsave(svm->vmcb, vmcb12);
 
kvm_vcpu_unmap(vcpu, , true);
@@ -2677,10 +2671,14 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
msr_info->data = svm->vmcb01.ptr->save.sysenter_cs;
break;
case MSR_IA32_SYSENTER_EIP:
-   msr_info->data = svm->sysenter_eip;
+   msr_info->data = (u32)svm->vmcb01.ptr->save.sysenter_eip;
+   if (guest_cpuid_is_intel(vcpu))
+   msr_info->data |= (u64)svm->sysenter_eip_hi << 32;
break;
case MSR_IA32_SYSENTER_ESP:
-   msr_info->data = svm->sysenter_esp;
+   msr_info->data = svm->vmcb01.ptr->save.sysenter_esp;
+   if (guest_cpuid_is_intel(vcpu))
+   msr_info->data |= (u64)svm->sysenter_esp_hi << 32;
break;
case MSR_TSC_AUX:
if (!boot_cpu_has(X86_FEATURE_RDTSCP))
@@ -2885,12 +2883,19 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr)
svm->vmcb01.ptr->save.sysenter_cs = data;
break;
case MSR_IA32_SYSENTER_EIP:
-   svm->sysenter_eip = data;
-   svm->vmcb01.ptr->save.sysenter_eip = data;
+   svm->vmcb01.ptr->save.sysente

[PATCH 0/4] KVM: nSVM/nVMX: fix nested virtualization treatment of nested exceptions

2021-04-01 Thread Maxim Levitsky

clone of "kernel-starship-5.12.unstable"

Maxim Levitsky (4):
  KVM: x86: pending exceptions must not be blocked by an injected event
  KVM: x86: separate pending and injected exception
  KVM: x86: correctly merge pending and injected exception
  KVM: x86: remove tweaking of inject_page_fault

 arch/x86/include/asm/kvm_host.h |  34 +++-
 arch/x86/kvm/svm/nested.c   |  65 +++
 arch/x86/kvm/svm/svm.c  |   8 +-
 arch/x86/kvm/vmx/nested.c   | 107 +--
 arch/x86/kvm/vmx/vmx.c  |  14 +-
 arch/x86/kvm/x86.c  | 302 ++--
 arch/x86/kvm/x86.h  |   6 +-
 7 files changed, 283 insertions(+), 253 deletions(-)

-- 
2.26.2

Re: [PATCH 4/6] KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2

2021-04-01 Thread Maxim Levitsky

On Thu, 2021-04-01 at 16:44 +0200, Paolo Bonzini wrote:
> Just a quick review on the API:
> 
> On 01/04/21 16:18, Maxim Levitsky wrote:
> > +struct kvm_sregs2 {
> > +   /* out (KVM_GET_SREGS2) / in (KVM_SET_SREGS2) */
> > +   struct kvm_segment cs, ds, es, fs, gs, ss;
> > +   struct kvm_segment tr, ldt;
> > +   struct kvm_dtable gdt, idt;
> > +   __u64 cr0, cr2, cr3, cr4, cr8;
> > +   __u64 efer;
> > +   __u64 apic_base;
> > +   __u64 flags; /* must be zero*/
> 
> I think it would make sense to define a flag bit for the PDPTRs, so that 
> userspace can use KVM_SET_SREGS2 unconditionally (e.g. even when 
> migrating from a source that uses KVM_GET_SREGS and therefore doesn't 
> provide the PDPTRs).
Yes, I didn't think about this case! I'll add this to the next version.
Thanks!

> 
> > +   __u64 pdptrs[4];
> > +   __u64 padding;
> 
> No need to add padding; if we add more fields in the future we can use 
> the flags to determine the length of the userspace data, similar to 
> KVM_GET/SET_NESTED_STATE.
Got it, will fix. I added it just in case.

> 
> 
> > +   idx = srcu_read_lock(>kvm->srcu);
> > +   if (is_pae_paging(vcpu)) {
> > +   for (i = 0 ; i < 4 ; i++)
> > +   kvm_pdptr_write(vcpu, i, sregs2->pdptrs[i]);
> > +   kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> > +   mmu_reset_needed = 1;
> > +   }
> > +   srcu_read_unlock(>kvm->srcu, idx);
> > +
> 
> SRCU should not be needed here?

I haven't yet studied in depth the locking that is used in the kvm,
so I put this to be on the safe side.

I looked at it a bit and it looks like the pdptr reading code takes
this lock because it accesses the memslots, which is not done here,
and therefore the lock is indeed not needed here.

I need to study in depth how locking is done in kvm to be 100% sure
about this.


> 
> > +   case KVM_GET_SREGS2: {
> > +   u.sregs2 = kzalloc(sizeof(struct kvm_sregs2), 
> > GFP_KERNEL_ACCOUNT);
> > +   r = -ENOMEM;
> > +   if (!u.sregs2)
> > +   goto out;
> 
> No need to account, I think it's a little slower and this allocation is 
> very short lived.
Right, I will fix this in the next version.

> 
> >  #define KVM_CAP_PPC_DAWR1 194
> > +#define KVM_CAP_SREGS2 196
> 
> 195, not 196.

I am also planning to add KVM_CAP_SET_GUEST_DEBUG2 for which I
used 195.
Prior to sending I rebased all of my patch series on top of kvm/queue,
but I kept the numbers just in case.

> 
> >  #define KVM_XEN_VCPU_GET_ATTR  _IOWR(KVMIO, 0xca, struct 
> > kvm_xen_vcpu_attr)
> >  #define KVM_XEN_VCPU_SET_ATTR  _IOW(KVMIO,  0xcb, struct 
> > kvm_xen_vcpu_attr)
> > +
> > +#define KVM_GET_SREGS2 _IOR(KVMIO,  0xca, struct kvm_sregs2)
> > +#define KVM_SET_SREGS2 _IOW(KVMIO,  0xcb, struct kvm_sregs2)
> > +
> 
> It's not exactly overlapping, but please bump the ioctls to 0xcc/0xcd.
Will do.


Thanks a lot for the review!

Best regards,
Maxim Levitsky

> 
> Paolo
>

[PATCH 1/2] KVM: x86: add guest_cpuid_is_intel

2021-04-01 Thread Maxim Levitsky

This is similar to existing 'guest_cpuid_is_amd_or_hygon'

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/cpuid.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 2a0c5064497f..ded84d244f19 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -248,6 +248,14 @@ static inline bool guest_cpuid_is_amd_or_hygon(struct 
kvm_vcpu *vcpu)
is_guest_vendor_hygon(best->ebx, best->ecx, best->edx));
 }
 
+static inline bool guest_cpuid_is_intel(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 0, 0);
+   return best && is_guest_vendor_intel(best->ebx, best->ecx, best->edx);
+}
+
 static inline int guest_cpuid_family(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
-- 
2.26.2

[PATCH 0/2] KVM: x86: nSVM: fixes for SYSENTER emulation

2021-04-01 Thread Maxim Levitsky

This is a result of a deep rabbit hole dive in regard to why
currently the nested migration of 32 bit guests
is totally broken on AMD.

It turns out that due to slight differences between the original AMD64
implementation and the Intel's remake, SYSENTER instruction behaves a
bit differently on Intel, and to support migration from Intel to AMD we
try to emulate those differences away.

Sadly that collides with virtual vmload/vmsave feature that is used in nesting.
The problem was that when it is enabled,
on migration (and otherwise when userspace reads MSR_IA32_SYSENTER_{EIP|ESP},
wrong value were returned, which leads to #DF in the
nested guest when the wrong value is loaded back.

The patch I prepared carefully fixes this, by mostly disabling that
SYSCALL emulation when we don't spoof the Intel's vendor ID, and if we do,
and yet somehow SVM is enabled (this is a very rare edge case), then
virtual vmload/save is force disabled.

V2: incorporated review feedback from Paulo.

Best regards,
Maxim Levitsky

Maxim Levitsky (2):
  KVM: x86: add guest_cpuid_is_intel
  KVM: nSVM: improve SYSENTER emulation on AMD

 arch/x86/kvm/cpuid.h   |  8 
 arch/x86/kvm/svm/svm.c | 99 +++---
 arch/x86/kvm/svm/svm.h |  6 +--
 3 files changed, 76 insertions(+), 37 deletions(-)

-- 
2.26.2

[PATCH 3/4] KVM: x86: correctly merge pending and injected exception

2021-04-01 Thread Maxim Levitsky

Allow the pending and the injected exceptions to co-exist
when they are raised.

Add 'kvm_deliver_pending_exception' function which 'merges' the pending
and injected exception or delivers a VM exit with both for a case when
the L1 intercepts the pending exception.

The later is done by vendor code using new nested callback
'deliver_exception_as_vmexit'

The kvm_deliver_pending_exception is called after each VM exit,
and prior to VM entry which ensures that during userspace VM exits,
only injected exception can be in a raised state.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm_host.h |   9 ++
 arch/x86/kvm/svm/nested.c   |  27 ++--
 arch/x86/kvm/svm/svm.c  |   2 +-
 arch/x86/kvm/vmx/nested.c   |  58 
 arch/x86/kvm/vmx/vmx.c  |   2 +-
 arch/x86/kvm/x86.c  | 233 ++--
 6 files changed, 181 insertions(+), 150 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3b2fd276e8d5..a9b9cd030d9a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1346,6 +1346,15 @@ struct kvm_x86_ops {
 
 struct kvm_x86_nested_ops {
int (*check_events)(struct kvm_vcpu *vcpu);
+
+   /*
+* Deliver a pending exception as a VM exit if the L1 intercepts it.
+* Returns -EBUSY if L1 does intercept the exception but,
+* it is not possible to deliver it right now.
+* (for example when nested run is pending)
+*/
+   int (*deliver_exception_as_vmexit)(struct kvm_vcpu *vcpu);
+
bool (*hv_timer_pending)(struct kvm_vcpu *vcpu);
void (*triple_fault)(struct kvm_vcpu *vcpu);
int (*get_state)(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 7adad9b6dcad..ff745d59ffcf 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1061,21 +1061,6 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
return 0;
}
 
-   if (vcpu->arch.pending_exception.valid) {
-   /*
-* Only a pending nested run can block a pending exception.
-* Otherwise an injected NMI/interrupt should either be
-* lost or delivered to the nested hypervisor in the EXITINTINFO
-* vmcb field, while delivering the pending exception.
-*/
-   if (svm->nested.nested_run_pending)
-return -EBUSY;
-   if (!nested_exit_on_exception(svm))
-   return 0;
-   nested_svm_inject_exception_vmexit(svm);
-   return 0;
-   }
-
if (vcpu->arch.smi_pending && !svm_smi_blocked(vcpu)) {
if (block_nested_events)
return -EBUSY;
@@ -1107,6 +1092,17 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
return 0;
 }
 
+int svm_deliver_nested_exception_as_vmexit(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_svm *svm = to_svm(vcpu);
+
+   if (svm->nested.nested_run_pending)
+   return -EBUSY;
+   if (nested_exit_on_exception(svm))
+   nested_svm_inject_exception_vmexit(svm);
+   return 0;
+}
+
 int nested_svm_exit_special(struct vcpu_svm *svm)
 {
u32 exit_code = svm->vmcb->control.exit_code;
@@ -1321,6 +1317,7 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 struct kvm_x86_nested_ops svm_nested_ops = {
.check_events = svm_check_nested_events,
.triple_fault = nested_svm_triple_fault,
+   .deliver_exception_as_vmexit = svm_deliver_nested_exception_as_vmexit,
.get_nested_state_pages = svm_get_nested_state_pages,
.get_state = svm_get_nested_state,
.set_state = svm_set_nested_state,
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 90b541138c5a..b89e48574c39 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -363,7 +363,7 @@ static void svm_queue_exception(struct kvm_vcpu *vcpu)
bool has_error_code = vcpu->arch.injected_exception.has_error_code;
u32 error_code = vcpu->arch.injected_exception.error_code;
 
-   kvm_deliver_exception_payload(vcpu);
+   WARN_ON_ONCE(vcpu->arch.pending_exception.valid);
 
if (nr == BP_VECTOR && !nrips) {
unsigned long rip, old_rip = kvm_rip_read(vcpu);
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 5d54fecff9a7..1c09b132c55c 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3768,7 +3768,6 @@ static bool nested_vmx_preemption_timer_pending(struct 
kvm_vcpu *vcpu)
 static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
-   unsigned long exit_qual;
bool block_nested_events =
vmx->nested.nested_run_pending || kvm_event_needs_reinjection(vcpu);
bool mtf_pending =

[PATCH 4/4] KVM: x86: remove tweaking of inject_page_fault

2021-04-01 Thread Maxim Levitsky

This is no longer needed since page faults can now be
injected as regular exceptions in all the cases.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c | 20 
 arch/x86/kvm/vmx/nested.c | 23 ---
 2 files changed, 43 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index ff745d59ffcf..25840399841e 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -53,23 +53,6 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
nested_svm_vmexit(svm);
 }
 
-static void svm_inject_page_fault_nested(struct kvm_vcpu *vcpu, struct 
x86_exception *fault)
-{
-   struct vcpu_svm *svm = to_svm(vcpu);
-   WARN_ON(!is_guest_mode(vcpu));
-
-   if (vmcb_is_intercept(>nested.ctl, INTERCEPT_EXCEPTION_OFFSET + 
PF_VECTOR) &&
-  !svm->nested.nested_run_pending) {
-   svm->vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + PF_VECTOR;
-   svm->vmcb->control.exit_code_hi = 0;
-   svm->vmcb->control.exit_info_1 = fault->error_code;
-   svm->vmcb->control.exit_info_2 = fault->address;
-   nested_svm_vmexit(svm);
-   } else {
-   kvm_inject_page_fault(vcpu, fault);
-   }
-}
-
 static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
 {
struct vcpu_svm *svm = to_svm(vcpu);
@@ -575,9 +558,6 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 
vmcb12_gpa,
if (ret)
return ret;
 
-   if (!npt_enabled)
-   vcpu->arch.mmu->inject_page_fault = 
svm_inject_page_fault_nested;
-
svm_set_gif(svm, true);
 
return 0;
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 1c09b132c55c..8add4c27e718 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -418,26 +418,6 @@ static int nested_vmx_check_exception(struct kvm_vcpu 
*vcpu, unsigned long *exit
return 0;
 }
 
-
-static void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
-   struct x86_exception *fault)
-{
-   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-
-   WARN_ON(!is_guest_mode(vcpu));
-
-   if (nested_vmx_is_page_fault_vmexit(vmcs12, fault->error_code) &&
-   !to_vmx(vcpu)->nested.nested_run_pending) {
-   vmcs12->vm_exit_intr_error_code = fault->error_code;
-   nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
- PF_VECTOR | INTR_TYPE_HARD_EXCEPTION |
- INTR_INFO_DELIVER_CODE_MASK | 
INTR_INFO_VALID_MASK,
- fault->address);
-   } else {
-   kvm_inject_page_fault(vcpu, fault);
-   }
-}
-
 static int nested_vmx_check_io_bitmap_controls(struct kvm_vcpu *vcpu,
   struct vmcs12 *vmcs12)
 {
@@ -2588,9 +2568,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12,
vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
}
 
-   if (!enable_ept)
-   vcpu->arch.walk_mmu->inject_page_fault = 
vmx_inject_page_fault_nested;
-
if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) &&
WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
 vmcs12->guest_ia32_perf_global_ctrl)))
-- 
2.26.2

[PATCH 1/4] KVM: x86: pending exceptions must not be blocked by an injected event

2021-04-01 Thread Maxim Levitsky

Injected interrupts/nmi should not block a pending exception,
but rather be either lost if nested hypervisor doesn't
intercept the pending exception (as in stock x86), or be delivered
in exitintinfo/IDT_VECTORING_INFO field, as a part of a VMexit
that corresponds to the pending exception.

The only reason for an exception to be blocked is when nested run
is pending (and that can't really happen currently
but still worth checking for).

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c |  8 +++-
 arch/x86/kvm/vmx/nested.c | 10 --
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 8523f60adb92..34a37b2bd486 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1062,7 +1062,13 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
}
 
if (vcpu->arch.exception.pending) {
-   if (block_nested_events)
+   /*
+* Only a pending nested run can block a pending exception.
+* Otherwise an injected NMI/interrupt should either be
+* lost or delivered to the nested hypervisor in the EXITINTINFO
+* vmcb field, while delivering the pending exception.
+*/
+   if (svm->nested.nested_run_pending)
 return -EBUSY;
if (!nested_exit_on_exception(svm))
return 0;
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index fd334e4aa6db..c3ba842fc07f 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3806,9 +3806,15 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 
/*
 * Process any exceptions that are not debug traps before MTF.
+*
+* Note that only a pending nested run can block a pending exception.
+* Otherwise an injected NMI/interrupt should either be
+* lost or delivered to the nested hypervisor in the IDT_VECTORING_INFO,
+* while delivering the pending exception.
 */
+
if (vcpu->arch.exception.pending && !vmx_pending_dbg_trap(vcpu)) {
-   if (block_nested_events)
+   if (vmx->nested.nested_run_pending)
return -EBUSY;
if (!nested_vmx_check_exception(vcpu, _qual))
goto no_vmexit;
@@ -3825,7 +3831,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
}
 
if (vcpu->arch.exception.pending) {
-   if (block_nested_events)
+   if (vmx->nested.nested_run_pending)
return -EBUSY;
if (!nested_vmx_check_exception(vcpu, _qual))
goto no_vmexit;
-- 
2.26.2

[PATCH 2/4] KVM: x86: separate pending and injected exception

2021-04-01 Thread Maxim Levitsky

Use 'pending_exception' and 'injected_exception' fields
to store the pending and the injected exceptions.

After this patch still only one is active, but
in the next patch both could co-exist in some cases.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm_host.h |  25 --
 arch/x86/kvm/svm/nested.c   |  26 +++---
 arch/x86/kvm/svm/svm.c  |   6 +-
 arch/x86/kvm/vmx/nested.c   |  36 
 arch/x86/kvm/vmx/vmx.c  |  12 +--
 arch/x86/kvm/x86.c  | 145 ++--
 arch/x86/kvm/x86.h  |   6 +-
 7 files changed, 143 insertions(+), 113 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a52f973bdff6..3b2fd276e8d5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -547,6 +547,14 @@ struct kvm_vcpu_xen {
u64 runstate_times[4];
 };
 
+struct kvm_queued_exception {
+   bool valid;
+   u8 nr;
+   bool has_error_code;
+   u32 error_code;
+};
+
+
 struct kvm_vcpu_arch {
/*
 * rip and regs accesses must go through
@@ -645,16 +653,15 @@ struct kvm_vcpu_arch {
 
u8 event_exit_inst_len;
 
-   struct kvm_queued_exception {
-   bool pending;
-   bool injected;
-   bool has_error_code;
-   u8 nr;
-   u32 error_code;
-   unsigned long payload;
-   bool has_payload;
+   struct kvm_queued_exception pending_exception;
+
+   struct kvm_exception_payload {
+   bool valid;
+   unsigned long value;
u8 nested_apf;
-   } exception;
+   } exception_payload;
+
+   struct kvm_queued_exception injected_exception;
 
struct kvm_queued_interrupt {
bool injected;
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 34a37b2bd486..7adad9b6dcad 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -349,14 +349,14 @@ static void nested_save_pending_event_to_vmcb12(struct 
vcpu_svm *svm,
u32 exit_int_info = 0;
unsigned int nr;
 
-   if (vcpu->arch.exception.injected) {
-   nr = vcpu->arch.exception.nr;
+   if (vcpu->arch.injected_exception.valid) {
+   nr = vcpu->arch.injected_exception.nr;
exit_int_info = nr | SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_EXEPT;
 
-   if (vcpu->arch.exception.has_error_code) {
+   if (vcpu->arch.injected_exception.has_error_code) {
exit_int_info |= SVM_EVTINJ_VALID_ERR;
vmcb12->control.exit_int_info_err =
-   vcpu->arch.exception.error_code;
+   vcpu->arch.injected_exception.error_code;
}
 
} else if (vcpu->arch.nmi_injected) {
@@ -1000,30 +1000,30 @@ int nested_svm_check_permissions(struct kvm_vcpu *vcpu)
 
 static bool nested_exit_on_exception(struct vcpu_svm *svm)
 {
-   unsigned int nr = svm->vcpu.arch.exception.nr;
+   unsigned int nr = svm->vcpu.arch.pending_exception.nr;
 
return (svm->nested.ctl.intercepts[INTERCEPT_EXCEPTION] & BIT(nr));
 }
 
 static void nested_svm_inject_exception_vmexit(struct vcpu_svm *svm)
 {
-   unsigned int nr = svm->vcpu.arch.exception.nr;
+   unsigned int nr = svm->vcpu.arch.pending_exception.nr;
 
svm->vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + nr;
svm->vmcb->control.exit_code_hi = 0;
 
-   if (svm->vcpu.arch.exception.has_error_code)
-   svm->vmcb->control.exit_info_1 = 
svm->vcpu.arch.exception.error_code;
+   if (svm->vcpu.arch.pending_exception.has_error_code)
+   svm->vmcb->control.exit_info_1 = 
svm->vcpu.arch.pending_exception.error_code;
 
/*
 * EXITINFO2 is undefined for all exception intercepts other
 * than #PF.
 */
if (nr == PF_VECTOR) {
-   if (svm->vcpu.arch.exception.nested_apf)
+   if (svm->vcpu.arch.exception_payload.nested_apf)
svm->vmcb->control.exit_info_2 = 
svm->vcpu.arch.apf.nested_apf_token;
-   else if (svm->vcpu.arch.exception.has_payload)
-   svm->vmcb->control.exit_info_2 = 
svm->vcpu.arch.exception.payload;
+   else if (svm->vcpu.arch.exception_payload.valid)
+   svm->vmcb->control.exit_info_2 = 
svm->vcpu.arch.exception_payload.value;
else
svm->vmcb->control.exit_info_2 = svm->vcpu.arch.cr2;
} else if (nr == DB_VECTOR) {
@@ -1034,7 +1034,7 @@ static void nested_svm_inject_exception_vmexit(struct 
vcpu_svm *svm)
kvm_update_dr7(>vcpu);
}
} else
-   WARN_ON

Re: [PATCH v2 2/2] KVM: nSVM: improve SYSENTER emulation on AMD

2021-04-01 Thread Maxim Levitsky

On Thu, 2021-04-01 at 15:03 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky  writes:
> 
> > Currently to support Intel->AMD migration, if CPU vendor is GenuineIntel,
> > we emulate the full 64 value for MSR_IA32_SYSENTER_{EIP|ESP}
> > msrs, and we also emulate the sysenter/sysexit instruction in long mode.
> > 
> > (Emulator does still refuse to emulate sysenter in 64 bit mode, on the
> > ground that the code for that wasn't tested and likely has no users)
> > 
> > However when virtual vmload/vmsave is enabled, the vmload instruction will
> > update these 32 bit msrs without triggering their msr intercept,
> > which will lead to having stale values in kvm's shadow copy of these msrs,
> > which relies on the intercept to be up to date.
> > 
> > Fix/optimize this by doing the following:
> > 
> > 1. Enable the MSR intercepts for SYSENTER MSRs iff vendor=GenuineIntel
> >(This is both a tiny optimization and also ensures that in case
> >the guest cpu vendor is AMD, the msrs will be 32 bit wide as
> >AMD defined).
> > 
> > 2. Store only high 32 bit part of these msrs on interception and combine
> >it with hardware msr value on intercepted read/writes
> >iff vendor=GenuineIntel.
> > 
> > 3. Disable vmload/vmsave virtualization if vendor=GenuineIntel.
> >(It is somewhat insane to set vendor=GenuineIntel and still enable
> >SVM for the guest but well whatever).
> >Then zero the high 32 bit parts when kvm intercepts and emulates vmload.
> > 
> > Thanks a lot to Paulo Bonzini for helping me with fixing this in the most
> 
> s/Paulo/Paolo/ :-)
Sorry about that!

> 
> > correct way.
> > 
> > This patch fixes nested migration of 32 bit nested guests, that was
> > broken because incorrect cached values of SYSENTER msrs were stored in
> > the migration stream if L1 changed these msrs with
> > vmload prior to L2 entry.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  arch/x86/kvm/svm/svm.c | 99 +++---
> >  arch/x86/kvm/svm/svm.h |  6 +--
> >  2 files changed, 68 insertions(+), 37 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 271196400495..6c39b0cd6ec6 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -95,6 +95,8 @@ static const struct svm_direct_access_msrs {
> >  } direct_access_msrs[MAX_DIRECT_ACCESS_MSRS] = {
> > { .index = MSR_STAR,.always = true  },
> > { .index = MSR_IA32_SYSENTER_CS,.always = true  },
> > +   { .index = MSR_IA32_SYSENTER_EIP,   .always = false },
> > +   { .index = MSR_IA32_SYSENTER_ESP,   .always = false },
> >  #ifdef CONFIG_X86_64
> > { .index = MSR_GS_BASE, .always = true  },
> > { .index = MSR_FS_BASE, .always = true  },
> > @@ -1258,16 +1260,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
> > if (kvm_vcpu_apicv_active(vcpu))
> > avic_init_vmcb(svm);
> >  
> > -   /*
> > -* If hardware supports Virtual VMLOAD VMSAVE then enable it
> > -* in VMCB and clear intercepts to avoid #VMEXIT.
> > -*/
> > -   if (vls) {
> > -   svm_clr_intercept(svm, INTERCEPT_VMLOAD);
> > -   svm_clr_intercept(svm, INTERCEPT_VMSAVE);
> > -   svm->vmcb->control.virt_ext |= 
> > VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
> > -   }
> > -
> > if (vgif) {
> > svm_clr_intercept(svm, INTERCEPT_STGI);
> > svm_clr_intercept(svm, INTERCEPT_CLGI);
> > @@ -2133,9 +2125,11 @@ static int vmload_vmsave_interception(struct 
> > kvm_vcpu *vcpu, bool vmload)
> >  
> >     ret = kvm_skip_emulated_instruction(vcpu);
> >  
> > -   if (vmload)
> > +   if (vmload) {
> > nested_svm_vmloadsave(vmcb12, svm->vmcb);
> > -   else
> > +   svm->sysenter_eip_hi = 0;
> > +   svm->sysenter_esp_hi = 0;
> > +   } else
> > nested_svm_vmloadsave(svm->vmcb, vmcb12);
> 
> Nitpicking: {} are now needed for both branches here.
I didn't knew about this rule, and I'll take this into
account next time. Thanks!


Best regards,
Maxim Levitsky

Re: [PATCH 3/3] KVM: SVM: allow to intercept all exceptions for debug

2021-03-18 Thread Maxim Levitsky

On Thu, 2021-03-18 at 16:35 +, Sean Christopherson wrote:
> On Thu, Mar 18, 2021, Joerg Roedel wrote:
> > On Thu, Mar 18, 2021 at 11:24:25AM +0200, Maxim Levitsky wrote:
> > > But again this is a debug feature, and it is intended to allow the user
> > > to shoot himself in the foot.
> > 
> > And one can't debug SEV-ES guests with it, so what is the point of
> > enabling it for them too?
You can create a special SEV-ES guest which does handle all exceptions via
#VC, or just observe it fail which can be useful for some whatever reason.
> 
> Agreed.  I can see myself enabling debug features by default, it would be nice
> to not having to go out of my way to disable them for SEV-ES/SNP guests.
This does sound like a valid reason to disable this for SEV-ES.

> 
> Skipping SEV-ES guests should not be difficult; KVM could probably even
> print a message stating that the debug hook is being ignored.  One thought 
> would
> be to snapshot debug_intercept_exceptions at VM creation, and simply zero it 
> out
> for incompatible guests.  That would also allow changing 
> debug_intercept_exceptions
> without reloading KVM, which IMO would be very convenient.
> 
So all right I'll disable this for SEV-ES. 
The idea to change the debug_intercept_exceptions on the fly is also a good 
idea,
I will implement it in next version of the patches.

Thanks for the review,
Best regards,
Maxim Levitsky

Re: [PATCH 3/3] KVM: SVM: allow to intercept all exceptions for debug

2021-03-18 Thread Maxim Levitsky

On Thu, 2021-03-18 at 10:19 +0100, Joerg Roedel wrote:
> On Tue, Mar 16, 2021 at 12:51:20PM +0200, Maxim Levitsky wrote:
> > I agree but what is wrong with that? 
> > This is a debug feature, and it only can be enabled by the root,
> > and so someone might actually want this case to happen
> > (e.g to see if a SEV guest can cope with extra #VC exceptions).
> 
> That doesn't make sense, we know that and SEV-ES guest can't cope with
> extra #VC exceptions, so there is no point in testing this. It is more a
> way to shot oneself into the foot for the user and a potential source of
> bug reports for SEV-ES guests.

But again this is a debug feature, and it is intended to allow the user
to shoot himself in the foot. Bug reports for a debug feature
are autoclosed. It is no different from say poking kernel memory with
its built-in gdbstub, for example.

Best regards,
Maxim Levitsky

> 
> 
> > I have nothing against not allowing this for SEV-ES guests though.
> > What do you think?
> 
> I think SEV-ES guests should only have the intercept bits set which
> guests acutally support

> 
> Regards,
> 
>   Joerg
>

Re: [PATCH 2/3] KVM: x86: guest debug: don't inject interrupts while single stepping

2021-03-16 Thread Maxim Levitsky

On Tue, 2021-03-16 at 18:01 +0100, Jan Kiszka wrote:
> On 16.03.21 17:50, Sean Christopherson wrote:
> > On Tue, Mar 16, 2021, Maxim Levitsky wrote:
> > > On Tue, 2021-03-16 at 16:31 +0100, Jan Kiszka wrote:
> > > > Back then, when I was hacking on the gdb-stub and KVM support, the
> > > > monitor trap flag was not yet broadly available, but the idea to once
> > > > use it was already there. Now it can be considered broadly available,
> > > > but it would still require some changes to get it in.
> > > > 
> > > > Unfortunately, we don't have such thing with SVM, even recent versions,
> > > > right? So, a proper way of avoiding diverting event injections while we
> > > > are having the guest in an "incorrect" state should definitely be the 
> > > > goal.
> > > Yes, I am not aware of anything like monitor trap on SVM.
> > > 
> > > > Given that KVM knows whether TF originates solely from guest debugging
> > > > or was (also) injected by the guest, we should be able to identify the
> > > > cases where your approach is best to apply. And that without any extra
> > > > control knob that everyone will only forget to set.
> > > Well I think that the downside of this patch is that the user might 
> > > actually
> > > want to single step into an interrupt handler, and this patch makes it a 
> > > bit
> > > more complicated, and changes the default behavior.
> > 
> > Yes.  And, as is, this also blocks NMIs and SMIs.  I suspect it also doesn't
> > prevent weirdness if the guest is running in L2, since IRQs for L1 will 
> > cause
> > exits from L2 during nested_ops->check_events().
> > 
> > > I have no objections though to use this patch as is, or at least make this
> > > the new default with a new flag to override this.
> > 
> > That's less bad, but IMO still violates the principle of least surprise, 
> > e.g.
> > someone that is single-stepping a guest and is expecting an IRQ to fire 
> > will be
> > all kinds of confused if they see all the proper IRR, ISR, EFLAGS.IF, etc...
> > settings, but no interrupt.
> 
> From my practical experience with debugging guests via single step,
> seeing an interrupt in that case is everything but handy and generally
> also not expected (though logical, I agree). IOW: When there is a knob
> for it, it will remain off in 99% of the time.
> 
> But I see the point of having some control, in an ideal world also an
> indication that there are pending events, permitting the user to decide
> what to do. But I suspect the gdb frontend and protocol does not easily
> permit that.

Qemu gdbstub actually does have control over suppression of the interrupts
over a single step and it is even enabled by default:

https://qemu.readthedocs.io/en/latest/system/gdb.html
(advanced debug options)

However it is currently only implemented in TCG (software emulator) mode 
and not in KVM mode (I can argue that this is a qemu bug).

So my plan was to add a new kvm guest debug flag KVM_GUESTDBG_BLOCKEVENTS,
and let qemu enable it when its 'NOIRQ' mode is enabled (it is by default).

However due to the discussion in this thread about the leakage of the RFLAGS.TF,
I wonder if kvm should by default suppress events and have something like
KVM_GUESTDBG_SSTEP_ALLOW_EVENTS to override this and wire 
that to qemu's NOIRQ=false case.

This will allow older qemu to work correctly and new qemu will be able to choose
the old less ideal behavior.

> 
> > > Sean Christopherson, what do you think?
> > 
> > Rather than block all events in KVM, what about having QEMU "pause" the 
> > timer?
> > E.g. save MSR_TSC_DEADLINE and APIC_TMICT (or inspect the guest to find out
> > which flavor it's using), clear them to zero, then restore both when
> > single-stepping is disabled.  I think that will work?
> > 
> 
> No one can stop the clock, and timers are only one source of interrupts.
> Plus they do not all come from QEMU, some also from KVM or in-kernel
> sources directly. Would quickly become a mess.

This, plus as we see, even changing with RFLAGS.TF leaks it.
Changing things like MSR_TSC_DEADLINE will also make it visible to the guest,
sooner or later and is a mess that I rather not get into.

It is _possible_ to disable timer interrupts 'out of band', but that is messy 
too
if done from userspace. For example, what if the timer interrupt is already 
pending
in local apic, when qemu decides to single step?

Also with gdbstub the user doesn't have to stop all vcpus (there is a non-stop 
mode),
in which only some vcpus are stopped which is actually a very cool feature,
and of course running vcpus can raise events.

Also interrupts can indeed come from things like vhost.

Best regards,
Maxim Levitsky

> Jan
>

Re: [PATCH 2/3] KVM: x86: guest debug: don't inject interrupts while single stepping

2021-03-16 Thread Maxim Levitsky

On Tue, 2021-03-16 at 14:46 +0100, Jan Kiszka wrote:
> On 16.03.21 13:34, Maxim Levitsky wrote:
> > On Tue, 2021-03-16 at 12:27 +0100, Jan Kiszka wrote:
> > > On 16.03.21 11:59, Maxim Levitsky wrote:
> > > > On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote:
> > > > > On 16.03.21 00:37, Sean Christopherson wrote:
> > > > > > On Tue, Mar 16, 2021, Maxim Levitsky wrote:
> > > > > > > This change greatly helps with two issues:
> > > > > > > 
> > > > > > > * Resuming from a breakpoint is much more reliable.
> > > > > > > 
> > > > > > >   When resuming execution from a breakpoint, with interrupts 
> > > > > > > enabled, more often
> > > > > > >   than not, KVM would inject an interrupt and make the CPU jump 
> > > > > > > immediately to
> > > > > > >   the interrupt handler and eventually return to the breakpoint, 
> > > > > > > to trigger it
> > > > > > >   again.
> > > > > > > 
> > > > > > >   From the user point of view it looks like the CPU never 
> > > > > > > executed a
> > > > > > >   single instruction and in some cases that can even prevent 
> > > > > > > forward progress,
> > > > > > >   for example, when the breakpoint is placed by an automated 
> > > > > > > script
> > > > > > >   (e.g lx-symbols), which does something in response to the 
> > > > > > > breakpoint and then
> > > > > > >   continues the guest automatically.
> > > > > > >   If the script execution takes enough time for another interrupt 
> > > > > > > to arrive,
> > > > > > >   the guest will be stuck on the same breakpoint RIP forever.
> > > > > > > 
> > > > > > > * Normal single stepping is much more predictable, since it won't 
> > > > > > > land the
> > > > > > >   debugger into an interrupt handler, so it is much more usable.
> > > > > > > 
> > > > > > >   (If entry to an interrupt handler is desired, the user can 
> > > > > > > still place a
> > > > > > >   breakpoint at it and resume the guest, which won't activate 
> > > > > > > this workaround
> > > > > > >   and let the gdb still stop at the interrupt handler)
> > > > > > > 
> > > > > > > Since this change is only active when guest is debugged, it won't 
> > > > > > > affect
> > > > > > > KVM running normal 'production' VMs.
> > > > > > > 
> > > > > > > 
> > > > > > > Signed-off-by: Maxim Levitsky 
> > > > > > > Tested-by: Stefano Garzarella 
> > > > > > > ---
> > > > > > >  arch/x86/kvm/x86.c | 6 ++
> > > > > > >  1 file changed, 6 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > > index a9d95f90a0487..b75d990fcf12b 100644
> > > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > > @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct 
> > > > > > > kvm_vcpu *vcpu, bool *req_immediate_exit
> > > > > > >   can_inject = false;
> > > > > > >   }
> > > > > > >  
> > > > > > > + /*
> > > > > > > +  * Don't inject interrupts while single stepping to make guest 
> > > > > > > debug easier
> > > > > > > +  */
> > > > > > > + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> > > > > > > + return;
> > > > > > 
> > > > > > Is this something userspace can deal with?  E.g. disable IRQs 
> > > > > > and/or set NMI
> > > > > > blocking at the start of single-stepping, unwind at the end?  
> > > > > > Deviating this far
> > > > > > from architectural behavior will end in tears at some point.
> > > > > > 
> > > > > 
> > > > > Does this happen to address this suspicious workaround in the kern

Re: [PATCH 1/3] scripts/gdb: rework lx-symbols gdb script

2021-03-16 Thread Maxim Levitsky

On Tue, 2021-03-16 at 14:38 +0100, Jan Kiszka wrote:
> On 15.03.21 23:10, Maxim Levitsky wrote:
> > Fix several issues that are present in lx-symbols script:
> > 
> > * Track module unloads by placing another software breakpoint at 
> > 'free_module'
> >   (force uninline this symbol just in case), and use remove-symbol-file
> >   gdb command to unload the symobls of the module that is unloading.
> > 
> >   That gives the gdb a chance to mark all software breakpoints from
> >   this module as pending again.
> >   Also remove the module from the 'known' module list once it is unloaded.
> > 
> > * Since we now track module unload, we don't need to reload all
> >   symbols anymore when 'known' module loaded again (that can't happen 
> > anymore).
> >   This allows reloading a module in the debugged kernel to finish much 
> > faster,
> >   while lx-symbols tracks module loads and unloads.
> > 
> > * Disable/enable all gdb breakpoints on both module load and unload 
> > breakpoint
> >   hits, and not only in 'load_all_symbols' as was done before.
> >   (load_all_symbols is no longer called on breakpoint hit)
> >   That allows gdb to avoid getting confused about the state of the (now two)
> >   internal breakpoints we place.
> > 
> >   Otherwise it will leave them in the kernel code segment, when continuing
> >   which triggers a guest kernel panic as soon as it skips over the 'int3'
> >   instruction and executes the garbage tail of the optcode on which
> >   the breakpoint was placed.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  kernel/module.c  |   8 ++-
> >  scripts/gdb/linux/symbols.py | 106 +--
> >  2 files changed, 83 insertions(+), 31 deletions(-)
> > 
> > diff --git a/kernel/module.c b/kernel/module.c
> > index 30479355ab850..ea81fc06ea1f5 100644
> > --- a/kernel/module.c
> > +++ b/kernel/module.c
> > @@ -901,8 +901,12 @@ int module_refcount(struct module *mod)
> >  }
> >  EXPORT_SYMBOL(module_refcount);
> >  
> > -/* This exists whether we can unload or not */
> > -static void free_module(struct module *mod);
> > +/* This exists whether we can unload or not
> > + * Keep it uninlined to provide a reliable breakpoint target,
> > + * e.g. for the gdb helper command 'lx-symbols'.
> > + */
> > +
> > +static noinline void free_module(struct module *mod);
> >  
> >  SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
> > unsigned int, flags)
> > diff --git a/scripts/gdb/linux/symbols.py b/scripts/gdb/linux/symbols.py
> > index 1be9763cf8bb2..4ce879548a1ae 100644
> > --- a/scripts/gdb/linux/symbols.py
> > +++ b/scripts/gdb/linux/symbols.py
> > @@ -17,6 +17,24 @@ import re
> >  
> >  from linux import modules, utils
> >  
> > +def save_state():
> 
> Naming is a bit too generic. And it's not only saving the state, it's
> also disabling things.
> 
> > +breakpoints = []
> > +if hasattr(gdb, 'breakpoints') and not gdb.breakpoints() is None:
> > +for bp in gdb.breakpoints():
> > +breakpoints.append({'breakpoint': bp, 'enabled': 
> > bp.enabled})
> > +bp.enabled = False
> > +
> > +show_pagination = gdb.execute("show pagination", to_string=True)
> > +pagination = show_pagination.endswith("on.\n")
> > +gdb.execute("set pagination off")
> > +
> > +return {"breakpoints":breakpoints, "show_pagination": 
> > show_pagination}
> > +
> > +def load_state(state):
> 
> Maybe rather something with "restore", to make naming balanced. Or is
> there a use case where "state" is not coming from the function above?

I didn't put much thought into naming these functions. 
I'll think of something better.

> 
> > +for breakpoint in state["breakpoints"]:
> > +breakpoint['breakpoint'].enabled = breakpoint['enabled']
> > +gdb.execute("set pagination %s" % ("on" if state["show_pagination"] 
> > else "off"))
> > +
> >  
> >  if hasattr(gdb, 'Breakpoint'):
> >  class LoadModuleBreakpoint(gdb.Breakpoint):
> > @@ -30,26 +48,38 @@ if hasattr(gdb, 'Breakpoint'):
> >  module_name = module['name'].string()
> >  cmd = self.gdb_command
> >  
> > +# module already loaded, false alarm
> > +if module_name in cmd.loaded_module

Re: [PATCH 2/3] KVM: x86: guest debug: don't inject interrupts while single stepping

2021-03-16 Thread Maxim Levitsky

On Tue, 2021-03-16 at 12:27 +0100, Jan Kiszka wrote:
> On 16.03.21 11:59, Maxim Levitsky wrote:
> > On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote:
> > > On 16.03.21 00:37, Sean Christopherson wrote:
> > > > On Tue, Mar 16, 2021, Maxim Levitsky wrote:
> > > > > This change greatly helps with two issues:
> > > > > 
> > > > > * Resuming from a breakpoint is much more reliable.
> > > > > 
> > > > >   When resuming execution from a breakpoint, with interrupts enabled, 
> > > > > more often
> > > > >   than not, KVM would inject an interrupt and make the CPU jump 
> > > > > immediately to
> > > > >   the interrupt handler and eventually return to the breakpoint, to 
> > > > > trigger it
> > > > >   again.
> > > > > 
> > > > >   From the user point of view it looks like the CPU never executed a
> > > > >   single instruction and in some cases that can even prevent forward 
> > > > > progress,
> > > > >   for example, when the breakpoint is placed by an automated script
> > > > >   (e.g lx-symbols), which does something in response to the 
> > > > > breakpoint and then
> > > > >   continues the guest automatically.
> > > > >   If the script execution takes enough time for another interrupt to 
> > > > > arrive,
> > > > >   the guest will be stuck on the same breakpoint RIP forever.
> > > > > 
> > > > > * Normal single stepping is much more predictable, since it won't 
> > > > > land the
> > > > >   debugger into an interrupt handler, so it is much more usable.
> > > > > 
> > > > >   (If entry to an interrupt handler is desired, the user can still 
> > > > > place a
> > > > >   breakpoint at it and resume the guest, which won't activate this 
> > > > > workaround
> > > > >   and let the gdb still stop at the interrupt handler)
> > > > > 
> > > > > Since this change is only active when guest is debugged, it won't 
> > > > > affect
> > > > > KVM running normal 'production' VMs.
> > > > > 
> > > > > 
> > > > > Signed-off-by: Maxim Levitsky 
> > > > > Tested-by: Stefano Garzarella 
> > > > > ---
> > > > >  arch/x86/kvm/x86.c | 6 ++
> > > > >  1 file changed, 6 insertions(+)
> > > > > 
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index a9d95f90a0487..b75d990fcf12b 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct 
> > > > > kvm_vcpu *vcpu, bool *req_immediate_exit
> > > > >   can_inject = false;
> > > > >   }
> > > > >  
> > > > > + /*
> > > > > +  * Don't inject interrupts while single stepping to make guest 
> > > > > debug easier
> > > > > +  */
> > > > > + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> > > > > + return;
> > > > 
> > > > Is this something userspace can deal with?  E.g. disable IRQs and/or 
> > > > set NMI
> > > > blocking at the start of single-stepping, unwind at the end?  Deviating 
> > > > this far
> > > > from architectural behavior will end in tears at some point.
> > > > 
> > > 
> > > Does this happen to address this suspicious workaround in the kernel?
> > > 
> > > /*
> > >  * The kernel doesn't use TF single-step outside of:
> > >  *
> > >  *  - Kprobes, consumed through kprobe_debug_handler()
> > >  *  - KGDB, consumed through notify_debug()
> > >  *
> > >  * So if we get here with DR_STEP set, something is wonky.
> > >  *
> > >  * A known way to trigger this is through QEMU's GDB stub,
> > >  * which leaks #DB into the guest and causes IST recursion.
> > >  */
> > > if (WARN_ON_ONCE(dr6 & DR_STEP))
> > > regs->flags &= ~X86_EFLAGS_TF;
> > > 
> > > (arch/x86/kernel/traps.c, exc_debug_kernel)
> > > 
> > > I wonder why this

Re: [PATCH 2/3] KVM: x86: guest debug: don't inject interrupts while single stepping

2021-03-16 Thread Maxim Levitsky

On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote:
> On 16.03.21 00:37, Sean Christopherson wrote:
> > On Tue, Mar 16, 2021, Maxim Levitsky wrote:
> > > This change greatly helps with two issues:
> > > 
> > > * Resuming from a breakpoint is much more reliable.
> > > 
> > >   When resuming execution from a breakpoint, with interrupts enabled, 
> > > more often
> > >   than not, KVM would inject an interrupt and make the CPU jump 
> > > immediately to
> > >   the interrupt handler and eventually return to the breakpoint, to 
> > > trigger it
> > >   again.
> > > 
> > >   From the user point of view it looks like the CPU never executed a
> > >   single instruction and in some cases that can even prevent forward 
> > > progress,
> > >   for example, when the breakpoint is placed by an automated script
> > >   (e.g lx-symbols), which does something in response to the breakpoint 
> > > and then
> > >   continues the guest automatically.
> > >   If the script execution takes enough time for another interrupt to 
> > > arrive,
> > >   the guest will be stuck on the same breakpoint RIP forever.
> > > 
> > > * Normal single stepping is much more predictable, since it won't land the
> > >   debugger into an interrupt handler, so it is much more usable.
> > > 
> > >   (If entry to an interrupt handler is desired, the user can still place a
> > >   breakpoint at it and resume the guest, which won't activate this 
> > > workaround
> > >   and let the gdb still stop at the interrupt handler)
> > > 
> > > Since this change is only active when guest is debugged, it won't affect
> > > KVM running normal 'production' VMs.
> > > 
> > > 
> > > Signed-off-by: Maxim Levitsky 
> > > Tested-by: Stefano Garzarella 
> > > ---
> > >  arch/x86/kvm/x86.c | 6 ++
> > >  1 file changed, 6 insertions(+)
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index a9d95f90a0487..b75d990fcf12b 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu 
> > > *vcpu, bool *req_immediate_exit
> > >   can_inject = false;
> > >   }
> > >  
> > > + /*
> > > +  * Don't inject interrupts while single stepping to make guest debug 
> > > easier
> > > +  */
> > > + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> > > + return;
> > 
> > Is this something userspace can deal with?  E.g. disable IRQs and/or set NMI
> > blocking at the start of single-stepping, unwind at the end?  Deviating 
> > this far
> > from architectural behavior will end in tears at some point.
> > 
> 
> Does this happen to address this suspicious workaround in the kernel?
> 
> /*
>  * The kernel doesn't use TF single-step outside of:
>  *
>  *  - Kprobes, consumed through kprobe_debug_handler()
>  *  - KGDB, consumed through notify_debug()
>  *
>  * So if we get here with DR_STEP set, something is wonky.
>  *
>  * A known way to trigger this is through QEMU's GDB stub,
>  * which leaks #DB into the guest and causes IST recursion.
>  */
> if (WARN_ON_ONCE(dr6 & DR_STEP))
> regs->flags &= ~X86_EFLAGS_TF;
> 
> (arch/x86/kernel/traps.c, exc_debug_kernel)
> 
> I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh,
> yeah, question to myself as well, dancing around broken guest debugging
> for a long time while trying to fix other issues...

To be honest I didn't see that warning even once, but I can imagine KVM
leaking #DB due to bugs in that code. That area historically didn't receive
much attention since it can only be triggered by
KVM_GET/SET_GUEST_DEBUG which isn't used in production.

The only issue that I on the other hand  did
see which is mostly gdb fault is that it fails to remove a software breakpoint
when resuming over it, if that breakpoint's python handler messes up 
with gdb's symbols, which is what lx-symbols does.

And that despite the fact that lx-symbol doesn't mess with the object
(that is the kernel) where the breakpoint is defined.

Just adding/removing one symbol file is enough to trigger this issue.

Since lx-symbols already works this around when it reloads all symbols,
I extended that workaround to happen also when loading/unloading 
only a single symbol file.

Best regards,
Maxim Levitsky

> 
> Jan
> 
> > > +
> > >   /*
> > >* Finally, inject interrupt events.  If an event cannot be injected
> > >* due to architectural conditions (e.g. IF=0) a window-open exit
> > > -- 
> > > 2.26.2
> > >

Re: [PATCH 2/3] KVM: x86: guest debug: don't inject interrupts while single stepping

2021-03-16 Thread Maxim Levitsky

On Mon, 2021-03-15 at 16:37 -0700, Sean Christopherson wrote:
> On Tue, Mar 16, 2021, Maxim Levitsky wrote:
> > This change greatly helps with two issues:
> > 
> > * Resuming from a breakpoint is much more reliable.
> > 
> >   When resuming execution from a breakpoint, with interrupts enabled, more 
> > often
> >   than not, KVM would inject an interrupt and make the CPU jump immediately 
> > to
> >   the interrupt handler and eventually return to the breakpoint, to trigger 
> > it
> >   again.
> > 
> >   From the user point of view it looks like the CPU never executed a
> >   single instruction and in some cases that can even prevent forward 
> > progress,
> >   for example, when the breakpoint is placed by an automated script
> >   (e.g lx-symbols), which does something in response to the breakpoint and 
> > then
> >   continues the guest automatically.
> >   If the script execution takes enough time for another interrupt to arrive,
> >   the guest will be stuck on the same breakpoint RIP forever.
> > 
> > * Normal single stepping is much more predictable, since it won't land the
> >   debugger into an interrupt handler, so it is much more usable.
> > 
> >   (If entry to an interrupt handler is desired, the user can still place a
> >   breakpoint at it and resume the guest, which won't activate this 
> > workaround
> >   and let the gdb still stop at the interrupt handler)
> > 
> > Since this change is only active when guest is debugged, it won't affect
> > KVM running normal 'production' VMs.
> > 
> > 
> > Signed-off-by: Maxim Levitsky 
> > Tested-by: Stefano Garzarella 
> > ---
> >  arch/x86/kvm/x86.c | 6 ++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index a9d95f90a0487..b75d990fcf12b 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu 
> > *vcpu, bool *req_immediate_exit
> > can_inject = false;
> > }
> >  
> > +   /*
> > +* Don't inject interrupts while single stepping to make guest debug 
> > easier
> > +*/
> > +   if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
> > +   return;
> 
> Is this something userspace can deal with?  E.g. disable IRQs and/or set NMI
> blocking at the start of single-stepping, unwind at the end?  Deviating this 
> far
> from architectural behavior will end in tears at some point.

I don't worry about NMI, but for IRQs, userspace can clear EFLAGS.IF, but that
can be messy to unwind, if an instruction that clears the interrupt flag was
single stepped over.

There is also notion of interrupt shadow but it also is reserved for things
like delaying interrupts for one cycle after sti, and such.

IMHO KVM_GUESTDBG_SINGLESTEP is already non architectural feature (userspace
basically tell the KVM to single step the guest but it doesn't set TF flag
or something like that), so changing its definition shouldn't be a problem.

If you worry about some automated script breaking due to the change,
(I expect that KVM_GUESTDBG_SINGLESTEP is mostly used manually, especially
since single stepping is never 100% reliable due to various issues like that),
I can add another flag to it which will block all the interrupts.
(like say KVM_GUESTDBG_BLOCKEVENTS).

In fact qemu already has single step flags, enabled over special qemu gdb 
extension
'maintenance packet qqemu.sstepbits'

Those single step flags allow to disable interrupts and qemu timers during the 
single stepping,
(and both modes are enabled by default)
However kvm code in qemu ignores these bits.


What do you think? 

Best regards,
Maxim Levitsky


> 
> > +
> > /*
> >  * Finally, inject interrupt events.  If an event cannot be injected
> >  * due to architectural conditions (e.g. IF=0) a window-open exit
> > -- 
> > 2.26.2
> >

Re: [PATCH 3/3] KVM: SVM: allow to intercept all exceptions for debug

2021-03-16 Thread Maxim Levitsky

On Tue, 2021-03-16 at 09:32 +0100, Joerg Roedel wrote:
> Hi Maxim,
> 
> On Tue, Mar 16, 2021 at 12:10:20AM +0200, Maxim Levitsky wrote:
> > -static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
> > +static int (*svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
> 
> Can you keep this const and always set the necessary handlers? If
> exceptions are not intercepted they will not be used.
> 
> > @@ -333,7 +334,9 @@ static inline void clr_exception_intercept(struct 
> > vcpu_svm *svm, u32 bit)
> > struct vmcb *vmcb = svm->vmcb01.ptr;
> >  
> > WARN_ON_ONCE(bit >= 32);
> > -   vmcb_clr_intercept(>control, INTERCEPT_EXCEPTION_OFFSET + bit);
> > +
> > +   if (!((1 << bit) & debug_intercept_exceptions))
> > +   vmcb_clr_intercept(>control, INTERCEPT_EXCEPTION_OFFSET + 
> > bit);
> 
> This will break SEV-ES guests, as those will not cause an intercept but
> now start to get #VC exceptions on every other exception that is raised.
> SEV-ES guests are not prepared for that and will not even boot, so
> please don't enable this feature for them.

I agree but what is wrong with that? 
This is a debug feature, and it only can be enabled by the root,
and so someone might actually want this case to happen
(e.g to see if a SEV guest can cope with extra #VC exceptions).

I have nothing against not allowing this for SEV-ES guests though.
What do you think?


Best regards,
Maxim Levitsky

Re: [PATCH 2/2] KVM: nSVM: improve SYSENTER emulation on AMD

2021-03-16 Thread Maxim Levitsky

On Tue, 2021-03-16 at 09:16 +0100, Paolo Bonzini wrote:
> On 15/03/21 19:19, Maxim Levitsky wrote:
> > On Mon, 2021-03-15 at 18:56 +0100, Paolo Bonzini wrote:
> > > On 15/03/21 18:43, Maxim Levitsky wrote:
> > > > +   if (!guest_cpuid_is_intel(vcpu)) {
> > > > +   /*
> > > > +* If hardware supports Virtual VMLOAD VMSAVE then 
> > > > enable it
> > > > +* in VMCB and clear intercepts to avoid #VMEXIT.
> > > > +*/
> > > > +   if (vls) {
> > > > +   svm_clr_intercept(svm, INTERCEPT_VMLOAD);
> > > > +   svm_clr_intercept(svm, INTERCEPT_VMSAVE);
> > > > +   svm->vmcb->control.virt_ext |= 
> > > > VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
> > > > +   }
> > > > +   /* No need to intercept these msrs either */
> > > > +   set_msr_interception(vcpu, svm->msrpm, 
> > > > MSR_IA32_SYSENTER_EIP, 1, 1);
> > > > +   set_msr_interception(vcpu, svm->msrpm, 
> > > > MSR_IA32_SYSENTER_ESP, 1, 1);
> > > > +   }
> > > 
> > > An "else" is needed here to do the opposite setup (removing the "if
> > > (vls)" from init_vmcb).
> > 
> > init_vmcb currently set the INTERCEPT_VMLOAD and INTERCEPT_VMSAVE and it 
> > doesn't enable vls
> 
> There's also this towards the end of the function:
> 
>  /*
>   * If hardware supports Virtual VMLOAD VMSAVE then enable it
>   * in VMCB and clear intercepts to avoid #VMEXIT.
>   */
>  if (vls) {
>  svm_clr_intercept(svm, INTERCEPT_VMLOAD);
>  svm_clr_intercept(svm, INTERCEPT_VMSAVE);
>  svm->vmcb->control.virt_ext |= 
> VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
>  }
> 
> > thus there is nothing to do if I don't want to enable vls.
> > It seems reasonable to me.
> > 
> > Both msrs I marked as '.always = false' in the
> > 'direct_access_msrs', which makes them be intercepted by the default.
> > If I were to use '.always = true' it would feel a bit wrong as the 
> > intercept is not always
> > enabled.
> 
> I agree that .always = false is correct.
> 
> > What do you think?
> 
> You can set the CPUID multiple times, so you could go from AMD to Intel 
> and back.

I understand now, I will send V2 with that. Thanks for the review!

Best regards,
Maxim Levitsky

> 
> Thanks,
> 
> Paolo
>

[PATCH 3/3] KVM: SVM: allow to intercept all exceptions for debug

2021-03-15 Thread Maxim Levitsky

Add a new debug module param 'debug_intercept_exceptions' which will allow the
KVM to intercept any guest exception, and forward it to the guest.

This can be very useful for guest debugging and/or KVM debugging with kvm trace.
This is not intended to be used on production systems.

This is based on an idea first shown here:
https://patchwork.kernel.org/project/kvm/patch/20160301192822.gd22...@pd.tnic/

CC: Borislav Petkov 
Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm_host.h |  2 +
 arch/x86/kvm/svm/svm.c  | 77 -
 arch/x86/kvm/svm/svm.h  |  5 ++-
 arch/x86/kvm/x86.c  |  5 ++-
 4 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a52f973bdff6d..c8f44a88b3153 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1564,6 +1564,8 @@ int kvm_emulate_rdpmc(struct kvm_vcpu *vcpu);
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
 void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long 
payload);
+void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
+u32 error_code, unsigned long payload);
 void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 
error_code);
 void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 271196400495f..94156a367a663 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -197,6 +197,9 @@ module_param(sev_es, int, 0444);
 bool __read_mostly dump_invalid_vmcb;
 module_param(dump_invalid_vmcb, bool, 0644);
 
+uint debug_intercept_exceptions;
+module_param(debug_intercept_exceptions, uint, 0444);
+
 static bool svm_gp_erratum_intercept = true;
 
 static u8 rsm_ins_bytes[] = "\x0f\xaa";
@@ -220,6 +223,8 @@ static const u32 msrpm_ranges[] = {0, 0xc000, 
0xc001};
 #define MSRS_RANGE_SIZE 2048
 #define MSRS_IN_RANGE (MSRS_RANGE_SIZE * 8 / 2)
 
+static void init_debug_exceptions_intercept(struct vcpu_svm *svm);
+
 u32 svm_msrpm_offset(u32 msr)
 {
u32 offset;
@@ -1137,6 +1142,8 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
set_exception_intercept(svm, MC_VECTOR);
set_exception_intercept(svm, AC_VECTOR);
set_exception_intercept(svm, DB_VECTOR);
+
+   init_debug_exceptions_intercept(svm);
/*
 * Guest access to VMware backdoor ports could legitimately
 * trigger #GP because of TSS I/O permission bitmap.
@@ -1913,6 +1920,17 @@ static int pf_interception(struct kvm_vcpu *vcpu)
u64 fault_address = svm->vmcb->control.exit_info_2;
u64 error_code = svm->vmcb->control.exit_info_1;
 
+   if ((debug_intercept_exceptions & (1 << PF_VECTOR)))
+   if (npt_enabled && !vcpu->arch.apf.host_apf_flags) {
+   /* If #PF was only intercepted for debug, inject
+* it directly to the guest, since the mmu code
+* is not ready to deal with such page faults
+*/
+   kvm_queue_exception_e_p(vcpu, PF_VECTOR,
+   error_code, fault_address);
+   return 1;
+   }
+
return kvm_handle_page_fault(vcpu, error_code, fault_address,
static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
svm->vmcb->control.insn_bytes : NULL,
@@ -3025,7 +3043,7 @@ static int invpcid_interception(struct kvm_vcpu *vcpu)
return kvm_handle_invpcid(vcpu, type, gva);
 }
 
-static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
+static int (*svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
[SVM_EXIT_READ_CR0] = cr_interception,
[SVM_EXIT_READ_CR3] = cr_interception,
[SVM_EXIT_READ_CR4] = cr_interception,
@@ -3099,6 +3117,63 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu 
*vcpu) = {
[SVM_EXIT_VMGEXIT]  = sev_handle_vmgexit,
 };
 
+static int generic_exception_interception(struct kvm_vcpu *vcpu)
+{
+   /*
+* Generic exception handler which forwards a guest exception
+* as-is to the guest.
+* For exceptions that don't have a special intercept handler.
+*
+* Used for 'debug_intercept_exceptions' KVM debug feature only.
+*/
+   struct vcpu_svm *svm = to_svm(vcpu);
+   int exc = svm->vmcb->control.exit_code - SVM_EXIT_EXCP_BASE;
+
+   WARN_ON(exc < 0 || exc > 31);
+
+   if (exc == TS_VECTOR) {
+   /*
+* SVM doesn't provide us

[PATCH 2/3] KVM: x86: guest debug: don't inject interrupts while single stepping

2021-03-15 Thread Maxim Levitsky

This change greatly helps with two issues:

* Resuming from a breakpoint is much more reliable.

  When resuming execution from a breakpoint, with interrupts enabled, more often
  than not, KVM would inject an interrupt and make the CPU jump immediately to
  the interrupt handler and eventually return to the breakpoint, to trigger it
  again.

  From the user point of view it looks like the CPU never executed a
  single instruction and in some cases that can even prevent forward progress,
  for example, when the breakpoint is placed by an automated script
  (e.g lx-symbols), which does something in response to the breakpoint and then
  continues the guest automatically.
  If the script execution takes enough time for another interrupt to arrive,
  the guest will be stuck on the same breakpoint RIP forever.

* Normal single stepping is much more predictable, since it won't land the
  debugger into an interrupt handler, so it is much more usable.

  (If entry to an interrupt handler is desired, the user can still place a
  breakpoint at it and resume the guest, which won't activate this workaround
  and let the gdb still stop at the interrupt handler)

Since this change is only active when guest is debugged, it won't affect
KVM running normal 'production' VMs.


Signed-off-by: Maxim Levitsky 
Tested-by: Stefano Garzarella 
---
 arch/x86/kvm/x86.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a9d95f90a0487..b75d990fcf12b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, 
bool *req_immediate_exit
can_inject = false;
}
 
+   /*
+* Don't inject interrupts while single stepping to make guest debug 
easier
+*/
+   if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
+   return;
+
/*
 * Finally, inject interrupt events.  If an event cannot be injected
 * due to architectural conditions (e.g. IF=0) a window-open exit
-- 
2.26.2

[PATCH 0/3] KVM: my debug patch queue

2021-03-15 Thread Maxim Levitsky

Hi!

I would like to publish two debug features which were needed for other stuff
I work on.

One is the reworked lx-symbols script which now actually works on at least
gdb 9.1 (gdb 9.2 was reported to fail to load the debug symbols from the kernel
for some reason, not related to this patch) and upstream qemu.

The other feature is the ability to trap all guest exceptions (on SVM for now)
and see them in kvmtrace prior to potential merge to double/triple fault.

This can be very useful and I already had to manually patch KVM a few
times for this.
I will, once time permits, implement this feature on Intel as well.

Best regards,
Maxim Levitsky

Maxim Levitsky (3):
  scripts/gdb: rework lx-symbols gdb script
  KVM: x86: guest debug: don't inject interrupts while single stepping
  KVM: SVM: allow to intercept all exceptions for debug

 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/svm/svm.c  |  77 ++-
 arch/x86/kvm/svm/svm.h  |   5 +-
 arch/x86/kvm/x86.c  |  11 +++-
 kernel/module.c |   8 ++-
 scripts/gdb/linux/symbols.py| 106 +++-
 6 files changed, 174 insertions(+), 35 deletions(-)

-- 
2.26.2

[PATCH 1/3] scripts/gdb: rework lx-symbols gdb script

2021-03-15 Thread Maxim Levitsky

Fix several issues that are present in lx-symbols script:

* Track module unloads by placing another software breakpoint at 'free_module'
  (force uninline this symbol just in case), and use remove-symbol-file
  gdb command to unload the symobls of the module that is unloading.

  That gives the gdb a chance to mark all software breakpoints from
  this module as pending again.
  Also remove the module from the 'known' module list once it is unloaded.

* Since we now track module unload, we don't need to reload all
  symbols anymore when 'known' module loaded again (that can't happen anymore).
  This allows reloading a module in the debugged kernel to finish much faster,
  while lx-symbols tracks module loads and unloads.

* Disable/enable all gdb breakpoints on both module load and unload breakpoint
  hits, and not only in 'load_all_symbols' as was done before.
  (load_all_symbols is no longer called on breakpoint hit)
  That allows gdb to avoid getting confused about the state of the (now two)
  internal breakpoints we place.

  Otherwise it will leave them in the kernel code segment, when continuing
  which triggers a guest kernel panic as soon as it skips over the 'int3'
  instruction and executes the garbage tail of the optcode on which
  the breakpoint was placed.

Signed-off-by: Maxim Levitsky 
---
 kernel/module.c  |   8 ++-
 scripts/gdb/linux/symbols.py | 106 +--
 2 files changed, 83 insertions(+), 31 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 30479355ab850..ea81fc06ea1f5 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -901,8 +901,12 @@ int module_refcount(struct module *mod)
 }
 EXPORT_SYMBOL(module_refcount);
 
-/* This exists whether we can unload or not */
-static void free_module(struct module *mod);
+/* This exists whether we can unload or not
+ * Keep it uninlined to provide a reliable breakpoint target,
+ * e.g. for the gdb helper command 'lx-symbols'.
+ */
+
+static noinline void free_module(struct module *mod);
 
 SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
unsigned int, flags)
diff --git a/scripts/gdb/linux/symbols.py b/scripts/gdb/linux/symbols.py
index 1be9763cf8bb2..4ce879548a1ae 100644
--- a/scripts/gdb/linux/symbols.py
+++ b/scripts/gdb/linux/symbols.py
@@ -17,6 +17,24 @@ import re
 
 from linux import modules, utils
 
+def save_state():
+breakpoints = []
+if hasattr(gdb, 'breakpoints') and not gdb.breakpoints() is None:
+for bp in gdb.breakpoints():
+breakpoints.append({'breakpoint': bp, 'enabled': bp.enabled})
+bp.enabled = False
+
+show_pagination = gdb.execute("show pagination", to_string=True)
+pagination = show_pagination.endswith("on.\n")
+gdb.execute("set pagination off")
+
+return {"breakpoints":breakpoints, "show_pagination": show_pagination}
+
+def load_state(state):
+for breakpoint in state["breakpoints"]:
+breakpoint['breakpoint'].enabled = breakpoint['enabled']
+gdb.execute("set pagination %s" % ("on" if state["show_pagination"] else 
"off"))
+
 
 if hasattr(gdb, 'Breakpoint'):
 class LoadModuleBreakpoint(gdb.Breakpoint):
@@ -30,26 +48,38 @@ if hasattr(gdb, 'Breakpoint'):
 module_name = module['name'].string()
 cmd = self.gdb_command
 
+# module already loaded, false alarm
+if module_name in cmd.loaded_modules:
+return False
+
 # enforce update if object file is not found
 cmd.module_files_updated = False
 
 # Disable pagination while reporting symbol (re-)loading.
 # The console input is blocked in this context so that we would
 # get stuck waiting for the user to acknowledge paged output.
-show_pagination = gdb.execute("show pagination", to_string=True)
-pagination = show_pagination.endswith("on.\n")
-gdb.execute("set pagination off")
+state = save_state()
+cmd.load_module_symbols(module)
+load_state(state)
+return False
 
-if module_name in cmd.loaded_modules:
-gdb.write("refreshing all symbols to reload module "
-  "'{0}'\n".format(module_name))
-cmd.load_all_symbols()
-else:
-cmd.load_module_symbols(module)
+class UnLoadModuleBreakpoint(gdb.Breakpoint):
+def __init__(self, spec, gdb_command):
+super(UnLoadModuleBreakpoint, self).__init__(spec, internal=True)
+self.silent = True
+self.gdb_command = gdb_command
+
+def stop(self):
+module = gdb.parse_and_eval("mod")
+module_name = module['name'].str

Re: [PATCH 2/2] KVM: nSVM: improve SYSENTER emulation on AMD

2021-03-15 Thread Maxim Levitsky

On Mon, 2021-03-15 at 18:56 +0100, Paolo Bonzini wrote:
> On 15/03/21 18:43, Maxim Levitsky wrote:
> > +   if (!guest_cpuid_is_intel(vcpu)) {
> > +   /*
> > +* If hardware supports Virtual VMLOAD VMSAVE then enable it
> > +* in VMCB and clear intercepts to avoid #VMEXIT.
> > +*/
> > +   if (vls) {
> > +   svm_clr_intercept(svm, INTERCEPT_VMLOAD);
> > +   svm_clr_intercept(svm, INTERCEPT_VMSAVE);
> > +   svm->vmcb->control.virt_ext |= 
> > VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
> > +   }
> > +   /* No need to intercept these msrs either */
> > +   set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_EIP, 
> > 1, 1);
> > +   set_msr_interception(vcpu, svm->msrpm, MSR_IA32_SYSENTER_ESP, 
> > 1, 1);
> > +   }
> 
> An "else" is needed here to do the opposite setup (removing the "if 
> (vls)" from init_vmcb).

init_vmcb currently set the INTERCEPT_VMLOAD and INTERCEPT_VMSAVE and it 
doesn't enable vls
thus there is nothing to do if I don't want to enable vls.
It seems reasonable to me.

Both msrs I marked as '.always = false' in the 
'direct_access_msrs', which makes them be intercepted by the default.
If I were to use '.always = true' it would feel a bit wrong as the intercept is 
not always
enabled.

What do you think?

> 
> This also makes the code more readable since you can write
> 
>   if (guest_cpuid_is_intel(vcpu)) {
>   /*
>* We must intercept SYSENTER_EIP and SYSENTER_ESP
>* accesses because the processor only stores 32 bits.
>* For the same reason we cannot use virtual
>    * VMLOAD/VMSAVE.
>*/
>   ...
>   } else {
>   /* Do the opposite.  */
>   ...
>   }

Best regards,
Maxim Levitsky

> 
> Paolo
>

[PATCH 0/2] KVM: x86: nSVM: fixes for SYSENTER emulation

2021-03-15 Thread Maxim Levitsky

This is a result of deep rabbit hole dive in regard to why currently the
nested migration of 32 bit guests is totally broken on AMD.

It turns out that due to slight differences between the original AMD64
implementation and the Intel's remake, SYSENTER instruction behaves a
bit differently on Intel, and to support migration from Intel to AMD we
try to emulate those differences away.

Sadly that collides with virtual vmload/vmsave feature that is used in nesting
such as on migration (and otherwise when userspace
reads MSR_IA32_SYSENTER_EIP/MSR_IA32_SYSENTER_ESP), wrong value is returned,
which leads to #DF in the nested guest when the wrong value is loaded back.

The patch I prepared carefully fixes this, by mostly disabling that SYSCALL
emulation when we don't spoof Intel's vendor ID, and if we do, and yet somehow
SVM is enabled (this is very rare corner case), then virtual vmload/save is
force disabled.

Best regards,
Maxim Levitsky

Maxim Levitsky (2):
  KVM: x86: add guest_cpuid_is_intel
  KVM: nSVM: improve SYSENTER emulation on AMD

 arch/x86/kvm/cpuid.h   |  8 
 arch/x86/kvm/svm/svm.c | 97 --
 arch/x86/kvm/svm/svm.h |  7 +--
 3 files changed, 77 insertions(+), 35 deletions(-)

-- 
2.26.2

[PATCH 1/2] KVM: x86: add guest_cpuid_is_intel

2021-03-15 Thread Maxim Levitsky

This is similar to existing 'guest_cpuid_is_amd_or_hygon'

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/cpuid.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 2a0c5064497f3..ded84d244f19f 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -248,6 +248,14 @@ static inline bool guest_cpuid_is_amd_or_hygon(struct 
kvm_vcpu *vcpu)
is_guest_vendor_hygon(best->ebx, best->ecx, best->edx));
 }
 
+static inline bool guest_cpuid_is_intel(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 0, 0);
+   return best && is_guest_vendor_intel(best->ebx, best->ecx, best->edx);
+}
+
 static inline int guest_cpuid_family(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
-- 
2.26.2

[PATCH 2/2] KVM: nSVM: improve SYSENTER emulation on AMD

2021-03-15 Thread Maxim Levitsky

Currently to support Intel->AMD migration, if CPU vendor is GenuineIntel,
we emulate full 64 value for MSR_IA32_SYSENTER_EIP/MSR_IA32_SYSENTER_ESP msrs,
and we also emulate sysenter/sysexit instruction in long mode.

(Emulator does still refuse to emulate sysenter in 64 bit mode, on the ground
that the code for this wasn't tested and likely there are no users of this)

However when virtual vmload/vmsave is enabled, the vmload instruction will
update these 32 bit msrs without triggering their msr intercept,
which will lead to having stale values in our shadow copy of these msrs,
which are in turn only updated when those msrs are intercepted.

Fix/optimize this by doing the following:

1. Enable the MSR intercepts for these MSRs iff vendor=GenuineIntel
   (This is both a tiny optimization and also will ensure that when guest
   cpu vendor is AMD, the msrs will be 32 bit wide as AMD defined).

2. Store only high 32 bit part of these msrs on interception and combine
   it with hardware msr value on intercepted read/writes iff 
vendor=GenuineIntel.

3. Disable vmload/vmsave virtualization if vendor=GenuineIntel.
   (It is somewhat insane to set vendor=GenuineIntel and still enable
   SVM for the guest but well whatever).
   Then zero the high 32 bit parts when we intercept and emulate vmload.
   And since we now read the low 32 bit part from the VMCB, it will be
   correct.

Thanks a lot to Paulo Bonzini for helping me fix this in the most correct way.

This patch fixes nested migration of 32 bit nested guests which was broken due
to incorrect cached values of these msrs being read if L1 changed these
msrs with vmload prior to L2 entry.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/svm.c | 97 --
 arch/x86/kvm/svm/svm.h |  7 +--
 2 files changed, 69 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 271196400495f..8bf243e0b1f7c 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -95,6 +95,8 @@ static const struct svm_direct_access_msrs {
 } direct_access_msrs[MAX_DIRECT_ACCESS_MSRS] = {
{ .index = MSR_STAR,.always = true  },
{ .index = MSR_IA32_SYSENTER_CS,.always = true  },
+   { .index = MSR_IA32_SYSENTER_EIP,   .always = false },
+   { .index = MSR_IA32_SYSENTER_ESP,   .always = false },
 #ifdef CONFIG_X86_64
{ .index = MSR_GS_BASE, .always = true  },
{ .index = MSR_FS_BASE, .always = true  },
@@ -1258,16 +1260,6 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
if (kvm_vcpu_apicv_active(vcpu))
avic_init_vmcb(svm);
 
-   /*
-* If hardware supports Virtual VMLOAD VMSAVE then enable it
-* in VMCB and clear intercepts to avoid #VMEXIT.
-*/
-   if (vls) {
-   svm_clr_intercept(svm, INTERCEPT_VMLOAD);
-   svm_clr_intercept(svm, INTERCEPT_VMSAVE);
-   svm->vmcb->control.virt_ext |= 
VIRTUAL_VMLOAD_VMSAVE_ENABLE_MASK;
-   }
-
if (vgif) {
svm_clr_intercept(svm, INTERCEPT_STGI);
svm_clr_intercept(svm, INTERCEPT_CLGI);
@@ -2133,9 +2125,11 @@ static int vmload_vmsave_interception(struct kvm_vcpu 
*vcpu, bool vmload)
 
ret = kvm_skip_emulated_instruction(vcpu);
 
-   if (vmload)
+   if (vmload) {
nested_svm_vmloadsave(vmcb12, svm->vmcb);
-   else
+   svm->sysenter_eip_hi = 0;
+   svm->sysenter_esp_hi = 0;
+   } else
nested_svm_vmloadsave(svm->vmcb, vmcb12);
 
kvm_vcpu_unmap(vcpu, , true);
@@ -2676,11 +2670,18 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
case MSR_IA32_SYSENTER_CS:
msr_info->data = svm->vmcb01.ptr->save.sysenter_cs;
break;
+
case MSR_IA32_SYSENTER_EIP:
-   msr_info->data = svm->sysenter_eip;
+   msr_info->data = (u32)svm->vmcb01.ptr->save.sysenter_eip;
+   if (guest_cpuid_is_intel(vcpu))
+   msr_info->data |= (u64)svm->sysenter_eip_hi << 32;
+
break;
case MSR_IA32_SYSENTER_ESP:
-   msr_info->data = svm->sysenter_esp;
+   msr_info->data = svm->vmcb01.ptr->save.sysenter_esp;
+   if (guest_cpuid_is_intel(vcpu))
+   msr_info->data |= (u64)svm->sysenter_esp_hi << 32;
+
break;
case MSR_TSC_AUX:
if (!boot_cpu_has(X86_FEATURE_RDTSCP))
@@ -2885,12 +2886,20 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr)
svm->vmcb01.ptr->save.sysenter_cs = data;
break;
case MSR_IA32_SYSENTER_EIP:
-   svm->sysenter_eip = dat

Re: [PATCH 2/2] KVM: x86/mmu: Exclude the MMU_PRESENT bit from MMIO SPTE's generation

2021-03-09 Thread Maxim Levitsky

On Tue, 2021-03-09 at 14:12 +0100, Paolo Bonzini wrote:
> On 09/03/21 11:09, Maxim Levitsky wrote:
> > What happens if mmio generation overflows (e.g if userspace keeps on 
> > updating the memslots)?
> > In theory if we have a SPTE with a stale generation, it can became valid, 
> > no?
> > 
> > I think that we should in the case of the overflow zap all mmio sptes.
> > What do you think?
> 
> Zapping all MMIO SPTEs is done by updating the generation count.  When 
> it overflows, all SPs are zapped:
> 
>  /*
>   * The very rare case: if the MMIO generation number has wrapped,
>   * zap all shadow pages.
>   */
>  if (unlikely(gen == 0)) {
>  kvm_debug_ratelimited("kvm: zapping shadow pages for 
> mmio generation wraparound\n");
>  kvm_mmu_zap_all_fast(kvm);
>  }
> 
> So giving it more bits make this more rare, at the same time having to 
> remove one or two bits is not the end of the world.

This is exactly what I expected to happen, I just didn't find that code.
Thanks for explanation, and it shows that I didn't study the mmio spte
code much.

Best regards,
Maxim Levitsky

> 
> Paolo
>

Re: [PATCH 2/2] KVM: x86/mmu: Exclude the MMU_PRESENT bit from MMIO SPTE's generation

2021-03-09 Thread Maxim Levitsky

On Mon, 2021-03-08 at 18:19 -0800, Sean Christopherson wrote:
> Drop bit 11, used for the MMU_PRESENT flag, from the set of bits used to
> store the generation number in MMIO SPTEs.  MMIO SPTEs with bit 11 set,
> which occurs when userspace creates 128+ memslots in an address space,
> get false positives for is_shadow_present_spte(), which lead to a variety
> of fireworks, crashes KVM, and likely hangs the host kernel.
> 
> Fixes: b14e28f37e9b ("KVM: x86/mmu: Use a dedicated bit to track 
> shadow/MMU-present SPTEs")
> Reported-by: Tom Lendacky 
> Reported-by: Paolo Bonzini 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/kvm/mmu/spte.h | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index b53036d9ddf3..bca0ba11cccf 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -101,11 +101,11 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & 
> SHADOW_ACC_TRACK_SAVED_MASK));
>  #undef SHADOW_ACC_TRACK_SAVED_MASK
>  
>  /*
> - * Due to limited space in PTEs, the MMIO generation is a 20 bit subset of
> + * Due to limited space in PTEs, the MMIO generation is a 19 bit subset of
>   * the memslots generation and is derived as follows:
>   *
> - * Bits 0-8 of the MMIO generation are propagated to spte bits 3-11
> - * Bits 9-19 of the MMIO generation are propagated to spte bits 52-62
> + * Bits 0-7 of the MMIO generation are propagated to spte bits 3-10
> + * Bits 8-18 of the MMIO generation are propagated to spte bits 52-62
>   *
>   * The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included 
> in
>   * the MMIO generation number, as doing so would require stealing a bit from
> @@ -116,7 +116,7 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & 
> SHADOW_ACC_TRACK_SAVED_MASK));
>   */
>  
>  #define MMIO_SPTE_GEN_LOW_START  3
> -#define MMIO_SPTE_GEN_LOW_END11
> +#define MMIO_SPTE_GEN_LOW_END10
>  
>  #define MMIO_SPTE_GEN_HIGH_START 52
>  #define MMIO_SPTE_GEN_HIGH_END   62
> @@ -125,12 +125,14 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & 
> SHADOW_ACC_TRACK_SAVED_MASK));
>   MMIO_SPTE_GEN_LOW_START)
>  #define MMIO_SPTE_GEN_HIGH_MASK  
> GENMASK_ULL(MMIO_SPTE_GEN_HIGH_END, \
>   MMIO_SPTE_GEN_HIGH_START)
> +static_assert(!(SPTE_MMU_PRESENT_MASK &
> + (MMIO_SPTE_GEN_LOW_MASK | MMIO_SPTE_GEN_HIGH_MASK)));
>  
>  #define MMIO_SPTE_GEN_LOW_BITS   (MMIO_SPTE_GEN_LOW_END - 
> MMIO_SPTE_GEN_LOW_START + 1)
>  #define MMIO_SPTE_GEN_HIGH_BITS  (MMIO_SPTE_GEN_HIGH_END - 
> MMIO_SPTE_GEN_HIGH_START + 1)
>  
>  /* remember to adjust the comment above as well if you change these */
> -static_assert(MMIO_SPTE_GEN_LOW_BITS == 9 && MMIO_SPTE_GEN_HIGH_BITS == 11);
> +static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
>  
>  #define MMIO_SPTE_GEN_LOW_SHIFT  (MMIO_SPTE_GEN_LOW_START - 0)
>  #define MMIO_SPTE_GEN_HIGH_SHIFT (MMIO_SPTE_GEN_HIGH_START - 
> MMIO_SPTE_GEN_LOW_BITS)
I bisected this and I reached the same conclusion that bit 11 has to be removed 
from mmio generation mask.

Reviewed-by: Maxim Levitsky 
 
I do wonder, why do we need 19 (and now 18 bits) for the mmio generation:

What happens if mmio generation overflows (e.g if userspace keeps on updating 
the memslots)? 
In theory if we have a SPTE with a stale generation, it can became valid, no?

I think that we should in the case of the overflow zap all mmio sptes.
What do you think?

Best regards,
Maxim Levitsky

Re: [PATCH] KVM: SVM: Connect 'npt' module param to KVM's internal 'npt_enabled'

2021-03-08 Thread Maxim Levitsky

On Mon, 2021-03-08 at 09:18 -0800, Sean Christopherson wrote:
> On Mon, Mar 08, 2021, Maxim Levitsky wrote:
> > On Thu, 2021-03-04 at 18:16 -0800, Sean Christopherson wrote:
> > > Directly connect the 'npt' param to the 'npt_enabled' variable so that
> > > runtime adjustments to npt_enabled are reflected in sysfs.  Move the
> > > !PAE restriction to a runtime check to ensure NPT is forced off if the
> > > host is using 2-level paging, and add a comment explicitly stating why
> > > NPT requires a 64-bit kernel or a kernel with PAE enabled.
> > 
> > Let me ask a small question for a personal itch.
> > 
> > Do you think it is feasable to allow the user to enable npt/ept per guest?
> > (the default should still of course come from npt module parameter)
> 
> Feasible, yes.  Worth the extra maintenance, probably not.  It's a niche use
> case, and only viable if you have a priori knowledge of the guest being run.
> I doubt there are more than a few people in the world that meet those 
> criteria,
> and want to run multiple VMs, and also care deeply about the performance
> degregation of the other VMs.
I understand.
On one of weekends when I am bored I probably implement it anyway,
and post it upstream. I don't count on getting this merged.

It just that I often run VMs which I don't want to stop, and sometimes I want
to boot my retro VM which is finally working but I need for that
to reload KVM and disable NPT.

> 
> > This weekend I checked it a bit and I think that it shouldn't be hard
> > to do.
> > 
> > There are some old and broken OSes which can't work with npt=1
> > https://blog.stuffedcow.net/2015/08/win9x-tlb-invalidation-bug/
> > https://blog.stuffedcow.net/2015/08/pagewalk-coherence/
> > 
> > I won't be surprised if some other old OSes
> > are affected by this as well knowing from the above 
> > that on Intel the MMU speculates less and doesn't
> > break their assumptions up to today.
> > (This is tested to be true on my Kabylake laptop)
> 
> Heh, I would be quite surprised if Intel CPUs speculate less.  I wouldn't be
> surprised if the old Windows behavior got grandfathered into Intel CPUs 
> because
> the buggy behavior worked on old CPUs and so must continue to work on new 
> CPUs.

Yes, this sounds exactly what did happen. Besides we might not care but other
hypervisors are often sold as a means to run very old software, and that 
includes
very old operation systems. 
So Intel might have kept this working for that reason as well, 
while AMD didn't have time to care for an obvious OS bug which is 
even given as an example of what not to do in the manual.

> 
> > In addition to that, on semi-unrelated note,
> > our shadowing MMU also shows up the exact same issue since it
> > also caches translations in form of unsync MMU pages.
> > 
> > But I can (and did disable) this using a hack (see below)
> > and this finally made my win98 "hobby" guest actually work fine 
> > on AMD for me.
> > 
> > I am also thinking to make this "sync" mmu mode to be 
> > another module param (this can also be useful for debug,
> > see below)
> > What do you think?
> > 
> > On yet another semi-unrelated note,
> > A "sync" mmu mode affects another bug I am tracking,
> > but I don't yet understand why:
> > 
> > I found out that while windows 10 doesn't boot at all with 
> > disabled tdp on the host (npt/ept - I tested both) 
> >  the "sync" mmu mode does make it work.
> 
> Intel and AMD?  Or just AMD?  If both architectures fail, this definitely 
> needs
> to be debugged and fixed.  Given the lack of bug reports, most KVM users
> obviously don't care about TDP=0, but any bug in the unsync code likely 
> affects
> nested TDP as well, which far more people do care about.

Both Intel and AMD in exactly the same way. 
Win10 fails to boot always, with various blue screens or just
hangs. With 'sync' mmu hack it slow but boots always.

It even boots nested (very slow) with TDP disabled on host.
(with my fix for booting nested guests on AMD with TDP disabled on the host)

Note that otherwise this isn't related to nesting, I just boot a regular win10 
guest.
I also see this happen with several different win10 VMs.


> 
> > I was also able to reproduce a crash on Linux 
> > (but only with nested migration loop)
> 
> With or without TDP enabled?
Without TDP enabled. I also was able to reproduce this on both Intel and AMD.

For this case as Linux does seem to boot, I did run my nested migration test,
and its is the nested guest that crashes, but also most likely not related to
nesting as no TDP was enabled on the host (L0)

With sync mmu I wasn't able to make anything crash (I think though that I 
tested this case only on AMD so far)

I also did *lot* of various hacks to mmu code.
(like to avoid any prefetching, sync everything on every cr0/cr3/cr4 write,
flush the real TLB on each guest entry, and stuff like that, and nothing seemes 
to help).


Best regards,
Maxim Levitsky

>

Re: [PATCH] KVM: SVM: Connect 'npt' module param to KVM's internal 'npt_enabled'

2021-03-08 Thread Maxim Levitsky

On Thu, 2021-03-04 at 18:16 -0800, Sean Christopherson wrote:
> Directly connect the 'npt' param to the 'npt_enabled' variable so that
> runtime adjustments to npt_enabled are reflected in sysfs.  Move the
> !PAE restriction to a runtime check to ensure NPT is forced off if the
> host is using 2-level paging, and add a comment explicitly stating why
> NPT requires a 64-bit kernel or a kernel with PAE enabled.

Let me ask a small question for a personal itch.

Do you think it is feasable to allow the user to enable npt/ept per guest?
(the default should still of course come from npt module parameter)

This weekend I checked it a bit and I think that it shouldn't be hard
to do.

There are some old and broken OSes which can't work with npt=1
https://blog.stuffedcow.net/2015/08/win9x-tlb-invalidation-bug/
https://blog.stuffedcow.net/2015/08/pagewalk-coherence/

I won't be surprised if some other old OSes
are affected by this as well knowing from the above 
that on Intel the MMU speculates less and doesn't
break their assumptions up to today.
(This is tested to be true on my Kabylake laptop)

In addition to that, on semi-unrelated note,
our shadowing MMU also shows up the exact same issue since it
also caches translations in form of unsync MMU pages.

But I can (and did disable) this using a hack (see below)
and this finally made my win98 "hobby" guest actually work fine 
on AMD for me.

I am also thinking to make this "sync" mmu mode to be 
another module param (this can also be useful for debug,
see below)
What do you think?

On yet another semi-unrelated note,
A "sync" mmu mode affects another bug I am tracking,
but I don't yet understand why:

I found out that while windows 10 doesn't boot at all with 
disabled tdp on the host (npt/ept - I tested both) 
 the "sync" mmu mode does make it work.

I was also able to reproduce a crash on Linux 
(but only with nested migration loop)
Without "sync" mmu mode and without npt on the host.
With "sync" mmu mode it passed an overnight test of more 
that 1000 iterations.

For reference this is my "sync" mmu hack:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index febe71935bb5a..1046d8c97702d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2608,7 +2608,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
}

set_spte_ret = set_spte(vcpu, sptep, pte_access, level, gfn, pfn,
-   speculative, true, host_writable);
+   speculative, false, host_writable);
if (set_spte_ret & SET_SPTE_WRITE_PROTECTED_PT) {
if (write_fault)
ret = RET_PF_EMULATE;

It is a hack since it only happens to work because we eventually
unprotect the guest mmu pages when we detect write flooding to them.
Still performance wise, my win98 guest works very well with this
(with npt=0 on host)

Best regards,
Maxim Levitsky

> 
> Opportunistically switch the param to octal permissions.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/kvm/svm/svm.c | 27 ++-
>  1 file changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 54610270f66a..0ee74321461e 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -115,13 +115,6 @@ static const struct svm_direct_access_msrs {
>   { .index = MSR_INVALID, .always = false },
>  };
>  
> -/* enable NPT for AMD64 and X86 with PAE */
> -#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
> -bool npt_enabled = true;
> -#else
> -bool npt_enabled;
> -#endif
> -
>  /*
>   * These 2 parameters are used to config the controls for Pause-Loop Exiting:
>   * pause_filter_count: On processors that support Pause filtering(indicated
> @@ -170,9 +163,12 @@ module_param(pause_filter_count_shrink, ushort, 0444);
>  static unsigned short pause_filter_count_max = 
> KVM_SVM_DEFAULT_PLE_WINDOW_MAX;
>  module_param(pause_filter_count_max, ushort, 0444);
>  
> -/* allow nested paging (virtualized MMU) for all guests */
> -static int npt = true;
> -module_param(npt, int, S_IRUGO);
> +/*
> + * Use nested page tables by default.  Note, NPT may get forced off by
> + * svm_hardware_setup() if it's unsupported by hardware or the host kernel.
> + */
> +bool npt_enabled = true;
> +module_param_named(npt, npt_enabled, bool, 0444);
>  
>  /* allow nested virtualization in KVM/SVM */
>  static int nested = true;
> @@ -988,12 +984,17 @@ static __init int svm_hardware_setup(void)
>   goto err;
>   }
>  
> + /*
> +  * KVM's MMU doesn't support using 2-level paging for itself, and thus
> +  * NPT isn't supported if the host

Re: [PATCH 3/4] KVM: x86: pending exception must be be injected even with an injected event

2021-02-25 Thread Maxim Levitsky

On Thu, 2021-02-25 at 17:05 +0100, Paolo Bonzini wrote:
> On 25/02/21 16:41, Maxim Levitsky wrote:
> > Injected events should not block a pending exception, but rather,
> > should either be lost or be delivered to the nested hypervisor as part of
> > exitintinfo/IDT_VECTORING_INFO
> > (if nested hypervisor intercepts the pending exception)
> > 
> > Signed-off-by: Maxim Levitsky 
> 
> Does this already fix some of your new test cases?

Yes, this fixes the 'interrupted' interrupt delivery test,
while patch fixes th 'interrupted' exception delivery.
Both interrupted by an exception.

Best regards
Maxim Levitsky
> 
> Paolo
> 
> > ---
> >   arch/x86/kvm/svm/nested.c | 7 ++-
> >   arch/x86/kvm/vmx/nested.c | 9 +++--
> >   2 files changed, 13 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index 881e3954d753b..4c82abce0ea0c 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -1024,7 +1024,12 @@ static int svm_check_nested_events(struct kvm_vcpu 
> > *vcpu)
> > }
> >   
> > if (vcpu->arch.exception.pending) {
> > -   if (block_nested_events)
> > +   /*
> > +* Only pending nested run can block an pending exception
> > +* Otherwise an injected NMI/interrupt should either be
> > +* lost or delivered to the nested hypervisor in EXITINTINFO
> > +* */
> > +   if (svm->nested.nested_run_pending)
> >   return -EBUSY;
> > if (!nested_exit_on_exception(svm))
> > return 0;
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index b34e284bfa62a..20ed1a351b2d9 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -3810,9 +3810,14 @@ static int vmx_check_nested_events(struct kvm_vcpu 
> > *vcpu)
> >   
> > /*
> >  * Process any exceptions that are not debug traps before MTF.
> > +*
> > +* Note that only pending nested run can block an pending exception
> > +* Otherwise an injected NMI/interrupt should either be
> > +* lost or delivered to the nested hypervisor in EXITINTINFO
> >  */
> > +
> > if (vcpu->arch.exception.pending && !vmx_pending_dbg_trap(vcpu)) {
> > -   if (block_nested_events)
> > +   if (vmx->nested.nested_run_pending)
> > return -EBUSY;
> > if (!nested_vmx_check_exception(vcpu, _qual))
> > goto no_vmexit;
> > @@ -3829,7 +3834,7 @@ static int vmx_check_nested_events(struct kvm_vcpu 
> > *vcpu)
> > }
> >   
> > if (vcpu->arch.exception.pending) {
> > -   if (block_nested_events)
> > +   if (vmx->nested.nested_run_pending)
> > return -EBUSY;
> > if (!nested_vmx_check_exception(vcpu, _qual))
> > goto no_vmexit;
> >

Re: [PATCH 0/4] RFC/WIP: KVM: separate injected and pending exception + few more fixes

2021-02-25 Thread Maxim Levitsky

On Thu, 2021-02-25 at 17:41 +0200, Maxim Levitsky wrote:
> clone of "kernel-starship-5.11"
> 
> Maxim Levitsky (4):
>   KVM: x86: determine if an exception has an error code only when
> injecting it.
>   KVM: x86: mmu: initialize fault.async_page_fault in walk_addr_generic
>   KVM: x86: pending exception must be be injected even with an injected
> event
>   kvm: WIP separation of injected and pending exception
> 
>  arch/x86/include/asm/kvm_host.h |  23 +-
>  arch/x86/include/uapi/asm/kvm.h |  14 +-
>  arch/x86/kvm/mmu/paging_tmpl.h  |   1 +
>  arch/x86/kvm/svm/nested.c   |  57 +++--
>  arch/x86/kvm/svm/svm.c  |   8 +-
>  arch/x86/kvm/vmx/nested.c   | 109 +
>  arch/x86/kvm/vmx/vmx.c  |  14 +-
>  arch/x86/kvm/x86.c  | 377 +++-
>  arch/x86/kvm/x86.h  |   6 +-
>  include/uapi/linux/kvm.h|   1 +
>  10 files changed, 374 insertions(+), 236 deletions(-)
> 
> -- 
> 2.26.2
> 
git-publish ate the cover letter, so here it goes:

RFC/WIP: KVM: separate injected and pending exception + few more fixes

This is a result of my deep dive on why do we need special .inject_page_fault
for cases when TDP paging is disabled on the host for running nested guests.

First 3 patches fix relatively small issues I found.
Some of them can be squashed with patch 4 assuming that it is accepted.

Patch 4 is WIP and I would like to hear your feedback on it:

Basically the issue is that during delivery of one exception
we (emulator or mmu) can signal another exception, and if the new exception
is intercepted by the nested guest, we should do VM exit with
former exception signaled in exitintinfo (or equivalent 
IDT_VECTORING_INFO_FIELD)

We sadly either loose the former exception and signal an VM exit, or deliver
a #DF since we only store either pending or injected exception
and we merge them in kvm_multiple_exception although we shouldn't.

Only later we deliver the VM exit in .check_nested_events when already wrong
data is in the pending/injected exception.

There are multiple ways to fix it, and I choose somewhat hard but I think
the most correct way of dealing with it.

1. I split pending and injected exceptions in kvm_vcpu_arch thus allowing
both to co-exist.

2. I made kvm_multiple_exception avoid merging exceptions, but instead only
setup either pending or injected exception
(there is another bug that we don't deliver triple fault as nested vm exit,
which I'll fix later)

3. I created kvm_deliver_pending_exception which its goal is to
convert the pending exception to injected exception or deliver a VM exit
with both pending and injected exception/interrupt/nmi.

It itself only deals with non-vmexit cases while it calls a new
'kvm_x86_ops.nested_ops->deliver_exception' to deliver exception VM exit
if needed.

The later implementation is simple as it just checks if we should VM exit
and then delivers both exceptions (or interrupt and exception, in case
interrupt delivery was interrupted by exception).
This new callback returns 0 if it had delivered this VM exit,
0 if no vm exit is needed, or -EBUSY when nested run is pending,
in which case the exception delivery will be retried after nested
run is done.

kvm_deliver_pending_exception is called each time we inject pending events
and all exception related code is removed from .check_nested_events which now 
only deals
with pending interrupts and events such as INIT,NMI,SMI, etc.

New KVM cap is added to expose both pending and injected exception via
KVM_GET_VCPU_EVENTS/KVM_SET_VCPU_EVENTS

If this cap is not enabled, and we have both pending and injected exception
when KVM_GET_VCPU_EVENTS is called, the exception is delivered.

The code was tested with SVM, and it currently seems to pass all the tests I 
usually
do (including nested migration). KVM unit tests seem to pass as well.

I still almost sure that I broke something since this is far from trivial 
change,
therefore this is RFC/WIP.

Also VMX side was not yet tested other than basic compile and I am sure that 
there
are at least few issues that remain to be fixed.

I should also note that with these patches I can boot nested guests with npt=0 
without
any changes to .inject_page_fault.

I also wrote 2 KVM unit tests to test for this issue, and for similar issue when
interrupt is lost when delivery of it causes exception.
These tests pass now.

Best regards,
Maxim Levitsky

[PATCH 4/4] kvm: WIP separation of injected and pending exception

2021-02-25 Thread Maxim Levitsky

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm_host.h |  23 +-
 arch/x86/include/uapi/asm/kvm.h |  14 +-
 arch/x86/kvm/svm/nested.c   |  62 +++---
 arch/x86/kvm/svm/svm.c  |   8 +-
 arch/x86/kvm/vmx/nested.c   | 114 +-
 arch/x86/kvm/vmx/vmx.c  |  14 +-
 arch/x86/kvm/x86.c  | 370 +++-
 arch/x86/kvm/x86.h  |   6 +-
 include/uapi/linux/kvm.h|   1 +
 9 files changed, 367 insertions(+), 245 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4aa48fb55361d..190e245aa6670 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -637,16 +637,22 @@ struct kvm_vcpu_arch {
 
u8 event_exit_inst_len;
 
-   struct kvm_queued_exception {
-   bool pending;
-   bool injected;
+   struct kvm_pending_exception {
+   bool valid;
bool has_error_code;
u8 nr;
u32 error_code;
unsigned long payload;
bool has_payload;
u8 nested_apf;
-   } exception;
+   } pending_exception;
+
+   struct kvm_queued_exception {
+   bool valid;
+   bool has_error_code;
+   u8 nr;
+   u32 error_code;
+   } injected_exception;
 
struct kvm_queued_interrupt {
bool injected;
@@ -1018,6 +1024,7 @@ struct kvm_arch {
 
bool guest_can_read_msr_platform_info;
bool exception_payload_enabled;
+   bool exception_separate_injected_pending;
 
/* Deflect RDMSR and WRMSR to user space when they trigger a #GP */
u32 user_space_msr_mask;
@@ -1351,6 +1358,14 @@ struct kvm_x86_ops {
 
 struct kvm_x86_nested_ops {
int (*check_events)(struct kvm_vcpu *vcpu);
+
+   /*
+* return value: 0 - delivered vm exit, 1 - exception not intercepted,
+* negative - failure
+* */
+
+   int (*deliver_exception)(struct kvm_vcpu *vcpu);
+
bool (*hv_timer_pending)(struct kvm_vcpu *vcpu);
int (*get_state)(struct kvm_vcpu *vcpu,
 struct kvm_nested_state __user *user_kvm_nested_state,
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 5a3022c8af82b..9556e420e8ecb 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -345,9 +345,17 @@ struct kvm_vcpu_events {
__u8 smm_inside_nmi;
__u8 latched_init;
} smi;
-   __u8 reserved[27];
-   __u8 exception_has_payload;
-   __u64 exception_payload;
+
+   __u8 reserved[20];
+
+   struct {
+   __u32 error_code;
+   __u8 nr;
+   __u8 pad;
+   __u8 has_error_code;
+   __u8 has_payload;
+   __u64 payload;
+   } pending_exception;
 };
 
 /* for KVM_GET/SET_DEBUGREGS */
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 4c82abce0ea0c..9df01b6e2e091 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -315,15 +315,16 @@ static void nested_save_pending_event_to_vmcb12(struct 
vcpu_svm *svm,
u32 exit_int_info = 0;
unsigned int nr;
 
-   if (vcpu->arch.exception.injected) {
-   nr = vcpu->arch.exception.nr;
+   if (vcpu->arch.injected_exception.valid) {
+   nr = vcpu->arch.injected_exception.nr;
exit_int_info = nr | SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_EXEPT;
 
-   if (vcpu->arch.exception.has_error_code) {
+   if (vcpu->arch.injected_exception.has_error_code) {
exit_int_info |= SVM_EVTINJ_VALID_ERR;
vmcb12->control.exit_int_info_err =
-   vcpu->arch.exception.error_code;
+   vcpu->arch.injected_exception.error_code;
}
+   vcpu->arch.injected_exception.valid = false;
 
} else if (vcpu->arch.nmi_injected) {
exit_int_info = SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_NMI;
@@ -923,30 +924,30 @@ int nested_svm_check_permissions(struct kvm_vcpu *vcpu)
 
 static bool nested_exit_on_exception(struct vcpu_svm *svm)
 {
-   unsigned int nr = svm->vcpu.arch.exception.nr;
+   unsigned int nr = svm->vcpu.arch.pending_exception.nr;
 
return (svm->nested.ctl.intercepts[INTERCEPT_EXCEPTION] & BIT(nr));
 }
 
 static void nested_svm_inject_exception_vmexit(struct vcpu_svm *svm)
 {
-   unsigned int nr = svm->vcpu.arch.exception.nr;
+   unsigned int nr = svm->vcpu.arch.pending_exception.nr;
 
svm->vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + nr;
svm->vmcb->control.exit_code_hi = 0;
 
-   if (svm->vcpu.arch.exception.has_error_code)
-   svm->vmcb->co

[PATCH 3/4] KVM: x86: pending exception must be be injected even with an injected event

2021-02-25 Thread Maxim Levitsky

Injected events should not block a pending exception, but rather,
should either be lost or be delivered to the nested hypervisor as part of
exitintinfo/IDT_VECTORING_INFO
(if nested hypervisor intercepts the pending exception)

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c | 7 ++-
 arch/x86/kvm/vmx/nested.c | 9 +++--
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 881e3954d753b..4c82abce0ea0c 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1024,7 +1024,12 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
}
 
if (vcpu->arch.exception.pending) {
-   if (block_nested_events)
+   /*
+* Only pending nested run can block an pending exception
+* Otherwise an injected NMI/interrupt should either be
+* lost or delivered to the nested hypervisor in EXITINTINFO
+* */
+   if (svm->nested.nested_run_pending)
 return -EBUSY;
if (!nested_exit_on_exception(svm))
return 0;
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b34e284bfa62a..20ed1a351b2d9 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3810,9 +3810,14 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 
/*
 * Process any exceptions that are not debug traps before MTF.
+*
+* Note that only pending nested run can block an pending exception
+* Otherwise an injected NMI/interrupt should either be
+* lost or delivered to the nested hypervisor in EXITINTINFO
 */
+
if (vcpu->arch.exception.pending && !vmx_pending_dbg_trap(vcpu)) {
-   if (block_nested_events)
+   if (vmx->nested.nested_run_pending)
return -EBUSY;
if (!nested_vmx_check_exception(vcpu, _qual))
goto no_vmexit;
@@ -3829,7 +3834,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
}
 
if (vcpu->arch.exception.pending) {
-   if (block_nested_events)
+   if (vmx->nested.nested_run_pending)
return -EBUSY;
if (!nested_vmx_check_exception(vcpu, _qual))
goto no_vmexit;
-- 
2.26.2

[PATCH 2/4] KVM: x86: mmu: initialize fault.async_page_fault in walk_addr_generic

2021-02-25 Thread Maxim Levitsky

This field was left uninitialized by a mistake.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/mmu/paging_tmpl.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index d9f66cc459e84..3dc9a25772bd8 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -503,6 +503,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker 
*walker,
 #endif
walker->fault.address = addr;
walker->fault.nested_page_fault = mmu != vcpu->arch.walk_mmu;
+   walker->fault.async_page_fault = false;
 
trace_kvm_mmu_walker_error(walker->fault.error_code);
return 0;
-- 
2.26.2

[PATCH 1/4] KVM: x86: determine if an exception has an error code only when injecting it.

2021-02-25 Thread Maxim Levitsky

A page fault can be queued while vCPU is in real paged mode on AMD, and
AMD manual asks the user to always intercept it
(otherwise result is undefined).
The resulting VM exit, does have an error code.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/x86.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9fa0c7ff6e2fb..a9d814a0b5e4f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -544,8 +544,6 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 
if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {
queue:
-   if (has_error && !is_protmode(vcpu))
-   has_error = false;
if (reinject) {
/*
 * On vmentry, vcpu->arch.exception.pending is only
@@ -8345,6 +8343,13 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu)
static_call(kvm_x86_update_cr8_intercept)(vcpu, tpr, max_irr);
 }
 
+static void kvm_inject_exception(struct kvm_vcpu *vcpu)
+{
+   if (vcpu->arch.exception.error_code && !is_protmode(vcpu))
+   vcpu->arch.exception.error_code = false;
+   static_call(kvm_x86_queue_exception)(vcpu);
+}
+
 static void inject_pending_event(struct kvm_vcpu *vcpu, bool 
*req_immediate_exit)
 {
int r;
@@ -8353,7 +8358,7 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, 
bool *req_immediate_exit
/* try to reinject previous events if any */
 
if (vcpu->arch.exception.injected) {
-   static_call(kvm_x86_queue_exception)(vcpu);
+   kvm_inject_exception(vcpu);
can_inject = false;
}
/*
@@ -8416,7 +8421,7 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, 
bool *req_immediate_exit
}
}
 
-   static_call(kvm_x86_queue_exception)(vcpu);
+   kvm_inject_exception(vcpu);
can_inject = false;
}
 
-- 
2.26.2

[PATCH 0/4] RFC/WIP: KVM: separate injected and pending exception + few more fixes

2021-02-25 Thread Maxim Levitsky

clone of "kernel-starship-5.11"

Maxim Levitsky (4):
  KVM: x86: determine if an exception has an error code only when
injecting it.
  KVM: x86: mmu: initialize fault.async_page_fault in walk_addr_generic
  KVM: x86: pending exception must be be injected even with an injected
event
  kvm: WIP separation of injected and pending exception

 arch/x86/include/asm/kvm_host.h |  23 +-
 arch/x86/include/uapi/asm/kvm.h |  14 +-
 arch/x86/kvm/mmu/paging_tmpl.h  |   1 +
 arch/x86/kvm/svm/nested.c   |  57 +++--
 arch/x86/kvm/svm/svm.c  |   8 +-
 arch/x86/kvm/vmx/nested.c   | 109 +
 arch/x86/kvm/vmx/vmx.c  |  14 +-
 arch/x86/kvm/x86.c  | 377 +++-
 arch/x86/kvm/x86.h  |   6 +-
 include/uapi/linux/kvm.h|   1 +
 10 files changed, 374 insertions(+), 236 deletions(-)

-- 
2.26.2

Re: [PATCH 4/7] KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init

2021-02-17 Thread Maxim Levitsky

On Wed, 2021-02-17 at 18:37 +0100, Paolo Bonzini wrote:
> On 17/02/21 18:29, Sean Christopherson wrote:
> > All that being said, I'm pretty we can eliminate setting 
> > inject_page_fault dynamically. I think that would yield more 
> > maintainable code. Following these flows is a nightmare. The change 
> > itself will be scarier, but I'm pretty sure the end result will be a lot 
> > cleaner.

I agree with that.

> 
> I had a similar reaction, though my proposal was different.
> 
> The only thing we're changing in complete_mmu_init is the page fault 
> callback for init_kvm_softmmu, so couldn't that be the callback directly 
> (i.e. something like context->inject_page_fault = 
> kvm_x86_ops.inject_softmmu_page_fault)?  And then adding is_guest_mode 
> to the conditional that is already in vmx_inject_page_fault_nested and 
> svm_inject_page_fault_nested.

I was thinking about this a well, I tried to make an as simple as possible
solution that doesn't make things worse.
> 
> That said, I'm also rusty on _why_ this code is needed.  Why isn't it 
> enough to inject the exception normally, and let 
> nested_vmx_check_exception decide whether to inject a vmexit to L1 or an 
> exception into L2?
> 
> Also, bonus question which should have been in the 5/7 changelog: are 
> there kvm-unit-tests testcases that fail with npt=0, and if not could we 
> write one?  [Answer: the mode_switch testcase fails, but I haven't 
> checked why].

I agree with all of this. I'll see why this code is needed (it is needed,
since I once removed it accidentaly on VMX, and it broke nesting with ept=0,
in exact the same way as it was broken on AMD).

I''l debug this a bit to see if I can make it work as you suggest.


Best regards,
Maxim Levitsky
> 
> 
> Paolo
>

Re: [PATCH 4/7] KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init

2021-02-17 Thread Maxim Levitsky

On Wed, 2021-02-17 at 09:29 -0800, Sean Christopherson wrote:
> On Wed, Feb 17, 2021, Maxim Levitsky wrote:
> > This fixes a (mostly theoretical) bug which can happen if ept=0
> > on host and we run a nested guest which triggers a mmu context
> > reset while running nested.
> > In this case the .inject_page_fault callback will be lost.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  arch/x86/kvm/vmx/nested.c | 8 +---
> >  arch/x86/kvm/vmx/nested.h | 1 +
> >  arch/x86/kvm/vmx/vmx.c| 5 -
> >  3 files changed, 6 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 0b6dab6915a3..f9de729dbea6 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -419,7 +419,7 @@ static int nested_vmx_check_exception(struct kvm_vcpu 
> > *vcpu, unsigned long *exit
> >  }
> >  
> >  
> > -static void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
> > +void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
> > struct x86_exception *fault)
> >  {
> > struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > @@ -2620,9 +2620,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
> > struct vmcs12 *vmcs12,
> > vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
> > }
> >  
> > -   if (!enable_ept)
> > -   vcpu->arch.walk_mmu->inject_page_fault = 
> > vmx_inject_page_fault_nested;
> > -
> > if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) &&
> > WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
> >  vmcs12->guest_ia32_perf_global_ctrl)))
> > @@ -4224,9 +4221,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu 
> > *vcpu,
> > if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, ))
> > nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> >  
> > -   if (!enable_ept)
> > -   vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> 
> Oof, please explicitly call out these types of side effects in the changelog,
> it took me a while to piece together that this can be dropped because a MMU
> reset is guaranteed and is also guaranteed to restore inject_page_fault.
> 
> I would even go so far as to say this particular line of code should be 
> removed
> in a separate commit.  Unless I'm overlooking something, this code is
> effectively a nop, which means it doesn't need to be removed to make the bug 
> fix
> functionally correct.
> 
> All that being said, I'm pretty we can eliminate setting inject_page_fault
> dynamically.  I think that would yield more maintainable code.  Following 
> these
> flows is a nightmare.  The change itself will be scarier, but I'm pretty sure
> the end result will be a lot cleaner.
> 
> And I believe there's also a second bug that would be fixed by such an 
> approach.
> Doesn't vmx_inject_page_fault_nested() need to be used for the nested_mmu when
> ept=1?  E.g. if the emulator injects a #PF to L2, L1 should still be able to
> intercept the #PF even if L1 is using EPT.  This likely hasn't been noticed
> because hypervisors typically don't intercept #PF when EPT is enabled.

Let me explain what I know about this:
 
There are basically 3 cases:
 
1. npt/ept disabled in the host. In this case we have a single shadowing
and a nested hypervisor has to do its own shadowing on top of it.
In this case the MMU itself has to generate page faults (they are a result
of hardware page faults, but are completely different), and in case
of nesting these page faults have to be sometimes injected as VM exits.
 
2. npt/ept enabled on host and disabled in guest.
In this case we don't need to shadow anything, while the nested hypervisor
does need to do shadowing to run its guest.
In this case in fact it is likely that L1 intercepts the page faults,
however they are just reflected as is to it, like what nested_svm_exit_special
does (it does have a special case for async page fault which I need to 
investigate.

This is where the bug that you mention can happen. I haven’t checked how VMX
reflects the page faults to the nested guest also.
 
3. (the common case) npt/ept are enabled on both host and guest.
In this case walk_mmu is used for all the page faults which is actually
tweaked in the similar way (see nested_svm_init_mmu_context for example)


Also if the emulator injects the page fault, then indeed I think the
bug will happen.


Best regards and thanks for the feedback,
Maxim Levitsky

> 
> Something like this (very incomplete):
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b

Re: [PATCH 1/7] KVM: VMX: read idt_vectoring_info a bit earlier

2021-02-17 Thread Maxim Levitsky

On Wed, 2021-02-17 at 17:06 +0100, Paolo Bonzini wrote:
> On 17/02/21 15:57, Maxim Levitsky wrote:
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index b3e36dc3f164..e428d69e21c0 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -6921,13 +6921,15 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu 
> > *vcpu)
> > if (unlikely((u16)vmx->exit_reason.basic == 
> > EXIT_REASON_MCE_DURING_VMENTRY))
> > kvm_machine_check();
> >  
> > +   if (likely(!vmx->exit_reason.failed_vmentry))
> > +   vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> > +
> 
> Any reason for the if?

Sean Christopherson asked me to do this to avoid updating idt_vectoring_info on 
failed
VM entry, to keep things as they were logically before this patch.

Best regards,
Maxim Levitsky

> 
> Paolo
>

Re: [PATCH 5/7] KVM: nSVM: fix running nested guests when npt=0

2021-02-17 Thread Maxim Levitsky

On Wed, 2021-02-17 at 16:57 +0200, Maxim Levitsky wrote:
> In case of npt=0 on host,
> nSVM needs the same .inject_page_fault tweak as VMX has,
> to make sure that shadow mmu faults are injected as vmexits.
> 
> Signed-off-by: Maxim Levitsky 
> ---
>  arch/x86/kvm/svm/nested.c | 18 ++
>  arch/x86/kvm/svm/svm.c|  5 -
>  arch/x86/kvm/svm/svm.h|  1 +
>  3 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 1bc31e2e8fe0..53b9037259b5 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -53,6 +53,23 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu 
> *vcpu,
>   nested_svm_vmexit(svm);
>  }
>  
> +void svm_inject_page_fault_nested(struct kvm_vcpu *vcpu, struct 
> x86_exception *fault)
> +{
> +   struct vcpu_svm *svm = to_svm(vcpu);
> +   WARN_ON(!is_guest_mode(vcpu));
> +
> +   if (vmcb_is_intercept(>nested.ctl, INTERCEPT_EXCEPTION_OFFSET + 
> PF_VECTOR) &&
> +!svm->nested.nested_run_pending) {
> +   svm->vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + PF_VECTOR;
> +   svm->vmcb->control.exit_code_hi = 0;
> +   svm->vmcb->control.exit_info_1 = fault->error_code;
> +   svm->vmcb->control.exit_info_2 = fault->address;
> +   nested_svm_vmexit(svm);
> +   } else {
> +   kvm_inject_page_fault(vcpu, fault);
> +   }
> +}
> +
>  static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
>  {
>   struct vcpu_svm *svm = to_svm(vcpu);
> @@ -531,6 +548,7 @@ int enter_svm_guest_mode(struct vcpu_svm *svm, u64 
> vmcb12_gpa,
>   if (ret)
>   return ret;
>  
> +
Sorry for this whitespace change.
Best regards,
Maxim Levitsky
>   svm_set_gif(svm, true);
>  
>   return 0;
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 74a334c9902a..59e1767df030 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3915,7 +3915,10 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, 
> unsigned long root,
>  
>  static void svm_complete_mmu_init(struct kvm_vcpu *vcpu)
>  {
> -
> + if (!npt_enabled && is_guest_mode(vcpu)) {
> + WARN_ON(mmu_is_nested(vcpu));
> + vcpu->arch.mmu->inject_page_fault = 
> svm_inject_page_fault_nested;
> + }
>  }
>  
>  static int is_disabled(void)
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 7b6ca0e49a14..fda80d56c6e3 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -437,6 +437,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm 
> *svm)
>   return vmcb_is_intercept(>nested.ctl, INTERCEPT_NMI);
>  }
>  
> +void svm_inject_page_fault_nested(struct kvm_vcpu *vcpu, struct 
> x86_exception *fault);
>  int enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, struct vmcb 
> *vmcb12);
>  void svm_leave_nested(struct vcpu_svm *svm);
>  void svm_free_nested(struct vcpu_svm *svm);

[PATCH 5/7] KVM: nSVM: fix running nested guests when npt=0

2021-02-17 Thread Maxim Levitsky

In case of npt=0 on host,
nSVM needs the same .inject_page_fault tweak as VMX has,
to make sure that shadow mmu faults are injected as vmexits.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c | 18 ++
 arch/x86/kvm/svm/svm.c|  5 -
 arch/x86/kvm/svm/svm.h|  1 +
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 1bc31e2e8fe0..53b9037259b5 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -53,6 +53,23 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
nested_svm_vmexit(svm);
 }
 
+void svm_inject_page_fault_nested(struct kvm_vcpu *vcpu, struct x86_exception 
*fault)
+{
+   struct vcpu_svm *svm = to_svm(vcpu);
+   WARN_ON(!is_guest_mode(vcpu));
+
+   if (vmcb_is_intercept(>nested.ctl, INTERCEPT_EXCEPTION_OFFSET + 
PF_VECTOR) &&
+  !svm->nested.nested_run_pending) {
+   svm->vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + PF_VECTOR;
+   svm->vmcb->control.exit_code_hi = 0;
+   svm->vmcb->control.exit_info_1 = fault->error_code;
+   svm->vmcb->control.exit_info_2 = fault->address;
+   nested_svm_vmexit(svm);
+   } else {
+   kvm_inject_page_fault(vcpu, fault);
+   }
+}
+
 static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
 {
struct vcpu_svm *svm = to_svm(vcpu);
@@ -531,6 +548,7 @@ int enter_svm_guest_mode(struct vcpu_svm *svm, u64 
vmcb12_gpa,
if (ret)
return ret;
 
+
svm_set_gif(svm, true);
 
return 0;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 74a334c9902a..59e1767df030 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3915,7 +3915,10 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, 
unsigned long root,
 
 static void svm_complete_mmu_init(struct kvm_vcpu *vcpu)
 {
-
+   if (!npt_enabled && is_guest_mode(vcpu)) {
+   WARN_ON(mmu_is_nested(vcpu));
+   vcpu->arch.mmu->inject_page_fault = 
svm_inject_page_fault_nested;
+   }
 }
 
 static int is_disabled(void)
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 7b6ca0e49a14..fda80d56c6e3 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -437,6 +437,7 @@ static inline bool nested_exit_on_nmi(struct vcpu_svm *svm)
return vmcb_is_intercept(>nested.ctl, INTERCEPT_NMI);
 }
 
+void svm_inject_page_fault_nested(struct kvm_vcpu *vcpu, struct x86_exception 
*fault);
 int enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa, struct vmcb 
*vmcb12);
 void svm_leave_nested(struct vcpu_svm *svm);
 void svm_free_nested(struct vcpu_svm *svm);
-- 
2.26.2

[PATCH 6/7] KVM: nVMX: don't load PDPTRS right after nested state set

2021-02-17 Thread Maxim Levitsky

Just like all other nested memory accesses, after a migration loading
PDPTRs should be delayed to first VM entry to ensure
that guest memory is fully initialized.

Just move the call to nested_vmx_load_cr3 to nested_get_vmcs12_pages
to implement this.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/vmx/nested.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index f9de729dbea6..26084f8eee82 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2596,11 +2596,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12,
return -EINVAL;
}
 
-   /* Shadow page tables on either EPT or shadow page tables. */
-   if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, 
nested_cpu_has_ept(vmcs12),
-   entry_failure_code))
-   return -EINVAL;
-
/*
 * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
 * on nested VM-Exit, which can occur without actually running L2 and
@@ -3138,11 +3133,16 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
 static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu)
 {
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+   enum vm_entry_failure_code entry_failure_code;
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct kvm_host_map *map;
struct page *page;
u64 hpa;
 
+   if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, 
nested_cpu_has_ept(vmcs12),
+   _failure_code))
+   return false;
+
if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
/*
 * Translate L1 physical address to host physical
@@ -3386,6 +3386,10 @@ enum nvmx_vmentry_status 
nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
}
 
if (from_vmentry) {
+   if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3,
+   nested_cpu_has_ept(vmcs12), _failure_code))
+   goto vmentry_fail_vmexit_guest_mode;
+
failed_index = nested_vmx_load_msr(vcpu,
   
vmcs12->vm_entry_msr_load_addr,
   
vmcs12->vm_entry_msr_load_count);
-- 
2.26.2

[PATCH 7/7] KVM: nSVM: call nested_svm_load_cr3 on nested state load

2021-02-17 Thread Maxim Levitsky

While KVM's MMU should be fully reset by loading of nested CR0/CR3/CR4
by KVM_SET_SREGS, we are not in nested mode yet when we do it and therefore
only root_mmu is reset.

On regular nested entries we call nested_svm_load_cr3 which both updates the
guest's CR3 in the MMU when it is needed, and it also initializes
the mmu again which makes it initialize the walk_mmu as well when nested
paging is enabled in both host and guest.

Since we don't call nested_svm_load_cr3 on nested state load,
the walk_mmu can be left uninitialized, which can lead to a NULL pointer
dereference while accessing it, if we happen to get a nested page fault
right after entering the nested guest first time after the migration and
if we decide to emulate it.
This makes the emulator access NULL walk_mmu->gva_to_gpa.

Therefore we should call this function on nested state load as well.

Suggested-by: Paolo Bonzini 
Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c | 40 +--
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 53b9037259b5..ebc7dfaa9f13 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -215,24 +215,6 @@ static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm)
return true;
 }
 
-static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
-{
-   struct vcpu_svm *svm = to_svm(vcpu);
-
-   if (WARN_ON(!is_guest_mode(vcpu)))
-   return true;
-
-   if (!nested_svm_vmrun_msrpm(svm)) {
-   vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
-   vcpu->run->internal.suberror =
-   KVM_INTERNAL_ERROR_EMULATION;
-   vcpu->run->internal.ndata = 0;
-   return false;
-   }
-
-   return true;
-}
-
 static bool nested_vmcb_check_controls(struct vmcb_control_area *control)
 {
if (CC(!vmcb_is_intercept(control, INTERCEPT_VMRUN)))
@@ -1311,6 +1293,28 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
return ret;
 }
 
+static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_svm *svm = to_svm(vcpu);
+
+   if (WARN_ON(!is_guest_mode(vcpu)))
+   return true;
+
+   if (nested_svm_load_cr3(>vcpu, vcpu->arch.cr3,
+   nested_npt_enabled(svm)))
+   return false;
+
+   if (!nested_svm_vmrun_msrpm(svm)) {
+   vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+   vcpu->run->internal.suberror =
+   KVM_INTERNAL_ERROR_EMULATION;
+   vcpu->run->internal.ndata = 0;
+   return false;
+   }
+
+   return true;
+}
+
 struct kvm_x86_nested_ops svm_nested_ops = {
.check_events = svm_check_nested_events,
.get_nested_state_pages = svm_get_nested_state_pages,
-- 
2.26.2

[PATCH 2/7] KVM: nSVM: move nested vmrun tracepoint to enter_svm_guest_mode

2021-02-17 Thread Maxim Levitsky

This way trace will capture all the nested mode entries
(including entries after migration, and from smm)

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 519fe84f2100..1bc31e2e8fe0 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -500,6 +500,20 @@ int enter_svm_guest_mode(struct vcpu_svm *svm, u64 
vmcb12_gpa,
 {
int ret;
 
+   trace_kvm_nested_vmrun(svm->vmcb->save.rip, vmcb12_gpa,
+  vmcb12->save.rip,
+  vmcb12->control.int_ctl,
+  vmcb12->control.event_inj,
+  vmcb12->control.nested_ctl);
+
+   trace_kvm_nested_intercepts(vmcb12->control.intercepts[INTERCEPT_CR] & 
0x,
+   vmcb12->control.intercepts[INTERCEPT_CR] >> 
16,
+   
vmcb12->control.intercepts[INTERCEPT_EXCEPTION],
+   vmcb12->control.intercepts[INTERCEPT_WORD3],
+   vmcb12->control.intercepts[INTERCEPT_WORD4],
+   
vmcb12->control.intercepts[INTERCEPT_WORD5]);
+
+
svm->nested.vmcb12_gpa = vmcb12_gpa;
 
WARN_ON(svm->vmcb == svm->nested.vmcb02.ptr);
@@ -559,18 +573,6 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
goto out;
}
 
-   trace_kvm_nested_vmrun(svm->vmcb->save.rip, vmcb12_gpa,
-  vmcb12->save.rip,
-  vmcb12->control.int_ctl,
-  vmcb12->control.event_inj,
-  vmcb12->control.nested_ctl);
-
-   trace_kvm_nested_intercepts(vmcb12->control.intercepts[INTERCEPT_CR] & 
0x,
-   vmcb12->control.intercepts[INTERCEPT_CR] >> 
16,
-   
vmcb12->control.intercepts[INTERCEPT_EXCEPTION],
-   vmcb12->control.intercepts[INTERCEPT_WORD3],
-   vmcb12->control.intercepts[INTERCEPT_WORD4],
-   
vmcb12->control.intercepts[INTERCEPT_WORD5]);
 
/* Clear internal status */
kvm_clear_exception_queue(vcpu);
-- 
2.26.2

[PATCH 4/7] KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init

2021-02-17 Thread Maxim Levitsky

This fixes a (mostly theoretical) bug which can happen if ept=0
on host and we run a nested guest which triggers a mmu context
reset while running nested.
In this case the .inject_page_fault callback will be lost.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/vmx/nested.c | 8 +---
 arch/x86/kvm/vmx/nested.h | 1 +
 arch/x86/kvm/vmx/vmx.c| 5 -
 3 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 0b6dab6915a3..f9de729dbea6 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -419,7 +419,7 @@ static int nested_vmx_check_exception(struct kvm_vcpu 
*vcpu, unsigned long *exit
 }
 
 
-static void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
+void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,
struct x86_exception *fault)
 {
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
@@ -2620,9 +2620,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12,
vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3);
}
 
-   if (!enable_ept)
-   vcpu->arch.walk_mmu->inject_page_fault = 
vmx_inject_page_fault_nested;
-
if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) &&
WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL,
 vmcs12->guest_ia32_perf_global_ctrl)))
@@ -4224,9 +4221,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, ))
nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
 
-   if (!enable_ept)
-   vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
-
nested_vmx_transition_tlb_flush(vcpu, vmcs12, false);
 
vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs);
diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
index 197148d76b8f..2ab279744d38 100644
--- a/arch/x86/kvm/vmx/nested.h
+++ b/arch/x86/kvm/vmx/nested.h
@@ -36,6 +36,7 @@ void nested_vmx_pmu_entry_exit_ctls_update(struct kvm_vcpu 
*vcpu);
 void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu);
 bool nested_vmx_check_io_bitmaps(struct kvm_vcpu *vcpu, unsigned int port,
 int size);
+void vmx_inject_page_fault_nested(struct kvm_vcpu *vcpu,struct x86_exception 
*fault);
 
 static inline struct vmcs12 *get_vmcs12(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bf6ef674d688..c43324df4877 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3254,7 +3254,10 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, 
unsigned long pgd,
 
 static void vmx_complete_mmu_init(struct kvm_vcpu *vcpu)
 {
-
+   if (!enable_ept && is_guest_mode(vcpu)) {
+   WARN_ON(mmu_is_nested(vcpu));
+   vcpu->arch.mmu->inject_page_fault = 
vmx_inject_page_fault_nested;
+   }
 }
 
 static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
-- 
2.26.2

[PATCH 3/7] KVM: x86: add .complete_mmu_init arch callback

2021-02-17 Thread Maxim Levitsky

This callback will be used to tweak the mmu context
in arch specific code after it was reset.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/include/asm/kvm-x86-ops.h | 1 +
 arch/x86/include/asm/kvm_host.h| 2 ++
 arch/x86/kvm/mmu/mmu.c | 2 ++
 arch/x86/kvm/svm/svm.c | 6 ++
 arch/x86/kvm/vmx/vmx.c | 6 ++
 5 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index 355a2ab8fc09..041e5765dc67 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -86,6 +86,7 @@ KVM_X86_OP(set_tss_addr)
 KVM_X86_OP(set_identity_map_addr)
 KVM_X86_OP(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP(complete_mmu_init)
 KVM_X86_OP_NULL(has_wbinvd_exit)
 KVM_X86_OP(write_l1_tsc_offset)
 KVM_X86_OP(get_exit_info)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a8e1b57b1532..01a08f936781 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1251,6 +1251,8 @@ struct kvm_x86_ops {
void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, unsigned long pgd,
 int pgd_level);
 
+   void (*complete_mmu_init) (struct kvm_vcpu *vcpu);
+
bool (*has_wbinvd_exit)(void);
 
/* Returns actual tsc_offset set in active VMCS */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e507568cd55d..00bf9ff2e469 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4774,6 +4774,8 @@ void kvm_init_mmu(struct kvm_vcpu *vcpu, bool reset_roots)
init_kvm_tdp_mmu(vcpu);
else
init_kvm_softmmu(vcpu);
+
+   static_call(kvm_x86_complete_mmu_init)(vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_init_mmu);
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 754e07538b4a..74a334c9902a 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3913,6 +3913,11 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, 
unsigned long root,
vmcb_mark_dirty(svm->vmcb, VMCB_CR);
 }
 
+static void svm_complete_mmu_init(struct kvm_vcpu *vcpu)
+{
+
+}
+
 static int is_disabled(void)
 {
u64 vm_cr;
@@ -4522,6 +4527,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.write_l1_tsc_offset = svm_write_l1_tsc_offset,
 
.load_mmu_pgd = svm_load_mmu_pgd,
+   .complete_mmu_init = svm_complete_mmu_init,
 
.check_intercept = svm_check_intercept,
.handle_exit_irqoff = svm_handle_exit_irqoff,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index e428d69e21c0..bf6ef674d688 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3252,6 +3252,11 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, 
unsigned long pgd,
vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
+static void vmx_complete_mmu_init(struct kvm_vcpu *vcpu)
+{
+
+}
+
 static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
/*
@@ -7849,6 +7854,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.write_l1_tsc_offset = vmx_write_l1_tsc_offset,
 
.load_mmu_pgd = vmx_load_mmu_pgd,
+   .complete_mmu_init = vmx_complete_mmu_init,
 
.check_intercept = vmx_check_intercept,
.handle_exit_irqoff = vmx_handle_exit_irqoff,
-- 
2.26.2

[PATCH 0/7] KVM: random nested fixes

2021-02-17 Thread Maxim Levitsky

This is a set of mostly random fixes I have in my patch queue.

- Patches 1,2 are minor tracing fixes from a patch series I sent
  some time ago which I don't want to get lost in the noise.

- Patches 3,4 are for fixing a theoretical bug in VMX with ept=0, but also to
  allow to move nested_vmx_load_cr3 call a bit, to make sure that update to
  .inject_page_fault is not lost while entering a nested guest.

- Patch 5 fixes running nested guests with npt=0 on host, which is sometimes
  useful for debug and such (especially nested).

- Patch 6 fixes the (mostly theoretical) issue with PDPTR loading on VMX after
  nested migration.

- Patch 7 is hopefully the correct fix to eliminate a L0 crash in some rare
  cases when a HyperV guest is migrated.

This was tested with kvm_unit_tests on both VMX and SVM,
both native and in a VM.
Some tests fail on VMX, but I haven't observed new tests failing
due to the changes.

This patch series was also tested by doing my nested migration with:
1. npt/ept disabled on the host
2. npt/ept enabled on the host and disabled in the L1
3. npt/ept enabled on both.

In case of npt/ept=0 on the host (both on Intel and AMD),
the L2 eventually crashed but I strongly suspect a bug in shadow mmu,
which I track separately.
(see below for full explanation).

This patch series is based on kvm/queue branch.

Best regards,
Maxim Levitsky

PS: The shadow mmu bug which I spent most of this week on:

In my testing I am not able to boot win10 (without nesting, HyperV or
anything special) on either Intel nor AMD without two dimensional paging
enabled (ept/npt).
It always crashes in various ways during the boot.

I found out (accidentally) that if I make KVM's shadow mmu not unsync last level
shadow pages, it starts working.
In addition to that, as I mentioned above this bug can happen on Linux as well,
while stressing the shadow mmu with repeated migrations
(and again with the same shadow unsync hack it just works).

While running without two dimensional paging is very obsolete by now, a
bug in shadow mmu is relevant to nesting, since it uses it as well.

Maxim Levitsky (7):
  KVM: VMX: read idt_vectoring_info a bit earlier
  KVM: nSVM: move nested vmrun tracepoint to enter_svm_guest_mode
  KVM: x86: add .complete_mmu_init arch callback
  KVM: nVMX: move inject_page_fault tweak to .complete_mmu_init
  KVM: nSVM: fix running nested guests when npt=0
  KVM: nVMX: don't load PDPTRS right after nested state set
  KVM: nSVM: call nested_svm_load_cr3 on nested state load

 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h|  2 +
 arch/x86/kvm/mmu/mmu.c |  2 +
 arch/x86/kvm/svm/nested.c  | 84 +++---
 arch/x86/kvm/svm/svm.c |  9 
 arch/x86/kvm/svm/svm.h |  1 +
 arch/x86/kvm/vmx/nested.c  | 22 
 arch/x86/kvm/vmx/nested.h  |  1 +
 arch/x86/kvm/vmx/vmx.c | 13 -
 9 files changed, 92 insertions(+), 43 deletions(-)

-- 
2.26.2

[PATCH 1/7] KVM: VMX: read idt_vectoring_info a bit earlier

2021-02-17 Thread Maxim Levitsky

trace_kvm_exit prints this value (using vmx_get_exit_info)
so it makes sense to read it before the trace point.

Fixes: dcf068da7eb2 ("KVM: VMX: Introduce generic fastpath handler")

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/vmx/vmx.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b3e36dc3f164..e428d69e21c0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6921,13 +6921,15 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (unlikely((u16)vmx->exit_reason.basic == 
EXIT_REASON_MCE_DURING_VMENTRY))
kvm_machine_check();
 
+   if (likely(!vmx->exit_reason.failed_vmentry))
+   vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+
trace_kvm_exit(vmx->exit_reason.full, vcpu, KVM_ISA_VMX);
 
if (unlikely(vmx->exit_reason.failed_vmentry))
return EXIT_FASTPATH_NONE;
 
vmx->loaded_vmcs->launched = 1;
-   vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
vmx_recover_nmi_blocking(vmx);
vmx_complete_interrupts(vmx);
-- 
2.26.2

[PATCH] KVM: nSVM: call nested_svm_load_cr3 on nested state load

2021-02-10 Thread Maxim Levitsky

While KVM's MMU should be fully reset by loading of nested CR0/CR3/CR4
by KVM_SET_SREGS, we are not in nested mode yet when we do it and therefore
only root_mmu is reset.

On regular nested entries we call nested_svm_load_cr3 which both updates the
guest's CR3 in the MMU when it is needed, and it also initializes
the mmu again which makes it initialize the walk_mmu as well when nested
paging is enabled in both host and guest.

Since we don't call nested_svm_load_cr3 on nested state load,
the walk_mmu can be left uninitialized, which can lead to a NULL pointer
dereference while accessing it if we happen to get a nested page fault
right after entering the nested guest first time after the migration and
we decide to emulate it, which leads to emulator trying to access
walk_mmu->gva_to_gpa which is NULL.

Therefore we should call this function on nested state load as well.

Suggested-by: Paolo Bonzini 
Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 519fe84f2100..c209f1232928 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1282,6 +1282,14 @@ static int svm_set_nested_state(struct kvm_vcpu *vcpu,
 
nested_vmcb02_prepare_control(svm);
 
+   ret = nested_svm_load_cr3(>vcpu, vcpu->arch.cr3,
+ nested_npt_enabled(svm));
+
+   if (ret) {
+   svm_leave_nested(svm);
+   goto out_free;
+   }
+
kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
ret = 0;
 out_free:
-- 
2.26.2

Re: [PATCH v3 4/4] KVM: SVM: Support #GP handling for the case of nested on nested

2021-01-26 Thread Maxim Levitsky

On Tue, 2021-01-26 at 03:18 -0500, Wei Huang wrote:
> Under the case of nested on nested (L0->L1->L2->L3), #GP triggered by
> SVM instructions can be hided from L1. Instead the hypervisor can
> inject the proper #VMEXIT to inform L1 of what is happening. Thus L1
> can avoid invoking the #GP workaround. For this reason we turns on
> guest VM's X86_FEATURE_SVME_ADDR_CHK bit for KVM running inside VM to
> receive the notification and change behavior.
> 
> Similarly we check if vcpu is under guest mode before emulating the
> vmware-backdoor instructions. For the case of nested on nested, we
> let the guest handle it.
> 
> Co-developed-by: Bandan Das 
> Signed-off-by: Bandan Das 
> Signed-off-by: Wei Huang 
> Tested-by: Maxim Levitsky 
> Reviewed-by: Maxim Levitsky 
> ---
>  arch/x86/kvm/svm/svm.c | 20 ++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index f9233c79265b..83c401d2709f 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -929,6 +929,9 @@ static __init void svm_set_cpu_caps(void)
>  
>   if (npt_enabled)
>   kvm_cpu_cap_set(X86_FEATURE_NPT);
> +
> + /* Nested VM can receive #VMEXIT instead of triggering #GP */
> + kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK);
>   }
>  
>   /* CPUID 0x8008 */
> @@ -2198,6 +2201,11 @@ static int svm_instr_opcode(struct kvm_vcpu *vcpu)
>  
>  static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
>  {
> + const int guest_mode_exit_codes[] = {
> + [SVM_INSTR_VMRUN] = SVM_EXIT_VMRUN,
> + [SVM_INSTR_VMLOAD] = SVM_EXIT_VMLOAD,
> + [SVM_INSTR_VMSAVE] = SVM_EXIT_VMSAVE,
> + };
>   int (*const svm_instr_handlers[])(struct vcpu_svm *svm) = {
>   [SVM_INSTR_VMRUN] = vmrun_interception,
>   [SVM_INSTR_VMLOAD] = vmload_interception,
> @@ -2205,7 +2213,14 @@ static int emulate_svm_instr(struct kvm_vcpu *vcpu, 
> int opcode)
>   };
>   struct vcpu_svm *svm = to_svm(vcpu);
>  
> - return svm_instr_handlers[opcode](svm);
> + if (is_guest_mode(vcpu)) {
> + svm->vmcb->control.exit_code = guest_mode_exit_codes[opcode];
> + svm->vmcb->control.exit_info_1 = 0;
> + svm->vmcb->control.exit_info_2 = 0;
> +
> + return nested_svm_vmexit(svm);
> + } else
> + return svm_instr_handlers[opcode](svm);
>  }
>  
>  /*
> @@ -2239,7 +2254,8 @@ static int gp_interception(struct vcpu_svm *svm)
>* VMware backdoor emulation on #GP interception only handles
>* IN{S}, OUT{S}, and RDPMC.
>*/
> - return kvm_emulate_instruction(vcpu,
> + if (!is_guest_mode(vcpu))
> + return kvm_emulate_instruction(vcpu,
>   EMULTYPE_VMWARE_GP | EMULTYPE_NO_DECODE);
>   } else
>   return emulate_svm_instr(vcpu, opcode);

To be honest I expected the vmware backdoor fix to be in a separate patch,
but I see that Paulo already took these patches so I guess it is too late.

Anyway I am very happy to see this workaround merged, and see that bug
disappear forever.

Best regards,
Maxim Levitsky

Re: [PATCH v3 3/4] KVM: SVM: Add support for SVM instruction address check change

2021-01-26 Thread Maxim Levitsky

On Tue, 2021-01-26 at 03:18 -0500, Wei Huang wrote:
> New AMD CPUs have a change that checks #VMEXIT intercept on special SVM
> instructions before checking their EAX against reserved memory region.
> This change is indicated by CPUID_0x800A_EDX[28]. If it is 1, #VMEXIT
> is triggered before #GP. KVM doesn't need to intercept and emulate #GP
> faults as #GP is supposed to be triggered.
> 
> Co-developed-by: Bandan Das 
> Signed-off-by: Bandan Das 
> Signed-off-by: Wei Huang 
> Reviewed-by: Maxim Levitsky 
> ---
>  arch/x86/include/asm/cpufeatures.h | 1 +
>  arch/x86/kvm/svm/svm.c | 3 +++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h 
> b/arch/x86/include/asm/cpufeatures.h
> index 84b887825f12..ea89d6fdd79a 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -337,6 +337,7 @@
>  #define X86_FEATURE_AVIC (15*32+13) /* Virtual Interrupt 
> Controller */
>  #define X86_FEATURE_V_VMSAVE_VMLOAD  (15*32+15) /* Virtual VMSAVE VMLOAD */
>  #define X86_FEATURE_VGIF (15*32+16) /* Virtual GIF */
> +#define X86_FEATURE_SVME_ADDR_CHK(15*32+28) /* "" SVME addr check */
>  
>  /* Intel-defined CPU features, CPUID level 0x0007:0 (ECX), word 16 */
>  #define X86_FEATURE_AVX512VBMI   (16*32+ 1) /* AVX512 Vector Bit 
> Manipulation instructions*/
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index e5ca01e25e89..f9233c79265b 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1036,6 +1036,9 @@ static __init int svm_hardware_setup(void)
>   }
>   }
>  
> + if (boot_cpu_has(X86_FEATURE_SVME_ADDR_CHK))
> + svm_gp_erratum_intercept = false;
> +
Again, I would make svm_gp_erratum_intercept a tri-state module param,
and here if it is in 'auto' state do this.

Also I might as well made this code fail if X86_FEATURE_SVME_ADDR_CHK is set but
user insists on svm_gp_erratum_intercept = true.

>       if (vgif) {
>   if (!boot_cpu_has(X86_FEATURE_VGIF))
>   vgif = false;


Best regards,
Maxim Levitsky

Re: [PATCH v3 2/4] KVM: SVM: Add emulation support for #GP triggered by SVM instructions

2021-01-26 Thread Maxim Levitsky

gger #GP on
> + *  some AMD CPUs when EAX of these instructions are in the reserved 
> memory
> + *  regions (e.g. SMM memory on host).
> + *   2) VMware backdoor
> + */
> +static int gp_interception(struct vcpu_svm *svm)
> +{
> + struct kvm_vcpu *vcpu = >vcpu;
> + u32 error_code = svm->vmcb->control.exit_info_1;
> + int opcode;
> +
> + /* Both #GP cases have zero error_code */
> + if (error_code)
> + goto reinject;
> +
> + /* Decode the instruction for usage later */
> + if (x86_decode_emulated_instruction(vcpu, 0, NULL, 0) != EMULATION_OK)
> + goto reinject;
> +
> + opcode = svm_instr_opcode(vcpu);
> +
> + if (opcode == NONE_SVM_INSTR) {
> + WARN_ON_ONCE(!enable_vmware_backdoor);
> +
> + /*
> +  * VMware backdoor emulation on #GP interception only handles
> +  * IN{S}, OUT{S}, and RDPMC.
> +  */
> + return kvm_emulate_instruction(vcpu,
> + EMULTYPE_VMWARE_GP | EMULTYPE_NO_DECODE);
> + } else

I would check svm_gp_erratum_intercept here, not do any emulation
if not set, and print a warning.

> + return emulate_svm_instr(vcpu, opcode);
> +
> +reinject:
> + kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
> + return 1;
> +}
> +
>  void svm_set_gif(struct vcpu_svm *svm, bool value)
>  {
>   if (value) {


Best regards,
Maxim Levitsky

Re: [PATCH v2 2/4] KVM: SVM: Add emulation support for #GP triggered by SVM instructions

2021-01-25 Thread Maxim Levitsky

On Thu, 2021-01-21 at 14:40 -0800, Sean Christopherson wrote:
> On Thu, Jan 21, 2021, Maxim Levitsky wrote:
> > BTW, on unrelated note, currently the smap test is broken in kvm-unit tests.
> > I bisected it to commit 322cdd6405250a2a3e48db199f97a45ef519e226
> > 
> > It seems that the following hack (I have no idea why it works,
> > since I haven't dug deep into the area 'fixes', the smap test for me)
> > 
> > -#define USER_BASE  (1 << 24)
> > +#define USER_BASE  (1 << 25)
> 
> https://lkml.kernel.org/r/2021012808.619347-2-imbre...@linux.ibm.com
> 
Thanks!

Best regards,
Maxim Levitsky

Thoughts on sharing KVM tracepoints [was:Re: [PATCH 2/2] KVM: nVMX: trace nested vm entry]

2021-01-25 Thread Maxim Levitsky

On Thu, 2021-01-21 at 14:27 -0800, Sean Christopherson wrote:
> On Thu, Jan 21, 2021, Maxim Levitsky wrote:
> > This is very helpful to debug nested VMX issues.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  arch/x86/kvm/trace.h  | 30 ++
> >  arch/x86/kvm/vmx/nested.c |  5 +
> >  arch/x86/kvm/x86.c|  3 ++-
> >  3 files changed, 37 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > index 2de30c20bc264..ec75efdac3560 100644
> > --- a/arch/x86/kvm/trace.h
> > +++ b/arch/x86/kvm/trace.h
> > @@ -554,6 +554,36 @@ TRACE_EVENT(kvm_nested_vmrun,
> > __entry->npt ? "on" : "off")
> >  );
> >  
> > +
> > +/*
> > + * Tracepoint for nested VMLAUNCH/VMRESUME
> > + */
> > +TRACE_EVENT(kvm_nested_vmenter,
> > +   TP_PROTO(__u64 rip, __u64 vmcs, __u64 nested_rip,
> > +__u32 entry_intr_info),
> > +   TP_ARGS(rip, vmcs, nested_rip, entry_intr_info),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(__u64,  rip )
> > +   __field(__u64,  vmcs)
> > +   __field(__u64,  nested_rip  )
> > +   __field(__u32,  entry_intr_info )
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->rip= rip;
> > +   __entry->vmcs   = vmcs;
> > +   __entry->nested_rip = nested_rip;
> > +   __entry->entry_intr_info= entry_intr_info;
> > +   ),
> > +
> > +   TP_printk("rip: 0x%016llx vmcs: 0x%016llx nrip: 0x%016llx "
> > + "entry_intr_info: 0x%08x",
> > +   __entry->rip, __entry->vmcs, __entry->nested_rip,
> > +   __entry->entry_intr_info)
> 
> I still don't see why VMX can't share this with SVM.  "npt' can easily be 
> "tdp",
> differentiating between VMCB and VMCS can be down with ISA, and VMX can give 0
> for int_ctl (or throw in something else interesting/relevant).

I understand very well your point, and I don't strongly disagree with you.
However let me voice my own thoughts on this:

I think that sharing tracepoints between SVM and VMX isn't necessarily a good 
idea.
It does make sense in some cases but not in all of them.

The trace points are primarily intended for developers, thus they should 
capture as
much as possible relevant info but not everything because traces can get huge.

Also despite the fact that a developer will look at the traces, some usability 
is welcome
as well (e.g for new developers), and looking at things like 
info1/info2/intr_info/error_code
isn't very usable (btw the error_code should at least be called 
intr_info_error_code, and
of course both it and intr_info are VMX specific).

So I don't even like the fact that kvm_entry/kvm_exit are shared, and neither I 
want
to add even more shared trace points.

I understand that there are some benefits of sharing, namely a userspace tool 
can use
the same event to *profile* kvm, but I am not sure that this is worth it.

What we could have done is to have ISA (and maybe even x86) agnostic 
kvm_exit/kvm_entry
tracepoints that would have no data attached to them, or have very little (like 
maybe RIP),
and then have ISA specific tracepoints with the reset of the info.

Same could be applied to kvm_nested_vmenter, although for this one I don't 
think that we
need an ISA agnostic tracepoint.

Having said all that, I am not hell bent on this. If you really want it to be 
this way,
I won't argue that much.

Thoughts?

Best regards,
Maxim Levitsky

> 
>   trace_kvm_nested_vmenter(kvm_rip_read(vcpu),
>vmx->nested.current_vmptr,
>vmcs12->guest_rip,
>0,
>vmcs12->vm_entry_intr_info_field,
>nested_cpu_has_ept(vmcs12),
>KVM_ISA_VMX);
> 
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index 2de30c20bc26..90f7cdb31fc1 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -522,12 +522,12 @@ TRACE_EVENT(kvm_pv_eoi,
>  );
> 
>  /*
> - * Tracepoint for nested VMRUN
> + * Tracepoint for nested VM-Enter.  Note, vmcb==vmcs on VMX.
>   */
> -TRACE_EVENT(kvm_nested_vmrun,
> +TRACE_EVENT(kvm_nested_vmenter,
> TP_PROTO(__u64 rip, __u64 vmcb, __u64 nested_rip, __u32 int_ctl,
> -__u32 event_inj, bool npt),
> -   TP_AR

Re: [PATCH v2 2/3] KVM: nVMX: add kvm_nested_vmlaunch_resume tracepoint

2021-01-21 Thread Maxim Levitsky

On Thu, 2021-01-14 at 16:14 -0800, Sean Christopherson wrote:
> On Thu, Jan 14, 2021, Maxim Levitsky wrote:
> > This is very helpful for debugging nested VMX issues.
> > 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  arch/x86/kvm/trace.h  | 30 ++
> >  arch/x86/kvm/vmx/nested.c |  6 ++
> >  arch/x86/kvm/x86.c|  1 +
> >  3 files changed, 37 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> > index 2de30c20bc264..663d1b1d8bf64 100644
> > --- a/arch/x86/kvm/trace.h
> > +++ b/arch/x86/kvm/trace.h
> > @@ -554,6 +554,36 @@ TRACE_EVENT(kvm_nested_vmrun,
> > __entry->npt ? "on" : "off")
> >  );
> >  
> > +
> > +/*
> > + * Tracepoint for nested VMLAUNCH/VMRESUME
> 
> VM-Enter, as below.

Will do

> 
> > + */
> > +TRACE_EVENT(kvm_nested_vmlaunch_resume,
> 
> s/vmlaunc_resume/vmenter, both for consistency with other code and so that it
> can sanely be reused by SVM.  IMO, trace_kvm_entry is wrong :-).
SVM already has trace_kvm_nested_vmrun and it contains some SVM specific
stuff that doesn't exist on VMX and vise versa.
So I do want to keep these trace points separate.


> 
> > +   TP_PROTO(__u64 rip, __u64 vmcs, __u64 nested_rip,
> > +__u32 entry_intr_info),
> > +   TP_ARGS(rip, vmcs, nested_rip, entry_intr_info),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(__u64,  rip )
> > +   __field(__u64,  vmcs)
> > +   __field(__u64,  nested_rip  )
> > +   __field(__u32,  entry_intr_info )
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->rip= rip;
> > +   __entry->vmcs   = vmcs;
> > +   __entry->nested_rip = nested_rip;
> > +   __entry->entry_intr_info= entry_intr_info;
> > +   ),
> > +
> > +   TP_printk("rip: 0x%016llx vmcs: 0x%016llx nrip: 0x%016llx "
> > + "entry_intr_info: 0x%08x",
> > +   __entry->rip, __entry->vmcs, __entry->nested_rip,
> > +   __entry->entry_intr_info)
> > +);
> > +
> > +
> >  TRACE_EVENT(kvm_nested_intercepts,
> > TP_PROTO(__u16 cr_read, __u16 cr_write, __u32 exceptions,
> >  __u32 intercept1, __u32 intercept2, __u32 intercept3),
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 776688f9d1017..cd51b66480d52 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -3327,6 +3327,12 @@ enum nvmx_vmentry_status 
> > nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
> > !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
> > vmx->nested.vmcs01_guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
> >  
> > +   trace_kvm_nested_vmlaunch_resume(kvm_rip_read(vcpu),
> 
> Hmm, won't this RIP be wrong for the migration case?  I.e. it'll be L2, not L1
> as is the case for the "true" nested VM-Enter path.

> 
> > +vmx->nested.current_vmptr,
> > +vmcs12->guest_rip,
> > +vmcs12->vm_entry_intr_info_field);
> 
> The placement is a bit funky.  I assume you put it here so that calls from
> vmx_set_nested_state() also get traced.  But, that also means
> vmx_pre_leave_smm() will get traced, and it also creates some weirdness where
> some nested VM-Enters that VM-Fail will get traced, but others will not.
> 
> Tracing vmx_pre_leave_smm() isn't necessarily bad, but it could be confusing,
> especially if the debugger looks up the RIP and sees RSM.  Ditto for the
> migration case.
> 
> Not sure what would be a good answer.
> 
> > +
> > +
> > /*
> >  * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and*
> >  * nested early checks are disabled.  In the event of a "late" VM-Fail,
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index a480804ae27a3..7c6e94e32100e 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -11562,6 +11562,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_cr);
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmlaunch_resume);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmrun);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit_inject);
> > -- 
> > 2.26.2
> >

[PATCH 0/2] VMX: few tracing improvements

2021-01-21 Thread Maxim Levitsky

Since the fix for the bug in nested migration on VMX is
already merged by Paulo, those are the remaining
patches in this series.

I added a new patch to trace SVM nested entries from
SMM and nested state load as well.

Best regards,
Maxim Levitsky

Maxim Levitsky (2):
  KVM: nSVM: move nested vmrun tracepoint to enter_svm_guest_mode
  KVM: nVMX: trace nested vm entry

 arch/x86/kvm/svm/nested.c | 26 ++
 arch/x86/kvm/trace.h  | 30 ++
 arch/x86/kvm/vmx/nested.c |  5 +
 arch/x86/kvm/x86.c|  3 ++-
 4 files changed, 51 insertions(+), 13 deletions(-)

-- 
2.26.2

[PATCH 2/2] KVM: nVMX: trace nested vm entry

2021-01-21 Thread Maxim Levitsky

This is very helpful to debug nested VMX issues.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/trace.h  | 30 ++
 arch/x86/kvm/vmx/nested.c |  5 +
 arch/x86/kvm/x86.c|  3 ++-
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 2de30c20bc264..ec75efdac3560 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -554,6 +554,36 @@ TRACE_EVENT(kvm_nested_vmrun,
__entry->npt ? "on" : "off")
 );
 
+
+/*
+ * Tracepoint for nested VMLAUNCH/VMRESUME
+ */
+TRACE_EVENT(kvm_nested_vmenter,
+   TP_PROTO(__u64 rip, __u64 vmcs, __u64 nested_rip,
+__u32 entry_intr_info),
+   TP_ARGS(rip, vmcs, nested_rip, entry_intr_info),
+
+   TP_STRUCT__entry(
+   __field(__u64,  rip )
+   __field(__u64,  vmcs)
+   __field(__u64,  nested_rip  )
+   __field(__u32,  entry_intr_info )
+   ),
+
+   TP_fast_assign(
+   __entry->rip= rip;
+   __entry->vmcs   = vmcs;
+   __entry->nested_rip = nested_rip;
+   __entry->entry_intr_info= entry_intr_info;
+   ),
+
+   TP_printk("rip: 0x%016llx vmcs: 0x%016llx nrip: 0x%016llx "
+ "entry_intr_info: 0x%08x",
+   __entry->rip, __entry->vmcs, __entry->nested_rip,
+   __entry->entry_intr_info)
+);
+
+
 TRACE_EVENT(kvm_nested_intercepts,
TP_PROTO(__u16 cr_read, __u16 cr_write, __u32 exceptions,
 __u32 intercept1, __u32 intercept2, __u32 intercept3),
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 0fbb46990dfce..20b0954f31bda 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3327,6 +3327,11 @@ enum nvmx_vmentry_status 
nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
vmx->nested.vmcs01_guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
 
+   trace_kvm_nested_vmenter(kvm_rip_read(vcpu),
+vmx->nested.current_vmptr,
+vmcs12->guest_rip,
+vmcs12->vm_entry_intr_info_field);
+
/*
 * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and*
 * nested early checks are disabled.  In the event of a "late" VM-Fail,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a480804ae27a3..757f4f88072af 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11562,11 +11562,12 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_cr);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmenter);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmenter_failed);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmrun);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit_inject);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intr_vmexit);
-EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmenter_failed);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_invlpga);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_skinit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts);
-- 
2.26.2

[PATCH 1/2] KVM: nSVM: move nested vmrun tracepoint to enter_svm_guest_mode

2021-01-21 Thread Maxim Levitsky

This way trace will capture all the nested mode entries
(including entries after migration, and from smm)

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/svm/nested.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index cb4c6ee10029c..a6a14f7b56278 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -440,6 +440,20 @@ int enter_svm_guest_mode(struct vcpu_svm *svm, u64 
vmcb12_gpa,
 {
int ret;
 
+   trace_kvm_nested_vmrun(svm->vmcb->save.rip, vmcb12_gpa,
+  vmcb12->save.rip,
+  vmcb12->control.int_ctl,
+  vmcb12->control.event_inj,
+  vmcb12->control.nested_ctl);
+
+   trace_kvm_nested_intercepts(vmcb12->control.intercepts[INTERCEPT_CR] & 
0x,
+   vmcb12->control.intercepts[INTERCEPT_CR] >> 
16,
+   
vmcb12->control.intercepts[INTERCEPT_EXCEPTION],
+   vmcb12->control.intercepts[INTERCEPT_WORD3],
+   vmcb12->control.intercepts[INTERCEPT_WORD4],
+   
vmcb12->control.intercepts[INTERCEPT_WORD5]);
+
+
svm->nested.vmcb12_gpa = vmcb12_gpa;
load_nested_vmcb_control(svm, >control);
nested_prepare_vmcb_save(svm, vmcb12);
@@ -493,18 +507,6 @@ int nested_svm_vmrun(struct vcpu_svm *svm)
goto out;
}
 
-   trace_kvm_nested_vmrun(svm->vmcb->save.rip, vmcb12_gpa,
-  vmcb12->save.rip,
-  vmcb12->control.int_ctl,
-  vmcb12->control.event_inj,
-  vmcb12->control.nested_ctl);
-
-   trace_kvm_nested_intercepts(vmcb12->control.intercepts[INTERCEPT_CR] & 
0x,
-   vmcb12->control.intercepts[INTERCEPT_CR] >> 
16,
-   
vmcb12->control.intercepts[INTERCEPT_EXCEPTION],
-   vmcb12->control.intercepts[INTERCEPT_WORD3],
-   vmcb12->control.intercepts[INTERCEPT_WORD4],
-   
vmcb12->control.intercepts[INTERCEPT_WORD5]);
 
/* Clear internal status */
kvm_clear_exception_queue(>vcpu);
-- 
2.26.2

Re: [PATCH v2 3/3] KVM: VMX: read idt_vectoring_info a bit earlier

2021-01-21 Thread Maxim Levitsky

On Thu, 2021-01-14 at 16:29 -0800, Sean Christopherson wrote:
> On Thu, Jan 14, 2021, Maxim Levitsky wrote:
> > This allows it to be printed correctly by the trace print
> 
> It'd be helpful to explicitly say which tracepoint, and explain that the value
> is read by vmx_get_exit_info().  It's far from obvious how this gets consumed.
> 
> > that follows.
> > 
> 
> Fixes: dcf068da7eb2 ("KVM: VMX: Introduce generic fastpath handler")
> 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  arch/x86/kvm/vmx/vmx.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 2af05d3b05909..9b6e7dbf5e2bd 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -6771,6 +6771,8 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> > }
> >  
> > vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
> > +   vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> 
> Hrm, it probably makes sense to either do the VMREAD conditionally, or to
> zero idt_vectoring_info in the vmx->fail path.  I don't care about the cycles
> on VM-Exit consistency checks, just that this would hide that the field is 
> valid
> if and only if VM-Enter fully succeeded.  A third option would be to add a
> comment saying that it's unnecessary if VM-Enter failed, but faster in the
> common case to just do the VMREAD.

Allright, I will add this.

Best regards,
Maxim Levitsky


> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 2af05d3b0590..3c172c05570a 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6774,13 +6774,15 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> if (unlikely((u16)vmx->exit_reason == EXIT_REASON_MCE_DURING_VMENTRY))
> kvm_machine_check();
> 
> +   if (likely(!(vmx->exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY)))
> +   vmx->idt_vectoring_info = 
> vmcs_read32(IDT_VECTORING_INFO_FIELD);
> +
> trace_kvm_exit(vmx->exit_reason, vcpu, KVM_ISA_VMX);
> 
> if (unlikely(vmx->exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
> return EXIT_FASTPATH_NONE;
> 
> vmx->loaded_vmcs->launched = 1;
> -   vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> 
> vmx_recover_nmi_blocking(vmx);
> vmx_complete_interrupts(vmx);
> 
> 
> > +
> > if (unlikely((u16)vmx->exit_reason == EXIT_REASON_MCE_DURING_VMENTRY))
> > kvm_machine_check();
> >  
> > @@ -6780,7 +6782,6 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> > return EXIT_FASTPATH_NONE;
> >  
> > vmx->loaded_vmcs->launched = 1;
> > -   vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> >  
> > vmx_recover_nmi_blocking(vmx);
> > vmx_complete_interrupts(vmx);
> > -- 
> > 2.26.2
> >

Re: [PATCH v2 2/3] KVM: nVMX: add kvm_nested_vmlaunch_resume tracepoint

2021-01-21 Thread Maxim Levitsky

On Fri, 2021-01-15 at 08:30 -0800, Sean Christopherson wrote:
> On Fri, Jan 15, 2021, Paolo Bonzini wrote:
> > On 15/01/21 01:14, Sean Christopherson wrote:
> > > > +   trace_kvm_nested_vmlaunch_resume(kvm_rip_read(vcpu),
> > > Hmm, won't this RIP be wrong for the migration case?  I.e. it'll be L2, 
> > > not L1
> > > as is the case for the "true" nested VM-Enter path.
> > 
> > It will be the previous RIP---might as well be 0xfff0 depending on what
> > userspace does.  I don't think you can do much better than that, using
> > vmcs12->host_rip would be confusing in the SMM case.
> > 
> > > > +vmx->nested.current_vmptr,
> > > > +vmcs12->guest_rip,
> > > > +
> > > > vmcs12->vm_entry_intr_info_field);
> > > The placement is a bit funky.  I assume you put it here so that calls from
> > > vmx_set_nested_state() also get traced.  But, that also means
> > > vmx_pre_leave_smm() will get traced, and it also creates some weirdness 
> > > where
> > > some nested VM-Enters that VM-Fail will get traced, but others will not.
> > > 
> > > Tracing vmx_pre_leave_smm() isn't necessarily bad, but it could be 
> > > confusing,
> > > especially if the debugger looks up the RIP and sees RSM.  Ditto for the
> > > migration case.
> > 
> > Actually tracing vmx_pre_leave_smm() is good, and pointing to RSM makes
> > sense so I'm not worried about that.
> 
> Ideally there would something in the tracepoint to differentiate the various
> cases.  Not that the RSM/migration cases will pop up often, but I think it's 
> an
> easily solved problem that could avoid confusion.
> 
> What if we captured vmx->nested.smm.guest_mode and from_vmentry, and 
> explicitly
> record what triggered the entry?
> 
>   TP_printk("from: %s rip: 0x%016llx vmcs: 0x%016llx nrip: 0x%016llx 
> intr_info: 0x%08x",
> __entry->vmenter ? "VM-Enter" : __entry->smm ? "RSM" : 
> "SET_STATE",
> __entry->rip, __entry->vmcs, __entry->nested_rip,
> __entry->entry_intr_info

I think that this is a good idea, but should be done in a separate patch.

> 
> Side topic, can we have an "official" ruling on whether KVM tracepoints should
> use colons and/or commas? And probably same question for whether or not to
> prepend zeros.  E.g. kvm_entry has "vcpu %u, rip 0x%lx" versus "rip: 0x%016llx
> vmcs: 0x%016llx".  It bugs me that we're so inconsistent.
> 

As I said the kvm tracing has a lot of things that can be imporoved, 
and as it is often the only way to figure out complex bugs as these I had to 
deal with recently,
I will do more improvements in this area as time permits.

Best regards,
Maxim Levitsky

Re: [PATCH v2 2/3] KVM: nVMX: add kvm_nested_vmlaunch_resume tracepoint

2021-01-21 Thread Maxim Levitsky

On Fri, 2021-01-15 at 14:48 +0100, Paolo Bonzini wrote:
> On 15/01/21 01:14, Sean Christopherson wrote:
> > > + trace_kvm_nested_vmlaunch_resume(kvm_rip_read(vcpu),
> > Hmm, won't this RIP be wrong for the migration case?  I.e. it'll be L2, not 
> > L1
> > as is the case for the "true" nested VM-Enter path.

Actually in this case, the initial RIP of 0xfff0 will be printed
which isn't that bad.

A tracepoint in nested state load function would be very nice to add
to mark this explicitly. I'll do this later.

> 
> It will be the previous RIP---might as well be 0xfff0 depending on 
> what userspace does.  I don't think you can do much better than that, 
> using vmcs12->host_rip would be confusing in the SMM case.
> 
> > > +  vmx->nested.current_vmptr,
> > > +  vmcs12->guest_rip,
> > > +  vmcs12->vm_entry_intr_info_field);
> > The placement is a bit funky.  I assume you put it here so that calls from
> > vmx_set_nested_state() also get traced.  But, that also means
> > vmx_pre_leave_smm() will get traced, and it also creates some weirdness 
> > where
> > some nested VM-Enters that VM-Fail will get traced, but others will not.
> > 
> > Tracing vmx_pre_leave_smm() isn't necessarily bad, but it could be 
> > confusing,
> > especially if the debugger looks up the RIP and sees RSM.  Ditto for the
> > migration case.
> 
> Actually tracing vmx_pre_leave_smm() is good, and pointing to RSM makes 
> sense so I'm not worried about that.
> 
> Paolo
> 

I agree with that and indeed this was my intention.

In fact I will change the svm's tracepoint to behave the same way
in the next patch series (I'll move it to enter_svm_guest_mode).

(When I wrote this patch I somehow thought that this is what SVM already does).

Best regards,
Maxim Levitsky

Re: [PATCH v2 2/4] KVM: SVM: Add emulation support for #GP triggered by SVM instructions

2021-01-21 Thread Maxim Levitsky

On Thu, 2021-01-21 at 10:06 -0600, Wei Huang wrote:
> 
> On 1/21/21 8:07 AM, Maxim Levitsky wrote:
> > On Thu, 2021-01-21 at 01:55 -0500, Wei Huang wrote:
> > > From: Bandan Das 
> > > 
> > > While running SVM related instructions (VMRUN/VMSAVE/VMLOAD), some AMD
> > > CPUs check EAX against reserved memory regions (e.g. SMM memory on host)
> > > before checking VMCB's instruction intercept. If EAX falls into such
> > > memory areas, #GP is triggered before VMEXIT. This causes problem under
> > > nested virtualization. To solve this problem, KVM needs to trap #GP and
> > > check the instructions triggering #GP. For VM execution instructions,
> > > KVM emulates these instructions.
> > > 
> > > Co-developed-by: Wei Huang 
> > > Signed-off-by: Wei Huang 
> > > Signed-off-by: Bandan Das 
> > > ---
> > >  arch/x86/kvm/svm/svm.c | 99 ++
> > >  1 file changed, 81 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > > index 7ef171790d02..6ed523cab068 100644
> > > --- a/arch/x86/kvm/svm/svm.c
> > > +++ b/arch/x86/kvm/svm/svm.c
> > > @@ -288,6 +288,9 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
> > >   if (!(efer & EFER_SVME)) {
> > >   svm_leave_nested(svm);
> > >   svm_set_gif(svm, true);
> > > + /* #GP intercept is still needed in vmware_backdoor */
> > > + if (!enable_vmware_backdoor)
> > > + clr_exception_intercept(svm, GP_VECTOR);
> > Again I would prefer a flag for the errata workaround, but this is still
> > better.
> 
> Instead of using !enable_vmware_backdoor, will the following be better?
> Or the existing form is acceptable.
> 
> if (!kvm_cpu_cap_has(X86_FEATURE_SVME_ADDR_CHK))
>   clr_exception_intercept(svm, GP_VECTOR);

To be honest I would prefer to have a module param named something like
'enable_svm_gp_errata_workaround' that would have 3 state value: (0,1,-1),
aka true,false,auto

0,1 - would mean force disable/enable the workaround.
-1 - auto select based on X86_FEATURE_SVME_ADDR_CHK.

0 could be used if for example someone is paranoid in regard to attack surface.
-#define USER_BASE  (1 << 24)
+#define USER_BASE  (1 << 25)
This isn't that much importaint to me though, so if you prefer you can leave it 
as is
as well.

> 
> > >  
> > >   /*
> > >* Free the nested guest state, unless we are in SMM.
> > > @@ -309,6 +312,9 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
> > >  
> > >   svm->vmcb->save.efer = efer | EFER_SVME;
> > >   vmcb_mark_dirty(svm->vmcb, VMCB_CR);
> > > + /* Enable #GP interception for SVM instructions */
> > > + set_exception_intercept(svm, GP_VECTOR);
> > > +
> > >   return 0;
> > >  }
> > >  
> > > @@ -1957,24 +1963,6 @@ static int ac_interception(struct vcpu_svm *svm)
> > >   return 1;
> > >  }
> > >  
> > > -static int gp_interception(struct vcpu_svm *svm)
> > > -{
> > > - struct kvm_vcpu *vcpu = >vcpu;
> > > - u32 error_code = svm->vmcb->control.exit_info_1;
> > > -
> > > - WARN_ON_ONCE(!enable_vmware_backdoor);
> > > -
> > > - /*
> > > -  * VMware backdoor emulation on #GP interception only handles IN{S},
> > > -  * OUT{S}, and RDPMC, none of which generate a non-zero error code.
> > > -  */
> > > - if (error_code) {
> > > - kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
> > > - return 1;
> > > - }
> > > - return kvm_emulate_instruction(vcpu, EMULTYPE_VMWARE_GP);
> > > -}
> > > -
> > >  static bool is_erratum_383(void)
> > >  {
> > >   int err, i;
> > > @@ -2173,6 +2161,81 @@ static int vmrun_interception(struct vcpu_svm *svm)
> > >   return nested_svm_vmrun(svm);
> > >  }
> > >  
> > > +enum {
> > > + NOT_SVM_INSTR,
> > > + SVM_INSTR_VMRUN,
> > > + SVM_INSTR_VMLOAD,
> > > + SVM_INSTR_VMSAVE,
> > > +};
> > > +
> > > +/* Return NOT_SVM_INSTR if not SVM instrs, otherwise return decode 
> > > result */
> > > +static int svm_instr_opcode(struct kvm_vcpu *vcpu)
> > > +{
> > > + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> > > +
> > > + if (ctxt->b != 0x1 || ctxt->o

Re: [PATCH v2 4/4] KVM: SVM: Support #GP handling for the case of nested on nested

2021-01-21 Thread Maxim Levitsky

On Thu, 2021-01-21 at 14:56 +, Dr. David Alan Gilbert wrote:
> * Wei Huang (wei.hua...@amd.com) wrote:
> > Under the case of nested on nested (e.g. L0->L1->L2->L3), #GP triggered
> > by SVM instructions can be hided from L1. Instead the hypervisor can
> > inject the proper #VMEXIT to inform L1 of what is happening. Thus L1
> > can avoid invoking the #GP workaround. For this reason we turns on
> > guest VM's X86_FEATURE_SVME_ADDR_CHK bit for KVM running inside VM to
> > receive the notification and change behavior.
> 
> Doesn't this mean a VM migrated between levels (hmm L2 to L1???) would
> see different behaviour?
> (I've never tried such a migration, but I thought in principal it should
> work).

This is not an issue. The VM will always see the X86_FEATURE_SVME_ADDR_CHK set,
(regardless if host has it, or if KVM emulates it).
This is not different from what KVM does for guest's x2apic.
KVM also always emulates it regardless of the host support.

The hypervisor on the other hand can indeed either see or not that bit set,
but it is prepared to handle both cases, so it will support migrating VMs
between hosts that have and don't have that bit.

I hope that I understand this correctly.

Best regards,
Maxim Levitsky


> 
> Dave
> 
> 
> > Co-developed-by: Bandan Das 
> > Signed-off-by: Bandan Das 
> > Signed-off-by: Wei Huang 
> > ---
> >  arch/x86/kvm/svm/svm.c | 19 ++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 2a12870ac71a..89512c0e7663 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -2196,6 +2196,11 @@ static int svm_instr_opcode(struct kvm_vcpu *vcpu)
> >  
> >  static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
> >  {
> > +   const int guest_mode_exit_codes[] = {
> > +   [SVM_INSTR_VMRUN] = SVM_EXIT_VMRUN,
> > +   [SVM_INSTR_VMLOAD] = SVM_EXIT_VMLOAD,
> > +   [SVM_INSTR_VMSAVE] = SVM_EXIT_VMSAVE,
> > +   };
> > int (*const svm_instr_handlers[])(struct vcpu_svm *svm) = {
> > [SVM_INSTR_VMRUN] = vmrun_interception,
> > [SVM_INSTR_VMLOAD] = vmload_interception,
> > @@ -2203,7 +2208,14 @@ static int emulate_svm_instr(struct kvm_vcpu *vcpu, 
> > int opcode)
> > };
> > struct vcpu_svm *svm = to_svm(vcpu);
> >  
> > -   return svm_instr_handlers[opcode](svm);
> > +   if (is_guest_mode(vcpu)) {
> > +   svm->vmcb->control.exit_code = guest_mode_exit_codes[opcode];
> > +   svm->vmcb->control.exit_info_1 = 0;
> > +   svm->vmcb->control.exit_info_2 = 0;
> > +
> > +   return nested_svm_vmexit(svm);
> > +   } else
> > +   return svm_instr_handlers[opcode](svm);
> >  }
> >  
> >  /*
> > @@ -4034,6 +4046,11 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu 
> > *vcpu)
> > /* Check again if INVPCID interception if required */
> > svm_check_invpcid(svm);
> >  
> > +   if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM)) {
> > +   best = kvm_find_cpuid_entry(vcpu, 0x800A, 0);
> > +   best->edx |= (1 << 28);
> > +   }
> > +
> > /* For sev guests, the memory encryption bit is not reserved in CR3.  */
> > if (sev_guest(vcpu->kvm)) {
> > best = kvm_find_cpuid_entry(vcpu, 0x801F, 0);
> > -- 
> > 2.27.0
> >

Re: [PATCH v2 4/4] KVM: SVM: Support #GP handling for the case of nested on nested

2021-01-21 Thread Maxim Levitsky

On Thu, 2021-01-21 at 01:55 -0500, Wei Huang wrote:
> Under the case of nested on nested (e.g. L0->L1->L2->L3), #GP triggered
> by SVM instructions can be hided from L1. Instead the hypervisor can
> inject the proper #VMEXIT to inform L1 of what is happening. Thus L1
> can avoid invoking the #GP workaround. For this reason we turns on
> guest VM's X86_FEATURE_SVME_ADDR_CHK bit for KVM running inside VM to
> receive the notification and change behavior.
> 
> Co-developed-by: Bandan Das 
> Signed-off-by: Bandan Das 
> Signed-off-by: Wei Huang 
> ---
>  arch/x86/kvm/svm/svm.c | 19 ++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 2a12870ac71a..89512c0e7663 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2196,6 +2196,11 @@ static int svm_instr_opcode(struct kvm_vcpu *vcpu)
>  
>  static int emulate_svm_instr(struct kvm_vcpu *vcpu, int opcode)
>  {
> + const int guest_mode_exit_codes[] = {
> + [SVM_INSTR_VMRUN] = SVM_EXIT_VMRUN,
> + [SVM_INSTR_VMLOAD] = SVM_EXIT_VMLOAD,
> + [SVM_INSTR_VMSAVE] = SVM_EXIT_VMSAVE,
> + };
>   int (*const svm_instr_handlers[])(struct vcpu_svm *svm) = {
>   [SVM_INSTR_VMRUN] = vmrun_interception,
>   [SVM_INSTR_VMLOAD] = vmload_interception,
> @@ -2203,7 +2208,14 @@ static int emulate_svm_instr(struct kvm_vcpu *vcpu, 
> int opcode)
>   };
>   struct vcpu_svm *svm = to_svm(vcpu);
>  
> - return svm_instr_handlers[opcode](svm);
> + if (is_guest_mode(vcpu)) {
> + svm->vmcb->control.exit_code = guest_mode_exit_codes[opcode];
> + svm->vmcb->control.exit_info_1 = 0;
> + svm->vmcb->control.exit_info_2 = 0;
> +
> + return nested_svm_vmexit(svm);
> + } else
> + return svm_instr_handlers[opcode](svm);
>  }
>  
>  /*
> @@ -4034,6 +4046,11 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu 
> *vcpu)
>   /* Check again if INVPCID interception if required */
>   svm_check_invpcid(svm);
>  
> + if (nested && guest_cpuid_has(vcpu, X86_FEATURE_SVM)) {
> + best = kvm_find_cpuid_entry(vcpu, 0x800A, 0);
> + best->edx |= (1 << 28);
> + }
> +
>   /* For sev guests, the memory encryption bit is not reserved in CR3.  */
>   if (sev_guest(vcpu->kvm)) {
>   best = kvm_find_cpuid_entry(vcpu, 0x801F, 0);

Tested-by: Maxim Levitsky 
Reviewed-by: Maxim Levitsky 


Best regards,
Maxim Levitsky

Re: [PATCH v2 3/4] KVM: SVM: Add support for VMCB address check change

2021-01-21 Thread Maxim Levitsky

On Thu, 2021-01-21 at 01:55 -0500, Wei Huang wrote:
> New AMD CPUs have a change that checks VMEXIT intercept on special SVM
> instructions before checking their EAX against reserved memory region.
> This change is indicated by CPUID_0x800A_EDX[28]. If it is 1, #VMEXIT
> is triggered before #GP. KVM doesn't need to intercept and emulate #GP
> faults as #GP is supposed to be triggered.
> 
> Co-developed-by: Bandan Das 
> Signed-off-by: Bandan Das 
> Signed-off-by: Wei Huang 
> ---
>  arch/x86/include/asm/cpufeatures.h | 1 +
>  arch/x86/kvm/svm/svm.c | 6 +-
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h 
> b/arch/x86/include/asm/cpufeatures.h
> index 84b887825f12..ea89d6fdd79a 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -337,6 +337,7 @@
>  #define X86_FEATURE_AVIC (15*32+13) /* Virtual Interrupt 
> Controller */
>  #define X86_FEATURE_V_VMSAVE_VMLOAD  (15*32+15) /* Virtual VMSAVE VMLOAD */
>  #define X86_FEATURE_VGIF (15*32+16) /* Virtual GIF */
> +#define X86_FEATURE_SVME_ADDR_CHK(15*32+28) /* "" SVME addr check */
>  
>  /* Intel-defined CPU features, CPUID level 0x0007:0 (ECX), word 16 */
>  #define X86_FEATURE_AVX512VBMI   (16*32+ 1) /* AVX512 Vector Bit 
> Manipulation instructions*/
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 6ed523cab068..2a12870ac71a 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -313,7 +313,8 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
>   svm->vmcb->save.efer = efer | EFER_SVME;
>   vmcb_mark_dirty(svm->vmcb, VMCB_CR);
>   /* Enable #GP interception for SVM instructions */
> - set_exception_intercept(svm, GP_VECTOR);
> + if (!kvm_cpu_cap_has(X86_FEATURE_SVME_ADDR_CHK))
> + set_exception_intercept(svm, GP_VECTOR);
>  
>   return 0;
>  }
> @@ -933,6 +934,9 @@ static __init void svm_set_cpu_caps(void)
>   boot_cpu_has(X86_FEATURE_AMD_SSBD))
>   kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD);
>  
> + if (boot_cpu_has(X86_FEATURE_SVME_ADDR_CHK))
> +     kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK);
> +
>   /* Enable INVPCID feature */
>   kvm_cpu_cap_check_and_set(X86_FEATURE_INVPCID);
>  }

Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky

Re: [PATCH v2 2/4] KVM: SVM: Add emulation support for #GP triggered by SVM instructions

2021-01-21 Thread Maxim Levitsky

he VMWARE backdoor and that WARN_ON_ONCE that was removed.


> + if (error_code)
> + goto reinject;
> +
> + /* Decode the instruction for usage later */
> + if (x86_emulate_decoded_instruction(vcpu, 0, NULL, 0) != EMULATION_OK)
> + goto reinject;
> +
> + opcode = svm_instr_opcode(vcpu);
> + if (opcode)

I prefer opcode != NOT_SVM_INSTR.

> + return emulate_svm_instr(vcpu, opcode);
> + else

'WARN_ON_ONCE(!enable_vmware_backdoor)' I think can be placed here.


> + return kvm_emulate_instruction(vcpu,
> + EMULTYPE_VMWARE_GP | EMULTYPE_NO_DECODE);

I tested the vmware backdoor a bit (using the kvm unit tests) and I found out a 
tiny pre-existing bug
there:

We shouldn't emulate the vmware backdoor for a nested guest, but rather let it 
do it.

The below patch (on top of your patches) works for me and allows the vmware 
backdoor 
test to pass when kvm unit tests run in a guest.

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index fe97b0e41824a..4557fdc9c3e1b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2243,7 +2243,7 @@ static int gp_interception(struct vcpu_svm *svm)
opcode = svm_instr_opcode(vcpu);
if (opcode)
return emulate_svm_instr(vcpu, opcode);
-   else
+   else if (!is_guest_mode(vcpu))
return kvm_emulate_instruction(vcpu,
EMULTYPE_VMWARE_GP | EMULTYPE_NO_DECODE);
 


Best regards,
Maxim Levitsky

> +
> +reinject:
> + kvm_queue_exception_e(vcpu, GP_VECTOR, error_code);
> + return 1;
> +}
> +
>  void svm_set_gif(struct vcpu_svm *svm, bool value)
>  {
>   if (value) {

Re: [PATCH v2 1/4] KVM: x86: Factor out x86 instruction emulation with decoding

2021-01-21 Thread Maxim Levitsky

On Thu, 2021-01-21 at 01:55 -0500, Wei Huang wrote:
> Move the instruction decode part out of x86_emulate_instruction() for it
> to be used in other places. Also kvm_clear_exception_queue() is moved
> inside the if-statement as it doesn't apply when KVM are coming back from
> userspace.
> 
> Co-developed-by: Bandan Das 
> Signed-off-by: Bandan Das 
> Signed-off-by: Wei Huang 
> ---
>  arch/x86/kvm/x86.c | 63 +-
>  arch/x86/kvm/x86.h |  2 ++
>  2 files changed, 42 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9a8969a6dd06..580883cee493 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7298,6 +7298,43 @@ static bool is_vmware_backdoor_opcode(struct 
> x86_emulate_ctxt *ctxt)
>   return false;
>  }
>  
> +/*
> + * Decode and emulate instruction. Return EMULATION_OK if success.
> + */
> +int x86_emulate_decoded_instruction(struct kvm_vcpu *vcpu, int 
> emulation_type,
> + void *insn, int insn_len)

Isn't the name of this function wrong? This function decodes the instruction.
So I would expect something like x86_decode_instruction.

> +{
> + int r = EMULATION_OK;
> + struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> +
> + init_emulate_ctxt(vcpu);
> +
> + /*
> +  * We will reenter on the same instruction since
> +  * we do not set complete_userspace_io.  This does not
> +  * handle watchpoints yet, those would be handled in
> +  * the emulate_ops.
> +  */
> + if (!(emulation_type & EMULTYPE_SKIP) &&
> + kvm_vcpu_check_breakpoint(vcpu, ))
> + return r;
> +
> + ctxt->interruptibility = 0;
> + ctxt->have_exception = false;
> + ctxt->exception.vector = -1;
> + ctxt->perm_ok = false;
> +
> + ctxt->ud = emulation_type & EMULTYPE_TRAP_UD;
> +
> + r = x86_decode_insn(ctxt, insn, insn_len);
> +
> + trace_kvm_emulate_insn_start(vcpu);
> + ++vcpu->stat.insn_emulation;
> +
> + return r;
> +}
> +EXPORT_SYMBOL_GPL(x86_emulate_decoded_instruction);
> +
>  int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>   int emulation_type, void *insn, int insn_len)
>  {
> @@ -7317,32 +7354,12 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, 
> gpa_t cr2_or_gpa,
>*/
>   write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;
>   vcpu->arch.write_fault_to_shadow_pgtable = false;
> - kvm_clear_exception_queue(vcpu);

I think that this change is OK, but I can't be 100% sure about this.

Best regards,
Maxim Levitsky


>  
>   if (!(emulation_type & EMULTYPE_NO_DECODE)) {
> - init_emulate_ctxt(vcpu);
> -
> - /*
> -  * We will reenter on the same instruction since
> -  * we do not set complete_userspace_io.  This does not
> -  * handle watchpoints yet, those would be handled in
> -  * the emulate_ops.
> -  */
> - if (!(emulation_type & EMULTYPE_SKIP) &&
> - kvm_vcpu_check_breakpoint(vcpu, ))
> - return r;
> -
> - ctxt->interruptibility = 0;
> - ctxt->have_exception = false;
> - ctxt->exception.vector = -1;
> - ctxt->perm_ok = false;
> -
> - ctxt->ud = emulation_type & EMULTYPE_TRAP_UD;
> -
> - r = x86_decode_insn(ctxt, insn, insn_len);
> + kvm_clear_exception_queue(vcpu);
>  
> - trace_kvm_emulate_insn_start(vcpu);
> - ++vcpu->stat.insn_emulation;
> + r = x86_emulate_decoded_instruction(vcpu, emulation_type,
> + insn, insn_len);
>   if (r != EMULATION_OK)  {
>   if ((emulation_type & EMULTYPE_TRAP_UD) ||
>   (emulation_type & EMULTYPE_TRAP_UD_FORCED)) {
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index c5ee0f5ce0f1..fc42454a4c27 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -273,6 +273,8 @@ bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu 
> *vcpu, gfn_t gfn,
> int page_num);
>  bool kvm_vector_hashing_enabled(void);
>  void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 
> error_code);
> +int x86_emulate_decoded_instruction(struct kvm_vcpu *vcpu, int 
> emulation_type,
> + void *insn, int insn_len);
>  int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>   int emulation_type, void *insn, int insn_len);
>  fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);

[PATCH v2 1/3] KVM: nVMX: Always call sync_vmcs02_to_vmcs12_rare on migration

2021-01-14 Thread Maxim Levitsky

Even when we are outside the nested guest, some vmcs02 fields
are not in sync vs vmcs12.

However during the migration, the vmcs12 has to be up to date
to be able to load it later after the migration.

To fix that, call that function.

Fixes: 7952d769c29ca ("KVM: nVMX: Sync rarely accessed guest fields only when 
needed")

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/vmx/nested.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 0fbb46990dfce..776688f9d1017 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -6077,11 +6077,14 @@ static int vmx_get_nested_state(struct kvm_vcpu *vcpu,
if (is_guest_mode(vcpu)) {
sync_vmcs02_to_vmcs12(vcpu, vmcs12);
sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12);
-   } else if (!vmx->nested.need_vmcs12_to_shadow_sync) {
-   if (vmx->nested.hv_evmcs)
-   copy_enlightened_to_vmcs12(vmx);
-   else if (enable_shadow_vmcs)
-   copy_shadow_to_vmcs12(vmx);
+   } else  {
+   copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu));
+   if (!vmx->nested.need_vmcs12_to_shadow_sync) {
+   if (vmx->nested.hv_evmcs)
+   copy_enlightened_to_vmcs12(vmx);
+   else if (enable_shadow_vmcs)
+   copy_shadow_to_vmcs12(vmx);
+   }
}
 
BUILD_BUG_ON(sizeof(user_vmx_nested_state->vmcs12) < VMCS12_SIZE);
-- 
2.26.2

[PATCH v2 3/3] KVM: VMX: read idt_vectoring_info a bit earlier

2021-01-14 Thread Maxim Levitsky

This allows it to be printed correctly by the trace print
that follows.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/vmx/vmx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 2af05d3b05909..9b6e7dbf5e2bd 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6771,6 +6771,8 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
}
 
vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
+   vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
+
if (unlikely((u16)vmx->exit_reason == EXIT_REASON_MCE_DURING_VMENTRY))
kvm_machine_check();
 
@@ -6780,7 +6782,6 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
return EXIT_FASTPATH_NONE;
 
vmx->loaded_vmcs->launched = 1;
-   vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
vmx_recover_nmi_blocking(vmx);
vmx_complete_interrupts(vmx);
-- 
2.26.2

[PATCH v2 2/3] KVM: nVMX: add kvm_nested_vmlaunch_resume tracepoint

2021-01-14 Thread Maxim Levitsky

This is very helpful for debugging nested VMX issues.

Signed-off-by: Maxim Levitsky 
---
 arch/x86/kvm/trace.h  | 30 ++
 arch/x86/kvm/vmx/nested.c |  6 ++
 arch/x86/kvm/x86.c|  1 +
 3 files changed, 37 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 2de30c20bc264..663d1b1d8bf64 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -554,6 +554,36 @@ TRACE_EVENT(kvm_nested_vmrun,
__entry->npt ? "on" : "off")
 );
 
+
+/*
+ * Tracepoint for nested VMLAUNCH/VMRESUME
+ */
+TRACE_EVENT(kvm_nested_vmlaunch_resume,
+   TP_PROTO(__u64 rip, __u64 vmcs, __u64 nested_rip,
+__u32 entry_intr_info),
+   TP_ARGS(rip, vmcs, nested_rip, entry_intr_info),
+
+   TP_STRUCT__entry(
+   __field(__u64,  rip )
+   __field(__u64,  vmcs)
+   __field(__u64,  nested_rip  )
+   __field(__u32,  entry_intr_info )
+   ),
+
+   TP_fast_assign(
+   __entry->rip= rip;
+   __entry->vmcs   = vmcs;
+   __entry->nested_rip = nested_rip;
+   __entry->entry_intr_info= entry_intr_info;
+   ),
+
+   TP_printk("rip: 0x%016llx vmcs: 0x%016llx nrip: 0x%016llx "
+ "entry_intr_info: 0x%08x",
+   __entry->rip, __entry->vmcs, __entry->nested_rip,
+   __entry->entry_intr_info)
+);
+
+
 TRACE_EVENT(kvm_nested_intercepts,
TP_PROTO(__u16 cr_read, __u16 cr_write, __u32 exceptions,
 __u32 intercept1, __u32 intercept2, __u32 intercept3),
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 776688f9d1017..cd51b66480d52 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3327,6 +3327,12 @@ enum nvmx_vmentry_status 
nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
vmx->nested.vmcs01_guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
 
+   trace_kvm_nested_vmlaunch_resume(kvm_rip_read(vcpu),
+vmx->nested.current_vmptr,
+vmcs12->guest_rip,
+vmcs12->vm_entry_intr_info_field);
+
+
/*
 * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and*
 * nested early checks are disabled.  In the event of a "late" VM-Fail,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a480804ae27a3..7c6e94e32100e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11562,6 +11562,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_cr);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmlaunch_resume);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmrun);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit_inject);
-- 
2.26.2

[PATCH v2 0/3] VMX: more nested fixes

2021-01-14 Thread Maxim Levitsky

This is hopefully the last fix for VMX nested migration
that finally allows my stress test of migration with a nested guest to pass.

In a nutshell after an optimization that was done in commit 7952d769c29ca,
some of vmcs02 fields which can be modified by the L2 freely while it runs
(like GSBASE and such) were not copied back to vmcs12 unless:

1. L1 tries to vmread them (update done on intercept)
2. vmclear or vmldptr on other vmcs are done.
3. nested state is read and nested guest is running.

What wasn't done was to sync these 'rare' fields when L1 is running
but still has a loaded vmcs12 which might have some stale fields,
if that vmcs was used to enter a guest already due to that optimization.

Plus I added two minor patches to improve VMX tracepoints
a bit. There is still a large room for improvement.

Best regards,
Maxim Levitsky

Maxim Levitsky (3):
  KVM: nVMX: Always call sync_vmcs02_to_vmcs12_rare on migration
  KVM: nVMX: add kvm_nested_vmlaunch_resume tracepoint
  KVM: VMX: read idt_vectoring_info a bit earlier

 arch/x86/kvm/trace.h  | 30 ++
 arch/x86/kvm/vmx/nested.c | 19 ++-
 arch/x86/kvm/vmx/vmx.c|  3 ++-
 arch/x86/kvm/x86.c|  1 +
 4 files changed, 47 insertions(+), 6 deletions(-)

-- 
2.26.2

1 2 3 4 5 6 7 >

1 - 100 of 663 matches

Mail list logo