Re: [PATCH 2/3] KVM: Stop looking for coalesced MMIO zones if the bus is destroyed

2021-04-15 Thread Jim Mattson
On Mon, Apr 12, 2021 at 3:21 PM Sean Christopherson  wrote:
>
> Abort the walk of coalesced MMIO zones if kvm_io_bus_unregister_dev()
> fails to allocate memory for the new instance of the bus.  If it can't
> instantiate a new bus, unregister_dev() destroys all devices _except_ the
> target device.   But, it doesn't tell the caller that it obliterated the
> bus and invoked the destructor for all devices that were on the bus.  In
> the coalesced MMIO case, this can result in a deleted list entry
> dereference due to attempting to continue iterating on coalesced_zones
> after future entries (in the walk) have been deleted.
>
> Opportunistically add curly braces to the for-loop, which encompasses
> many lines but sneaks by without braces due to the guts being a single
> if statement.
>
> Fixes: f65886606c2d ("KVM: fix memory leak in kvm_io_bus_unregister_dev()")
> Cc: sta...@vger.kernel.org
> Reported-by: Hao Sun 
> Signed-off-by: Sean Christopherson 
Reviewed-by: Jim Mattson 


Re: [PATCH 3/3] KVM: Add proper lockdep assertion in I/O bus unregister

2021-04-15 Thread Jim Mattson
On Mon, Apr 12, 2021 at 3:23 PM Sean Christopherson  wrote:
>
> Convert a comment above kvm_io_bus_unregister_dev() into an actual
> lockdep assertion, and opportunistically add curly braces to a multi-line
> for-loop.
>
> Signed-off-by: Sean Christopherson 
> ---
>  virt/kvm/kvm_main.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ab1fa6f92c82..ccc2ef1dbdda 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4485,21 +4485,23 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum 
> kvm_bus bus_idx, gpa_t addr,
> return 0;
>  }
>
> -/* Caller must hold slots_lock. */
>  int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
>   struct kvm_io_device *dev)
>  {
> int i, j;
> struct kvm_io_bus *new_bus, *bus;
>
> +   lockdep_assert_held(>slots_lock);
> +
> bus = kvm_get_bus(kvm, bus_idx);
> if (!bus)
> return 0;
>
> -   for (i = 0; i < bus->dev_count; i++)
> +   for (i = 0; i < bus->dev_count; i++) {
> if (bus->range[i].dev == dev) {
> break;
> }
> +   }
Per coding-style.rst, neither the for loop nor the if-block should have braces.

"Do not unnecessarily use braces where a single statement will do."

Stylistic nits aside,

Reviewed-by: Jim Mattson 


Re: [PATCH 1/3] KVM: Destroy I/O bus devices on unregister failure _after_ sync'ing SRCU

2021-04-15 Thread Jim Mattson
On Mon, Apr 12, 2021 at 3:21 PM Sean Christopherson  wrote:
>
> If allocating a new instance of an I/O bus fails when unregistering a
> device, wait to destroy the device until after all readers are guaranteed
> to see the new null bus.  Destroying devices before the bus is nullified
> could lead to use-after-free since readers expect the devices on their
> reference of the bus to remain valid.
>
> Fixes: f65886606c2d ("KVM: fix memory leak in kvm_io_bus_unregister_dev()")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Sean Christopherson 
Reviewed-by: Jim Mattson 


Re: [PATCH 0/6] KVM: x86: Make the cause of instruction emulation available to user-space

2021-04-12 Thread Jim Mattson
On Mon, Apr 12, 2021 at 6:09 AM David Edmondson
 wrote:
>
> Instruction emulation happens for a variety of reasons, yet on error
> we have no idea exactly what triggered it. Add a cause of emulation to
> the various originators and pass it upstream when emulation fails.

What is userspace going to do with this information? It's hard to say
whether or not this is the right ABI without more context.


Re: [PATCH] x86/tlb: Flush global mappings when KAISER is disabled

2021-03-25 Thread Jim Mattson
On Thu, Mar 25, 2021 at 9:33 AM Hugh Dickins  wrote:
>
> On Thu, 25 Mar 2021, Borislav Petkov wrote:
>
> > Ok,
> >
> > I tried to be as specific as possible in the commit message so that we
> > don't forget. Please lemme know if I've missed something.
> >
> > Babu, Jim, I'd appreciate it if you ran this to confirm.
Tested-by: Jim Mattson 


Re: [PATCH] KVM: VMX: Check the corresponding bits according to the intel sdm

2021-03-22 Thread Jim Mattson
On Mon, Mar 22, 2021 at 7:37 PM  wrote:
>
> From: Haiwei Li 
>
> According to IA-32 SDM Vol.3D "A.1 BASIC VMX INFORMATION", two inspections
> are missing.
> * Bit 31 is always 0. Earlier versions of this manual specified that the
> VMCS revision identifier was a 32-bit field in bits 31:0 of this MSR. For
> all processors produced prior to this change, bit 31 of this MSR was read
> as 0.

For all *Intel* processors produced prior to this change, bit 31 of
this MSR may have been 0. However, a conforming hypervisor may have
selected a full 32-bit VMCS revision identifier with the high bit set
for nested VMX. Furthermore, there are other vendors, such as VIA,
which have implemented the VMX extensions, and they, too, may have
selected a full 32-bit VMCS revision identifier with the high bit set.
Intel should know better than to change the documentation after the
horse is out of the barn.

What, exactly, is the value you are adding with this check?


Re: [PATCH v6 00/12] SVM cleanup and INVPCID feature support

2021-03-11 Thread Jim Mattson
On Thu, Mar 11, 2021 at 12:32 PM Borislav Petkov  wrote:
>
> On Thu, Mar 11, 2021 at 09:07:55PM +0100, Borislav Petkov wrote:
> > On Wed, Mar 10, 2021 at 07:21:23PM -0600, Babu Moger wrote:
> > > # git bisect good
> > > 59094faf3f618b2d2b2a45acb916437d611cede6 is the first bad commit
> > > commit 59094faf3f618b2d2b2a45acb916437d611cede6
> > > Author: Borislav Petkov 
> > > Date:   Mon Dec 25 13:57:16 2017 +0100
> > >
> > > x86/kaiser: Move feature detection up
> >
> > What is the reproducer?
> >
> > Boot latest 4.9 stable kernel in a SEV guest? Can you send guest
> > .config?
> >
> > Upthread is talking about PCID, so I'm guessing host needs to be Zen3
> > with PCID. Anything else?
>
> That oops points to:
>
> [1.237515] kernel BUG at 
> /build/linux-dqnRSc/linux-4.9.228/arch/x86/kernel/alternative.c:709!
>
> which is:
>
> local_flush_tlb();
> sync_core();
> /* Could also do a CLFLUSH here to speed up CPU recovery; but
>that causes hangs on some VIA CPUs. */
> for (i = 0; i < len; i++)
> BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);   <---
> local_irq_restore(flags);
> return addr;
>
> in text_poke() which basically says that the patching verification
> fails. And you have a local_flush_tlb() before that. And with PCID maybe
> it is not flushing properly or whatnot.
>
> And deep down in the TLB flushing code, it does:
>
> if (kaiser_enabled)
> kaiser_flush_tlb_on_return_to_user();
>
> and that uses PCID...

I would expect kaiser_enabled to be false (and PCIDs not to be used),
since AMD CPUs are not vulnerable to Meltdown.


Re: [PATCH] perf/x86/kvm: Fix inverted pebs_no_isolation check

2021-03-10 Thread Jim Mattson
On Wed, Mar 10, 2021 at 7:24 AM Andi Kleen  wrote:
>
> The pebs_no_isolation optimization check is inverted. We want to disable
> PEBS isolation when the microcode is at least the revision in the table,
> not for older microcode. So remove the extra !.
The original code was correct because of the double negative. A "good"
CPU _has_ PEBS isolation. It's bad to have pebs_no_isolation.


Re: [PATCH v2] x86/perf: Use RET0 as default for guest_get_msrs to handle "no PMU" case

2021-03-09 Thread Jim Mattson
On Tue, Mar 9, 2021 at 9:10 AM Sean Christopherson  wrote:
>
> Initialize x86_pmu.guest_get_msrs to return 0/NULL to handle the "nop"
> case.  Patching in perf_guest_get_msrs_nop() during setup does not work
> if there is no PMU, as setup bails before updating the static calls,
> leaving x86_pmu.guest_get_msrs NULL and thus a complete nop.  Ultimately,
> this causes VMX abort on VM-Exit due to KVM putting random garbage from
> the stack into the MSR load list.
>
> Add a comment in KVM to note that nr_msrs is valid if and only if the
> return value is non-NULL.
>
> Fixes: abd562df94d1 ("x86/perf: Use static_call for x86_pmu.guest_get_msrs")
> Cc: Like Xu 
> Cc: Paolo Bonzini 
> Cc: Jim Mattson 
> Reported-by: Dmitry Vyukov 
> Reported-by: syzbot+cce9ef2dd25246f81...@syzkaller.appspotmail.com
> Suggested-by: Peter Zijlstra 
> Signed-off-by: Sean Christopherson 
> ---
>
> v2:
>  - Use __static_call_return0 to return NULL instead of manually checking
>the hook at invocation.  [Peter]
>  - Rebase to tip/sched/core, commit 4117cebf1a9f ("psi: Optimize task
>switch inside shared cgroups").
>
...
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 50810d471462..32cf8287d4a7 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6580,8 +6580,8 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx 
> *vmx)
> int i, nr_msrs;
> struct perf_guest_switch_msr *msrs;
>
> +   /* Note, nr_msrs may be garbage if perf_guest_get_msrs() returns 
> NULL. */

You could drop the scary comment with a profligate initialization of
nr_msrs to 0.

[Apologies to those seeing this twice. I blame gmail.]


Re: [PATCH v6 00/12] SVM cleanup and INVPCID feature support

2021-02-23 Thread Jim Mattson
Any updates? What should we be telling customers with Debian 9 guests? :-)

On Fri, Jan 22, 2021 at 5:52 PM Babu Moger  wrote:
>
>
>
> On 1/21/21 5:51 PM, Babu Moger wrote:
> >
> >
> > On 1/20/21 9:10 PM, Babu Moger wrote:
> >>
> >>
> >> On 1/20/21 3:45 PM, Babu Moger wrote:
> >>>
> >>>
> >>> On 1/20/21 3:14 PM, Jim Mattson wrote:
> >>>> On Tue, Jan 19, 2021 at 3:45 PM Babu Moger  wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 1/19/21 5:01 PM, Jim Mattson wrote:
> >>>>>> On Mon, Sep 14, 2020 at 11:33 AM Babu Moger  wrote:
> >>>>>>
> >>>>>>> Thanks Paolo. Tested Guest/nested guest/kvm units tests. Everything 
> >>>>>>> works
> >>>>>>> as expected.
> >>>>>>
> >>>>>> Debian 9 does not like this patch set. As a kvm guest, it panics on a
> >>>>>> Milan CPU unless booted with 'nopcid'. Gmail mangles long lines, so
> >>>>>> please see the attached kernel log snippet. Debian 10 is fine, so I
> >>>>>> assume this is a guest bug.
> >>>>>>
> >>>>>
> >>>>> We had an issue with PCID feature earlier. This was showing only with 
> >>>>> SEV
> >>>>> guests. It is resolved recently. Do you think it is not related that?
> >>>>> Here are the patch set.
> >>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fkvm%2F160521930597.32054.4906933314022910996.stgit%40bmoger-ubuntu%2Fdata=04%7C01%7CBabu.Moger%40amd.com%7C3009e5f7f32b4dbd4aee08d8bdc045c9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637467980841376327%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=%2Bva7em372XD7uaCrSy3UBH6a9n8xaTTXWCAlA3gJX78%3Dreserved=0
> >>>>
> >>>> The Debian 9 release we tested is not an SEV guest.
> >>> ok. I have not tested Debian 9 before. I will try now. Will let you know
> >>> how it goes. thanks
> >>>
> >>
> >> I have reproduced the issue locally. Will investigate. thanks
> >>
> > Few updates.
> > 1. Like Jim mentioned earlier, this appears to be guest kernel issue.
> > Debian 9 runs the base kernel 4.9.0-14. Problem can be seen consistently
> > with this kernel.
> >
> > 2. This guest kernel(4.9.0-14) does not like the new feature INVPCID.
> >
> > 3. System comes up fine when invpcid feature is disabled with the boot
> > parameter "noinvpcid" and also with "nopcid". nopcid disables both pcid
> > and invpcid.
> >
> > 4. Upgraded the guest kernel to v5.0 and system comes up fine.
> >
> > 5. Also system comes up fine with latest guest kernel 5.11.0-rc4.
> >
> > I did not bisect further yet.
> > Babu
> > Thanks
>
>
> Some more update:
>  System comes up fine with kernel v4.9(checked out on upstream tag v4.9).
> So, I am assuming this is something specific to Debian 4.9.0-14 kernel.
>
> Note: I couldn't go back prior versions(v4.8 or earlier) due to compile
> issues.
> Thanks
> Babu
>


Re: [PATCH v2 1/3] KVM: x86: dump_vmcs should not assume GUEST_IA32_EFER is valid

2021-02-23 Thread Jim Mattson
On Fri, Feb 19, 2021 at 6:46 AM David Edmondson
 wrote:
>
> If the VM entry/exit controls for loading/saving MSR_EFER are either
> not available (an older processor or explicitly disabled) or not
> used (host and guest values are the same), reading GUEST_IA32_EFER
> from the VMCS returns an inaccurate value.
>
> Because of this, in dump_vmcs() don't use GUEST_IA32_EFER to decide
> whether to print the PDPTRs - do so if the EPT is in use and CR4.PAE
> is set.
>
> Fixes: 4eb64dce8d0a ("KVM: x86: dump VMCS on invalid entry")
> Signed-off-by: David Edmondson 
> ---
>  arch/x86/kvm/vmx/vmx.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index eb69fef57485..818051c9fa10 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5759,7 +5759,6 @@ void dump_vmcs(void)
> u32 vmentry_ctl, vmexit_ctl;
> u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
> unsigned long cr4;
> -   u64 efer;
>
> if (!dump_invalid_vmcs) {
> pr_warn_ratelimited("set kvm_intel.dump_invalid_vmcs=1 to 
> dump internal KVM state.\n");
> @@ -5771,7 +5770,6 @@ void dump_vmcs(void)
> cpu_based_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> pin_based_exec_ctrl = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
> cr4 = vmcs_readl(GUEST_CR4);
> -   efer = vmcs_read64(GUEST_IA32_EFER);
> secondary_exec_control = 0;
> if (cpu_has_secondary_exec_ctrls())
> secondary_exec_control = 
> vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> @@ -5784,8 +5782,7 @@ void dump_vmcs(void)
>cr4, vmcs_readl(CR4_READ_SHADOW), 
> vmcs_readl(CR4_GUEST_HOST_MASK));
> pr_err("CR3 = 0x%016lx\n", vmcs_readl(GUEST_CR3));
> if ((secondary_exec_control & SECONDARY_EXEC_ENABLE_EPT) &&
> -   (cr4 & X86_CR4_PAE) && !(efer & EFER_LMA))
> -   {
> +   (cr4 & X86_CR4_PAE)) {

Assuming that we really want to restrict the printing of the PDPTEs, I
think you also need to test "cr0 & CR0.PG" (cf. section 26.3.1.6 of
the SDM, volume 3).


Re: [PATCH v2 1/3] KVM: x86: dump_vmcs should not assume GUEST_IA32_EFER is valid

2021-02-23 Thread Jim Mattson
On Tue, Feb 23, 2021 at 2:51 PM Sean Christopherson  wrote:
>
> On Fri, Feb 19, 2021, David Edmondson wrote:
> > If the VM entry/exit controls for loading/saving MSR_EFER are either
> > not available (an older processor or explicitly disabled) or not
> > used (host and guest values are the same), reading GUEST_IA32_EFER
> > from the VMCS returns an inaccurate value.
> >
> > Because of this, in dump_vmcs() don't use GUEST_IA32_EFER to decide
> > whether to print the PDPTRs - do so if the EPT is in use and CR4.PAE
> > is set.
>
> This isn't necessarily correct either.  In a way, it's less correct as PDPTRs
> are more likely to be printed when they shouldn't, assuming most guests are
> 64-bit guests.  It's annoying to calculate the effective guest EFER, but so
> awful that it's worth risking confusion over PDTPRs.

I still prefer a dump_vmcs that always dumps every VMCS field. But if
you really want to skip printing the PDPTEs when they're irrelevant,
can't you just use the "IA-32e mode guest" VM-entry control as a proxy
for EFER.LMA?


Re: [PATCH] KVM: x86: dump_vmcs should not assume GUEST_IA32_EFER is valid

2021-02-18 Thread Jim Mattson
On Thu, Feb 18, 2021 at 8:35 AM Sean Christopherson  wrote:
>
> On Thu, Feb 18, 2021, Paolo Bonzini wrote:
> > On 18/02/21 13:56, David Edmondson wrote:
> > > On Thursday, 2021-02-18 at 12:54:52 +01, Paolo Bonzini wrote:
> > >
> > > > On 18/02/21 11:04, David Edmondson wrote:
> > > > > When dumping the VMCS, retrieve the current guest value of EFER from
> > > > > the kvm_vcpu structure if neither VM_EXIT_SAVE_IA32_EFER or
> > > > > VM_ENTRY_LOAD_IA32_EFER is set, which can occur if the processor does
> > > > > not support the relevant VM-exit/entry controls.
> > > >
> > > > Printing vcpu->arch.efer is not the best choice however.  Could we dump
> > > > the whole MSR load/store area instead?
> > >
> > > I'm happy to do that, and think that it would be useful, but it won't
> > > help with the original problem (which I should have explained more).
> > >
> > > If the guest has EFER_LMA set but we aren't using the entry/exit
> > > controls, vm_read64(GUEST_IA32_EFER) returns 0, causing dump_vmcs() to
> > > erroneously dump the PDPTRs.
> >
> > Got it now.  It would sort of help, because while dumping the MSR load/store
> > area you could get hold of the real EFER, and use it to decide whether to
> > dump the PDPTRs.
>
> EFER isn't guaranteed to be in the load list, either, e.g. if guest and host
> have the same desired value.
>
> The proper way to retrieve the effective EFER is to reuse the logic in
> nested_vmx_calc_efer(), i.e. look at VM_ENTRY_IA32E_MODE if EFER isn't being
> loaded via VMCS.

Shouldn't dump_vmcs() simply dump the contents of the VMCS, in its
entirety? What does it matter what the value of EFER is?


Re: [PATCH 2/3] KVM: x86: Advertise INVPCID by default

2021-02-11 Thread Jim Mattson
On Thu, Feb 11, 2021 at 4:34 PM Sean Christopherson  wrote:
>
> Advertise INVPCID by default (if supported by the host kernel) instead
> of having both SVM and VMX opt in.  INVPCID was opt in when it was a
> VMX only feature so that KVM wouldn't prematurely advertise support
> if/when it showed up in the kernel on AMD hardware.
>
> Signed-off-by: Sean Christopherson 
Reviewed-by: Jim Mattson 


Re: [PATCH 1/3] KVM: SVM: Intercept INVPCID when it's disabled to inject #UD

2021-02-11 Thread Jim Mattson
On Thu, Feb 11, 2021 at 4:34 PM Sean Christopherson  wrote:
>
> Intercept INVPCID if it's disabled in the guest, even when using NPT,
> as KVM needs to inject #UD in this case.
>
> Fixes: 4407a797e941 ("KVM: SVM: Enable INVPCID feature on AMD")
> Cc: Babu Moger 
> Signed-off-by: Sean Christopherson 
Reviewed-by: Jim Mattson 


Re: [PATCH 3/3] KVM: VMX: Allow INVPCID in guest without PCID

2021-02-11 Thread Jim Mattson
On Thu, Feb 11, 2021 at 4:34 PM Sean Christopherson  wrote:
>
> Remove the restriction that prevents VMX from exposing INVPCID to the
> guest without PCID also being exposed to the guest.  The justification of
> the restriction is that INVPCID will #UD if it's disabled in the VMCS.
> While that is a true statement, it's also true that RDTSCP will #UD if
> it's disabled in the VMCS.  Neither of those things has any dependency
> whatsoever on the guest being able to set CR4.PCIDE=1, which is what is
> effectively allowed by exposing PCID to the guest.
>
> Removing the bogus restriction aligns VMX with SVM, and also allows for
> an interesting configuration.  INVPCID is that fastest way to do a global
> TLB flush, e.g. see native_flush_tlb_global().  Allowing INVPCID without
> PCID would let a guest use the expedited flush while also limiting the
> number of ASIDs consumed by the guest.
>
> Signed-off-by: Sean Christopherson 
I always thought this was a bizarre one-off restriction.
Reviewed-by: Jim Mattson 


[tip: perf/core] perf/x86/kvm: Add Cascade Lake Xeon steppings to isolation_ucodes[]

2021-02-10 Thread tip-bot2 for Jim Mattson
The following commit has been merged into the perf/core branch of tip:

Commit-ID: b3c3361fe325074d4144c29d46daae4fc5a268d5
Gitweb:
https://git.kernel.org/tip/b3c3361fe325074d4144c29d46daae4fc5a268d5
Author:Jim Mattson 
AuthorDate:Fri, 05 Feb 2021 11:13:24 -08:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 10 Feb 2021 14:44:54 +01:00

perf/x86/kvm: Add Cascade Lake Xeon steppings to isolation_ucodes[]

Cascade Lake Xeon parts have the same model number as Skylake Xeon
parts, so they are tagged with the intel_pebs_isolation
quirk. However, as with Skylake Xeon H0 stepping parts, the PEBS
isolation issue is fixed in all microcode versions.

Add the Cascade Lake Xeon steppings (5, 6, and 7) to the
isolation_ucodes[] table so that these parts benefit from Andi's
optimization in commit 9b545c04abd4f ("perf/x86/kvm: Avoid unnecessary
work in guest filtering").

Signed-off-by: Jim Mattson 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Andi Kleen 
Link: https://lkml.kernel.org/r/20210205191324.2889006-1-jmatt...@google.com
---
 arch/x86/events/intel/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 67a7246..5bac48d 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4513,6 +4513,9 @@ static const struct x86_cpu_desc isolation_ucodes[] = {
INTEL_CPU_DESC(INTEL_FAM6_BROADWELL_X,   2, 0x0b14),
INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_X, 3, 0x0021),
INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_X, 4, 0x),
+   INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_X, 5, 0x),
+   INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_X, 6, 0x),
+   INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_X, 7, 0x),
INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_L, 3, 0x007c),
INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE,   3, 0x007c),
INTEL_CPU_DESC(INTEL_FAM6_KABYLAKE,  9, 0x004e),


Trouble with perf_guest_get_msrs() and pebs_no_isolation

2021-02-09 Thread Jim Mattson
On a host that suffers from pebs_no_isolation, perf_guest_get_msrs()
adds an entry to cpuc->guest_switch_msrs for
MSR_IA32_PEBS_ENABLE. Kvm's atomic_switch_perf_msrs() is the only
caller of perf_guest_get_msrs(). If atomic_switch_perf_msrs() finds an
entry for MSR_IA32_PEBS_ENABLE in cpuc->guest_switch_msrs, it puts the
"host" value of this entry on the VM-exit MSR-load list for the
current VMCS. At the next VM-exit, that "host" value will be written
to MSR_IA32_PEBS_ENABLE.

The problem is that by the next VM-exit, that "host" value may be
stale. Though maskable interrupts are blocked from well before
atomic_switch_perf_msrs() to the next VM-entry, PMIs are delivered as
NMIs. Moreover, due to PMI throttling, the PMI handler may clear bits
in MSR_IA32_PEBS_ENABLE. See the comment to that effect in
handle_pmi_common(). In fact, by the time that perf_guest_get_msrs()
returns to its caller, the "host" value that it has recorded for
MSR_IA32_PEBS_ENABLE could already be stale.

What happens if a VM-exit sets a bit in MSR_IA32_PEBS_ENABLE that the
perf subsystem thinks should be clear? In the short term, nothing
happens at all. But note that this situation may not get better at the
next VM-exit, because kvm won't add MSR_IA32_PEBS_ENABLE to the
VM-exit MSR-load list if perf_guest_get_mrs() reports that the "host"
value of the MSR is 0. So, if the new MSR_IA32_PEBS_ENABLE "host"
value is 0, the stale bits can actually persist for a long time.

If, at some point in the future, the perf subsystem programs a counter
overflow interrupt on the same PMC for a PEBS-capable event, we are in
for some nasty surprises. (Note that the perf subsystem never
*intentionally* programs a PMC for both PEBS and counter overflow
interrupts at the same time.)

If a PEBS assist is triggered while in VMX non-root operation, the CPU
will try to access the host's DS_AREA using the guest's page tables,
and a page fault is likely (specifically on a read of the PEBS
interrupt threshold at offset 0x38 in the DS_AREA).

If the PEBS interrupt threshold is met while in VMX root operation,
two separate PMIs are generated: one for the PEBS interrupt threshold
and one for the counter overflow. This results in a message from
unknown_nmi_error(): "Uhhuh. NMI received for unknown reason  on
CPU ."


[PATCH RESEND] perf/x86/kvm: Add Cascade Lake Xeon steppings to isolation_ucodes[]

2021-02-05 Thread Jim Mattson
Cascade Lake Xeon parts have the same model number as Skylake Xeon
parts, so they are tagged with the intel_pebs_isolation
quirk. However, as with Skylake Xeon H0 stepping parts, the PEBS
isolation issue is fixed in all microcode versions.

Add the Cascade Lake Xeon steppings (5, 6, and 7) to the
isolation_ucodes[] table so that these parts benefit from Andi's
optimization in commit 9b545c04abd4f ("perf/x86/kvm: Avoid unnecessary
work in guest filtering").

Signed-off-by: Jim Mattson 
Cc: Andi Kleen 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
---
 arch/x86/events/intel/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index d4569bfa83e3..4faaef3a8f6c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4397,6 +4397,9 @@ static const struct x86_cpu_desc isolation_ucodes[] = {
INTEL_CPU_DESC(INTEL_FAM6_BROADWELL_X,   2, 0x0b14),
INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_X, 3, 0x0021),
INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_X, 4, 0x),
+   INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_X, 5, 0x),
+   INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_X, 6, 0x),
+   INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_X, 7, 0x),
INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_L, 3, 0x007c),
INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE,   3, 0x007c),
INTEL_CPU_DESC(INTEL_FAM6_KABYLAKE,  9, 0x004e),
-- 
2.30.0.478.g8a0d178c01-goog



Re: [PATCH v3 7/9] KVM: VMX: Add guest physical address check in EPT violation and misconfig

2021-01-27 Thread Jim Mattson
On Wed, Jan 20, 2021 at 1:16 PM Jim Mattson  wrote:
>
> On Fri, Jan 15, 2021 at 11:35 AM Jim Mattson  wrote:
> >
> > On Fri, Oct 23, 2020 at 10:43 AM Paolo Bonzini  wrote:
> > >
> > > On 23/10/20 19:23, Jim Mattson wrote:
> > > >> The information that we need is _not_ that provided by the advanced
> > > >> VM-exit information (or by a page walk).  If a page is neither writable
> > > >> nor executable, the advanced information doesn't say if the injected 
> > > >> #PF
> > > >> should be a W=1 or a F=1 fault.  We need the information in bits 0..2 
> > > >> of
> > > >> the exit qualification for the final access, which however is not
> > > >> available for the paging-structure access.
> > > >>
> > > > Are you planning to extend the emulator, then, to support all
> > > > instructions? I'm not sure where you are going with this.
> > >
> > > I'm going to fix the bit 8=1 case, but for bit 8=0 there's not much that
> > > you can do.  In all likelihood the guest is buggy anyway.
> >
> > Did this drop off your radar? Are you still planning to fix the bit8=1
> > case to use advanced EPT exit qualification information? Or did I just
> > miss it?
>
> Paolo,
> If you're not working on this, do you mind if I ask Aaron to take a look at 
> it?

Ugh. The advanced EPT exit qualification contains nothing useful here,
AFAICT. It only contains x86 page protection information--nothing
about the access itself.


[PATCH] perf/x86/kvm: Add Cascade Lake Xeon steppings to isolation_ucodes[]

2021-01-27 Thread Jim Mattson
Cascade Lake Xeon parts have the same model number as Skylake Xeon
parts, so they are tagged with the intel_pebs_isolation
quirk. However, as with Skylake Xeon H0 stepping parts, the PEBS
isolation issue is fixed in all microcode versions.

Add the Cascade Lake Xeon steppings (5, 6, and 7) to the
isolation_ucodes[] table so that these parts benefit from Andi's
optimization in commit 9b545c04abd4f ("perf/x86/kvm: Avoid unnecessary
work in guest filtering").

Signed-off-by: Jim Mattson 
Cc: Andi Kleen 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
---
 arch/x86/events/intel/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index af457f8cb29d..af28b2f5f895 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4383,6 +4383,9 @@ static const struct x86_cpu_desc isolation_ucodes[] = {
INTEL_CPU_DESC(INTEL_FAM6_BROADWELL_X,   2, 0x0b14),
INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_X, 3, 0x0021),
INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_X, 4, 0x),
+   INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_X, 5, 0x),
+   INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_X, 6, 0x),
+   INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_X, 7, 0x),
INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE_L, 3, 0x007c),
INTEL_CPU_DESC(INTEL_FAM6_SKYLAKE,   3, 0x007c),
INTEL_CPU_DESC(INTEL_FAM6_KABYLAKE,  9, 0x004e),
-- 
2.30.0.280.ga3ce27912f-goog



Re: [PATCH v3 7/9] KVM: VMX: Add guest physical address check in EPT violation and misconfig

2021-01-20 Thread Jim Mattson
On Fri, Jan 15, 2021 at 11:35 AM Jim Mattson  wrote:
>
> On Fri, Oct 23, 2020 at 10:43 AM Paolo Bonzini  wrote:
> >
> > On 23/10/20 19:23, Jim Mattson wrote:
> > >> The information that we need is _not_ that provided by the advanced
> > >> VM-exit information (or by a page walk).  If a page is neither writable
> > >> nor executable, the advanced information doesn't say if the injected #PF
> > >> should be a W=1 or a F=1 fault.  We need the information in bits 0..2 of
> > >> the exit qualification for the final access, which however is not
> > >> available for the paging-structure access.
> > >>
> > > Are you planning to extend the emulator, then, to support all
> > > instructions? I'm not sure where you are going with this.
> >
> > I'm going to fix the bit 8=1 case, but for bit 8=0 there's not much that
> > you can do.  In all likelihood the guest is buggy anyway.
>
> Did this drop off your radar? Are you still planning to fix the bit8=1
> case to use advanced EPT exit qualification information? Or did I just
> miss it?

Paolo,
If you're not working on this, do you mind if I ask Aaron to take a look at it?


Re: [PATCH v6 00/12] SVM cleanup and INVPCID feature support

2021-01-20 Thread Jim Mattson
On Tue, Jan 19, 2021 at 3:45 PM Babu Moger  wrote:
>
>
>
> On 1/19/21 5:01 PM, Jim Mattson wrote:
> > On Mon, Sep 14, 2020 at 11:33 AM Babu Moger  wrote:
> >
> >> Thanks Paolo. Tested Guest/nested guest/kvm units tests. Everything works
> >> as expected.
> >
> > Debian 9 does not like this patch set. As a kvm guest, it panics on a
> > Milan CPU unless booted with 'nopcid'. Gmail mangles long lines, so
> > please see the attached kernel log snippet. Debian 10 is fine, so I
> > assume this is a guest bug.
> >
>
> We had an issue with PCID feature earlier. This was showing only with SEV
> guests. It is resolved recently. Do you think it is not related that?
> Here are the patch set.
> https://lore.kernel.org/kvm/160521930597.32054.4906933314022910996.stgit@bmoger-ubuntu/

The Debian 9 release we tested is not an SEV guest.


Re: [PATCH v6 00/12] SVM cleanup and INVPCID feature support

2021-01-19 Thread Jim Mattson
On Mon, Sep 14, 2020 at 11:33 AM Babu Moger  wrote:

> Thanks Paolo. Tested Guest/nested guest/kvm units tests. Everything works
> as expected.

Debian 9 does not like this patch set. As a kvm guest, it panics on a
Milan CPU unless booted with 'nopcid'. Gmail mangles long lines, so
please see the attached kernel log snippet. Debian 10 is fine, so I
assume this is a guest bug.


debian9.panic
Description: Binary data


Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

2021-01-15 Thread Jim Mattson
On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal  wrote:
>
> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
>
> Since smaller physical address spaces are only supported on VMX, the parameter
> is only exposed in the kvm_intel module.
> Modifications to VMX page fault and EPT violation handling will depend on 
> whether
> that parameter is enabled.
>
> Also disable support by default, and let the user decide if they want to 
> enable
> it.
>
> Signed-off-by: Mohammed Gamal 
> ---
>  arch/x86/kvm/vmx/vmx.c | 15 ++-
>  arch/x86/kvm/vmx/vmx.h |  3 +++
>  arch/x86/kvm/x86.c |  2 +-
>  3 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 819c185adf09..dc778c7b5a06 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -129,6 +129,9 @@ static bool __read_mostly enable_preemption_timer = 1;
>  module_param_named(preemption_timer, enable_preemption_timer, bool, S_IRUGO);
>  #endif
>
> +extern bool __read_mostly allow_smaller_maxphyaddr;

Since this variable is in the kvm module rather than the kvm_intel
module, its current setting is preserved across "rmmod kvm_intel;
modprobe kvm_intel." That is, if set to true, it doesn't revert to
false after "rmmod kvm_intel." Is that the intended behavior?


Re: [PATCH v3 7/9] KVM: VMX: Add guest physical address check in EPT violation and misconfig

2021-01-15 Thread Jim Mattson
On Fri, Oct 23, 2020 at 10:43 AM Paolo Bonzini  wrote:
>
> On 23/10/20 19:23, Jim Mattson wrote:
> >> The information that we need is _not_ that provided by the advanced
> >> VM-exit information (or by a page walk).  If a page is neither writable
> >> nor executable, the advanced information doesn't say if the injected #PF
> >> should be a W=1 or a F=1 fault.  We need the information in bits 0..2 of
> >> the exit qualification for the final access, which however is not
> >> available for the paging-structure access.
> >>
> > Are you planning to extend the emulator, then, to support all
> > instructions? I'm not sure where you are going with this.
>
> I'm going to fix the bit 8=1 case, but for bit 8=0 there's not much that
> you can do.  In all likelihood the guest is buggy anyway.

Did this drop off your radar? Are you still planning to fix the bit8=1
case to use advanced EPT exit qualification information? Or did I just
miss it?

> It would be possible to only do the decode part of the emulator to get
> the PFEC (matching the GVA from the vmexit to the memory operand, for
> example, and retrying if the instruction is unexpected).  Then one would
> only need enough VEX/EVEX parsing to process the decoding.
>
> Paolo
>


Re: [PATCH] kvm: Add emulation for movups/movupd

2021-01-12 Thread Jim Mattson
On Wed, Apr 4, 2018 at 10:44 PM Paolo Bonzini  wrote:
>
> On 04/04/2018 19:35, Stefan Fritsch wrote:
> > On Wednesday, 4 April 2018 19:24:20 CEST Paolo Bonzini wrote:
> >> On 04/04/2018 19:10, Konrad Rzeszutek Wilk wrote:
> >>> Should there be a corresponding test-case?
> >>
> >> Good point!  Stefan, could you write one?
> >
> > Is there infrastructure for such tests? If yes, can you give me a pointer to
> > it?
>
> Yes, check out x86/emulator.c in
> https://git.kernel.org/pub/scm/virt/kvm/kvm-unit-tests.git.  There is
> already a movaps test.

Whatever became of this unit test? I don't see it in the
kvm-unit-tests repository.


Re: UBSAN: shift-out-of-bounds in kvm_vcpu_after_set_cpuid

2021-01-11 Thread Jim Mattson
It looks like userspace can possibly induce this by providing guest
CPUID information with a "physical address width" of 64 in leaf
0x8008.

Perhaps cpuid_query_maxphyaddr() should just look at the low 5 bits of
CPUID.8008H:EAX? Better would be to return an error for
out-of-range values, but I understand that the kvm community's stance
is that, in general, guest CPUID information should not be validated
by kvm.

On Tue, Dec 22, 2020 at 12:36 AM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:5e60366d Merge tag 'fallthrough-fixes-clang-5.11-rc1' of g..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=11c7046b50
> kernel config:  https://syzkaller.appspot.com/x/.config?x=db720fe37a6a41d8
> dashboard link: https://syzkaller.appspot.com/bug?extid=e87846c48bf72bc85311
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> userspace arch: i386
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+e87846c48bf72bc85...@syzkaller.appspotmail.com
>
> 
> UBSAN: shift-out-of-bounds in arch/x86/kvm/mmu.h:52:16
> shift exponent 64 is too large for 64-bit type 'long long unsigned int'
> CPU: 1 PID: 11156 Comm: syz-executor.1 Not tainted 5.10.0-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x107/0x163 lib/dump_stack.c:120
>  ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
>  __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
>  rsvd_bits arch/x86/kvm/mmu.h:52 [inline]
>  kvm_vcpu_after_set_cpuid.cold+0x35/0x3a arch/x86/kvm/cpuid.c:181
>  kvm_vcpu_ioctl_set_cpuid+0x28e/0x970 arch/x86/kvm/cpuid.c:273
>  kvm_arch_vcpu_ioctl+0x1091/0x2d70 arch/x86/kvm/x86.c:4699
>  kvm_vcpu_ioctl+0x7b9/0xdb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3386
>  kvm_vcpu_compat_ioctl+0x1a2/0x340 
> arch/x86/kvm/../../../virt/kvm/kvm_main.c:3430
>  __do_compat_sys_ioctl+0x1d3/0x230 fs/ioctl.c:842
>  do_syscall_32_irqs_on arch/x86/entry/common.c:78 [inline]
>  __do_fast_syscall_32+0x56/0x80 arch/x86/entry/common.c:137
>  do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:160
>  entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
> RIP: 0023:0xf7fe8549
> Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 
> 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 
> 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
> RSP: 002b:f55e20cc EFLAGS: 0296 ORIG_RAX: 0036
> RAX: ffda RBX: 0005 RCX: 4008ae8a
> RDX: 20c0 RSI:  RDI: 
> RBP:  R08:  R09: 
> R10:  R11:  R12: 
> R13:  R14:  R15: 
> 
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.


Re: [PATCH 2/2] KVM: SVM: Add support for Virtual SPEC_CTRL

2020-12-10 Thread Jim Mattson
On Thu, Dec 10, 2020 at 1:26 PM Babu Moger  wrote:
>
> Hi Jim,
>
> > -Original Message-----
> > From: Jim Mattson 
> > Sent: Monday, December 7, 2020 5:06 PM
> > To: Moger, Babu 
> > Cc: Paolo Bonzini ; Thomas Gleixner
> > ; Ingo Molnar ; Borislav Petkov
> > ; Yu, Fenghua ; Tony Luck
> > ; Wanpeng Li ; kvm list
> > ; Lendacky, Thomas ;
> > Peter Zijlstra ; Sean Christopherson
> > ; Joerg Roedel ; the arch/x86
> > maintainers ; kyung.min.p...@intel.com; LKML  > ker...@vger.kernel.org>; Krish Sadhukhan ; H .
> > Peter Anvin ; mgr...@linux.intel.com; Vitaly Kuznetsov
> > ; Phillips, Kim ; Huang2, Wei
> > 
> > Subject: Re: [PATCH 2/2] KVM: SVM: Add support for Virtual SPEC_CTRL
> >
> > On Mon, Dec 7, 2020 at 2:38 PM Babu Moger  wrote:
> > >
> > > Newer AMD processors have a feature to virtualize the use of the
> > > SPEC_CTRL MSR. When supported, the SPEC_CTRL MSR is automatically
> > > virtualized and no longer requires hypervisor intervention.
> > >
> > > This feature is detected via CPUID function 0x800A_EDX[20]:
> > > GuestSpecCtrl.
> > >
> > > Hypervisors are not required to enable this feature since it is
> > > automatically enabled on processors that support it.
> > >
> > > When this feature is enabled, the hypervisor no longer has to
> > > intercept the usage of the SPEC_CTRL MSR and no longer is required to
> > > save and restore the guest SPEC_CTRL setting when switching
> > > hypervisor/guest modes.  The effective SPEC_CTRL setting is the guest
> > > SPEC_CTRL setting or'ed with the hypervisor SPEC_CTRL setting. This
> > > allows the hypervisor to ensure a minimum SPEC_CTRL if desired.
> > >
> > > This support also fixes an issue where a guest may sometimes see an
> > > inconsistent value for the SPEC_CTRL MSR on processors that support
> > > this feature. With the current SPEC_CTRL support, the first write to
> > > SPEC_CTRL is intercepted and the virtualized version of the SPEC_CTRL
> > > MSR is not updated. When the guest reads back the SPEC_CTRL MSR, it
> > > will be 0x0, instead of the actual expected value. There isn’t a
> > > security concern here, because the host SPEC_CTRL value is or’ed with
> > > the Guest SPEC_CTRL value to generate the effective SPEC_CTRL value.
> > > KVM writes with the guest's virtualized SPEC_CTRL value to SPEC_CTRL
> > > MSR just before the VMRUN, so it will always have the actual value
> > > even though it doesn’t appear that way in the guest. The guest will
> > > only see the proper value for the SPEC_CTRL register if the guest was
> > > to write to the SPEC_CTRL register again. With Virtual SPEC_CTRL
> > > support, the MSR interception of SPEC_CTRL is disabled during
> > > vmcb_init, so this will no longer be an issue.
> > >
> > > Signed-off-by: Babu Moger 
> > > ---
> >
> > Shouldn't there be some code to initialize a new "guest SPEC_CTRL"
> > value in the VMCB, both at vCPU creation, and at virtual processor reset?
>
> Yes, I think so. I will check on this.
>
> >
> > >  arch/x86/kvm/svm/svm.c |   17 ++---
> > >  1 file changed, 14 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index
> > > 79b3a564f1c9..3d73ec0cdb87 100644
> > > --- a/arch/x86/kvm/svm/svm.c
> > > +++ b/arch/x86/kvm/svm/svm.c
> > > @@ -1230,6 +1230,14 @@ static void init_vmcb(struct vcpu_svm *svm)
> > >
> > > svm_check_invpcid(svm);
> > >
> > > +   /*
> > > +* If the host supports V_SPEC_CTRL then disable the interception
> > > +* of MSR_IA32_SPEC_CTRL.
> > > +*/
> > > +   if (boot_cpu_has(X86_FEATURE_V_SPEC_CTRL))
> > > +   set_msr_interception(>vcpu, svm->msrpm,
> > MSR_IA32_SPEC_CTRL,
> > > +1, 1);
> > > +
> > > if (kvm_vcpu_apicv_active(>vcpu))
> > > avic_init_vmcb(svm);
> > >
> > > @@ -3590,7 +3598,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct
> > kvm_vcpu *vcpu)
> > >  * is no need to worry about the conditional branch over the wrmsr
> > >  * being speculatively taken.
> > >  */
> > > -   x86_spec_ctrl_set_guest(svm->spec_ctrl, svm->virt_spec_ctrl);
> > > +   if (!static_cpu_has(X86_FEATURE_V_SPEC_CTRL

Re: [PATCH 1/2] x86/cpufeatures: Add the Virtual SPEC_CTRL feature

2020-12-09 Thread Jim Mattson
On Wed, Dec 9, 2020 at 2:39 PM Babu Moger  wrote:
>
>
>
> On 12/7/20 5:22 PM, Jim Mattson wrote:
> > On Mon, Dec 7, 2020 at 2:38 PM Babu Moger  wrote:
> >>
> >> Newer AMD processors have a feature to virtualize the use of the SPEC_CTRL
> >> MSR. This feature is identified via CPUID 0x800A_EDX[20]. When present,
> >> the SPEC_CTRL MSR is automatically virtualized and no longer requires
> >> hypervisor intervention.
> >>
> >> Signed-off-by: Babu Moger 
> >> ---
> >>  arch/x86/include/asm/cpufeatures.h |1 +
> >>  1 file changed, 1 insertion(+)
> >>
> >> diff --git a/arch/x86/include/asm/cpufeatures.h 
> >> b/arch/x86/include/asm/cpufeatures.h
> >> index dad350d42ecf..d649ac5ed7c7 100644
> >> --- a/arch/x86/include/asm/cpufeatures.h
> >> +++ b/arch/x86/include/asm/cpufeatures.h
> >> @@ -335,6 +335,7 @@
> >>  #define X86_FEATURE_AVIC   (15*32+13) /* Virtual Interrupt 
> >> Controller */
> >>  #define X86_FEATURE_V_VMSAVE_VMLOAD(15*32+15) /* Virtual VMSAVE 
> >> VMLOAD */
> >>  #define X86_FEATURE_VGIF   (15*32+16) /* Virtual GIF */
> >> +#define X86_FEATURE_V_SPEC_CTRL(15*32+20) /* Virtual 
> >> SPEC_CTRL */
> >
> > Shouldn't this bit be reported by KVM_GET_SUPPORTED_CPUID when it's
> > enumerated on the host?
>
> Jim, I am not sure if this needs to be reported by
> KVM_GET_SUPPORTED_CPUID. I dont see V_VMSAVE_VMLOAD or VGIF being reported
> via KVM_GET_SUPPORTED_CPUID. Do you see the need for that?

Every little bit helps. No, it isn't *needed*. But then again, this
entire patchset isn't *needed*, is it?


Re: KVM_SET_CPUID doesn't check supported bits (was Re: [PATCH 0/6] KVM: x86: KVM_SET_SREGS.CR4 bug fixes and cleanup)

2020-12-07 Thread Jim Mattson
On Mon, Dec 7, 2020 at 3:47 AM stsp  wrote:
>
> 07.12.2020 14:29, Paolo Bonzini пишет:
> > On 07/12/20 12:24, stsp wrote:
> >> It tries to enable VME among other things.
> >> qemu appears to disable VME by default,
> >> unless you do "-cpu host". So we have a situation where
> >> the host (which is qemu) doesn't have VME,
> >> and guest (dosemu) is trying to enable it.
> >> Now obviously KVM_SET_CPUID doesn't check anyting
> >> at all and returns success. That later turns
> >> into an invalid guest state.
> >>
> >>
> >> Question: should KVM_SET_CPUID check for
> >> supported bits, end return error if not everything
> >> is supported?
> >
> > No, it is intentional.  Most bits of CPUID are not ever checked by
> > KVM, so userspace is supposed to set values that makes sense
> By "that makes sense" you probably
> meant to say "bits_that_makes_sense masked
> with the ones returned by KVM_GET_SUPPORTED_CPUID"?
>
> So am I right that KVM_SET_CPUID only "lowers"
> the supported bits? In which case I don't need to
> call it at all, but instead just call KVM_GET_SUPPORTED_CPUID
> and see if the needed bits are supported, and
> exit otherwise, right?

"Lowers" is a tricky concept for CPUID information. Some feature bits
report 0 for "present" and 1 for "not-present." Some multi-bit fields
are interpreted as numbers, which may be signed or unsigned. Some
multi-bit fields are strings. Some fields have dependencies on other
fields. Etc.


Re: [PATCH 1/2] x86/cpufeatures: Add the Virtual SPEC_CTRL feature

2020-12-07 Thread Jim Mattson
On Mon, Dec 7, 2020 at 2:38 PM Babu Moger  wrote:
>
> Newer AMD processors have a feature to virtualize the use of the SPEC_CTRL
> MSR. This feature is identified via CPUID 0x800A_EDX[20]. When present,
> the SPEC_CTRL MSR is automatically virtualized and no longer requires
> hypervisor intervention.
>
> Signed-off-by: Babu Moger 
> ---
>  arch/x86/include/asm/cpufeatures.h |1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h 
> b/arch/x86/include/asm/cpufeatures.h
> index dad350d42ecf..d649ac5ed7c7 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -335,6 +335,7 @@
>  #define X86_FEATURE_AVIC   (15*32+13) /* Virtual Interrupt 
> Controller */
>  #define X86_FEATURE_V_VMSAVE_VMLOAD(15*32+15) /* Virtual VMSAVE VMLOAD */
>  #define X86_FEATURE_VGIF   (15*32+16) /* Virtual GIF */
> +#define X86_FEATURE_V_SPEC_CTRL(15*32+20) /* Virtual 
> SPEC_CTRL */

Shouldn't this bit be reported by KVM_GET_SUPPORTED_CPUID when it's
enumerated on the host?

>  /* Intel-defined CPU features, CPUID level 0x0007:0 (ECX), word 16 */
>  #define X86_FEATURE_AVX512VBMI (16*32+ 1) /* AVX512 Vector Bit 
> Manipulation instructions*/
>


Re: [PATCH 2/2] KVM: SVM: Add support for Virtual SPEC_CTRL

2020-12-07 Thread Jim Mattson
On Mon, Dec 7, 2020 at 2:38 PM Babu Moger  wrote:
>
> Newer AMD processors have a feature to virtualize the use of the
> SPEC_CTRL MSR. When supported, the SPEC_CTRL MSR is automatically
> virtualized and no longer requires hypervisor intervention.
>
> This feature is detected via CPUID function 0x800A_EDX[20]:
> GuestSpecCtrl.
>
> Hypervisors are not required to enable this feature since it is
> automatically enabled on processors that support it.
>
> When this feature is enabled, the hypervisor no longer has to
> intercept the usage of the SPEC_CTRL MSR and no longer is required to
> save and restore the guest SPEC_CTRL setting when switching
> hypervisor/guest modes.  The effective SPEC_CTRL setting is the guest
> SPEC_CTRL setting or'ed with the hypervisor SPEC_CTRL setting. This
> allows the hypervisor to ensure a minimum SPEC_CTRL if desired.
>
> This support also fixes an issue where a guest may sometimes see an
> inconsistent value for the SPEC_CTRL MSR on processors that support
> this feature. With the current SPEC_CTRL support, the first write to
> SPEC_CTRL is intercepted and the virtualized version of the SPEC_CTRL
> MSR is not updated. When the guest reads back the SPEC_CTRL MSR, it
> will be 0x0, instead of the actual expected value. There isn’t a
> security concern here, because the host SPEC_CTRL value is or’ed with
> the Guest SPEC_CTRL value to generate the effective SPEC_CTRL value.
> KVM writes with the guest's virtualized SPEC_CTRL value to SPEC_CTRL
> MSR just before the VMRUN, so it will always have the actual value
> even though it doesn’t appear that way in the guest. The guest will
> only see the proper value for the SPEC_CTRL register if the guest was
> to write to the SPEC_CTRL register again. With Virtual SPEC_CTRL
> support, the MSR interception of SPEC_CTRL is disabled during
> vmcb_init, so this will no longer be an issue.
>
> Signed-off-by: Babu Moger 
> ---

Shouldn't there be some code to initialize a new "guest SPEC_CTRL"
value in the VMCB, both at vCPU creation, and at virtual processor
reset?

>  arch/x86/kvm/svm/svm.c |   17 ++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 79b3a564f1c9..3d73ec0cdb87 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1230,6 +1230,14 @@ static void init_vmcb(struct vcpu_svm *svm)
>
> svm_check_invpcid(svm);
>
> +   /*
> +* If the host supports V_SPEC_CTRL then disable the interception
> +* of MSR_IA32_SPEC_CTRL.
> +*/
> +   if (boot_cpu_has(X86_FEATURE_V_SPEC_CTRL))
> +   set_msr_interception(>vcpu, svm->msrpm, 
> MSR_IA32_SPEC_CTRL,
> +1, 1);
> +
> if (kvm_vcpu_apicv_active(>vcpu))
> avic_init_vmcb(svm);
>
> @@ -3590,7 +3598,8 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct 
> kvm_vcpu *vcpu)
>  * is no need to worry about the conditional branch over the wrmsr
>  * being speculatively taken.
>  */
> -   x86_spec_ctrl_set_guest(svm->spec_ctrl, svm->virt_spec_ctrl);
> +   if (!static_cpu_has(X86_FEATURE_V_SPEC_CTRL))
> +   x86_spec_ctrl_set_guest(svm->spec_ctrl, svm->virt_spec_ctrl);

Is this correct for the nested case? Presumably, there is now a "guest
SPEC_CTRL" value somewhere in the VMCB. If L1 does not intercept this
MSR, then we need to transfer the "guest SPEC_CTRL" value from the
vmcb01 to the vmcb02, don't we?

> svm_vcpu_enter_exit(vcpu, svm);
>
> @@ -3609,12 +3618,14 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct 
> kvm_vcpu *vcpu)
>  * If the L02 MSR bitmap does not intercept the MSR, then we need to
>  * save it.
>  */
> -   if (unlikely(!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL)))
> +   if (!static_cpu_has(X86_FEATURE_V_SPEC_CTRL) &&
> +   unlikely(!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL)))
> svm->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);

Is this correct for the nested case? If L1 does not intercept this
MSR, then it might have changed while L2 is running. Presumably, the
hardware has stored the new value somewhere in the vmcb02 at #VMEXIT,
but now we need to move that value into the vmcb01, don't we?

> reload_tss(vcpu);
>
> -   x86_spec_ctrl_restore_host(svm->spec_ctrl, svm->virt_spec_ctrl);
> +   if (!static_cpu_has(X86_FEATURE_V_SPEC_CTRL))
> +   x86_spec_ctrl_restore_host(svm->spec_ctrl, 
> svm->virt_spec_ctrl);
>
> vcpu->arch.cr2 = svm->vmcb->save.cr2;
> vcpu->arch.regs[VCPU_REGS_RAX] = svm->vmcb->save.rax;
>

It would be great if you could add some tests to kvm-unit-tests.


Re: [PATCH] KVM: SVM: check CR4 changes against vcpu->arch

2020-11-16 Thread Jim Mattson
On Mon, Nov 16, 2020 at 10:11 AM Paolo Bonzini  wrote:
>
> Similarly to what vmx/vmx.c does, use vcpu->arch.cr4 to check if CR4
> bits PGE, PKE and OSXSAVE have changed.  When switching between VMCB01
> and VMCB02, CPUID has to be adjusted every time if CR4.PKE or CR4.OSXSAVE
> change; without this patch, instead, CR4 would be checked against the
> previous value for L2 on vmentry, and against the previous value for
> L1 on vmexit, and CPUID would not be updated.
>
> Signed-off-by: Paolo Bonzini 
Reviewed-by: Jim Mattson 


Re: [tip: ras/core] x86/mce: Enable additional error logging on certain Intel CPUs

2020-11-09 Thread Jim Mattson
On Mon, Nov 9, 2020 at 2:57 PM Luck, Tony  wrote:
>
> > I thought Linux had long ago gone the route of turning rdmsr/wrmsr
> > into rdmsr_safe/wrmsr_safe, so that the guest would ignore the #GPs on
> > writes and return zero to the caller for #GPs on reads.
>
> Linux just switched that around for the machine check banks ... if they #GP
> fault, then something is seriously wrong.
>
> Maybe that isn't a general change of direction though. Perhaps I
> should either use rdmsrl_safe() in this code. Or (better?) add
>
> if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
> return;
>
> to the start of intel_imc_init().

I wouldn't expect all hypervisors to necessarily set CPUID.01H:ECX[bit
31]. Architecturally, on Intel CPUs, that bit is simply defined as
"not used." There is no documented contract between Intel and
hypervisor vendors regarding the use of that bit. (AMD, on the other
hand, *does* document that bit as "reserved for use by hypervisor to
indicate guest status.")


Re: [tip: ras/core] x86/mce: Enable additional error logging on certain Intel CPUs

2020-11-09 Thread Jim Mattson
On Mon, Nov 9, 2020 at 2:09 PM Luck, Tony  wrote:
>
> What does KVM do with model specific MSRs?

"Model specific model-specific registers?" :-)

KVM only implements a small subset of MSRs. By default, any access to
the rest raises #GP.

> Looks like you let the guest believe it was running on one of Sandy Bridge, 
> Ivy Bridge, Haswell (Xeon).
>
> So, the core MCE code tried to enable extended error reporting.
>
> If there is a mode to have KVM let the guest think that it read/wrote MSR 
> 0x17F,
> but actually, doesn't do it ... that would seem to be a reasonable thing to 
> do here.

There is an 'ignore_msrs' module parameter, to sink writes and return
zero on reads for unknown MSRs, but I don't think it's commonly used.

I thought Linux had long ago gone the route of turning rdmsr/wrmsr
into rdmsr_safe/wrmsr_safe, so that the guest would ignore the #GPs on
writes and return zero to the caller for #GPs on reads.


Re: [PATCH] KVM: VMX: Enable Notify VM exit

2020-11-02 Thread Jim Mattson
On Sun, Nov 1, 2020 at 10:14 PM Tao Xu  wrote:
>
> There are some cases that malicious virtual machines can cause CPU stuck
> (event windows don't open up), e.g., infinite loop in microcode when
> nested #AC (CVE-2015-5307). No event window obviously means no events,
> e.g. NMIs, SMIs, and IRQs will all be blocked, may cause the related
> hardware CPU can't be used by host or other VM.
>
> To resolve those cases, it can enable a notify VM exit if no
> event window occur in VMX non-root mode for a specified amount of
> time (notify window).
>
> Expose a module param for setting notify window, default setting it to
> the time as 1/10 of periodic tick, and user can set it to 0 to disable
> this feature.
>
> TODO:
> 1. The appropriate value of notify window.
> 2. Another patch to disable interception of #DB and #AC when notify
> VM-Exiting is enabled.
>
> Co-developed-by: Xiaoyao Li 
> Signed-off-by: Tao Xu 
> Signed-off-by: Xiaoyao Li 

Do you have test cases?


Re: [PATCH v3 7/9] KVM: VMX: Add guest physical address check in EPT violation and misconfig

2020-10-23 Thread Jim Mattson
On Fri, Oct 23, 2020 at 10:16 AM Paolo Bonzini  wrote:
>
> On 23/10/20 18:59, Jim Mattson wrote:
> >> The problem is that page fault error code bits cannot be reconstructed
> >> from bits 0..2 of the EPT violation exit qualification, if bit 8 is
> >> clear in the exit qualification (that is, if the access causing the EPT
> >> violation is to a paging-structure entry).  In that case bits 0..2 refer
> >> to the paging-structure access rather than to the final access.  In fact
> >> advanced information is not available at all for paging-structure access
> >> EPT violations.
> >
> > True, but the in-kernel emulator can only handle a very small subset
> > of the available instructions.
> >
> > If bit 8 is set in the exit qualification, we should use the advanced
> > VM-exit information. If it's clear, we should just do a software page
> > walk of the guest's x86 page tables.
>
> The information that we need is _not_ that provided by the advanced
> VM-exit information (or by a page walk).  If a page is neither writable
> nor executable, the advanced information doesn't say if the injected #PF
> should be a W=1 or a F=1 fault.  We need the information in bits 0..2 of
> the exit qualification for the final access, which however is not
> available for the paging-structure access.
>
Are you planning to extend the emulator, then, to support all
instructions? I'm not sure where you are going with this.


Re: [PATCH] KVM: X86: Expose KVM_HINTS_REALTIME in KVM_GET_SUPPORTED_CPUID

2020-10-23 Thread Jim Mattson
On Fri, Oct 23, 2020 at 2:07 AM Paolo Bonzini  wrote:
>
> On 22/10/20 19:13, Jim Mattson wrote:
> > We don't actually use KVM_GET_SUPPORTED_CPUID at all today. If it's
> > commonly being misinterpreted as you say, perhaps we should add a
> > KVM_GET_TRUE_SUPPORTED_CPUID ioctl. Or, perhaps we can just fix this
> > in the documentation?
>
> Yes, I think we should fix the documentation and document the best
> practices around MSRs and CPUID bits.  Mostly documenting what QEMU
> does, perhaps without all the quirks it has to support old kernels that
> messed things up even more.
>
> Paolo

I'd really like to be able to call an ioctl that will help me
determine whether the host can support the guest CPUID information
that I already have (e.g. on the target of a live migration). At first
glance, KVM_GET_SUPPORTED_CPUID appeared to be that ioctl. Sadly, it
appears that no such ioctl exists.


Re: [PATCH v3 7/9] KVM: VMX: Add guest physical address check in EPT violation and misconfig

2020-10-23 Thread Jim Mattson
On Fri, Oct 23, 2020 at 2:22 AM Paolo Bonzini  wrote:
>
> On 23/10/20 05:14, Sean Christopherson wrote:
>  +
>  +   /*
>  +* Check that the GPA doesn't exceed physical memory limits, as 
>  that is
>  +* a guest page fault.  We have to emulate the instruction here, 
>  because
>  +* if the illegal address is that of a paging structure, then
>  +* EPT_VIOLATION_ACC_WRITE bit is set.  Alternatively, if 
>  supported we
>  +* would also use advanced VM-exit information for EPT 
>  violations to
>  +* reconstruct the page fault error code.
>  +*/
>  +   if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
>  +   return kvm_emulate_instruction(vcpu, 0);
>  +
> >>> Is kvm's in-kernel emulator up to the task? What if the instruction in
> >>> question is AVX-512, or one of the myriad instructions that the
> >>> in-kernel emulator can't handle? Ice Lake must support the advanced
> >>> VM-exit information for EPT violations, so that would seem like a
> >>> better choice.
> >>>
> >> Anyone?
> >
> > Using "advanced info" if it's supported seems like the way to go.  Outright
> > requiring it is probably overkill; if userspace wants to risk having to 
> > kill a
> > (likely broken) guest, so be it.
>
> Yeah, the instruction is expected to page-fault here.  However the
> comment is incorrect and advanced information does not help here.
>
> The problem is that page fault error code bits cannot be reconstructed
> from bits 0..2 of the EPT violation exit qualification, if bit 8 is
> clear in the exit qualification (that is, if the access causing the EPT
> violation is to a paging-structure entry).  In that case bits 0..2 refer
> to the paging-structure access rather than to the final access.  In fact
> advanced information is not available at all for paging-structure access
> EPT violations.

True, but the in-kernel emulator can only handle a very small subset
of the available instructions.

If bit 8 is set in the exit qualification, we should use the advanced
VM-exit information. If it's clear, we should just do a software page
walk of the guest's x86 page tables. The in-kernel emulator should
only be used as a last resort on hardware that doesn't support the
advanced VM-exit information for EPT violations.


Re: [PATCH] KVM: X86: Expose KVM_HINTS_REALTIME in KVM_GET_SUPPORTED_CPUID

2020-10-22 Thread Jim Mattson
On Thu, Oct 22, 2020 at 9:37 AM Paolo Bonzini  wrote:
>
> On 22/10/20 18:35, Jim Mattson wrote:
> > On Thu, Oct 22, 2020 at 6:02 AM Paolo Bonzini  wrote:
> >>
> >> On 22/10/20 03:34, Wanpeng Li wrote:
> >>> From: Wanpeng Li 
> >>>
> >>> Per KVM_GET_SUPPORTED_CPUID ioctl documentation:
> >>>
> >>> This ioctl returns x86 cpuid features which are supported by both the
> >>> hardware and kvm in its default configuration.
> >>>
> >>> A well-behaved userspace should not set the bit if it is not supported.
> >>>
> >>> Suggested-by: Jim Mattson 
> >>> Signed-off-by: Wanpeng Li 
> >>
> >> It's common for userspace to copy all supported CPUID bits to
> >> KVM_SET_CPUID2, I don't think this is the right behavior for
> >> KVM_HINTS_REALTIME.
> >
> > That is not how the API is defined, but I'm sure you know that. :-)
>
> Yes, though in my defense :) KVM_HINTS_REALTIME is not a property of the
> kernel, it's a property of the environment that the guest runs in.  This
> was the original reason to separate it from other feature bits in the
> same leaf.
>
> Paolo
>
We don't actually use KVM_GET_SUPPORTED_CPUID at all today. If it's
commonly being misinterpreted as you say, perhaps we should add a
KVM_GET_TRUE_SUPPORTED_CPUID ioctl. Or, perhaps we can just fix this
in the documentation?


Re: [PATCH] KVM: X86: Expose KVM_HINTS_REALTIME in KVM_GET_SUPPORTED_CPUID

2020-10-22 Thread Jim Mattson
On Thu, Oct 22, 2020 at 6:02 AM Paolo Bonzini  wrote:
>
> On 22/10/20 03:34, Wanpeng Li wrote:
> > From: Wanpeng Li 
> >
> > Per KVM_GET_SUPPORTED_CPUID ioctl documentation:
> >
> > This ioctl returns x86 cpuid features which are supported by both the
> > hardware and kvm in its default configuration.
> >
> > A well-behaved userspace should not set the bit if it is not supported.
> >
> > Suggested-by: Jim Mattson 
> > Signed-off-by: Wanpeng Li 
>
> It's common for userspace to copy all supported CPUID bits to
> KVM_SET_CPUID2, I don't think this is the right behavior for
> KVM_HINTS_REALTIME.

That is not how the API is defined, but I'm sure you know that. :-)

> (But maybe this was discussed already; if so, please point me to the
> previous discussion).
>
> Paolo


Re: [PATCH v3 7/9] KVM: VMX: Add guest physical address check in EPT violation and misconfig

2020-10-14 Thread Jim Mattson
On Fri, Oct 9, 2020 at 9:17 AM Jim Mattson  wrote:
>
> On Fri, Jul 10, 2020 at 8:48 AM Mohammed Gamal  wrote:
> >
> > Check guest physical address against it's maximum physical memory. If
> > the guest's physical address exceeds the maximum (i.e. has reserved bits
> > set), inject a guest page fault with PFERR_RSVD_MASK set.
> >
> > This has to be done both in the EPT violation and page fault paths, as
> > there are complications in both cases with respect to the computation
> > of the correct error code.
> >
> > For EPT violations, unfortunately the only possibility is to emulate,
> > because the access type in the exit qualification might refer to an
> > access to a paging structure, rather than to the access performed by
> > the program.
> >
> > Trapping page faults instead is needed in order to correct the error code,
> > but the access type can be obtained from the original error code and
> > passed to gva_to_gpa.  The corrections required in the error code are
> > subtle. For example, imagine that a PTE for a supervisor page has a reserved
> > bit set.  On a supervisor-mode access, the EPT violation path would trigger.
> > However, on a user-mode access, the processor will not notice the reserved
> > bit and not include PFERR_RSVD_MASK in the error code.
> >
> > Co-developed-by: Mohammed Gamal 
> > Signed-off-by: Paolo Bonzini 
> > ---
> >  arch/x86/kvm/vmx/vmx.c | 24 +---
> >  arch/x86/kvm/vmx/vmx.h |  3 ++-
> >  2 files changed, 23 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 770b090969fb..de3f436b2d32 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -4790,9 +4790,15 @@ static int handle_exception_nmi(struct kvm_vcpu 
> > *vcpu)
> >
> > if (is_page_fault(intr_info)) {
> > cr2 = vmx_get_exit_qual(vcpu);
> > -   /* EPT won't cause page fault directly */
> > -   WARN_ON_ONCE(!vcpu->arch.apf.host_apf_flags && enable_ept);
> > -   return kvm_handle_page_fault(vcpu, error_code, cr2, NULL, 
> > 0);
> > +   if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
> > +   /*
> > +* EPT will cause page fault only if we need to
> > +* detect illegal GPAs.
> > +*/
> > +   kvm_fixup_and_inject_pf_error(vcpu, cr2, 
> > error_code);
> > +   return 1;
> > +   } else
> > +   return kvm_handle_page_fault(vcpu, error_code, cr2, 
> > NULL, 0);
> > }
> >
> > ex_no = intr_info & INTR_INFO_VECTOR_MASK;
> > @@ -5308,6 +5314,18 @@ static int handle_ept_violation(struct kvm_vcpu 
> > *vcpu)
> >PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
> >
> > vcpu->arch.exit_qualification = exit_qualification;
> > +
> > +   /*
> > +* Check that the GPA doesn't exceed physical memory limits, as 
> > that is
> > +* a guest page fault.  We have to emulate the instruction here, 
> > because
> > +* if the illegal address is that of a paging structure, then
> > +* EPT_VIOLATION_ACC_WRITE bit is set.  Alternatively, if supported 
> > we
> > +* would also use advanced VM-exit information for EPT violations to
> > +* reconstruct the page fault error code.
> > +*/
> > +   if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
> > +   return kvm_emulate_instruction(vcpu, 0);
> > +
>
> Is kvm's in-kernel emulator up to the task? What if the instruction in
> question is AVX-512, or one of the myriad instructions that the
> in-kernel emulator can't handle? Ice Lake must support the advanced
> VM-exit information for EPT violations, so that would seem like a
> better choice.
>
Anyone?


Re: [PATCH v3 7/9] KVM: VMX: Add guest physical address check in EPT violation and misconfig

2020-10-09 Thread Jim Mattson
On Fri, Jul 10, 2020 at 8:48 AM Mohammed Gamal  wrote:
>
> Check guest physical address against it's maximum physical memory. If
> the guest's physical address exceeds the maximum (i.e. has reserved bits
> set), inject a guest page fault with PFERR_RSVD_MASK set.
>
> This has to be done both in the EPT violation and page fault paths, as
> there are complications in both cases with respect to the computation
> of the correct error code.
>
> For EPT violations, unfortunately the only possibility is to emulate,
> because the access type in the exit qualification might refer to an
> access to a paging structure, rather than to the access performed by
> the program.
>
> Trapping page faults instead is needed in order to correct the error code,
> but the access type can be obtained from the original error code and
> passed to gva_to_gpa.  The corrections required in the error code are
> subtle. For example, imagine that a PTE for a supervisor page has a reserved
> bit set.  On a supervisor-mode access, the EPT violation path would trigger.
> However, on a user-mode access, the processor will not notice the reserved
> bit and not include PFERR_RSVD_MASK in the error code.
>
> Co-developed-by: Mohammed Gamal 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/vmx/vmx.c | 24 +---
>  arch/x86/kvm/vmx/vmx.h |  3 ++-
>  2 files changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 770b090969fb..de3f436b2d32 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4790,9 +4790,15 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
>
> if (is_page_fault(intr_info)) {
> cr2 = vmx_get_exit_qual(vcpu);
> -   /* EPT won't cause page fault directly */
> -   WARN_ON_ONCE(!vcpu->arch.apf.host_apf_flags && enable_ept);
> -   return kvm_handle_page_fault(vcpu, error_code, cr2, NULL, 0);
> +   if (enable_ept && !vcpu->arch.apf.host_apf_flags) {
> +   /*
> +* EPT will cause page fault only if we need to
> +* detect illegal GPAs.
> +*/
> +   kvm_fixup_and_inject_pf_error(vcpu, cr2, error_code);
> +   return 1;
> +   } else
> +   return kvm_handle_page_fault(vcpu, error_code, cr2, 
> NULL, 0);
> }
>
> ex_no = intr_info & INTR_INFO_VECTOR_MASK;
> @@ -5308,6 +5314,18 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
>
> vcpu->arch.exit_qualification = exit_qualification;
> +
> +   /*
> +* Check that the GPA doesn't exceed physical memory limits, as that 
> is
> +* a guest page fault.  We have to emulate the instruction here, 
> because
> +* if the illegal address is that of a paging structure, then
> +* EPT_VIOLATION_ACC_WRITE bit is set.  Alternatively, if supported we
> +* would also use advanced VM-exit information for EPT violations to
> +* reconstruct the page fault error code.
> +*/
> +   if (unlikely(kvm_mmu_is_illegal_gpa(vcpu, gpa)))
> +   return kvm_emulate_instruction(vcpu, 0);
> +

Is kvm's in-kernel emulator up to the task? What if the instruction in
question is AVX-512, or one of the myriad instructions that the
in-kernel emulator can't handle? Ice Lake must support the advanced
VM-exit information for EPT violations, so that would seem like a
better choice.

> return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>  }
>
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index b0e5e210f1c1..0d06951e607c 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -11,6 +11,7 @@
>  #include "kvm_cache_regs.h"
>  #include "ops.h"
>  #include "vmcs.h"
> +#include "cpuid.h"
>
>  extern const u32 vmx_msr_index[];
>
> @@ -552,7 +553,7 @@ static inline bool vmx_has_waitpkg(struct vcpu_vmx *vmx)
>
>  static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu)
>  {
> -   return !enable_ept;
> +   return !enable_ept || cpuid_maxphyaddr(vcpu) < 
> boot_cpu_data.x86_phys_bits;
>  }
>
>  void dump_vmcs(void);
> --
> 2.26.2
>


Re: [PATCH] KVM: nVMX: Morph notification vector IRQ on nested VM-Enter to pending PI

2020-10-06 Thread Jim Mattson
On Wed, Aug 12, 2020 at 10:51 AM Sean Christopherson
 wrote:
>
> On successful nested VM-Enter, check for pending interrupts and convert
> the highest priority interrupt to a pending posted interrupt if it
> matches L2's notification vector.  If the vCPU receives a notification
> interrupt before nested VM-Enter (assuming L1 disables IRQs before doing
> VM-Enter), the pending interrupt (for L1) should be recognized and
> processed as a posted interrupt when interrupts become unblocked after
> VM-Enter to L2.
>
> This fixes a bug where L1/L2 will get stuck in an infinite loop if L1 is
> trying to inject an interrupt into L2 by setting the appropriate bit in
> L2's PIR and sending a self-IPI prior to VM-Enter (as opposed to KVM's
> method of manually moving the vector from PIR->vIRR/RVI).  KVM will
> observe the IPI while the vCPU is in L1 context and so won't immediately
> morph it to a posted interrupt for L2.  The pending interrupt will be
> seen by vmx_check_nested_events(), cause KVM to force an immediate exit
> after nested VM-Enter, and eventually be reflected to L1 as a VM-Exit.
> After handling the VM-Exit, L1 will see that L2 has a pending interrupt
> in PIR, send another IPI, and repeat until L2 is killed.
>
> Note, posted interrupts require virtual interrupt deliveriy, and virtual
> interrupt delivery requires exit-on-interrupt, ergo interrupts will be
> unconditionally unmasked on VM-Enter if posted interrupts are enabled.
>
> Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing")
> Cc: sta...@vger.kernel.org
> Cc: Liran Alon 
> Signed-off-by: Sean Christopherson 
> ---
I don't think this is the best fix.

I believe the real problem is the way that external and posted
interrupts are handled in vmx_check_nested_events().

First of all, I believe that the existing call to
vmx_complete_nested_posted_interrupt() at the end of
vmx_check_nested_events() is far too aggressive. Unless I am missing
something in the SDM, posted interrupt processing is *only* triggered
when the notification vector is received in VMX non-root mode. It is
not triggered on VM-entry.

Looking back one block, we have:

if (kvm_cpu_has_interrupt(vcpu) && !vmx_interrupt_blocked(vcpu)) {
if (block_nested_events)
return -EBUSY;
if (!nested_exit_on_intr(vcpu))
goto no_vmexit;
nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0);
return 0;
}

If nested_exit_on_intr() is true, we should first check to see if
"acknowledge interrupt on exit" is set. If so, we should acknowledge
the interrupt right here, with a call to kvm_cpu_get_interrupt(),
rather than deep in the guts of nested_vmx_vmexit(). If the vector we
get is the notification vector from VMCS12, then we should call
vmx_complete_nested_posted_interrupt(). Otherwise, we should call
nested_vmx_vmexit(EXIT_REASON_EXTERNAL_INTERRUPT) as we do now.

Furthermore, vmx_complete_nested_posted_interrupt() should write to
the L1 EOI register, as indicated in step 4 of the 7-step sequence
detailed in section 29.6 of the SDM, volume 3. It skips this step
today.


Re: [PATCH v2 4/9] KVM: VMX: Don't freeze guest when event delivery causes an APIC-access exit

2020-09-29 Thread Jim Mattson
On Thu, Sep 10, 2020 at 2:51 AM Wanpeng Li  wrote:
>
> From: Wanpeng Li 
>
> According to SDM 27.2.4, Event delivery causes an APIC-access VM exit.
> Don't report internal error and freeze guest when event delivery causes
> an APIC-access exit, it is handleable and the event will be re-injected
> during the next vmentry.
>
> Signed-off-by: Wanpeng Li 

I'm curious if you have a test case for this.


Re: [PATCH v8 1/8] KVM: x86: Return -ENOENT on unimplemented MSRs

2020-09-25 Thread Jim Mattson
On Fri, Sep 25, 2020 at 7:34 AM Alexander Graf  wrote:
>
> When we find an MSR that we can not handle, bubble up that error code as
> MSR error return code. Follow up patches will use that to expose the fact
> that an MSR is not handled by KVM to user space.
>
> Suggested-by: Aaron Lewis 
> Signed-off-by: Alexander Graf 
Reviewed-by: Jim Mattson 


Re: [PATCH] KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKE

2020-09-24 Thread Jim Mattson
On Wed, Sep 23, 2020 at 2:54 PM Sean Christopherson
 wrote:
>
> Reset the MMU context during kvm_set_cr4() if SMAP or PKE is toggled.
> Recent commits to (correctly) not reload PDPTRs when SMAP/PKE are
> toggled inadvertantly skipped the MMU context reset due to the mask
> of bits that triggers PDPTR loads also being used to trigger MMU context
> resets.
>
> Fixes: 427890aff855 ("kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE 
> mode")
> Fixes: cb957adb4ea4 ("kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE 
> mode")
> Cc: Jim Mattson 
> Cc: Peter Shier 
> Cc: Oliver Upton 
> Signed-off-by: Sean Christopherson 

Thanks for the fix!

Reviewed-by: Jim Mattson 


Re: [PATCH v2 0/2] INVD intercept change to skip instruction

2020-09-24 Thread Jim Mattson
On Thu, Sep 24, 2020 at 11:42 AM Tom Lendacky  wrote:
>
> From: Tom Lendacky 
>
> This series updates the INVD intercept support for both SVM and VMX to
> skip the instruction rather than emulating it, since emulation of this
> instruction is just a NOP.

Isn't INVD a serializing instruction, whereas NOP isn't? IIRC, Intel
doesn't architect VM-entry or VM-exit as serializing, though they
probably are in practice. I'm not sure what AMD's stance on this is.


Re: [PATCH 2/4] KVM: VMX: Unconditionally clear CPUID.INVPCID if !CPUID.PCID

2020-09-23 Thread Jim Mattson
On Wed, Sep 23, 2020 at 9:51 AM Sean Christopherson
 wrote:
>
> If PCID is not exposed to the guest, clear INVPCID in the guest's CPUID
> even if the VMCS INVPCID enable is not supported.  This will allow
> consolidating the secondary execution control adjustment code without
> having to special case INVPCID.
>
> Technically, this fixes a bug where !CPUID.PCID && CPUID.INVCPID would
> result in unexpected guest behavior (#UD instead of #GP/#PF), but KVM
> doesn't support exposing INVPCID if it's not supported in the VMCS, i.e.
> such a config is broken/bogus no matter what.
>
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/kvm/vmx/vmx.c | 16 +++-
>  1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index cfed29329e4f..57e48c5a1e91 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4149,16 +4149,22 @@ static void vmx_compute_secondary_exec_control(struct 
> vcpu_vmx *vmx)
> }
> }
>
> +   /*
> +* Expose INVPCID if and only if PCID is also exposed to the guest.
> +* INVPCID takes a #UD when it's disabled in the VMCS, but a #GP or 
> #PF
> +* if CR4.PCIDE=0.  Enumerating CPUID.INVPCID=1 would lead to 
> incorrect
> +* behavior from the guest perspective (it would expect #GP or #PF).
> +*/
> +   if (!guest_cpuid_has(vcpu, X86_FEATURE_PCID))
> +   guest_cpuid_clear(vcpu, X86_FEATURE_INVPCID);
> +

I thought the general rule for userspace provided CPUID bits was that
kvm only made any adjustments necessary to prevent bad things from
happening at the host level. Proper guest semantics are entirely the
responsibility of userspace. Or did I misunderstand?


Re: [PATCH 3/4] KVM: VMX: Rename RDTSCP secondary exec control name to insert "ENABLE"

2020-09-23 Thread Jim Mattson
On Wed, Sep 23, 2020 at 9:51 AM Sean Christopherson
 wrote:
>
> Rename SECONDARY_EXEC_RDTSCP to SECONDARY_EXEC_ENABLE_RDTSCP in
> preparation for consolidating the logic for adjusting secondary exec
> controls based on the guest CPUID model.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson 
Reviewed-by: Jim Mattson 


Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

2020-09-03 Thread Jim Mattson
On Thu, Sep 3, 2020 at 1:02 PM Paolo Bonzini  wrote:
>
> On 03/09/20 20:32, Jim Mattson wrote:
> >> [Checking writes to CR3] would be way too slow.  Even the current
> >> trapping of present #PF can introduce some slowdown depending on the
> >> workload.
> >
> > Yes, I was concerned about that...which is why I would not want to
> > enable pedantic mode. But if you're going to be pedantic, why go
> > halfway?
>
> Because I am not sure about any guest, even KVM, caring about setting
> bits 51:46 in CR3.
>
> >>> Does the typical guest care about whether or not setting any of the
> >>> bits 51:46 in a PFN results in a fault?
> >>
> >> At least KVM with shadow pages does, which is a bit niche but it shows
> >> that you cannot really rely on no one doing it.  As you guessed, the
> >> main usage of the feature is for machines with 5-level page tables where
> >> there are no reserved bits; emulating smaller MAXPHYADDR allows
> >> migrating VMs from 4-level page-table hosts.
> >>
> >> Enabling per-VM would not be particularly useful IMO because if you want
> >> to disable this code you can just set host MAXPHYADDR = guest
> >> MAXPHYADDR, which should be the common case unless you want to do that
> >> kind of Skylake to Icelake (or similar) migration.
> >
> > I expect that it will be quite common to run 46-bit wide legacy VMs on
> > Ice Lake hardware, as Ice Lake machines start showing up in
> > heterogeneous data centers.
>
> If you'll be okay with running _all_ 46-bit wide legacy VMs without
> MAXPHYADDR emulation, that's what this patch is for.  If you'll be okay
> with running _only_ 46-bit wide VMs without emulation, you still don't
> need special enabling per-VM beyond the automatic one based on
> CPUID[0x8000_0008].  Do you think you'll need to enable it for some
> special 46-bit VMs?

Yes. From what you've said above, we would only want to enable this
for the niche case of 46-bit KVM guests using shadow paging. I would
expect that to be a very small number of VMs. :-)


Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

2020-09-03 Thread Jim Mattson
On Thu, Sep 3, 2020 at 11:03 AM Paolo Bonzini  wrote:
>
> On 03/09/20 19:57, Jim Mattson wrote:
> > On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal  wrote:
> >> This patch exposes allow_smaller_maxphyaddr to the user as a module 
> >> parameter.
> >>
> >> Since smaller physical address spaces are only supported on VMX, the 
> >> parameter
> >> is only exposed in the kvm_intel module.
> >> Modifications to VMX page fault and EPT violation handling will depend on 
> >> whether
> >> that parameter is enabled.
> >>
> >> Also disable support by default, and let the user decide if they want to 
> >> enable
> >> it.
> >
> > I think a smaller guest physical address width *should* be allowed.
> > However, perhaps the pedantic adherence to the architectural
> > specification could be turned on or off per-VM? And, if we're going to
> > be pedantic, I think we should go all the way and get MOV-to-CR3
> > correct.
>
> That would be way too slow.  Even the current trapping of present #PF
> can introduce some slowdown depending on the workload.

Yes, I was concerned about that...which is why I would not want to
enable pedantic mode. But if you're going to be pedantic, why go
halfway?

> > Does the typical guest care about whether or not setting any of the
> > bits 51:46 in a PFN results in a fault?
>
> At least KVM with shadow pages does, which is a bit niche but it shows
> that you cannot really rely on no one doing it.  As you guessed, the
> main usage of the feature is for machines with 5-level page tables where
> there are no reserved bits; emulating smaller MAXPHYADDR allows
> migrating VMs from 4-level page-table hosts.
>
> Enabling per-VM would not be particularly useful IMO because if you want
> to disable this code you can just set host MAXPHYADDR = guest
> MAXPHYADDR, which should be the common case unless you want to do that
> kind of Skylake to Icelake (or similar) migration.

I expect that it will be quite common to run 46-bit wide legacy VMs on
Ice Lake hardware, as Ice Lake machines start showing up in
heterogeneous data centers.


Re: [PATCH] KVM: x86: VMX: Make smaller physical guest address space support user-configurable

2020-09-03 Thread Jim Mattson
On Thu, Sep 3, 2020 at 7:12 AM Mohammed Gamal  wrote:
>
> This patch exposes allow_smaller_maxphyaddr to the user as a module parameter.
>
> Since smaller physical address spaces are only supported on VMX, the parameter
> is only exposed in the kvm_intel module.
> Modifications to VMX page fault and EPT violation handling will depend on 
> whether
> that parameter is enabled.
>
> Also disable support by default, and let the user decide if they want to 
> enable
> it.
>
> Signed-off-by: Mohammed Gamal 

I think a smaller guest physical address width *should* be allowed.
However, perhaps the pedantic adherence to the architectural
specification could be turned on or off per-VM? And, if we're going to
be pedantic, I think we should go all the way and get MOV-to-CR3
correct.

Does the typical guest care about whether or not setting any of the
bits 51:46 in a PFN results in a fault?


Re: [PATCH 3/5] KVM: nVMX: Update VMX controls MSR according to guest CPUID after setting VMX MSRs

2020-09-02 Thread Jim Mattson
On Wed, Sep 2, 2020 at 11:16 AM Sean Christopherson
 wrote:
>
> On Fri, Aug 28, 2020 at 01:39:39PM -0700, Jim Mattson wrote:
> > On Fri, Aug 28, 2020 at 1:54 AM Chenyi Qiang  wrote:
> > >
> > > Update the fields (i.e. VM_{ENTRY_LOAD, EXIT_CLEAR}_BNDCFGS and
> > > VM_{ENTRY, EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL) in
> > > nested MSR_IA32_VMX_TRUE_{ENTRY, EXIT}_CTLS according to guest CPUID
> > > when user space initializes the features MSRs. Regardless of the order
> > > of SET_CPUID and SET_MSRS from the user space, do the update to avoid
> > > MSR values overriding.
> > >
> > > Signed-off-by: Chenyi Qiang 
> > > ---
> > >  arch/x86/kvm/vmx/vmx.c | 6 +-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 819c185adf09..f9664ccc003b 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -345,6 +345,7 @@ static bool guest_state_valid(struct kvm_vcpu *vcpu);
> > >  static u32 vmx_segment_access_rights(struct kvm_segment *var);
> > >  static __always_inline void vmx_disable_intercept_for_msr(unsigned long 
> > > *msr_bitmap,
> > >   u32 msr, int 
> > > type);
> > > +static void nested_vmx_entry_exit_ctls_update(struct kvm_vcpu *vcpu);
> > >
> > >  void vmx_vmexit(void);
> > >
> > > @@ -2161,7 +2162,10 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, 
> > > struct msr_data *msr_info)
> > > return 1; /* they are read-only */
> > > if (!nested_vmx_allowed(vcpu))
> > > return 1;
> > > -   return vmx_set_vmx_msr(vcpu, msr_index, data);
> > > +   ret = vmx_set_vmx_msr(vcpu, msr_index, data);
> > > +   nested_vmx_pmu_entry_exit_ctls_update(vcpu);
> > > +   nested_vmx_entry_exit_ctls_update(vcpu);
> > > +   break;
> >
> > Now I see what you're doing. This commit should probably come before
> > the previous commit, so that at no point in the series can userspace
> > set VMX MSR bits that should be cleared based on the guest CPUID.
> >
> > There's an ABI change here: userspace may no longer get -EINVAL if it
> > tries to set an illegal VMX MSR bit. Instead, some illegal bits are
> > silently cleared. Moreover, these functions will potentially set VMX
> > MSR bits that userspace has just asked to clear.
>
> Can we simply remove nested_vmx_entry_exit_ctls_update() and
> nested_vmx_pmu_entry_exit_ctls_update()?  It's userspace's responsibility
> to present a valid vCPU model to the guest, I don't see any reason to
> silently tweak the VMX MSRs unless allowing the bogus config breaks KVM.
> E.g. there are many more controls that are non-sensical without "native"
> support for the associated feature.

We might need a test for kvm_mpx_supported() here:

/* If not VM_EXIT_CLEAR_BNDCFGS, the L2 value propagates to L1.  */
if (vmcs12->vm_exit_controls & VM_EXIT_CLEAR_BNDCFGS)
vmcs_write64(GUEST_BNDCFGS, 0);

BTW, where does the L2 value propagate to L1 if not VM_EXIT_CLEAR_BNDCFGS?


Re: [PATCH 1/5] KVM: nVMX: Fix VMX controls MSRs setup when nested VMX enabled

2020-08-31 Thread Jim Mattson
On Fri, Aug 28, 2020 at 7:51 PM Xiaoyao Li  wrote:
>
> On 8/29/2020 9:49 AM, Chenyi Qiang wrote:
> >
> >
> > On 8/29/2020 1:43 AM, Jim Mattson wrote:
> >> On Fri, Aug 28, 2020 at 1:54 AM Chenyi Qiang 
> >> wrote:
> >>>
> >>> KVM supports the nested VM_{EXIT, ENTRY}_LOAD_IA32_PERF_GLOBAL_CTRL and
> >>> VM_{ENTRY_LOAD, EXIT_CLEAR}_BNDCFGS, but they doesn't expose during
> >>> the setup of nested VMX controls MSR.
> >>>
> >>
> >> Aren't these features added conditionally in
> >> nested_vmx_entry_exit_ctls_update() and
> >> nested_vmx_pmu_entry_exit_ctls_update()?
> >>
> >
> > Yes, but I assume vmcs_config.nested should reflect the global
> > capability of VMX MSR. KVM supports these two controls, so should be
> > exposed here.
>
> No. I prefer to say they are removed conditionally in
> nested_vmx_entry_exit_ctls_update() and
> nested_vmx_pmu_entry_exit_ctls_update().
>
> Userspace calls vmx_get_msr_feature() to query what KVM supports for
> these VMX MSR. In vmx_get_msr_feature(), it returns the value of
> vmcs_config.nested. As KVM supports these two bits, we should advertise
> them in vmcs_config.nested and report to userspace.

It would be nice if there was an API to query what MSR values KVM
supports for a specific VCPU configuration, but given that this is a
system ioctl, I agree with you.


Re: [PATCH 3/5] KVM: nVMX: Update VMX controls MSR according to guest CPUID after setting VMX MSRs

2020-08-28 Thread Jim Mattson
On Fri, Aug 28, 2020 at 1:54 AM Chenyi Qiang  wrote:
>
> Update the fields (i.e. VM_{ENTRY_LOAD, EXIT_CLEAR}_BNDCFGS and
> VM_{ENTRY, EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL) in
> nested MSR_IA32_VMX_TRUE_{ENTRY, EXIT}_CTLS according to guest CPUID
> when user space initializes the features MSRs. Regardless of the order
> of SET_CPUID and SET_MSRS from the user space, do the update to avoid
> MSR values overriding.
>
> Signed-off-by: Chenyi Qiang 
> ---
>  arch/x86/kvm/vmx/vmx.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 819c185adf09..f9664ccc003b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -345,6 +345,7 @@ static bool guest_state_valid(struct kvm_vcpu *vcpu);
>  static u32 vmx_segment_access_rights(struct kvm_segment *var);
>  static __always_inline void vmx_disable_intercept_for_msr(unsigned long 
> *msr_bitmap,
>   u32 msr, int type);
> +static void nested_vmx_entry_exit_ctls_update(struct kvm_vcpu *vcpu);
>
>  void vmx_vmexit(void);
>
> @@ -2161,7 +2162,10 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
> return 1; /* they are read-only */
> if (!nested_vmx_allowed(vcpu))
> return 1;
> -   return vmx_set_vmx_msr(vcpu, msr_index, data);
> +   ret = vmx_set_vmx_msr(vcpu, msr_index, data);
> +   nested_vmx_pmu_entry_exit_ctls_update(vcpu);
> +   nested_vmx_entry_exit_ctls_update(vcpu);
> +   break;

Now I see what you're doing. This commit should probably come before
the previous commit, so that at no point in the series can userspace
set VMX MSR bits that should be cleared based on the guest CPUID.

There's an ABI change here: userspace may no longer get -EINVAL if it
tries to set an illegal VMX MSR bit. Instead, some illegal bits are
silently cleared. Moreover, these functions will potentially set VMX
MSR bits that userspace has just asked to clear.

> case MSR_IA32_RTIT_CTL:
> if (!vmx_pt_mode_is_host_guest() ||
> vmx_rtit_ctl_check(vcpu, data) ||
> --
> 2.17.1
>


Re: [PATCH 4/5] KVM: nVMX: Fix the update value of nested load IA32_PERF_GLOBAL_CTRL control

2020-08-28 Thread Jim Mattson
On Fri, Aug 28, 2020 at 1:54 AM Chenyi Qiang  wrote:
>
> A minor fix for the update of VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL field
> in exit_ctls_high.
>
> Fixes: 03a8871add95 ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL
> VM-{Entry,Exit} control")
> Signed-off-by: Chenyi Qiang 
> Reviewed-by: Xiaoyao Li 
Reviewed-by: Jim Mattson 


Re: [PATCH 2/5] KVM: nVMX: Verify the VMX controls MSRs with the global capability when setting VMX MSRs

2020-08-28 Thread Jim Mattson
On Fri, Aug 28, 2020 at 1:54 AM Chenyi Qiang  wrote:
>
> When setting the nested VMX MSRs, verify it with the values in
> vmcs_config.nested_vmx_msrs, which reflects the global capability of
> VMX controls MSRs.
>
> Signed-off-by: Chenyi Qiang 

You seem to have entirely missed the point of this code, which is to
prevent userspace from adding features that have previously been
removed for this vCPU (e.g as a side-effect of KVM_SET_CPUID).


Re: [PATCH 1/5] KVM: nVMX: Fix VMX controls MSRs setup when nested VMX enabled

2020-08-28 Thread Jim Mattson
On Fri, Aug 28, 2020 at 1:54 AM Chenyi Qiang  wrote:
>
> KVM supports the nested VM_{EXIT, ENTRY}_LOAD_IA32_PERF_GLOBAL_CTRL and
> VM_{ENTRY_LOAD, EXIT_CLEAR}_BNDCFGS, but they doesn't expose during
> the setup of nested VMX controls MSR.
>

Aren't these features added conditionally in
nested_vmx_entry_exit_ctls_update() and
nested_vmx_pmu_entry_exit_ctls_update()?


Re: [PATCH v2] KVM: nVMX: fix the layout of struct kvm_vmx_nested_state_hdr

2020-08-27 Thread Jim Mattson
On Mon, Jul 13, 2020 at 11:23 AM Jim Mattson  wrote:
>
> On Mon, Jul 13, 2020 at 9:22 AM Vitaly Kuznetsov  wrote:
> >
> > Before commit 850448f35aaf ("KVM: nVMX: Fix VMX preemption timer
> > migration") struct kvm_vmx_nested_state_hdr looked like:
> >
> > struct kvm_vmx_nested_state_hdr {
> > __u64 vmxon_pa;
> > __u64 vmcs12_pa;
> > struct {
> > __u16 flags;
> > } smm;
> > }
> >
> > The ABI got broken by the above mentioned commit and an attempt
> > to fix that was made in commit 83d31e5271ac ("KVM: nVMX: fixes for
> > preemption timer migration") which made the structure look like:
> >
> > struct kvm_vmx_nested_state_hdr {
> > __u64 vmxon_pa;
> > __u64 vmcs12_pa;
> > struct {
> > __u16 flags;
> > } smm;
> > __u32 flags;
> > __u64 preemption_timer_deadline;
> > };
> >
> > The problem with this layout is that before both changes compilers were
> > allocating 24 bytes for this and although smm.flags is padded to 8 bytes,
> > it is initialized as a 2 byte value. Chances are that legacy userspaces
> > using old layout will be passing uninitialized bytes which will slip into
> > what is now known as 'flags'.
> >
> > Suggested-by: Sean Christopherson 
> > Fixes: 850448f35aaf ("KVM: nVMX: Fix VMX preemption timer migration")
> > Fixes: 83d31e5271ac ("KVM: nVMX: fixes for preemption timer migration")
> > Signed-off-by: Vitaly Kuznetsov 
>
> Oops!
>
> Reviewed-by: Jim Mattson 

Whatever happened to this?


Re: [PATCH v5 10/12] KVM: X86: Rename and move the function vmx_handle_memory_failure to x86.c

2020-08-26 Thread Jim Mattson
On Wed, Aug 26, 2020 at 12:15 PM Babu Moger  wrote:
>
> Handling of kvm_read/write_guest_virt*() errors can be moved to common
> code. The same code can be used by both VMX and SVM.
>
> Signed-off-by: Babu Moger 
Reviewed-by: Jim Mattson 


Re: [PATCH v5 04/12] KVM: SVM: Modify intercept_exceptions to generic intercepts

2020-08-26 Thread Jim Mattson
On Wed, Aug 26, 2020 at 12:14 PM Babu Moger  wrote:
>
> Modify intercept_exceptions to generic intercepts in vmcb_control_area. Use
> the generic vmcb_set_intercept, vmcb_clr_intercept and vmcb_is_intercept to
> set/clear/test the intercept_exceptions bits.
>
> Signed-off-by: Babu Moger 
> Reviewed-by: Jim Mattson 
> ---

> @@ -835,7 +832,7 @@ static bool nested_exit_on_exception(struct vcpu_svm *svm)
>  {
> unsigned int nr = svm->vcpu.arch.exception.nr;
>
> -   return (svm->nested.ctl.intercept_exceptions & (1 << nr));
> +   return (svm->nested.ctl.intercepts[EXCEPTION_VECTOR] & (1 << nr));
Nit: BIT(nr) rather than (1 << nr).


Re: [PATCH v5 03/12] KVM: SVM: Change intercept_dr to generic intercepts

2020-08-26 Thread Jim Mattson
On Wed, Aug 26, 2020 at 12:14 PM Babu Moger  wrote:
>
> Modify intercept_dr to generic intercepts in vmcb_control_area. Use
> the generic vmcb_set_intercept, vmcb_clr_intercept and vmcb_is_intercept
> to set/clear/test the intercept_dr bits.
>
> Signed-off-by: Babu Moger 
Reviewed-by: Jim Mattson 


Re: [PATCH v5 02/12] KVM: SVM: Change intercept_cr to generic intercepts

2020-08-26 Thread Jim Mattson
On Wed, Aug 26, 2020 at 12:14 PM Babu Moger  wrote:
>
> Change intercept_cr to generic intercepts in vmcb_control_area.
> Use the new vmcb_set_intercept, vmcb_clr_intercept and vmcb_is_intercept
> where applicable.
>
> Signed-off-by: Babu Moger 
Reviewed-by: Jim Mattson 


Re: [PATCH v5 01/12] KVM: SVM: Introduce vmcb_(set_intercept/clr_intercept/_is_intercept)

2020-08-26 Thread Jim Mattson
On Wed, Aug 26, 2020 at 12:13 PM Babu Moger  wrote:
>
> This is in preparation for the future intercept vector additions.
>
> Add new functions vmcb_set_intercept, vmcb_clr_intercept and vmcb_is_intercept
> using kernel APIs __set_bit, __clear_bit and test_bit espectively.
Nit: "respectively"
>
> Signed-off-by: Babu Moger 
Reviewed-by: Jim Mattson 


Re: [PATCH] KVM: VMX: fix crash cleanup when KVM wasn't used

2020-08-24 Thread Jim Mattson
On Mon, Aug 24, 2020 at 11:57 AM Jim Mattson  wrote:
>
> On Fri, Aug 21, 2020 at 8:40 PM Sean Christopherson
>  wrote:
> >
> > On Thu, Aug 20, 2020 at 01:08:22PM -0700, Jim Mattson wrote:
> > > On Wed, Apr 1, 2020 at 1:13 AM Vitaly Kuznetsov  
> > > wrote:
> > > > ---
> > > >  arch/x86/kvm/vmx/vmx.c | 12 +++-
> > > >  1 file changed, 7 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index 3aba51d782e2..39a5dde12b79 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -2257,10 +2257,6 @@ static int hardware_enable(void)
> > > > !hv_get_vp_assist_page(cpu))
> > > > return -EFAULT;
> > > >
> > > > -   INIT_LIST_HEAD(_cpu(loaded_vmcss_on_cpu, cpu));
> > > > -   INIT_LIST_HEAD(_cpu(blocked_vcpu_on_cpu, cpu));
> > > > -   spin_lock_init(_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > > > -
> > > > r = kvm_cpu_vmxon(phys_addr);
> > > > if (r)
> > > > return r;
> > > > @@ -8006,7 +8002,7 @@ module_exit(vmx_exit);
> > > >
> > > >  static int __init vmx_init(void)
> > > >  {
> > > > -   int r;
> > > > +   int r, cpu;
> > > >
> > > >  #if IS_ENABLED(CONFIG_HYPERV)
> > > > /*
> > > > @@ -8060,6 +8056,12 @@ static int __init vmx_init(void)
> > > > return r;
> > > > }
> > > >
> > > > +   for_each_possible_cpu(cpu) {
> > > > +   INIT_LIST_HEAD(_cpu(loaded_vmcss_on_cpu, cpu));
> > > > +   INIT_LIST_HEAD(_cpu(blocked_vcpu_on_cpu, cpu));
> > > > +   spin_lock_init(_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > > > +   }
> > >
> > > Just above this chunk, we have:
> > >
> > > r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);
> > > if (r) {
> > > vmx_exit();
> > > ...
> > >
> > > If we take that early exit, because vmx_setup_l1d_flush() fails, we
> > > won't initialize loaded_vmcss_on_cpu. However, vmx_exit() calls
> > > kvm_exit(), which calls on_each_cpu(hardware_disable_nolock, NULL, 1).
> > > Hardware_disable_nolock() then calls kvm_arch_hardware_disable(),
> > > which calls kvm_x86_ops.hardware_disable() [the vmx.c
> > > hardware_disable()], which calls vmclear_local_loaded_vmcss().
> > >
> > > I believe that vmclear_local_loaded_vmcss() will then try to
> > > dereference a NULL pointer, since per_cpu(loaded_vmcss_on_cpu, cpu) is
> > > uninitialzed.
> >
> > I agree the code is a mess (kvm_init() and kvm_exit() included), but I'm
> > pretty sure hardware_disable_nolock() is guaranteed to be a nop as it's
> > impossible for kvm_usage_count to be non-zero if vmx_init() hasn't
> > finished.
>
> Unless I'm missing something, there's no check for a non-zero
> kvm_usage_count on this path. There is such a check in
> hardware_disable_all_nolock(), but not in hardware_disable_nolock().

However, cpus_hardware_enabled shouldn't have any bits set, so
everything's fine. Nothing to see here, after all.


Re: [PATCH] KVM: VMX: fix crash cleanup when KVM wasn't used

2020-08-24 Thread Jim Mattson
On Fri, Aug 21, 2020 at 8:40 PM Sean Christopherson
 wrote:
>
> On Thu, Aug 20, 2020 at 01:08:22PM -0700, Jim Mattson wrote:
> > On Wed, Apr 1, 2020 at 1:13 AM Vitaly Kuznetsov  wrote:
> > > ---
> > >  arch/x86/kvm/vmx/vmx.c | 12 +++-
> > >  1 file changed, 7 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 3aba51d782e2..39a5dde12b79 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -2257,10 +2257,6 @@ static int hardware_enable(void)
> > > !hv_get_vp_assist_page(cpu))
> > > return -EFAULT;
> > >
> > > -   INIT_LIST_HEAD(_cpu(loaded_vmcss_on_cpu, cpu));
> > > -   INIT_LIST_HEAD(_cpu(blocked_vcpu_on_cpu, cpu));
> > > -   spin_lock_init(_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > > -
> > > r = kvm_cpu_vmxon(phys_addr);
> > > if (r)
> > > return r;
> > > @@ -8006,7 +8002,7 @@ module_exit(vmx_exit);
> > >
> > >  static int __init vmx_init(void)
> > >  {
> > > -   int r;
> > > +   int r, cpu;
> > >
> > >  #if IS_ENABLED(CONFIG_HYPERV)
> > > /*
> > > @@ -8060,6 +8056,12 @@ static int __init vmx_init(void)
> > > return r;
> > > }
> > >
> > > +   for_each_possible_cpu(cpu) {
> > > +   INIT_LIST_HEAD(_cpu(loaded_vmcss_on_cpu, cpu));
> > > +   INIT_LIST_HEAD(_cpu(blocked_vcpu_on_cpu, cpu));
> > > +   spin_lock_init(_cpu(blocked_vcpu_on_cpu_lock, cpu));
> > > +   }
> >
> > Just above this chunk, we have:
> >
> > r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);
> > if (r) {
> > vmx_exit();
> > ...
> >
> > If we take that early exit, because vmx_setup_l1d_flush() fails, we
> > won't initialize loaded_vmcss_on_cpu. However, vmx_exit() calls
> > kvm_exit(), which calls on_each_cpu(hardware_disable_nolock, NULL, 1).
> > Hardware_disable_nolock() then calls kvm_arch_hardware_disable(),
> > which calls kvm_x86_ops.hardware_disable() [the vmx.c
> > hardware_disable()], which calls vmclear_local_loaded_vmcss().
> >
> > I believe that vmclear_local_loaded_vmcss() will then try to
> > dereference a NULL pointer, since per_cpu(loaded_vmcss_on_cpu, cpu) is
> > uninitialzed.
>
> I agree the code is a mess (kvm_init() and kvm_exit() included), but I'm
> pretty sure hardware_disable_nolock() is guaranteed to be a nop as it's
> impossible for kvm_usage_count to be non-zero if vmx_init() hasn't
> finished.

Unless I'm missing something, there's no check for a non-zero
kvm_usage_count on this path. There is such a check in
hardware_disable_all_nolock(), but not in hardware_disable_nolock().


Re: [PATCH v2 4/7] KVM: x86: allow kvm_x86_ops.set_efer to return a value

2020-08-20 Thread Jim Mattson
On Thu, Aug 20, 2020 at 6:34 AM Maxim Levitsky  wrote:
>
> This will be used later to return an error when setting this msr fails.
>
> For VMX, it already has an error condition when EFER is
> not in the shared MSR list, so return an error in this case.
>
> Signed-off-by: Maxim Levitsky 
> ---

> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1471,7 +1471,8 @@ static int set_efer(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
> efer &= ~EFER_LMA;
> efer |= vcpu->arch.efer & EFER_LMA;
>
> -   kvm_x86_ops.set_efer(vcpu, efer);
> +   if (kvm_x86_ops.set_efer(vcpu, efer))
> +   return 1;

This seems like a userspace ABI change to me. Previously, it looks
like userspace could always use KVM_SET_MSRS to set MSR_EFER to 0 or
EFER_SCE, and it would always succeed. Now, it looks like it will fail
on CPUs that don't support EFER in hardware. (Perhaps it should fail,
but it didn't before, AFAICT.)


Re: [PATCH v2 3/7] KVM: SVM: refactor msr permission bitmap allocation

2020-08-20 Thread Jim Mattson
On Thu, Aug 20, 2020 at 6:34 AM Maxim Levitsky  wrote:
>
> Replace svm_vcpu_init_msrpm with svm_vcpu_alloc_msrpm, that also allocates
> the msr bitmap and add svm_vcpu_free_msrpm to free it.
>
> This will be used later to move the nested msr permission bitmap allocation
> to nested.c
>
> No functional change intended.
>
> Signed-off-by: Maxim Levitsky 
> ---
>  arch/x86/kvm/svm/svm.c | 45 +-
>  1 file changed, 23 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index d33013b9b4d7..7bb094bf6494 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -609,18 +609,29 @@ static void set_msr_interception(u32 *msrpm, unsigned 
> msr,
> msrpm[offset] = tmp;
>  }
>
> -static void svm_vcpu_init_msrpm(u32 *msrpm)
> +static u32 *svm_vcpu_alloc_msrpm(void)

I prefer the original name, since this function does more than allocation.

>  {
> int i;
> +   u32 *msrpm;
> +   struct page *pages = alloc_pages(GFP_KERNEL_ACCOUNT, 
> MSRPM_ALLOC_ORDER);
> +
> +   if (!pages)
> +   return NULL;
>
> +   msrpm = page_address(pages);
> memset(msrpm, 0xff, PAGE_SIZE * (1 << MSRPM_ALLOC_ORDER));
>
> for (i = 0; direct_access_msrs[i].index != MSR_INVALID; i++) {
> if (!direct_access_msrs[i].always)
> continue;
> -
> set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 
> 1);
> }
> +   return msrpm;
> +}
> +
> +static void svm_vcpu_free_msrpm(u32 *msrpm)
> +{
> +   __free_pages(virt_to_page(msrpm), MSRPM_ALLOC_ORDER);
>  }
>
>  static void add_msr_offset(u32 offset)
> @@ -1172,9 +1183,7 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
>  {
> struct vcpu_svm *svm;
> struct page *vmcb_page;
> -   struct page *msrpm_pages;
> struct page *hsave_page;
> -   struct page *nested_msrpm_pages;
> int err;
>
> BUILD_BUG_ON(offsetof(struct vcpu_svm, vcpu) != 0);
> @@ -1185,21 +1194,13 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
> if (!vmcb_page)
> goto out;
>
> -   msrpm_pages = alloc_pages(GFP_KERNEL_ACCOUNT, MSRPM_ALLOC_ORDER);
> -   if (!msrpm_pages)
> -   goto free_page1;
> -
> -   nested_msrpm_pages = alloc_pages(GFP_KERNEL_ACCOUNT, 
> MSRPM_ALLOC_ORDER);
> -   if (!nested_msrpm_pages)
> -   goto free_page2;
> -

Reordering the allocations does seem like a functional change to me,
albeit one that should (hopefully) be benign. For example, if the
MSRPM_ALLOC_ORDER allocations fail, in the new version of the code,
the hsave_page will be cleared, but in the old version of the code, no
page would be cleared.

> hsave_page = alloc_page(GFP_KERNEL_ACCOUNT);

Speaking of clearing pages, why not add __GFP_ZERO to the flags above
and skip the clear_page() call below?

> if (!hsave_page)
> -   goto free_page3;
> +   goto free_page1;
>
> err = avic_init_vcpu(svm);
> if (err)
> -   goto free_page4;
> +   goto free_page2;
>
> /* We initialize this flag to true to make sure that the is_running
>  * bit would be set the first time the vcpu is loaded.
> @@ -1210,11 +1211,13 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
> svm->nested.hsave = page_address(hsave_page);
> clear_page(svm->nested.hsave);
>
> -   svm->msrpm = page_address(msrpm_pages);
> -   svm_vcpu_init_msrpm(svm->msrpm);
> +   svm->msrpm = svm_vcpu_alloc_msrpm();
> +   if (!svm->msrpm)
> +   goto free_page2;
>
> -   svm->nested.msrpm = page_address(nested_msrpm_pages);
> -   svm_vcpu_init_msrpm(svm->nested.msrpm);
> +   svm->nested.msrpm = svm_vcpu_alloc_msrpm();
> +   if (!svm->nested.msrpm)
> +   goto free_page3;
>
> svm->vmcb = page_address(vmcb_page);
> clear_page(svm->vmcb);
> @@ -1227,12 +1230,10 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
>
> return 0;
>
> -free_page4:
> -   __free_page(hsave_page);
>  free_page3:
> -   __free_pages(nested_msrpm_pages, MSRPM_ALLOC_ORDER);
> +   svm_vcpu_free_msrpm(svm->msrpm);
>  free_page2:
> -   __free_pages(msrpm_pages, MSRPM_ALLOC_ORDER);
> +   __free_page(hsave_page);
>  free_page1:
> __free_page(vmcb_page);
>  out:

While you're here, could you improve these labels? Coding-style.rst says:

Choose label names which say what the goto does or why the goto exists.  An
example of a good name could be ``out_free_buffer:`` if the goto frees
``buffer``.
Avoid using GW-BASIC names like ``err1:`` and ``err2:``, as you would have to
renumber them if you ever add or remove exit paths, and they make correctness
difficult to verify anyway.


Re: [PATCH v2 2/7] KVM: nSVM: rename nested 'vmcb' to vmcb12_gpa in few places

2020-08-20 Thread Jim Mattson
On Thu, Aug 20, 2020 at 6:33 AM Maxim Levitsky  wrote:
>
> No functional changes.
>
> Signed-off-by: Maxim Levitsky 
> ---
>  arch/x86/kvm/svm/nested.c | 10 +-
>  arch/x86/kvm/svm/svm.c| 13 +++--
>  arch/x86/kvm/svm/svm.h|  2 +-
>  3 files changed, 13 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index fb68467e6049..f5b17920a2ca 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -431,7 +431,7 @@ int enter_svm_guest_mode(struct vcpu_svm *svm, u64 
> vmcb_gpa,
For consistency, should the vmcb_gpa argument be renamed to vmcb12_gpa as well?


> @@ -579,7 +579,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
>
> /* Exit Guest-Mode */
> leave_guest_mode(>vcpu);
> -   svm->nested.vmcb = 0;
> +   svm->nested.vmcb12_gpa = 0;
Perhaps in a follow-up change, this could be set to an illegal value
rather than 0?


> @@ -1018,7 +1018,7 @@ static int svm_get_nested_state(struct kvm_vcpu *vcpu,
>
> /* First fill in the header and copy it out.  */
> if (is_guest_mode(vcpu)) {
> -   kvm_state.hdr.svm.vmcb_pa = svm->nested.vmcb;
> +   kvm_state.hdr.svm.vmcb_pa = svm->nested.vmcb12_gpa;
It's unfortunate that we have "_pa" on the LHS on "_gpa" on the RHS. Oh, well.


> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 562a79e3e63a..d33013b9b4d7 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1102,7 +1102,7 @@ static void init_vmcb(struct vcpu_svm *svm)
> }
> svm->asid_generation = 0;
>
> -   svm->nested.vmcb = 0;
> +   svm->nested.vmcb12_gpa = 0;
Here, too, perhaps this could be changed from 0 to an illegal value in
a follow-up change.

Reviewed-by: Jim Mattson 


Re: [PATCH v2 1/7] KVM: SVM: rename a variable in the svm_create_vcpu

2020-08-20 Thread Jim Mattson
On Thu, Aug 20, 2020 at 6:33 AM Maxim Levitsky  wrote:
>
> The 'page' is to hold the vcpu's vmcb so name it as such to
> avoid confusion.
>
> Signed-off-by: Maxim Levitsky 
Reviewed-by: Jim Mattson 


Re: [PATCH] KVM: VMX: fix crash cleanup when KVM wasn't used

2020-08-20 Thread Jim Mattson
On Wed, Apr 1, 2020 at 1:13 AM Vitaly Kuznetsov  wrote:
>
> If KVM wasn't used at all before we crash the cleanup procedure fails with
>  BUG: unable to handle page fault for address: ffc8
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x) - not-present page
>  PGD 23215067 P4D 23215067 PUD 23217067 PMD 0
>  Oops:  [#8] SMP PTI
>  CPU: 0 PID: 3542 Comm: bash Kdump: loaded Tainted: G  D   
> 5.6.0-rc2+ #823
>  RIP: 0010:crash_vmclear_local_loaded_vmcss.cold+0x19/0x51 [kvm_intel]
>
> The root cause is that loaded_vmcss_on_cpu list is not yet initialized,
> we initialize it in hardware_enable() but this only happens when we start
> a VM.
>
> Previously, we used to have a bitmap with enabled CPUs and that was
> preventing [masking] the issue.
>
> Initialized loaded_vmcss_on_cpu list earlier, right before we assign
> crash_vmclear_loaded_vmcss pointer. blocked_vcpu_on_cpu list and
> blocked_vcpu_on_cpu_lock are moved altogether for consistency.
>
> Fixes: 31603d4fc2bb ("KVM: VMX: Always VMCLEAR in-use VMCSes during crash 
> with kexec support")
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/kvm/vmx/vmx.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 3aba51d782e2..39a5dde12b79 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2257,10 +2257,6 @@ static int hardware_enable(void)
> !hv_get_vp_assist_page(cpu))
> return -EFAULT;
>
> -   INIT_LIST_HEAD(_cpu(loaded_vmcss_on_cpu, cpu));
> -   INIT_LIST_HEAD(_cpu(blocked_vcpu_on_cpu, cpu));
> -   spin_lock_init(_cpu(blocked_vcpu_on_cpu_lock, cpu));
> -
> r = kvm_cpu_vmxon(phys_addr);
> if (r)
> return r;
> @@ -8006,7 +8002,7 @@ module_exit(vmx_exit);
>
>  static int __init vmx_init(void)
>  {
> -   int r;
> +   int r, cpu;
>
>  #if IS_ENABLED(CONFIG_HYPERV)
> /*
> @@ -8060,6 +8056,12 @@ static int __init vmx_init(void)
> return r;
> }
>
> +   for_each_possible_cpu(cpu) {
> +   INIT_LIST_HEAD(_cpu(loaded_vmcss_on_cpu, cpu));
> +   INIT_LIST_HEAD(_cpu(blocked_vcpu_on_cpu, cpu));
> +   spin_lock_init(_cpu(blocked_vcpu_on_cpu_lock, cpu));
> +   }

Just above this chunk, we have:

r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);
if (r) {
vmx_exit();
...

If we take that early exit, because vmx_setup_l1d_flush() fails, we
won't initialize loaded_vmcss_on_cpu. However, vmx_exit() calls
kvm_exit(), which calls on_each_cpu(hardware_disable_nolock, NULL, 1).
Hardware_disable_nolock() then calls kvm_arch_hardware_disable(),
which calls kvm_x86_ops.hardware_disable() [the vmx.c
hardware_disable()], which calls vmclear_local_loaded_vmcss().

I believe that vmclear_local_loaded_vmcss() will then try to
dereference a NULL pointer, since per_cpu(loaded_vmcss_on_cpu, cpu) is
uninitialzed.

>  #ifdef CONFIG_KEXEC_CORE
> rcu_assign_pointer(crash_vmclear_loaded_vmcss,
>crash_vmclear_local_loaded_vmcss);
> --
> 2.25.1
>


Re: FSGSBASE causing panic on 5.9-rc1

2020-08-20 Thread Jim Mattson
On Thu, Aug 20, 2020 at 11:38 AM Jim Mattson  wrote:
>
> On Thu, Aug 20, 2020 at 11:34 AM Tom Lendacky  wrote:
> >
> > On 8/20/20 11:30 AM, Tom Lendacky wrote:
> > > On 8/20/20 11:17 AM, Tom Lendacky wrote:
> > >> On 8/20/20 10:55 AM, Andy Lutomirski wrote:
> > >>> On Thu, Aug 20, 2020 at 8:21 AM Tom Lendacky 
> > >>> wrote:
> > >>>>
> > >>>> On 8/20/20 10:10 AM, Sean Christopherson wrote:
> > >>>>> On Wed, Aug 19, 2020 at 05:21:33PM -0700, Andy Lutomirski wrote:
> > >>>>>> On Wed, Aug 19, 2020 at 2:25 PM Andy Lutomirski 
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>> On Wed, Aug 19, 2020 at 11:19 AM Tom Lendacky
> > >>>>>>>  wrote:
> > >>>>>>>>
> > >>>>>>>> On 8/19/20 1:07 PM, Tom Lendacky wrote:
> > >>>>>>>>> It looks like the FSGSBASE support is crashing my second
> > >>>>>>>>> generation EPYC
> > >>>>>>>>> system. I was able to bisect it to:
> > >>>>>>>>>
> > >>>>>>>>> b745cfba44c1 ("x86/cpu: Enable FSGSBASE on 64bit by default and
> > >>>>>>>>> add a chicken bit")
> > >>>>>>>>>
> > >>>>>>>>> The panic only happens when using KVM. Doing kernel builds or 
> > >>>>>>>>> stress
> > >>>>>>>>> on bare-metal appears fine. But if I fire up, in this case, a
> > >>>>>>>>> 64-vCPU
> > >>>>>>>>> guest and do a kernel build within the guest, I get the following:
> > >>>>>>>>
> > >>>>>>>> I should clarify that this panic is on the bare-metal system, not
> > >>>>>>>> in the
> > >>>>>>>> guest. And that specifying nofsgsbase on the bare-metal command
> > >>>>>>>> line fixes
> > >>>>>>>> the issue.
> > >>>>>>>
> > >>>>>>> I certainly see some oddities:
> > >>>>>>>
> > >>>>>>> We have this code:
> > >>>>>>>
> > >>>>>>> static void svm_vcpu_put(struct kvm_vcpu *vcpu)
> > >>>>>>> {
> > >>>>>>>   struct vcpu_svm *svm = to_svm(vcpu);
> > >>>>>>>   int i;
> > >>>>>>>
> > >>>>>>>   avic_vcpu_put(vcpu);
> > >>>>>>>
> > >>>>>>>   ++vcpu->stat.host_state_reload;
> > >>>>>>>   kvm_load_ldt(svm->host.ldt);
> > >>>>>>> #ifdef CONFIG_X86_64
> > >>>>>>>   loadsegment(fs, svm->host.fs);
> > >>>>>>>   wrmsrl(MSR_KERNEL_GS_BASE, current->thread.gsbase);
> > >>>>>
> > >>>>> Pretty sure current->thread.gsbase can be stale, i.e. this needs:
> > >>>>>
> > >>>>>current_save_fsgs();
> > >>>>
> > >>>> I did try adding current_save_fsgs() in svm_vcpu_load(), saving the
> > >>>> current->thread.gsbase value to a new variable in the svm struct. I 
> > >>>> then
> > >>>> used that variable in the wrmsrl below, but it still crashed.
> > >>>
> > >>> Can you try bisecting all the way back to:
> > >>>
> > >>> commit dd649bd0b3aa012740059b1ba31ecad28a408f7f
> > >>> Author: Andy Lutomirski 
> > >>> Date:   Thu May 28 16:13:48 2020 -0400
> > >>>
> > >>>  x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE
> > >>>
> > >>> and adding the unsafe_fsgsbase command line option while you bisect.
> > >>
> > >> I'll give that a try.
> >
> > Bisecting with unsafe_fsgsbase identified:
> >
> > c82965f9e530 ("x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit")
> >
> > But I'm thinking that could be because it starts using GET_PERCPU_BASE,
> > which on Rome would use RDPID. So is SVM restoring TSC_AUX_MSR too late?
> > That would explain why I don't see the issue on Naples, which doesn't
> > support RDPID.
>
> It looks to me like SVM loads the guest TSC_AUX from vcpu_load to
> vcpu_put, with this comment:
>
> /* This assumes that the kernel never uses MSR_TSC_AUX */
> if (static_cpu_has(X86_FEATURE_RDTSCP))
> wrmsrl(MSR_TSC_AUX, svm->tsc_aux);

Correction: It never restores TSC_AUX, AFAICT.


Re: FSGSBASE causing panic on 5.9-rc1

2020-08-20 Thread Jim Mattson
On Thu, Aug 20, 2020 at 11:34 AM Tom Lendacky  wrote:
>
> On 8/20/20 11:30 AM, Tom Lendacky wrote:
> > On 8/20/20 11:17 AM, Tom Lendacky wrote:
> >> On 8/20/20 10:55 AM, Andy Lutomirski wrote:
> >>> On Thu, Aug 20, 2020 at 8:21 AM Tom Lendacky 
> >>> wrote:
> 
>  On 8/20/20 10:10 AM, Sean Christopherson wrote:
> > On Wed, Aug 19, 2020 at 05:21:33PM -0700, Andy Lutomirski wrote:
> >> On Wed, Aug 19, 2020 at 2:25 PM Andy Lutomirski 
> >> wrote:
> >>>
> >>> On Wed, Aug 19, 2020 at 11:19 AM Tom Lendacky
> >>>  wrote:
> 
>  On 8/19/20 1:07 PM, Tom Lendacky wrote:
> > It looks like the FSGSBASE support is crashing my second
> > generation EPYC
> > system. I was able to bisect it to:
> >
> > b745cfba44c1 ("x86/cpu: Enable FSGSBASE on 64bit by default and
> > add a chicken bit")
> >
> > The panic only happens when using KVM. Doing kernel builds or stress
> > on bare-metal appears fine. But if I fire up, in this case, a
> > 64-vCPU
> > guest and do a kernel build within the guest, I get the following:
> 
>  I should clarify that this panic is on the bare-metal system, not
>  in the
>  guest. And that specifying nofsgsbase on the bare-metal command
>  line fixes
>  the issue.
> >>>
> >>> I certainly see some oddities:
> >>>
> >>> We have this code:
> >>>
> >>> static void svm_vcpu_put(struct kvm_vcpu *vcpu)
> >>> {
> >>>   struct vcpu_svm *svm = to_svm(vcpu);
> >>>   int i;
> >>>
> >>>   avic_vcpu_put(vcpu);
> >>>
> >>>   ++vcpu->stat.host_state_reload;
> >>>   kvm_load_ldt(svm->host.ldt);
> >>> #ifdef CONFIG_X86_64
> >>>   loadsegment(fs, svm->host.fs);
> >>>   wrmsrl(MSR_KERNEL_GS_BASE, current->thread.gsbase);
> >
> > Pretty sure current->thread.gsbase can be stale, i.e. this needs:
> >
> >current_save_fsgs();
> 
>  I did try adding current_save_fsgs() in svm_vcpu_load(), saving the
>  current->thread.gsbase value to a new variable in the svm struct. I then
>  used that variable in the wrmsrl below, but it still crashed.
> >>>
> >>> Can you try bisecting all the way back to:
> >>>
> >>> commit dd649bd0b3aa012740059b1ba31ecad28a408f7f
> >>> Author: Andy Lutomirski 
> >>> Date:   Thu May 28 16:13:48 2020 -0400
> >>>
> >>>  x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE
> >>>
> >>> and adding the unsafe_fsgsbase command line option while you bisect.
> >>
> >> I'll give that a try.
>
> Bisecting with unsafe_fsgsbase identified:
>
> c82965f9e530 ("x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit")
>
> But I'm thinking that could be because it starts using GET_PERCPU_BASE,
> which on Rome would use RDPID. So is SVM restoring TSC_AUX_MSR too late?
> That would explain why I don't see the issue on Naples, which doesn't
> support RDPID.

It looks to me like SVM loads the guest TSC_AUX from vcpu_load to
vcpu_put, with this comment:

/* This assumes that the kernel never uses MSR_TSC_AUX */
if (static_cpu_has(X86_FEATURE_RDTSCP))
wrmsrl(MSR_TSC_AUX, svm->tsc_aux);

We are talking about mainline here, right?


Re: [PATCH v4 0/3] Allow user space to restrict and augment MSR emulation

2020-08-19 Thread Jim Mattson
On Wed, Aug 19, 2020 at 3:09 PM Jim Mattson  wrote:
>
> On Wed, Aug 19, 2020 at 2:46 PM Graf (AWS), Alexander  wrote:
>
> > Special MSRs like EFER also irritate me a bit. We can't really trap on them 
> > - most code paths just know they're handled in kernel. Maybe I'll add some 
> > sanity checks as well...
>
> Why can't we intercept EFER?

Some MSRs (IA32_GSBASE comes to mind) can't be completely handled by
userspace, even if we do intercept RDMSR and WRMSR. The EFER.LMA bit
also falls into that category, though the rest of the register isn't a
problem (and EFER.LMA is always derivable). Is that what you meant?


Re: [PATCH v4 2/3] KVM: x86: Introduce allow list for MSR emulation

2020-08-19 Thread Jim Mattson
On Mon, Aug 3, 2020 at 2:14 PM Alexander Graf  wrote:

> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -901,6 +901,13 @@ struct kvm_hv {
> struct kvm_hv_syndbg hv_syndbg;
>  };
>
> +struct msr_bitmap_range {
> +   u32 flags;
> +   u32 nmsrs;
> +   u32 base;
> +   unsigned long *bitmap;
> +};
> +
>  enum kvm_irqchip_mode {
> KVM_IRQCHIP_NONE,
> KVM_IRQCHIP_KERNEL,   /* created with KVM_CREATE_IRQCHIP */
> @@ -1005,6 +1012,9 @@ struct kvm_arch {
> /* Deflect RDMSR and WRMSR to user space when they trigger a #GP */
> bool user_space_msr_enabled;
>
> +   struct msr_bitmap_range msr_allowlist_ranges[10];

Why 10? I think this is the only use of this constant, but a macro
would still be nice, especially since the number appears to be
arbitrary.

> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 0780f97c1850..c33fb1d72d52 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -192,6 +192,21 @@ struct kvm_msr_list {
> __u32 indices[0];
>  };
>
> +#define KVM_MSR_ALLOW_READ  (1 << 0)
> +#define KVM_MSR_ALLOW_WRITE (1 << 1)
> +
> +/* Maximum size of the of the bitmap in bytes */
> +#define KVM_MSR_ALLOWLIST_MAX_LEN 0x600

Wouldn't 0x400 be a more natural size, since both Intel and AMD MSR
permission bitmaps cover ranges of 8192 MSRs?

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e1139124350f..25e58ceb19de 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1472,6 +1472,38 @@ void kvm_enable_efer_bits(u64 mask)
>  }
>  EXPORT_SYMBOL_GPL(kvm_enable_efer_bits);
>
> +static bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)

In another thread, when I suggested that a function should return
bool, you said, "'I'm not a big fan of bool returning APIs unless they
have an "is" in their name.' This function doesn't have "is" in its
name. :-)

> +{
> +   struct kvm *kvm = vcpu->kvm;
> +   struct msr_bitmap_range *ranges = kvm->arch.msr_allowlist_ranges;
> +   u32 count = kvm->arch.msr_allowlist_ranges_count;

Shouldn't the read of kvm->arch.msr_allowlist_ranges_count be guarded
by the mutex, below?

> +   u32 i;
> +   bool r = false;
> +
> +   /* MSR allowlist not set up, allow everything */
> +   if (!count)
> +   return true;
> +
> +   /* Prevent collision with clear_msr_allowlist */
> +   mutex_lock(>lock);
> +
> +   for (i = 0; i < count; i++) {
> +   u32 start = ranges[i].base;
> +   u32 end = start + ranges[i].nmsrs;
> +   u32 flags = ranges[i].flags;
> +   unsigned long *bitmap = ranges[i].bitmap;
> +
> +   if ((index >= start) && (index < end) && (flags & type)) {
> +   r = !!test_bit(index - start, bitmap);

The !! seems gratuitous, since r is of type bool.

> @@ -1483,6 +1515,9 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 
> index, u64 data,
>  {
> struct msr_data msr;
>
> +   if (!host_initiated && !kvm_msr_allowed(vcpu, index, 
> KVM_MSR_ALLOW_WRITE))
> +   return -ENOENT;

Perhaps -EPERM is more appropriate here?

> switch (index) {
> case MSR_FS_BASE:
> case MSR_GS_BASE:
> @@ -1528,6 +1563,9 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 
> *data,
> struct msr_data msr;
> int ret;
>
> +   if (!host_initiated && !kvm_msr_allowed(vcpu, index, 
> KVM_MSR_ALLOW_READ))
> +   return -ENOENT;

...and here?

> +static bool msr_range_overlaps(struct kvm *kvm, struct msr_bitmap_range 
> *range)

Another bool function with no "is"? :-)

> +{
> +   struct msr_bitmap_range *ranges = kvm->arch.msr_allowlist_ranges;
> +   u32 i, count = kvm->arch.msr_allowlist_ranges_count;
> +   bool r = false;
> +
> +   for (i = 0; i < count; i++) {
> +   u32 start = max(range->base, ranges[i].base);
> +   u32 end = min(range->base + range->nmsrs,
> + ranges[i].base + ranges[i].nmsrs);
> +
> +   if ((start < end) && (range->flags & ranges[i].flags)) {
> +   r = true;
> +   break;
> +   }
> +   }
> +
> +   return r;
> +}

This seems like an awkward constraint. Would it be possible to allow
overlapping ranges as long as the access types don't clash? So, for
example, could I specify an allow list for READ of MSRs 0-0x1 and
an allow list for WRITE of MSRs 0-0x1? Actually, I don't see why
you have to prohibit overlapping ranges at all.


> +static int kvm_vm_ioctl_clear_msr_allowlist(struct kvm *kvm)
> +{
> +   int i;

Nit: In earlier code, you use u32 for this index. (I'm actually a fan
of int, myself.)


> @@ -10086,6 +10235,8 @@ void kvm_arch_pre_destroy_vm(struct kvm *kvm)
>
>  void kvm_arch_destroy_vm(struct kvm *kvm)
>  {
> +   

Re: [PATCH v4 0/3] Allow user space to restrict and augment MSR emulation

2020-08-19 Thread Jim Mattson
On Wed, Aug 19, 2020 at 2:46 PM Graf (AWS), Alexander  wrote:

> Special MSRs like EFER also irritate me a bit. We can't really trap on them - 
> most code paths just know they're handled in kernel. Maybe I'll add some 
> sanity checks as well...

Why can't we intercept EFER?


Re: [PATCH v4 0/3] Allow user space to restrict and augment MSR emulation

2020-08-19 Thread Jim Mattson
On Mon, Aug 3, 2020 at 2:14 PM Alexander Graf  wrote:
>
> While tying to add support for the MSR_CORE_THREAD_COUNT MSR in KVM,
> I realized that we were still in a world where user space has no control
> over what happens with MSR emulation in KVM.
>
> That is bad for multiple reasons. In my case, I wanted to emulate the
> MSR in user space, because it's a CPU specific register that does not
> exist on older CPUs and that really only contains informational data that
> is on the package level, so it's a natural fit for user space to provide
> it.
>
> However, it is also bad on a platform compatibility level. Currrently,
> KVM has no way to expose different MSRs based on the selected target CPU
> type.
>
> This patch set introduces a way for user space to indicate to KVM which
> MSRs should be handled in kernel space. With that, we can solve part of
> the platform compatibility story. Or at least we can not handle AMD specific
> MSRs on an Intel platform and vice versa.
>
> In addition, it introduces a way for user space to get into the loop
> when an MSR access would generate a #GP fault, such as when KVM finds an
> MSR that is not handled by the in-kernel MSR emulation or when the guest
> is trying to access reserved registers.
>
> In combination with the allow list, the user space trapping allows us
> to emulate arbitrary MSRs in user space, paving the way for target CPU
> specific MSR implementations from user space.

This is somewhat misleading. If you don't modify the MSR permission
bitmaps, as Aaron has done, you cannot emulate *arbitrary* MSRs in
userspace. You can only emulate MSRs that kvm is going to intercept.
Moreover, since the set of intercepted MSRs evolves over time, this
isn't a stable API.

> v1 -> v2:
>
>   - s/ETRAP_TO_USER_SPACE/ENOENT/g
>   - deflect all #GP injection events to user space, not just unknown MSRs.
> That was we can also deflect allowlist errors later
>   - fix emulator case
>   - new patch: KVM: x86: Introduce allow list for MSR emulation
>   - new patch: KVM: selftests: Add test for user space MSR handling
>
> v2 -> v3:
>
>   - return r if r == X86EMUL_IO_NEEDED
>   - s/KVM_EXIT_RDMSR/KVM_EXIT_X86_RDMSR/g
>   - s/KVM_EXIT_WRMSR/KVM_EXIT_X86_WRMSR/g
>   - Use complete_userspace_io logic instead of reply field
>   - Simplify trapping code
>   - document flags for KVM_X86_ADD_MSR_ALLOWLIST
>   - generalize exit path, always unlock when returning
>   - s/KVM_CAP_ADD_MSR_ALLOWLIST/KVM_CAP_X86_MSR_ALLOWLIST/g
>   - Add KVM_X86_CLEAR_MSR_ALLOWLIST
>   - Add test to clear whitelist
>   - Adjust to reply-less API
>   - Fix asserts
>   - Actually trap on MSR_IA32_POWER_CTL writes
>
> v3 -> v4:
>
>   - Mention exit reasons in re-enter mandatory section of API documentation
>   - Clear padding bytes
>   - Generalize get/set deflect functions
>   - Remove redundant pending_user_msr field
>   - lock allow check and clearing
>   - free bitmaps on clear
>
> Alexander Graf (3):
>   KVM: x86: Deflect unknown MSR accesses to user space
>   KVM: x86: Introduce allow list for MSR emulation
>   KVM: selftests: Add test for user space MSR handling
>
>  Documentation/virt/kvm/api.rst| 157 ++-
>  arch/x86/include/asm/kvm_host.h   |  13 +
>  arch/x86/include/uapi/asm/kvm.h   |  15 +
>  arch/x86/kvm/emulate.c|  18 +-
>  arch/x86/kvm/x86.c| 259 +-
>  include/trace/events/kvm.h|   2 +-
>  include/uapi/linux/kvm.h  |  15 +
>  tools/testing/selftests/kvm/Makefile  |   1 +
>  .../selftests/kvm/x86_64/user_msr_test.c  | 221 +++
>  9 files changed, 692 insertions(+), 9 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/x86_64/user_msr_test.c
>
> --
> 2.17.1
>
>
>
>
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
>
>
>


Re: [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR

2020-08-18 Thread Jim Mattson
On Tue, Aug 18, 2020 at 12:28 AM Chenyi Qiang  wrote:
>
>
>
> On 8/14/2020 1:31 AM, Jim Mattson wrote:
> > On Wed, Aug 12, 2020 at 10:42 PM Chenyi Qiang  
> > wrote:
> >>
> >>
> >>
> >> On 8/13/2020 5:21 AM, Jim Mattson wrote:
> >>> On Fri, Aug 7, 2020 at 1:46 AM Chenyi Qiang  
> >>> wrote:
> >>>>
> >>>> Protection Keys for Supervisor Pages (PKS) uses IA32_PKRS MSR (PKRS) at
> >>>> index 0x6E1 to allow software to manage supervisor protection key
> >>>> rights. For performance consideration, PKRS intercept will be disabled
> >>>> so that the guest can access the PKRS without VM exits.
> >>>> PKS introduces dedicated control fields in VMCS to switch PKRS, which
> >>>> only does the retore part. In addition, every VM exit saves PKRS into
> >>>> the guest-state area in VMCS, while VM enter won't save the host value
> >>>> due to the expectation that the host won't change the MSR often. Update
> >>>> the host's value in VMCS manually if the MSR has been changed by the
> >>>> kernel since the last time the VMCS was run.
> >>>> The function get_current_pkrs() in arch/x86/mm/pkeys.c exports the
> >>>> per-cpu variable pkrs_cache to avoid frequent rdmsr of PKRS.
> >>>>
> >>>> Signed-off-by: Chenyi Qiang 
> >>>> ---
> >>>
> >>>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> >>>> index 11e4df560018..df2c2e733549 100644
> >>>> --- a/arch/x86/kvm/vmx/nested.c
> >>>> +++ b/arch/x86/kvm/vmx/nested.c
> >>>> @@ -289,6 +289,7 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx 
> >>>> *vmx,
> >>>>   dest->ds_sel = src->ds_sel;
> >>>>   dest->es_sel = src->es_sel;
> >>>>#endif
> >>>> +   dest->pkrs = src->pkrs;
> >>>
> >>> Why isn't this (and other PKRS code) inside the #ifdef CONFIG_X86_64?
> >>> PKRS isn't usable outside of long mode, is it?
> >>>
> >>
> >> Yes, I'm also thinking about whether to put all pks code into
> >> CONFIG_X86_64. The kernel implementation also wrap its pks code inside
> >> CONFIG_ARCH_HAS_SUPERVISOR_PKEYS which has dependency with CONFIG_X86_64.
> >> However, maybe this can help when host kernel disable PKS but the guest
> >> enable it. What do you think about this?
> >
> > I see no problem in exposing PKRS to the guest even if the host
> > doesn't have CONFIG_ARCH_HAS_SUPERVISOR_PKEYS.
> >
>
> Yes, but I would prefer to keep it outside CONFIG_X86_64. PKS code has
> several code blocks and putting them under x86_64 may end up being a
> mess. In addition, PKU KVM related code isn't under CONFIG_X86_64 as
> well. So, is it really necessary to put inside?

I'll let someone who actually cares about the i386 build answer that question.


Re: [RFC 7/7] KVM: VMX: Enable PKS for nested VM

2020-08-14 Thread Jim Mattson
On Fri, Aug 14, 2020 at 3:09 AM Chenyi Qiang  wrote:
>
>
>
> On 8/14/2020 1:52 AM, Jim Mattson wrote:
> > On Wed, Aug 12, 2020 at 9:54 PM Chenyi Qiang  wrote:
> >>
> >>
> >>
> >> On 8/11/2020 8:05 AM, Jim Mattson wrote:
> >>> On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang  
> >>> wrote:
> >>>>
> >>>> PKS MSR passes through guest directly. Configure the MSR to match the
> >>>> L0/L1 settings so that nested VM runs PKS properly.
> >>>>
> >>>> Signed-off-by: Chenyi Qiang 
> >>>> ---
> >
> >>>> +   (!vmx->nested.nested_run_pending ||
> >>>> +!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
> >>>> +   vmcs_write64(GUEST_IA32_PKRS, 
> >>>> vmx->nested.vmcs01_guest_pkrs);
> >>>
> >>> This doesn't seem right to me. On the target of a live migration, with
> >>> L2 active at the time the snapshot was taken (i.e.,
> >>> vmx->nested.nested_run_pending=0), it looks like we're going to try to
> >>> overwrite the current L2 PKRS value with L1's PKRS value (except that
> >>> in this situation, vmx->nested.vmcs01_guest_pkrs should actually be
> >>> 0). Am I missing something?
> >>>
> >>
> >> We overwrite the L2 PKRS with L1's value when L2 doesn't support PKS.
> >> Because the L1's VM_ENTRY_LOAD_IA32_PKRS is off, we need to migrate L1's
> >> PKRS to L2.
> >
> > I'm thinking of the case where vmx->nested.nested_run_pending is
> > false, and we are processing a KVM_SET_NESTED_STATE ioctl, yet
> > VM_ENTRY_LOAD_IA32_PKRS *is* set in the vmcs12.
> >
>
> Oh, I miss this case. What I'm still confused here is that the
> restoration for GUEST_IA32_DEBUGCTL and GUEST_BNDCFGS have the same
> issue, right? or I miss something.

I take it back. This does work, assuming that userspace calls
KVM_SET_MSRS before calling KVM_SET_NESTED_STATE. Assuming L2 is
active when the checkpoint is taken, the MSR values saved will be the
L2 values. When restoring the MSRs with KVM_SET_MSRS, the L2 MSR
values will be written into vmcs01. They don't belong there, but we're
never going to launch vmcs01 with those MSR values. Instead, when
userspace calls KVM_SET_NESTED_STATE, those values will be transferred
first to the vmcs01_ fields of the vmx->nested struct, and then
to vmcs02.

This is subtle, and I don't think it's documented anywhere that
KVM_SET_NESTED_STATE must be called after KVM_SET_MSRS. In fact, there
are probably a number of dependencies among the various KVM_SET_*
functions that aren't documented anywhere.


Re: [RFC 6/7] KVM: X86: Expose PKS to guest and userspace

2020-08-13 Thread Jim Mattson
On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang  wrote:
>
> Existence of PKS is enumerated via CPUID.(EAX=7H,ECX=0):ECX[31]. It is
> enabled by setting CR4.PKS when long mode is active. PKS is only
> implemented when EPT is enabled and requires the support of VM_{ENTRY,
> EXIT}_LOAD_IA32_PKRS currently.
>
> Signed-off-by: Chenyi Qiang 

> @@ -967,7 +969,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>  {
> unsigned long old_cr4 = kvm_read_cr4(vcpu);
> unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
> -  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
> +  X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE |
> +  X86_CR4_PKS;

This list already seems overly long, but I don't think CR4.PKS belongs
here. In volume 3 of the SDM, section 4.4.1, it says:

- If PAE paging would be in use following an execution of MOV to CR0
or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any
of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP;
then the PDPTEs are loaded from the address in CR3.

CR4.PKS is not in the list of CR4 bits that result in a PDPTE load.
Since it has no effect on PAE paging, I would be surprised if it did
result in a PDPTE load.

> if (kvm_valid_cr4(vcpu, cr4))
> return 1;
> @@ -1202,7 +1205,7 @@ static const u32 msrs_to_save_all[] = {
> MSR_IA32_RTIT_ADDR1_A, MSR_IA32_RTIT_ADDR1_B,
> MSR_IA32_RTIT_ADDR2_A, MSR_IA32_RTIT_ADDR2_B,
> MSR_IA32_RTIT_ADDR3_A, MSR_IA32_RTIT_ADDR3_B,
> -   MSR_IA32_UMWAIT_CONTROL,
> +   MSR_IA32_UMWAIT_CONTROL, MSR_IA32_PKRS,

Should MSR_IA32_PKRS be added to the switch statement in
kvm_init_msr_list()? Something like...

case MSR_IA32_PKRS:
if (!kvm_cpu_cap_has(X86_FEATURE_PKRS))
continue;
break;


Re: [RFC 7/7] KVM: VMX: Enable PKS for nested VM

2020-08-13 Thread Jim Mattson
On Wed, Aug 12, 2020 at 9:54 PM Chenyi Qiang  wrote:
>
>
>
> On 8/11/2020 8:05 AM, Jim Mattson wrote:
> > On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang  wrote:
> >>
> >> PKS MSR passes through guest directly. Configure the MSR to match the
> >> L0/L1 settings so that nested VM runs PKS properly.
> >>
> >> Signed-off-by: Chenyi Qiang 
> >> ---

> >> +   (!vmx->nested.nested_run_pending ||
> >> +!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
> >> +   vmcs_write64(GUEST_IA32_PKRS, 
> >> vmx->nested.vmcs01_guest_pkrs);
> >
> > This doesn't seem right to me. On the target of a live migration, with
> > L2 active at the time the snapshot was taken (i.e.,
> > vmx->nested.nested_run_pending=0), it looks like we're going to try to
> > overwrite the current L2 PKRS value with L1's PKRS value (except that
> > in this situation, vmx->nested.vmcs01_guest_pkrs should actually be
> > 0). Am I missing something?
> >
>
> We overwrite the L2 PKRS with L1's value when L2 doesn't support PKS.
> Because the L1's VM_ENTRY_LOAD_IA32_PKRS is off, we need to migrate L1's
> PKRS to L2.

I'm thinking of the case where vmx->nested.nested_run_pending is
false, and we are processing a KVM_SET_NESTED_STATE ioctl, yet
VM_ENTRY_LOAD_IA32_PKRS *is* set in the vmcs12.


Re: [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR

2020-08-13 Thread Jim Mattson
On Wed, Aug 12, 2020 at 10:42 PM Chenyi Qiang  wrote:
>
>
>
> On 8/13/2020 5:21 AM, Jim Mattson wrote:
> > On Fri, Aug 7, 2020 at 1:46 AM Chenyi Qiang  wrote:
> >>
> >> Protection Keys for Supervisor Pages (PKS) uses IA32_PKRS MSR (PKRS) at
> >> index 0x6E1 to allow software to manage supervisor protection key
> >> rights. For performance consideration, PKRS intercept will be disabled
> >> so that the guest can access the PKRS without VM exits.
> >> PKS introduces dedicated control fields in VMCS to switch PKRS, which
> >> only does the retore part. In addition, every VM exit saves PKRS into
> >> the guest-state area in VMCS, while VM enter won't save the host value
> >> due to the expectation that the host won't change the MSR often. Update
> >> the host's value in VMCS manually if the MSR has been changed by the
> >> kernel since the last time the VMCS was run.
> >> The function get_current_pkrs() in arch/x86/mm/pkeys.c exports the
> >> per-cpu variable pkrs_cache to avoid frequent rdmsr of PKRS.
> >>
> >> Signed-off-by: Chenyi Qiang 
> >> ---
> >
> >> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> >> index 11e4df560018..df2c2e733549 100644
> >> --- a/arch/x86/kvm/vmx/nested.c
> >> +++ b/arch/x86/kvm/vmx/nested.c
> >> @@ -289,6 +289,7 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx 
> >> *vmx,
> >>  dest->ds_sel = src->ds_sel;
> >>  dest->es_sel = src->es_sel;
> >>   #endif
> >> +   dest->pkrs = src->pkrs;
> >
> > Why isn't this (and other PKRS code) inside the #ifdef CONFIG_X86_64?
> > PKRS isn't usable outside of long mode, is it?
> >
>
> Yes, I'm also thinking about whether to put all pks code into
> CONFIG_X86_64. The kernel implementation also wrap its pks code inside
> CONFIG_ARCH_HAS_SUPERVISOR_PKEYS which has dependency with CONFIG_X86_64.
> However, maybe this can help when host kernel disable PKS but the guest
> enable it. What do you think about this?

I see no problem in exposing PKRS to the guest even if the host
doesn't have CONFIG_ARCH_HAS_SUPERVISOR_PKEYS.


Re: [RFC 2/7] KVM: VMX: Expose IA32_PKRS MSR

2020-08-12 Thread Jim Mattson
On Fri, Aug 7, 2020 at 1:46 AM Chenyi Qiang  wrote:
>
> Protection Keys for Supervisor Pages (PKS) uses IA32_PKRS MSR (PKRS) at
> index 0x6E1 to allow software to manage supervisor protection key
> rights. For performance consideration, PKRS intercept will be disabled
> so that the guest can access the PKRS without VM exits.
> PKS introduces dedicated control fields in VMCS to switch PKRS, which
> only does the retore part. In addition, every VM exit saves PKRS into
> the guest-state area in VMCS, while VM enter won't save the host value
> due to the expectation that the host won't change the MSR often. Update
> the host's value in VMCS manually if the MSR has been changed by the
> kernel since the last time the VMCS was run.
> The function get_current_pkrs() in arch/x86/mm/pkeys.c exports the
> per-cpu variable pkrs_cache to avoid frequent rdmsr of PKRS.
>
> Signed-off-by: Chenyi Qiang 
> ---

> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 11e4df560018..df2c2e733549 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -289,6 +289,7 @@ static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx,
> dest->ds_sel = src->ds_sel;
> dest->es_sel = src->es_sel;
>  #endif
> +   dest->pkrs = src->pkrs;

Why isn't this (and other PKRS code) inside the #ifdef CONFIG_X86_64?
PKRS isn't usable outside of long mode, is it?

>  }
>
>  static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
> diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
> index 7a3675fddec2..39ec3d0c844b 100644
> --- a/arch/x86/kvm/vmx/vmcs.h
> +++ b/arch/x86/kvm/vmx/vmcs.h
> @@ -40,6 +40,7 @@ struct vmcs_host_state {
>  #ifdef CONFIG_X86_64
> u16   ds_sel, es_sel;
>  #endif
> +   u32   pkrs;

One thing that does seem odd throughout is that PKRS is a 64-bit
resource, but we store and pass around only 32-bits. Yes, the top 32
bits are currently reserved, but the same can be said of, say, cr4, a
few lines above this. Yet, we store and pass around cr4 as 64-bits.
How do we decide?

>  };
>
>  struct vmcs_controls_shadow {

> @@ -1163,6 +1164,20 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>  */
> host_state->ldt_sel = kvm_read_ldt();
>
> +   /*
> +* Update the host pkrs vmcs field before vcpu runs.
> +* The setting of VM_EXIT_LOAD_IA32_PKRS can ensure
> +* kvm_cpu_cap_has(X86_FEATURE_PKS) &&
> +* guest_cpuid_has(vcpu, X86_FEATURE_VMX).
> +*/
> +   if (vm_exit_controls_get(vmx) & VM_EXIT_LOAD_IA32_PKRS) {
> +   host_pkrs = get_current_pkrs();
> +   if (unlikely(host_pkrs != host_state->pkrs)) {
> +   vmcs_write64(HOST_IA32_PKRS, host_pkrs);
> +   host_state->pkrs = host_pkrs;
> +   }
> +   }
> +
>  #ifdef CONFIG_X86_64
> savesegment(ds, host_state->ds_sel);
> savesegment(es, host_state->es_sel);
> @@ -1951,6 +1966,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
> else
> msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
> break;
> +   case MSR_IA32_PKRS:
> +   if (!kvm_cpu_cap_has(X86_FEATURE_PKS) ||
> +   (!msr_info->host_initiated &&
> +   !guest_cpuid_has(vcpu, X86_FEATURE_PKS)))

Could this be simplified if we required
kvm_cpu_cap_has(X86_FEATURE_PKS) as a prerequisite for
guest_cpuid_has(vcpu, X86_FEATURE_PKS)? If not, what is the expected
behavior if guest_cpuid_has(vcpu, X86_FEATURE_PKS) is true and
kvm_cpu_cap_has(X86_FEATURE_PKS)  is false?

> +   return 1;
> +   msr_info->data = vmcs_read64(GUEST_IA32_PKRS);
> +   break;
> case MSR_TSC_AUX:
> if (!msr_info->host_initiated &&
> !guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP))

> @@ -7230,6 +7267,26 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
> vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
>  }
>
> +static void vmx_update_pkrs_cfg(struct kvm_vcpu *vcpu)
> +{
> +   struct vcpu_vmx *vmx = to_vmx(vcpu);
> +   unsigned long *msr_bitmap = vmx->vmcs01.msr_bitmap;
> +   bool pks_supported = guest_cpuid_has(vcpu, X86_FEATURE_PKS);
> +
> +   /*
> +* set intercept for PKRS when the guest doesn't support pks
> +*/
> +   vmx_set_intercept_for_msr(msr_bitmap, MSR_IA32_PKRS, MSR_TYPE_RW, 
> !pks_supported);
> +
> +   if (pks_supported) {
> +   vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
> +   vm_exit_controls_setbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
> +   } else {
> +   vm_entry_controls_clearbit(vmx, VM_ENTRY_LOAD_IA32_PKRS);
> +   vm_exit_controls_clearbit(vmx, VM_EXIT_LOAD_IA32_PKRS);
> +   }
> +}

Not just your change, but it is unclear to 

Re: [RFC 7/7] KVM: VMX: Enable PKS for nested VM

2020-08-12 Thread Jim Mattson
On Wed, Aug 12, 2020 at 8:00 AM Sean Christopherson
 wrote:
>
> On Mon, Aug 10, 2020 at 05:05:36PM -0700, Jim Mattson wrote:
> > On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang  wrote:
> > >
> > > PKS MSR passes through guest directly. Configure the MSR to match the
> > > L0/L1 settings so that nested VM runs PKS properly.
> > >
> > > Signed-off-by: Chenyi Qiang 
> > > ---
> > >  arch/x86/kvm/vmx/nested.c | 32 
> > >  arch/x86/kvm/vmx/vmcs12.c |  2 ++
> > >  arch/x86/kvm/vmx/vmcs12.h |  6 +-
> > >  arch/x86/kvm/vmx/vmx.c| 10 ++
> > >  arch/x86/kvm/vmx/vmx.h|  1 +
> > >  5 files changed, 50 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > > index df2c2e733549..1f9823d21ecd 100644
> > > --- a/arch/x86/kvm/vmx/nested.c
> > > +++ b/arch/x86/kvm/vmx/nested.c
> > > @@ -647,6 +647,12 @@ static inline bool 
> > > nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> > > MSR_IA32_PRED_CMD,
> > > MSR_TYPE_W);
> > >
> > > +   if (!msr_write_intercepted_l01(vcpu, MSR_IA32_PKRS))
> > > +   nested_vmx_disable_intercept_for_msr(
> > > +   msr_bitmap_l1, msr_bitmap_l0,
> > > +   MSR_IA32_PKRS,
> > > +   MSR_TYPE_R | MSR_TYPE_W);
> >
> > What if L1 intercepts only *reads* of MSR_IA32_PKRS?
>
> nested_vmx_disable_intercept_for_msr() handles merging L1's desires, the
> (MSR_TYPE_R | MSR_TYPE_W) param is effectively L0's desire for L2.

I should know better than to assume that a function in kvm actually
does anything like what its name implies, but I never seem to learn.
:-(

Thanks!


Re: [RFC 7/7] KVM: VMX: Enable PKS for nested VM

2020-08-10 Thread Jim Mattson
On Fri, Aug 7, 2020 at 1:47 AM Chenyi Qiang  wrote:
>
> PKS MSR passes through guest directly. Configure the MSR to match the
> L0/L1 settings so that nested VM runs PKS properly.
>
> Signed-off-by: Chenyi Qiang 
> ---
>  arch/x86/kvm/vmx/nested.c | 32 
>  arch/x86/kvm/vmx/vmcs12.c |  2 ++
>  arch/x86/kvm/vmx/vmcs12.h |  6 +-
>  arch/x86/kvm/vmx/vmx.c| 10 ++
>  arch/x86/kvm/vmx/vmx.h|  1 +
>  5 files changed, 50 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index df2c2e733549..1f9823d21ecd 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -647,6 +647,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct 
> kvm_vcpu *vcpu,
> MSR_IA32_PRED_CMD,
> MSR_TYPE_W);
>
> +   if (!msr_write_intercepted_l01(vcpu, MSR_IA32_PKRS))
> +   nested_vmx_disable_intercept_for_msr(
> +   msr_bitmap_l1, msr_bitmap_l0,
> +   MSR_IA32_PKRS,
> +   MSR_TYPE_R | MSR_TYPE_W);

What if L1 intercepts only *reads* of MSR_IA32_PKRS?

> kvm_vcpu_unmap(vcpu, _vmx(vcpu)->nested.msr_bitmap_map, false);
>
> return true;

> @@ -2509,6 +2519,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12,
> if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
> !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
> vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs);
> +
> +   if (kvm_cpu_cap_has(X86_FEATURE_PKS) &&

Is the above check superfluous? I would assume that the L1 guest can't
set VM_ENTRY_LOAD_IA32_PKRS unless this is true.

> +   (!vmx->nested.nested_run_pending ||
> +!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PKRS)))
> +   vmcs_write64(GUEST_IA32_PKRS, vmx->nested.vmcs01_guest_pkrs);

This doesn't seem right to me. On the target of a live migration, with
L2 active at the time the snapshot was taken (i.e.,
vmx->nested.nested_run_pending=0), it looks like we're going to try to
overwrite the current L2 PKRS value with L1's PKRS value (except that
in this situation, vmx->nested.vmcs01_guest_pkrs should actually be
0). Am I missing something?

> vmx_set_rflags(vcpu, vmcs12->guest_rflags);
>
> /* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the


> @@ -3916,6 +3943,8 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu 
> *vcpu,
> vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
> if (kvm_mpx_supported())
> vmcs12->guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
> +   if (kvm_cpu_cap_has(X86_FEATURE_PKS))

Shouldn't we be checking to see if the *virtual* CPU supports PKS
before writing anything into vmcs12->guest_ia32_pkrs?

> +   vmcs12->guest_ia32_pkrs = vmcs_read64(GUEST_IA32_PKRS);
>
> vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false;
>  }


Re: [RFC 1/7] KVM: VMX: Introduce PKS VMCS fields

2020-08-10 Thread Jim Mattson
On Fri, Aug 7, 2020 at 1:46 AM Chenyi Qiang  wrote:
>
> PKS(Protection Keys for Supervisor Pages) is a feature that extends the
> Protection Key architecture to support thread-specific permission
> restrictions on supervisor pages.
>
> A new PKS MSR(PKRS) is defined in kernel to support PKS, which holds a
> set of permissions associated with each protection domian.
>
> Two VMCS fields {HOST,GUEST}_IA32_PKRS are introduced in
> {host,guest}-state area to store the value of PKRS.
>
> Every VM exit saves PKRS into guest-state area.
> If VM_EXIT_LOAD_IA32_PKRS = 1, VM exit loads PKRS from the host-state
> area.
> If VM_ENTRY_LOAD_IA32_PKRS = 1, VM entry loads PKRS from the guest-state
> area.
>
> Signed-off-by: Chenyi Qiang 
Reviewed-by: Jim Mattson 


Re: [PATCH v4 12/12] KVM:SVM: Enable INVPCID feature on AMD

2020-08-10 Thread Jim Mattson
On Thu, Aug 6, 2020 at 5:47 PM Babu Moger  wrote:
>
> The following intercept bit has been added to support VMEXIT
> for INVPCID instruction:
> CodeNameCause
> A2h VMEXIT_INVPCID  INVPCID instruction
>
> The following bit has been added to the VMCB layout control area
> to control intercept of INVPCID:
> Byte Offset Bit(s)Function
> 14h 2 intercept INVPCID
>
> Enable the interceptions when the the guest is running with shadow
> page table enabled and handle the tlbflush based on the invpcid
> instruction type.
>
> For the guests with nested page table (NPT) support, the INVPCID
> feature works as running it natively. KVM does not need to do any
> special handling in this case.
>
> AMD documentation for INVPCID feature is available at "AMD64
> Architecture Programmer’s Manual Volume 2: System Programming,
> Pub. 24593 Rev. 3.34(or later)"
>
> The documentation can be obtained at the links below:
> Link: https://www.amd.com/system/files/TechDocs/24593.pdf
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>
> Signed-off-by: Babu Moger 
Reviewed-by: Jim Mattson 


Re: [PATCH] KVM: x86: Don't attempt to load PDPTRs when 64-bit mode is enabled

2020-08-06 Thread Jim Mattson
On Wed, Aug 5, 2020 at 12:04 AM Maxim Levitsky  wrote:
>
> On Mon, 2020-07-13 at 18:57 -0700, Sean Christopherson wrote:
> > Don't attempt to load PDPTRs if EFER.LME=1, i.e. if 64-bit mode is
> > enabled.  A recent change to reload the PDTPRs when CR0.CD or CR0.NW is
> > toggled botched the EFER.LME handling and sends KVM down the PDTPR path
> > when is_paging() is true, i.e. when the guest toggles CD/NW in 64-bit
> > mode.
> >
> > Split the CR0 checks for 64-bit vs. 32-bit PAE into separate paths.  The
> > 64-bit path is specifically checking state when paging is toggled on,
> > i.e. CR0.PG transititions from 0->1.  The PDPTR path now needs to run if
> > the new CR0 state has paging enabled, irrespective of whether paging was
> > already enabled.  Trying to shave a few cycles to make the PDPTR path an
> > "else if" case is a mess.
> >
> > Fixes: d42e3fae6faed ("kvm: x86: Read PDPTEs on CR0.CD and CR0.NW changes")
> > Cc: Jim Mattson 
> > Cc: Oliver Upton 
> > Cc: Peter Shier 
> > Signed-off-by: Sean Christopherson 
> > ---
> >
> > The other way to fix this, with a much smaller diff stat, is to simply
> > move the !is_page(vcpu) check inside (vcpu->arch.efer & EFER_LME).  But
> > that results in a ridiculous amount of nested conditionals for what is a
> > very straightforward check e.g.
> >
> >   if (cr0 & X86_CR0_PG) {
> >   if (vcpu->arch.efer & EFER_LME) }
> >   if (!is_paging(vcpu)) {
> >   ...
> >   }
> >   }
> >   }
> >
> > Since this doesn't need to be backported anywhere, I didn't see any value
> > in having an intermediate step.
> >
> >  arch/x86/kvm/x86.c | 24 
> >  1 file changed, 12 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 95ef629228691..5f526d94c33f3 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -819,22 +819,22 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long 
> > cr0)
> >   if ((cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PE))
> >   return 1;
> >
> > - if (cr0 & X86_CR0_PG) {
> >  #ifdef CONFIG_X86_64
> > - if (!is_paging(vcpu) && (vcpu->arch.efer & EFER_LME)) {
> > - int cs_db, cs_l;
> > + if ((vcpu->arch.efer & EFER_LME) && !is_paging(vcpu) &&
> > + (cr0 & X86_CR0_PG)) {
> > + int cs_db, cs_l;
> >
> > - if (!is_pae(vcpu))
> > - return 1;
> > - kvm_x86_ops.get_cs_db_l_bits(vcpu, _db, _l);
> > - if (cs_l)
> > - return 1;
> > - } else
> > -#endif
> > - if (is_pae(vcpu) && ((cr0 ^ old_cr0) & pdptr_bits) &&
> > - !load_pdptrs(vcpu, vcpu->arch.walk_mmu, 
> > kvm_read_cr3(vcpu)))
> > + if (!is_pae(vcpu))
> > + return 1;
> > + kvm_x86_ops.get_cs_db_l_bits(vcpu, _db, _l);
> > + if (cs_l)
> >   return 1;
> >   }
> > +#endif
> > + if (!(vcpu->arch.efer & EFER_LME) && (cr0 & X86_CR0_PG) &&
> > + is_pae(vcpu) && ((cr0 ^ old_cr0) & pdptr_bits) &&
> > + !load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu)))
> > + return 1;

It might be worth commenting on the subtlety of the test below being
skipped if the PDPTEs were loaded above. I'm assuming that the PDPTEs
shouldn't be loaded if the instruction faults.

> >   if (!(cr0 & X86_CR0_PG) && kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE))
> >   return 1;
>
> I also investigated this issue (also same thing, OVMF doesn't boot),
> and after looking at the intel and amd's PRM, this looks like correct 
> solution.
> I also tested this and it works.
>
>
> Reviewed-by: Maxim Levitsky 
>
> Best regards,
> Maxim Levitsky
>


Re: [PATCH 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory

2020-08-06 Thread Jim Mattson
On Wed, Aug 5, 2020 at 5:18 PM Michael S. Tsirkin  wrote:
>
> On Wed, Aug 05, 2020 at 10:05:40AM -0700, Jim Mattson wrote:
> > On Tue, Jul 28, 2020 at 7:38 AM Vitaly Kuznetsov  
> > wrote:
> > >
> > > PCIe config space can (depending on the configuration) be quite big but
> > > usually is sparsely populated. Guest may scan it by accessing individual
> > > device's page which, when device is missing, is supposed to have 'pci
> > > hole' semantics: reads return '0xff' and writes get discarded. Compared
> > > to the already existing KVM_MEM_READONLY, VMM doesn't need to allocate
> > > real memory and stuff it with '0xff'.
> >
> > Note that the bus error semantics described should apply to *any*
> > unbacked guest physical addresses, not just addresses in the PCI hole.
> > (Typically, this also applies to the standard local APIC page
> > (0xfee00xxx) when the local APIC is either disabled or in x2APIC mode,
> > which is an area that kvm has had trouble with in the past.)
>
> Well ATM from KVM's POV unbacked -> exit to userspace, right?
> Not sure what you are suggesting here ...

Sometimes, maybe. That's not the way the emulation of most VMX
instructions works, should they access unbacked memory. Perhaps that's
just a whole slew of bugs to be fixed. :-)


Re: [PATCH 2/3] KVM: x86: introduce KVM_MEM_PCI_HOLE memory

2020-08-05 Thread Jim Mattson
On Tue, Jul 28, 2020 at 7:38 AM Vitaly Kuznetsov  wrote:
>
> PCIe config space can (depending on the configuration) be quite big but
> usually is sparsely populated. Guest may scan it by accessing individual
> device's page which, when device is missing, is supposed to have 'pci
> hole' semantics: reads return '0xff' and writes get discarded. Compared
> to the already existing KVM_MEM_READONLY, VMM doesn't need to allocate
> real memory and stuff it with '0xff'.

Note that the bus error semantics described should apply to *any*
unbacked guest physical addresses, not just addresses in the PCI hole.
(Typically, this also applies to the standard local APIC page
(0xfee00xxx) when the local APIC is either disabled or in x2APIC mode,
which is an area that kvm has had trouble with in the past.)


Re: [PATCH] KVM: x86: Don't attempt to load PDPTRs when 64-bit mode is enabled

2020-08-04 Thread Jim Mattson
On Tue, Aug 4, 2020 at 11:41 AM Sean Christopherson
 wrote:

> Ping.  This really needs to be in the initial pull for 5.9, as is kvm/queue
> has a 100% fatality rate for me.

I agree completely, but I am curious what guest you have that toggles
CD/NW in 64-bit mode.


Re: [PATCH v3 1/3] KVM: x86: Deflect unknown MSR accesses to user space

2020-07-31 Thread Jim Mattson
On Fri, Jul 31, 2020 at 2:50 PM Alexander Graf  wrote:
>
> MSRs are weird. Some of them are normal control registers, such as EFER.
> Some however are registers that really are model specific, not very
> interesting to virtualization workloads, and not performance critical.
> Others again are really just windows into package configuration.
>
> Out of these MSRs, only the first category is necessary to implement in
> kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
> certain CPU models and MSRs that contain information on the package level
> are much better suited for user space to process. However, over time we have
> accumulated a lot of MSRs that are not the first category, but still handled
> by in-kernel KVM code.
>
> This patch adds a generic interface to handle WRMSR and RDMSR from user
> space. With this, any future MSR that is part of the latter categories can
> be handled in user space.
>
> Furthermore, it allows us to replace the existing "ignore_msrs" logic with
> something that applies per-VM rather than on the full system. That way you
> can run productive VMs in parallel to experimental ones where you don't care
> about proper MSR handling.
>
> Signed-off-by: Alexander Graf 
>
> ---
>
> v1 -> v2:
>
>   - s/ETRAP_TO_USER_SPACE/ENOENT/g
>   - deflect all #GP injection events to user space, not just unknown MSRs.
> That was we can also deflect allowlist errors later
>   - fix emulator case
>
> v2 -> v3:
>
>   - return r if r == X86EMUL_IO_NEEDED
>   - s/KVM_EXIT_RDMSR/KVM_EXIT_X86_RDMSR/g
>   - s/KVM_EXIT_WRMSR/KVM_EXIT_X86_WRMSR/g
>   - Use complete_userspace_io logic instead of reply field
>   - Simplify trapping code
> ---
>  Documentation/virt/kvm/api.rst  |  62 +++
>  arch/x86/include/asm/kvm_host.h |   6 ++
>  arch/x86/kvm/emulate.c  |  18 +-
>  arch/x86/kvm/x86.c  | 106 ++--
>  include/trace/events/kvm.h  |   2 +-
>  include/uapi/linux/kvm.h|  10 +++
>  6 files changed, 197 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 320788f81a05..79c3e2fdfae4 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst

The new exit reasons should probably be mentioned here (around line 4866):

.. note::

  For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and
  KVM_EXIT_EPR the corresponding

operations are complete (and guest state is consistent) only after userspace
has re-entered the kernel with KVM_RUN.  The kernel side will first finish
incomplete operations and then check for pending signals.  Userspace
can re-enter the guest with an unmasked signal pending to complete
pending operations.

Other than that, my remaining comments are all nits. Feel free to ignore them.

> +static int kvm_get_msr_user_space(struct kvm_vcpu *vcpu, u32 index)

Return bool rather than int?

> +{
> +   if (!vcpu->kvm->arch.user_space_msr_enabled)
> +   return 0;
> +
> +   vcpu->run->exit_reason = KVM_EXIT_X86_RDMSR;
> +   vcpu->run->msr.error = 0;

Should we clear 'pad' in case anyone can think of a reason to use this
space to extend the API in the future?

> +   vcpu->run->msr.index = index;
> +   vcpu->arch.pending_user_msr = true;
> +   vcpu->arch.complete_userspace_io = complete_emulated_rdmsr;

complete_userspace_io could perhaps be renamed to
complete_userspace_emulation (in a separate commit).

> +
> +   return 1;
> +}
> +
> +static int kvm_set_msr_user_space(struct kvm_vcpu *vcpu, u32 index, u64 data)

Return bool rather than int?

> +{
> +   if (!vcpu->kvm->arch.user_space_msr_enabled)
> +   return 0;
> +
> +   vcpu->run->exit_reason = KVM_EXIT_X86_WRMSR;
> +   vcpu->run->msr.error = 0;

Same question about 'pad' as above.

> +   vcpu->run->msr.index = index;
> +   vcpu->run->msr.data = data;
> +   vcpu->arch.pending_user_msr = true;
> +   vcpu->arch.complete_userspace_io = complete_emulated_wrmsr;
> +
> +   return 1;
> +}
> +

Reviewed-by: Jim Mattson 


Re: [PATCH v2 1/3] KVM: x86: Deflect unknown MSR accesses to user space

2020-07-30 Thread Jim Mattson
On Thu, Jul 30, 2020 at 4:53 PM Jim Mattson  wrote:
>
> On Thu, Jul 30, 2020 at 4:08 PM Alexander Graf  wrote:
> > Do you have a particular situation in mind where that would not be the
> > case and where we would still want to actually complete an MSR operation
> > after the environment changed?
>
> As far as userspace is concerned, if it has replied with error=0, the
> instruction has completed and retired. If the kernel executes a
> different instruction at CS:RIP, the state is certainly inconsistent
> for WRMSR exits. It would also be inconsistent for RDMSR exits if the
> RDMSR emulation on the userspace side had any side-effects.

Actually, I think there's a potential problem with interrupt delivery
even if the instruction bytes are the same. On the second pass, an
interrupt could be delivered on the CS:IP of a WRMSR, even though
userspace has already emulated the WRMSR instruction. This could be
particularly awkward if the WRMSR was to the x2APIC TPR register, and
in fact lowered the TPR sufficiently to allow a pending interrupt to
be delivered.


Re: [PATCH v2 1/3] KVM: x86: Deflect unknown MSR accesses to user space

2020-07-30 Thread Jim Mattson
On Thu, Jul 30, 2020 at 4:08 PM Alexander Graf  wrote:
>
>
>
> On 31.07.20 00:42, Jim Mattson wrote:
> >
> > On Wed, Jul 29, 2020 at 4:59 PM Alexander Graf  wrote:
> >>
> >> MSRs are weird. Some of them are normal control registers, such as EFER.
> >> Some however are registers that really are model specific, not very
> >> interesting to virtualization workloads, and not performance critical.
> >> Others again are really just windows into package configuration.
> >>
> >> Out of these MSRs, only the first category is necessary to implement in
> >> kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
> >> certain CPU models and MSRs that contain information on the package level
> >> are much better suited for user space to process. However, over time we 
> >> have
> >> accumulated a lot of MSRs that are not the first category, but still 
> >> handled
> >> by in-kernel KVM code.
> >>
> >> This patch adds a generic interface to handle WRMSR and RDMSR from user
> >> space. With this, any future MSR that is part of the latter categories can
> >> be handled in user space.
> >>
> >> Furthermore, it allows us to replace the existing "ignore_msrs" logic with
> >> something that applies per-VM rather than on the full system. That way you
> >> can run productive VMs in parallel to experimental ones where you don't 
> >> care
> >> about proper MSR handling.
> >>
> >> Signed-off-by: Alexander Graf 
> >
> > Can we just drop em_wrmsr and em_rdmsr? The in-kernel emulator is
> > already incomplete, and I don't think there is ever a good reason for
> > kvm to emulate RDMSR or WRMSR if the VM-exit was for some other reason
> > (and we shouldn't end up here if the VM-exit was for RDMSR or WRMSR).
> > Am I missing something?
>
> On certain combinations of CPUs and guest modes, such as real mode on
> pre-Nehalem(?) at least, we are running all guest code through the
> emulator and thus may encounter a RDMSR or WRMSR instruction. I *think*
> we also do so for big real mode on more modern CPUs, but I'm not 100% sure.

Oh, gag me with a spoon! (BTW, we shouldn't have to emulate big real
mode if the CPU supports unrestricted guest mode. If we do, something
is probably wrong.)

> > You seem to be assuming that the instruction at CS:IP will still be
> > RDMSR (or WRMSR) after returning from userspace, and we will come
> > through kvm_{get,set}_msr_user_space again at the next KVM_RUN. That
> > isn't necessarily the case, for a variety of reasons. I think the
>
> Do you have a particular situation in mind where that would not be the
> case and where we would still want to actually complete an MSR operation
> after the environment changed?

As far as userspace is concerned, if it has replied with error=0, the
instruction has completed and retired. If the kernel executes a
different instruction at CS:RIP, the state is certainly inconsistent
for WRMSR exits. It would also be inconsistent for RDMSR exits if the
RDMSR emulation on the userspace side had any side-effects.

> > 'completion' of the userspace instruction emulation should be done
> > with the complete_userspace_io [sic] mechanism instead.
>
> Hm, that would avoid a roundtrip into guest mode, but add a cycle
> through the in-kernel emulator. I'm not sure that's a net win quite yet.
>
> >
> > I'd really like to see this mechanism apply only in the case of
> > invalid/unknown MSRs, and not for illegal reads/writes as well.
>
> Why? Any #GP inducing MSR access will be on the slow path. What's the
> problem if you get a few more of them in user space that you just bounce
> back as failing, so they actually do inject a fault?

I'm not concerned about the performance. I think I'm just biased
because of what we have today. But since we're planning on dropping
that anyway, I take it back. IIRC, the plumbing to make the
distinction is a little painful, and I don't want to ask you to go
there.


Re: [PATCH v2 1/3] KVM: x86: Deflect unknown MSR accesses to user space

2020-07-30 Thread Jim Mattson
On Wed, Jul 29, 2020 at 4:59 PM Alexander Graf  wrote:
>
> MSRs are weird. Some of them are normal control registers, such as EFER.
> Some however are registers that really are model specific, not very
> interesting to virtualization workloads, and not performance critical.
> Others again are really just windows into package configuration.
>
> Out of these MSRs, only the first category is necessary to implement in
> kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
> certain CPU models and MSRs that contain information on the package level
> are much better suited for user space to process. However, over time we have
> accumulated a lot of MSRs that are not the first category, but still handled
> by in-kernel KVM code.
>
> This patch adds a generic interface to handle WRMSR and RDMSR from user
> space. With this, any future MSR that is part of the latter categories can
> be handled in user space.
>
> Furthermore, it allows us to replace the existing "ignore_msrs" logic with
> something that applies per-VM rather than on the full system. That way you
> can run productive VMs in parallel to experimental ones where you don't care
> about proper MSR handling.
>
> Signed-off-by: Alexander Graf 

Can we just drop em_wrmsr and em_rdmsr? The in-kernel emulator is
already incomplete, and I don't think there is ever a good reason for
kvm to emulate RDMSR or WRMSR if the VM-exit was for some other reason
(and we shouldn't end up here if the VM-exit was for RDMSR or WRMSR).
Am I missing something?

You seem to be assuming that the instruction at CS:IP will still be
RDMSR (or WRMSR) after returning from userspace, and we will come
through kvm_{get,set}_msr_user_space again at the next KVM_RUN. That
isn't necessarily the case, for a variety of reasons. I think the
'completion' of the userspace instruction emulation should be done
with the complete_userspace_io [sic] mechanism instead.

I'd really like to see this mechanism apply only in the case of
invalid/unknown MSRs, and not for illegal reads/writes as well.


Re: [PATCH v3 11/11] KVM:SVM: Enable INVPCID feature on AMD

2020-07-29 Thread Jim Mattson
On Tue, Jul 28, 2020 at 4:39 PM Babu Moger  wrote:
>
> The following intercept bit has been added to support VMEXIT
> for INVPCID instruction:
> CodeNameCause
> A2h VMEXIT_INVPCID  INVPCID instruction
>
> The following bit has been added to the VMCB layout control area
> to control intercept of INVPCID:
> Byte Offset Bit(s)Function
> 14h 2 intercept INVPCID
>
> Enable the interceptions when the the guest is running with shadow
> page table enabled and handle the tlbflush based on the invpcid
> instruction type.
>
> For the guests with nested page table (NPT) support, the INVPCID
> feature works as running it natively. KVM does not need to do any
> special handling in this case.
>
> AMD documentation for INVPCID feature is available at "AMD64
> Architecture Programmer’s Manual Volume 2: System Programming,
> Pub. 24593 Rev. 3.34(or later)"
>
> The documentation can be obtained at the links below:
> Link: https://www.amd.com/system/files/TechDocs/24593.pdf
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
>
> Signed-off-by: Babu Moger 
> ---
>  arch/x86/include/uapi/asm/svm.h |2 +
>  arch/x86/kvm/svm/svm.c  |   64 
> +++
>  2 files changed, 66 insertions(+)
>
> diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
> index 2e8a30f06c74..522d42dfc28c 100644
> --- a/arch/x86/include/uapi/asm/svm.h
> +++ b/arch/x86/include/uapi/asm/svm.h
> @@ -76,6 +76,7 @@
>  #define SVM_EXIT_MWAIT_COND0x08c
>  #define SVM_EXIT_XSETBV0x08d
>  #define SVM_EXIT_RDPRU 0x08e
> +#define SVM_EXIT_INVPCID   0x0a2
>  #define SVM_EXIT_NPF   0x400
>  #define SVM_EXIT_AVIC_INCOMPLETE_IPI   0x401
>  #define SVM_EXIT_AVIC_UNACCELERATED_ACCESS 0x402
> @@ -171,6 +172,7 @@
> { SVM_EXIT_MONITOR, "monitor" }, \
> { SVM_EXIT_MWAIT,   "mwait" }, \
> { SVM_EXIT_XSETBV,  "xsetbv" }, \
> +   { SVM_EXIT_INVPCID, "invpcid" }, \
> { SVM_EXIT_NPF, "npf" }, \
> { SVM_EXIT_AVIC_INCOMPLETE_IPI, "avic_incomplete_ipi" }, \
> { SVM_EXIT_AVIC_UNACCELERATED_ACCESS,   "avic_unaccelerated_access" 
> }, \
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 99cc9c285fe6..6b099e0b28c0 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -813,6 +813,11 @@ static __init void svm_set_cpu_caps(void)
> if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) ||
> boot_cpu_has(X86_FEATURE_AMD_SSBD))
> kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD);
> +
> +   /* Enable INVPCID if both PCID and INVPCID enabled */
> +   if (boot_cpu_has(X86_FEATURE_PCID) &&
> +   boot_cpu_has(X86_FEATURE_INVPCID))
> +   kvm_cpu_cap_set(X86_FEATURE_INVPCID);
>  }

Why is PCID required? Can't this just be
'kvm_cpu_cap_check_and_set(X86_FEATURE_INVPCID);'?

>  static __init int svm_hardware_setup(void)
> @@ -1099,6 +1104,18 @@ static void init_vmcb(struct vcpu_svm *svm)
> clr_intercept(svm, INTERCEPT_PAUSE);
> }
>
> +   /*
> +* Intercept INVPCID instruction only if shadow page table is
> +* enabled. Interception is not required with nested page table
> +* enabled.
> +*/
> +   if (boot_cpu_has(X86_FEATURE_INVPCID)) {

Shouldn't this be 'kvm_cpu_cap_has(X86_FEATURE_INVPCID),' so that it
is consistent with the code above?

> +   if (!npt_enabled)
> +   set_intercept(svm, INTERCEPT_INVPCID);
> +   else
> +   clr_intercept(svm, INTERCEPT_INVPCID);
> +   }
> +
> if (kvm_vcpu_apicv_active(>vcpu))
> avic_init_vmcb(svm);
>
> @@ -2715,6 +2732,43 @@ static int mwait_interception(struct vcpu_svm *svm)
> return nop_interception(svm);
>  }
>
> +static int invpcid_interception(struct vcpu_svm *svm)
> +{
> +   struct kvm_vcpu *vcpu = >vcpu;
> +   struct x86_exception e;
> +   unsigned long type;
> +   gva_t gva;
> +   struct {
> +   u64 pcid;
> +   u64 gla;
> +   } operand;
> +
> +   if (!guest_cpuid_has(vcpu, X86_FEATURE_INVPCID)) {
> +   kvm_queue_exception(vcpu, UD_VECTOR);
> +   return 1;
> +   }
> +
> +   /*
> +* For an INVPCID intercept:
> +* EXITINFO1 provides the linear address of the memory operand.
> +* EXITINFO2 provides the contents of the register operand.
> +*/
> +   type = svm->vmcb->control.exit_info_2;
> +   gva = svm->vmcb->control.exit_info_1;
> +
> +   if (type > 3) {
> +   kvm_inject_gp(vcpu, 0);
> +   return 1;
> +   }
> +
> +   if (kvm_read_guest_virt(vcpu, gva, , sizeof(operand), )) {
> +   kvm_inject_emulated_page_fault(vcpu, );
> +   return 1;
> +   }

The emulated page fault is not always correct. See 

Re: [PATCH v3 10/11] KVM: X86: Move handling of INVPCID types to x86

2020-07-29 Thread Jim Mattson
On Tue, Jul 28, 2020 at 4:38 PM Babu Moger  wrote:
>
> INVPCID instruction handling is mostly same across both VMX and
> SVM. So, move the code to common x86.c.
>
> Signed-off-by: Babu Moger 
Reviewed-by: Jim Mattson 


Re: [PATCH v3 09/11] KVM: SVM: Remove set_exception_intercept and clr_exception_intercept

2020-07-29 Thread Jim Mattson
On Tue, Jul 28, 2020 at 4:38 PM Babu Moger  wrote:
>
> Remove set_exception_intercept and clr_exception_intercept.
> Replace with generic set_intercept and clr_intercept for these calls.
>
> Signed-off-by: Babu Moger 
Reviewed-by: Jim Mattson 


  1   2   3   4   5   6   7   >