from:"Bandan Das"

Re: [PATCH 1/2] KVM: x86: Add emulation support for #GP triggered by VM instructions

2021-01-12 Thread Bandan Das

Sean Christopherson  writes:
...
>> -if ((emulation_type & EMULTYPE_VMWARE_GP) &&
>> -!is_vmware_backdoor_opcode(ctxt)) {
>> -kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
>> -return 1;
>> +if (emulation_type & EMULTYPE_PARAVIRT_GP) {
>> +vminstr = is_vm_instr_opcode(ctxt);
>> +if (!vminstr && !is_vmware_backdoor_opcode(ctxt)) {
>> +kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
>> +return 1;
>> +}
>> +if (vminstr)
>> +return vminstr;
>
> I'm pretty sure this doesn't correctly handle a VM-instr in L2 that hits a bad
> L0 GPA and that L1 wants to intercept.  The intercept bitmap isn't checked 
> until
> x86_emulate_insn(), and the vm*_interception() helpers expect nested VM-Exits 
> to
> be handled further up the stack.
>
So, the condition is that L2 executes a vmload and #GPs on a reserved address, 
jumps to L0 - L0 doesn't
check if L1 has asked for the instruction to be intercepted and goes on with 
emulating
vmload and returning back to L2 ?

>>  }
>>  
>>  /*
>> -- 
>> 2.27.0
>>

Re: [PATCH 1/2] KVM: x86: Add emulation support for #GP triggered by VM instructions

2021-01-12 Thread Bandan Das

Andy Lutomirski  writes:
...
> #endif diff --git a/arch/x86/kvm/mmu/mmu.c
> b/arch/x86/kvm/mmu/mmu.c index 6d16481aa29d..c5c4aaf01a1a 100644
> --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@
> -50,6 +50,7 @@ #include  #include  #include
>  +#include  #include
> "trace.h"
> 
> extern bool itlb_multihit_kvm_mitigation; @@ -5675,6 +5676,12 @@
> void kvm_mmu_slot_set_dirty(struct kvm *kvm, }
> EXPORT_SYMBOL_GPL(kvm_mmu_slot_set_dirty);
> 
> +bool kvm_is_host_reserved_region(u64 gpa) +{ + return
> e820__mbapped_raw_any(gpa-1, gpa+1, E820_TYPE_RESERVED); +}
  While _e820__mapped_any()'s doc says '..  checks if any part of
 the range  is mapped ..' it seems to me that the real
 check is [start, end) so we should use 'gpa' instead of 'gpa-1',
 no?
>>>  Why do you need to check GPA at all?
>>> 
>> To reduce the scope of the workaround.
>> 
>> The errata only happens when you use one of SVM instructions in the
>> guest with EAX that happens to be inside one of the host reserved
>> memory regions (for example SMM).
>
> This code reduces the scope of the workaround at the cost of
> increasing the complexity of the workaround and adding a nonsensical
> coupling between KVM and host details and adding an export that really
> doesn’t deserve to be exported.
>
> Is there an actual concrete benefit to this check?

Besides reducing the scope, my intention for the check was that we should
know if such exceptions occur for any other undiscovered reasons with other
memory types rather than hiding them under this workaround.

Bandan

Re: [PATCH] KVM: nVMX: fixes for preemption timer migration

2020-07-09 Thread Bandan Das

Jim Mattson  writes:

> On Thu, Jul 9, 2020 at 10:15 AM Paolo Bonzini  wrote:
>>
>> Commit 850448f35aaf ("KVM: nVMX: Fix VMX preemption timer migration",
>> 2020-06-01) accidentally broke nVMX live migration from older version
>> by changing the userspace ABI.  Restore it and, while at it, ensure
>> that vmx->nested.has_preemption_timer_deadline is always initialized
>> according to the KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE flag.
>>
>> Cc: Makarand Sonare 
>> Fixes: 850448f35aaf ("KVM: nVMX: Fix VMX preemption timer migration")
>> Signed-off-by: Paolo Bonzini 
>> ---
>>  arch/x86/include/uapi/asm/kvm.h | 5 +++--
>>  arch/x86/kvm/vmx/nested.c   | 3 ++-
>>  2 files changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/include/uapi/asm/kvm.h 
>> b/arch/x86/include/uapi/asm/kvm.h
>> index 17c5a038f42d..0780f97c1850 100644
>> --- a/arch/x86/include/uapi/asm/kvm.h
>> +++ b/arch/x86/include/uapi/asm/kvm.h
>> @@ -408,14 +408,15 @@ struct kvm_vmx_nested_state_data {
>>  };
>>
>>  struct kvm_vmx_nested_state_hdr {
>> -   __u32 flags;
>> __u64 vmxon_pa;
>> __u64 vmcs12_pa;
>> -   __u64 preemption_timer_deadline;
>>
>> struct {
>> __u16 flags;
>> } smm;
>> +
>> +   __u32 flags;
>> +   __u64 preemption_timer_deadline;
>>  };
>>
Oops!

>>  struct kvm_svm_nested_state_data {
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index b26655104d4a..3fc2411edc92 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -6180,7 +6180,8 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu,
>> vmx->nested.has_preemption_timer_deadline = true;
>> vmx->nested.preemption_timer_deadline =
>> kvm_state->hdr.vmx.preemption_timer_deadline;
>> -   }
>> +   } else
>> +   vmx->nested.has_preemption_timer_deadline = false;
>
> Doesn't the coding standard require braces around the else clause?
>
I think so... for if/else where at least one of them is multiline.

> Reviewed-by: Jim Mattson 

Looks good to me, 
Reviewed-by: Bandan Das

Re: Linux 5.3-rc7

2019-09-09 Thread Bandan Das

Linus Torvalds  writes:

> On Sat, Sep 7, 2019 at 12:17 PM Linus Torvalds
>  wrote:
>>
>> I'm really not clear on why it's a good idea to clear the LDR bits on
>> shutdown, and commit 558682b52919 ("x86/apic: Include the LDR when
>> clearing out APIC registers") just looks pointless. And now it has
>> proven to break some machines.
>>
>> So why wouldn't we just revert it?
>
> Side note: looking around for the discussion about this patch, at
> least one version of the patch from Bandan had
>
> +   if (!x2apic_enabled) {
>
> rather than
>
> +   if (!x2apic_enabled()) {
>

I believe this crept up by accident when I was preparing the series, my testing
was with x2apic_enabled() but I didn't test CPU hotplug - only the kdump path
with 32 bit guest. In hindsight, I should have been more careful with testing,
sorry about that.

Bandan

> which meant that whatever Bandan tested at that point was actually a
> complete no-op, since "!x2apic_enabled" is never true (it tests a
> function pointer against NULL, which it won't be).
>
> Then that was fixed by the time it hit -tip (and eventually my tree),
> but it kind of shows how the patch history of this is all
> questionable. Further strengthened by a quote from that discussion:
>
>  "this is really a KVM bug but it doesn't hurt to clear out the LDR in
> the guest and then, it wouldn't need a hypervisor fix"
>
> and clearly it *does* hurt to clear the LDR in the guest, making the
> whole thinking behind the patch wrong and broken. The kernel clearly
> _does_ depend on LDR having the right contents.
>
> Now, I still suspect the boot problem then comes from our
> cpu0_logical_apicid use mentioned in that previous email, but at this
> point I think the proper fix is "revert for now, and we can look at
> this as a cleanup with the cpu0_logical_apicid thing for 5.4 instead".
>
> Hmm?
>
>Linus

Re: linux-next: Fixes tag needs some work in the tip tree

2019-08-29 Thread Bandan Das

Stephen Rothwell  writes:

> Hi all,
>
> In commit
>
>   bae3a8d3308e ("x86/apic: Do not initialize LDR and DFR for bigsmp")
>
> Fixes tag
>
>   Fixes: db7b9e9f26b8 ("[PATCH] Clustered APIC setup for >8 CPU systems")
>
> has these problem(s):
>
>   - Target SHA1 does not exist
>

I tried to dig this up and I believe that this is from pre-git.
I went back as far as commit 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
Author: Linus Torvalds 
Date:   Sat Apr 16 15:20:36 2005 -0700

Linux-2.6.12-rc2

Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!

which adds init_apic_ldr() in include/asm-i386/mach-bigsmp/mach_apic.h with
the following:

+static inline void init_apic_ldr(void)
+{
+   unsigned long val;
+
+   apic_write_around(APIC_DFR, APIC_DFR_VALUE);
+   val = apic_read(APIC_LDR) & ~APIC_LDR_MASK;
+   val = calculate_ldr(val);
+   apic_write_around(APIC_LDR, val);
+}
...

So, the bug seems to be present here as well...

Bandan

> I could not quickly find an obvious match.

Re: [tip:x86/urgent 3/3] arch/x86/kernel/apic/apic.c:1182:6: warning: the address of 'x2apic_enabled' will always evaluate as 'true'

2019-08-26 Thread Bandan Das

Thomas Gleixner  writes:

> On Tue, 27 Aug 2019, Bandan Das wrote:
>> kbuild test robot  writes:
>> 
>> > tree:   
>> > https://kernel.googlesource.com/pub/scm/linux/kernel/git/tip/tip.git 
>> > x86/urgent
>> > head:   cfa16294b1c5b320c0a0e1cac37c784b92366c87
>> > commit: cfa16294b1c5b320c0a0e1cac37c784b92366c87 [3/3] x86/apic: Include 
>> > the LDR when clearing out APIC registers
>> > config: i386-defconfig (attached as .config)
>> > compiler: gcc-7 (Debian 7.4.0-10) 7.4.0
>> > reproduce:
>> > git checkout cfa16294b1c5b320c0a0e1cac37c784b92366c87
>> > # save the attached .config to linux build tree
>> > make ARCH=i386 
>> >
>> > If you fix the issue, kindly add following tag
>> > Reported-by: kbuild test robot 
>> >
>> > All warnings (new ones prefixed by >>):
>> >
>> >arch/x86/kernel/apic/apic.c: In function 'clear_local_APIC':
>> >>> arch/x86/kernel/apic/apic.c:1182:6: warning: the address of 
>> >>> 'x2apic_enabled' will always evaluate as 'true' [-Waddress]
>> >  if (!x2apic_enabled) {
>> >  ^
>> Thomas, I apologize for the typo here. This is the x2apic_enabled() function.
>> Should I respin ?
>
>   
> https://lkml.kernel.org/r/156684295076.23440.2192639697586451635.tip-bot2@tip-bot2
>
> You have that mail in your inbox ..

Ah, thanks!

Re: [tip:x86/urgent 3/3] arch/x86/kernel/apic/apic.c:1182:6: warning: the address of 'x2apic_enabled' will always evaluate as 'true'

2019-08-26 Thread Bandan Das

kbuild test robot  writes:

> tree:   https://kernel.googlesource.com/pub/scm/linux/kernel/git/tip/tip.git 
> x86/urgent
> head:   cfa16294b1c5b320c0a0e1cac37c784b92366c87
> commit: cfa16294b1c5b320c0a0e1cac37c784b92366c87 [3/3] x86/apic: Include the 
> LDR when clearing out APIC registers
> config: i386-defconfig (attached as .config)
> compiler: gcc-7 (Debian 7.4.0-10) 7.4.0
> reproduce:
> git checkout cfa16294b1c5b320c0a0e1cac37c784b92366c87
> # save the attached .config to linux build tree
> make ARCH=i386 
>
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot 
>
> All warnings (new ones prefixed by >>):
>
>arch/x86/kernel/apic/apic.c: In function 'clear_local_APIC':
>>> arch/x86/kernel/apic/apic.c:1182:6: warning: the address of 
>>> 'x2apic_enabled' will always evaluate as 'true' [-Waddress]
>  if (!x2apic_enabled) {
>  ^
Thomas, I apologize for the typo here. This is the x2apic_enabled() function.
Should I respin ?

>
> vim +1182 arch/x86/kernel/apic/apic.c
>
>   1142
>   1143/*
>   1144 * Local APIC start and shutdown
>   1145 */
>   1146
>   1147/**
>   1148 * clear_local_APIC - shutdown the local APIC
>   1149 *
>   1150 * This is called, when a CPU is disabled and before rebooting, 
> so the state of
>   1151 * the local APIC has no dangling leftovers. Also used to 
> cleanout any BIOS
>   1152 * leftovers during boot.
>   1153 */
>   1154void clear_local_APIC(void)
>   1155{
>   1156int maxlvt;
>   1157u32 v;
>   1158
>   1159/* APIC hasn't been mapped yet */
>   1160if (!x2apic_mode && !apic_phys)
>   1161return;
>   1162
>   1163maxlvt = lapic_get_maxlvt();
>   1164/*
>   1165 * Masking an LVT entry can trigger a local APIC error
>   1166 * if the vector is zero. Mask LVTERR first to prevent 
> this.
>   1167 */
>   1168if (maxlvt >= 3) {
>   1169v = ERROR_APIC_VECTOR; /* any non-zero vector 
> will do */
>   1170apic_write(APIC_LVTERR, v | APIC_LVT_MASKED);
>   1171}
>   1172/*
>   1173 * Careful: we have to set masks only first to deassert
>   1174 * any level-triggered sources.
>   1175 */
>   1176v = apic_read(APIC_LVTT);
>   1177apic_write(APIC_LVTT, v | APIC_LVT_MASKED);
>   1178v = apic_read(APIC_LVT0);
>   1179apic_write(APIC_LVT0, v | APIC_LVT_MASKED);
>   1180v = apic_read(APIC_LVT1);
>   1181apic_write(APIC_LVT1, v | APIC_LVT_MASKED);
>> 1182 if (!x2apic_enabled) {
>   1183v = apic_read(APIC_LDR) & ~APIC_LDR_MASK;
>   1184apic_write(APIC_LDR, v);
>   1185}
>   1186if (maxlvt >= 4) {
>   1187v = apic_read(APIC_LVTPC);
>   1188apic_write(APIC_LVTPC, v | APIC_LVT_MASKED);
>   1189}
>   1190
>   1191/* lets not touch this if we didn't frob it */
>   1192#ifdef CONFIG_X86_THERMAL_VECTOR
>   1193if (maxlvt >= 5) {
>   1194v = apic_read(APIC_LVTTHMR);
>   1195apic_write(APIC_LVTTHMR, v | APIC_LVT_MASKED);
>   1196}
>   1197#endif
>   1198#ifdef CONFIG_X86_MCE_INTEL
>   1199if (maxlvt >= 6) {
>   1200v = apic_read(APIC_LVTCMCI);
>   1201if (!(v & APIC_LVT_MASKED))
>   1202apic_write(APIC_LVTCMCI, v | 
> APIC_LVT_MASKED);
>   1203}
>   1204#endif
>   1205
>   1206/*
>   1207 * Clean APIC state for other OSs:
>   1208 */
>   1209apic_write(APIC_LVTT, APIC_LVT_MASKED);
>   1210apic_write(APIC_LVT0, APIC_LVT_MASKED);
>   1211apic_write(APIC_LVT1, APIC_LVT_MASKED);
>   1212if (maxlvt >= 3)
>   1213apic_write(APIC_LVTERR, APIC_LVT_MASKED);
>   1214if (maxlvt >= 4)
>   1215apic_write(APIC_LVTPC, APIC_LVT_MASKED);
>   1216
>   1217/* Integrated APIC (!82489DX) ? */
>   1218if (lapic_is_integrated()) {
>   1219if (maxlvt > 3)
>   1220/* Clear ESR due to Pentium errata 3AP 
> and 11AP */
>   1221apic_write(APIC_ESR, 0);
>   1222apic_read(APIC_ESR);
>   1223}
>   1224

[tip: x86/urgent] x86/apic: Do not initialize LDR and DFR for bigsmp

2019-08-26 Thread tip-bot2 for Bandan Das

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: bae3a8d3308ee69a7dbdf145911b18dfda8ade0d
Gitweb:
https://git.kernel.org/tip/bae3a8d3308ee69a7dbdf145911b18dfda8ade0d
Author:Bandan Das 
AuthorDate:Mon, 26 Aug 2019 06:15:12 -04:00
Committer: Thomas Gleixner 
CommitterDate: Mon, 26 Aug 2019 20:00:56 +02:00

x86/apic: Do not initialize LDR and DFR for bigsmp

Legacy apic init uses bigsmp for smp systems with 8 and more CPUs. The
bigsmp APIC implementation uses physical destination mode, but it
nevertheless initializes LDR and DFR. The LDR even ends up incorrectly with
multiple bit being set.

This does not cause a functional problem because LDR and DFR are ignored
when physical destination mode is active, but it triggered a problem on a
32-bit KVM guest which jumps into a kdump kernel.

The multiple bits set unearthed a bug in the KVM APIC implementation. The
code which creates the logical destination map for VCPUs ignores the
disabled state of the APIC and ends up overwriting an existing valid entry
and as a result, APIC calibration hangs in the guest during kdump
initialization.

Remove the bogus LDR/DFR initialization.

This is not intended to work around the KVM APIC bug. The LDR/DFR
ininitalization is wrong on its own.

The issue goes back into the pre git history. The fixes tag is the commit
in the bitkeeper import which introduced bigsmp support in 2003.

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git

Fixes: db7b9e9f26b8 ("[PATCH] Clustered APIC setup for >8 CPU systems")
Suggested-by: Thomas Gleixner 
Signed-off-by: Bandan Das 
Signed-off-by: Thomas Gleixner 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20190826101513.5080-2-...@redhat.com


---
 arch/x86/kernel/apic/bigsmp_32.c | 24 ++--
 1 file changed, 2 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/apic/bigsmp_32.c b/arch/x86/kernel/apic/bigsmp_32.c
index afee386..caedd8d 100644
--- a/arch/x86/kernel/apic/bigsmp_32.c
+++ b/arch/x86/kernel/apic/bigsmp_32.c
@@ -38,32 +38,12 @@ static int bigsmp_early_logical_apicid(int cpu)
return early_per_cpu(x86_cpu_to_apicid, cpu);
 }
 
-static inline unsigned long calculate_ldr(int cpu)
-{
-   unsigned long val, id;
-
-   val = apic_read(APIC_LDR) & ~APIC_LDR_MASK;
-   id = per_cpu(x86_bios_cpu_apicid, cpu);
-   val |= SET_APIC_LOGICAL_ID(id);
-
-   return val;
-}
-
 /*
- * Set up the logical destination ID.
- *
- * Intel recommends to set DFR, LDR and TPR before enabling
- * an APIC.  See e.g. "AP-388 82489DX User's Manual" (Intel
- * document number 292116).  So here it goes...
+ * bigsmp enables physical destination mode
+ * and doesn't use LDR and DFR
  */
 static void bigsmp_init_apic_ldr(void)
 {
-   unsigned long val;
-   int cpu = smp_processor_id();
-
-   apic_write(APIC_DFR, APIC_DFR_FLAT);
-   val = calculate_ldr(cpu);
-   apic_write(APIC_LDR, val);
 }
 
 static void bigsmp_setup_apic_routing(void)

[tip: x86/urgent] x86/apic: Include the LDR when clearing out APIC registers

2019-08-26 Thread tip-bot2 for Bandan Das

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 558682b5291937a70748d36fd9ba757fb25b99ae
Gitweb:
https://git.kernel.org/tip/558682b5291937a70748d36fd9ba757fb25b99ae
Author:Bandan Das 
AuthorDate:Mon, 26 Aug 2019 06:15:13 -04:00
Committer: Thomas Gleixner 
CommitterDate: Mon, 26 Aug 2019 20:00:57 +02:00

x86/apic: Include the LDR when clearing out APIC registers

Although APIC initialization will typically clear out the LDR before
setting it, the APIC cleanup code should reset the LDR.

This was discovered with a 32-bit KVM guest jumping into a kdump
kernel. The stale bits in the LDR triggered a bug in the KVM APIC
implementation which caused the destination mapping for VCPUs to be
corrupted.

Note that this isn't intended to paper over the KVM APIC bug. The kernel
has to clear the LDR when resetting the APIC registers except when X2APIC
is enabled.

This lacks a Fixes tag because missing to clear LDR goes way back into pre
git history.

[ tglx: Made x2apic_enabled a function call as required ]

Signed-off-by: Bandan Das 
Signed-off-by: Thomas Gleixner 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20190826101513.5080-3-...@redhat.com

---
 arch/x86/kernel/apic/apic.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index aa5495d..dba2828 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1179,6 +1179,10 @@ void clear_local_APIC(void)
apic_write(APIC_LVT0, v | APIC_LVT_MASKED);
v = apic_read(APIC_LVT1);
apic_write(APIC_LVT1, v | APIC_LVT_MASKED);
+   if (!x2apic_enabled()) {
+   v = apic_read(APIC_LDR) & ~APIC_LDR_MASK;
+   apic_write(APIC_LDR, v);
+   }
if (maxlvt >= 4) {
v = apic_read(APIC_LVTPC);
apic_write(APIC_LVTPC, v | APIC_LVT_MASKED);

[tip: x86/urgent] x86/apic: Do not initialize LDR and DFR for bigsmp

2019-08-26 Thread tip-bot2 for Bandan Das

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 9cfe98a6dbfb2a72ae29831e57b406eab7668da8
Gitweb:
https://git.kernel.org/tip/9cfe98a6dbfb2a72ae29831e57b406eab7668da8
Author:Bandan Das 
AuthorDate:Mon, 26 Aug 2019 06:15:12 -04:00
Committer: Thomas Gleixner 
CommitterDate: Mon, 26 Aug 2019 17:45:22 +02:00

x86/apic: Do not initialize LDR and DFR for bigsmp

Legacy apic init uses bigsmp for smp systems with 8 and more CPUs. The
bigsmp APIC implementation uses physical destination mode, but it
nevertheless initializes LDR and DFR. The LDR even ends up incorrectly with
multiple bit being set.

This does not cause a functional problem because LDR and DFR are ignored
when physical destination mode is active, but it triggered a problem on a
32-bit KVM guest which jumps into a kdump kernel.

The multiple bits set unearthed a bug in the KVM APIC implementation. The
code which creates the logical destination map for VCPUs ignores the
disabled state of the APIC and ends up overwriting an existing valid entry
and as a result, APIC calibration hangs in the guest during kdump
initialization.

Remove the bogus LDR/DFR initialization.

This is not intended to work around the KVM APIC bug. The LDR/DFR
ininitalization is wrong on its own.

The issue goes back into the pre git history. The fixes tag is the commit
in the bitkeeper import which introduced bigsmp support in 2003.

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git

Fixes: db7b9e9f26b8 ("[PATCH] Clustered APIC setup for >8 CPU systems")
Suggested-by: Thomas Gleixner 
Signed-off-by: Bandan Das 
Signed-off-by: Thomas Gleixner 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r20190826101513.5080-2-...@redhat.com

---
 arch/x86/kernel/apic/bigsmp_32.c | 24 ++--
 1 file changed, 2 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/apic/bigsmp_32.c b/arch/x86/kernel/apic/bigsmp_32.c
index afee386..caedd8d 100644
--- a/arch/x86/kernel/apic/bigsmp_32.c
+++ b/arch/x86/kernel/apic/bigsmp_32.c
@@ -38,32 +38,12 @@ static int bigsmp_early_logical_apicid(int cpu)
return early_per_cpu(x86_cpu_to_apicid, cpu);
 }
 
-static inline unsigned long calculate_ldr(int cpu)
-{
-   unsigned long val, id;
-
-   val = apic_read(APIC_LDR) & ~APIC_LDR_MASK;
-   id = per_cpu(x86_bios_cpu_apicid, cpu);
-   val |= SET_APIC_LOGICAL_ID(id);
-
-   return val;
-}
-
 /*
- * Set up the logical destination ID.
- *
- * Intel recommends to set DFR, LDR and TPR before enabling
- * an APIC.  See e.g. "AP-388 82489DX User's Manual" (Intel
- * document number 292116).  So here it goes...
+ * bigsmp enables physical destination mode
+ * and doesn't use LDR and DFR
  */
 static void bigsmp_init_apic_ldr(void)
 {
-   unsigned long val;
-   int cpu = smp_processor_id();
-
-   apic_write(APIC_DFR, APIC_DFR_FLAT);
-   val = calculate_ldr(cpu);
-   apic_write(APIC_LDR, val);
 }
 
 static void bigsmp_setup_apic_routing(void)

[tip: x86/urgent] x86/apic: Include the LDR when clearing out APIC registers

2019-08-26 Thread tip-bot2 for Bandan Das

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: cfa16294b1c5b320c0a0e1cac37c784b92366c87
Gitweb:
https://git.kernel.org/tip/cfa16294b1c5b320c0a0e1cac37c784b92366c87
Author:Bandan Das 
AuthorDate:Mon, 26 Aug 2019 06:15:13 -04:00
Committer: Thomas Gleixner 
CommitterDate: Mon, 26 Aug 2019 17:47:24 +02:00

x86/apic: Include the LDR when clearing out APIC registers

Although APIC initialization will typically clear out the LDR before
setting it, the APIC cleanup code should reset the LDR.

This was discovered with a 32-bit KVM guest jumping into a kdump
kernel. The stale bits in the LDR triggered a bug in the KVM APIC
implementation which caused the destination mapping for VCPUs to be
corrupted.

Note that this isn't intended to paper over the KVM APIC bug. The kernel
has to clear the LDR when resetting the APIC registers except when X2APIC
is enabled.

This lacks a Fixes tag because missing to clear LDR goes way back into pre
git history.

Signed-off-by: Bandan Das 
Signed-off-by: Thomas Gleixner 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r20190826101513.5080-3-...@redhat.com
---
 arch/x86/kernel/apic/apic.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index aa5495d..e75f378 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1179,6 +1179,10 @@ void clear_local_APIC(void)
apic_write(APIC_LVT0, v | APIC_LVT_MASKED);
v = apic_read(APIC_LVT1);
apic_write(APIC_LVT1, v | APIC_LVT_MASKED);
+   if (!x2apic_enabled) {
+   v = apic_read(APIC_LDR) & ~APIC_LDR_MASK;
+   apic_write(APIC_LDR, v);
+   }
if (maxlvt >= 4) {
v = apic_read(APIC_LVTPC);
apic_write(APIC_LVTPC, v | APIC_LVT_MASKED);

[PATCH v2 2/2] x86/apic: include the LDR when clearing out apic registers

2019-08-26 Thread Bandan Das

Although apic initialization will typically clear out the LDR before
setting it, the apic cleanup code should reset the LDR.

This was discovered with a 32 bit kvm guest loading the kdump kernel.
Stale bits in the LDR exposed a bug in the kvm lapic code that creates
logical destination maps for vcpus. If multiple bits are set, kvm
could potentially overwrite a valid logical destination with an
invalid one.

Note that this fix isn't intended to paper over the kvm lapic bug;
clear_local_APIC() should correctly clear out any set bits in the LDR
when resetting apic registers.

Signed-off-by: Bandan Das 
---
 arch/x86/kernel/apic/apic.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index aa5495d0f478..e75f3782b915 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1179,6 +1179,10 @@ void clear_local_APIC(void)
apic_write(APIC_LVT0, v | APIC_LVT_MASKED);
v = apic_read(APIC_LVT1);
apic_write(APIC_LVT1, v | APIC_LVT_MASKED);
+   if (!x2apic_enabled) {
+   v = apic_read(APIC_LDR) & ~APIC_LDR_MASK;
+   apic_write(APIC_LDR, v);
+   }
if (maxlvt >= 4) {
v = apic_read(APIC_LVTPC);
apic_write(APIC_LVTPC, v | APIC_LVT_MASKED);
-- 
2.20.1

[PATCH v2 1/2] x86/apic: Do not initialize LDR and DFR for bigsmp

2019-08-26 Thread Bandan Das

Legacy apic init uses bigsmp for > 8 smp systems.
In these cases, PhysFlat will invariably be used and there
is no point in initializing apic LDR and DFR. Furthermore,
calculate_ldr() helper function was incorrectly setting multiple
bits in the LDR.

This was discovered with a 32 bit KVM guest loading the kdump kernel.
Owing to the multiple bits being incorrectly set in the LDR, KVM hit a
buggy "if" condition in the kvm lapic implementation that creates the
logical destination map for vcpus - it ends up overwriting and
existing valid entry and as a result, apic calibration hangs in the
guest during kdump initialization.

Note that this change isn't intended to workaround the kvm lapic bug;
bigsmp should correctly stay away from initializing LDR.

Suggested-by: Thomas Gleixner 
Signed-off-by: Bandan Das 
---
 arch/x86/kernel/apic/bigsmp_32.c | 24 ++--
 1 file changed, 2 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/apic/bigsmp_32.c b/arch/x86/kernel/apic/bigsmp_32.c
index afee386ff711..caedd8d60d36 100644
--- a/arch/x86/kernel/apic/bigsmp_32.c
+++ b/arch/x86/kernel/apic/bigsmp_32.c
@@ -38,32 +38,12 @@ static int bigsmp_early_logical_apicid(int cpu)
return early_per_cpu(x86_cpu_to_apicid, cpu);
 }
 
-static inline unsigned long calculate_ldr(int cpu)
-{
-   unsigned long val, id;
-
-   val = apic_read(APIC_LDR) & ~APIC_LDR_MASK;
-   id = per_cpu(x86_bios_cpu_apicid, cpu);
-   val |= SET_APIC_LOGICAL_ID(id);
-
-   return val;
-}
-
 /*
- * Set up the logical destination ID.
- *
- * Intel recommends to set DFR, LDR and TPR before enabling
- * an APIC.  See e.g. "AP-388 82489DX User's Manual" (Intel
- * document number 292116).  So here it goes...
+ * bigsmp enables physical destination mode
+ * and doesn't use LDR and DFR
  */
 static void bigsmp_init_apic_ldr(void)
 {
-   unsigned long val;
-   int cpu = smp_processor_id();
-
-   apic_write(APIC_DFR, APIC_DFR_FLAT);
-   val = calculate_ldr(cpu);
-   apic_write(APIC_LDR, val);
 }
 
 static void bigsmp_setup_apic_routing(void)
-- 
2.20.1

[PATCH v2 0/2] x86/apic: reset LDR in clear_local_APIC

2019-08-26 Thread Bandan Das

v2:
   1/2: clear out the bogus initialization in bigsmp_init_apic_ldr
   2/2: reword commit message as suggested by Thomas
v1 posted at https://lkml.org/lkml/2019/8/14/1

On a 32 bit RHEL6 guest with greater than 8 cpus, the
kdump kernel hangs when calibrating apic. This happens
because when apic initializes bigsmp, it also initializes LDR
even though it probably wouldn't be used.

When booting into kdump, KVM apic incorrectly reads the stale LDR
values from the guest while building the logical destination map
even for inactive vcpus. While KVM apic can be fixed to ignore apics
that haven't been enabled, a simple guest only change can be to
just clear out the LDR.

Bandan Das (2):
  x86/apic: Do not initialize LDR and DFR for bigsmp
  x86/apic: include the LDR when clearing out apic registers

 arch/x86/kernel/apic/apic.c  |  4 
 arch/x86/kernel/apic/bigsmp_32.c | 24 ++--
 2 files changed, 6 insertions(+), 22 deletions(-)

-- 
2.20.1

Re: [PATCH] x86/apic: reset LDR in clear_local_APIC

2019-08-21 Thread Bandan Das

Thomas Gleixner  writes:

> Bandan,
>
> On Wed, 21 Aug 2019, Bandan Das wrote:
>> Thomas Gleixner  writes:
>> So, in KVM: if we make sure that the logical destination map isn't filled up 
>> if the virtual
>> apic is not enabled by software, it really doesn't matter whether the LDR 
>> for an inactive CPU
>> has a stale value.
>>
>> In x86/apic: if we make sure that the LDR is 0 or reset,
>> recalculate_apic_map() will never consider including this cpu in the
>> logical map.
> ?
>> In short, as I mentioned in the patch description, this is really a KVM
>> bug but it doesn't hurt to clear out the LDR in the guest and then, it
>> wouldn't need a hypervisor fix.
>
> I still needs a hypervisor fix. Taking disabled APICs into account is a bug
> which has also other consequeces than that particular one. So please don't
> claim that. It's wrong.
>
> If that prevents the APIC bug from triggering on unfixed hypervisors, then
> this is a nice side effect, but not a solution.
>
Agreed and fwiw, the kvm fix has been queued already.

>> Is this better ?
>
> That's way better.
>
> So can you please create two patches:
>
>1) Make that bogus bigsmp ldr init empty
>
>   That one wants a changelog along these lines:
>
>   - Setting LDR for physical destination mode is pointless
>   - Setting multiple bits in the LDR is wrong
>
>   Mention how this was discovered and caused the KVM APIC bug to be
>   triggered. Also mention that the change is not there to paper over
>   the KVM APIC bug. The change fixes a bug in the bigsmp APIC code.
>
>2) Clear LDR in in that apic reset function
>
>   That one wants a changelog along these lines:
>
>   - Except for x2apic the LDR should be cleared as any other APIC
>   register
>
>   Mention how this was discovered. Again the change is not there to
>   paper over the KVM APIC bug. It's for correctness sake and valid on
>   its own.
>
> Thanks,
>
Will do as you suggested. Thank you for the review.

Bandan
>   tglx
>
>

Re: [PATCH] x86/apic: reset LDR in clear_local_APIC

2019-08-21 Thread Bandan Das

Thomas Gleixner  writes:

> Bandan,
>
> On Mon, 19 Aug 2019, Bandan Das wrote:
>> Thomas Gleixner  writes:
>> > On Wed, 14 Aug 2019, Bandan Das wrote:
>> >> On a 32 bit RHEL6 guest with greater than 8 cpus, the
>> >> kdump kernel hangs when calibrating apic. This happens
>> >> because when apic initializes bigsmp, it also initializes LDR
>> >> even though it probably wouldn't be used.
>> >
>> > 'It probably wouldn't be used' is a not really a useful technical
>> > statement.
>> >
>> > Either it is used, then it needs to be handled. Or it is unused then why is
>> > it written in the first place?
>> >
>> > The bigsmp APIC uses physical destination mode because the logical flat
>> > model only supports 8 APICs. So clearly bigsmp_init_apic_ldr() is bogus and
>> > should be empty.
>> >
>> 
>> Your note above is what I meant by "it probably wouldn't be used" because
>> I don't have much insight into the history behind why LDR is being 
>> initialized
>> in the first place. The only evidence I found is a comment in apic.c that 
>> states:
>>  /*
>>   * Intel recommends to set DFR, LDR and TPR before enabling
>>   * an APIC.  See e.g. "AP-388 82489DX User's Manual" (Intel
>>   * document number 292116).  So here it goes...
>>   */
>
> The physflat stuff is documented in the SDM and in the APIC code
> (apic_flat_64.c):
>
> static void physflat_init_apic_ldr(void)
> {
> /*
>  * LDR and DFR are not involved in physflat mode, rather:
>  * "In physical destination mode, the destination processor is
>  * specified by its local APIC ID [...]." (Intel SDM, 10.6.2.1)
>  */
> }
>
> Why is LDR initialized in the bigsmp code? Probably histerical raisins and
> I'm just too tired to consult the history git trees for an answer.
>
>> That said, not initalizing the ldr in bigsmp_init_apic_ldr() should be
>> enough to fix this. Would you prefer that change instead ?
>
> That's surely something we want to get rid off. But for sanity sake we
> should clear LDR as well after understanding it completely.
>
>> >> When booting into kdump, KVM apic incorrectly reads the stale LDR
>> >> values from the guest while building the logical destination map
>> >> even for inactive vcpus. While KVM apic can be fixed to ignore apics
>> >> that haven't been enabled, a simple guest only change can be to
>> >> just clear out the LDR.
>> >
>> > This does not make much sense either. What has KVM to do with logical
>> > destination maps while booting the kdump kernel? The kdump kernel is not
>> 
>> This is the guest kernel and KVM takes care of injecting the interrupt to
>> the right vcpu (recalculate_apic_map)() in lapic.c).
>
> Yeah. I know that KVM injects interrupts. Still that does not explain the
> issue properly.
>
> The point is that when the kdump kernel boots in the guest and uses logical
> destination mode then it will overwrite LDR _BEFORE_ the local APIC timer
> calibration takes place. So no, I'm not bying this. Just because it makes
It will but only for 1 cpu which is the boot cpu since kdump will usually be 
run with
nr_cpus=1.

> your problem disappear does not mean it's the proper explanation.
>
Let me try this again -
1. We boot a 16 vcpu guest, the guest kernel writes the LDR for the CPUs but
ofcourse, PhysFlat is always used.

2. We force a kdump crash - the kdump kernel boots with nr_cpus=1 but that does 
not
make KVM forget about the inactive vcpus. They are still in the vcpu list but 
not
active.

3. Before jumping into the kdump kernel, the guest kernel does not clear the 
LDR bits.

4. In the kdump kernel, the guest only overwrites the LDR for the boot cpu i.e 
from
KVM's point of view, the stale LDR values are still around for the inactive 
vcpus.

5. recalculate_apic_map() in its previous form (before the kvm patch linked 
above) did
not check whether the virtual apic is actually enabled; rather, if it finds any 
value in
the LDR, it will keep the cpu in the mapping table it maintains. However, it 
makes the
assumption that only one bit in the LDR is set which is not true for 
bigsmp_init_apic_ldr() -
the way it initializes the LDR, multiple bits can be set! Since 
recalculate_apic_map uses
ffs() it just checks for the first set bit and assumes that other bits are set 
and potentially
overwrites and existing entry.

For example, let's assume the kdump kernel boots on CPU 1. So, in 
recalculate_apic_map(),
mask is 1 and ffs(mask) - 1 is

Re: [PATCH] x86/apic: reset LDR in clear_local_APIC

2019-08-19 Thread Bandan Das

Hi Thomas,

Thomas Gleixner  writes:

> Bandan,
>
> On Wed, 14 Aug 2019, Bandan Das wrote:
>> On a 32 bit RHEL6 guest with greater than 8 cpus, the
>> kdump kernel hangs when calibrating apic. This happens
>> because when apic initializes bigsmp, it also initializes LDR
>> even though it probably wouldn't be used.
>
> 'It probably wouldn't be used' is a not really a useful technical
> statement.
>
> Either it is used, then it needs to be handled. Or it is unused then why is
> it written in the first place?
>
> The bigsmp APIC uses physical destination mode because the logical flat
> model only supports 8 APICs. So clearly bigsmp_init_apic_ldr() is bogus and
> should be empty.
>

Your note above is what I meant by "it probably wouldn't be used" because
I don't have much insight into the history behind why LDR is being initialized
in the first place. The only evidence I found is a comment in apic.c that 
states:
/*
 * Intel recommends to set DFR, LDR and TPR before enabling
 * an APIC.  See e.g. "AP-388 82489DX User's Manual" (Intel
 * document number 292116).  So here it goes...
 */
That said, not initalizing the ldr in bigsmp_init_apic_ldr() should be
enough to fix this. Would you prefer that change instead ?

>> When booting into kdump, KVM apic incorrectly reads the stale LDR
>> values from the guest while building the logical destination map
>> even for inactive vcpus. While KVM apic can be fixed to ignore apics
>> that haven't been enabled, a simple guest only change can be to
>> just clear out the LDR.
>
> This does not make much sense either. What has KVM to do with logical
> destination maps while booting the kdump kernel? The kdump kernel is not

This is the guest kernel and KVM takes care of injecting the interrupt to
the right vcpu (recalculate_apic_map)() in lapic.c). 

For the KVM side change, please take a look at
https://lore.kernel.org/kvm/aee50952-144d-78da-9036-045bd3838...@redhat.com/

> going through the regular cold/warm boot process, so KVM does not even know
> that the crashing kernel jumped into the kdump one.
>
> What builds the logical destination maps and what has LDR and the KVM APIC
> to do with that?
>
> I'm not opposed to the change per se, but I'm not accepting change logs
> which have the fairy tale smell.
>
Heh, no it's not.

> Thanks,
>
>   tglx

[PATCH] x86/apic: reset LDR in clear_local_APIC

2019-08-13 Thread Bandan Das



On a 32 bit RHEL6 guest with greater than 8 cpus, the
kdump kernel hangs when calibrating apic. This happens
because when apic initializes bigsmp, it also initializes LDR
even though it probably wouldn't be used.

When booting into kdump, KVM apic incorrectly reads the stale LDR
values from the guest while building the logical destination map
even for inactive vcpus. While KVM apic can be fixed to ignore apics
that haven't been enabled, a simple guest only change can be to
just clear out the LDR.

Signed-off-by: Bandan Das 
---
 arch/x86/kernel/apic/apic.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index f5291362da1a..6b9511dc9157 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1141,6 +1141,10 @@ void clear_local_APIC(void)
apic_write(APIC_LVT0, v | APIC_LVT_MASKED);
v = apic_read(APIC_LVT1);
apic_write(APIC_LVT1, v | APIC_LVT_MASKED);
+   if (!x2apic_enabled()) {
+   v = apic_read(APIC_LDR) & ~APIC_LDR_MASK;
+   apic_write(APIC_LDR, v);
+   }
if (maxlvt >= 4) {
v = apic_read(APIC_LVTPC);
apic_write(APIC_LVTPC, v | APIC_LVT_MASKED);
-- 
2.20.1

[PATCH v2] perf/x86: descriptive failure messages for PMU init

2019-04-17 Thread Bandan Das



There's a default warning message that gets printed, however,
there are various failure conditions:
 - a msr read can fail
 - a msr write can fail
 - a msr has an unexpected value
 - all msrs have unexpected values (disable PMU)

Lastly, use %llx to silence checkpatch

Signed-off-by: Bandan Das 
---
v2:
 Remove virt specific pr_debugs
 Change the default warning message

 arch/x86/events/core.c | 53 +++---
 1 file changed, 40 insertions(+), 13 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 81911e11a15d..52b0893da78b 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -192,9 +192,16 @@ static void release_pmc_hardware(void) {}
 
 static bool check_hw_exists(void)
 {
-   u64 val, val_fail = -1, val_new= ~0;
-   int i, reg, reg_fail = -1, ret = 0;
-   int bios_fail = 0;
+   u64 val = -1, val_fail = -1, val_new = ~0;
+   int i, reg = -1, reg_fail = -1, ret = 0;
+
+   enum {
+ READ_FAIL=1,
+ WRITE_FAIL   =2,
+ PMU_FAIL =3,
+ BIOS_FAIL=4,
+   };
+   int status = 0;
int reg_safe = -1;
 
/*
@@ -204,10 +211,13 @@ static bool check_hw_exists(void)
for (i = 0; i < x86_pmu.num_counters; i++) {
reg = x86_pmu_config_addr(i);
ret = rdmsrl_safe(reg, &val);
-   if (ret)
+   if (ret) {
+   status = READ_FAIL;
goto msr_fail;
+   }
+
if (val & ARCH_PERFMON_EVENTSEL_ENABLE) {
-   bios_fail = 1;
+   status = BIOS_FAIL;
val_fail = val;
reg_fail = reg;
} else {
@@ -218,11 +228,13 @@ static bool check_hw_exists(void)
if (x86_pmu.num_counters_fixed) {
reg = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
ret = rdmsrl_safe(reg, &val);
-   if (ret)
+   if (ret) {
+   status = READ_FAIL;
goto msr_fail;
+   }
for (i = 0; i < x86_pmu.num_counters_fixed; i++) {
if (val & (0x03 << i*4)) {
-   bios_fail = 1;
+   status = BIOS_FAIL;
val_fail = val;
reg_fail = reg;
}
@@ -236,7 +248,7 @@ static bool check_hw_exists(void)
 */
 
if (reg_safe == -1) {
-   reg = reg_safe;
+   status = PMU_FAIL;
goto msr_fail;
}
 
@@ -246,18 +258,22 @@ static bool check_hw_exists(void)
 * (qemu/kvm) that don't trap on the MSR access and always return 0s.
 */
reg = x86_pmu_event_addr(reg_safe);
-   if (rdmsrl_safe(reg, &val))
+   if (rdmsrl_safe(reg, &val)) {
+   status = READ_FAIL;
goto msr_fail;
+   }
val ^= 0xUL;
ret = wrmsrl_safe(reg, val);
ret |= rdmsrl_safe(reg, &val_new);
-   if (ret || val != val_new)
+   if (ret || val != val_new) {
+   status = WRITE_FAIL;
goto msr_fail;
+   }
 
/*
 * We still allow the PMU driver to operate:
 */
-   if (bios_fail) {
+   if (status == BIOS_FAIL) {
pr_cont("Broken BIOS detected, complain to your hardware 
vendor.\n");
pr_err(FW_BUG "the BIOS has corrupted hw-PMU resources (MSR %x 
is %Lx)\n",
  reg_fail, val_fail);
@@ -270,8 +286,19 @@ static bool check_hw_exists(void)
pr_cont("PMU not available due to virtualization, using 
software events only.\n");
} else {
pr_cont("Broken PMU hardware detected, using software events 
only.\n");
-   pr_err("Failed to access perfctr msr (MSR %x is %Lx)\n",
-  reg, val_new);
+   }
+   switch (status) {
+   case READ_FAIL:
+   pr_err("Failed to read perfctr msr (MSR %x)\n", reg);
+   break;
+   case WRITE_FAIL:
+   pr_err("Failed to write perfctr msr (MSR %x, wrote: %llx, read: 
%llx)\n",
+  reg, val, val_new);
+   break;
+   case PMU_FAIL:
+   /* fall through for default message */
+   default:
+   pr_err(FW_BUG "the BIOS has corrupted hw-PMU resources.\n");
}
 
return false;
-- 
2.19.2

Re: [PATCH] perf/x86: descriptive failure messages for PMU init

2019-04-15 Thread Bandan Das

Hi Peter,

Peter Zijlstra  writes:

> On Fri, Apr 12, 2019 at 03:09:17PM -0400, Bandan Das wrote:
>> 
>> There's a default warning message that gets printed, however,
>> there are various failure conditions:
>>  - a msr read can fail
>>  - a msr write can fail
>>  - a msr has an unexpected value
>>  - all msrs have unexpected values (disable PMU)
>> 
>> Also, commit commit 005bd0077a79 ("perf/x86: Modify error message in
>> virtualized environment") completely removed printing the msr in
>> question but these messages could be helpful for debugging vPMUs as
>> well. Add them back and change them to pr_debugs, this keeps the
>> behavior the same for baremetal.
>> 
>> Lastly, use %llx to silence checkpatch
>
> Yuck... if you're debugging a hypervisor, you can bloody well run your
> own kernel with additional print slattered around.
>
> The whole make an exception for virt bullshit was already pushing it,
> this is just insane.
>

The only virt specific parts are the pr_debugs which I can remove and
replace with unconditional pr_err()s as suggested by Jiri. Is that ok ?

Bandan

>> @@ -266,12 +282,30 @@ static bool check_hw_exists(void)
>>  return true;
>>  
>>  msr_fail:
>> -if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
>> +if (virt)
>>  pr_cont("PMU not available due to virtualization, using 
>> software events only.\n");
>> -} else {
>> -pr_cont("Broken PMU hardware detected, using software events 
>> only.\n");
>> -pr_err("Failed to access perfctr msr (MSR %x is %Lx)\n",
>> -   reg, val_new);
>> +switch (status) {
>> +case READ_FAIL:
>> +if (virt)
>> +pr_debug("Failed to read perfctr msr (MSR %x)\n", reg);
>> +else
>> +pr_err("Failed to read perfctr msr (MSR %x)\n", reg);
>> +break;
>> +case WRITE_FAIL:
>> +if (virt)
>> +pr_debug("Failed to write perfctr msr (MSR %x, wrote: 
>> %llx, read: %llx)\n",
>> + reg, val, val_new);
>> +else
>> +pr_err("Failed to write perfctr msr (MSR %x, wrote: 
>> %llx, read: %llx)\n",
>> + reg, val, val_new);
>> +break;
>> +case PMU_FAIL:
>> +/* fall through for default message */
>> +default:
>> +if (virt)
>> +pr_debug("Broken PMU hardware detected, using software 
>> events only.\n");
>> +else
>> +pr_cont("Broken PMU hardware detected, using software 
>> events only.\n");
>>  }

[PATCH] perf/x86: descriptive failure messages for PMU init

2019-04-12 Thread Bandan Das



There's a default warning message that gets printed, however,
there are various failure conditions:
 - a msr read can fail
 - a msr write can fail
 - a msr has an unexpected value
 - all msrs have unexpected values (disable PMU)

Also, commit commit 005bd0077a79 ("perf/x86: Modify error message in
virtualized environment") completely removed printing the msr in
question but these messages could be helpful for debugging vPMUs as
well. Add them back and change them to pr_debugs, this keeps the
behavior the same for baremetal.

Lastly, use %llx to silence checkpatch

Signed-off-by: Bandan Das 
---
 arch/x86/events/core.c | 66 --
 1 file changed, 50 insertions(+), 16 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e2b1447192a8..786e03893a0c 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -192,9 +192,16 @@ static void release_pmc_hardware(void) {}
 
 static bool check_hw_exists(void)
 {
-   u64 val, val_fail = -1, val_new= ~0;
-   int i, reg, reg_fail = -1, ret = 0;
-   int bios_fail = 0;
+   u64 val = -1, val_fail = -1, val_new = ~0;
+   int i, reg = -1, reg_fail = -1, ret = 0;
+   bool virt = boot_cpu_has(X86_FEATURE_HYPERVISOR) ? true : false;
+   enum {
+ READ_FAIL=1,
+ WRITE_FAIL   =2,
+ PMU_FAIL =3,
+ BIOS_FAIL=4,
+   };
+   int status = 0;
int reg_safe = -1;
 
/*
@@ -204,10 +211,13 @@ static bool check_hw_exists(void)
for (i = 0; i < x86_pmu.num_counters; i++) {
reg = x86_pmu_config_addr(i);
ret = rdmsrl_safe(reg, &val);
-   if (ret)
+   if (ret) {
+   status = READ_FAIL;
goto msr_fail;
+   }
+
if (val & ARCH_PERFMON_EVENTSEL_ENABLE) {
-   bios_fail = 1;
+   status = BIOS_FAIL;
val_fail = val;
reg_fail = reg;
} else {
@@ -218,11 +228,13 @@ static bool check_hw_exists(void)
if (x86_pmu.num_counters_fixed) {
reg = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
ret = rdmsrl_safe(reg, &val);
-   if (ret)
+   if (ret) {
+   status = READ_FAIL;
goto msr_fail;
+   }
for (i = 0; i < x86_pmu.num_counters_fixed; i++) {
if (val & (0x03 << i*4)) {
-   bios_fail = 1;
+   status = BIOS_FAIL;
val_fail = val;
reg_fail = reg;
}
@@ -236,7 +248,7 @@ static bool check_hw_exists(void)
 */
 
if (reg_safe == -1) {
-   reg = reg_safe;
+   status = PMU_FAIL;
goto msr_fail;
}
 
@@ -246,18 +258,22 @@ static bool check_hw_exists(void)
 * (qemu/kvm) that don't trap on the MSR access and always return 0s.
 */
reg = x86_pmu_event_addr(reg_safe);
-   if (rdmsrl_safe(reg, &val))
+   if (rdmsrl_safe(reg, &val)) {
+   status = READ_FAIL;
goto msr_fail;
+   }
val ^= 0xUL;
ret = wrmsrl_safe(reg, val);
ret |= rdmsrl_safe(reg, &val_new);
-   if (ret || val != val_new)
+   if (ret || val != val_new) {
+   status = WRITE_FAIL;
goto msr_fail;
+   }
 
/*
 * We still allow the PMU driver to operate:
 */
-   if (bios_fail) {
+   if (status == BIOS_FAIL) {
pr_cont("Broken BIOS detected, complain to your hardware 
vendor.\n");
pr_err(FW_BUG "the BIOS has corrupted hw-PMU resources (MSR %x 
is %Lx)\n",
  reg_fail, val_fail);
@@ -266,12 +282,30 @@ static bool check_hw_exists(void)
return true;
 
 msr_fail:
-   if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) {
+   if (virt)
pr_cont("PMU not available due to virtualization, using 
software events only.\n");
-   } else {
-   pr_cont("Broken PMU hardware detected, using software events 
only.\n");
-   pr_err("Failed to access perfctr msr (MSR %x is %Lx)\n",
-  reg, val_new);
+   switch (status) {
+   case READ_FAIL:
+   if (virt)
+   pr_debug("Failed to read perfctr msr (MSR %x)\n", reg);
+   else
+   pr_err("Failed to read perfctr msr (MSR %x)\n", reg);
+   break;
+   case WRITE_FAIL:
+   if (virt)
+   pr_debug("Failed to write perfctr msr (MSR %x, wrote: 
%l

Re: [PATCH] KVM: vmx: speed up MSR bitmap merge

2017-12-18 Thread Bandan Das

David Hildenbrand  writes:
...
>>  vmx->nested.cached_vmcs12 = kmalloc(VMCS12_SIZE, GFP_KERNEL);
>> @@ -10325,36 +10321,43 @@ static inline bool 
>> nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
>>  /* This shortcut is ok because we support only x2APIC MSRs so far. */
>>  if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
>>  return false;
>> +if (WARN_ON_ONCE(!cpu_has_vmx_msr_bitmap()))
>> +return false;
>
> IMHO it would be nicer to always call nested_vmx_merge_msr_bitmap() and
> make calling code less ugly:
>
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index ee214b4112af..d4f06fc643ae 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -10238,11 +10238,7 @@ static void nested_get_vmcs12_pages(struct
> kvm_vcpu *vcpu,
> (unsigned long)(vmcs12->posted_intr_desc_addr &
> (PAGE_SIZE - 1)));
> }
> -   if (cpu_has_vmx_msr_bitmap() &&
> -   nested_cpu_has(vmcs12, CPU_BASED_USE_MSR_BITMAPS) &&
> -   nested_vmx_merge_msr_bitmap(vcpu, vmcs12))
> -   ;
> -   else
> +   if (!nested_vmx_merge_msr_bitmap(vcpu, vmcs12))
> vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL,
> CPU_BASED_USE_MSR_BITMAPS);
>  }
> @@ -10318,6 +10314,10 @@ static inline bool
> nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
> unsigned long *msr_bitmap_l1;
> unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.msr_bitmap;
>
> +   if (!cpu_has_vmx_msr_bitmap())
> +   return false;
> +   if (!nested_cpu_has(vmcs12, CPU_BASED_USE_MSR_BITMAPS))
> +   return false;

This looks good, otherwise the WARN_ON_ONCE just seems like an unnecessary
check since the function is only called once.

Bandan

> /* This shortcut is ok because we support only x2APIC MSRs so
> far. */
> if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
> return false;
>
>
>
>>  
>>  page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->msr_bitmap);
>>  if (is_error_page(page))
>>  return false;
>> -msr_bitmap_l1 = (unsigned long *)kmap(page);
>>  
>> -memset(msr_bitmap_l0, 0xff, PAGE_SIZE);
>> +msr_bitmap_l1 = (unsigned long *)kmap(page);
>
>
> Wouldn't it be easier to simply set everything to 0xff as before and
> then only handle the one special case where you don't do that? e.g. the
> complete else part would be gone.
>
>> +if (nested_cpu_has_apic_reg_virt(vmcs12)) {
>> +/* Disable read intercept for all MSRs between 0x800 and 0x8ff. 
>>  */
>> +for (msr = 0x800; msr <= 0x8ff; msr += BITS_PER_LONG) {
>> +unsigned word = msr / BITS_PER_LONG;
>> +msr_bitmap_l0[word] = msr_bitmap_l1[word];
>> +msr_bitmap_l0[word + (0x800 / sizeof(long))] = ~0;
>> +}
>> +} else {
>> +for (msr = 0x800; msr <= 0x8ff; msr += BITS_PER_LONG) {
>> +unsigned word = msr / BITS_PER_LONG;
>> +msr_bitmap_l0[word] = ~0;
>> +msr_bitmap_l0[word + (0x800 / sizeof(long))] = ~0;
>> +}
>> +}
>>  
>> -if (nested_cpu_has_virt_x2apic_mode(vmcs12)) {
>> -if (nested_cpu_has_apic_reg_virt(vmcs12))
>> -for (msr = 0x800; msr <= 0x8ff; msr++)
>> -nested_vmx_disable_intercept_for_msr(
>> -msr_bitmap_l1, msr_bitmap_l0,
>> -msr, MSR_TYPE_R);
>> +nested_vmx_disable_intercept_for_msr(
>> +msr_bitmap_l1, msr_bitmap_l0,
>> +APIC_BASE_MSR + (APIC_TASKPRI >> 4),
>> +MSR_TYPE_W);
>
> I'd vote for indenting the parameters properly (even though we exceed 80
> chars by 1 then :) )
>
>>  
>> +if (nested_cpu_has_vid(vmcs12)) {
>>  nested_vmx_disable_intercept_for_msr(
>> -msr_bitmap_l1, msr_bitmap_l0,
>> -APIC_BASE_MSR + (APIC_TASKPRI >> 4),
>> -MSR_TYPE_R | MSR_TYPE_W);
>> -
>> -if (nested_cpu_has_vid(vmcs12)) {
>> -nested_vmx_disable_intercept_for_msr(
>> -msr_bitmap_l1, msr_bitmap_l0,
>> -APIC_BASE_MSR + (APIC_EOI >> 4),
>> -MSR_TYPE_W);
>> -nested_vmx_disable_intercept_for_msr(
>> -msr_bitmap_l1, msr_bitmap_l0,
>> -APIC_BASE_MSR + (APIC_SELF_IPI >> 4),
>> -MSR_TYPE_W);
>> -}
>> +msr_bitmap_l1, msr_bitmap_l0,
>> +APIC_BASE_MSR + (APIC_EOI >> 4),
>> +MSR_TYPE_W);
>> +nested_vmx_disable_intercept_for_msr(
>> +msr_bitmap_l1, msr_bitmap_l0,
>> +APIC_

Re: [PATCH] x86/pci: Add a break condition when enabling BAR

2017-12-07 Thread Bandan Das

Christian König  writes:

> Hi Bandas,
>
> thanks for the patch, but this is a known issue with a fix already on
> the way into the next -rc.

Oh great! Thank you, have a pointer to the patch so that I can test ?

> Regards,
> Christian.
>
> Am 07.12.2017 um 09:00 schrieb Bandan Das:
>> On an old flaky system with AMD Opteron 6320, boot hangs
>> with the following trace since commit fa564ad9:
>>
>> [   28.181012] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 09/03/2014
>> [   28.184022] RIP: 0010:lock_acquire+0xd5/0x1e0
>> [   28.185010] RSP: 0018:b7ad818c39a8 EFLAGS: 0246 ORIG_RAX: 
>> ff11
>> [   28.187010] RAX: a074fb39b140 RBX: 0246 RCX: 
>> 
>> [   28.189014] RDX: b20a55a9 RSI: 00040009 RDI: 
>> 0246
>> [   28.191012] RBP:  R08: 0006 R09: 
>> 
>> [   28.193020] R10: 0001 R11: dac664b5 R12: 
>> 
>> [   28.196013] R13:  R14: 0001 R15: 
>> 
>> [   28.197011] FS:  () GS:a074fbd0() 
>> knlGS:
>> [   28.201014] CS:  0010 DS:  ES:  CR0: 80050033
>> [   28.201014] CR2:  CR3: 0003b6e1 CR4: 
>> 000406e0
>> [   28.205008] Call Trace:
>> [   28.205013]  ? request_resource_conflict+0x19/0x40
>> [   28.207013]  _raw_write_lock+0x2e/0x40
>> [   28.209008]  ? request_resource_conflict+0x19/0x40
>> [   28.209010]  request_resource_conflict+0x19/0x40
>> [   28.212013]  pci_amd_enable_64bit_bar+0x103/0x1a0
>> [   28.213025]  pci_fixup_device+0xd4/0x210
>> [   28.213025]  pci_setup_device+0x193/0x570
>> [   28.215010]  ? get_device+0x13/0x20
>> [   28.217008]  pci_scan_single_device+0x98/0xd0
>> [   28.217011]  pci_scan_slot+0x90/0x130
>> [   28.219010]  pci_scanild_bus_extend+0x3a/0x270
>> [   28.321008]  acpi_pci_root_create+0x1a9/0x210
>> [   28.321014]  ? pci_acpi_scan_root+0x135/0x1b0
>> [   28.324013]  pci_acpi_scan_root+0x15f/0x1b0
>> [   28.325008]  acpi_pci_root_add+0x283/0x560
>> [   28.325014]  ? acpi_match_device_ids+0xc/0x20
>> [   28.327013]  acpi_bus_attach+0xf9/0x1c0
>> [   28.329008]  acpi_bus_attach+0x82/0x1c0
>> [   28.329044]  acpi_bus_attach+0x82/0x1c0
>> [   28.331010]  acpi_bus_scan+0x47/0xa0
>> [   28.333008]  acpi_scan_init+0x12d/0x28d
>> [   28.333013]  ? bus_register+0x208/0x280
>> [   28.333013]  acpi_init+0x30f/0x36f
>> [   28.335010]  ? acpi_sleep_proc_init+0x24/0x24
>> [   28.337013]  do_one_initcall+0x4d/0x19c
>> [   28.337013]  ? do_early_param+0x29/0x86
>> [   28.340013]  kernel_init_freeable+0x209/0x2a4
>> [   28.341008]  ? set_debug_rodata+0x11/0x11
>> [   28.341011]  ? rest_init+0xc0/0xc0
>> [   28.343013]  kernel_init+0xa/0x104
>> [   28.345008]  ret_from_fork+0x24/0x30
>> [   28.345010] Code: 24 08 49 c1 e9 09 49 83 f1 01 41 83 e1 01 e8 73
>> e4 ff ff 65 48 8b 04 25 c0 d4 00 00 48 89 df c7 80 fc 0c 00 00 00 00
>> 00 00 57 9d <0f> 1f 44 00 00 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f
>> c3 65
>>
>> Since request_resource() will unconditionally return a conflict for invalid
>> regions, there will be no way to break out of the loop when enabling 64bit 
>> BAR.
>> Add checks and exit the loop in these cases without attempting to enable
>> BAR.
>>
>> Signed-off-by: Bandan Das 
>> ---
>>   arch/x86/pci/fixup.c | 7 ++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c
>> index 1e996df..8933a1b 100644
>> --- a/arch/x86/pci/fixup.c
>> +++ b/arch/x86/pci/fixup.c
>> @@ -696,8 +696,13 @@ static void pci_amd_enable_64bit_bar(struct pci_dev 
>> *dev)
>>  res->end = 0xfdull - 1;
>>  /* Just grab the free area behind system memory for this */
>> -while ((conflict = request_resource_conflict(&iomem_resource, res)))
>> +while ((conflict = request_resource_conflict(&iomem_resource, res))) {
>> +if ((res->start > res->end) ||
>> +(res->start < iomem_resource.start) ||
>> +(res->end > iomem_resource.end))
>> +break;
>>  res->start = conflict->end + 1;
>> +}
>>  dev_info(&dev->dev, "adding root bus resource %pR\n", res);
>>

[PATCH] x86/pci: Add a break condition when enabling BAR

2017-12-07 Thread Bandan Das


On an old flaky system with AMD Opteron 6320, boot hangs
with the following trace since commit fa564ad9:

[   28.181012] Hardware name: HP ProLiant DL385p Gen8, BIOS A28 09/03/2014
[   28.184022] RIP: 0010:lock_acquire+0xd5/0x1e0
[   28.185010] RSP: 0018:b7ad818c39a8 EFLAGS: 0246 ORIG_RAX: 
ff11
[   28.187010] RAX: a074fb39b140 RBX: 0246 RCX: 
[   28.189014] RDX: b20a55a9 RSI: 00040009 RDI: 0246
[   28.191012] RBP:  R08: 0006 R09: 
[   28.193020] R10: 0001 R11: dac664b5 R12: 
[   28.196013] R13:  R14: 0001 R15: 
[   28.197011] FS:  () GS:a074fbd0() 
knlGS:
[   28.201014] CS:  0010 DS:  ES:  CR0: 80050033
[   28.201014] CR2:  CR3: 0003b6e1 CR4: 000406e0
[   28.205008] Call Trace:
[   28.205013]  ? request_resource_conflict+0x19/0x40
[   28.207013]  _raw_write_lock+0x2e/0x40
[   28.209008]  ? request_resource_conflict+0x19/0x40
[   28.209010]  request_resource_conflict+0x19/0x40
[   28.212013]  pci_amd_enable_64bit_bar+0x103/0x1a0
[   28.213025]  pci_fixup_device+0xd4/0x210
[   28.213025]  pci_setup_device+0x193/0x570
[   28.215010]  ? get_device+0x13/0x20
[   28.217008]  pci_scan_single_device+0x98/0xd0
[   28.217011]  pci_scan_slot+0x90/0x130
[   28.219010]  pci_scanild_bus_extend+0x3a/0x270
[   28.321008]  acpi_pci_root_create+0x1a9/0x210
[   28.321014]  ? pci_acpi_scan_root+0x135/0x1b0
[   28.324013]  pci_acpi_scan_root+0x15f/0x1b0
[   28.325008]  acpi_pci_root_add+0x283/0x560
[   28.325014]  ? acpi_match_device_ids+0xc/0x20
[   28.327013]  acpi_bus_attach+0xf9/0x1c0
[   28.329008]  acpi_bus_attach+0x82/0x1c0
[   28.329044]  acpi_bus_attach+0x82/0x1c0
[   28.331010]  acpi_bus_scan+0x47/0xa0
[   28.333008]  acpi_scan_init+0x12d/0x28d
[   28.333013]  ? bus_register+0x208/0x280
[   28.333013]  acpi_init+0x30f/0x36f
[   28.335010]  ? acpi_sleep_proc_init+0x24/0x24
[   28.337013]  do_one_initcall+0x4d/0x19c
[   28.337013]  ? do_early_param+0x29/0x86
[   28.340013]  kernel_init_freeable+0x209/0x2a4
[   28.341008]  ? set_debug_rodata+0x11/0x11
[   28.341011]  ? rest_init+0xc0/0xc0
[   28.343013]  kernel_init+0xa/0x104
[   28.345008]  ret_from_fork+0x24/0x30
[   28.345010] Code: 24 08 49 c1 e9 09 49 83 f1 01 41 83 e1 01 e8 73
e4 ff ff 65 48 8b 04 25 c0 d4 00 00 48 89 df c7 80 fc 0c 00 00 00 00
00 00 57 9d <0f> 1f 44 00 00 48 83 c4 30 5b 5d 41 5c 41 5d 41 5e 41 5f
c3 65

Since request_resource() will unconditionally return a conflict for invalid
regions, there will be no way to break out of the loop when enabling 64bit BAR.
Add checks and exit the loop in these cases without attempting to enable
BAR.

Signed-off-by: Bandan Das 
---
 arch/x86/pci/fixup.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c
index 1e996df..8933a1b 100644
--- a/arch/x86/pci/fixup.c
+++ b/arch/x86/pci/fixup.c
@@ -696,8 +696,13 @@ static void pci_amd_enable_64bit_bar(struct pci_dev *dev)
res->end = 0xfdull - 1;
 
/* Just grab the free area behind system memory for this */
-   while ((conflict = request_resource_conflict(&iomem_resource, res)))
+   while ((conflict = request_resource_conflict(&iomem_resource, res))) {
+   if ((res->start > res->end) ||
+   (res->start < iomem_resource.start) ||
+   (res->end > iomem_resource.end))
+   break;
res->start = conflict->end + 1;
+   }
 
dev_info(&dev->dev, "adding root bus resource %pR\n", res);
 
-- 
2.9.4

Re: [PATCH v7 0/3] Expose VMFUNC to the nested hypervisor

2017-08-04 Thread Bandan Das

David Hildenbrand  writes:
...
>> v1:
>>  https://lkml.org/lkml/2017/6/29/958
>> 
>> Bandan Das (3):
>>   KVM: vmx: Enable VMFUNCs
>>   KVM: nVMX: Enable VMFUNC for the L1 hypervisor
>>   KVM: nVMX: Emulate EPTP switching for the L1 hypervisor
>> 
>>  arch/x86/include/asm/vmx.h |   9 +++
>>  arch/x86/kvm/vmx.c | 185 
>> -
>>  2 files changed, 192 insertions(+), 2 deletions(-)
>> 
>
> Acked-by: David Hildenbrand 
>
> (not 100% confident for a r-b, not because of your patches but because
> of the involved complexity (flushes, MMU ...))

You and Radim both constitute to major revisions and changes in these patches.
I would be 100% confident of a R-b tag by you.

Bandan

[PATCH v7 1/3] KVM: vmx: Enable VMFUNCs

2017-08-03 Thread Bandan Das

Enable VMFUNC in the secondary execution controls.  This simplifies the
changes necessary to expose it to nested hypervisors.  VMFUNCs still
cause #UD when invoked.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/include/asm/vmx.h |  3 +++
 arch/x86/kvm/vmx.c | 22 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 35cd06f..da5375e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -72,6 +72,7 @@
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
 #define SECONDARY_EXEC_RDRAND  0x0800
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
+#define SECONDARY_EXEC_ENABLE_VMFUNC0x2000
 #define SECONDARY_EXEC_SHADOW_VMCS  0x4000
 #define SECONDARY_EXEC_RDSEED  0x0001
 #define SECONDARY_EXEC_ENABLE_PML   0x0002
@@ -187,6 +188,8 @@ enum vmcs_field {
APIC_ACCESS_ADDR_HIGH   = 0x2015,
POSTED_INTR_DESC_ADDR   = 0x2016,
POSTED_INTR_DESC_ADDR_HIGH  = 0x2017,
+   VM_FUNCTION_CONTROL = 0x2018,
+   VM_FUNCTION_CONTROL_HIGH= 0x2019,
EPT_POINTER = 0x201a,
EPT_POINTER_HIGH= 0x201b,
EOI_EXIT_BITMAP0= 0x201c,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 39a6222..b8969da 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1318,6 +1318,12 @@ static inline bool cpu_has_vmx_tsc_scaling(void)
SECONDARY_EXEC_TSC_SCALING;
 }
 
+static inline bool cpu_has_vmx_vmfunc(void)
+{
+   return vmcs_config.cpu_based_2nd_exec_ctrl &
+   SECONDARY_EXEC_ENABLE_VMFUNC;
+}
+
 static inline bool report_flexpriority(void)
 {
return flexpriority_enabled;
@@ -3607,7 +3613,8 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_SHADOW_VMCS |
SECONDARY_EXEC_XSAVES |
SECONDARY_EXEC_ENABLE_PML |
-   SECONDARY_EXEC_TSC_SCALING;
+   SECONDARY_EXEC_TSC_SCALING |
+   SECONDARY_EXEC_ENABLE_VMFUNC;
if (adjust_vmx_controls(min2, opt2,
MSR_IA32_VMX_PROCBASED_CTLS2,
&_cpu_based_2nd_exec_control) < 0)
@@ -5303,6 +5310,9 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmcs_writel(HOST_GS_BASE, 0); /* 22.2.4 */
 #endif
 
+   if (cpu_has_vmx_vmfunc())
+   vmcs_write64(VM_FUNCTION_CONTROL, 0);
+
vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
@@ -7793,6 +7803,12 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static int handle_vmfunc(struct kvm_vcpu *vcpu)
+{
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -7843,6 +7859,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct 
kvm_vcpu *vcpu) = {
[EXIT_REASON_XSAVES]  = handle_xsaves,
[EXIT_REASON_XRSTORS] = handle_xrstors,
[EXIT_REASON_PML_FULL]= handle_pml_full,
+   [EXIT_REASON_VMFUNC]  = handle_vmfunc,
[EXIT_REASON_PREEMPTION_TIMER]= handle_preemption_timer,
 };
 
@@ -8164,6 +8181,9 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
case EXIT_REASON_PML_FULL:
/* We emulate PML support to L1. */
return false;
+   case EXIT_REASON_VMFUNC:
+   /* VM functions are emulated through L2->L0 vmexits. */
+   return false;
default:
return true;
}
-- 
2.9.4

[PATCH v7 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-08-03 Thread Bandan Das

When L2 uses vmfunc, L0 utilizes the associated vmexit to
emulate a switching of the ept pointer by reloading the
guest MMU.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/include/asm/vmx.h |   6 +++
 arch/x86/kvm/vmx.c | 124 ++---
 2 files changed, 124 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index da5375e..5f63a2e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -115,6 +115,10 @@
 #define VMX_MISC_SAVE_EFER_LMA 0x0020
 #define VMX_MISC_ACTIVITY_HLT  0x0040
 
+/* VMFUNC functions */
+#define VMX_VMFUNC_EPTP_SWITCHING   0x0001
+#define VMFUNC_EPTP_ENTRIES  512
+
 static inline u32 vmx_basic_vmcs_revision_id(u64 vmx_basic)
 {
return vmx_basic & GENMASK_ULL(30, 0);
@@ -200,6 +204,8 @@ enum vmcs_field {
EOI_EXIT_BITMAP2_HIGH   = 0x2021,
EOI_EXIT_BITMAP3= 0x2022,
EOI_EXIT_BITMAP3_HIGH   = 0x2023,
+   EPTP_LIST_ADDRESS   = 0x2024,
+   EPTP_LIST_ADDRESS_HIGH  = 0x2025,
VMREAD_BITMAP   = 0x2026,
VMWRITE_BITMAP  = 0x2028,
XSS_EXIT_BITMAP = 0x202C,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 042ea88..61f7fe5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -249,6 +249,7 @@ struct __packed vmcs12 {
u64 eoi_exit_bitmap1;
u64 eoi_exit_bitmap2;
u64 eoi_exit_bitmap3;
+   u64 eptp_list_address;
u64 xss_exit_bitmap;
u64 guest_physical_address;
u64 vmcs_link_pointer;
@@ -774,6 +775,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1),
FIELD64(EOI_EXIT_BITMAP2, eoi_exit_bitmap2),
FIELD64(EOI_EXIT_BITMAP3, eoi_exit_bitmap3),
+   FIELD64(EPTP_LIST_ADDRESS, eptp_list_address),
FIELD64(XSS_EXIT_BITMAP, xss_exit_bitmap),
FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address),
FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer),
@@ -1406,6 +1408,13 @@ static inline bool nested_cpu_has_vmfunc(struct vmcs12 
*vmcs12)
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VMFUNC);
 }
 
+static inline bool nested_cpu_has_eptp_switching(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has_vmfunc(vmcs12) &&
+   (vmcs12->vm_function_control &
+VMX_VMFUNC_EPTP_SWITCHING);
+}
+
 static inline bool is_nmi(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2818,7 +2827,12 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
if (cpu_has_vmx_vmfunc()) {
vmx->nested.nested_vmx_secondary_ctls_high |=
SECONDARY_EXEC_ENABLE_VMFUNC;
-   vmx->nested.nested_vmx_vmfunc_controls = 0;
+   /*
+* Advertise EPTP switching unconditionally
+* since we emulate it
+*/
+   vmx->nested.nested_vmx_vmfunc_controls =
+   VMX_VMFUNC_EPTP_SWITCHING;
}
 
/*
@@ -7820,6 +7834,88 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static bool valid_ept_address(struct kvm_vcpu *vcpu, u64 address)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   u64 mask = address & 0x7;
+   int maxphyaddr = cpuid_maxphyaddr(vcpu);
+
+   /* Check for memory type validity */
+   switch (mask) {
+   case 0:
+   if (!(vmx->nested.nested_vmx_ept_caps & VMX_EPTP_UC_BIT))
+   return false;
+   break;
+   case 6:
+   if (!(vmx->nested.nested_vmx_ept_caps & VMX_EPTP_WB_BIT))
+   return false;
+   break;
+   default:
+   return false;
+   }
+
+   /* Bits 5:3 must be 3 */
+   if (((address >> VMX_EPT_GAW_EPTP_SHIFT) & 0x7) != VMX_EPT_DEFAULT_GAW)
+   return false;
+
+   /* Reserved bits should not be set */
+   if (address >> maxphyaddr || ((address >> 7) & 0x1f))
+   return false;
+
+   /* AD, if set, should be supported */
+   if ((address & VMX_EPT_AD_ENABLE_BIT)) {
+   if (!(vmx->nested.nested_vmx_ept_caps & VMX_EPT_AD_BIT))
+   return false;
+   }
+
+   return true;
+}
+
+static int nested_vmx_eptp_switching(struct kvm_vcpu *vcpu,
+struct vmcs12 *vmcs12)
+{
+   u32 index = vcpu->arch.regs[VCPU_REGS_RCX];
+   u64 address;
+   bool accessed_dirty;
+   struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
+
+   if (!nested_cpu_has_eptp_switching(vmcs

[PATCH v7 2/3] KVM: nVMX: Enable VMFUNC for the L1 hypervisor

2017-08-03 Thread Bandan Das

Expose VMFUNC in MSRs and VMCS fields. No actual VMFUNCs are enabled.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/kvm/vmx.c | 53 +++--
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b8969da..042ea88 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -243,6 +243,7 @@ struct __packed vmcs12 {
u64 virtual_apic_page_addr;
u64 apic_access_addr;
u64 posted_intr_desc_addr;
+   u64 vm_function_control;
u64 ept_pointer;
u64 eoi_exit_bitmap0;
u64 eoi_exit_bitmap1;
@@ -484,6 +485,7 @@ struct nested_vmx {
u64 nested_vmx_cr4_fixed0;
u64 nested_vmx_cr4_fixed1;
u64 nested_vmx_vmcs_enum;
+   u64 nested_vmx_vmfunc_controls;
 };
 
 #define POSTED_INTR_ON  0
@@ -766,6 +768,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr),
FIELD64(APIC_ACCESS_ADDR, apic_access_addr),
FIELD64(POSTED_INTR_DESC_ADDR, posted_intr_desc_addr),
+   FIELD64(VM_FUNCTION_CONTROL, vm_function_control),
FIELD64(EPT_POINTER, ept_pointer),
FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0),
FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1),
@@ -1398,6 +1401,11 @@ static inline bool nested_cpu_has_posted_intr(struct 
vmcs12 *vmcs12)
return vmcs12->pin_based_vm_exec_control & PIN_BASED_POSTED_INTR;
 }
 
+static inline bool nested_cpu_has_vmfunc(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VMFUNC);
+}
+
 static inline bool is_nmi(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2807,6 +2815,12 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
} else
vmx->nested.nested_vmx_ept_caps = 0;
 
+   if (cpu_has_vmx_vmfunc()) {
+   vmx->nested.nested_vmx_secondary_ctls_high |=
+   SECONDARY_EXEC_ENABLE_VMFUNC;
+   vmx->nested.nested_vmx_vmfunc_controls = 0;
+   }
+
/*
 * Old versions of KVM use the single-context version without
 * checking for support, so declare that it is supported even
@@ -3176,6 +3190,9 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
*pdata = vmx->nested.nested_vmx_ept_caps |
((u64)vmx->nested.nested_vmx_vpid_caps << 32);
break;
+   case MSR_IA32_VMX_VMFUNC:
+   *pdata = vmx->nested.nested_vmx_vmfunc_controls;
+   break;
default:
return 1;
}
@@ -7805,7 +7822,29 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
 
 static int handle_vmfunc(struct kvm_vcpu *vcpu)
 {
-   kvm_queue_exception(vcpu, UD_VECTOR);
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct vmcs12 *vmcs12;
+   u32 function = vcpu->arch.regs[VCPU_REGS_RAX];
+
+   /*
+* VMFUNC is only supported for nested guests, but we always enable the
+* secondary control for simplicity; for non-nested mode, fake that we
+* didn't by injecting #UD.
+*/
+   if (!is_guest_mode(vcpu)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   vmcs12 = get_vmcs12(vcpu);
+   if ((vmcs12->vm_function_control & (1 << function)) == 0)
+   goto fail;
+   WARN_ONCE(1, "VMCS12 VM function control should have been zero");
+
+fail:
+   nested_vmx_vmexit(vcpu, vmx->exit_reason,
+ vmcs_read32(VM_EXIT_INTR_INFO),
+ vmcs_readl(EXIT_QUALIFICATION));
return 1;
 }
 
@@ -10133,7 +10172,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12,
exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
  SECONDARY_EXEC_RDTSCP |
  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
- SECONDARY_EXEC_APIC_REGISTER_VIRT);
+ SECONDARY_EXEC_APIC_REGISTER_VIRT |
+ SECONDARY_EXEC_ENABLE_VMFUNC);
if (nested_cpu_has(vmcs12,
   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)) {
vmcs12_exec_ctrl = vmcs12->secondary_vm_exec_control &
@@ -10141,6 +10181,10 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12,
exec_control |= vmcs12_exec_ctrl;
}
 
+   /* All VMFUNCs are currently emulated through L0 vmexits.  */
+   if (exec_control & SECONDARY_EXEC_ENABLE_VMFUNC)
+   vmcs_write64(VM_FUNCTION_CO

Re: [PATCH 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-08-03 Thread Bandan Das

Paolo Bonzini  writes:

> On 03/08/2017 13:39, David Hildenbrand wrote:
>>> +   /* AD, if set, should be supported */
>>> +   if ((address & VMX_EPT_AD_ENABLE_BIT)) {
>>> +   if (!enable_ept_ad_bits)
>>> +   return false;
>> In theory (I guess) we would have to check here if
>> (vmx->nested.nested_vmx_ept_caps & VMX_EPT_AD_BIT)
>
> Yes, that's a more correct check than enable_ept_ad_bits.
>
>>>
>>> +   page = nested_get_page(vcpu, vmcs12->eptp_list_address);
>>> +   if (!page)
>>> +   return 1;
>>> +
>>> +   l1_eptp_list = kmap(page);
>>> +   address = l1_eptp_list[index];
>>> +   accessed_dirty = !!(address & VMX_EPT_AD_ENABLE_BIT);
>> 
>> Minor nit: Can't you directly do
>> 
>> kunmap(page);
>> nested_release_page_clean(page);
>> 
>> at this point?
>> 
>> We can fix this up later.
>
> You actually can do simply kvm_vcpu_read_guest_page(vcpu,
> vmcs12->eptp_list_address >> PAGE_SHIFT, &address, index * 8, 8).

Thanks Paolo, for the interesting tip. David, I sent a new version with the 
correct
check for AD and using this instead of kmap(page).

> Paolo

[PATCH v7 0/3] Expose VMFUNC to the nested hypervisor

2017-08-03 Thread Bandan Das

v7:
 3/3:
  Fix check for AD
  Use kvm_vcpu_read_guest_page()

v6:
 https://lkml.org/lkml/2017/8/1/1015
 3/3:
   Fix check for memory type in address
   Change check function name as requested in the review
   Move setting of mmu->ept_ad to after calling mmu_unload
   and also reset base_role.ad_disabled appropriately
   Replace IS_ALIGN with page_address_valid()

v5:
 https://lkml.org/lkml/2017/7/28/621
 1/3 and 2/3 are unchanged but some changes in 3/3. I left
 the mmu_load failure path untouched because I am not sure what's
 the right thing to do here.
 3/3:
Move the eptp switching logic to a different function
Add check for EPTP_ADDRESS in check_vmentry_prereq
Add check for validity of ept pointer
Check if AD bit is set and set ept_ad
Add TODO item about mmu_unload failure

v4:
 https://lkml.org/lkml/2017/7/10/705
 2/3:  Use WARN_ONCE to avoid logging dos

v3:
 https://lkml.org/lkml/2017/7/10/684
 3/3: Add missing nested_release_page_clean() and check the
 eptp as mentioned in SDM 24.6.14

v2:
 https://lkml.org/lkml/2017/7/6/813
 1/3: Patch to enable vmfunc on the host but cause a #UD if
  L1 tries to use it directly. (new)
 2/3: Expose vmfunc to the nested hypervisor, but no vm functions
  are exposed and L0 emulates a vmfunc vmexit to L1. 
 3/3: Force a vmfunc vmexit when L2 tries to use vmfunc and emulate
  eptp switching. Unconditionally expose EPTP switching to the
  L1 hypervisor since L0 fakes eptp switching via a mmu reload.

These patches expose eptp switching/vmfunc to the nested hypervisor.
vmfunc is enabled in the secondary controls for the host and is
exposed to the nested hypervisor. However, if the nested hypervisor
decides to use eptp switching, L0 emulates it.

v1:
 https://lkml.org/lkml/2017/6/29/958

Bandan Das (3):
  KVM: vmx: Enable VMFUNCs
  KVM: nVMX: Enable VMFUNC for the L1 hypervisor
  KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

 arch/x86/include/asm/vmx.h |   9 +++
 arch/x86/kvm/vmx.c | 185 -
 2 files changed, 192 insertions(+), 2 deletions(-)

-- 
2.9.4

[PATCH v6 1/3] KVM: vmx: Enable VMFUNCs

2017-08-01 Thread Bandan Das

Enable VMFUNC in the secondary execution controls.  This simplifies the
changes necessary to expose it to nested hypervisors.  VMFUNCs still
cause #UD when invoked.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/include/asm/vmx.h |  3 +++
 arch/x86/kvm/vmx.c | 22 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 35cd06f..da5375e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -72,6 +72,7 @@
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
 #define SECONDARY_EXEC_RDRAND  0x0800
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
+#define SECONDARY_EXEC_ENABLE_VMFUNC0x2000
 #define SECONDARY_EXEC_SHADOW_VMCS  0x4000
 #define SECONDARY_EXEC_RDSEED  0x0001
 #define SECONDARY_EXEC_ENABLE_PML   0x0002
@@ -187,6 +188,8 @@ enum vmcs_field {
APIC_ACCESS_ADDR_HIGH   = 0x2015,
POSTED_INTR_DESC_ADDR   = 0x2016,
POSTED_INTR_DESC_ADDR_HIGH  = 0x2017,
+   VM_FUNCTION_CONTROL = 0x2018,
+   VM_FUNCTION_CONTROL_HIGH= 0x2019,
EPT_POINTER = 0x201a,
EPT_POINTER_HIGH= 0x201b,
EOI_EXIT_BITMAP0= 0x201c,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 39a6222..b8969da 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1318,6 +1318,12 @@ static inline bool cpu_has_vmx_tsc_scaling(void)
SECONDARY_EXEC_TSC_SCALING;
 }
 
+static inline bool cpu_has_vmx_vmfunc(void)
+{
+   return vmcs_config.cpu_based_2nd_exec_ctrl &
+   SECONDARY_EXEC_ENABLE_VMFUNC;
+}
+
 static inline bool report_flexpriority(void)
 {
return flexpriority_enabled;
@@ -3607,7 +3613,8 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_SHADOW_VMCS |
SECONDARY_EXEC_XSAVES |
SECONDARY_EXEC_ENABLE_PML |
-   SECONDARY_EXEC_TSC_SCALING;
+   SECONDARY_EXEC_TSC_SCALING |
+   SECONDARY_EXEC_ENABLE_VMFUNC;
if (adjust_vmx_controls(min2, opt2,
MSR_IA32_VMX_PROCBASED_CTLS2,
&_cpu_based_2nd_exec_control) < 0)
@@ -5303,6 +5310,9 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmcs_writel(HOST_GS_BASE, 0); /* 22.2.4 */
 #endif
 
+   if (cpu_has_vmx_vmfunc())
+   vmcs_write64(VM_FUNCTION_CONTROL, 0);
+
vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
@@ -7793,6 +7803,12 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static int handle_vmfunc(struct kvm_vcpu *vcpu)
+{
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -7843,6 +7859,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct 
kvm_vcpu *vcpu) = {
[EXIT_REASON_XSAVES]  = handle_xsaves,
[EXIT_REASON_XRSTORS] = handle_xrstors,
[EXIT_REASON_PML_FULL]= handle_pml_full,
+   [EXIT_REASON_VMFUNC]  = handle_vmfunc,
[EXIT_REASON_PREEMPTION_TIMER]= handle_preemption_timer,
 };
 
@@ -8164,6 +8181,9 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
case EXIT_REASON_PML_FULL:
/* We emulate PML support to L1. */
return false;
+   case EXIT_REASON_VMFUNC:
+   /* VM functions are emulated through L2->L0 vmexits. */
+   return false;
default:
return true;
}
-- 
2.9.4

[PATCH 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-08-01 Thread Bandan Das

When L2 uses vmfunc, L0 utilizes the associated vmexit to
emulate a switching of the ept pointer by reloading the
guest MMU.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/include/asm/vmx.h |   6 +++
 arch/x86/kvm/vmx.c | 130 ++---
 2 files changed, 130 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index da5375e..5f63a2e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -115,6 +115,10 @@
 #define VMX_MISC_SAVE_EFER_LMA 0x0020
 #define VMX_MISC_ACTIVITY_HLT  0x0040
 
+/* VMFUNC functions */
+#define VMX_VMFUNC_EPTP_SWITCHING   0x0001
+#define VMFUNC_EPTP_ENTRIES  512
+
 static inline u32 vmx_basic_vmcs_revision_id(u64 vmx_basic)
 {
return vmx_basic & GENMASK_ULL(30, 0);
@@ -200,6 +204,8 @@ enum vmcs_field {
EOI_EXIT_BITMAP2_HIGH   = 0x2021,
EOI_EXIT_BITMAP3= 0x2022,
EOI_EXIT_BITMAP3_HIGH   = 0x2023,
+   EPTP_LIST_ADDRESS   = 0x2024,
+   EPTP_LIST_ADDRESS_HIGH  = 0x2025,
VMREAD_BITMAP   = 0x2026,
VMWRITE_BITMAP  = 0x2028,
XSS_EXIT_BITMAP = 0x202C,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 042ea88..7235e9a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -249,6 +249,7 @@ struct __packed vmcs12 {
u64 eoi_exit_bitmap1;
u64 eoi_exit_bitmap2;
u64 eoi_exit_bitmap3;
+   u64 eptp_list_address;
u64 xss_exit_bitmap;
u64 guest_physical_address;
u64 vmcs_link_pointer;
@@ -774,6 +775,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1),
FIELD64(EOI_EXIT_BITMAP2, eoi_exit_bitmap2),
FIELD64(EOI_EXIT_BITMAP3, eoi_exit_bitmap3),
+   FIELD64(EPTP_LIST_ADDRESS, eptp_list_address),
FIELD64(XSS_EXIT_BITMAP, xss_exit_bitmap),
FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address),
FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer),
@@ -1406,6 +1408,13 @@ static inline bool nested_cpu_has_vmfunc(struct vmcs12 
*vmcs12)
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VMFUNC);
 }
 
+static inline bool nested_cpu_has_eptp_switching(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has_vmfunc(vmcs12) &&
+   (vmcs12->vm_function_control &
+VMX_VMFUNC_EPTP_SWITCHING);
+}
+
 static inline bool is_nmi(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2818,7 +2827,12 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
if (cpu_has_vmx_vmfunc()) {
vmx->nested.nested_vmx_secondary_ctls_high |=
SECONDARY_EXEC_ENABLE_VMFUNC;
-   vmx->nested.nested_vmx_vmfunc_controls = 0;
+   /*
+* Advertise EPTP switching unconditionally
+* since we emulate it
+*/
+   vmx->nested.nested_vmx_vmfunc_controls =
+   VMX_VMFUNC_EPTP_SWITCHING;
}
 
/*
@@ -7820,6 +7834,94 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static bool valid_ept_address(struct kvm_vcpu *vcpu, u64 address)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   u64 mask = address & 0x7;
+   int maxphyaddr = cpuid_maxphyaddr(vcpu);
+
+   /* Check for memory type validity */
+   switch (mask) {
+   case 0:
+   if (!(vmx->nested.nested_vmx_ept_caps & VMX_EPTP_UC_BIT))
+   return false;
+   break;
+   case 6:
+   if (!(vmx->nested.nested_vmx_ept_caps & VMX_EPTP_WB_BIT))
+   return false;
+   break;
+   default:
+   return false;
+   }
+
+   /* Bits 5:3 must be 3 */
+   if (((address >> VMX_EPT_GAW_EPTP_SHIFT) & 0x7) != VMX_EPT_DEFAULT_GAW)
+   return false;
+
+   /* Reserved bits should not be set */
+   if (address >> maxphyaddr || ((address >> 7) & 0x1f))
+   return false;
+
+   /* AD, if set, should be supported */
+   if ((address & VMX_EPT_AD_ENABLE_BIT)) {
+   if (!enable_ept_ad_bits)
+   return false;
+   }
+
+   return true;
+}
+
+static int nested_vmx_eptp_switching(struct kvm_vcpu *vcpu,
+struct vmcs12 *vmcs12)
+{
+   u32 index = vcpu->arch.regs[VCPU_REGS_RCX];
+   u64 *l1_eptp_list, address;
+   struct page *page;
+   bool accessed_dirty;
+   struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
+
+   if (!nested_cpu_has_eptp_switching(vmcs12) ||

[PATCH v6 2/3] KVM: nVMX: Enable VMFUNC for the L1 hypervisor

2017-08-01 Thread Bandan Das

Expose VMFUNC in MSRs and VMCS fields. No actual VMFUNCs are enabled.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/kvm/vmx.c | 53 +++--
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b8969da..042ea88 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -243,6 +243,7 @@ struct __packed vmcs12 {
u64 virtual_apic_page_addr;
u64 apic_access_addr;
u64 posted_intr_desc_addr;
+   u64 vm_function_control;
u64 ept_pointer;
u64 eoi_exit_bitmap0;
u64 eoi_exit_bitmap1;
@@ -484,6 +485,7 @@ struct nested_vmx {
u64 nested_vmx_cr4_fixed0;
u64 nested_vmx_cr4_fixed1;
u64 nested_vmx_vmcs_enum;
+   u64 nested_vmx_vmfunc_controls;
 };
 
 #define POSTED_INTR_ON  0
@@ -766,6 +768,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr),
FIELD64(APIC_ACCESS_ADDR, apic_access_addr),
FIELD64(POSTED_INTR_DESC_ADDR, posted_intr_desc_addr),
+   FIELD64(VM_FUNCTION_CONTROL, vm_function_control),
FIELD64(EPT_POINTER, ept_pointer),
FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0),
FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1),
@@ -1398,6 +1401,11 @@ static inline bool nested_cpu_has_posted_intr(struct 
vmcs12 *vmcs12)
return vmcs12->pin_based_vm_exec_control & PIN_BASED_POSTED_INTR;
 }
 
+static inline bool nested_cpu_has_vmfunc(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VMFUNC);
+}
+
 static inline bool is_nmi(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2807,6 +2815,12 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
} else
vmx->nested.nested_vmx_ept_caps = 0;
 
+   if (cpu_has_vmx_vmfunc()) {
+   vmx->nested.nested_vmx_secondary_ctls_high |=
+   SECONDARY_EXEC_ENABLE_VMFUNC;
+   vmx->nested.nested_vmx_vmfunc_controls = 0;
+   }
+
/*
 * Old versions of KVM use the single-context version without
 * checking for support, so declare that it is supported even
@@ -3176,6 +3190,9 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
*pdata = vmx->nested.nested_vmx_ept_caps |
((u64)vmx->nested.nested_vmx_vpid_caps << 32);
break;
+   case MSR_IA32_VMX_VMFUNC:
+   *pdata = vmx->nested.nested_vmx_vmfunc_controls;
+   break;
default:
return 1;
}
@@ -7805,7 +7822,29 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
 
 static int handle_vmfunc(struct kvm_vcpu *vcpu)
 {
-   kvm_queue_exception(vcpu, UD_VECTOR);
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct vmcs12 *vmcs12;
+   u32 function = vcpu->arch.regs[VCPU_REGS_RAX];
+
+   /*
+* VMFUNC is only supported for nested guests, but we always enable the
+* secondary control for simplicity; for non-nested mode, fake that we
+* didn't by injecting #UD.
+*/
+   if (!is_guest_mode(vcpu)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   vmcs12 = get_vmcs12(vcpu);
+   if ((vmcs12->vm_function_control & (1 << function)) == 0)
+   goto fail;
+   WARN_ONCE(1, "VMCS12 VM function control should have been zero");
+
+fail:
+   nested_vmx_vmexit(vcpu, vmx->exit_reason,
+ vmcs_read32(VM_EXIT_INTR_INFO),
+ vmcs_readl(EXIT_QUALIFICATION));
return 1;
 }
 
@@ -10133,7 +10172,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12,
exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
  SECONDARY_EXEC_RDTSCP |
  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
- SECONDARY_EXEC_APIC_REGISTER_VIRT);
+ SECONDARY_EXEC_APIC_REGISTER_VIRT |
+ SECONDARY_EXEC_ENABLE_VMFUNC);
if (nested_cpu_has(vmcs12,
   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)) {
vmcs12_exec_ctrl = vmcs12->secondary_vm_exec_control &
@@ -10141,6 +10181,10 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12,
exec_control |= vmcs12_exec_ctrl;
}
 
+   /* All VMFUNCs are currently emulated through L0 vmexits.  */
+   if (exec_control & SECONDARY_EXEC_ENABLE_VMFUNC)
+   vmcs_write64(VM_FUNCTION_CO

[PATCH v6 0/3] Expose VMFUNC to the nested hypervisor

2017-08-01 Thread Bandan Das

v6:
 3/3:
   Fix check for memory type in address
   Change check function name as requested in the review
   Move setting of mmu->ept_ad to after calling mmu_unload
   and also reset base_role.ad_disabled appropriately
   Replace IS_ALIGN with page_address_valid()

v5:
 https://lkml.org/lkml/2017/7/28/621
 1/3 and 2/3 are unchanged but some changes in 3/3. I left
 the mmu_load failure path untouched because I am not sure what's
 the right thing to do here.
 3/3:
Move the eptp switching logic to a different function
Add check for EPTP_ADDRESS in check_vmentry_prereq
Add check for validity of ept pointer
Check if AD bit is set and set ept_ad
Add TODO item about mmu_unload failure

v4:
 https://lkml.org/lkml/2017/7/10/705
 2/3:  Use WARN_ONCE to avoid logging dos

v3:
 https://lkml.org/lkml/2017/7/10/684
 3/3: Add missing nested_release_page_clean() and check the
 eptp as mentioned in SDM 24.6.14

v2:
 https://lkml.org/lkml/2017/7/6/813
 1/3: Patch to enable vmfunc on the host but cause a #UD if
  L1 tries to use it directly. (new)
 2/3: Expose vmfunc to the nested hypervisor, but no vm functions
  are exposed and L0 emulates a vmfunc vmexit to L1. 
 3/3: Force a vmfunc vmexit when L2 tries to use vmfunc and emulate
  eptp switching. Unconditionally expose EPTP switching to the
  L1 hypervisor since L0 fakes eptp switching via a mmu reload.

These patches expose eptp switching/vmfunc to the nested hypervisor.
vmfunc is enabled in the secondary controls for the host and is
exposed to the nested hypervisor. However, if the nested hypervisor
decides to use eptp switching, L0 emulates it.

v1:
 https://lkml.org/lkml/2017/6/29/958

Bandan Das (3):
  KVM: vmx: Enable VMFUNCs
  KVM: nVMX: Enable VMFUNC for the L1 hypervisor
  KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

 arch/x86/include/asm/vmx.h |   9 +++
 arch/x86/kvm/vmx.c | 191 -
 2 files changed, 198 insertions(+), 2 deletions(-)

-- 
2.9.4

Re: [PATCH v5 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-08-01 Thread Bandan Das

Radim Krčmář  writes:

> 2017-07-28 15:52-0400, Bandan Das:
>> When L2 uses vmfunc, L0 utilizes the associated vmexit to
>> emulate a switching of the ept pointer by reloading the
>> guest MMU.
>> 
>> Signed-off-by: Paolo Bonzini 
>> Signed-off-by: Bandan Das 
>> ---
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> @@ -7767,6 +7781,85 @@ static int handle_preemption_timer(struct kvm_vcpu 
>> *vcpu)
>>  return 1;
>>  }
>>  
>> +static bool check_ept_address_valid(struct kvm_vcpu *vcpu, u64 address)
>> +{
>> +struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +u64 mask = VMX_EPT_RWX_MASK;
>> +int maxphyaddr = cpuid_maxphyaddr(vcpu);
>> +struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
>> +
>> +/* Check for execute_only validity */
>> +if ((address & mask) == VMX_EPT_EXECUTABLE_MASK) {
>> +if (!(vmx->nested.nested_vmx_ept_caps &
>> +  VMX_EPT_EXECUTE_ONLY_BIT))
>> +return false;
>> +}
>
> This checks looks wrong ... bits 0:2 define the memory type:
>
>   0 = Uncacheable (UC)
>   6 = Write-back (WB)

Oops, sorry, I badly messed this up! I will incorporate these
changes and the suggestions by David to a new version.

> If those are supported MSR IA32_VMX_EPT_VPID_CAP, so I think it should
> return false when
>
>   (address & 0x7) == 0 && !(vmx->nested.nested_vmx_ept_caps & 
> VMX_EPTP_UC_BIT))
>
> the same for 6 and VMX_EPTP_WB_BIT and unconditionally for the remaining
> types.
>
> Btw. when is TLB flushed after EPTP switching?

>From what I understand, mmu_sync_roots() calls kvm_mmu_flush_or_zap()
that sets KVM_REQ_TLB_FLUSH.

Bandan

>> @@ -10354,10 +10456,20 @@ static int check_vmentry_prereqs(struct kvm_vcpu 
>> *vcpu, struct vmcs12 *vmcs12)
>>  vmx->nested.nested_vmx_entry_ctls_high))
>>  return VMXERR_ENTRY_INVALID_CONTROL_FIELD;
>>  
>> -if (nested_cpu_has_vmfunc(vmcs12) &&
>> -(vmcs12->vm_function_control &
>> - ~vmx->nested.nested_vmx_vmfunc_controls))
>> -return VMXERR_ENTRY_INVALID_CONTROL_FIELD;
>> +if (nested_cpu_has_vmfunc(vmcs12)) {
>> +if (vmcs12->vm_function_control &
>> +~vmx->nested.nested_vmx_vmfunc_controls)
>> +return VMXERR_ENTRY_INVALID_CONTROL_FIELD;
>> +
>> +if (nested_cpu_has_eptp_switching(vmcs12)) {
>> +if (!nested_cpu_has_ept(vmcs12) ||
>> +(vmcs12->eptp_list_address >>
>> + cpuid_maxphyaddr(vcpu)) ||
>> +!IS_ALIGNED(vmcs12->eptp_list_address, 4096))
>
> page_address_valid() would make this check a bit nicer,
>
> thanks.
>
>> +return VMXERR_ENTRY_INVALID_CONTROL_FIELD;

Re: [PATCH v5 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-31 Thread Bandan Das

Hi David,

David Hildenbrand  writes:

>> +static inline bool nested_cpu_has_eptp_switching(struct vmcs12 *vmcs12)
>> +{
>> +return nested_cpu_has_vmfunc(vmcs12) &&
>> +(vmcs12->vm_function_control &
>> + VMX_VMFUNC_EPTP_SWITCHING);
>> +}
>> +
>>  static inline bool is_nmi(u32 intr_info)
>>  {
>>  return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
>> @@ -2791,7 +2800,12 @@ static void nested_vmx_setup_ctls_msrs(struct 
>> vcpu_vmx *vmx)
>>  if (cpu_has_vmx_vmfunc()) {
>>  vmx->nested.nested_vmx_secondary_ctls_high |=
>>  SECONDARY_EXEC_ENABLE_VMFUNC;
>> -vmx->nested.nested_vmx_vmfunc_controls = 0;
>> +/*
>> + * Advertise EPTP switching unconditionally
>> + * since we emulate it
>> + */
>> +vmx->nested.nested_vmx_vmfunc_controls =
>> +VMX_VMFUNC_EPTP_SWITCHING;
>
> Should this only be advertised, if enable_ept is set (if the guest also
> sees/can use SECONDARY_EXEC_ENABLE_EPT)?

This represents the function control MSR, which on the hardware is
a RO value. The checks for enable_ept and such are somewhere else.

>>  }
>>  
>>  /*
>> @@ -7767,6 +7781,85 @@ static int handle_preemption_timer(struct kvm_vcpu 
>> *vcpu)
>>  return 1;
>>  }
>>  
>> +static bool check_ept_address_valid(struct kvm_vcpu *vcpu, u64 address)
>
> check_..._valid -> valid_ept_address() ?

I think either of the names is fine and I would prefer not
to respin unless you feel really strongly about it :) 

>
>> +{
>> +struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +u64 mask = VMX_EPT_RWX_MASK;
>> +int maxphyaddr = cpuid_maxphyaddr(vcpu);
>> +struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
>> +
>> +/* Check for execute_only validity */
>> +if ((address & mask) == VMX_EPT_EXECUTABLE_MASK) {
>> +if (!(vmx->nested.nested_vmx_ept_caps &
>> +  VMX_EPT_EXECUTE_ONLY_BIT))
>> +return false;
>> +}
>> +
>> +/* Bits 5:3 must be 3 */
>> +if (((address >> VMX_EPT_GAW_EPTP_SHIFT) & 0x7) != VMX_EPT_DEFAULT_GAW)
>> +return false;
>> +
>> +/* Reserved bits should not be set */
>> +if (address >> maxphyaddr || ((address >> 7) & 0x1f))
>> +return false;
>> +
>> +/* AD, if set, should be supported */
>> +if ((address & VMX_EPT_AD_ENABLE_BIT)) {
>> +if (!enable_ept_ad_bits)
>> +return false;
>> +mmu->ept_ad = true;
>> +} else
>> +mmu->ept_ad = false;
>
> I wouldn't expect a "check" function to modify the mmu. Can you move
> modifying the mmu outside of this function (leaving the
> enable_ept_ad_bits check in place)? (and maybe even set mmu->ept_ad
> _after_ the kvm_mmu_unload(vcpu)?, just when setting vmcs12->ept_pointer?)
>

Well, the correct thing to do is have a wrapper around it in mmu.c
without directly calling here and also call this function before
nested_mmu is initialized. I am working on a separate patch for this btw.
It seems to me setting mmu->ept_ad after kvm_mmu_unload is unnecessary
since it's already being set only if everything else succeeds.
kvm_mmu_unload() isn't affected by the setting of this flag if I understand
correctly.

>> +
>> +return true;
>> +}
>> +
>> +static int nested_vmx_eptp_switching(struct kvm_vcpu *vcpu,
>> + struct vmcs12 *vmcs12)
>> +{
>> +u32 index = vcpu->arch.regs[VCPU_REGS_RCX];
>> +u64 *l1_eptp_list, address;
>> +struct page *page;
>> +
>> +if (!nested_cpu_has_eptp_switching(vmcs12) ||
>> +!nested_cpu_has_ept(vmcs12))
>> +return 1;
>> +
>> +if (index >= VMFUNC_EPTP_ENTRIES)
>> +return 1;
>> +
>> +page = nested_get_page(vcpu, vmcs12->eptp_list_address);
>> +if (!page)
>> +return 1;
>> +
>> +l1_eptp_list = kmap(page);
>> +address = l1_eptp_list[index];
>> +
>> +/*
>> + * If the (L2) guest does a vmfunc to the currently
>> + * active ept pointer, we don't have to do anything else
>> + */
>> +if (vmcs12->ept_pointer != address) {
>> +if (!check_ept_address_valid(vcpu, address)) {
>> +kunmap(page);
>> +nested_release_page_clean(page);
>> +return 1;
>> +}
>> +kvm_mmu_unload(vcpu);
>> +vmcs12->ept_pointer = address;
>> +/*
>> + * TODO: Check what's the correct approach in case
>> + * mmu reload fails. Currently, we just let the next
>> + * reload potentially fail
>> + */
>> +kvm_mmu_reload(vcpu);
>
> So, what actually happens if this generates a tripple fault? I guess we
> will kill the (nested) hypervisor?

Yes. Not sure what's the right thing to do is though...

Bandan

>> +}
>> +
>> +kunmap(page);
>> +nested_release_page_clean(page);
>> +ret

Re: [RFC PATCH v2 00/38] Nested Virtualization on KVM/ARM

2017-07-28 Thread Bandan Das

Jintack Lim  writes:
...
>>
>> I'll share my experiment setup shortly.
>
> I summarized my experiment setup here.
>
> https://github.com/columbia/nesting-pub/wiki/Nested-virtualization-on-ARM-setup

Thanks Jintack! I was able to test L2 boot up with these instructions.

Next, I will try to run some simple tests. Any suggestions on reducing the L2 
bootup
time in my test setup ? I think I will try to make the L2 kernel print
less messages; and maybe just get rid of some of the userspace services.
I also applied the patch to reduce the timer frequency btw.

Bandan

>>
>> Even though this work has some limitations and TODOs, I'd appreciate early
>> feedback on this RFC. Specifically, I'm interested in:
>>
>> - Overall design to manage vcpu context for the virtual EL2
>> - Verifying correct EL2 register configurations such as HCR_EL2, CPTR_EL2
>>   (Patch 30 and 32)
>> - Patch organization and coding style
>
> I also wonder if the hardware and/or KVM do not support nested
> virtualization but the userspace uses nested virtualization option,
> which one is better: giving an error or launching a regular VM
> silently.
>
>>
>> This patch series is based on kvm/next d38338e.
>> The whole patch series including memory, VGIC, and timer patches is available
>> here:
>>
>> g...@github.com:columbia/nesting-pub.git rfc-v2
>>
>> Limitations:
>> - There are some cases that the target exception level of a VM is ambiguous 
>> when
>>   emulating eret instruction. I'm discussing this issue with Christoffer and
>>   Marc. Meanwhile, I added a temporary patch (not included in this
>>   series. f1beaba in the repo) and used 4.10.0 kernel when testing the guest
>>   hypervisor with VHE.
>> - Recursive nested virtualization is not tested yet.
>> - Other hypervisors (such as Xen) on KVM are not tested.
>>
>> TODO:
>> - Submit memory, VGIC, and timer patches
>> - Evaluate regular VM performance to see if there's a negative impact.
>> - Test other hypervisors such as Xen on KVM
>> - Test recursive nested virtualization
>>
>> v1-->v2:
>> - Added support for the virtual EL2 with VHE
>> - Rewrote commit messages and comments from the perspective of supporting
>>   execution environments to VMs, rather than from the perspective of the 
>> guest
>>   hypervisor running in them.
>> - Fixed a few bugs to make it run on the FastModel.
>> - Tested on ARMv8.3 with four configurations. (host/guest. with/without VHE.)
>> - Rebased to kvm/next
>>
>> [1] 
>> https://www.community.arm.com/processors/b/blog/posts/armv8-a-architecture-2016-additions
>>
>> Christoffer Dall (7):
>>   KVM: arm64: Add KVM nesting feature
>>   KVM: arm64: Allow userspace to set PSR_MODE_EL2x
>>   KVM: arm64: Add vcpu_mode_el2 primitive to support nesting
>>   KVM: arm/arm64: Add a framework to prepare virtual EL2 execution
>>   arm64: Add missing TCR hw defines
>>   KVM: arm64: Create shadow EL1 registers
>>   KVM: arm64: Trap EL1 VM register accesses in virtual EL2
>>
>> Jintack Lim (31):
>>   arm64: Add ARM64_HAS_NESTED_VIRT feature
>>   KVM: arm/arm64: Enable nested virtualization via command-line
>>   KVM: arm/arm64: Check if nested virtualization is in use
>>   KVM: arm64: Add EL2 system registers to vcpu context
>>   KVM: arm64: Add EL2 special registers to vcpu context
>>   KVM: arm64: Add the shadow context for virtual EL2 execution
>>   KVM: arm64: Set vcpu context depending on the guest exception level
>>   KVM: arm64: Synchronize EL1 system registers on virtual EL2 entry and
>> exit
>>   KVM: arm64: Move exception macros and enums to a common file
>>   KVM: arm64: Support to inject exceptions to the virtual EL2
>>   KVM: arm64: Trap SPSR_EL1, ELR_EL1 and VBAR_EL1 from virtual EL2
>>   KVM: arm64: Trap CPACR_EL1 access in virtual EL2
>>   KVM: arm64: Handle eret instruction traps
>>   KVM: arm64: Set a handler for the system instruction traps
>>   KVM: arm64: Handle PSCI call via smc from the guest
>>   KVM: arm64: Inject HVC exceptions to the virtual EL2
>>   KVM: arm64: Respect virtual HCR_EL2.TWX setting
>>   KVM: arm64: Respect virtual CPTR_EL2.TFP setting
>>   KVM: arm64: Add macros to support the virtual EL2 with VHE
>>   KVM: arm64: Add EL2 registers defined in ARMv8.1 to vcpu context
>>   KVM: arm64: Emulate EL12 register accesses from the virtual EL2
>>   KVM: arm64: Support a VM with VHE considering EL0 of the VHE host
>>   KVM: arm64: Allow the virtual EL2 to access EL2 states without trap
>>   KVM: arm64: Manage the shadow states when virtual E2H bit enabled
>>   KVM: arm64: Trap and emulate CPTR_EL2 accesses via CPACR_EL1 from the
>> virtual EL2 with VHE
>>   KVM: arm64: Emulate appropriate VM control system registers
>>   KVM: arm64: Respect the virtual HCR_EL2.NV bit setting
>>   KVM: arm64: Respect the virtual HCR_EL2.NV bit setting for EL12
>> register traps
>>   KVM: arm64: Respect virtual HCR_EL2.TVM and TRVM settings
>>   KVM: arm64: Respect the virtual HCR_EL2.NV1 bit setting
>>   KVM: arm64: Respect the virtual CPTR_EL2.TCPAC s

[PATCH v5 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-28 Thread Bandan Das

When L2 uses vmfunc, L0 utilizes the associated vmexit to
emulate a switching of the ept pointer by reloading the
guest MMU.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/include/asm/vmx.h |   6 +++
 arch/x86/kvm/vmx.c | 124 ++---
 2 files changed, 124 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index da5375e..5f63a2e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -115,6 +115,10 @@
 #define VMX_MISC_SAVE_EFER_LMA 0x0020
 #define VMX_MISC_ACTIVITY_HLT  0x0040
 
+/* VMFUNC functions */
+#define VMX_VMFUNC_EPTP_SWITCHING   0x0001
+#define VMFUNC_EPTP_ENTRIES  512
+
 static inline u32 vmx_basic_vmcs_revision_id(u64 vmx_basic)
 {
return vmx_basic & GENMASK_ULL(30, 0);
@@ -200,6 +204,8 @@ enum vmcs_field {
EOI_EXIT_BITMAP2_HIGH   = 0x2021,
EOI_EXIT_BITMAP3= 0x2022,
EOI_EXIT_BITMAP3_HIGH   = 0x2023,
+   EPTP_LIST_ADDRESS   = 0x2024,
+   EPTP_LIST_ADDRESS_HIGH  = 0x2025,
VMREAD_BITMAP   = 0x2026,
VMWRITE_BITMAP  = 0x2028,
XSS_EXIT_BITMAP = 0x202C,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index fe8f5fc..f1ab783 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -246,6 +246,7 @@ struct __packed vmcs12 {
u64 eoi_exit_bitmap1;
u64 eoi_exit_bitmap2;
u64 eoi_exit_bitmap3;
+   u64 eptp_list_address;
u64 xss_exit_bitmap;
u64 guest_physical_address;
u64 vmcs_link_pointer;
@@ -771,6 +772,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1),
FIELD64(EOI_EXIT_BITMAP2, eoi_exit_bitmap2),
FIELD64(EOI_EXIT_BITMAP3, eoi_exit_bitmap3),
+   FIELD64(EPTP_LIST_ADDRESS, eptp_list_address),
FIELD64(XSS_EXIT_BITMAP, xss_exit_bitmap),
FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address),
FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer),
@@ -1402,6 +1404,13 @@ static inline bool nested_cpu_has_vmfunc(struct vmcs12 
*vmcs12)
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VMFUNC);
 }
 
+static inline bool nested_cpu_has_eptp_switching(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has_vmfunc(vmcs12) &&
+   (vmcs12->vm_function_control &
+VMX_VMFUNC_EPTP_SWITCHING);
+}
+
 static inline bool is_nmi(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2791,7 +2800,12 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
if (cpu_has_vmx_vmfunc()) {
vmx->nested.nested_vmx_secondary_ctls_high |=
SECONDARY_EXEC_ENABLE_VMFUNC;
-   vmx->nested.nested_vmx_vmfunc_controls = 0;
+   /*
+* Advertise EPTP switching unconditionally
+* since we emulate it
+*/
+   vmx->nested.nested_vmx_vmfunc_controls =
+   VMX_VMFUNC_EPTP_SWITCHING;
}
 
/*
@@ -7767,6 +7781,85 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static bool check_ept_address_valid(struct kvm_vcpu *vcpu, u64 address)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   u64 mask = VMX_EPT_RWX_MASK;
+   int maxphyaddr = cpuid_maxphyaddr(vcpu);
+   struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
+
+   /* Check for execute_only validity */
+   if ((address & mask) == VMX_EPT_EXECUTABLE_MASK) {
+   if (!(vmx->nested.nested_vmx_ept_caps &
+ VMX_EPT_EXECUTE_ONLY_BIT))
+   return false;
+   }
+
+   /* Bits 5:3 must be 3 */
+   if (((address >> VMX_EPT_GAW_EPTP_SHIFT) & 0x7) != VMX_EPT_DEFAULT_GAW)
+   return false;
+
+   /* Reserved bits should not be set */
+   if (address >> maxphyaddr || ((address >> 7) & 0x1f))
+   return false;
+
+   /* AD, if set, should be supported */
+   if ((address & VMX_EPT_AD_ENABLE_BIT)) {
+   if (!enable_ept_ad_bits)
+   return false;
+   mmu->ept_ad = true;
+   } else
+   mmu->ept_ad = false;
+
+   return true;
+}
+
+static int nested_vmx_eptp_switching(struct kvm_vcpu *vcpu,
+struct vmcs12 *vmcs12)
+{
+   u32 index = vcpu->arch.regs[VCPU_REGS_RCX];
+   u64 *l1_eptp_list, address;
+   struct page *page;
+
+   if (!nested_cpu_has_eptp_switching(vmcs12) ||
+   !nested_cpu_has_ept(vmcs12))
+   return 1;
+
+   if (index >= VMFUNC_EPTP_ENTRIES)
+

[PATCH v5 2/3] KVM: nVMX: Enable VMFUNC for the L1 hypervisor

2017-07-28 Thread Bandan Das

Expose VMFUNC in MSRs and VMCS fields. No actual VMFUNCs are enabled.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/kvm/vmx.c | 53 +++--
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a483b49..fe8f5fc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -240,6 +240,7 @@ struct __packed vmcs12 {
u64 virtual_apic_page_addr;
u64 apic_access_addr;
u64 posted_intr_desc_addr;
+   u64 vm_function_control;
u64 ept_pointer;
u64 eoi_exit_bitmap0;
u64 eoi_exit_bitmap1;
@@ -481,6 +482,7 @@ struct nested_vmx {
u64 nested_vmx_cr4_fixed0;
u64 nested_vmx_cr4_fixed1;
u64 nested_vmx_vmcs_enum;
+   u64 nested_vmx_vmfunc_controls;
 };
 
 #define POSTED_INTR_ON  0
@@ -763,6 +765,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr),
FIELD64(APIC_ACCESS_ADDR, apic_access_addr),
FIELD64(POSTED_INTR_DESC_ADDR, posted_intr_desc_addr),
+   FIELD64(VM_FUNCTION_CONTROL, vm_function_control),
FIELD64(EPT_POINTER, ept_pointer),
FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0),
FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1),
@@ -1394,6 +1397,11 @@ static inline bool nested_cpu_has_posted_intr(struct 
vmcs12 *vmcs12)
return vmcs12->pin_based_vm_exec_control & PIN_BASED_POSTED_INTR;
 }
 
+static inline bool nested_cpu_has_vmfunc(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VMFUNC);
+}
+
 static inline bool is_nmi(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2780,6 +2788,12 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
} else
vmx->nested.nested_vmx_ept_caps = 0;
 
+   if (cpu_has_vmx_vmfunc()) {
+   vmx->nested.nested_vmx_secondary_ctls_high |=
+   SECONDARY_EXEC_ENABLE_VMFUNC;
+   vmx->nested.nested_vmx_vmfunc_controls = 0;
+   }
+
/*
 * Old versions of KVM use the single-context version without
 * checking for support, so declare that it is supported even
@@ -3149,6 +3163,9 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
*pdata = vmx->nested.nested_vmx_ept_caps |
((u64)vmx->nested.nested_vmx_vpid_caps << 32);
break;
+   case MSR_IA32_VMX_VMFUNC:
+   *pdata = vmx->nested.nested_vmx_vmfunc_controls;
+   break;
default:
return 1;
}
@@ -7752,7 +7769,29 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
 
 static int handle_vmfunc(struct kvm_vcpu *vcpu)
 {
-   kvm_queue_exception(vcpu, UD_VECTOR);
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct vmcs12 *vmcs12;
+   u32 function = vcpu->arch.regs[VCPU_REGS_RAX];
+
+   /*
+* VMFUNC is only supported for nested guests, but we always enable the
+* secondary control for simplicity; for non-nested mode, fake that we
+* didn't by injecting #UD.
+*/
+   if (!is_guest_mode(vcpu)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   vmcs12 = get_vmcs12(vcpu);
+   if ((vmcs12->vm_function_control & (1 << function)) == 0)
+   goto fail;
+   WARN_ONCE(1, "VMCS12 VM function control should have been zero");
+
+fail:
+   nested_vmx_vmexit(vcpu, vmx->exit_reason,
+ vmcs_read32(VM_EXIT_INTR_INFO),
+ vmcs_readl(EXIT_QUALIFICATION));
return 1;
 }
 
@@ -10053,7 +10092,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12,
exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
  SECONDARY_EXEC_RDTSCP |
  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
- SECONDARY_EXEC_APIC_REGISTER_VIRT);
+ SECONDARY_EXEC_APIC_REGISTER_VIRT |
+ SECONDARY_EXEC_ENABLE_VMFUNC);
if (nested_cpu_has(vmcs12,
   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)) {
vmcs12_exec_ctrl = vmcs12->secondary_vm_exec_control &
@@ -10061,6 +10101,10 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12,
exec_control |= vmcs12_exec_ctrl;
}
 
+   /* All VMFUNCs are currently emulated through L0 vmexits.  */
+   if (exec_control & SECONDARY_EXEC_ENABLE_VMFUNC)
+   vmcs_write64(VM_FUNCTION_CO

[PATCH v5 0/3] Expose VMFUNC to the nested hypervisor

2017-07-28 Thread Bandan Das

v5:
 1/3 and 2/3 are unchanged but some changes in 3/3. I left
 the mmu_load failure path untouched because I am not sure what's
 the right thing to do here.
 3/3:
Move the eptp switching logic to a different function
Add check for EPTP_ADDRESS in check_vmentry_prereq
Add check for validity of ept pointer
Check if AD bit is set and set ept_ad
Add TODO item about mmu_unload failure
 
v4:
 https://lkml.org/lkml/2017/7/10/705
 2/3:  Use WARN_ONCE to avoid logging dos

v3:
 https://lkml.org/lkml/2017/7/10/684
 3/3: Add missing nested_release_page_clean() and check the
 eptp as mentioned in SDM 24.6.14

v2:
 https://lkml.org/lkml/2017/7/6/813
 1/3: Patch to enable vmfunc on the host but cause a #UD if
  L1 tries to use it directly. (new)
 2/3: Expose vmfunc to the nested hypervisor, but no vm functions
  are exposed and L0 emulates a vmfunc vmexit to L1. 
 3/3: Force a vmfunc vmexit when L2 tries to use vmfunc and emulate
  eptp switching. Unconditionally expose EPTP switching to the
  L1 hypervisor since L0 fakes eptp switching via a mmu reload.

These patches expose eptp switching/vmfunc to the nested hypervisor.
vmfunc is enabled in the secondary controls for the host and is
exposed to the nested hypervisor. However, if the nested hypervisor
decides to use eptp switching, L0 emulates it.

v1:
 https://lkml.org/lkml/2017/6/29/958

Bandan Das (3):
  KVM: vmx: Enable VMFUNCs
  KVM: nVMX: Enable VMFUNC for the L1 hypervisor
  KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

 arch/x86/include/asm/vmx.h |   9 +++
 arch/x86/kvm/vmx.c | 185 -
 2 files changed, 192 insertions(+), 2 deletions(-)

-- 
2.9.4

[PATCH v5 1/3] KVM: vmx: Enable VMFUNCs

2017-07-28 Thread Bandan Das

Enable VMFUNC in the secondary execution controls.  This simplifies the
changes necessary to expose it to nested hypervisors.  VMFUNCs still
cause #UD when invoked.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/include/asm/vmx.h |  3 +++
 arch/x86/kvm/vmx.c | 22 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 35cd06f..da5375e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -72,6 +72,7 @@
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
 #define SECONDARY_EXEC_RDRAND  0x0800
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
+#define SECONDARY_EXEC_ENABLE_VMFUNC0x2000
 #define SECONDARY_EXEC_SHADOW_VMCS  0x4000
 #define SECONDARY_EXEC_RDSEED  0x0001
 #define SECONDARY_EXEC_ENABLE_PML   0x0002
@@ -187,6 +188,8 @@ enum vmcs_field {
APIC_ACCESS_ADDR_HIGH   = 0x2015,
POSTED_INTR_DESC_ADDR   = 0x2016,
POSTED_INTR_DESC_ADDR_HIGH  = 0x2017,
+   VM_FUNCTION_CONTROL = 0x2018,
+   VM_FUNCTION_CONTROL_HIGH= 0x2019,
EPT_POINTER = 0x201a,
EPT_POINTER_HIGH= 0x201b,
EOI_EXIT_BITMAP0= 0x201c,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ca5d2b9..a483b49 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1314,6 +1314,12 @@ static inline bool cpu_has_vmx_tsc_scaling(void)
SECONDARY_EXEC_TSC_SCALING;
 }
 
+static inline bool cpu_has_vmx_vmfunc(void)
+{
+   return vmcs_config.cpu_based_2nd_exec_ctrl &
+   SECONDARY_EXEC_ENABLE_VMFUNC;
+}
+
 static inline bool report_flexpriority(void)
 {
return flexpriority_enabled;
@@ -3575,7 +3581,8 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_SHADOW_VMCS |
SECONDARY_EXEC_XSAVES |
SECONDARY_EXEC_ENABLE_PML |
-   SECONDARY_EXEC_TSC_SCALING;
+   SECONDARY_EXEC_TSC_SCALING |
+   SECONDARY_EXEC_ENABLE_VMFUNC;
if (adjust_vmx_controls(min2, opt2,
MSR_IA32_VMX_PROCBASED_CTLS2,
&_cpu_based_2nd_exec_control) < 0)
@@ -5233,6 +5240,9 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmcs_writel(HOST_GS_BASE, 0); /* 22.2.4 */
 #endif
 
+   if (cpu_has_vmx_vmfunc())
+   vmcs_write64(VM_FUNCTION_CONTROL, 0);
+
vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
@@ -7740,6 +7750,12 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static int handle_vmfunc(struct kvm_vcpu *vcpu)
+{
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -7790,6 +7806,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct 
kvm_vcpu *vcpu) = {
[EXIT_REASON_XSAVES]  = handle_xsaves,
[EXIT_REASON_XRSTORS] = handle_xrstors,
[EXIT_REASON_PML_FULL]= handle_pml_full,
+   [EXIT_REASON_VMFUNC]  = handle_vmfunc,
[EXIT_REASON_PREEMPTION_TIMER]= handle_preemption_timer,
 };
 
@@ -8111,6 +8128,9 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
case EXIT_REASON_PML_FULL:
/* We emulate PML support to L1. */
return false;
+   case EXIT_REASON_VMFUNC:
+   /* VM functions are emulated through L2->L0 vmexits. */
+   return false;
default:
return true;
}
-- 
2.9.4

Re: [PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-19 Thread Bandan Das

Radim Krčmář  writes:

> 2017-07-17 13:58-0400, Bandan Das:
>> Radim Krčmář  writes:
>> ...
>>>> > and no other mentions of a VM exit, so I think that the VM exit happens
>>>> > only under these conditions:
>>>> >
>>>> >   — The EPT memory type (bits 2:0) must be a value supported by the
>>>> > processor as indicated in the IA32_VMX_EPT_VPID_CAP MSR (see
>>>> > Appendix A.10).
>>>> >   — Bits 5:3 (1 less than the EPT page-walk length) must be 3, indicating
>>>> > an EPT page-walk length of 4; see Section 28.2.2.
>>>> >   — Bit 6 (enable bit for accessed and dirty flags for EPT) must be 0 if
>>>> > bit 21 of the IA32_VMX_EPT_VPID_CAP MSR (see Appendix A.10) is read
>>>> > as 0, indicating that the processor does not support accessed and
>>>> > dirty flags for EPT.
>>>> >   — Reserved bits 11:7 and 63:N (where N is the processor’s
>>>> > physical-address width) must all be 0.
>>>> >
>>>> > And it looks like we need parts of nested_ept_init_mmu_context() to
>>>> > properly handle VMX_EPT_AD_ENABLE_BIT.
>>>> 
>>>> I completely ignored AD and the #VE sections. I will add a TODO item
>>>> in the comment section.
>>>
>>> AFAIK, we don't support #VE, but AD would be nice to handle from the
>>> beginning.  (I think that caling nested_ept_init_mmu_context() as-is
>>> isn't that bad.)
>> 
>> I went back to the spec to take a look at the AD handling. It doesn't look
>> like anything needs to be done since nested_ept_init_mmu_context() is already
>> being called with the correct eptp in prepare_vmcs02 ? Anything else that
>> needs to be done for AD handling in vmfunc context ?
>
> AD is decided by EPTP bit 6, so it can be toggled by EPTP switching and
> we don't call prepare_vmcs02() after emulating VMFUNC vm exit.
> We want to forward the new AD configuration to KVM's MMU.

Thanks, I had incorrectly assumed that prepare_vmcs02 will be called after
an exit unconditionally. I will work something up soon.

Bandan

Re: [PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-17 Thread Bandan Das

Radim Krčmář  writes:
...
>> > and no other mentions of a VM exit, so I think that the VM exit happens
>> > only under these conditions:
>> >
>> >   — The EPT memory type (bits 2:0) must be a value supported by the
>> > processor as indicated in the IA32_VMX_EPT_VPID_CAP MSR (see
>> > Appendix A.10).
>> >   — Bits 5:3 (1 less than the EPT page-walk length) must be 3, indicating
>> > an EPT page-walk length of 4; see Section 28.2.2.
>> >   — Bit 6 (enable bit for accessed and dirty flags for EPT) must be 0 if
>> > bit 21 of the IA32_VMX_EPT_VPID_CAP MSR (see Appendix A.10) is read
>> > as 0, indicating that the processor does not support accessed and
>> > dirty flags for EPT.
>> >   — Reserved bits 11:7 and 63:N (where N is the processor’s
>> > physical-address width) must all be 0.
>> >
>> > And it looks like we need parts of nested_ept_init_mmu_context() to
>> > properly handle VMX_EPT_AD_ENABLE_BIT.
>> 
>> I completely ignored AD and the #VE sections. I will add a TODO item
>> in the comment section.
>
> AFAIK, we don't support #VE, but AD would be nice to handle from the
> beginning.  (I think that caling nested_ept_init_mmu_context() as-is
> isn't that bad.)

I went back to the spec to take a look at the AD handling. It doesn't look
like anything needs to be done since nested_ept_init_mmu_context() is already
being called with the correct eptp in prepare_vmcs02 ? Anything else that
needs to be done for AD handling in vmfunc context ?

Thanks,
Bandan

Re: [PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-13 Thread Bandan Das

David Hildenbrand  writes:

 +  /*
 +   * If the (L2) guest does a vmfunc to the currently
 +   * active ept pointer, we don't have to do anything else
 +   */
 +  if (vmcs12->ept_pointer != address) {
 +  if (address >> cpuid_maxphyaddr(vcpu) ||
 +  !IS_ALIGNED(address, 4096))
>>>
>>> Couldn't the pfn still be invalid and make kvm_mmu_reload() fail?
>>> (triggering a KVM_REQ_TRIPLE_FAULT)
>> 
>> If there's a triple fault, I think it's a good idea to inject it
>> back. Basically, there's no need to take care of damage control
>> that L1 is intentionally doing.
>
> I quickly rushed over the massive amount of comments. Sounds like you'll
> be preparing a v5. Would be great if you could add some comments that
> were the result of this discussion (for parts that are not that obvious
> - triple faults) - thanks!

Will do. Basically, we agreed that we don't need to do anything with 
mmu_reload() faillures
because the invalid eptp that mmu_unload will write to root_hpa will result in 
an ept
violation.

Bandan

Re: [PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-12 Thread Bandan Das

Radim Krčmář  writes:
...
>> Why do you think it's a bug ?
>
> SDM defines a different behavior and hardware doesn't do that either.
> There are only two reasons for a VMFUNC VM exit from EPTP switching:
>
>  1) ECX > 0
>  2) EPTP would cause VM entry to fail if in VMCS.EPT_POINTER
>
> KVM can fail for other reasons because of its bugs, but that should be
> notified to the guest in another way.  Rebooting the guest is kind of
> acceptable in that case.
>
>>   The eptp switching function really didn't
>> succeed as far as our emulation goes when kvm_mmu_reload() fails.
>> And as such, the generic vmexit failure event should be a vmfunc vmexit.
>
> I interpret it as two separate events -- at first, the vmfunc succeeds
> and when it later tries to access memory through the new EPTP (valid,
> but not pointing into backed memory), it results in a EPT_MISCONFIG VM
> exit.
>
>> We cannot strictly follow the spec here, the spec doesn't even mention a way
>> to emulate eptp switching.  If setting up the switching succeeded and the
>> new root pointer is invalid or whatever, I really don't care what happens
>> next but this is not the case. We fail to get a new root pointer and without
>> that, we can't even make a switch!
>
> We just make it behave exactly how the spec says that it behaves.  We do
> have a value (we read 'address') to put into VMCS.EPT_POINTER, which is
> all we need for the emulation.
> The function doesn't dereference that pointer, it just looks at its
> value to decide whether it is valid or not.  (btw. we should check that
> properly, because we cannot depend on VM entry failure pass-through like
> the normal case.)
>
> The dereference done in kvm_mmu_reload() should happen after EPTP
> switching finishes, because the spec doesn't mention a VM exit for other
> reason than invalid EPT_POINTER value.
>
>>> just keep the original bug -- we want to eventually fix it and it's no
>>> worse till then.
>> 
>> Anyway, can you please confirm again what is the behavior that you
>> are expecting if kvm_mmu_reload fails ? This would be a rarely used
>> branch and I am actually fine diverging from what I think is right if
>> I can get the reviewers to agree on a common thing.
>
> kvm_mmu_reload() fails when mmu_check_root() is false, which means that
> the pointed physical address is not backed.  We've hit this corner-case
> in the past -- Jim said that the chipset returns all 1s if a read is not
> claimed.
>
> So in theory, KVM should not fail kvm_mmu_reload(), but behave as if the
> pointer pointed to a memory of all 1s, which would likely result in
> EPT_MISCONFIG when the guest does a memory access.

As much as I would like to disagree with you, I have already spent way more
time on this then I want. Please let's just leave it here, then ? The mmu unload
will make sure there's an invalid root hpa and whatever happens next, happens.

> It is a mishandled corner case, but turning it into VM exit would only
> confuse an OS that receives the impossible VM exit and potentially
> confuse reader of the KVM logic.
>
> I think that not using kvm_mmu_reload() directly in EPTP switching is
> best.  The bug is not really something we care about.

Re: [PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-12 Thread Bandan Das

Radim Krčmář  writes:
...
>> > Thanks, we're not here to judge the guest, but to provide a bare-metal
>> > experience. :)
>> 
>> There are certain cases where do. For example, when L2 instruction emulation
>> fails we decide to kill L2 instead of injecting the error to L1 and let it 
>> handle
>> that. Anyway, that's a different topic, I was just trying to point out there
>> are cases kvm does a somewhat policy decision...
>
> Emulation failure is a KVM bug and we are too lazy to implement the
> bare-metal behavior correctly, but avoiding the EPTP list bug is
> actually easier than introducing it.  You can make KVM simpler and
> improve bare-metal emulation at the same time.

We are just talking past each other here trying to impose point of views.
Checking for 0 makes KVM simpler. As I said before, a 0 list_address means
that the hypervisor forgot to initialize it. Feel free to show me examples
where the hypervisor does indeed use a 0 address for eptp list address or
anything vm specific. You disagreed and I am fine with it.

Re: [PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-11 Thread Bandan Das

Radim Krčmář  writes:

> 2017-07-11 16:34-0400, Bandan Das:
>> Radim Krčmář  writes:
>> 
>> > 2017-07-11 15:50-0400, Bandan Das:
>> >> Radim Krčmář  writes:
>> >> > 2017-07-11 14:24-0400, Bandan Das:
>> >> >> Bandan Das  writes:
>> >> >> > If there's a triple fault, I think it's a good idea to inject it
>> >> >> > back. Basically, there's no need to take care of damage control
>> >> >> > that L1 is intentionally doing.
>> >> >> >
>> >> >> >>> +  goto fail;
>> >> >> >>> +  kvm_mmu_unload(vcpu);
>> >> >> >>> +  vmcs12->ept_pointer = address;
>> >> >> >>> +  kvm_mmu_reload(vcpu);
>> >> >> >>
>> >> >> >> I was thinking about something like this:
>> >> >> >>
>> >> >> >> kvm_mmu_unload(vcpu);
>> >> >> >> old = vmcs12->ept_pointer;
>> >> >> >> vmcs12->ept_pointer = address;
>> >> >> >> if (kvm_mmu_reload(vcpu)) {
>> >> >> >> /* pointer invalid, restore previous state */
>> >> >> >> kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
>> >> >> >> vmcs12->ept_pointer = old;
>> >> >> >> kvm_mmu_reload(vcpu);
>> >> >> >> goto fail;
>> >> >> >> }
>> >> >> >>
>> >> >> >> The you can inherit the checks from mmu_check_root().
>> >> >> 
>> >> >> Actually, thinking about this a bit more, I agree with you. Any fault
>> >> >> with a vmfunc operation should end with a vmfunc vmexit, so this
>> >> >> is a good thing to have. Thank you for this idea! :)
>> >> >
>> >> > SDM says
>> >> >
>> >> >   IF tent_EPTP is not a valid EPTP value (would cause VM entry to fail
>> >> >   if in EPTP) THEN VMexit;
>> >> 
>> >> This section here:
>> >> As noted in Section 25.5.5.2, an execution of the
>> >> EPTP-switching VM function that causes a VM exit (as specified
>> >> above), uses the basic exit reason 59, indicating “VMFUNC”.
>> >> The length of the VMFUNC instruction is saved into the
>> >> VM-exit instruction-length field. No additional VM-exit
>> >> information is provided.
>> >> 
>> >> Although, it adds (as specified above), from testing, any vmexit that
>> >> happens as a result of the execution of the vmfunc instruction always
>> >> has exit reason 59.
>> >> 
>> >> IMO, the case David pointed out comes under "as a result of the
>> >> execution of the vmfunc instruction", so I would prefer exiting
>> >> with reason 59.
>> >
>> > Right, the exit reason is 59 for reasons that trigger a VM exit
>> > (i.e. invalid EPTP value, the four below), but kvm_mmu_reload() checks
>> > unrelated stuff.
>> >
>> > If the EPTP value is correct, then the switch should succeed.
>> > If the EPTP is correct, but bogus, then the guest should get
>> > EPT_MISCONFIG VM exit on its first access (when reading the
>> > instruction).  Source: I added
>> 
>> My point is that we are using kvm_mmu_reload() to emulate eptp
>> switching. If that emulation of vmfunc fails, it should exit with reason
>> 59.
>
> Yeah, we just disagree on what is a vmfunc failure.
>
>> >   vmcs_write64(EPT_POINTER, vmcs_read64(EPT_POINTER) | (1ULL << 40));
>> >
>> > shortly before a VMLAUNCH on L0. :)
>> 
>> What happens if this ept pointer is actually in the eptp list and the guest
>> switches to it using vmfunc ? I think it will exit with reason 59.
>
> I think otherwise, because it doesn't cause a VM entry failure on
> bare-metal (and SDM says that we get a VM exit only if there would be a
> VM entry failure).
> I expect the vmfunc to succeed and to get a EPT_MISCONFIG right after.
> (Experiment pending :])
>
>> > I think that we might be emulating this case incorrectly and throwing
>> > triple faults when it should be VM exits in vcpu_run().
>> 
>> No, I agree with not throwing a triple fault. We should clear it out.
>> But we should emulate a vmfunc vmexit back to L1 when kvm_mmu_load fails.
>
> Here we disagree.  I think that it's a bug do the VM exit, so we can

Why do you think it's a bug ? The eptp switching function really didn't
succeed as far as our emulation goes when kvm_mmu_reload() fails.
And as such, the generic vmexit failure event should be a vmfunc vmexit.
We cannot strictly follow the spec here, the spec doesn't even mention a way
to emulate eptp switching. If setting up the switching succeeded and the
new root pointer is invalid or whatever, I really don't care what happens
next but this is not the case. We fail to get a new root pointer and without
that, we can't even make a switch!

> just keep the original bug -- we want to eventually fix it and it's no
> worse till then.

Anyway, can you please confirm again what is the behavior that you
are expecting if kvm_mmu_reload fails ? This would be a rarely used
branch and I am actually fine diverging from what I think is right if
I can get the reviewers to agree on a common thing.

(Thanks for giving this a closer look, Radim. I really appreciate it.)

Bandan

Re: [PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-11 Thread Bandan Das

Radim Krčmář  writes:

> 2017-07-11 15:38-0400, Bandan Das:
>> Radim Krčmář  writes:
>> 
>> > 2017-07-11 14:35-0400, Bandan Das:
>> >> Jim Mattson  writes:
>> >> ...
>> >> >>> I can find the definition for an vmexit in case of index >=
>> >> >>> VMFUNC_EPTP_ENTRIES, but not for !vmcs12->eptp_list_address in the 
>> >> >>> SDM.
>> >> >>>
>> >> >>> Can you give me a hint?
>> >> >>
>> >> >> I don't think there is. Since, we are basically emulating eptp 
>> >> >> switching
>> >> >> for L2, this is a good check to have.
>> >> >
>> >> > There is nothing wrong with a hypervisor using physical page 0 for
>> >> > whatever purpose it likes, including an EPTP list.
>> >> 
>> >> Right, but of all the things, a l1 hypervisor wanting page 0 for a eptp 
>> >> list
>> >> address most likely means it forgot to initialize it. Whatever damage it 
>> >> does will
>> >> still end up with vmfunc vmexit anyway.
>> >
>> > Most likely, but not certainly.  I also don't see a to diverge from the
>> > spec here.
>> 
>> Actually, this is a specific case where I would like to diverge from the 
>> spec.
>> But then again, it's L1 shooting itself in the foot and this would be a 
>> rarely
>> used code path, so, I am fine removing it.
>
> Thanks, we're not here to judge the guest, but to provide a bare-metal
> experience. :)

There are certain cases where do. For example, when L2 instruction emulation
fails we decide to kill L2 instead of injecting the error to L1 and let it 
handle
that. Anyway, that's a different topic, I was just trying to point out there
are cases kvm does a somewhat policy decision...

Re: [PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-11 Thread Bandan Das

Radim Krčmář  writes:

> 2017-07-11 15:50-0400, Bandan Das:
>> Radim Krčmář  writes:
>> > 2017-07-11 14:24-0400, Bandan Das:
>> >> Bandan Das  writes:
>> >> > If there's a triple fault, I think it's a good idea to inject it
>> >> > back. Basically, there's no need to take care of damage control
>> >> > that L1 is intentionally doing.
>> >> >
>> >> >>> + goto fail;
>> >> >>> + kvm_mmu_unload(vcpu);
>> >> >>> + vmcs12->ept_pointer = address;
>> >> >>> + kvm_mmu_reload(vcpu);
>> >> >>
>> >> >> I was thinking about something like this:
>> >> >>
>> >> >> kvm_mmu_unload(vcpu);
>> >> >> old = vmcs12->ept_pointer;
>> >> >> vmcs12->ept_pointer = address;
>> >> >> if (kvm_mmu_reload(vcpu)) {
>> >> >>/* pointer invalid, restore previous state */
>> >> >>kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
>> >> >>vmcs12->ept_pointer = old;
>> >> >>kvm_mmu_reload(vcpu);
>> >> >>goto fail;
>> >> >> }
>> >> >>
>> >> >> The you can inherit the checks from mmu_check_root().
>> >> 
>> >> Actually, thinking about this a bit more, I agree with you. Any fault
>> >> with a vmfunc operation should end with a vmfunc vmexit, so this
>> >> is a good thing to have. Thank you for this idea! :)
>> >
>> > SDM says
>> >
>> >   IF tent_EPTP is not a valid EPTP value (would cause VM entry to fail
>> >   if in EPTP) THEN VMexit;
>> 
>> This section here:
>> As noted in Section 25.5.5.2, an execution of the
>> EPTP-switching VM function that causes a VM exit (as specified
>> above), uses the basic exit reason 59, indicating “VMFUNC”.
>> The length of the VMFUNC instruction is saved into the
>> VM-exit instruction-length field. No additional VM-exit
>> information is provided.
>> 
>> Although, it adds (as specified above), from testing, any vmexit that
>> happens as a result of the execution of the vmfunc instruction always
>> has exit reason 59.
>> 
>> IMO, the case David pointed out comes under "as a result of the
>> execution of the vmfunc instruction", so I would prefer exiting
>> with reason 59.
>
> Right, the exit reason is 59 for reasons that trigger a VM exit
> (i.e. invalid EPTP value, the four below), but kvm_mmu_reload() checks
> unrelated stuff.
>
> If the EPTP value is correct, then the switch should succeed.
> If the EPTP is correct, but bogus, then the guest should get
> EPT_MISCONFIG VM exit on its first access (when reading the
> instruction).  Source: I added

My point is that we are using kvm_mmu_reload() to emulate eptp
switching. If that emulation of vmfunc fails, it should exit with reason
59.

>   vmcs_write64(EPT_POINTER, vmcs_read64(EPT_POINTER) | (1ULL << 40));
>
> shortly before a VMLAUNCH on L0. :)

What happens if this ept pointer is actually in the eptp list and the guest
switches to it using vmfunc ? I think it will exit with reason 59.

> I think that we might be emulating this case incorrectly and throwing
> triple faults when it should be VM exits in vcpu_run().

No, I agree with not throwing a triple fault. We should clear it out.
But we should emulate a vmfunc vmexit back to L1 when kvm_mmu_load fails.

>> > and no other mentions of a VM exit, so I think that the VM exit happens
>> > only under these conditions:
>> >
>> >   — The EPT memory type (bits 2:0) must be a value supported by the
>> > processor as indicated in the IA32_VMX_EPT_VPID_CAP MSR (see
>> > Appendix A.10).
>> >   — Bits 5:3 (1 less than the EPT page-walk length) must be 3, indicating
>> > an EPT page-walk length of 4; see Section 28.2.2.
>> >   — Bit 6 (enable bit for accessed and dirty flags for EPT) must be 0 if
>> > bit 21 of the IA32_VMX_EPT_VPID_CAP MSR (see Appendix A.10) is read
>> > as 0, indicating that the processor does not support accessed and
>> > dirty flags for EPT.
>> >   — Reserved bits 11:7 and 63:N (where N is the processor’s
>> > physical-address width) must all be 0.
>> >
>> > And it looks like we need parts of nested_ept_init_mmu_context() to
>> > properly handle VMX_EPT_AD_ENABLE_BIT.
>> 
>> I completely ignored AD and the #VE sections. I will add a TODO item
>> in the comment section.
>
> AFAIK, we don't support #VE, but AD would be nice to handle from the

Nevertheless, it's good to have the nested hypervisor be able to use it
just like vmfunc.

> beginning.  (I think that caling nested_ept_init_mmu_context() as-is
> isn't that bad.)

Ok, I will take a look.

Re: [PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-11 Thread Bandan Das

Radim Krčmář  writes:

> 2017-07-11 14:24-0400, Bandan Das:
>> Bandan Das  writes:
>> > If there's a triple fault, I think it's a good idea to inject it
>> > back. Basically, there's no need to take care of damage control
>> > that L1 is intentionally doing.
>> >
>> >>> +goto fail;
>> >>> +kvm_mmu_unload(vcpu);
>> >>> +vmcs12->ept_pointer = address;
>> >>> +kvm_mmu_reload(vcpu);
>> >>
>> >> I was thinking about something like this:
>> >>
>> >> kvm_mmu_unload(vcpu);
>> >> old = vmcs12->ept_pointer;
>> >> vmcs12->ept_pointer = address;
>> >> if (kvm_mmu_reload(vcpu)) {
>> >>   /* pointer invalid, restore previous state */
>> >>   kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
>> >>   vmcs12->ept_pointer = old;
>> >>   kvm_mmu_reload(vcpu);
>> >>   goto fail;
>> >> }
>> >>
>> >> The you can inherit the checks from mmu_check_root().
>> 
>> Actually, thinking about this a bit more, I agree with you. Any fault
>> with a vmfunc operation should end with a vmfunc vmexit, so this
>> is a good thing to have. Thank you for this idea! :)
>
> SDM says
>
>   IF tent_EPTP is not a valid EPTP value (would cause VM entry to fail
>   if in EPTP) THEN VMexit;

This section here:
As noted in Section 25.5.5.2, an execution of the
EPTP-switching VM function that causes a VM exit (as specified
above), uses the basic exit reason 59, indicating “VMFUNC”.
The length of the VMFUNC instruction is saved into the
VM-exit instruction-length field. No additional VM-exit
information is provided.

Although, it adds (as specified above), from testing, any vmexit that
happens as a result of the execution of the vmfunc instruction always
has exit reason 59.

IMO, the case David pointed out comes under "as a result of the
execution of the vmfunc instruction", so I would prefer exiting
with reason 59.

> and no other mentions of a VM exit, so I think that the VM exit happens
> only under these conditions:
>
>   — The EPT memory type (bits 2:0) must be a value supported by the
> processor as indicated in the IA32_VMX_EPT_VPID_CAP MSR (see
> Appendix A.10).
>   — Bits 5:3 (1 less than the EPT page-walk length) must be 3, indicating
> an EPT page-walk length of 4; see Section 28.2.2.
>   — Bit 6 (enable bit for accessed and dirty flags for EPT) must be 0 if
> bit 21 of the IA32_VMX_EPT_VPID_CAP MSR (see Appendix A.10) is read
> as 0, indicating that the processor does not support accessed and
> dirty flags for EPT.
>   — Reserved bits 11:7 and 63:N (where N is the processor’s
> physical-address width) must all be 0.
>
> And it looks like we need parts of nested_ept_init_mmu_context() to
> properly handle VMX_EPT_AD_ENABLE_BIT.

I completely ignored AD and the #VE sections. I will add a TODO item
in the comment section.

> The KVM_REQ_TRIPLE_FAULT can be handled by kvm_mmu_reload in vcpu_run if
> we just invalidate the MMU.

Re: [PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-11 Thread Bandan Das

Radim Krčmář  writes:

> 2017-07-11 14:35-0400, Bandan Das:
>> Jim Mattson  writes:
>> ...
>> >>> I can find the definition for an vmexit in case of index >=
>> >>> VMFUNC_EPTP_ENTRIES, but not for !vmcs12->eptp_list_address in the SDM.
>> >>>
>> >>> Can you give me a hint?
>> >>
>> >> I don't think there is. Since, we are basically emulating eptp switching
>> >> for L2, this is a good check to have.
>> >
>> > There is nothing wrong with a hypervisor using physical page 0 for
>> > whatever purpose it likes, including an EPTP list.
>> 
>> Right, but of all the things, a l1 hypervisor wanting page 0 for a eptp list
>> address most likely means it forgot to initialize it. Whatever damage it 
>> does will
>> still end up with vmfunc vmexit anyway.
>
> Most likely, but not certainly.  I also don't see a to diverge from the
> spec here.

Actually, this is a specific case where I would like to diverge from the spec.
But then again, it's L1 shooting itself in the foot and this would be a rarely
used code path, so, I am fine removing it.

Re: [PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-11 Thread Bandan Das

Radim Krčmář  writes:

> 2017-07-11 14:05-0400, Bandan Das:
>> Radim Krčmář  writes:
>> 
>> > [David did a great review, so I'll just point out things I noticed.]
>> >
>> > 2017-07-11 09:51+0200, David Hildenbrand:
>> >> On 10.07.2017 22:49, Bandan Das wrote:
>> >> > When L2 uses vmfunc, L0 utilizes the associated vmexit to
>> >> > emulate a switching of the ept pointer by reloading the
>> >> > guest MMU.
>> >> > 
>> >> > Signed-off-by: Paolo Bonzini 
>> >> > Signed-off-by: Bandan Das 
>> >> > ---
>> >> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> >> > @@ -7784,11 +7801,46 @@ static int handle_vmfunc(struct kvm_vcpu *vcpu)
>> >> > }
>> >> >  
>> >> > vmcs12 = get_vmcs12(vcpu);
>> >> > -   if ((vmcs12->vm_function_control & (1 << function)) == 0)
>> >> > +   if (((vmcs12->vm_function_control & (1 << function)) == 0) ||
>> >> > +   WARN_ON_ONCE(function))
>> >> 
>> >> "... instruction causes a VM exit if the bit at position EAX is 0 in the
>> >> VM-function controls (the selected VM function is
>> >> not enabled)."
>> >> 
>> >> So g2 can trigger this WARN_ON_ONCE, no? I think we should drop it then
>> >> completely.
>> >
>> > It assumes that vm_function_control is not > 1, which is (should be)
>> > guaranteed by VM entry check, because the nested_vmx_vmfunc_controls MSR
>> > is 1.
>> >
>> >> > +   goto fail;
>> >
>> > The rest of the code assumes that the function is
>> > VMX_VMFUNC_EPTP_SWITCHING, so some WARN_ON_ONCE is reasonable.
>> >
>> > Writing it as
>> >
>> >   WARN_ON_ONCE(function != VMX_VMFUNC_EPTP_SWITCHING)
>> >
>> > would be cleared and I'd prefer to move the part that handles
>> > VMX_VMFUNC_EPTP_SWITCHING into a new function. (Imagine that Intel is
>> > going to add more than one VM FUNC. :])
>> 
>> IMO, for now, this should be fine because we are not even passing through the
>> hardware's eptp switching. Even if there are other vm functions, they
>> won't be available for the nested case and cause any conflict.
>
> Yeah, it is fine function-wise, I was just pointing out that it looks
> ugly to me.

Ok, lemme switch this to a switch statement style handler function. That way,
future additions would be easier.

> Btw. have you looked what we'd need to do for the hardware pass-through?
> I'd expect big changes to MMU. :)

Yes, the first version was actually using vmfunc 0 directly, well not exatly, 
the
first time would go through this path and then the next time the processor would
handle it directly. Paolo pointed out an issue that I was ready to fix but he 
wasn't
comfortable with the idea. I actually agree with him, it's too much untested 
code
for something that would be rarely used.

>> >> > +   if (!nested_cpu_has_ept(vmcs12) ||
>> >> > +   !nested_cpu_has_eptp_switching(vmcs12))
>> >> > +   goto fail;
>> >
>> > This brings me to a missing vm-entry check:
>> >
>> >  If “EPTP switching” VM-function control is 1, the “enable EPT”
>> >  VM-execution control must also be 1. In addition, the EPTP-list address
>> >  must satisfy the following checks:
>> >  • Bits 11:0 of the address must be 0.
>> >  • The address must not set any bits beyond the processor’s
>> >physical-address width.
>> >
>> > so this one could be
>> >
>> >   if (!nested_cpu_has_eptp_switching(vmcs12) ||
>> >   WARN_ON_ONCE(!nested_cpu_has_ept(vmcs12)))
>> 
>> I will reverse the order here but the vm entry check is unnecessary because
>> the check on the list address is already being done in this function.
>
> Here is too late, the nested VM-entry should have failed, never letting
> this situation happen.  We want an equivalent of
>
>   if (nested_cpu_has_eptp_switching(vmcs12) && !nested_cpu_has_ept(vmcs12))
>   return VMXERR_ENTRY_INVALID_CONTROL_FIELD;
>
> in nested controls checks, right next to the reserved fields check.
> And then also the check EPTP-list check.  All of them only checked when
> nested_cpu_has_vmfunc(vmcs12).

Actually, I misread 25.5.5.3. There are two checks. Here, the list entry
needs to be checked so that eptp won't cause a vmentry failure. The vmentry
needs to check the eptp list address itself. I will add that check for the
list address in the next version.

Bandan

Re: [PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-11 Thread Bandan Das

Jim Mattson  writes:
...
>>> I can find the definition for an vmexit in case of index >=
>>> VMFUNC_EPTP_ENTRIES, but not for !vmcs12->eptp_list_address in the SDM.
>>>
>>> Can you give me a hint?
>>
>> I don't think there is. Since, we are basically emulating eptp switching
>> for L2, this is a good check to have.
>
> There is nothing wrong with a hypervisor using physical page 0 for
> whatever purpose it likes, including an EPTP list.

Right, but of all the things, a l1 hypervisor wanting page 0 for a eptp list
address most likely means it forgot to initialize it. Whatever damage it does 
will
still end up with vmfunc vmexit anyway.

Bandan

Re: [PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-11 Thread Bandan Das

Bandan Das  writes:

>>> +   /*
>>> +* If the (L2) guest does a vmfunc to the currently
>>> +* active ept pointer, we don't have to do anything else
>>> +*/
>>> +   if (vmcs12->ept_pointer != address) {
>>> +   if (address >> cpuid_maxphyaddr(vcpu) ||
>>> +   !IS_ALIGNED(address, 4096))
>>
>> Couldn't the pfn still be invalid and make kvm_mmu_reload() fail?
>> (triggering a KVM_REQ_TRIPLE_FAULT)
>
> If there's a triple fault, I think it's a good idea to inject it
> back. Basically, there's no need to take care of damage control
> that L1 is intentionally doing.
>
>>> +   goto fail;
>>> +   kvm_mmu_unload(vcpu);
>>> +   vmcs12->ept_pointer = address;
>>> +   kvm_mmu_reload(vcpu);
>>
>> I was thinking about something like this:
>>
>> kvm_mmu_unload(vcpu);
>> old = vmcs12->ept_pointer;
>> vmcs12->ept_pointer = address;
>> if (kvm_mmu_reload(vcpu)) {
>>  /* pointer invalid, restore previous state */
>>  kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
>>  vmcs12->ept_pointer = old;
>>  kvm_mmu_reload(vcpu);
>>  goto fail;
>> }
>>
>> The you can inherit the checks from mmu_check_root().

Actually, thinking about this a bit more, I agree with you. Any fault
with a vmfunc operation should end with a vmfunc vmexit, so this
is a good thing to have. Thank you for this idea! :)

Bandan

Re: [PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-11 Thread Bandan Das

Radim Krčmář  writes:

> [David did a great review, so I'll just point out things I noticed.]
>
> 2017-07-11 09:51+0200, David Hildenbrand:
>> On 10.07.2017 22:49, Bandan Das wrote:
>> > When L2 uses vmfunc, L0 utilizes the associated vmexit to
>> > emulate a switching of the ept pointer by reloading the
>> > guest MMU.
>> > 
>> > Signed-off-by: Paolo Bonzini 
>> > Signed-off-by: Bandan Das 
>> > ---
>> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> > @@ -7784,11 +7801,46 @@ static int handle_vmfunc(struct kvm_vcpu *vcpu)
>> >}
>> >  
>> >vmcs12 = get_vmcs12(vcpu);
>> > -  if ((vmcs12->vm_function_control & (1 << function)) == 0)
>> > +  if (((vmcs12->vm_function_control & (1 << function)) == 0) ||
>> > +  WARN_ON_ONCE(function))
>> 
>> "... instruction causes a VM exit if the bit at position EAX is 0 in the
>> VM-function controls (the selected VM function is
>> not enabled)."
>> 
>> So g2 can trigger this WARN_ON_ONCE, no? I think we should drop it then
>> completely.
>
> It assumes that vm_function_control is not > 1, which is (should be)
> guaranteed by VM entry check, because the nested_vmx_vmfunc_controls MSR
> is 1.
>
>> > +  goto fail;
>
> The rest of the code assumes that the function is
> VMX_VMFUNC_EPTP_SWITCHING, so some WARN_ON_ONCE is reasonable.
>
> Writing it as
>
>   WARN_ON_ONCE(function != VMX_VMFUNC_EPTP_SWITCHING)
>
> would be cleared and I'd prefer to move the part that handles
> VMX_VMFUNC_EPTP_SWITCHING into a new function. (Imagine that Intel is
> going to add more than one VM FUNC. :])

IMO, for now, this should be fine because we are not even passing through the
hardware's eptp switching. Even if there are other vm functions, they
won't be available for the nested case and cause any conflict.

>> > +  if (!nested_cpu_has_ept(vmcs12) ||
>> > +  !nested_cpu_has_eptp_switching(vmcs12))
>> > +  goto fail;
>
> This brings me to a missing vm-entry check:
>
>  If “EPTP switching” VM-function control is 1, the “enable EPT”
>  VM-execution control must also be 1. In addition, the EPTP-list address
>  must satisfy the following checks:
>  • Bits 11:0 of the address must be 0.
>  • The address must not set any bits beyond the processor’s
>physical-address width.
>
> so this one could be
>
>   if (!nested_cpu_has_eptp_switching(vmcs12) ||
>   WARN_ON_ONCE(!nested_cpu_has_ept(vmcs12)))

I will reverse the order here but the vm entry check is unnecessary because
the check on the list address is already being done in this function.

> after adding the check.

Re: [PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-11 Thread Bandan Das

David Hildenbrand  writes:

> On 10.07.2017 22:49, Bandan Das wrote:
>> When L2 uses vmfunc, L0 utilizes the associated vmexit to
>> emulate a switching of the ept pointer by reloading the
>> guest MMU.
>> 
>> Signed-off-by: Paolo Bonzini 
>> Signed-off-by: Bandan Das 
>> ---
>>  arch/x86/include/asm/vmx.h |  6 +
>>  arch/x86/kvm/vmx.c | 58 
>> +++---
>>  2 files changed, 61 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>> index da5375e..5f63a2e 100644
>> --- a/arch/x86/include/asm/vmx.h
>> +++ b/arch/x86/include/asm/vmx.h
>> @@ -115,6 +115,10 @@
>>  #define VMX_MISC_SAVE_EFER_LMA  0x0020
>>  #define VMX_MISC_ACTIVITY_HLT   0x0040
>>  
>> +/* VMFUNC functions */
>> +#define VMX_VMFUNC_EPTP_SWITCHING   0x0001
>> +#define VMFUNC_EPTP_ENTRIES  512
>> +
>>  static inline u32 vmx_basic_vmcs_revision_id(u64 vmx_basic)
>>  {
>>  return vmx_basic & GENMASK_ULL(30, 0);
>> @@ -200,6 +204,8 @@ enum vmcs_field {
>>  EOI_EXIT_BITMAP2_HIGH   = 0x2021,
>>  EOI_EXIT_BITMAP3= 0x2022,
>>  EOI_EXIT_BITMAP3_HIGH   = 0x2023,
>> +EPTP_LIST_ADDRESS   = 0x2024,
>> +EPTP_LIST_ADDRESS_HIGH  = 0x2025,
>>  VMREAD_BITMAP   = 0x2026,
>>  VMWRITE_BITMAP  = 0x2028,
>>  XSS_EXIT_BITMAP = 0x202C,
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index fe8f5fc..0a969fb 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -246,6 +246,7 @@ struct __packed vmcs12 {
>>  u64 eoi_exit_bitmap1;
>>  u64 eoi_exit_bitmap2;
>>  u64 eoi_exit_bitmap3;
>> +u64 eptp_list_address;
>>  u64 xss_exit_bitmap;
>>  u64 guest_physical_address;
>>  u64 vmcs_link_pointer;
>> @@ -771,6 +772,7 @@ static const unsigned short vmcs_field_to_offset_table[] 
>> = {
>>  FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1),
>>  FIELD64(EOI_EXIT_BITMAP2, eoi_exit_bitmap2),
>>  FIELD64(EOI_EXIT_BITMAP3, eoi_exit_bitmap3),
>> +FIELD64(EPTP_LIST_ADDRESS, eptp_list_address),
>>  FIELD64(XSS_EXIT_BITMAP, xss_exit_bitmap),
>>  FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address),
>>  FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer),
>> @@ -1402,6 +1404,13 @@ static inline bool nested_cpu_has_vmfunc(struct 
>> vmcs12 *vmcs12)
>>  return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VMFUNC);
>>  }
>>  
>> +static inline bool nested_cpu_has_eptp_switching(struct vmcs12 *vmcs12)
>> +{
>> +return nested_cpu_has_vmfunc(vmcs12) &&
>> +(vmcs12->vm_function_control &
>
> I wonder if it makes sense to rename vm_function_control to
> - vmfunc_control
> - vmfunc_controls (so it matches nested_vmx_vmfunc_controls)
> - vmfunc_ctrl

I tend to follow the SDM names because it's easy to look for them.

>> + VMX_VMFUNC_EPTP_SWITCHING);
>> +}
>> +
>>  static inline bool is_nmi(u32 intr_info)
>>  {
>>  return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
>> @@ -2791,7 +2800,12 @@ static void nested_vmx_setup_ctls_msrs(struct 
>> vcpu_vmx *vmx)
>>  if (cpu_has_vmx_vmfunc()) {
>>  vmx->nested.nested_vmx_secondary_ctls_high |=
>>  SECONDARY_EXEC_ENABLE_VMFUNC;
>> -vmx->nested.nested_vmx_vmfunc_controls = 0;
>> +/*
>> + * Advertise EPTP switching unconditionally
>> + * since we emulate it
>> + */
>> +vmx->nested.nested_vmx_vmfunc_controls =
>> +VMX_VMFUNC_EPTP_SWITCHING;> }
>>  
>>  /*
>> @@ -7772,6 +7786,9 @@ static int handle_vmfunc(struct kvm_vcpu *vcpu)
>>  struct vcpu_vmx *vmx = to_vmx(vcpu);
>>  struct vmcs12 *vmcs12;
>>  u32 function = vcpu->arch.regs[VCPU_REGS_RAX];
>> +u32 index = vcpu->arch.regs[VCPU_REGS_RCX];
>> +struct page *page = NULL;
>> +u64 *l1_eptp_list, address;
>>  
>>  /*
>>   * VMFUNC is only supported for nested guests, but we always enable the
>> @@ -7784,11 +7801,46 @@ static int handle_vmfunc(struct kvm_vcpu *vcpu)
>>  }
>>  
>>  vmcs12 = get_vmcs12(vcpu);
>> -if ((vmcs12->

[PATCH v4 1/3] KVM: vmx: Enable VMFUNCs

2017-07-10 Thread Bandan Das

Enable VMFUNC in the secondary execution controls.  This simplifies the
changes necessary to expose it to nested hypervisors.  VMFUNCs still
cause #UD when invoked.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/include/asm/vmx.h |  3 +++
 arch/x86/kvm/vmx.c | 22 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 35cd06f..da5375e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -72,6 +72,7 @@
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
 #define SECONDARY_EXEC_RDRAND  0x0800
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
+#define SECONDARY_EXEC_ENABLE_VMFUNC0x2000
 #define SECONDARY_EXEC_SHADOW_VMCS  0x4000
 #define SECONDARY_EXEC_RDSEED  0x0001
 #define SECONDARY_EXEC_ENABLE_PML   0x0002
@@ -187,6 +188,8 @@ enum vmcs_field {
APIC_ACCESS_ADDR_HIGH   = 0x2015,
POSTED_INTR_DESC_ADDR   = 0x2016,
POSTED_INTR_DESC_ADDR_HIGH  = 0x2017,
+   VM_FUNCTION_CONTROL = 0x2018,
+   VM_FUNCTION_CONTROL_HIGH= 0x2019,
EPT_POINTER = 0x201a,
EPT_POINTER_HIGH= 0x201b,
EOI_EXIT_BITMAP0= 0x201c,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ca5d2b9..a483b49 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1314,6 +1314,12 @@ static inline bool cpu_has_vmx_tsc_scaling(void)
SECONDARY_EXEC_TSC_SCALING;
 }
 
+static inline bool cpu_has_vmx_vmfunc(void)
+{
+   return vmcs_config.cpu_based_2nd_exec_ctrl &
+   SECONDARY_EXEC_ENABLE_VMFUNC;
+}
+
 static inline bool report_flexpriority(void)
 {
return flexpriority_enabled;
@@ -3575,7 +3581,8 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_SHADOW_VMCS |
SECONDARY_EXEC_XSAVES |
SECONDARY_EXEC_ENABLE_PML |
-   SECONDARY_EXEC_TSC_SCALING;
+   SECONDARY_EXEC_TSC_SCALING |
+   SECONDARY_EXEC_ENABLE_VMFUNC;
if (adjust_vmx_controls(min2, opt2,
MSR_IA32_VMX_PROCBASED_CTLS2,
&_cpu_based_2nd_exec_control) < 0)
@@ -5233,6 +5240,9 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmcs_writel(HOST_GS_BASE, 0); /* 22.2.4 */
 #endif
 
+   if (cpu_has_vmx_vmfunc())
+   vmcs_write64(VM_FUNCTION_CONTROL, 0);
+
vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
@@ -7740,6 +7750,12 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static int handle_vmfunc(struct kvm_vcpu *vcpu)
+{
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -7790,6 +7806,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct 
kvm_vcpu *vcpu) = {
[EXIT_REASON_XSAVES]  = handle_xsaves,
[EXIT_REASON_XRSTORS] = handle_xrstors,
[EXIT_REASON_PML_FULL]= handle_pml_full,
+   [EXIT_REASON_VMFUNC]  = handle_vmfunc,
[EXIT_REASON_PREEMPTION_TIMER]= handle_preemption_timer,
 };
 
@@ -8111,6 +8128,9 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
case EXIT_REASON_PML_FULL:
/* We emulate PML support to L1. */
return false;
+   case EXIT_REASON_VMFUNC:
+   /* VM functions are emulated through L2->L0 vmexits. */
+   return false;
default:
return true;
}
-- 
2.9.4

[PATCH v4 0/3] Expose VMFUNC to the nested hypervisor

2017-07-10 Thread Bandan Das

v4:
 2/3:  Use WARN_ONCE to avoid logging dos

v3:
 https://lkml.org/lkml/2017/7/10/684
 3/3: Add missing nested_release_page_clean() and check the
 eptp as mentioned in SDM 24.6.14
 
v2:
 https://lkml.org/lkml/2017/7/6/813
 1/3: Patch to enable vmfunc on the host but cause a #UD if
  L1 tries to use it directly. (new)
 2/3: Expose vmfunc to the nested hypervisor, but no vm functions
  are exposed and L0 emulates a vmfunc vmexit to L1. 
 3/3: Force a vmfunc vmexit when L2 tries to use vmfunc and emulate
  eptp switching. Unconditionally expose EPTP switching to the
  L1 hypervisor since L0 fakes eptp switching via a mmu reload.

These patches expose eptp switching/vmfunc to the nested hypervisor.
vmfunc is enabled in the secondary controls for the host and is
exposed to the nested hypervisor. However, if the nested hypervisor
decides to use eptp switching, L0 emulates it.

v1:
 https://lkml.org/lkml/2017/6/29/958

Bandan Das (3):
  KVM: vmx: Enable VMFUNCs
  KVM: nVMX: Enable VMFUNC for the L1 hypervisor
  KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

 arch/x86/include/asm/vmx.h |   9 
 arch/x86/kvm/vmx.c | 125 -
 2 files changed, 132 insertions(+), 2 deletions(-)

-- 
2.9.4

[PATCH v4 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-10 Thread Bandan Das

When L2 uses vmfunc, L0 utilizes the associated vmexit to
emulate a switching of the ept pointer by reloading the
guest MMU.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/include/asm/vmx.h |  6 +
 arch/x86/kvm/vmx.c | 58 +++---
 2 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index da5375e..5f63a2e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -115,6 +115,10 @@
 #define VMX_MISC_SAVE_EFER_LMA 0x0020
 #define VMX_MISC_ACTIVITY_HLT  0x0040
 
+/* VMFUNC functions */
+#define VMX_VMFUNC_EPTP_SWITCHING   0x0001
+#define VMFUNC_EPTP_ENTRIES  512
+
 static inline u32 vmx_basic_vmcs_revision_id(u64 vmx_basic)
 {
return vmx_basic & GENMASK_ULL(30, 0);
@@ -200,6 +204,8 @@ enum vmcs_field {
EOI_EXIT_BITMAP2_HIGH   = 0x2021,
EOI_EXIT_BITMAP3= 0x2022,
EOI_EXIT_BITMAP3_HIGH   = 0x2023,
+   EPTP_LIST_ADDRESS   = 0x2024,
+   EPTP_LIST_ADDRESS_HIGH  = 0x2025,
VMREAD_BITMAP   = 0x2026,
VMWRITE_BITMAP  = 0x2028,
XSS_EXIT_BITMAP = 0x202C,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index fe8f5fc..0a969fb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -246,6 +246,7 @@ struct __packed vmcs12 {
u64 eoi_exit_bitmap1;
u64 eoi_exit_bitmap2;
u64 eoi_exit_bitmap3;
+   u64 eptp_list_address;
u64 xss_exit_bitmap;
u64 guest_physical_address;
u64 vmcs_link_pointer;
@@ -771,6 +772,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1),
FIELD64(EOI_EXIT_BITMAP2, eoi_exit_bitmap2),
FIELD64(EOI_EXIT_BITMAP3, eoi_exit_bitmap3),
+   FIELD64(EPTP_LIST_ADDRESS, eptp_list_address),
FIELD64(XSS_EXIT_BITMAP, xss_exit_bitmap),
FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address),
FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer),
@@ -1402,6 +1404,13 @@ static inline bool nested_cpu_has_vmfunc(struct vmcs12 
*vmcs12)
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VMFUNC);
 }
 
+static inline bool nested_cpu_has_eptp_switching(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has_vmfunc(vmcs12) &&
+   (vmcs12->vm_function_control &
+VMX_VMFUNC_EPTP_SWITCHING);
+}
+
 static inline bool is_nmi(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2791,7 +2800,12 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
if (cpu_has_vmx_vmfunc()) {
vmx->nested.nested_vmx_secondary_ctls_high |=
SECONDARY_EXEC_ENABLE_VMFUNC;
-   vmx->nested.nested_vmx_vmfunc_controls = 0;
+   /*
+* Advertise EPTP switching unconditionally
+* since we emulate it
+*/
+   vmx->nested.nested_vmx_vmfunc_controls =
+   VMX_VMFUNC_EPTP_SWITCHING;
}
 
/*
@@ -7772,6 +7786,9 @@ static int handle_vmfunc(struct kvm_vcpu *vcpu)
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmcs12 *vmcs12;
u32 function = vcpu->arch.regs[VCPU_REGS_RAX];
+   u32 index = vcpu->arch.regs[VCPU_REGS_RCX];
+   struct page *page = NULL;
+   u64 *l1_eptp_list, address;
 
/*
 * VMFUNC is only supported for nested guests, but we always enable the
@@ -7784,11 +7801,46 @@ static int handle_vmfunc(struct kvm_vcpu *vcpu)
}
 
vmcs12 = get_vmcs12(vcpu);
-   if ((vmcs12->vm_function_control & (1 << function)) == 0)
+   if (((vmcs12->vm_function_control & (1 << function)) == 0) ||
+   WARN_ON_ONCE(function))
+   goto fail;
+
+   if (!nested_cpu_has_ept(vmcs12) ||
+   !nested_cpu_has_eptp_switching(vmcs12))
+   goto fail;
+
+   if (!vmcs12->eptp_list_address || index >= VMFUNC_EPTP_ENTRIES)
+   goto fail;
+
+   page = nested_get_page(vcpu, vmcs12->eptp_list_address);
+   if (!page)
goto fail;
-   WARN_ONCE(1, "VMCS12 VM function control should have been zero");
+
+   l1_eptp_list = kmap(page);
+   address = l1_eptp_list[index];
+   if (!address)
+   goto fail;
+   /*
+* If the (L2) guest does a vmfunc to the currently
+* active ept pointer, we don't have to do anything else
+*/
+   if (vmcs12->ept_pointer != address) {
+   if (address >> cpuid_maxphyaddr(vcpu) ||
+   !IS_ALIGNED(address, 4096))
+

[PATCH v4 2/3] KVM: nVMX: Enable VMFUNC for the L1 hypervisor

2017-07-10 Thread Bandan Das

Expose VMFUNC in MSRs and VMCS fields. No actual VMFUNCs are enabled.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
Reviewed-by: David Hildenbrand 
---
 arch/x86/kvm/vmx.c | 53 +++--
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a483b49..fe8f5fc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -240,6 +240,7 @@ struct __packed vmcs12 {
u64 virtual_apic_page_addr;
u64 apic_access_addr;
u64 posted_intr_desc_addr;
+   u64 vm_function_control;
u64 ept_pointer;
u64 eoi_exit_bitmap0;
u64 eoi_exit_bitmap1;
@@ -481,6 +482,7 @@ struct nested_vmx {
u64 nested_vmx_cr4_fixed0;
u64 nested_vmx_cr4_fixed1;
u64 nested_vmx_vmcs_enum;
+   u64 nested_vmx_vmfunc_controls;
 };
 
 #define POSTED_INTR_ON  0
@@ -763,6 +765,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr),
FIELD64(APIC_ACCESS_ADDR, apic_access_addr),
FIELD64(POSTED_INTR_DESC_ADDR, posted_intr_desc_addr),
+   FIELD64(VM_FUNCTION_CONTROL, vm_function_control),
FIELD64(EPT_POINTER, ept_pointer),
FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0),
FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1),
@@ -1394,6 +1397,11 @@ static inline bool nested_cpu_has_posted_intr(struct 
vmcs12 *vmcs12)
return vmcs12->pin_based_vm_exec_control & PIN_BASED_POSTED_INTR;
 }
 
+static inline bool nested_cpu_has_vmfunc(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VMFUNC);
+}
+
 static inline bool is_nmi(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2780,6 +2788,12 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
} else
vmx->nested.nested_vmx_ept_caps = 0;
 
+   if (cpu_has_vmx_vmfunc()) {
+   vmx->nested.nested_vmx_secondary_ctls_high |=
+   SECONDARY_EXEC_ENABLE_VMFUNC;
+   vmx->nested.nested_vmx_vmfunc_controls = 0;
+   }
+
/*
 * Old versions of KVM use the single-context version without
 * checking for support, so declare that it is supported even
@@ -3149,6 +3163,9 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
*pdata = vmx->nested.nested_vmx_ept_caps |
((u64)vmx->nested.nested_vmx_vpid_caps << 32);
break;
+   case MSR_IA32_VMX_VMFUNC:
+   *pdata = vmx->nested.nested_vmx_vmfunc_controls;
+   break;
default:
return 1;
}
@@ -7752,7 +7769,29 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
 
 static int handle_vmfunc(struct kvm_vcpu *vcpu)
 {
-   kvm_queue_exception(vcpu, UD_VECTOR);
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct vmcs12 *vmcs12;
+   u32 function = vcpu->arch.regs[VCPU_REGS_RAX];
+
+   /*
+* VMFUNC is only supported for nested guests, but we always enable the
+* secondary control for simplicity; for non-nested mode, fake that we
+* didn't by injecting #UD.
+*/
+   if (!is_guest_mode(vcpu)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   vmcs12 = get_vmcs12(vcpu);
+   if ((vmcs12->vm_function_control & (1 << function)) == 0)
+   goto fail;
+   WARN_ONCE(1, "VMCS12 VM function control should have been zero");
+
+fail:
+   nested_vmx_vmexit(vcpu, vmx->exit_reason,
+ vmcs_read32(VM_EXIT_INTR_INFO),
+ vmcs_readl(EXIT_QUALIFICATION));
return 1;
 }
 
@@ -10053,7 +10092,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12,
exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
  SECONDARY_EXEC_RDTSCP |
  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
- SECONDARY_EXEC_APIC_REGISTER_VIRT);
+ SECONDARY_EXEC_APIC_REGISTER_VIRT |
+ SECONDARY_EXEC_ENABLE_VMFUNC);
if (nested_cpu_has(vmcs12,
   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)) {
vmcs12_exec_ctrl = vmcs12->secondary_vm_exec_control &
@@ -10061,6 +10101,10 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12,
exec_control |= vmcs12_exec_ctrl;
}
 
+   /* All VMFUNCs are currently emulated through L0 vmexits.  */
+   if (exec_control & SECONDARY_EXEC_ENABLE_VMFUNC)
+   vmcs_write64(VM_FUNCTION_CO

Re: [PATCH v3 2/3] KVM: nVMX: Enable VMFUNC for the L1 hypervisor

2017-07-10 Thread Bandan Das

David Hildenbrand  writes:

>> -kvm_queue_exception(vcpu, UD_VECTOR);
>> +struct vcpu_vmx *vmx = to_vmx(vcpu);
>> +struct vmcs12 *vmcs12;
>> +u32 function = vcpu->arch.regs[VCPU_REGS_RAX];
>> +
>> +/*
>> + * VMFUNC is only supported for nested guests, but we always enable the
>> + * secondary control for simplicity; for non-nested mode, fake that we
>> + * didn't by injecting #UD.
>> + */
>> +if (!is_guest_mode(vcpu)) {
>> +kvm_queue_exception(vcpu, UD_VECTOR);
>> +return 1;
>> +}
>> +
>> +vmcs12 = get_vmcs12(vcpu);
>> +if ((vmcs12->vm_function_control & (1 << function)) == 0)
>> +goto fail;
>> +WARN(1, "VMCS12 VM function control should have been zero");
>
> Should this be a WARN_ONCE?

Even though this line gets removed in patch 3, I agree, it's a
good idea to use WARN_ONCE.

>> +
>> +fail:
>> +nested_vmx_vmexit(vcpu, vmx->exit_reason,
>> +  vmcs_read32(VM_EXIT_INTR_INFO),
>> +  vmcs_readl(EXIT_QUALIFICATION));
>>  return 1;
>>  }
>>  
>> @@ -10053,7 +10092,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
>> struct vmcs12 *vmcs12,
>>  exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
>>SECONDARY_EXEC_RDTSCP |
>>SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
>> -  SECONDARY_EXEC_APIC_REGISTER_VIRT);
>> +  SECONDARY_EXEC_APIC_REGISTER_VIRT |
>> +  SECONDARY_EXEC_ENABLE_VMFUNC);
>>  if (nested_cpu_has(vmcs12,
>> CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)) {
>>  vmcs12_exec_ctrl = vmcs12->secondary_vm_exec_control &
>> @@ -10061,6 +10101,10 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
>> struct vmcs12 *vmcs12,
>>  exec_control |= vmcs12_exec_ctrl;
>>  }
>>  
>> +/* All VMFUNCs are currently emulated through L0 vmexits.  */
>> +if (exec_control & SECONDARY_EXEC_ENABLE_VMFUNC)
>> +vmcs_write64(VM_FUNCTION_CONTROL, 0);
>> +
>>  if (exec_control & SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) {
>>  vmcs_write64(EOI_EXIT_BITMAP0,
>>  vmcs12->eoi_exit_bitmap0);
>> @@ -10310,6 +10354,11 @@ static int check_vmentry_prereqs(struct kvm_vcpu 
>> *vcpu, struct vmcs12 *vmcs12)
>>  vmx->nested.nested_vmx_entry_ctls_high))
>>  return VMXERR_ENTRY_INVALID_CONTROL_FIELD;
>>  
>> +if (nested_cpu_has_vmfunc(vmcs12) &&
>> +(vmcs12->vm_function_control &
>> + ~vmx->nested.nested_vmx_vmfunc_controls))
>
> I'd prefer the second part on one line, although it will violate 80
> chars. (these variable names really start to get too lengthy to be useful)

Yeah, I had to split it up for that.

Thank you for the quick review!

Bandan

>> +return VMXERR_ENTRY_INVALID_CONTROL_FIELD;
>> +
>>  if (vmcs12->cr3_target_count > nested_cpu_vmx_misc_cr3_count(vcpu))
>>  return VMXERR_ENTRY_INVALID_CONTROL_FIELD;
>>  
>> 
>
> Feel free to ignore my comments.
>
> Reviewed-by: David Hildenbrand

[PATCH v3 0/3] Expose VMFUNC to the nested hypervisor

2017-07-10 Thread Bandan Das


v3:
 3/3: Add missing nested_release_page_clean() and check the
 eptp as mentioned in SDM 24.6.14
 
v2:
 https://lkml.org/lkml/2017/7/6/813
 1/3: Patch to enable vmfunc on the host but cause a #UD if
  L1 tries to use it directly. (new)
 2/3: Expose vmfunc to the nested hypervisor, but no vm functions
  are exposed and L0 emulates a vmfunc vmexit to L1. 
 3/3: Force a vmfunc vmexit when L2 tries to use vmfunc and emulate
  eptp switching. Unconditionally expose EPTP switching to the
  L1 hypervisor since L0 fakes eptp switching via a mmu reload.

These patches expose eptp switching/vmfunc to the nested hypervisor.
vmfunc is enabled in the secondary controls for the host and is
exposed to the nested hypervisor. However, if the nested hypervisor
decides to use eptp switching, L0 emulates it.

v1:
 https://lkml.org/lkml/2017/6/29/958

Bandan Das (3):
  KVM: vmx: Enable VMFUNCs
  KVM: nVMX: Enable VMFUNC for the L1 hypervisor
  KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

 arch/x86/include/asm/vmx.h |   9 
 arch/x86/kvm/vmx.c | 125 -
 2 files changed, 132 insertions(+), 2 deletions(-)

-- 
2.9.4

[PATCH v3 2/3] KVM: nVMX: Enable VMFUNC for the L1 hypervisor

2017-07-10 Thread Bandan Das

Expose VMFUNC in MSRs and VMCS fields. No actual VMFUNCs are enabled.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/kvm/vmx.c | 53 +++--
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a483b49..7364678 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -240,6 +240,7 @@ struct __packed vmcs12 {
u64 virtual_apic_page_addr;
u64 apic_access_addr;
u64 posted_intr_desc_addr;
+   u64 vm_function_control;
u64 ept_pointer;
u64 eoi_exit_bitmap0;
u64 eoi_exit_bitmap1;
@@ -481,6 +482,7 @@ struct nested_vmx {
u64 nested_vmx_cr4_fixed0;
u64 nested_vmx_cr4_fixed1;
u64 nested_vmx_vmcs_enum;
+   u64 nested_vmx_vmfunc_controls;
 };
 
 #define POSTED_INTR_ON  0
@@ -763,6 +765,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr),
FIELD64(APIC_ACCESS_ADDR, apic_access_addr),
FIELD64(POSTED_INTR_DESC_ADDR, posted_intr_desc_addr),
+   FIELD64(VM_FUNCTION_CONTROL, vm_function_control),
FIELD64(EPT_POINTER, ept_pointer),
FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0),
FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1),
@@ -1394,6 +1397,11 @@ static inline bool nested_cpu_has_posted_intr(struct 
vmcs12 *vmcs12)
return vmcs12->pin_based_vm_exec_control & PIN_BASED_POSTED_INTR;
 }
 
+static inline bool nested_cpu_has_vmfunc(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VMFUNC);
+}
+
 static inline bool is_nmi(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2780,6 +2788,12 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
} else
vmx->nested.nested_vmx_ept_caps = 0;
 
+   if (cpu_has_vmx_vmfunc()) {
+   vmx->nested.nested_vmx_secondary_ctls_high |=
+   SECONDARY_EXEC_ENABLE_VMFUNC;
+   vmx->nested.nested_vmx_vmfunc_controls = 0;
+   }
+
/*
 * Old versions of KVM use the single-context version without
 * checking for support, so declare that it is supported even
@@ -3149,6 +3163,9 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
*pdata = vmx->nested.nested_vmx_ept_caps |
((u64)vmx->nested.nested_vmx_vpid_caps << 32);
break;
+   case MSR_IA32_VMX_VMFUNC:
+   *pdata = vmx->nested.nested_vmx_vmfunc_controls;
+   break;
default:
return 1;
}
@@ -7752,7 +7769,29 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
 
 static int handle_vmfunc(struct kvm_vcpu *vcpu)
 {
-   kvm_queue_exception(vcpu, UD_VECTOR);
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct vmcs12 *vmcs12;
+   u32 function = vcpu->arch.regs[VCPU_REGS_RAX];
+
+   /*
+* VMFUNC is only supported for nested guests, but we always enable the
+* secondary control for simplicity; for non-nested mode, fake that we
+* didn't by injecting #UD.
+*/
+   if (!is_guest_mode(vcpu)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   vmcs12 = get_vmcs12(vcpu);
+   if ((vmcs12->vm_function_control & (1 << function)) == 0)
+   goto fail;
+   WARN(1, "VMCS12 VM function control should have been zero");
+
+fail:
+   nested_vmx_vmexit(vcpu, vmx->exit_reason,
+ vmcs_read32(VM_EXIT_INTR_INFO),
+ vmcs_readl(EXIT_QUALIFICATION));
return 1;
 }
 
@@ -10053,7 +10092,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12,
exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
  SECONDARY_EXEC_RDTSCP |
  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
- SECONDARY_EXEC_APIC_REGISTER_VIRT);
+ SECONDARY_EXEC_APIC_REGISTER_VIRT |
+ SECONDARY_EXEC_ENABLE_VMFUNC);
if (nested_cpu_has(vmcs12,
   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)) {
vmcs12_exec_ctrl = vmcs12->secondary_vm_exec_control &
@@ -10061,6 +10101,10 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12,
exec_control |= vmcs12_exec_ctrl;
}
 
+   /* All VMFUNCs are currently emulated through L0 vmexits.  */
+   if (exec_control & SECONDARY_EXEC_ENABLE_VMFUNC)
+   vmcs_write64(VM_FUNCTION_CO

[PATCH v3 3/3] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-10 Thread Bandan Das

When L2 uses vmfunc, L0 utilizes the associated vmexit to
emulate a switching of the ept pointer by reloading the
guest MMU.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/include/asm/vmx.h |  6 +
 arch/x86/kvm/vmx.c | 58 +++---
 2 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index da5375e..5f63a2e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -115,6 +115,10 @@
 #define VMX_MISC_SAVE_EFER_LMA 0x0020
 #define VMX_MISC_ACTIVITY_HLT  0x0040
 
+/* VMFUNC functions */
+#define VMX_VMFUNC_EPTP_SWITCHING   0x0001
+#define VMFUNC_EPTP_ENTRIES  512
+
 static inline u32 vmx_basic_vmcs_revision_id(u64 vmx_basic)
 {
return vmx_basic & GENMASK_ULL(30, 0);
@@ -200,6 +204,8 @@ enum vmcs_field {
EOI_EXIT_BITMAP2_HIGH   = 0x2021,
EOI_EXIT_BITMAP3= 0x2022,
EOI_EXIT_BITMAP3_HIGH   = 0x2023,
+   EPTP_LIST_ADDRESS   = 0x2024,
+   EPTP_LIST_ADDRESS_HIGH  = 0x2025,
VMREAD_BITMAP   = 0x2026,
VMWRITE_BITMAP  = 0x2028,
XSS_EXIT_BITMAP = 0x202C,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7364678..0a969fb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -246,6 +246,7 @@ struct __packed vmcs12 {
u64 eoi_exit_bitmap1;
u64 eoi_exit_bitmap2;
u64 eoi_exit_bitmap3;
+   u64 eptp_list_address;
u64 xss_exit_bitmap;
u64 guest_physical_address;
u64 vmcs_link_pointer;
@@ -771,6 +772,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1),
FIELD64(EOI_EXIT_BITMAP2, eoi_exit_bitmap2),
FIELD64(EOI_EXIT_BITMAP3, eoi_exit_bitmap3),
+   FIELD64(EPTP_LIST_ADDRESS, eptp_list_address),
FIELD64(XSS_EXIT_BITMAP, xss_exit_bitmap),
FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address),
FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer),
@@ -1402,6 +1404,13 @@ static inline bool nested_cpu_has_vmfunc(struct vmcs12 
*vmcs12)
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VMFUNC);
 }
 
+static inline bool nested_cpu_has_eptp_switching(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has_vmfunc(vmcs12) &&
+   (vmcs12->vm_function_control &
+VMX_VMFUNC_EPTP_SWITCHING);
+}
+
 static inline bool is_nmi(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2791,7 +2800,12 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
if (cpu_has_vmx_vmfunc()) {
vmx->nested.nested_vmx_secondary_ctls_high |=
SECONDARY_EXEC_ENABLE_VMFUNC;
-   vmx->nested.nested_vmx_vmfunc_controls = 0;
+   /*
+* Advertise EPTP switching unconditionally
+* since we emulate it
+*/
+   vmx->nested.nested_vmx_vmfunc_controls =
+   VMX_VMFUNC_EPTP_SWITCHING;
}
 
/*
@@ -7772,6 +7786,9 @@ static int handle_vmfunc(struct kvm_vcpu *vcpu)
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmcs12 *vmcs12;
u32 function = vcpu->arch.regs[VCPU_REGS_RAX];
+   u32 index = vcpu->arch.regs[VCPU_REGS_RCX];
+   struct page *page = NULL;
+   u64 *l1_eptp_list, address;
 
/*
 * VMFUNC is only supported for nested guests, but we always enable the
@@ -7784,11 +7801,46 @@ static int handle_vmfunc(struct kvm_vcpu *vcpu)
}
 
vmcs12 = get_vmcs12(vcpu);
-   if ((vmcs12->vm_function_control & (1 << function)) == 0)
+   if (((vmcs12->vm_function_control & (1 << function)) == 0) ||
+   WARN_ON_ONCE(function))
+   goto fail;
+
+   if (!nested_cpu_has_ept(vmcs12) ||
+   !nested_cpu_has_eptp_switching(vmcs12))
+   goto fail;
+
+   if (!vmcs12->eptp_list_address || index >= VMFUNC_EPTP_ENTRIES)
+   goto fail;
+
+   page = nested_get_page(vcpu, vmcs12->eptp_list_address);
+   if (!page)
goto fail;
-   WARN(1, "VMCS12 VM function control should have been zero");
+
+   l1_eptp_list = kmap(page);
+   address = l1_eptp_list[index];
+   if (!address)
+   goto fail;
+   /*
+* If the (L2) guest does a vmfunc to the currently
+* active ept pointer, we don't have to do anything else
+*/
+   if (vmcs12->ept_pointer != address) {
+   if (address >> cpuid_maxphyaddr(vcpu) ||
+   !IS_ALIGNED(address, 4096))
+

[PATCH v3 1/3] KVM: vmx: Enable VMFUNCs

2017-07-10 Thread Bandan Das

Enable VMFUNC in the secondary execution controls.  This simplifies the
changes necessary to expose it to nested hypervisors.  VMFUNCs still
cause #UD when invoked.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
Reviewed-by: David Hildenbrand 
---
 arch/x86/include/asm/vmx.h |  3 +++
 arch/x86/kvm/vmx.c | 22 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 35cd06f..da5375e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -72,6 +72,7 @@
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
 #define SECONDARY_EXEC_RDRAND  0x0800
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
+#define SECONDARY_EXEC_ENABLE_VMFUNC0x2000
 #define SECONDARY_EXEC_SHADOW_VMCS  0x4000
 #define SECONDARY_EXEC_RDSEED  0x0001
 #define SECONDARY_EXEC_ENABLE_PML   0x0002
@@ -187,6 +188,8 @@ enum vmcs_field {
APIC_ACCESS_ADDR_HIGH   = 0x2015,
POSTED_INTR_DESC_ADDR   = 0x2016,
POSTED_INTR_DESC_ADDR_HIGH  = 0x2017,
+   VM_FUNCTION_CONTROL = 0x2018,
+   VM_FUNCTION_CONTROL_HIGH= 0x2019,
EPT_POINTER = 0x201a,
EPT_POINTER_HIGH= 0x201b,
EOI_EXIT_BITMAP0= 0x201c,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ca5d2b9..a483b49 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1314,6 +1314,12 @@ static inline bool cpu_has_vmx_tsc_scaling(void)
SECONDARY_EXEC_TSC_SCALING;
 }
 
+static inline bool cpu_has_vmx_vmfunc(void)
+{
+   return vmcs_config.cpu_based_2nd_exec_ctrl &
+   SECONDARY_EXEC_ENABLE_VMFUNC;
+}
+
 static inline bool report_flexpriority(void)
 {
return flexpriority_enabled;
@@ -3575,7 +3581,8 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_SHADOW_VMCS |
SECONDARY_EXEC_XSAVES |
SECONDARY_EXEC_ENABLE_PML |
-   SECONDARY_EXEC_TSC_SCALING;
+   SECONDARY_EXEC_TSC_SCALING |
+   SECONDARY_EXEC_ENABLE_VMFUNC;
if (adjust_vmx_controls(min2, opt2,
MSR_IA32_VMX_PROCBASED_CTLS2,
&_cpu_based_2nd_exec_control) < 0)
@@ -5233,6 +5240,9 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmcs_writel(HOST_GS_BASE, 0); /* 22.2.4 */
 #endif
 
+   if (cpu_has_vmx_vmfunc())
+   vmcs_write64(VM_FUNCTION_CONTROL, 0);
+
vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
@@ -7740,6 +7750,12 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static int handle_vmfunc(struct kvm_vcpu *vcpu)
+{
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -7790,6 +7806,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct 
kvm_vcpu *vcpu) = {
[EXIT_REASON_XSAVES]  = handle_xsaves,
[EXIT_REASON_XRSTORS] = handle_xrstors,
[EXIT_REASON_PML_FULL]= handle_pml_full,
+   [EXIT_REASON_VMFUNC]  = handle_vmfunc,
[EXIT_REASON_PREEMPTION_TIMER]= handle_preemption_timer,
 };
 
@@ -8111,6 +8128,9 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
case EXIT_REASON_PML_FULL:
/* We emulate PML support to L1. */
return false;
+   case EXIT_REASON_VMFUNC:
+   /* VM functions are emulated through L2->L0 vmexits. */
+   return false;
default:
return true;
}
-- 
2.9.4

[PATCH 0/3 v2] Expose VMFUNC to the nested hypervisor

2017-07-06 Thread Bandan Das

v2:
 1/3: Patch to enable vmfunc on the host but cause a #UD if
  L1 tries to use it directly. (new)
 2/3: Expose vmfunc to the nested hypervisor, but no vm functions
  are exposed and L0 emulates a vmfunc vmexit to L1. 
 3/3: Force a vmfunc vmexit when L2 tries to use vmfunc and emulate
  eptp switching. Unconditionally expose EPTP switching to the
  L1 hypervisor since L0 fakes eptp switching via a mmu reload.

These patches expose eptp switching/vmfunc to the nested hypervisor.
vmfunc is enabled in the secondary controls for the host and is
exposed to the nested hypervisor. However, if the nested hypervisor
decides to use eptp switching, L0 emulates it.

v1:
 https://lkml.org/lkml/2017/6/29/958

Bandan Das (3):
  KVM: vmx: Enable VMFUNCs
  KVM: nVMX: Enable VMFUNC for the L1 hypervisor
  KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

 arch/x86/include/asm/vmx.h |   9 
 arch/x86/kvm/vmx.c | 122 -
 2 files changed, 129 insertions(+), 2 deletions(-)

-- 
2.9.4

[PATCH 1/3 v2] KVM: vmx: Enable VMFUNCs

2017-07-06 Thread Bandan Das

Enable VMFUNC in the secondary execution controls.  This simplifies the
changes necessary to expose it to nested hypervisors.  VMFUNCs still
cause #UD when invoked.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/include/asm/vmx.h |  3 +++
 arch/x86/kvm/vmx.c | 22 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 35cd06f..da5375e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -72,6 +72,7 @@
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
 #define SECONDARY_EXEC_RDRAND  0x0800
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
+#define SECONDARY_EXEC_ENABLE_VMFUNC0x2000
 #define SECONDARY_EXEC_SHADOW_VMCS  0x4000
 #define SECONDARY_EXEC_RDSEED  0x0001
 #define SECONDARY_EXEC_ENABLE_PML   0x0002
@@ -187,6 +188,8 @@ enum vmcs_field {
APIC_ACCESS_ADDR_HIGH   = 0x2015,
POSTED_INTR_DESC_ADDR   = 0x2016,
POSTED_INTR_DESC_ADDR_HIGH  = 0x2017,
+   VM_FUNCTION_CONTROL = 0x2018,
+   VM_FUNCTION_CONTROL_HIGH= 0x2019,
EPT_POINTER = 0x201a,
EPT_POINTER_HIGH= 0x201b,
EOI_EXIT_BITMAP0= 0x201c,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ca5d2b9..a483b49 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1314,6 +1314,12 @@ static inline bool cpu_has_vmx_tsc_scaling(void)
SECONDARY_EXEC_TSC_SCALING;
 }
 
+static inline bool cpu_has_vmx_vmfunc(void)
+{
+   return vmcs_config.cpu_based_2nd_exec_ctrl &
+   SECONDARY_EXEC_ENABLE_VMFUNC;
+}
+
 static inline bool report_flexpriority(void)
 {
return flexpriority_enabled;
@@ -3575,7 +3581,8 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_SHADOW_VMCS |
SECONDARY_EXEC_XSAVES |
SECONDARY_EXEC_ENABLE_PML |
-   SECONDARY_EXEC_TSC_SCALING;
+   SECONDARY_EXEC_TSC_SCALING |
+   SECONDARY_EXEC_ENABLE_VMFUNC;
if (adjust_vmx_controls(min2, opt2,
MSR_IA32_VMX_PROCBASED_CTLS2,
&_cpu_based_2nd_exec_control) < 0)
@@ -5233,6 +5240,9 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmcs_writel(HOST_GS_BASE, 0); /* 22.2.4 */
 #endif
 
+   if (cpu_has_vmx_vmfunc())
+   vmcs_write64(VM_FUNCTION_CONTROL, 0);
+
vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host));
@@ -7740,6 +7750,12 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static int handle_vmfunc(struct kvm_vcpu *vcpu)
+{
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -7790,6 +7806,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct 
kvm_vcpu *vcpu) = {
[EXIT_REASON_XSAVES]  = handle_xsaves,
[EXIT_REASON_XRSTORS] = handle_xrstors,
[EXIT_REASON_PML_FULL]= handle_pml_full,
+   [EXIT_REASON_VMFUNC]  = handle_vmfunc,
[EXIT_REASON_PREEMPTION_TIMER]= handle_preemption_timer,
 };
 
@@ -8111,6 +8128,9 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
case EXIT_REASON_PML_FULL:
/* We emulate PML support to L1. */
return false;
+   case EXIT_REASON_VMFUNC:
+   /* VM functions are emulated through L2->L0 vmexits. */
+   return false;
default:
return true;
}
-- 
2.9.4

[PATCH 3/3 v2] KVM: nVMX: Emulate EPTP switching for the L1 hypervisor

2017-07-06 Thread Bandan Das

When L2 uses vmfunc, L0 utilizes the associated vmexit to
emulate a switching of the ept pointer by reloading the
guest MMU.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/include/asm/vmx.h |  6 +
 arch/x86/kvm/vmx.c | 55 +++---
 2 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index da5375e..5f63a2e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -115,6 +115,10 @@
 #define VMX_MISC_SAVE_EFER_LMA 0x0020
 #define VMX_MISC_ACTIVITY_HLT  0x0040
 
+/* VMFUNC functions */
+#define VMX_VMFUNC_EPTP_SWITCHING   0x0001
+#define VMFUNC_EPTP_ENTRIES  512
+
 static inline u32 vmx_basic_vmcs_revision_id(u64 vmx_basic)
 {
return vmx_basic & GENMASK_ULL(30, 0);
@@ -200,6 +204,8 @@ enum vmcs_field {
EOI_EXIT_BITMAP2_HIGH   = 0x2021,
EOI_EXIT_BITMAP3= 0x2022,
EOI_EXIT_BITMAP3_HIGH   = 0x2023,
+   EPTP_LIST_ADDRESS   = 0x2024,
+   EPTP_LIST_ADDRESS_HIGH  = 0x2025,
VMREAD_BITMAP   = 0x2026,
VMWRITE_BITMAP  = 0x2028,
XSS_EXIT_BITMAP = 0x202C,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7364678..3a4aa68 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -246,6 +246,7 @@ struct __packed vmcs12 {
u64 eoi_exit_bitmap1;
u64 eoi_exit_bitmap2;
u64 eoi_exit_bitmap3;
+   u64 eptp_list_address;
u64 xss_exit_bitmap;
u64 guest_physical_address;
u64 vmcs_link_pointer;
@@ -771,6 +772,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1),
FIELD64(EOI_EXIT_BITMAP2, eoi_exit_bitmap2),
FIELD64(EOI_EXIT_BITMAP3, eoi_exit_bitmap3),
+   FIELD64(EPTP_LIST_ADDRESS, eptp_list_address),
FIELD64(XSS_EXIT_BITMAP, xss_exit_bitmap),
FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address),
FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer),
@@ -1402,6 +1404,13 @@ static inline bool nested_cpu_has_vmfunc(struct vmcs12 
*vmcs12)
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VMFUNC);
 }
 
+static inline bool nested_cpu_has_eptp_switching(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has_vmfunc(vmcs12) &&
+   (vmcs12->vm_function_control &
+VMX_VMFUNC_EPTP_SWITCHING);
+}
+
 static inline bool is_nmi(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2791,7 +2800,12 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
if (cpu_has_vmx_vmfunc()) {
vmx->nested.nested_vmx_secondary_ctls_high |=
SECONDARY_EXEC_ENABLE_VMFUNC;
-   vmx->nested.nested_vmx_vmfunc_controls = 0;
+   /*
+* Advertise EPTP switching unconditionally
+* since we emulate it
+*/
+   vmx->nested.nested_vmx_vmfunc_controls =
+   VMX_VMFUNC_EPTP_SWITCHING;
}
 
/*
@@ -7772,6 +7786,9 @@ static int handle_vmfunc(struct kvm_vcpu *vcpu)
struct vcpu_vmx *vmx = to_vmx(vcpu);
struct vmcs12 *vmcs12;
u32 function = vcpu->arch.regs[VCPU_REGS_RAX];
+   u32 index = vcpu->arch.regs[VCPU_REGS_RCX];
+   struct page *page = NULL;
+   u64 *l1_eptp_list;
 
/*
 * VMFUNC is only supported for nested guests, but we always enable the
@@ -7784,11 +7801,43 @@ static int handle_vmfunc(struct kvm_vcpu *vcpu)
}
 
vmcs12 = get_vmcs12(vcpu);
-   if ((vmcs12->vm_function_control & (1 << function)) == 0)
+   if (((vmcs12->vm_function_control & (1 << function)) == 0) ||
+   WARN_ON_ONCE(function))
+   goto fail;
+
+   if (!nested_cpu_has_ept(vmcs12) ||
+   !nested_cpu_has_eptp_switching(vmcs12))
+   goto fail;
+
+   if (!vmcs12->eptp_list_address || index >= VMFUNC_EPTP_ENTRIES)
+   goto fail;
+
+   page = nested_get_page(vcpu, vmcs12->eptp_list_address);
+   if (!page)
+   goto fail;
+
+   l1_eptp_list = kmap(page);
+   if (!l1_eptp_list[index])
goto fail;
-   WARN(1, "VMCS12 VM function control should have been zero");
+
+   /*
+* If the (L2) guest does a vmfunc to the currently
+* active ept pointer, we don't have to do anything else
+*/
+   if (vmcs12->ept_pointer != l1_eptp_list[index]) {
+   kvm_mmu_unload(vcpu);
+   /*
+* TODO: Verify that guest ept satisfies vmentry prereqs
+*/
+

[PATCH 2/3 v2] KVM: nVMX: Enable VMFUNC for the L1 hypervisor

2017-07-06 Thread Bandan Das

Expose VMFUNC in MSRs and VMCS fields. No actual VMFUNCs are enabled.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Bandan Das 
---
 arch/x86/kvm/vmx.c | 53 +++--
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a483b49..7364678 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -240,6 +240,7 @@ struct __packed vmcs12 {
u64 virtual_apic_page_addr;
u64 apic_access_addr;
u64 posted_intr_desc_addr;
+   u64 vm_function_control;
u64 ept_pointer;
u64 eoi_exit_bitmap0;
u64 eoi_exit_bitmap1;
@@ -481,6 +482,7 @@ struct nested_vmx {
u64 nested_vmx_cr4_fixed0;
u64 nested_vmx_cr4_fixed1;
u64 nested_vmx_vmcs_enum;
+   u64 nested_vmx_vmfunc_controls;
 };
 
 #define POSTED_INTR_ON  0
@@ -763,6 +765,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr),
FIELD64(APIC_ACCESS_ADDR, apic_access_addr),
FIELD64(POSTED_INTR_DESC_ADDR, posted_intr_desc_addr),
+   FIELD64(VM_FUNCTION_CONTROL, vm_function_control),
FIELD64(EPT_POINTER, ept_pointer),
FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0),
FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1),
@@ -1394,6 +1397,11 @@ static inline bool nested_cpu_has_posted_intr(struct 
vmcs12 *vmcs12)
return vmcs12->pin_based_vm_exec_control & PIN_BASED_POSTED_INTR;
 }
 
+static inline bool nested_cpu_has_vmfunc(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VMFUNC);
+}
+
 static inline bool is_nmi(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2780,6 +2788,12 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
} else
vmx->nested.nested_vmx_ept_caps = 0;
 
+   if (cpu_has_vmx_vmfunc()) {
+   vmx->nested.nested_vmx_secondary_ctls_high |=
+   SECONDARY_EXEC_ENABLE_VMFUNC;
+   vmx->nested.nested_vmx_vmfunc_controls = 0;
+   }
+
/*
 * Old versions of KVM use the single-context version without
 * checking for support, so declare that it is supported even
@@ -3149,6 +3163,9 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
*pdata = vmx->nested.nested_vmx_ept_caps |
((u64)vmx->nested.nested_vmx_vpid_caps << 32);
break;
+   case MSR_IA32_VMX_VMFUNC:
+   *pdata = vmx->nested.nested_vmx_vmfunc_controls;
+   break;
default:
return 1;
}
@@ -7752,7 +7769,29 @@ static int handle_preemption_timer(struct kvm_vcpu *vcpu)
 
 static int handle_vmfunc(struct kvm_vcpu *vcpu)
 {
-   kvm_queue_exception(vcpu, UD_VECTOR);
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   struct vmcs12 *vmcs12;
+   u32 function = vcpu->arch.regs[VCPU_REGS_RAX];
+
+   /*
+* VMFUNC is only supported for nested guests, but we always enable the
+* secondary control for simplicity; for non-nested mode, fake that we
+* didn't by injecting #UD.
+*/
+   if (!is_guest_mode(vcpu)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   vmcs12 = get_vmcs12(vcpu);
+   if ((vmcs12->vm_function_control & (1 << function)) == 0)
+   goto fail;
+   WARN(1, "VMCS12 VM function control should have been zero");
+
+fail:
+   nested_vmx_vmexit(vcpu, vmx->exit_reason,
+ vmcs_read32(VM_EXIT_INTR_INFO),
+ vmcs_readl(EXIT_QUALIFICATION));
return 1;
 }
 
@@ -10053,7 +10092,8 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12,
exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
  SECONDARY_EXEC_RDTSCP |
  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
- SECONDARY_EXEC_APIC_REGISTER_VIRT);
+ SECONDARY_EXEC_APIC_REGISTER_VIRT |
+ SECONDARY_EXEC_ENABLE_VMFUNC);
if (nested_cpu_has(vmcs12,
   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)) {
vmcs12_exec_ctrl = vmcs12->secondary_vm_exec_control &
@@ -10061,6 +10101,10 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12,
exec_control |= vmcs12_exec_ctrl;
}
 
+   /* All VMFUNCs are currently emulated through L0 vmexits.  */
+   if (exec_control & SECONDARY_EXEC_ENABLE_VMFUNC)
+   vmcs_write64(VM_FUNCTION_CO

Re: [PATCH 0/2] Expose VMFUNC to the nested hypervisor

2017-06-30 Thread Bandan Das

Jim Mattson  writes:

> Isn't McAfee DeepSAFE defunct? Are there any other consumers of EPTP 
> switching?

I don't know of any real users but I think we should be providing this 
functionality to
the L1 hypervisor :)

IIRC, Xen lets you use EPTP switching as part of VM introspection ?

Bandan


> On Thu, Jun 29, 2017 at 4:29 PM, Bandan Das  wrote:
>> These patches expose eptp switching/vmfunc to the nested hypervisor. Testing 
>> with
>> kvm-unit-tests seems to work ok.
>>
>> If the guest hypervisor enables vmfunc/eptp switching, a "shadow" eptp list
>> address page is written to the VMCS. Initially, it would be unpopulated which
>> would result in a vmexit with exit reason 59. This hooks to handle_vmfunc()
>> to rewrite vmcs12->ept_pointer to reload the mmu and get a new root hpa.
>> This new shadow ept pointer is written to the shadow eptp list in the given
>> index. A next vmfunc call to switch to the given index would succeed without
>> an exit.
>>
>> Bandan Das (2):
>>   KVM: nVMX: Implement EPTP switching for the L1 hypervisor
>>   KVM: nVMX: Advertise VMFUNC to L1 hypervisor
>>
>>  arch/x86/include/asm/vmx.h |   9 
>>  arch/x86/kvm/vmx.c | 122 
>> +
>>  2 files changed, 131 insertions(+)
>>
>> --
>> 2.9.4
>>

Re: [PATCH 1/2] KVM: nVMX: Implement EPTP switching for the L1 hypervisor

2017-06-30 Thread Bandan Das

Hi Paolo,

Paolo Bonzini  writes:

> - Original Message -
>> From: "Bandan Das" 
>> To: k...@vger.kernel.org
>> Cc: pbonz...@redhat.com, linux-kernel@vger.kernel.org
>> Sent: Friday, June 30, 2017 1:29:55 AM
>> Subject: [PATCH 1/2] KVM: nVMX: Implement EPTP switching for the L1 
>> hypervisor
>> 
>> This is a mix of emulation/passthrough to implement EPTP
>> switching for the nested hypervisor.
>> 
>> If the shadow EPT are absent, a vmexit occurs with reason 59.
>> L0 can then create shadow structures based on the entry that the
>> guest calls with to obtain a new root_hpa that can be written to
>> the shadow list and subsequently, reload the mmu to resume L2.
>> On the next vmfunc(0, index) however, the processor will load the
>> entry without an exit.
>
> What happens if the root_hpa is dropped by the L0 MMU?  I'm not sure
> what removes it from the shadow EPT list.

That would result in a vmfunc vmexit, which will jump to handle_vmfunc
and then a call to mmu_alloc_shadow_roots() that will overwrite the shadow
eptp entry with the new one.

I believe part of your question is also that root_hpa is valid, just not
tracking the current l1 eptp and in that case, the processor can jump to
some other guest's page tables which is a problem. In that case, I think it 
should
be possible to invalidate that entry in the shadow eptp list.

> For a first version of the patch, I would prefer a less optimized
> version that always goes through L0 when L2 executes VMFUNC.
> To achieve this effect, you can copy "enable VM functions" secondary
> execution control from vmcs12 to vmcs02, but leave the VM-function
> controls to 0 in vmcs02.

Is the current approach prone to other undesired corner cases like the one
you pointed above ? :) I would be uncomfortable having this in if you feel
having the cpu jump to a new eptp feels like an interesting exploitable
interface; however, I think it would be nice to have l2 execute vmfunc without
exiting to L0. 

Thanks for the quick review!

> Paolo
>
>> Signed-off-by: Bandan Das 
>> ---
>>  arch/x86/include/asm/vmx.h |   5 +++
>>  arch/x86/kvm/vmx.c | 104
>>  +
>>  2 files changed, 109 insertions(+)
>> 
>> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>> index 35cd06f..e06783e 100644
>> --- a/arch/x86/include/asm/vmx.h
>> +++ b/arch/x86/include/asm/vmx.h
>> @@ -72,6 +72,7 @@
>>  #define SECONDARY_EXEC_PAUSE_LOOP_EXITING   0x0400
>>  #define SECONDARY_EXEC_RDRAND   0x0800
>>  #define SECONDARY_EXEC_ENABLE_INVPCID   0x1000
>> +#define SECONDARY_EXEC_ENABLE_VMFUNC0x2000
>>  #define SECONDARY_EXEC_SHADOW_VMCS  0x4000
>>  #define SECONDARY_EXEC_RDSEED   0x0001
>>  #define SECONDARY_EXEC_ENABLE_PML   0x0002
>> @@ -114,6 +115,10 @@
>>  #define VMX_MISC_SAVE_EFER_LMA  0x0020
>>  #define VMX_MISC_ACTIVITY_HLT   0x0040
>>  
>> +/* VMFUNC functions */
>> +#define VMX_VMFUNC_EPTP_SWITCHING   0x0001
>> +#define VMFUNC_EPTP_ENTRIES  512
>> +
>>  static inline u32 vmx_basic_vmcs_revision_id(u64 vmx_basic)
>>  {
>>  return vmx_basic & GENMASK_ULL(30, 0);
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index ca5d2b9..75049c0 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -240,11 +240,13 @@ struct __packed vmcs12 {
>>  u64 virtual_apic_page_addr;
>>  u64 apic_access_addr;
>>  u64 posted_intr_desc_addr;
>> +u64 vm_function_control;
>>  u64 ept_pointer;
>>  u64 eoi_exit_bitmap0;
>>  u64 eoi_exit_bitmap1;
>>  u64 eoi_exit_bitmap2;
>>  u64 eoi_exit_bitmap3;
>> +u64 eptp_list_address;
>>  u64 xss_exit_bitmap;
>>  u64 guest_physical_address;
>>  u64 vmcs_link_pointer;
>> @@ -441,6 +443,7 @@ struct nested_vmx {
>>  struct page *apic_access_page;
>>  struct page *virtual_apic_page;
>>  struct page *pi_desc_page;
>> +struct page *shadow_eptp_list;
>>  struct pi_desc *pi_desc;
>>  bool pi_pending;
>>  u16 posted_intr_nv;
>> @@ -481,6 +484,7 @@ struct nested_vmx {
>>  u64 nested_vmx_cr4_fixed0;
>>  u64 nested_vmx_cr4_fixed1;
>>  u64 nested_vmx_vmcs_enum;
>> +u64 nested_vmx_vmfunc_controls;
>>  };
>>  
>>  #define POSTED_INTR_ON  0
>> @@ -1314,6 +1318,22 @@ static inline bo

[PATCH 0/2] Expose VMFUNC to the nested hypervisor

2017-06-29 Thread Bandan Das

These patches expose eptp switching/vmfunc to the nested hypervisor. Testing 
with
kvm-unit-tests seems to work ok.

If the guest hypervisor enables vmfunc/eptp switching, a "shadow" eptp list
address page is written to the VMCS. Initially, it would be unpopulated which
would result in a vmexit with exit reason 59. This hooks to handle_vmfunc()
to rewrite vmcs12->ept_pointer to reload the mmu and get a new root hpa.
This new shadow ept pointer is written to the shadow eptp list in the given
index. A next vmfunc call to switch to the given index would succeed without
an exit.

Bandan Das (2):
  KVM: nVMX: Implement EPTP switching for the L1 hypervisor
  KVM: nVMX: Advertise VMFUNC to L1 hypervisor

 arch/x86/include/asm/vmx.h |   9 
 arch/x86/kvm/vmx.c | 122 +
 2 files changed, 131 insertions(+)

-- 
2.9.4

[PATCH 2/2] KVM: nVMX: Advertise VMFUNC to L1 hypervisor

2017-06-29 Thread Bandan Das

Advertise VMFUNC and EPTP switching function to the L1
hypervisor. Change nested_vmx_exit_handled() to return false
for VMFUNC so L0 can handle it.

Signed-off-by: Bandan Das 
---
 arch/x86/include/asm/vmx.h |  4 
 arch/x86/kvm/vmx.c | 18 ++
 2 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index e06783e..5f63a2e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -192,6 +192,8 @@ enum vmcs_field {
APIC_ACCESS_ADDR_HIGH   = 0x2015,
POSTED_INTR_DESC_ADDR   = 0x2016,
POSTED_INTR_DESC_ADDR_HIGH  = 0x2017,
+   VM_FUNCTION_CONTROL = 0x2018,
+   VM_FUNCTION_CONTROL_HIGH= 0x2019,
EPT_POINTER = 0x201a,
EPT_POINTER_HIGH= 0x201b,
EOI_EXIT_BITMAP0= 0x201c,
@@ -202,6 +204,8 @@ enum vmcs_field {
EOI_EXIT_BITMAP2_HIGH   = 0x2021,
EOI_EXIT_BITMAP3= 0x2022,
EOI_EXIT_BITMAP3_HIGH   = 0x2023,
+   EPTP_LIST_ADDRESS   = 0x2024,
+   EPTP_LIST_ADDRESS_HIGH  = 0x2025,
VMREAD_BITMAP   = 0x2026,
VMWRITE_BITMAP  = 0x2028,
XSS_EXIT_BITMAP = 0x202C,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 75049c0..bf06bef 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -767,11 +767,13 @@ static const unsigned short vmcs_field_to_offset_table[] 
= {
FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr),
FIELD64(APIC_ACCESS_ADDR, apic_access_addr),
FIELD64(POSTED_INTR_DESC_ADDR, posted_intr_desc_addr),
+   FIELD64(VM_FUNCTION_CONTROL, vm_function_control),
FIELD64(EPT_POINTER, ept_pointer),
FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0),
FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1),
FIELD64(EOI_EXIT_BITMAP2, eoi_exit_bitmap2),
FIELD64(EOI_EXIT_BITMAP3, eoi_exit_bitmap3),
+   FIELD64(EPTP_LIST_ADDRESS, eptp_list_address),
FIELD64(XSS_EXIT_BITMAP, xss_exit_bitmap),
FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address),
FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer),
@@ -2806,6 +2808,13 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
} else
vmx->nested.nested_vmx_ept_caps = 0;
 
+   if (cpu_has_vmx_vmfunc()) {
+   vmx->nested.nested_vmx_secondary_ctls_high |=
+   SECONDARY_EXEC_ENABLE_VMFUNC;
+   vmx->nested.nested_vmx_vmfunc_controls =
+   vmx_vmfunc_controls() & 1;
+   }
+
/*
 * Old versions of KVM use the single-context version without
 * checking for support, so declare that it is supported even
@@ -8215,6 +8224,8 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
case EXIT_REASON_PML_FULL:
/* We emulate PML support to L1. */
return false;
+   case EXIT_REASON_VMFUNC:
+   return false;
default:
return true;
}
@@ -10309,6 +10320,13 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12,
vmx_flush_tlb_ept_only(vcpu);
}
 
+   if (nested_cpu_has_eptp_switching(vmcs12)) {
+   vmcs_write64(VM_FUNCTION_CONTROL,
+vmcs12->vm_function_control & 1);
+   vmcs_write64(EPTP_LIST_ADDRESS,
+page_to_phys(vmx->nested.shadow_eptp_list));
+   }
+
/*
 * This sets GUEST_CR0 to vmcs12->guest_cr0, possibly modifying those
 * bits which we consider mandatory enabled.
-- 
2.9.4

[PATCH 1/2] KVM: nVMX: Implement EPTP switching for the L1 hypervisor

2017-06-29 Thread Bandan Das

This is a mix of emulation/passthrough to implement EPTP
switching for the nested hypervisor.

If the shadow EPT are absent, a vmexit occurs with reason 59.
L0 can then create shadow structures based on the entry that the
guest calls with to obtain a new root_hpa that can be written to
the shadow list and subsequently, reload the mmu to resume L2.
On the next vmfunc(0, index) however, the processor will load the
entry without an exit.

Signed-off-by: Bandan Das 
---
 arch/x86/include/asm/vmx.h |   5 +++
 arch/x86/kvm/vmx.c | 104 +
 2 files changed, 109 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 35cd06f..e06783e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -72,6 +72,7 @@
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
 #define SECONDARY_EXEC_RDRAND  0x0800
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
+#define SECONDARY_EXEC_ENABLE_VMFUNC0x2000
 #define SECONDARY_EXEC_SHADOW_VMCS  0x4000
 #define SECONDARY_EXEC_RDSEED  0x0001
 #define SECONDARY_EXEC_ENABLE_PML   0x0002
@@ -114,6 +115,10 @@
 #define VMX_MISC_SAVE_EFER_LMA 0x0020
 #define VMX_MISC_ACTIVITY_HLT  0x0040
 
+/* VMFUNC functions */
+#define VMX_VMFUNC_EPTP_SWITCHING   0x0001
+#define VMFUNC_EPTP_ENTRIES  512
+
 static inline u32 vmx_basic_vmcs_revision_id(u64 vmx_basic)
 {
return vmx_basic & GENMASK_ULL(30, 0);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ca5d2b9..75049c0 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -240,11 +240,13 @@ struct __packed vmcs12 {
u64 virtual_apic_page_addr;
u64 apic_access_addr;
u64 posted_intr_desc_addr;
+   u64 vm_function_control;
u64 ept_pointer;
u64 eoi_exit_bitmap0;
u64 eoi_exit_bitmap1;
u64 eoi_exit_bitmap2;
u64 eoi_exit_bitmap3;
+   u64 eptp_list_address;
u64 xss_exit_bitmap;
u64 guest_physical_address;
u64 vmcs_link_pointer;
@@ -441,6 +443,7 @@ struct nested_vmx {
struct page *apic_access_page;
struct page *virtual_apic_page;
struct page *pi_desc_page;
+   struct page *shadow_eptp_list;
struct pi_desc *pi_desc;
bool pi_pending;
u16 posted_intr_nv;
@@ -481,6 +484,7 @@ struct nested_vmx {
u64 nested_vmx_cr4_fixed0;
u64 nested_vmx_cr4_fixed1;
u64 nested_vmx_vmcs_enum;
+   u64 nested_vmx_vmfunc_controls;
 };
 
 #define POSTED_INTR_ON  0
@@ -1314,6 +1318,22 @@ static inline bool cpu_has_vmx_tsc_scaling(void)
SECONDARY_EXEC_TSC_SCALING;
 }
 
+static inline bool cpu_has_vmx_vmfunc(void)
+{
+   return vmcs_config.cpu_based_exec_ctrl &
+   SECONDARY_EXEC_ENABLE_VMFUNC;
+}
+
+static inline u64 vmx_vmfunc_controls(void)
+{
+   u64 controls = 0;
+
+   if (cpu_has_vmx_vmfunc())
+   rdmsrl(MSR_IA32_VMX_VMFUNC, controls);
+
+   return controls;
+}
+
 static inline bool report_flexpriority(void)
 {
return flexpriority_enabled;
@@ -1388,6 +1408,18 @@ static inline bool nested_cpu_has_posted_intr(struct 
vmcs12 *vmcs12)
return vmcs12->pin_based_vm_exec_control & PIN_BASED_POSTED_INTR;
 }
 
+static inline bool nested_cpu_has_vmfunc(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VMFUNC);
+}
+
+static inline bool nested_cpu_has_eptp_switching(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has_vmfunc(vmcs12) &&
+   (vmcs12->vm_function_control &
+VMX_VMFUNC_EPTP_SWITCHING);
+}
+
 static inline bool is_nmi(u32 intr_info)
 {
return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -3143,6 +3175,9 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
*pdata = vmx->nested.nested_vmx_ept_caps |
((u64)vmx->nested.nested_vmx_vpid_caps << 32);
break;
+   case MSR_IA32_VMX_VMFUNC:
+   *pdata = vmx->nested.nested_vmx_vmfunc_controls;
+   break;
default:
return 1;
}
@@ -6959,6 +6994,14 @@ static int enter_vmx_operation(struct kvm_vcpu *vcpu)
vmx->vmcs01.shadow_vmcs = shadow_vmcs;
}
 
+   if (vmx_vmfunc_controls() & 1) {
+   struct page *page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+
+   if (!page)
+   goto out_shadow_vmcs;
+   vmx->nested.shadow_eptp_list = page;
+   }
+
INIT_LIST_HEAD(&(vmx->nested.vmcs02_pool));
vmx->nested.vmcs02_num = 0;
 
@@ -7128,6 +7171,11 @@ static void free_nested(struct vcpu_vmx *vmx)
vmx->vmcs01.shadow_vmcs = NUL

Re: [RFC 11/55] KVM: arm64: Emulate taking an exception to the guest hypervisor

2017-06-07 Thread Bandan Das

Jintack Lim  writes:

> Compilation on 32bit arm architecture will fail without them.
...
>> It seems these functions are
>> defined separately in 32/64 bit specific header files. Or is it that
>> 64 bit compilation also depends on the 32 bit header file ?
>
> It's only for 32bit architecture. For example, kvm_inject_nested_irq()
> is called in virt/kvm/arm/vgic/vgic.c which is shared between 32 and
> 64 bit.

Ah, that's the catch! Thanks for clearing this up!

>>
>> Bandan
>>
>>> Thanks,
>>> Jintack
>>>

 Bandan

>  static inline void kvm_arm_setup_shadow_state(struct kvm_vcpu *vcpu) { };
>  static inline void kvm_arm_restore_shadow_state(struct kvm_vcpu *vcpu) { 
> };
>  static inline void kvm_arm_init_cpu_context(kvm_cpu_context_t *cpu_ctxt) 
> { };
> diff --git a/arch/arm64/include/asm/kvm_emulate.h 
> b/arch/arm64/include/asm/kvm_emulate.h
> index 8892c82..0987ee4 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -42,6 +42,25 @@
>  void kvm_inject_dabt(struct kvm_vcpu *vcpu, unsigned long addr);
>  void kvm_inject_pabt(struct kvm_vcpu *vcpu, unsigned long addr);
>
> +#ifdef CONFIG_KVM_ARM_NESTED_HYP
> +int kvm_inject_nested_sync(struct kvm_vcpu *vcpu, u64 esr_el2);
> +int kvm_inject_nested_irq(struct kvm_vcpu *vcpu);
> +#else
> +static inline int kvm_inject_nested_sync(struct kvm_vcpu *vcpu, u64 
> esr_el2)
> +{
> + kvm_err("Unexpected call to %s for the non-nesting configuration\n",
> +  __func__);
> + return -EINVAL;
> +}
> +
> +static inline int kvm_inject_nested_irq(struct kvm_vcpu *vcpu)
> +{
> + kvm_err("Unexpected call to %s for the non-nesting configuration\n",
> +  __func__);
> + return -EINVAL;
> +}
> +#endif
> +
>  void kvm_arm_setup_shadow_state(struct kvm_vcpu *vcpu);
>  void kvm_arm_restore_shadow_state(struct kvm_vcpu *vcpu);
>  void kvm_arm_init_cpu_context(kvm_cpu_context_t *cpu_ctxt);
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 7811d27..b342bdd 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -34,3 +34,5 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic/vgic-its.o
>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/irqchip.o
>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arch_timer.o
>  kvm-$(CONFIG_KVM_ARM_PMU) += $(KVM)/arm/pmu.o
> +
> +kvm-$(CONFIG_KVM_ARM_NESTED_HYP) += emulate-nested.o
> diff --git a/arch/arm64/kvm/emulate-nested.c 
> b/arch/arm64/kvm/emulate-nested.c
> new file mode 100644
> index 000..59d147f
> --- /dev/null
> +++ b/arch/arm64/kvm/emulate-nested.c
> @@ -0,0 +1,66 @@
> +/*
> + * Copyright (C) 2016 - Columbia University
> + * Author: Jintack Lim 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +
> +#include 
> +#include 
> +
> +#include 
> +
> +#include "trace.h"
> +
> +#define  EL2_EXCEPT_SYNC_OFFSET  0x400
> +#define  EL2_EXCEPT_ASYNC_OFFSET 0x480
> +
> +
> +/*
> + *  Emulate taking an exception. See ARM ARM J8.1.2 
> AArch64.TakeException()
> + */
> +static int kvm_inject_nested(struct kvm_vcpu *vcpu, u64 esr_el2,
> +  int exception_offset)
> +{
> + int ret = 1;
> + kvm_cpu_context_t *ctxt = &vcpu->arch.ctxt;
> +
> + /* We don't inject an exception recursively to virtual EL2 */
> + if (vcpu_mode_el2(vcpu))
> + BUG();
> +
> + ctxt->el2_regs[SPSR_EL2] = *vcpu_cpsr(vcpu);
> + ctxt->el2_regs[ELR_EL2] = *vcpu_pc(vcpu);
> + ctxt->el2_regs[ESR_EL2] = esr_el2;
> +
> + /* On an exception, PSTATE.SP = 1 */
> + *vcpu_cpsr(vcpu) = PSR_MODE_EL2h;
> + *vcpu_cpsr(vcpu) |=  (PSR_A_BIT | PSR_F_BIT | PSR_I_BIT | 
> PSR_D_BIT);
> + *vcpu_pc(vcpu) = ctxt->el2_regs[VBAR_EL2] + exception_offset;
> +
> + trace_kvm_inject_nested_exception(vcpu, esr_el2, *vcpu_pc(vcpu));
> +
> + return ret;
> +}
> +
> +int kvm_inject_nested_sync(struct kvm_vcpu *vcpu, u64 esr_el2)
> +{
> + return kvm_inject_nested(vcpu, esr_el2, EL2_EXCEPT_

Re: [RFC 11/55] KVM: arm64: Emulate taking an exception to the guest hypervisor

2017-06-06 Thread Bandan Das

Hi Jintack,

Jintack Lim  writes:

> Hi Bandan,
>
> On Tue, Jun 6, 2017 at 4:21 PM, Bandan Das  wrote:
>> Jintack Lim  writes:
>>
>>> Emulate taking an exception to the guest hypervisor running in the
>>> virtual EL2 as described in ARM ARM AArch64.TakeException().
>>
>> ARM newbie here, I keep thinking of ARM ARM as a typo ;)
>
> ARM ARM means ARM Architecture Reference Manual :)
>
>> ...
>>> +static inline int kvm_inject_nested_sync(struct kvm_vcpu *vcpu, u64 
>>> esr_el2)
>>> +{
>>> + kvm_err("Unexpected call to %s for the non-nesting configuration\n",
>>> +  __func__);
>>> + return -EINVAL;
>>> +}
>>> +
>>> +static inline int kvm_inject_nested_irq(struct kvm_vcpu *vcpu)
>>> +{
>>> + kvm_err("Unexpected call to %s for the non-nesting configuration\n",
>>> +  __func__);
>>> + return -EINVAL;
>>> +}
>>> +
>>
>> I see these function stubs for aarch32 in the patches. I don't see how they
>> can actually be called though. Is this because eventually, there will be
>> a virtual el2 mode for aarch32 ?
>
> Current RFC doesn't support nested virtualization on 32bit arm
> architecture and those functions will be never called. Those functions
> are there for the compilation.

Do you mean that compilation will fail ? It seems these functions are
defined separately in 32/64 bit specific header files. Or is it that
64 bit compilation also depends on the 32 bit header file ?

Bandan

> Thanks,
> Jintack
>
>>
>> Bandan
>>
>>>  static inline void kvm_arm_setup_shadow_state(struct kvm_vcpu *vcpu) { };
>>>  static inline void kvm_arm_restore_shadow_state(struct kvm_vcpu *vcpu) { };
>>>  static inline void kvm_arm_init_cpu_context(kvm_cpu_context_t *cpu_ctxt) { 
>>> };
>>> diff --git a/arch/arm64/include/asm/kvm_emulate.h 
>>> b/arch/arm64/include/asm/kvm_emulate.h
>>> index 8892c82..0987ee4 100644
>>> --- a/arch/arm64/include/asm/kvm_emulate.h
>>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>>> @@ -42,6 +42,25 @@
>>>  void kvm_inject_dabt(struct kvm_vcpu *vcpu, unsigned long addr);
>>>  void kvm_inject_pabt(struct kvm_vcpu *vcpu, unsigned long addr);
>>>
>>> +#ifdef CONFIG_KVM_ARM_NESTED_HYP
>>> +int kvm_inject_nested_sync(struct kvm_vcpu *vcpu, u64 esr_el2);
>>> +int kvm_inject_nested_irq(struct kvm_vcpu *vcpu);
>>> +#else
>>> +static inline int kvm_inject_nested_sync(struct kvm_vcpu *vcpu, u64 
>>> esr_el2)
>>> +{
>>> + kvm_err("Unexpected call to %s for the non-nesting configuration\n",
>>> +  __func__);
>>> + return -EINVAL;
>>> +}
>>> +
>>> +static inline int kvm_inject_nested_irq(struct kvm_vcpu *vcpu)
>>> +{
>>> + kvm_err("Unexpected call to %s for the non-nesting configuration\n",
>>> +  __func__);
>>> + return -EINVAL;
>>> +}
>>> +#endif
>>> +
>>>  void kvm_arm_setup_shadow_state(struct kvm_vcpu *vcpu);
>>>  void kvm_arm_restore_shadow_state(struct kvm_vcpu *vcpu);
>>>  void kvm_arm_init_cpu_context(kvm_cpu_context_t *cpu_ctxt);
>>> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
>>> index 7811d27..b342bdd 100644
>>> --- a/arch/arm64/kvm/Makefile
>>> +++ b/arch/arm64/kvm/Makefile
>>> @@ -34,3 +34,5 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic/vgic-its.o
>>>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/irqchip.o
>>>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arch_timer.o
>>>  kvm-$(CONFIG_KVM_ARM_PMU) += $(KVM)/arm/pmu.o
>>> +
>>> +kvm-$(CONFIG_KVM_ARM_NESTED_HYP) += emulate-nested.o
>>> diff --git a/arch/arm64/kvm/emulate-nested.c 
>>> b/arch/arm64/kvm/emulate-nested.c
>>> new file mode 100644
>>> index 000..59d147f
>>> --- /dev/null
>>> +++ b/arch/arm64/kvm/emulate-nested.c
>>> @@ -0,0 +1,66 @@
>>> +/*
>>> + * Copyright (C) 2016 - Columbia University
>>> + * Author: Jintack Lim 
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License version 2 as
>>> + * published by the Free Software Foundation.
>>> + *
>>> + * This program is distributed in the hope that it will be useful,
>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> + * MERCH

Re: [RFC 11/55] KVM: arm64: Emulate taking an exception to the guest hypervisor

2017-06-06 Thread Bandan Das

Jintack Lim  writes:

> Emulate taking an exception to the guest hypervisor running in the
> virtual EL2 as described in ARM ARM AArch64.TakeException().

ARM newbie here, I keep thinking of ARM ARM as a typo ;)
...
> +static inline int kvm_inject_nested_sync(struct kvm_vcpu *vcpu, u64 esr_el2)
> +{
> + kvm_err("Unexpected call to %s for the non-nesting configuration\n",
> +  __func__);
> + return -EINVAL;
> +}
> +
> +static inline int kvm_inject_nested_irq(struct kvm_vcpu *vcpu)
> +{
> + kvm_err("Unexpected call to %s for the non-nesting configuration\n",
> +  __func__);
> + return -EINVAL;
> +}
> +

I see these function stubs for aarch32 in the patches. I don't see how they
can actually be called though. Is this because eventually, there will be
a virtual el2 mode for aarch32 ?

Bandan

>  static inline void kvm_arm_setup_shadow_state(struct kvm_vcpu *vcpu) { };
>  static inline void kvm_arm_restore_shadow_state(struct kvm_vcpu *vcpu) { };
>  static inline void kvm_arm_init_cpu_context(kvm_cpu_context_t *cpu_ctxt) { };
> diff --git a/arch/arm64/include/asm/kvm_emulate.h 
> b/arch/arm64/include/asm/kvm_emulate.h
> index 8892c82..0987ee4 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -42,6 +42,25 @@
>  void kvm_inject_dabt(struct kvm_vcpu *vcpu, unsigned long addr);
>  void kvm_inject_pabt(struct kvm_vcpu *vcpu, unsigned long addr);
>  
> +#ifdef CONFIG_KVM_ARM_NESTED_HYP
> +int kvm_inject_nested_sync(struct kvm_vcpu *vcpu, u64 esr_el2);
> +int kvm_inject_nested_irq(struct kvm_vcpu *vcpu);
> +#else
> +static inline int kvm_inject_nested_sync(struct kvm_vcpu *vcpu, u64 esr_el2)
> +{
> + kvm_err("Unexpected call to %s for the non-nesting configuration\n",
> +  __func__);
> + return -EINVAL;
> +}
> +
> +static inline int kvm_inject_nested_irq(struct kvm_vcpu *vcpu)
> +{
> + kvm_err("Unexpected call to %s for the non-nesting configuration\n",
> +  __func__);
> + return -EINVAL;
> +}
> +#endif
> +
>  void kvm_arm_setup_shadow_state(struct kvm_vcpu *vcpu);
>  void kvm_arm_restore_shadow_state(struct kvm_vcpu *vcpu);
>  void kvm_arm_init_cpu_context(kvm_cpu_context_t *cpu_ctxt);
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 7811d27..b342bdd 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -34,3 +34,5 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic/vgic-its.o
>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/irqchip.o
>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arch_timer.o
>  kvm-$(CONFIG_KVM_ARM_PMU) += $(KVM)/arm/pmu.o
> +
> +kvm-$(CONFIG_KVM_ARM_NESTED_HYP) += emulate-nested.o
> diff --git a/arch/arm64/kvm/emulate-nested.c b/arch/arm64/kvm/emulate-nested.c
> new file mode 100644
> index 000..59d147f
> --- /dev/null
> +++ b/arch/arm64/kvm/emulate-nested.c
> @@ -0,0 +1,66 @@
> +/*
> + * Copyright (C) 2016 - Columbia University
> + * Author: Jintack Lim 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +
> +#include 
> +#include 
> +
> +#include 
> +
> +#include "trace.h"
> +
> +#define  EL2_EXCEPT_SYNC_OFFSET  0x400
> +#define  EL2_EXCEPT_ASYNC_OFFSET 0x480
> +
> +
> +/*
> + *  Emulate taking an exception. See ARM ARM J8.1.2 AArch64.TakeException()
> + */
> +static int kvm_inject_nested(struct kvm_vcpu *vcpu, u64 esr_el2,
> +  int exception_offset)
> +{
> + int ret = 1;
> + kvm_cpu_context_t *ctxt = &vcpu->arch.ctxt;
> +
> + /* We don't inject an exception recursively to virtual EL2 */
> + if (vcpu_mode_el2(vcpu))
> + BUG();
> +
> + ctxt->el2_regs[SPSR_EL2] = *vcpu_cpsr(vcpu);
> + ctxt->el2_regs[ELR_EL2] = *vcpu_pc(vcpu);
> + ctxt->el2_regs[ESR_EL2] = esr_el2;
> +
> + /* On an exception, PSTATE.SP = 1 */
> + *vcpu_cpsr(vcpu) = PSR_MODE_EL2h;
> + *vcpu_cpsr(vcpu) |=  (PSR_A_BIT | PSR_F_BIT | PSR_I_BIT | PSR_D_BIT);
> + *vcpu_pc(vcpu) = ctxt->el2_regs[VBAR_EL2] + exception_offset;
> +
> + trace_kvm_inject_nested_exception(vcpu, esr_el2, *vcpu_pc(vcpu));
> +
> + return ret;
> +}
> +
> +int kvm_inject_nested_sync(struct kvm_vcpu *vcpu, u64 esr_el2)
> +{
> + return kvm_inject_nested(vcpu, esr_el2, EL2_EXCEPT_SYNC_OFFSET);
> +}
> +
> +int kvm_inject_nested_irq(struct kvm_vcpu *vcpu)
> +{
> + u64 esr_el2 = kvm_vcpu_get_hsr(vcpu);
> + /* We supports only IRQ and FI

Re: [RFC 10/55] KVM: arm64: Synchronize EL1 system registers on virtual EL2 entry and exit

2017-06-06 Thread Bandan Das

Jintack Lim  writes:

> From: Christoffer Dall 
>
> When running in virtual EL2 we use the shadow EL1 systerm register array
> for the save/restore process, so that hardware and especially the memory
> subsystem behaves as code written for EL2 expects while really running
> in EL1.
>
> This works great for EL1 system register accesses that we trap, because
> these accesses will be written into the virtual state for the EL1 system
> registers used when eventually switching the VCPU mode to EL1.
>
> However, there was a collection of EL1 system registers which we do not
> trap, and as a consequence all save/restore operations of these
> registers were happening locally in the shadow array, with no benefit to
> software actually running in virtual EL1 at all.
>
> To fix this, simply synchronize the shadow and real EL1 state for these
> registers on entry/exit to/from virtual EL2 state.
>
> Signed-off-by: Christoffer Dall 
> Signed-off-by: Jintack Lim 
> ---
>  arch/arm64/kvm/context.c | 47 +++
>  1 file changed, 47 insertions(+)
>
> diff --git a/arch/arm64/kvm/context.c b/arch/arm64/kvm/context.c
> index 2e9e386..0025dd9 100644
> --- a/arch/arm64/kvm/context.c
> +++ b/arch/arm64/kvm/context.c
> @@ -88,6 +88,51 @@ static void create_shadow_el1_sysregs(struct kvm_vcpu 
> *vcpu)
>   s_sys_regs[CPACR_EL1] = cptr_el2_to_cpacr_el1(el2_regs[CPTR_EL2]);
>  }
>  
> +/*
> + * List of EL1 registers which we allow the virtual EL2 mode to access
> + * directly without trapping and which haven't been paravirtualized.
> + *
> + * Probably CNTKCTL_EL1 should not be copied but be accessed via trap. 
> Because,
> + * the guest hypervisor running in EL1 can be affected by event streams
> + * configured via CNTKCTL_EL1, which it does not expect. We don't have a
> + * mechanism to trap on CNTKCTL_EL1 as of now (v8.3), keep it in here 
> instead.
> + */
> +static const int el1_non_trap_regs[] = {
> + CNTKCTL_EL1,
> + CSSELR_EL1,
> + PAR_EL1,
> + TPIDR_EL0,
> + TPIDR_EL1,
> + TPIDRRO_EL0
> +};
> +

Do we trap on all register accesses in the non-nested case +
all accesses to the memory access registers ? I am trying to
understand how we decide what registers to trap on. For example,
shouldn't accesses to CSSELR_EL1, the cache size selection register
be trapped ?

Bandan


> +/**
> + * sync_shadow_el1_state - Going to/from the virtual EL2 state, sync state
> + * @vcpu:The VCPU pointer
> + * @setup:   True, if on the way to the guest (called from setup)
> + *   False, if returning form the guet (calld from restore)
> + *
> + * Some EL1 registers are accessed directly by the virtual EL2 mode because
> + * they in no way affect execution state in virtual EL2.   However, we must
> + * still ensure that virtual EL2 observes the same state of the EL1 registers
> + * as the normal VM's EL1 mode, so copy this state as needed on 
> setup/restore.
> + */
> +static void sync_shadow_el1_state(struct kvm_vcpu *vcpu, bool setup)
> +{
> + u64 *sys_regs = vcpu->arch.ctxt.sys_regs;
> + u64 *s_sys_regs = vcpu->arch.ctxt.shadow_sys_regs;
> + int i;
> +
> + for (i = 0; i < ARRAY_SIZE(el1_non_trap_regs); i++) {
> + const int sr = el1_non_trap_regs[i];
> +
> + if (setup)
> + s_sys_regs[sr] = sys_regs[sr];
> + else
> + sys_regs[sr] = s_sys_regs[sr];
> + }
> +}
> +
>  /**
>   * kvm_arm_setup_shadow_state -- prepare shadow state based on emulated mode
>   * @vcpu: The VCPU pointer
> @@ -107,6 +152,7 @@ void kvm_arm_setup_shadow_state(struct kvm_vcpu *vcpu)
>   else
>   ctxt->hw_pstate |= PSR_MODE_EL1t;
>  
> + sync_shadow_el1_state(vcpu, true);
>   create_shadow_el1_sysregs(vcpu);
>   ctxt->hw_sys_regs = ctxt->shadow_sys_regs;
>   ctxt->hw_sp_el1 = ctxt->el2_regs[SP_EL2];
> @@ -125,6 +171,7 @@ void kvm_arm_restore_shadow_state(struct kvm_vcpu *vcpu)
>  {
>   struct kvm_cpu_context *ctxt = &vcpu->arch.ctxt;
>   if (unlikely(vcpu_mode_el2(vcpu))) {
> + sync_shadow_el1_state(vcpu, false);
>   *vcpu_cpsr(vcpu) &= PSR_MODE_MASK;
>   *vcpu_cpsr(vcpu) |= ctxt->hw_pstate & ~PSR_MODE_MASK;
>   ctxt->el2_regs[SP_EL2] = ctxt->hw_sp_el1;

Re: [RFC 07/55] KVM: arm/arm64: Add virtual EL2 state emulation framework

2017-06-02 Thread Bandan Das

Christoffer Dall  writes:

> On Fri, Jun 02, 2017 at 01:36:23PM -0400, Bandan Das wrote:
>> Christoffer Dall  writes:
>> 
>> > On Thu, Jun 01, 2017 at 04:05:49PM -0400, Bandan Das wrote:
>> >> Jintack Lim  writes:
>> >> ...
>> >> > +/**
>> >> > + * kvm_arm_setup_shadow_state -- prepare shadow state based on 
>> >> > emulated mode
>> >> > + * @vcpu: The VCPU pointer
>> >> > + */
>> >> > +void kvm_arm_setup_shadow_state(struct kvm_vcpu *vcpu)
>> >> > +{
>> >> > +   struct kvm_cpu_context *ctxt = &vcpu->arch.ctxt;
>> >> > +
>> >> > +   ctxt->hw_pstate = *vcpu_cpsr(vcpu);
>> >> > +   ctxt->hw_sys_regs = ctxt->sys_regs;
>> >> > +   ctxt->hw_sp_el1 = ctxt->gp_regs.sp_el1;
>> >> > +}
>> >> > +
>> >> > +/**
>> >> > + * kvm_arm_restore_shadow_state -- write back shadow state from guest
>> >> > + * @vcpu: The VCPU pointer
>> >> > + */
>> >> > +void kvm_arm_restore_shadow_state(struct kvm_vcpu *vcpu)
>> >> > +{
>> >> > +   struct kvm_cpu_context *ctxt = &vcpu->arch.ctxt;
>> >> > +
>> >> > +   *vcpu_cpsr(vcpu) = ctxt->hw_pstate;
>> >> > +   ctxt->gp_regs.sp_el1 = ctxt->hw_sp_el1;
>> >> > +}
>> >> > +
>> >> > +void kvm_arm_init_cpu_context(kvm_cpu_context_t *cpu_ctxt)
>> >> > +{
>> >> > +   cpu_ctxt->hw_sys_regs = &cpu_ctxt->sys_regs[0];
>> >> > +}
>> >> 
>> >> 
>> >> IIUC, the *_shadow_state() functions will set hw_* pointers to
>> >> either point to the "real" state or the shadow state to manage L2 ?
>> >> Maybe, it might make sense to make these function names a little more
>> >> generic since they are not dealing with setting the shadow state
>> >> alone.
>> >> 
>> >
>> > The notion of 'shadow state' is borrowed from shadow page tables, in
>> > which you always load some 'shadow copy' of the 'real value' into the
>> > hardware, so the shadow state is the one that's used for execution by
>> > the hardware.
>> >
>> > The shadow state may be the same as the VCPU's EL1 state, for example,
>> > or it may be a modified version of the VCPU's EL2 state, for example.
>> 
>> Yes, it can be the same. Although, as you said above, "shadow" conventionally
>> refers to the latter.
>
> That's not what I said.  I said shadow is the thing you use in the
> hardware, which may be the same, and may be something different.  The
> important point being, that it is what gets used by the hardware, and
> that it's decoupled, not necessarily different, from the virtual
> state.

I was referring to your first paragraph. And conventionally, in the context of
shadow page tables, it is always different.

>> When it's pointing to EL1 state, it's not really
>> shadow state anymore.
>> 
>
> You can argue it both ways, in the end, all that's important is whether
> or not it's clear what the functions do.
>
>> > If you have better suggestions for naming, we're open to that though.
>> >
>> 
>> Oh nothing specifically, I just felt like "shadow" in the function name
>> could be confusing. Borrowing from kvm_arm_init_cpu_context(), 
>> how about kvm_arm_setup/restore_cpu_context()  ?
>
> I have no objection to these names.
>
>> 
>> BTW, on a separate note, we might as well get away with the typedef and
>> call struct kvm_cpu_context directly.
>> 
> I don't think it's worth changing the code just for that, but if you
> feel it's a significant cleanup, you can send a patch with a good
> argument for why it's worth changing in the commit message.

Sure! The cleanup is not part of the series but sticking to either one
of them in this patch is. As for the argument, typedefs for structs are
discouraged as part of the coding style.

> Thanks,
> -Christoffer

Re: [RFC 07/55] KVM: arm/arm64: Add virtual EL2 state emulation framework

2017-06-02 Thread Bandan Das

Christoffer Dall  writes:

> On Thu, Jun 01, 2017 at 04:05:49PM -0400, Bandan Das wrote:
>> Jintack Lim  writes:
>> ...
>> > +/**
>> > + * kvm_arm_setup_shadow_state -- prepare shadow state based on emulated 
>> > mode
>> > + * @vcpu: The VCPU pointer
>> > + */
>> > +void kvm_arm_setup_shadow_state(struct kvm_vcpu *vcpu)
>> > +{
>> > +  struct kvm_cpu_context *ctxt = &vcpu->arch.ctxt;
>> > +
>> > +  ctxt->hw_pstate = *vcpu_cpsr(vcpu);
>> > +  ctxt->hw_sys_regs = ctxt->sys_regs;
>> > +  ctxt->hw_sp_el1 = ctxt->gp_regs.sp_el1;
>> > +}
>> > +
>> > +/**
>> > + * kvm_arm_restore_shadow_state -- write back shadow state from guest
>> > + * @vcpu: The VCPU pointer
>> > + */
>> > +void kvm_arm_restore_shadow_state(struct kvm_vcpu *vcpu)
>> > +{
>> > +  struct kvm_cpu_context *ctxt = &vcpu->arch.ctxt;
>> > +
>> > +  *vcpu_cpsr(vcpu) = ctxt->hw_pstate;
>> > +  ctxt->gp_regs.sp_el1 = ctxt->hw_sp_el1;
>> > +}
>> > +
>> > +void kvm_arm_init_cpu_context(kvm_cpu_context_t *cpu_ctxt)
>> > +{
>> > +  cpu_ctxt->hw_sys_regs = &cpu_ctxt->sys_regs[0];
>> > +}
>> 
>> 
>> IIUC, the *_shadow_state() functions will set hw_* pointers to
>> either point to the "real" state or the shadow state to manage L2 ?
>> Maybe, it might make sense to make these function names a little more
>> generic since they are not dealing with setting the shadow state
>> alone.
>> 
>
> The notion of 'shadow state' is borrowed from shadow page tables, in
> which you always load some 'shadow copy' of the 'real value' into the
> hardware, so the shadow state is the one that's used for execution by
> the hardware.
>
> The shadow state may be the same as the VCPU's EL1 state, for example,
> or it may be a modified version of the VCPU's EL2 state, for example.

Yes, it can be the same. Although, as you said above, "shadow" conventionally
refers to the latter. When it's pointing to EL1 state, it's not really
shadow state anymore.

> If you have better suggestions for naming, we're open to that though.
>

Oh nothing specifically, I just felt like "shadow" in the function name
could be confusing. Borrowing from kvm_arm_init_cpu_context(), 
how about kvm_arm_setup/restore_cpu_context()  ?

BTW, on a separate note, we might as well get away with the typedef and
call struct kvm_cpu_context directly.

> Thanks,
> -Christoffer

Re: [RFC 08/55] KVM: arm64: Set virtual EL2 context depending on the guest exception level

2017-06-01 Thread Bandan Das

Jintack Lim  writes:

> From: Christoffer Dall 
>
> Set up virutal EL2 context to hardware if the guest exception level is
> EL2.
>
> Signed-off-by: Christoffer Dall 
> Signed-off-by: Jintack Lim 
> ---
>  arch/arm64/kvm/context.c | 32 ++--
>  1 file changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/kvm/context.c b/arch/arm64/kvm/context.c
> index 320afc6..acb4b1e 100644
> --- a/arch/arm64/kvm/context.c
> +++ b/arch/arm64/kvm/context.c
> @@ -25,10 +25,25 @@
>  void kvm_arm_setup_shadow_state(struct kvm_vcpu *vcpu)
>  {
>   struct kvm_cpu_context *ctxt = &vcpu->arch.ctxt;
> + if (unlikely(vcpu_mode_el2(vcpu))) {
> + ctxt->hw_pstate = *vcpu_cpsr(vcpu) & ~PSR_MODE_MASK;
>  
> - ctxt->hw_pstate = *vcpu_cpsr(vcpu);
> - ctxt->hw_sys_regs = ctxt->sys_regs;
> - ctxt->hw_sp_el1 = ctxt->gp_regs.sp_el1;
> + /*
> +  * We emulate virtual EL2 mode in hardware EL1 mode using the
> +  * same stack pointer mode as the guest expects.
> +  */
> + if ((*vcpu_cpsr(vcpu) & PSR_MODE_MASK) == PSR_MODE_EL2h)
> + ctxt->hw_pstate |= PSR_MODE_EL1h;
> + else
> + ctxt->hw_pstate |= PSR_MODE_EL1t;
> +

I see vcpu_mode(el2) does
return mode == PSR_MODE_EL2h || mode == PSR_MODE_EL2t;

I can't seem to find this, what's the difference between
the modes: PSR_MODE_EL2h/EL2t ?

Bandan

> + ctxt->hw_sys_regs = ctxt->shadow_sys_regs;
> + ctxt->hw_sp_el1 = ctxt->el2_regs[SP_EL2];
> + } else {
> + ctxt->hw_pstate = *vcpu_cpsr(vcpu);
> + ctxt->hw_sys_regs = ctxt->sys_regs;
> + ctxt->hw_sp_el1 = ctxt->gp_regs.sp_el1;
> + }
>  }
>  
>  /**
> @@ -38,9 +53,14 @@ void kvm_arm_setup_shadow_state(struct kvm_vcpu *vcpu)
>  void kvm_arm_restore_shadow_state(struct kvm_vcpu *vcpu)
>  {
>   struct kvm_cpu_context *ctxt = &vcpu->arch.ctxt;
> -
> - *vcpu_cpsr(vcpu) = ctxt->hw_pstate;
> - ctxt->gp_regs.sp_el1 = ctxt->hw_sp_el1;
> + if (unlikely(vcpu_mode_el2(vcpu))) {
> + *vcpu_cpsr(vcpu) &= PSR_MODE_MASK;
> + *vcpu_cpsr(vcpu) |= ctxt->hw_pstate & ~PSR_MODE_MASK;
> + ctxt->el2_regs[SP_EL2] = ctxt->hw_sp_el1;
> + } else {
> + *vcpu_cpsr(vcpu) = ctxt->hw_pstate;
> + ctxt->gp_regs.sp_el1 = ctxt->hw_sp_el1;
> + }
>  }
>  
>  void kvm_arm_init_cpu_context(kvm_cpu_context_t *cpu_ctxt)

Re: [RFC 07/55] KVM: arm/arm64: Add virtual EL2 state emulation framework

2017-06-01 Thread Bandan Das

Jintack Lim  writes:
...
> +/**
> + * kvm_arm_setup_shadow_state -- prepare shadow state based on emulated mode
> + * @vcpu: The VCPU pointer
> + */
> +void kvm_arm_setup_shadow_state(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_cpu_context *ctxt = &vcpu->arch.ctxt;
> +
> + ctxt->hw_pstate = *vcpu_cpsr(vcpu);
> + ctxt->hw_sys_regs = ctxt->sys_regs;
> + ctxt->hw_sp_el1 = ctxt->gp_regs.sp_el1;
> +}
> +
> +/**
> + * kvm_arm_restore_shadow_state -- write back shadow state from guest
> + * @vcpu: The VCPU pointer
> + */
> +void kvm_arm_restore_shadow_state(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_cpu_context *ctxt = &vcpu->arch.ctxt;
> +
> + *vcpu_cpsr(vcpu) = ctxt->hw_pstate;
> + ctxt->gp_regs.sp_el1 = ctxt->hw_sp_el1;
> +}
> +
> +void kvm_arm_init_cpu_context(kvm_cpu_context_t *cpu_ctxt)
> +{
> + cpu_ctxt->hw_sys_regs = &cpu_ctxt->sys_regs[0];
> +}


IIUC, the *_shadow_state() functions will set hw_* pointers to
either point to the "real" state or the shadow state to manage L2 ?
Maybe, it might make sense to make these function names a little more
generic since they are not dealing with setting the shadow state
alone.

> diff --git a/arch/arm64/kvm/hyp/sysreg-sr.c b/arch/arm64/kvm/hyp/sysreg-sr.c
> index 9341376..f2a1b32 100644
> --- a/arch/arm64/kvm/hyp/sysreg-sr.c
> +++ b/arch/arm64/kvm/hyp/sysreg-sr.c
> @@ -19,6 +19,7 @@
>  #include 
>  
>  #include 
> +#include 
>  #include 
>  
>  /* Yes, this does nothing, on purpose */
> @@ -33,37 +34,41 @@ static void __hyp_text __sysreg_do_nothing(struct 
> kvm_cpu_context *ctxt) { }
>  
>  static void __hyp_text __sysreg_save_common_state(struct kvm_cpu_context 
> *ctxt)
>  {
> - ctxt->sys_regs[ACTLR_EL1]   = read_sysreg(actlr_el1);
> - ctxt->sys_regs[TPIDR_EL0]   = read_sysreg(tpidr_el0);
> - ctxt->sys_regs[TPIDRRO_EL0] = read_sysreg(tpidrro_el0);
> - ctxt->sys_regs[TPIDR_EL1]   = read_sysreg(tpidr_el1);
> - ctxt->sys_regs[MDSCR_EL1]   = read_sysreg(mdscr_el1);
> + u64 *sys_regs = kern_hyp_va(ctxt->hw_sys_regs);
> +
> + sys_regs[ACTLR_EL1] = read_sysreg(actlr_el1);
> + sys_regs[TPIDR_EL0] = read_sysreg(tpidr_el0);
> + sys_regs[TPIDRRO_EL0]   = read_sysreg(tpidrro_el0);
> + sys_regs[TPIDR_EL1] = read_sysreg(tpidr_el1);
> + sys_regs[MDSCR_EL1] = read_sysreg(mdscr_el1);
>   ctxt->gp_regs.regs.sp   = read_sysreg(sp_el0);
>   ctxt->gp_regs.regs.pc   = read_sysreg_el2(elr);
> - ctxt->gp_regs.regs.pstate   = read_sysreg_el2(spsr);
> + ctxt->hw_pstate = read_sysreg_el2(spsr);
>  }
>  
>  static void __hyp_text __sysreg_save_state(struct kvm_cpu_context *ctxt)
>  {
> - ctxt->sys_regs[MPIDR_EL1]   = read_sysreg(vmpidr_el2);
> - ctxt->sys_regs[CSSELR_EL1]  = read_sysreg(csselr_el1);
> - ctxt->sys_regs[SCTLR_EL1]   = read_sysreg_el1(sctlr);
> - ctxt->sys_regs[CPACR_EL1]   = read_sysreg_el1(cpacr);
> - ctxt->sys_regs[TTBR0_EL1]   = read_sysreg_el1(ttbr0);
> - ctxt->sys_regs[TTBR1_EL1]   = read_sysreg_el1(ttbr1);
> - ctxt->sys_regs[TCR_EL1] = read_sysreg_el1(tcr);
> - ctxt->sys_regs[ESR_EL1] = read_sysreg_el1(esr);
> - ctxt->sys_regs[AFSR0_EL1]   = read_sysreg_el1(afsr0);
> - ctxt->sys_regs[AFSR1_EL1]   = read_sysreg_el1(afsr1);
> - ctxt->sys_regs[FAR_EL1] = read_sysreg_el1(far);
> - ctxt->sys_regs[MAIR_EL1]= read_sysreg_el1(mair);
> - ctxt->sys_regs[VBAR_EL1]= read_sysreg_el1(vbar);
> - ctxt->sys_regs[CONTEXTIDR_EL1]  = read_sysreg_el1(contextidr);
> - ctxt->sys_regs[AMAIR_EL1]   = read_sysreg_el1(amair);
> - ctxt->sys_regs[CNTKCTL_EL1] = read_sysreg_el1(cntkctl);
> - ctxt->sys_regs[PAR_EL1] = read_sysreg(par_el1);
> -
> - ctxt->gp_regs.sp_el1= read_sysreg(sp_el1);
> + u64 *sys_regs = kern_hyp_va(ctxt->hw_sys_regs);
> +
> + sys_regs[MPIDR_EL1] = read_sysreg(vmpidr_el2);
> + sys_regs[CSSELR_EL1]= read_sysreg(csselr_el1);
> + sys_regs[SCTLR_EL1] = read_sysreg_el1(sctlr);
> + sys_regs[CPACR_EL1] = read_sysreg_el1(cpacr);
> + sys_regs[TTBR0_EL1] = read_sysreg_el1(ttbr0);
> + sys_regs[TTBR1_EL1] = read_sysreg_el1(ttbr1);
> + sys_regs[TCR_EL1]   = read_sysreg_el1(tcr);
> + sys_regs[ESR_EL1]   = read_sysreg_el1(esr);
> + sys_regs[AFSR0_EL1] = read_sysreg_el1(afsr0);
> + sys_regs[AFSR1_EL1] = read_sysreg_el1(afsr1);
> + sys_regs[FAR_EL1]   = read_sysreg_el1(far);
> + sys_regs[MAIR_EL1]  = read_sysreg_el1(mair);
> + sys_regs[VBAR_EL1]  = read_sysreg_el1(vbar);
> + sys_regs[CONTEXTIDR_EL1]= read_sysreg_el1(contextidr);
> + sys_regs[AMAIR_EL1] = read_sysreg_el1(amair);
> + sys_regs[CNTKCTL_EL1]   = read_sysreg_el1(cntkctl);
> + sys_regs[PAR_EL1]   = read_sysreg(par_el1);
> +
> + ctxt->hw_sp_el1

Re: [PATCH v2 1/3] kvm: x86: Add a hook for arch specific dirty logging emulation

2017-05-11 Thread Bandan Das

"Huang, Kai"  writes:

...
> Hi Bandan,
>
> I was just suggesting. You and Paolo still make the decision :)

Sure Kai, I don't mind the name change at all.
The maintainer has already picked this up and I don't think
the function name change is worth submitting a follow up.
Thank you very much for the review! :)

Bandan

> Thanks,
> -Kai
>>
>> Bandan
>>
>>> Thanks,
>>> -Kai
>>>
 +
/* pmu operations of sub-arch */
const struct kvm_pmu_ops *pmu_ops;

 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 5586765..5d3376f 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1498,6 +1498,21 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct 
 kvm *kvm,
kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
  }

 +/**
 + * kvm_arch_write_log_dirty - emulate dirty page logging
 + * @vcpu: Guest mode vcpu
 + *
 + * Emulate arch specific page modification logging for the
 + * nested hypervisor
 + */
 +int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu)
 +{
 +  if (kvm_x86_ops->write_log_dirty)
 +  return kvm_x86_ops->write_log_dirty(vcpu);
 +
 +  return 0;
 +}
>>>
>>> kvm_nested_arch_write_log_dirty?
>>>
 +
  bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
struct kvm_memory_slot *slot, u64 gfn)
  {
 diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
 index d8ccb32..2797580 100644
 --- a/arch/x86/kvm/mmu.h
 +++ b/arch/x86/kvm/mmu.h
 @@ -202,4 +202,5 @@ void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot 
 *slot, gfn_t gfn);
  void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
  bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
struct kvm_memory_slot *slot, u64 gfn);
 +int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
  #endif
 diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
 index 314d207..5624174 100644
 --- a/arch/x86/kvm/paging_tmpl.h
 +++ b/arch/x86/kvm/paging_tmpl.h
 @@ -226,6 +226,10 @@ static int FNAME(update_accessed_dirty_bits)(struct 
 kvm_vcpu *vcpu,
if (level == walker->level && write_fault &&
!(pte & PT_GUEST_DIRTY_MASK)) {
trace_kvm_mmu_set_dirty_bit(table_gfn, index, 
 sizeof(pte));
 +#if PTTYPE == PTTYPE_EPT
 +  if (kvm_arch_write_log_dirty(vcpu))
 +  return -EINVAL;
 +#endif
pte |= PT_GUEST_DIRTY_MASK;
}
if (pte == orig_pte)

>>

Re: [PATCH v2 2/3] nVMX: Implement emulated Page Modification Logging

2017-05-10 Thread Bandan Das

Paolo Bonzini  writes:
...
>> Is the purpose of returning 1 to make upper layer code to inject PML
>> full VMEXIt to L1 in nested_ept_inject_page_fault?
>
> Yes, it triggers a fault
>>> +
>>> +gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS) & ~0xFFFull;
>>> +
>>> +page = nested_get_page(vcpu, vmcs12->pml_address);
>>> +if (!page)
>>> +return 0;
>> 
>> If PML is enabled in L1, I think nested_get_page should never return a
>> NULL PML page (unless L1 does something wrong)? Probably better to
>> return 1 rather than 0, and handle error in nested_ept_inject_page_fault
>> according to vmcs12->pml_address?
>
> This happens if the PML address is invalid (where on real hardware, the
> write would just be "eaten") or MMIO (where we expect to diverge from

Yes, that was my motivation. On real hardware, the hypervisor would still
run except that the PML buffer is corrupt.

Bandan

> real hardware behavior).
>
>>> +
>>> +pml_address = kmap(page);
>>> +pml_address[vmcs12->guest_pml_index--] = gpa;
>> 
>> This gpa is L2 guest's GPA. Do we also need to mark L1's GPA (which is
>> related to L2 guest's GPA above) in to dirty-log? Or has this already
>> been done?
>
> L1's PML contains L1 host physical addresses, i.e. L0 guest physical
> addresses.  This GPA comes from vmcs02 and hence it is L0's GPA.
>
> L0's HPA is marked by hardware through PML, as usual.  If L0 has EPT A/D
> but not PML, it can still provide emulated PML to L1, but L0's HPA will
> be marked as dirty via write protection.
>
> Paolo

Re: [PATCH v2 1/3] kvm: x86: Add a hook for arch specific dirty logging emulation

2017-05-10 Thread Bandan Das

Hi Kai,

"Huang, Kai"  writes:

> On 5/6/2017 7:25 AM, Bandan Das wrote:
>> When KVM updates accessed/dirty bits, this hook can be used
>> to invoke an arch specific function that implements/emulates
>> dirty logging such as PML.
>>
>> Signed-off-by: Bandan Das 
>> ---
>>  arch/x86/include/asm/kvm_host.h |  2 ++
>>  arch/x86/kvm/mmu.c  | 15 +++
>>  arch/x86/kvm/mmu.h  |  1 +
>>  arch/x86/kvm/paging_tmpl.h  |  4 
>>  4 files changed, 22 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h 
>> b/arch/x86/include/asm/kvm_host.h
>> index f5bddf92..9c761fe 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1020,6 +1020,8 @@ struct kvm_x86_ops {
>>  void (*enable_log_dirty_pt_masked)(struct kvm *kvm,
>> struct kvm_memory_slot *slot,
>> gfn_t offset, unsigned long mask);
>> +int (*write_log_dirty)(struct kvm_vcpu *vcpu);
>
> Hi,
>
> Thanks for adding PML to nested support!
>
> IMHO this callback is only used for write L2's dirty gpa to L1's PML
> buffer, so probably it's better to change the name to something like:
> nested_write_log_dirty.

The name was meant more to signify what it does: i.e. write dirty log rather
than where in the hierarchy it's being used :) But I guess, a nested_ prefix
doesn't hurt either.

Bandan

> Thanks,
> -Kai
>
>> +
>>  /* pmu operations of sub-arch */
>>  const struct kvm_pmu_ops *pmu_ops;
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 5586765..5d3376f 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -1498,6 +1498,21 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct 
>> kvm *kvm,
>>  kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
>>  }
>>
>> +/**
>> + * kvm_arch_write_log_dirty - emulate dirty page logging
>> + * @vcpu: Guest mode vcpu
>> + *
>> + * Emulate arch specific page modification logging for the
>> + * nested hypervisor
>> + */
>> +int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu)
>> +{
>> +if (kvm_x86_ops->write_log_dirty)
>> +return kvm_x86_ops->write_log_dirty(vcpu);
>> +
>> +return 0;
>> +}
>
> kvm_nested_arch_write_log_dirty?
>
>> +
>>  bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>>  struct kvm_memory_slot *slot, u64 gfn)
>>  {
>> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>> index d8ccb32..2797580 100644
>> --- a/arch/x86/kvm/mmu.h
>> +++ b/arch/x86/kvm/mmu.h
>> @@ -202,4 +202,5 @@ void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot 
>> *slot, gfn_t gfn);
>>  void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
>>  bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
>>  struct kvm_memory_slot *slot, u64 gfn);
>> +int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
>>  #endif
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> index 314d207..5624174 100644
>> --- a/arch/x86/kvm/paging_tmpl.h
>> +++ b/arch/x86/kvm/paging_tmpl.h
>> @@ -226,6 +226,10 @@ static int FNAME(update_accessed_dirty_bits)(struct 
>> kvm_vcpu *vcpu,
>>  if (level == walker->level && write_fault &&
>>  !(pte & PT_GUEST_DIRTY_MASK)) {
>>  trace_kvm_mmu_set_dirty_bit(table_gfn, index, 
>> sizeof(pte));
>> +#if PTTYPE == PTTYPE_EPT
>> +if (kvm_arch_write_log_dirty(vcpu))
>> +return -EINVAL;
>> +#endif
>>  pte |= PT_GUEST_DIRTY_MASK;
>>  }
>>  if (pte == orig_pte)
>>

Re: [PATCH v2 0/3] nVMX: Emulated Page Modification Logging for Nested Virtualization

2017-05-09 Thread Bandan Das

Paolo Bonzini  writes:

> On 05/05/2017 21:25, Bandan Das wrote:
>> v2:
>> 2/3: Clear out all bits except bit 12
>> 3/3: Slightly modify an existing comment, honor L0's
>> PML setting when clearing it for L1
>> 
>> v1:
>> http://www.spinics.net/lists/kvm/msg149247.html
>> 
>> These patches implement PML on top of EPT A/D emulation
>> (ae1e2d1082ae).
>> 
>> When dirty bit is being set, we write the gpa to the
>> buffer provided by L1. If the index overflows, we just
>> change the exit reason before running L1.
>
> I tested this with api/dirty-log-perf, and nested PML is more than 3
> times faster than pml=0.  I want to do a few more tests because I don't
> see any PML full exits in the L1 trace, but it seems to be a nice
> improvement!

Thanks for testing! Regarding the PML full exits, I did notice their
absence. I induced it artifically in my testing with a lower index
and it seemed to work fine.

> Paolo
>
>> Bandan Das (3):
>>   kvm: x86: Add a hook for arch specific dirty logging emulation
>>   nVMX: Implement emulated Page Modification Logging
>>   nVMX: Advertise PML to L1 hypervisor
>> 
>>  arch/x86/include/asm/kvm_host.h |  2 +
>>  arch/x86/kvm/mmu.c  | 15 +++
>>  arch/x86/kvm/mmu.h  |  1 +
>>  arch/x86/kvm/paging_tmpl.h  |  4 ++
>>  arch/x86/kvm/vmx.c  | 97 
>> ++---
>>  5 files changed, 112 insertions(+), 7 deletions(-)
>>

[PATCH v2 1/3] kvm: x86: Add a hook for arch specific dirty logging emulation

2017-05-05 Thread Bandan Das

When KVM updates accessed/dirty bits, this hook can be used
to invoke an arch specific function that implements/emulates
dirty logging such as PML.

Signed-off-by: Bandan Das 
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu.c  | 15 +++
 arch/x86/kvm/mmu.h  |  1 +
 arch/x86/kvm/paging_tmpl.h  |  4 
 4 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f5bddf92..9c761fe 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1020,6 +1020,8 @@ struct kvm_x86_ops {
void (*enable_log_dirty_pt_masked)(struct kvm *kvm,
   struct kvm_memory_slot *slot,
   gfn_t offset, unsigned long mask);
+   int (*write_log_dirty)(struct kvm_vcpu *vcpu);
+
/* pmu operations of sub-arch */
const struct kvm_pmu_ops *pmu_ops;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5586765..5d3376f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1498,6 +1498,21 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm 
*kvm,
kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
+/**
+ * kvm_arch_write_log_dirty - emulate dirty page logging
+ * @vcpu: Guest mode vcpu
+ *
+ * Emulate arch specific page modification logging for the
+ * nested hypervisor
+ */
+int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu)
+{
+   if (kvm_x86_ops->write_log_dirty)
+   return kvm_x86_ops->write_log_dirty(vcpu);
+
+   return 0;
+}
+
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
struct kvm_memory_slot *slot, u64 gfn)
 {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index d8ccb32..2797580 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -202,4 +202,5 @@ void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot 
*slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
struct kvm_memory_slot *slot, u64 gfn);
+int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 #endif
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 314d207..5624174 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -226,6 +226,10 @@ static int FNAME(update_accessed_dirty_bits)(struct 
kvm_vcpu *vcpu,
if (level == walker->level && write_fault &&
!(pte & PT_GUEST_DIRTY_MASK)) {
trace_kvm_mmu_set_dirty_bit(table_gfn, index, 
sizeof(pte));
+#if PTTYPE == PTTYPE_EPT
+   if (kvm_arch_write_log_dirty(vcpu))
+   return -EINVAL;
+#endif
pte |= PT_GUEST_DIRTY_MASK;
}
if (pte == orig_pte)
-- 
2.9.3

[PATCH v2 2/3] nVMX: Implement emulated Page Modification Logging

2017-05-05 Thread Bandan Das

With EPT A/D enabled, processor access to L2 guest
paging structures will result in a write violation.
When this happens, write the GUEST_PHYSICAL_ADDRESS
to the pml buffer provided by L1 if the access is
write and the dirty bit is being set.

This patch also adds necessary checks during VMEntry if L1
has enabled PML. If the PML index overflows, we change the
exit reason and run L1 to simulate a PML full event.

Signed-off-by: Bandan Das 
---
 arch/x86/kvm/vmx.c | 81 --
 1 file changed, 79 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2211697..8b9e942 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -248,6 +248,7 @@ struct __packed vmcs12 {
u64 xss_exit_bitmap;
u64 guest_physical_address;
u64 vmcs_link_pointer;
+   u64 pml_address;
u64 guest_ia32_debugctl;
u64 guest_ia32_pat;
u64 guest_ia32_efer;
@@ -369,6 +370,7 @@ struct __packed vmcs12 {
u16 guest_ldtr_selector;
u16 guest_tr_selector;
u16 guest_intr_status;
+   u16 guest_pml_index;
u16 host_es_selector;
u16 host_cs_selector;
u16 host_ss_selector;
@@ -407,6 +409,7 @@ struct nested_vmx {
/* Has the level1 guest done vmxon? */
bool vmxon;
gpa_t vmxon_ptr;
+   bool pml_full;
 
/* The guest-physical address of the current VMCS L1 keeps for L2 */
gpa_t current_vmptr;
@@ -742,6 +745,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD(GUEST_LDTR_SELECTOR, guest_ldtr_selector),
FIELD(GUEST_TR_SELECTOR, guest_tr_selector),
FIELD(GUEST_INTR_STATUS, guest_intr_status),
+   FIELD(GUEST_PML_INDEX, guest_pml_index),
FIELD(HOST_ES_SELECTOR, host_es_selector),
FIELD(HOST_CS_SELECTOR, host_cs_selector),
FIELD(HOST_SS_SELECTOR, host_ss_selector),
@@ -767,6 +771,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(XSS_EXIT_BITMAP, xss_exit_bitmap),
FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address),
FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer),
+   FIELD64(PML_ADDRESS, pml_address),
FIELD64(GUEST_IA32_DEBUGCTL, guest_ia32_debugctl),
FIELD64(GUEST_IA32_PAT, guest_ia32_pat),
FIELD64(GUEST_IA32_EFER, guest_ia32_efer),
@@ -1349,6 +1354,11 @@ static inline bool nested_cpu_has_xsaves(struct vmcs12 
*vmcs12)
vmx_xsaves_supported();
 }
 
+static inline bool nested_cpu_has_pml(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_PML);
+}
+
 static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12 *vmcs12)
 {
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
@@ -9368,13 +9378,20 @@ static void nested_ept_inject_page_fault(struct 
kvm_vcpu *vcpu,
struct x86_exception *fault)
 {
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 exit_reason;
+   unsigned long exit_qualification = vcpu->arch.exit_qualification;
 
-   if (fault->error_code & PFERR_RSVD_MASK)
+   if (vmx->nested.pml_full) {
+   exit_reason = EXIT_REASON_PML_FULL;
+   vmx->nested.pml_full = false;
+   exit_qualification &= INTR_INFO_UNBLOCK_NMI;
+   } else if (fault->error_code & PFERR_RSVD_MASK)
exit_reason = EXIT_REASON_EPT_MISCONFIG;
else
exit_reason = EXIT_REASON_EPT_VIOLATION;
-   nested_vmx_vmexit(vcpu, exit_reason, 0, vcpu->arch.exit_qualification);
+
+   nested_vmx_vmexit(vcpu, exit_reason, 0, exit_qualification);
vmcs12->guest_physical_address = fault->address;
 }
 
@@ -9717,6 +9734,22 @@ static int nested_vmx_check_msr_switch_controls(struct 
kvm_vcpu *vcpu,
return 0;
 }
 
+static int nested_vmx_check_pml_controls(struct kvm_vcpu *vcpu,
+struct vmcs12 *vmcs12)
+{
+   u64 address = vmcs12->pml_address;
+   int maxphyaddr = cpuid_maxphyaddr(vcpu);
+
+   if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_PML)) {
+   if (!nested_cpu_has_ept(vmcs12) ||
+   !IS_ALIGNED(address, 4096)  ||
+   address >> maxphyaddr)
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static int nested_vmx_msr_check_common(struct kvm_vcpu *vcpu,
   struct vmx_msr_entry *e)
 {
@@ -10252,6 +10285,9 @@ static int check_vmentry_prereqs(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
if (nested_vmx_check_msr_switch_controls(vcpu, vmcs12))
return VMXERR_ENTRY_INVALID_CONTROL_FIELD;
 
+   if (nested_vmx_check_pml_controls(vcpu, vmcs12))
+   return VMXERR_ENTRY_INVALID_CONTROL_FIELD;
+
if (!vmx_control_verify(vmcs12->cpu_based_vm_e

[PATCH v2 0/3] nVMX: Emulated Page Modification Logging for Nested Virtualization

2017-05-05 Thread Bandan Das

v2:
2/3: Clear out all bits except bit 12
3/3: Slightly modify an existing comment, honor L0's
PML setting when clearing it for L1

v1:
http://www.spinics.net/lists/kvm/msg149247.html

These patches implement PML on top of EPT A/D emulation
(ae1e2d1082ae).

When dirty bit is being set, we write the gpa to the
buffer provided by L1. If the index overflows, we just
change the exit reason before running L1.

Bandan Das (3):
  kvm: x86: Add a hook for arch specific dirty logging emulation
  nVMX: Implement emulated Page Modification Logging
  nVMX: Advertise PML to L1 hypervisor

 arch/x86/include/asm/kvm_host.h |  2 +
 arch/x86/kvm/mmu.c  | 15 +++
 arch/x86/kvm/mmu.h  |  1 +
 arch/x86/kvm/paging_tmpl.h  |  4 ++
 arch/x86/kvm/vmx.c  | 97 ++---
 5 files changed, 112 insertions(+), 7 deletions(-)

-- 
2.9.3

[PATCH v2 3/3] nVMX: Advertise PML to L1 hypervisor

2017-05-05 Thread Bandan Das

Advertise the PML bit in vmcs12 but don't try to enable
it in hardware when running L2 since L0 is emulating it. Also,
preserve L0's settings for PML since it may still
want to log writes.

Signed-off-by: Bandan Das 
---
 arch/x86/kvm/vmx.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8b9e942..a5f6054 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2763,8 +2763,11 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
vmx->nested.nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT |
VMX_EPT_EXTENT_CONTEXT_BIT | VMX_EPT_2MB_PAGE_BIT |
VMX_EPT_1GB_PAGE_BIT;
-  if (enable_ept_ad_bits)
+   if (enable_ept_ad_bits) {
+   vmx->nested.nested_vmx_secondary_ctls_high |=
+   SECONDARY_EXEC_ENABLE_PML;
   vmx->nested.nested_vmx_ept_caps |= VMX_EPT_AD_BIT;
+   }
} else
vmx->nested.nested_vmx_ept_caps = 0;
 
@@ -8128,7 +8131,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
case EXIT_REASON_PREEMPTION_TIMER:
return false;
case EXIT_REASON_PML_FULL:
-   /* We don't expose PML support to L1. */
+   /* We emulate PML support to L1. */
return false;
default:
return true;
@@ -9923,7 +9926,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12,
  bool from_vmentry, u32 *entry_failure_code)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
-   u32 exec_control;
+   u32 exec_control, vmcs12_exec_ctrl;
 
vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
@@ -10054,8 +10057,11 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12,
  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
  SECONDARY_EXEC_APIC_REGISTER_VIRT);
if (nested_cpu_has(vmcs12,
-   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
-   exec_control |= vmcs12->secondary_vm_exec_control;
+  CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)) {
+   vmcs12_exec_ctrl = vmcs12->secondary_vm_exec_control &
+   ~SECONDARY_EXEC_ENABLE_PML;
+   exec_control |= vmcs12_exec_ctrl;
+   }
 
if (exec_control & SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) {
vmcs_write64(EOI_EXIT_BITMAP0,
-- 
2.9.3

Re: [PATCH 3/3] nVMX: Advertise PML to L1 hypervisor

2017-05-04 Thread Bandan Das

Paolo Bonzini  writes:

> On 04/05/2017 00:14, Bandan Das wrote:
>> Advertise the PML bit in vmcs12 but clear it out
>> before running L2 since we don't depend on hardware support
>> for PML emulation.
>> 
>> Signed-off-by: Bandan Das 
>> ---
>>  arch/x86/kvm/vmx.c | 6 +-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 5e5abb7..df71116 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2763,8 +2763,11 @@ static void nested_vmx_setup_ctls_msrs(struct 
>> vcpu_vmx *vmx)
>>  vmx->nested.nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT |
>>  VMX_EPT_EXTENT_CONTEXT_BIT | VMX_EPT_2MB_PAGE_BIT |
>>  VMX_EPT_1GB_PAGE_BIT;
>> -   if (enable_ept_ad_bits)
>> +if (enable_ept_ad_bits) {
>> +vmx->nested.nested_vmx_secondary_ctls_high |=
>> +SECONDARY_EXEC_ENABLE_PML;
>> vmx->nested.nested_vmx_ept_caps |= VMX_EPT_AD_BIT;
>> +}
>>  } else
>>  vmx->nested.nested_vmx_ept_caps = 0;
>>  
>> @@ -10080,6 +10083,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
>> struct vmcs12 *vmcs12,
>>  if (exec_control & SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)
>>  vmcs_write64(APIC_ACCESS_ADDR, -1ull);
>>  
>> +exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
>>  vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
>
> L0 is still using its own page modification log when running L2, so you
> have to clear the bit here instead:
>
> exec_control |= vmcs12->secondary_vm_exec_control;
>

Oops, good catch, thank you!

> and set up PML_ADDRESS and GUEST_PML_INDEX.  Though, the lack of
> PML_ADDRESS and GUEST_PML_INDEX initialization is a pre-existing bug.

A little further down I see that these fields are being reset as part of
commit 1fb883bb827:
...
if (enable_pml) {
/*
 * Conceptually we want to copy the PML address and index from
 * vmcs01 here, and then back to vmcs01 on nested vmexit. But,
 * since we always flush the log on each vmexit, this happens
 * to be equivalent to simply resetting the fields in vmcs02.
 */
ASSERT(vmx->pml_pg);
vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
}

Or are you referring to a different place, these fields need to be set ?


> Paolo
>
>>  }
>>  
>>

Re: [PATCH 2/3] nVMX: Implement emulated Page Modification Logging

2017-05-04 Thread Bandan Das

Paolo Bonzini  writes:

> On 04/05/2017 00:14, Bandan Das wrote:
>> +if (vmx->nested.pml_full) {
>> +exit_reason = EXIT_REASON_PML_FULL;
>> +vmx->nested.pml_full = false;
>> +} else if (fault->error_code & PFERR_RSVD_MASK)
>>  exit_reason = EXIT_REASON_EPT_MISCONFIG;
>>  else
>>  exit_reason = EXIT_REASON_EPT_VIOLATION;
>> +/*
>> + * The definition of bit 12 for EPT violations and PML
>> + * full event is the same, so pass it through since
>> + * the rest of the bits are undefined.
>> + */
>
> Please zero all other bits instead.  It's as easy as adding an "u64
> exit_qualification" local variable.

Will do, thanks for the review.

Bandan

> Paolo
>
>>  nested_vmx_vmexit(vcpu, exit_reason, 0, vcpu->arch.exit_qualification);

[PATCH 1/3] kvm: x86: Add a hook for arch specific dirty logging emulation

2017-05-03 Thread Bandan Das

When KVM updates accessed/dirty bits, this hook can be used
to invoke an arch specific function that implements/emulates
dirty logging such as PML.

Signed-off-by: Bandan Das 
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu.c  | 15 +++
 arch/x86/kvm/mmu.h  |  1 +
 arch/x86/kvm/paging_tmpl.h  |  4 
 4 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f5bddf92..9c761fe 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1020,6 +1020,8 @@ struct kvm_x86_ops {
void (*enable_log_dirty_pt_masked)(struct kvm *kvm,
   struct kvm_memory_slot *slot,
   gfn_t offset, unsigned long mask);
+   int (*write_log_dirty)(struct kvm_vcpu *vcpu);
+
/* pmu operations of sub-arch */
const struct kvm_pmu_ops *pmu_ops;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5586765..5d3376f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1498,6 +1498,21 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm 
*kvm,
kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
 }
 
+/**
+ * kvm_arch_write_log_dirty - emulate dirty page logging
+ * @vcpu: Guest mode vcpu
+ *
+ * Emulate arch specific page modification logging for the
+ * nested hypervisor
+ */
+int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu)
+{
+   if (kvm_x86_ops->write_log_dirty)
+   return kvm_x86_ops->write_log_dirty(vcpu);
+
+   return 0;
+}
+
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
struct kvm_memory_slot *slot, u64 gfn)
 {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index d8ccb32..2797580 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -202,4 +202,5 @@ void kvm_mmu_gfn_disallow_lpage(struct kvm_memory_slot 
*slot, gfn_t gfn);
 void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
struct kvm_memory_slot *slot, u64 gfn);
+int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 #endif
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 314d207..5624174 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -226,6 +226,10 @@ static int FNAME(update_accessed_dirty_bits)(struct 
kvm_vcpu *vcpu,
if (level == walker->level && write_fault &&
!(pte & PT_GUEST_DIRTY_MASK)) {
trace_kvm_mmu_set_dirty_bit(table_gfn, index, 
sizeof(pte));
+#if PTTYPE == PTTYPE_EPT
+   if (kvm_arch_write_log_dirty(vcpu))
+   return -EINVAL;
+#endif
pte |= PT_GUEST_DIRTY_MASK;
}
if (pte == orig_pte)
-- 
2.9.3

[PATCH 2/3] nVMX: Implement emulated Page Modification Logging

2017-05-03 Thread Bandan Das

With EPT A/D enabled, processor access to L2 guest
paging structures will result in a write violation.
When this happens, write the GUEST_PHYSICAL_ADDRESS
to the pml buffer provided by L1 if the access is
write and the dirty bit is being set.

This patch also adds necessary checks during VMEntry if L1
has enabled PML. If the PML index overflows, we change the
exit reason and run L1 to simulate a PML full event.

Signed-off-by: Bandan Das 
---
 arch/x86/kvm/vmx.c | 81 +-
 1 file changed, 80 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 29940b6..5e5abb7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -248,6 +248,7 @@ struct __packed vmcs12 {
u64 xss_exit_bitmap;
u64 guest_physical_address;
u64 vmcs_link_pointer;
+   u64 pml_address;
u64 guest_ia32_debugctl;
u64 guest_ia32_pat;
u64 guest_ia32_efer;
@@ -369,6 +370,7 @@ struct __packed vmcs12 {
u16 guest_ldtr_selector;
u16 guest_tr_selector;
u16 guest_intr_status;
+   u16 guest_pml_index;
u16 host_es_selector;
u16 host_cs_selector;
u16 host_ss_selector;
@@ -407,6 +409,7 @@ struct nested_vmx {
/* Has the level1 guest done vmxon? */
bool vmxon;
gpa_t vmxon_ptr;
+   bool pml_full;
 
/* The guest-physical address of the current VMCS L1 keeps for L2 */
gpa_t current_vmptr;
@@ -742,6 +745,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD(GUEST_LDTR_SELECTOR, guest_ldtr_selector),
FIELD(GUEST_TR_SELECTOR, guest_tr_selector),
FIELD(GUEST_INTR_STATUS, guest_intr_status),
+   FIELD(GUEST_PML_INDEX, guest_pml_index),
FIELD(HOST_ES_SELECTOR, host_es_selector),
FIELD(HOST_CS_SELECTOR, host_cs_selector),
FIELD(HOST_SS_SELECTOR, host_ss_selector),
@@ -767,6 +771,7 @@ static const unsigned short vmcs_field_to_offset_table[] = {
FIELD64(XSS_EXIT_BITMAP, xss_exit_bitmap),
FIELD64(GUEST_PHYSICAL_ADDRESS, guest_physical_address),
FIELD64(VMCS_LINK_POINTER, vmcs_link_pointer),
+   FIELD64(PML_ADDRESS, pml_address),
FIELD64(GUEST_IA32_DEBUGCTL, guest_ia32_debugctl),
FIELD64(GUEST_IA32_PAT, guest_ia32_pat),
FIELD64(GUEST_IA32_EFER, guest_ia32_efer),
@@ -1349,6 +1354,11 @@ static inline bool nested_cpu_has_xsaves(struct vmcs12 
*vmcs12)
vmx_xsaves_supported();
 }
 
+static inline bool nested_cpu_has_pml(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_PML);
+}
+
 static inline bool nested_cpu_has_virt_x2apic_mode(struct vmcs12 *vmcs12)
 {
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
@@ -9368,12 +9378,21 @@ static void nested_ept_inject_page_fault(struct 
kvm_vcpu *vcpu,
struct x86_exception *fault)
 {
struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 exit_reason;
 
-   if (fault->error_code & PFERR_RSVD_MASK)
+   if (vmx->nested.pml_full) {
+   exit_reason = EXIT_REASON_PML_FULL;
+   vmx->nested.pml_full = false;
+   } else if (fault->error_code & PFERR_RSVD_MASK)
exit_reason = EXIT_REASON_EPT_MISCONFIG;
else
exit_reason = EXIT_REASON_EPT_VIOLATION;
+   /*
+* The definition of bit 12 for EPT violations and PML
+* full event is the same, so pass it through since
+* the rest of the bits are undefined.
+*/
nested_vmx_vmexit(vcpu, exit_reason, 0, vcpu->arch.exit_qualification);
vmcs12->guest_physical_address = fault->address;
 }
@@ -9717,6 +9736,22 @@ static int nested_vmx_check_msr_switch_controls(struct 
kvm_vcpu *vcpu,
return 0;
 }
 
+static int nested_vmx_check_pml_controls(struct kvm_vcpu *vcpu,
+struct vmcs12 *vmcs12)
+{
+   u64 address = vmcs12->pml_address;
+   int maxphyaddr = cpuid_maxphyaddr(vcpu);
+
+   if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_PML)) {
+   if (!nested_cpu_has_ept(vmcs12) ||
+   !IS_ALIGNED(address, 4096)  ||
+   address >> maxphyaddr)
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static int nested_vmx_msr_check_common(struct kvm_vcpu *vcpu,
   struct vmx_msr_entry *e)
 {
@@ -10240,6 +10275,9 @@ static int check_vmentry_prereqs(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
if (nested_vmx_check_msr_switch_controls(vcpu, vmcs12))
return VMXERR_ENTRY_INVALID_CONTROL_FIELD;
 
+   if (nested_vmx_check_pml_controls(vcpu, vmcs12))
+   return VMXERR_ENTRY_INVALID_CONTROL_FIELD;
+
if (!vmx_control_verify(vmcs12->cpu_based_vm_exec_con

[PATCH 3/3] nVMX: Advertise PML to L1 hypervisor

2017-05-03 Thread Bandan Das

Advertise the PML bit in vmcs12 but clear it out
before running L2 since we don't depend on hardware support
for PML emulation.

Signed-off-by: Bandan Das 
---
 arch/x86/kvm/vmx.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5e5abb7..df71116 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2763,8 +2763,11 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
vmx->nested.nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT |
VMX_EPT_EXTENT_CONTEXT_BIT | VMX_EPT_2MB_PAGE_BIT |
VMX_EPT_1GB_PAGE_BIT;
-  if (enable_ept_ad_bits)
+   if (enable_ept_ad_bits) {
+   vmx->nested.nested_vmx_secondary_ctls_high |=
+   SECONDARY_EXEC_ENABLE_PML;
   vmx->nested.nested_vmx_ept_caps |= VMX_EPT_AD_BIT;
+   }
} else
vmx->nested.nested_vmx_ept_caps = 0;
 
@@ -10080,6 +10083,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12,
if (exec_control & SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)
vmcs_write64(APIC_ACCESS_ADDR, -1ull);
 
+   exec_control &= ~SECONDARY_EXEC_ENABLE_PML;
vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
}
 
-- 
2.9.3

[PATCH 0/3] nVMX: Emulated Page Modification Logging for Nested Virtualization

2017-05-03 Thread Bandan Das

These patches implement PML on top of EPT A/D emulation
(ae1e2d1082ae).

When dirty bit is being set, we write the gpa to the
buffer provided by L1. If the index overflows, we just
change the exit reason before running L1.

Bandan Das (3):
  kvm: x86: Add a hook for arch specific dirty logging emulation
  nVMX: Implement emulated Page Modification Logging
  nVMX: Advertise PML to L1 hypervisor

 arch/x86/include/asm/kvm_host.h |  2 +
 arch/x86/kvm/mmu.c  | 15 +++
 arch/x86/kvm/mmu.h  |  1 +
 arch/x86/kvm/paging_tmpl.h  |  4 ++
 arch/x86/kvm/vmx.c  | 87 -
 5 files changed, 107 insertions(+), 2 deletions(-)

-- 
2.9.3

Re: [PATCH 4/6] kvm: nVMX: support EPT accessed/dirty bits

2017-04-12 Thread Bandan Das

Paolo Bonzini  writes:

> - Original Message -
>> From: "Bandan Das" 
>> To: "Paolo Bonzini" 
>> Cc: linux-kernel@vger.kernel.org, k...@vger.kernel.org, da...@redhat.com
>> Sent: Wednesday, April 12, 2017 7:35:16 AM
>> Subject: Re: [PATCH 4/6] kvm: nVMX: support EPT accessed/dirty bits
>> 
>> Paolo Bonzini  writes:
>> ...
>> >accessed_dirty = have_ad ? PT_GUEST_ACCESSED_MASK : 0;
>> > +
>> > +  /*
>> > +   * FIXME: on Intel processors, loads of the PDPTE registers for PAE
>> > paging
>> > +   * by the MOV to CR instruction are treated as reads and do not cause 
>> > the
>> > +   * processor to set the dirty flag in tany EPT paging-structure entry.
>> > +   */
>> 
>> Minor typo: "in any EPT paging-structure entry".
>> 
>> > +  nested_access = (have_ad ? PFERR_WRITE_MASK : 0) | PFERR_USER_MASK;
>> > +
>> >pt_access = pte_access = ACC_ALL;
>> >++walker->level;
>> >  
>> > @@ -338,7 +337,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker
>> > *walker,
>> >walker->pte_gpa[walker->level - 1] = pte_gpa;
>> >  
>> >real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
>> > -PFERR_USER_MASK|PFERR_WRITE_MASK,
>> > +nested_access,
>> >  &walker->fault);
>> 
>> I can't seem to understand the significance of this change (or for that
>> matter what was before this change).
>> 
>> mmu->translate_gpa() just returns gfn_to_gpa(table_gfn), right ?
>
> For EPT it is, you're right it's fishy.  The "nested_access" should be
> computed in translate_nested_gpa, which is where kvm->arch.nested_mmu
> (non-EPT) requests to access kvm->arch.mmu (EPT).

Thanks for the clarification. Is it the case when L1 runs L2 without
EPT ? I can't figure out the case where translate_nested_gpa will actually
be called. FNAME(walk_addr_nested) calls walk_addr_generic
with &vcpu->arch.nested_mmu and init_kvm_nested_mmu() sets gva_to_gpa()
with the appropriate "_nested" functions. But the gva_to_gpa() pointers
don't seem to get invoked at all for the nested case.

BTW, just noticed that setting PFERR_USER_MASK is redundant since
translate_nested_gpa does it too.

Bandan

> In practice we need to define a new function
> vcpu->arch.mmu.gva_to_gpa_nested that computes the nested_access
> and calls cpu->arch.mmu.gva_to_gpa.
>
> Thanks,
>
> Paolo

Re: [PATCH 4/6] kvm: nVMX: support EPT accessed/dirty bits

2017-04-11 Thread Bandan Das

Paolo Bonzini  writes:
...
>   accessed_dirty = have_ad ? PT_GUEST_ACCESSED_MASK : 0;
> +
> + /*
> +  * FIXME: on Intel processors, loads of the PDPTE registers for PAE 
> paging
> +  * by the MOV to CR instruction are treated as reads and do not cause 
> the
> +  * processor to set the dirty flag in tany EPT paging-structure entry.
> +  */

Minor typo: "in any EPT paging-structure entry".

> + nested_access = (have_ad ? PFERR_WRITE_MASK : 0) | PFERR_USER_MASK;
> +
>   pt_access = pte_access = ACC_ALL;
>   ++walker->level;
>  
> @@ -338,7 +337,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker 
> *walker,
>   walker->pte_gpa[walker->level - 1] = pte_gpa;
>  
>   real_gfn = mmu->translate_gpa(vcpu, gfn_to_gpa(table_gfn),
> -   PFERR_USER_MASK|PFERR_WRITE_MASK,
> +   nested_access,
> &walker->fault);

I can't seem to understand the significance of this change (or for that matter
what was before this change).

mmu->translate_gpa() just returns gfn_to_gpa(table_gfn), right ?

Bandan

>   /*
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 1c372600a962..6aaecc78dd71 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2767,6 +2767,8 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
> *vmx)
>   vmx->nested.nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT |
>   VMX_EPT_EXTENT_CONTEXT_BIT | VMX_EPT_2MB_PAGE_BIT |
>   VMX_EPT_1GB_PAGE_BIT;
> +if (enable_ept_ad_bits)
> +vmx->nested.nested_vmx_ept_caps |= VMX_EPT_AD_BIT;
>   } else
>   vmx->nested.nested_vmx_ept_caps = 0;
>  
> @@ -6211,6 +6213,18 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  
>   exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>  
> + if (is_guest_mode(vcpu)
> + && !(exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) {
> + /*
> +  * Fix up exit_qualification according to whether guest
> +  * page table accesses are reads or writes.
> +  */
> + u64 eptp = nested_ept_get_cr3(vcpu);
> + exit_qualification &= ~EPT_VIOLATION_ACC_WRITE;
> + if (eptp & VMX_EPT_AD_ENABLE_BIT)
> + exit_qualification |= EPT_VIOLATION_ACC_WRITE;
> + }
> +
>   /*
>* EPT violation happened while executing iret from NMI,
>* "blocked by NMI" bit has to be set before next VM entry.
> @@ -9416,17 +9430,26 @@ static unsigned long nested_ept_get_cr3(struct 
> kvm_vcpu *vcpu)
>   return get_vmcs12(vcpu)->ept_pointer;
>  }
>  
> -static void nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
> +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
>  {
> + u64 eptp;
> +
>   WARN_ON(mmu_is_nested(vcpu));
> + eptp = nested_ept_get_cr3(vcpu);
> + if ((eptp & VMX_EPT_AD_ENABLE_BIT) && !enable_ept_ad_bits)
> + return 1;
> +
> + kvm_mmu_unload(vcpu);
>   kvm_init_shadow_ept_mmu(vcpu,
>   to_vmx(vcpu)->nested.nested_vmx_ept_caps &
> - VMX_EPT_EXECUTE_ONLY_BIT);
> + VMX_EPT_EXECUTE_ONLY_BIT,
> + eptp & VMX_EPT_AD_ENABLE_BIT);
>   vcpu->arch.mmu.set_cr3   = vmx_set_cr3;
>   vcpu->arch.mmu.get_cr3   = nested_ept_get_cr3;
>   vcpu->arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
>  
>   vcpu->arch.walk_mmu  = &vcpu->arch.nested_mmu;
> + return 0;
>  }
>  
>  static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu)
> @@ -10188,8 +10211,10 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12,
>   }
>  
>   if (nested_cpu_has_ept(vmcs12)) {
> - kvm_mmu_unload(vcpu);
> - nested_ept_init_mmu_context(vcpu);
> + if (nested_ept_init_mmu_context(vcpu)) {
> + *entry_failure_code = ENTRY_FAIL_DEFAULT;
> + return 1;
> + }
>   } else if (nested_cpu_has2(vmcs12,
>  SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) {
>   vmx_flush_tlb_ept_only(vcpu);

Re: [PATCH] KVM: x86: remove code for lazy FPU handling

2017-02-17 Thread Bandan Das

Paolo Bonzini  writes:

> On 17/02/2017 01:45, Bandan Das wrote:
>> Paolo Bonzini  writes:
>> 
>>> The FPU is always active now when running KVM.
>> 
>> The lazy code was a performance optimization, correct ?
>> Is this just dormant code and being removed ? Maybe
>> mentioning the reasoning in a little more detail is a good
>> idea.
>
> Lazy FPU support was removed completely from arch/x86.  Apparently,
> things such as SSE-optimized mem* and str* functions made it much less
> useful.  At this point the KVM code is unnecessary too.

Thanks! If it's not too late, please include the above in the commit
message.

Bandan

> Paolo
>
>> The removal itself looks clean. I was really hoping that you
>> would have forgotten removing "fpu_active" from struct kvm_vcpu()
>> but you hadn't ;)
>> 
>> Bandan
>> 
>>> Signed-off-by: Paolo Bonzini 
>>> ---
>>>  arch/x86/include/asm/kvm_host.h |   3 --
>>>  arch/x86/kvm/cpuid.c|   2 -
>>>  arch/x86/kvm/svm.c  |  43 ++-
>>>  arch/x86/kvm/vmx.c  | 112 
>>> ++--
>>>  arch/x86/kvm/x86.c  |   7 +--
>>>  include/linux/kvm_host.h|   1 -
>>>  6 files changed, 19 insertions(+), 149 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h 
>>> b/arch/x86/include/asm/kvm_host.h
>>> index e4f13e714bcf..74ef58c8ff53 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -55,7 +55,6 @@
>>>  #define KVM_REQ_TRIPLE_FAULT  10
>>>  #define KVM_REQ_MMU_SYNC  11
>>>  #define KVM_REQ_CLOCK_UPDATE  12
>>> -#define KVM_REQ_DEACTIVATE_FPU13
>>>  #define KVM_REQ_EVENT 14
>>>  #define KVM_REQ_APF_HALT  15
>>>  #define KVM_REQ_STEAL_UPDATE  16
>>> @@ -936,8 +935,6 @@ struct kvm_x86_ops {
>>> unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
>>> void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
>>> u32 (*get_pkru)(struct kvm_vcpu *vcpu);
>>> -   void (*fpu_activate)(struct kvm_vcpu *vcpu);
>>> -   void (*fpu_deactivate)(struct kvm_vcpu *vcpu);
>>>  
>>> void (*tlb_flush)(struct kvm_vcpu *vcpu);
>>>  
>>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>>> index c0e2036217ad..1d155cc56629 100644
>>> --- a/arch/x86/kvm/cpuid.c
>>> +++ b/arch/x86/kvm/cpuid.c
>>> @@ -123,8 +123,6 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
>>> if (best && (best->eax & (F(XSAVES) | F(XSAVEC
>>> best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
>>>  
>>> -   kvm_x86_ops->fpu_activate(vcpu);
>>> -
>>> /*
>>>  * The existing code assumes virtual address is 48-bit in the canonical
>>>  * address checks; exit if it is ever changed.
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index 4e5905a1ce70..d1efe2c62b3f 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -1157,7 +1157,6 @@ static void init_vmcb(struct vcpu_svm *svm)
>>> struct vmcb_control_area *control = &svm->vmcb->control;
>>> struct vmcb_save_area *save = &svm->vmcb->save;
>>>  
>>> -   svm->vcpu.fpu_active = 1;
>>> svm->vcpu.arch.hflags = 0;
>>>  
>>> set_cr_intercept(svm, INTERCEPT_CR0_READ);
>>> @@ -1899,15 +1898,12 @@ static void update_cr0_intercept(struct vcpu_svm 
>>> *svm)
>>> ulong gcr0 = svm->vcpu.arch.cr0;
>>> u64 *hcr0 = &svm->vmcb->save.cr0;
>>>  
>>> -   if (!svm->vcpu.fpu_active)
>>> -   *hcr0 |= SVM_CR0_SELECTIVE_MASK;
>>> -   else
>>> -   *hcr0 = (*hcr0 & ~SVM_CR0_SELECTIVE_MASK)
>>> -   | (gcr0 & SVM_CR0_SELECTIVE_MASK);
>>> +   *hcr0 = (*hcr0 & ~SVM_CR0_SELECTIVE_MASK)
>>> +   | (gcr0 & SVM_CR0_SELECTIVE_MASK);
>>>  
>>> mark_dirty(svm->vmcb, VMCB_CR);
>>>  
>>> -   if (gcr0 == *hcr0 && svm->vcpu.fpu_active) {
>>> +   if (gcr0 == *hcr0) {
>>> clr_cr_intercept(svm, INTERCEPT_CR0_READ);
>>> clr_cr_intercept(svm, INTERCEPT_CR0_WRITE);
>>> } else {
>>> @@ -1938,8 +1934,6 @@ static void svm_set_cr0(struct kvm_vcpu *vcpu,

Re: [PATCH] KVM: VMX: use vmcs_set/clear_bits for CPU-based execution controls

2017-02-17 Thread Bandan Das

Paolo Bonzini  writes:

> - Original Message -
>> From: "Bandan Das" 
>> To: "Paolo Bonzini" 
>> Cc: linux-kernel@vger.kernel.org, k...@vger.kernel.org
>> Sent: Friday, February 17, 2017 1:04:14 AM
>> Subject: Re: [PATCH] KVM: VMX: use vmcs_set/clear_bits for CPU-based 
>> execution controls
>> 
>> Paolo Bonzini  writes:
>> 
>> > Signed-off-by: Paolo Bonzini 
>> > ---
>> 
>> I took a quick look and found these two potential
>> consumers of these set/clear wrappers.
>> 
>> vmcs_set_secondary_exec_control()
>> vmx_set_virtual_x2apic_mode()
>> 
>> Since this has been reviewed already,
>> we can just have them later in a follow up
>> (unless you left them out intentionally).
>
> Both of these can both set and clear bits, so they could be the
> consumer of a new function
>
> void vmcs_write_bits(u16 field, u32 value, u32 mask)
>
> but I don't see much benefit in introducing it; the cognitive
> load is higher than vmcs_{set,clear}_bits.

Yes, agreed. Thanks!

> Paolo

1 2 3 >

1 - 100 of 297 matches

Mail list logo