from:"Sean Christopherson"

Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

2024-05-29 Thread Sean Christopherson

On Wed, May 29, 2024, Yu Zhao wrote:
> On Wed, May 29, 2024 at 3:59 PM Sean Christopherson  wrote:
> >
> > On Wed, May 29, 2024, Yu Zhao wrote:
> > > On Wed, May 29, 2024 at 12:05 PM James Houghton  
> > > wrote:
> > > >
> > > > Secondary MMUs are currently consulted for access/age information at
> > > > eviction time, but before then, we don't get accurate age information.
> > > > That is, pages that are mostly accessed through a secondary MMU (like
> > > > guest memory, used by KVM) will always just proceed down to the oldest
> > > > generation, and then at eviction time, if KVM reports the page to be
> > > > young, the page will be activated/promoted back to the youngest
> > > > generation.
> > >
> > > Correct, and as I explained offline, this is the only reasonable
> > > behavior if we can't locklessly walk secondary MMUs.
> > >
> > > Just for the record, the (crude) analogy I used was:
> > > Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
> > > but you are only allowed to pick up 10 of them (and put them in your
> > > pocket). A smart move would be to survey the room *first and then*
> > > pick up the largest ones. But if you are carrying a 500 lbs backpack,
> > > you would just want to pick up whichever that's in front of you rather
> > > than walk the entire room.
> > >
> > > MGLRU should only scan (or lookaround) secondary MMUs if it can be
> > > done lockless. Otherwise, it should just fall back to the existing
> > > approach, which existed in previous versions but is removed in this
> > > version.
> >
> > IIUC, by "existing approach" you mean completely ignore secondary MMUs that
> > don't implement a lockless walk?
> 
> No, the existing approach only checks secondary MMUs for LRU folios,
> i.e., those at the end of the LRU list. It might not find the best
> candidates (the coldest ones) on the entire list, but it doesn't pay
> as much for the locking. MGLRU can *optionally* scan MMUs (secondary
> included) to find the best candidates, but it can only be a win if the
> scanning incurs a relatively low overhead, e.g., done locklessly for
> the secondary MMU. IOW, this is a balance between the cost of
> reclaiming not-so-cold (warm) folios and that of finding the coldest
> folios.

Gotcha.

I tend to agree with Yu, driving the behavior via a Kconfig may generate simpler
_code_, but I think it increases the overall system complexity.  E.g. distros
will likely enable the Kconfig, and in my experience people using KVM with a
distro kernel usually aren't kernel experts, i.e. likely won't know that there's
even a decision to be made, let alone be able to make an informed decision.

Having an mmu_notifier hook that is conditionally implemented doesn't seem 
overly
complex, e.g. even if there's a runtime aspect at play, it'd be easy enough for
KVM to nullify its mmu_notifier hook during initialization.  The hardest part is
likely going to be figuring out the threshold for how much overhead is too much.

Re: [PATCH v4 2/7] mm: multi-gen LRU: Have secondary MMUs participate in aging

2024-05-29 Thread Sean Christopherson

On Wed, May 29, 2024, Yu Zhao wrote:
> On Wed, May 29, 2024 at 12:05 PM James Houghton  wrote:
> >
> > Secondary MMUs are currently consulted for access/age information at
> > eviction time, but before then, we don't get accurate age information.
> > That is, pages that are mostly accessed through a secondary MMU (like
> > guest memory, used by KVM) will always just proceed down to the oldest
> > generation, and then at eviction time, if KVM reports the page to be
> > young, the page will be activated/promoted back to the youngest
> > generation.
> 
> Correct, and as I explained offline, this is the only reasonable
> behavior if we can't locklessly walk secondary MMUs.
> 
> Just for the record, the (crude) analogy I used was:
> Imagine a large room with many bills ($1, $5, $10, ...) on the floor,
> but you are only allowed to pick up 10 of them (and put them in your
> pocket). A smart move would be to survey the room *first and then*
> pick up the largest ones. But if you are carrying a 500 lbs backpack,
> you would just want to pick up whichever that's in front of you rather
> than walk the entire room.
> 
> MGLRU should only scan (or lookaround) secondary MMUs if it can be
> done lockless. Otherwise, it should just fall back to the existing
> approach, which existed in previous versions but is removed in this
> version.

IIUC, by "existing approach" you mean completely ignore secondary MMUs that 
don't
implement a lockless walk?

Re: [PATCH v4 4/7] KVM: Move MMU lock acquisition for test/clear_young to architecture

2024-05-29 Thread Sean Christopherson

On Wed, May 29, 2024, James Houghton wrote:
> For implementation mmu_notifier_{test,clear}_young, the KVM memslot
> walker used to take the MMU lock for us. Now make the architectures
> take it themselves.

Hmm, *forcing* architectures to take mmu_lock is a step backwards.  Rather than
add all of this churn, what about adding CONFIG_KVM_MMU_NOTIFIER_LOCKLESS, e.g.

static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier 
*mn,
 unsigned long start,
 unsigned long end,
 gfn_handler_t handler)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
const struct kvm_mmu_notifier_range range = {
.start  = start,
.end= end,
.handler= handler,
.on_lock= (void *)kvm_null_fn,
.flush_on_ret   = false,
.may_block  = false,
.lockless   = IS_ENABLED(CONFIG_KVM_MMU_NOTIFIER_LOCKLESS),
};

return __kvm_handle_hva_range(kvm, ).ret;
}

Re: [PATCH v4 3/7] KVM: Add lockless memslot walk to KVM

2024-05-29 Thread Sean Christopherson

On Wed, May 29, 2024, James Houghton wrote:
> @@ -686,10 +694,12 @@ static __always_inline int kvm_handle_hva_range(struct 
> mmu_notifier *mn,
>   return __kvm_handle_hva_range(kvm, ).ret;
>  }
>  
> -static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier 
> *mn,
> -  unsigned long start,
> -  unsigned long end,
> -  gfn_handler_t handler)
> +static __always_inline int kvm_handle_hva_range_no_flush(
> + struct mmu_notifier *mn,
> + unsigned long start,
> + unsigned long end,
> + gfn_handler_t handler,
> + bool lockless)

Unnecessary and unwanted style change.

>  {
>   struct kvm *kvm = mmu_notifier_to_kvm(mn);
>   const struct kvm_mmu_notifier_range range = {
> @@ -699,6 +709,7 @@ static __always_inline int 
> kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
>   .on_lock= (void *)kvm_null_fn,
>   .flush_on_ret   = false,
>   .may_block  = false,
> + .lockless   = lockless,

Why add @lockess to kvm_handle_hva_range_no_flush()?  Both callers immediately
pass %false, and conceptually, locking is always optional for a "no flush" 
variant.

>   };
>  
>   return __kvm_handle_hva_range(kvm, ).ret;
> @@ -889,7 +900,8 @@ static int kvm_mmu_notifier_clear_young(struct 
> mmu_notifier *mn,
>* cadence. If we find this inaccurate, we might come up with a
>* more sophisticated heuristic later.
>*/
> - return kvm_handle_hva_range_no_flush(mn, start, end, kvm_age_gfn);
> + return kvm_handle_hva_range_no_flush(mn, start, end,
> +  kvm_age_gfn, false);
>  }
>  
>  static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
> @@ -899,7 +911,7 @@ static int kvm_mmu_notifier_test_young(struct 
> mmu_notifier *mn,
>   trace_kvm_test_age_hva(address);
>  
>   return kvm_handle_hva_range_no_flush(mn, address, address + 1,
> -  kvm_test_age_gfn);
> +  kvm_test_age_gfn, false);
>  }
>  
>  static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
> -- 
> 2.45.1.288.g0e0cd299f1-goog
>

Re: [PATCH v2 3/6] KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()

2024-05-29 Thread Sean Christopherson

On Wed, May 29, 2024, Kai Huang wrote:
> I am not familiar with SVM, but it seems the relevant parts are:
> 
>   control->pause_filter_count;
>   vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS);
> 
> And it seems they are directly related to programming the hardware, i.e.,
> they got automatically loaded to hardware during VMRUN.

"control" is the control area of the VMCB, i.e. the above pause_filter_count is
equivalent to a VMCS field.

> They need to be updated in the SVM specific code when @ple_window_dirty is
> true in the relevant code path.
> 
> Anyway, even it is feasible and worth to do, we should do in a separate
> patchset.

Ya.

Re: [PATCH v2 3/6] KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()

2024-05-28 Thread Sean Christopherson

On Fri, May 24, 2024, Kai Huang wrote:
> > @@ -1548,6 +1548,9 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int 
> > cpu)
> > struct vcpu_svm *svm = to_svm(vcpu);
> > struct svm_cpu_data *sd = per_cpu_ptr(_data, cpu);
> > +   if (vcpu->scheduled_out && !kvm_pause_in_guest(vcpu->kvm))
> > +   shrink_ple_window(vcpu);
> > +
> 
> [...]
> 
> > @@ -1517,6 +1517,9 @@ void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >   {
> > struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +   if (vcpu->scheduled_out && !kvm_pause_in_guest(vcpu->kvm))
> > +   shrink_ple_window(vcpu);
> > +
> 
> Nit:  Perhaps we need a kvm_x86_ops::shrink_ple_window()?  :-)

Heh, that duplicate code annoys me too.  The problem is the "old" window value
comes from the VMCS/VMCB, so either we'd end up with multiple kvm_x86_ops, or
we'd only be able to consolidate the scheduled_out + kvm_pause_in_guest() code,
which isn't all that interesting.

Aha!  Actually, VMX already open codes the functionality provided by 
VCPU_EXREG_*,
e.g. has vmx->ple_window_dirty.  If we add VCPU_EXREG_PLE_WINDOW, then the info
get be made available to common x86 code without having to add new hooks.  And
that would also allow moving the guts of handle_pause()/pause_interception() to
common code, i.e. will also allow deduplicating the "grow" side of things.

[PATCH v2 2/6] KVM: VMX: Move PLE grow/shrink helpers above vmx_vcpu_load()

2024-05-21 Thread Sean Christopherson

Move VMX's {grow,shrink}_ple_window() above vmx_vcpu_load() in preparation
of moving the sched_in logic, which handles shrinking the PLE window, into
vmx_vcpu_load().

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/vmx/vmx.c | 64 +-
 1 file changed, 32 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 51b2cd13250a..07a4d6a3a43e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1410,6 +1410,38 @@ static void vmx_write_guest_kernel_gs_base(struct 
vcpu_vmx *vmx, u64 data)
 }
 #endif
 
+static void grow_ple_window(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   unsigned int old = vmx->ple_window;
+
+   vmx->ple_window = __grow_ple_window(old, ple_window,
+   ple_window_grow,
+   ple_window_max);
+
+   if (vmx->ple_window != old) {
+   vmx->ple_window_dirty = true;
+   trace_kvm_ple_window_update(vcpu->vcpu_id,
+   vmx->ple_window, old);
+   }
+}
+
+static void shrink_ple_window(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   unsigned int old = vmx->ple_window;
+
+   vmx->ple_window = __shrink_ple_window(old, ple_window,
+ ple_window_shrink,
+ ple_window);
+
+   if (vmx->ple_window != old) {
+   vmx->ple_window_dirty = true;
+   trace_kvm_ple_window_update(vcpu->vcpu_id,
+   vmx->ple_window, old);
+   }
+}
+
 void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
struct loaded_vmcs *buddy)
 {
@@ -5889,38 +5921,6 @@ int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
return 1;
 }
 
-static void grow_ple_window(struct kvm_vcpu *vcpu)
-{
-   struct vcpu_vmx *vmx = to_vmx(vcpu);
-   unsigned int old = vmx->ple_window;
-
-   vmx->ple_window = __grow_ple_window(old, ple_window,
-   ple_window_grow,
-   ple_window_max);
-
-   if (vmx->ple_window != old) {
-   vmx->ple_window_dirty = true;
-   trace_kvm_ple_window_update(vcpu->vcpu_id,
-   vmx->ple_window, old);
-   }
-}
-
-static void shrink_ple_window(struct kvm_vcpu *vcpu)
-{
-   struct vcpu_vmx *vmx = to_vmx(vcpu);
-   unsigned int old = vmx->ple_window;
-
-   vmx->ple_window = __shrink_ple_window(old, ple_window,
- ple_window_shrink,
- ple_window);
-
-   if (vmx->ple_window != old) {
-   vmx->ple_window_dirty = true;
-   trace_kvm_ple_window_update(vcpu->vcpu_id,
-   vmx->ple_window, old);
-   }
-}
-
 /*
  * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
  * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
-- 
2.45.0.215.g3402c0e53f-goog

[PATCH v2 6/6] KVM: x86: Drop now-superflous setting of l1tf_flush_l1d in vcpu_run()

2024-05-21 Thread Sean Christopherson

Now that KVM unconditionally sets l1tf_flush_l1d in kvm_arch_vcpu_load(),
drop the redundant store from vcpu_run().  The flag is cleared only when
VM-Enter is imminent, deep below vcpu_run(), i.e. barring a KVM bug, it's
impossible for l1tf_flush_l1d to be cleared between loading the vCPU and
calling vcpu_run().

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/vmx/vmx.c | 7 ---
 arch/x86/kvm/x86.c | 1 -
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index da2f95385a12..552b6a9887a5 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6672,9 +6672,10 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
bool flush_l1d;
 
/*
-* Clear the per-vcpu flush bit, it gets set again
-* either from vcpu_run() or from one of the unsafe
-* VMEXIT handlers.
+* Clear the per-vcpu flush bit, it gets set again if the vCPU
+* is reloaded, i.e. if the vCPU is scheduled out or if KVM
+* exits to userspace, or if KVM reaches one of the unsafe
+* VMEXIT handlers, e.g. if KVM calls into the emulator.
 */
flush_l1d = vcpu->arch.l1tf_flush_l1d;
vcpu->arch.l1tf_flush_l1d = false;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 60fea297f91f..86ae7392cc59 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11264,7 +11264,6 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
int r;
 
vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
-   vcpu->arch.l1tf_flush_l1d = true;
 
for (;;) {
/*
-- 
2.45.0.215.g3402c0e53f-goog

[PATCH v2 4/6] KVM: Delete the now unused kvm_arch_sched_in()

2024-05-21 Thread Sean Christopherson

Delete kvm_arch_sched_in() now that all implementations are nops.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h | 1 -
 arch/loongarch/include/asm/kvm_host.h | 1 -
 arch/mips/include/asm/kvm_host.h  | 1 -
 arch/powerpc/include/asm/kvm_host.h   | 1 -
 arch/riscv/include/asm/kvm_host.h | 1 -
 arch/s390/include/asm/kvm_host.h  | 1 -
 arch/x86/kvm/pmu.c| 6 +++---
 arch/x86/kvm/x86.c| 5 -
 include/linux/kvm_host.h  | 2 --
 virt/kvm/kvm_main.c   | 1 -
 10 files changed, 3 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 8170c04fde91..615e7a2e5590 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1225,7 +1225,6 @@ static inline bool kvm_system_needs_idmapped_vectors(void)
 }
 
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
-static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 
 void kvm_arm_init_debug(void);
 void kvm_arm_vcpu_init_debug(struct kvm_vcpu *vcpu);
diff --git a/arch/loongarch/include/asm/kvm_host.h 
b/arch/loongarch/include/asm/kvm_host.h
index c87b6ea0ec47..4162a252cdf6 100644
--- a/arch/loongarch/include/asm/kvm_host.h
+++ b/arch/loongarch/include/asm/kvm_host.h
@@ -261,7 +261,6 @@ static inline bool kvm_is_ifetch_fault(struct kvm_vcpu_arch 
*arch)
 static inline void kvm_arch_hardware_unsetup(void) {}
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
-static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 179f320cc231..6743a57c1ab4 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -890,7 +890,6 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_free_memslot(struct kvm *kvm,
 struct kvm_memory_slot *slot) {}
 static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
-static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 8abac532146e..c4fb6a27fb92 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -897,7 +897,6 @@ struct kvm_vcpu_arch {
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
 static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
-static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 
diff --git a/arch/riscv/include/asm/kvm_host.h 
b/arch/riscv/include/asm/kvm_host.h
index d96281278586..dd77c2db6819 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -286,7 +286,6 @@ struct kvm_vcpu_arch {
 };
 
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
-static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 
 #define KVM_RISCV_GSTAGE_TLB_MIN_ORDER 12
 
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 95990461888f..e9fcaf4607a6 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -1045,7 +1045,6 @@ extern int kvm_s390_gisc_register(struct kvm *kvm, u32 
gisc);
 extern int kvm_s390_gisc_unregister(struct kvm *kvm, u32 gisc);
 
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
-static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_free_memslot(struct kvm *kvm,
 struct kvm_memory_slot *slot) {}
 static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index a593b03c9aed..f9149c9fc275 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -521,9 +521,9 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
}
 
/*
-* Unused perf_events are only released if the corresponding MSRs
-* weren't accessed during the last vCPU time slice. kvm_arch_sched_in
-* triggers KVM_REQ_PMU if cleanup is needed.
+* Release unused perf_events if the corresponding guest MSRs weren't
+* accessed during the last vCPU time slice

[PATCH v2 3/6] KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()

2024-05-21 Thread Sean Christopherson

Fold the guts of kvm_arch_sched_in() into kvm_arch_vcpu_load(), keying
off the recently added kvm_vcpu.scheduled_out as appropriate.

Note, there is a very slight functional change, as PLE shrink updates will
now happen after blasting WBINVD, but that is quite uninteresting as the
two operations do not interact in any way.

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 -
 arch/x86/include/asm/kvm_host.h|  2 --
 arch/x86/kvm/svm/svm.c | 11 +++
 arch/x86/kvm/vmx/main.c|  2 --
 arch/x86/kvm/vmx/vmx.c |  9 +++--
 arch/x86/kvm/vmx/x86_ops.h |  1 -
 arch/x86/kvm/x86.c | 17 ++---
 7 files changed, 16 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index 566d19b02483..5a8b74c2e6c4 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -103,7 +103,6 @@ KVM_X86_OP(write_tsc_multiplier)
 KVM_X86_OP(get_exit_info)
 KVM_X86_OP(check_intercept)
 KVM_X86_OP(handle_exit_irqoff)
-KVM_X86_OP(sched_in)
 KVM_X86_OP_OPTIONAL(update_cpu_dirty_logging)
 KVM_X86_OP_OPTIONAL(vcpu_blocking)
 KVM_X86_OP_OPTIONAL(vcpu_unblocking)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aabf1648a56a..0df4d14db896 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1750,8 +1750,6 @@ struct kvm_x86_ops {
   struct x86_exception *exception);
void (*handle_exit_irqoff)(struct kvm_vcpu *vcpu);
 
-   void (*sched_in)(struct kvm_vcpu *vcpu, int cpu);
-
/*
 * Size of the CPU's dirty log buffer, i.e. VMX's PML buffer.  A zero
 * value indicates CPU dirty logging is unsupported or disabled.
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 3d0549ca246f..51a5eb31aee5 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1548,6 +1548,9 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
struct vcpu_svm *svm = to_svm(vcpu);
struct svm_cpu_data *sd = per_cpu_ptr(_data, cpu);
 
+   if (vcpu->scheduled_out && !kvm_pause_in_guest(vcpu->kvm))
+   shrink_ple_window(vcpu);
+
if (sd->current_vmcb != svm->vmcb) {
sd->current_vmcb = svm->vmcb;
 
@@ -4572,12 +4575,6 @@ static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu)
vcpu->arch.at_instruction_boundary = true;
 }
 
-static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu)
-{
-   if (!kvm_pause_in_guest(vcpu->kvm))
-   shrink_ple_window(vcpu);
-}
-
 static void svm_setup_mce(struct kvm_vcpu *vcpu)
 {
/* [63:9] are reserved. */
@@ -5046,8 +5043,6 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.check_intercept = svm_check_intercept,
.handle_exit_irqoff = svm_handle_exit_irqoff,
 
-   .sched_in = svm_sched_in,
-
.nested_ops = _nested_ops,
 
.deliver_interrupt = svm_deliver_interrupt,
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 7c546ad3e4c9..4fee9a8cc5a1 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -121,8 +121,6 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.check_intercept = vmx_check_intercept,
.handle_exit_irqoff = vmx_handle_exit_irqoff,
 
-   .sched_in = vmx_sched_in,
-
.cpu_dirty_log_size = PML_ENTITY_NUM,
.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 07a4d6a3a43e..da2f95385a12 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1517,6 +1517,9 @@ void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+   if (vcpu->scheduled_out && !kvm_pause_in_guest(vcpu->kvm))
+   shrink_ple_window(vcpu);
+
vmx_vcpu_load_vmcs(vcpu, cpu, NULL);
 
vmx_vcpu_pi_load(vcpu, cpu);
@@ -8171,12 +8174,6 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
 }
 #endif
 
-void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
-{
-   if (!kvm_pause_in_guest(vcpu->kvm))
-   shrink_ple_window(vcpu);
-}
-
 void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 502704596c83..3cb0be94e779 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -112,7 +112,6 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
 void vmx_write_tsc_offset(struct kvm_vcpu *vcpu);
 void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu);
 void vmx_request_immediate_exit(struct kvm_vcpu *vcpu);
-void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu);
 void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu);
 #ifdef CONFIG_X86_64
 int vmx_set_hv_timer(struct kvm_vcpu *vc

[PATCH v2 5/6] KVM: x86: Unconditionally set l1tf_flush_l1d during vCPU load

2024-05-21 Thread Sean Christopherson

Always set l1tf_flush_l1d during kvm_arch_vcpu_load() instead of setting
it only when the vCPU is being scheduled back in.  The flag is processed
only when VM-Enter is imminent, and KVM obviously needs to load the vCPU
before VM-Enter, so attempting to precisely set l1tf_flush_l1d provides no
meaningful value.  I.e. the flag _will_ be set either way, it's simply a
matter of when.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/x86.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 59aa772af755..60fea297f91f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5006,12 +5006,11 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
 
-   if (vcpu->scheduled_out) {
-   vcpu->arch.l1tf_flush_l1d = true;
-   if (pmu->version && unlikely(pmu->event_count)) {
-   pmu->need_cleanup = true;
-   kvm_make_request(KVM_REQ_PMU, vcpu);
-   }
+   vcpu->arch.l1tf_flush_l1d = true;
+
+   if (vcpu->scheduled_out && pmu->version && pmu->event_count) {
+   pmu->need_cleanup = true;
+   kvm_make_request(KVM_REQ_PMU, vcpu);
}
 
/* Address WBINVD may be executed by guest */
-- 
2.45.0.215.g3402c0e53f-goog

[PATCH v2 1/6] KVM: Add a flag to track if a loaded vCPU is scheduled out

2024-05-21 Thread Sean Christopherson

Add a kvm_vcpu.scheduled_out flag to track if a vCPU is in the process of
being scheduled out (vCPU put path), or if the vCPU is being reloaded
after being scheduled out (vCPU load path).  In the short term, this will
allow dropping kvm_arch_sched_in(), as arch code can query scheduled_out
during kvm_arch_vcpu_load().

Longer term, scheduled_out opens up other potential optimizations, without
creating subtle/brittle dependencies.  E.g. it allows KVM to keep guest
state (that is managed via kvm_arch_vcpu_{load,put}()) loaded across
kvm_sched_{out,in}(), if KVM knows the state isn't accessed by the host
kernel.  Forcing arch code to coordinate between kvm_arch_sched_{in,out}()
and kvm_arch_vcpu_{load,put}() is awkward, not reusable, and relies on the
exact ordering of calls into arch code.

Adding scheduled_out also obviates the need for a kvm_arch_sched_out()
hook, e.g. if arch code needs to do something novel when putting vCPU
state.

And even if KVM never uses scheduled_out for anything beyond dropping
kvm_arch_sched_in(), just being able to remove all of the arch stubs makes
it worth adding the flag.

Link: https://lore.kernel.org/all/20240430224431.490139-1-sea...@google.com
Cc: Oliver Upton 
Signed-off-by: Sean Christopherson 
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/kvm_main.c  | 4 
 2 files changed, 5 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7b57878c8c18..bde69f74b031 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -380,6 +380,7 @@ struct kvm_vcpu {
 #endif
bool preempted;
bool ready;
+   bool scheduled_out;
struct kvm_vcpu_arch arch;
struct kvm_vcpu_stat stat;
char stats_id[KVM_STATS_NAME_SIZE];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a1756d5077ee..7ecea573d121 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -6288,6 +6288,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int 
cpu)
__this_cpu_write(kvm_running_vcpu, vcpu);
kvm_arch_sched_in(vcpu, cpu);
kvm_arch_vcpu_load(vcpu, cpu);
+
+   WRITE_ONCE(vcpu->scheduled_out, false);
 }
 
 static void kvm_sched_out(struct preempt_notifier *pn,
@@ -6295,6 +6297,8 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 {
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
 
+   WRITE_ONCE(vcpu->scheduled_out, true);
+
if (current->on_rq) {
WRITE_ONCE(vcpu->preempted, true);
WRITE_ONCE(vcpu->ready, true);
-- 
2.45.0.215.g3402c0e53f-goog

[PATCH v2 0/6] KVM: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()

2024-05-21 Thread Sean Christopherson

Drop kvm_arch_sched_in() and instead add and use kvm_vcpu.scheduled_out
to communicate to kvm_arch_vcpu_load() that the vCPU is being scheduling
back in.

While fiddling with an idea for optimizing state management on AMD CPUs,
I wanted to skip re-saving certain host state when a vCPU is scheduled back
in, as the state (theoretically) shouldn't change for the task while it's
scheduled out.  Actually doing that was annoying and unnecessarily brittle
due to having a separate API for the kvm_sched_in() case (the state save
needed to be in kvm_arch_vcpu_load() for the common path).

The other motivation for this is to avoid yet another arch hook, and more
arbitrary ordering, if there's a future need to hook kvm_sched_out() (we've
come close on the x86 side several times).  E.g. kvm_arch_vcpu_put() can
simply check kvm_vcpu.scheduled_out if it needs to something specific for
the vCPU being scheduled out.

v2:
 - Add scheduled_out flag instead of passing a bool to kvm_arch_vcpu_load().
   [Oliver]
 - Tack on patches to clean up x86's setting of l1tf_flush_l1d in
   kvm_arch_sched_load() (the code looked slightly less weird when the flag
   was being set by kvm_arch_sched_in()).

v1: https://lore.kernel.org/all/20240430193157.419425-1-sea...@google.com

Sean Christopherson (6):
  KVM: Add a flag to track if a loaded vCPU is scheduled out
  KVM: VMX: Move PLE grow/shrink helpers above vmx_vcpu_load()
  KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()
  KVM: Delete the now unused kvm_arch_sched_in()
  KVM: x86: Unconditionally set l1tf_flush_l1d during vCPU load
  KVM: x86: Drop now-superflous setting of l1tf_flush_l1d in vcpu_run()

 arch/arm64/include/asm/kvm_host.h |  1 -
 arch/loongarch/include/asm/kvm_host.h |  1 -
 arch/mips/include/asm/kvm_host.h  |  1 -
 arch/powerpc/include/asm/kvm_host.h   |  1 -
 arch/riscv/include/asm/kvm_host.h |  1 -
 arch/s390/include/asm/kvm_host.h  |  1 -
 arch/x86/include/asm/kvm-x86-ops.h|  1 -
 arch/x86/include/asm/kvm_host.h   |  2 -
 arch/x86/kvm/pmu.c|  6 +-
 arch/x86/kvm/svm/svm.c| 11 +---
 arch/x86/kvm/vmx/main.c   |  2 -
 arch/x86/kvm/vmx/vmx.c| 80 +--
 arch/x86/kvm/vmx/x86_ops.h|  1 -
 arch/x86/kvm/x86.c| 22 +++-
 include/linux/kvm_host.h  |  3 +-
 virt/kvm/kvm_main.c   |  5 +-
 16 files changed, 59 insertions(+), 80 deletions(-)


base-commit: 4aad0b1893a141f114ba40ed509066f3c9bc24b0
-- 
2.45.0.215.g3402c0e53f-goog

Re: [PATCH 0/4] KVM: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()

2024-05-01 Thread Sean Christopherson

On Wed, May 01, 2024, Oliver Upton wrote:
> On Tue, Apr 30, 2024 at 12:31:53PM -0700, Sean Christopherson wrote:
> > Drop kvm_arch_sched_in() and instead pass a @sched_in boolean to
> > kvm_arch_vcpu_load().
> > 
> > While fiddling with an idea for optimizing state management on AMD CPUs,
> > I wanted to skip re-saving certain host state when a vCPU is scheduled back
> > in, as the state (theoretically) shouldn't change for the task while it's
> > scheduled out.  Actually doing that was annoying and unnecessarily brittle
> > due to having a separate API for the kvm_sched_in() case (the state save
> > needed to be in kvm_arch_vcpu_load() for the common path).
> > 
> > E.g. I could have set a "temporary"-ish flag somewhere in kvm_vcpu, but (a)
> > that's gross and (b) it would rely on the arbitrary ordering between
> > sched_in() and vcpu_load() staying the same.
> 
> Another option would be to change the rules around kvm_arch_sched_in()
> where the callee is expected to load the vCPU context.
> 
> The default implementation could just call kvm_arch_vcpu_load() directly
> and the x86 implementation can order things the way it wants before
> kvm_arch_vcpu_load().
> 
> I say this because ...
> 
> > The only real downside I see is that arm64 and riscv end up having to pass
> > "false" for their direct usage of kvm_arch_vcpu_load(), and passing boolean
> > literals isn't ideal.  But that can be solved by adding an inner helper that
> > omits the @sched_in param (I almost added a patch to do that, but I couldn't
> > convince myself it was necessary).
> 
> Needing to pass @sched_in for other usage of kvm_arch_vcpu_load() hurts
> readability, especially when no other architecture besides x86 cares
> about it.

Yeah, that bothers me too.

I tried your suggestion of having x86's kvm_arch_sched_in() do 
kvm_arch_vcpu_load(),
and even with an added kvm_arch_sched_out() to provide symmetry, the x86 code is
kludgy, and even the common code is a bit confusing as it's not super obvious
that kvm_sched_{in,out}() is really just kvm_arch_vcpu_{load,put}().

Staring a bit more at the vCPU flags we have, adding a "bool scheduled_out" 
isn't
terribly gross if it's done in common code and persists across load() and put(),
i.e. isn't so blatantly a temporary field.  And because it's easy, it could be
set with WRITE_ONCE() so that if it can be read cross-task if there's ever a
reason to do so.

The x86 code ends up being less ugly, and adding future arch/vendor code for
sched_in() *or* sched_out() requires minimal churn, e.g. arch code doesn't need
to override kvm_arch_sched_in().

The only weird part is that vcpu->preempted and vcpu->ready have slightly
different behavior, as they are cleared before kvm_arch_vcpu_load().  But the
weirdness is really with those flags no having symmetry, not with scheduled_out
itself.

Thoughts?

static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
{
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);

WRITE_ONCE(vcpu->preempted, false);
WRITE_ONCE(vcpu->ready, false);

__this_cpu_write(kvm_running_vcpu, vcpu);
kvm_arch_vcpu_load(vcpu, cpu);

WRITE_ONCE(vcpu->scheduled_out, false);
}

static void kvm_sched_out(struct preempt_notifier *pn,
  struct task_struct *next)
{
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);

WRITE_ONCE(vcpu->scheduled_out, true);

if (current->on_rq) {
WRITE_ONCE(vcpu->preempted, true);
WRITE_ONCE(vcpu->ready, true);
}
kvm_arch_vcpu_put(vcpu);
__this_cpu_write(kvm_running_vcpu, NULL);
}

[PATCH 4/4] KVM: Delete the now unused kvm_arch_sched_in()

2024-04-30 Thread Sean Christopherson

Delete kvm_arch_sched_in() now that all implementations are nops.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h | 1 -
 arch/loongarch/include/asm/kvm_host.h | 1 -
 arch/mips/include/asm/kvm_host.h  | 1 -
 arch/powerpc/include/asm/kvm_host.h   | 1 -
 arch/riscv/include/asm/kvm_host.h | 1 -
 arch/s390/include/asm/kvm_host.h  | 1 -
 arch/x86/kvm/pmu.c| 6 +++---
 arch/x86/kvm/x86.c| 5 -
 include/linux/kvm_host.h  | 2 --
 virt/kvm/kvm_main.c   | 1 -
 10 files changed, 3 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 9e8a496fb284..a12d3bb0b590 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1180,7 +1180,6 @@ static inline bool kvm_system_needs_idmapped_vectors(void)
 }
 
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
-static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 
 void kvm_arm_init_debug(void);
 void kvm_arm_vcpu_init_debug(struct kvm_vcpu *vcpu);
diff --git a/arch/loongarch/include/asm/kvm_host.h 
b/arch/loongarch/include/asm/kvm_host.h
index 69305441f40d..64ca60a3ce24 100644
--- a/arch/loongarch/include/asm/kvm_host.h
+++ b/arch/loongarch/include/asm/kvm_host.h
@@ -228,7 +228,6 @@ static inline bool kvm_is_ifetch_fault(struct kvm_vcpu_arch 
*arch)
 static inline void kvm_arch_hardware_unsetup(void) {}
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
-static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 179f320cc231..6743a57c1ab4 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -890,7 +890,6 @@ static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_free_memslot(struct kvm *kvm,
 struct kvm_memory_slot *slot) {}
 static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
-static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 8abac532146e..c4fb6a27fb92 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -897,7 +897,6 @@ struct kvm_vcpu_arch {
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
 static inline void kvm_arch_flush_shadow_all(struct kvm *kvm) {}
-static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 
diff --git a/arch/riscv/include/asm/kvm_host.h 
b/arch/riscv/include/asm/kvm_host.h
index 484d04a92fa6..6cd7a576ef14 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -272,7 +272,6 @@ struct kvm_vcpu_arch {
 };
 
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
-static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 
 #define KVM_RISCV_GSTAGE_TLB_MIN_ORDER 12
 
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 95990461888f..e9fcaf4607a6 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -1045,7 +1045,6 @@ extern int kvm_s390_gisc_register(struct kvm *kvm, u32 
gisc);
 extern int kvm_s390_gisc_unregister(struct kvm *kvm, u32 gisc);
 
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
-static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void kvm_arch_free_memslot(struct kvm *kvm,
 struct kvm_memory_slot *slot) {}
 static inline void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen) {}
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index c397b28e3d1b..75346a588e13 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -521,9 +521,9 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
}
 
/*
-* Unused perf_events are only released if the corresponding MSRs
-* weren't accessed during the last vCPU time slice. kvm_arch_sched_in
-* triggers KVM_REQ_PMU if cleanup is needed.
+* Release unused perf_events if the corresponding guest MSRs weren't
+* accessed during the last vCPU time slice

[PATCH 3/4] KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()

2024-04-30 Thread Sean Christopherson

Fold the guts of kvm_arch_sched_in() into kvm_arch_vcpu_load(), keying
off the recently added @sched_in as appropriate.

Note, there is a very slight functional change, as PLE shrink updates will
now happen after blasting WBINVD, but that is quite uninteresting.

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 -
 arch/x86/include/asm/kvm_host.h|  4 +---
 arch/x86/kvm/svm/svm.c | 13 -
 arch/x86/kvm/vmx/main.c|  2 --
 arch/x86/kvm/vmx/vmx.c | 11 ---
 arch/x86/kvm/vmx/x86_ops.h |  3 +--
 arch/x86/kvm/x86.c | 19 +++
 7 files changed, 21 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index 5187fcf4b610..910d06cdb86b 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -103,7 +103,6 @@ KVM_X86_OP(write_tsc_multiplier)
 KVM_X86_OP(get_exit_info)
 KVM_X86_OP(check_intercept)
 KVM_X86_OP(handle_exit_irqoff)
-KVM_X86_OP(sched_in)
 KVM_X86_OP_OPTIONAL(update_cpu_dirty_logging)
 KVM_X86_OP_OPTIONAL(vcpu_blocking)
 KVM_X86_OP_OPTIONAL(vcpu_unblocking)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 01c69840647e..9fd1ec82303d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1624,7 +1624,7 @@ struct kvm_x86_ops {
void (*vcpu_reset)(struct kvm_vcpu *vcpu, bool init_event);
 
void (*prepare_switch_to_guest)(struct kvm_vcpu *vcpu);
-   void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu);
+   void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu, bool sched_in);
void (*vcpu_put)(struct kvm_vcpu *vcpu);
 
void (*update_exception_bitmap)(struct kvm_vcpu *vcpu);
@@ -1746,8 +1746,6 @@ struct kvm_x86_ops {
   struct x86_exception *exception);
void (*handle_exit_irqoff)(struct kvm_vcpu *vcpu);
 
-   void (*sched_in)(struct kvm_vcpu *vcpu, int cpu);
-
/*
 * Size of the CPU's dirty log buffer, i.e. VMX's PML buffer.  A zero
 * value indicates CPU dirty logging is unsupported or disabled.
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 0f3b59da0d4a..6d9763dc4fed 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1539,11 +1539,14 @@ static void svm_prepare_host_switch(struct kvm_vcpu 
*vcpu)
to_svm(vcpu)->guest_state_loaded = false;
 }
 
-static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool sched_in)
 {
struct vcpu_svm *svm = to_svm(vcpu);
struct svm_cpu_data *sd = per_cpu_ptr(_data, cpu);
 
+   if (sched_in && !kvm_pause_in_guest(vcpu->kvm))
+   shrink_ple_window(vcpu);
+
if (sd->current_vmcb != svm->vmcb) {
sd->current_vmcb = svm->vmcb;
 
@@ -4548,12 +4551,6 @@ static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu)
vcpu->arch.at_instruction_boundary = true;
 }
 
-static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu)
-{
-   if (!kvm_pause_in_guest(vcpu->kvm))
-   shrink_ple_window(vcpu);
-}
-
 static void svm_setup_mce(struct kvm_vcpu *vcpu)
 {
/* [63:9] are reserved. */
@@ -5013,8 +5010,6 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
.check_intercept = svm_check_intercept,
.handle_exit_irqoff = svm_handle_exit_irqoff,
 
-   .sched_in = svm_sched_in,
-
.nested_ops = _nested_ops,
 
.deliver_interrupt = svm_deliver_interrupt,
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 7c546ad3e4c9..4fee9a8cc5a1 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -121,8 +121,6 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
.check_intercept = vmx_check_intercept,
.handle_exit_irqoff = vmx_handle_exit_irqoff,
 
-   .sched_in = vmx_sched_in,
-
.cpu_dirty_log_size = PML_ENTITY_NUM,
.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cb36db7b6140..ccea594187c7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1505,10 +1505,13 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
  * vcpu mutex is already taken.
  */
-void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool sched_in)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+   if (sched_in && !kvm_pause_in_guest(vcpu->kvm))
+   shrink_ple_window(vcpu);
+
vmx_vcpu_load_vmcs(vcpu, cpu, NULL);
 
vmx_vcpu_pi_load(vcpu, cpu);
@@ -8093,12 +8096,6 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
 }
 #endif
 
-void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
-

[PATCH 2/4] KVM: VMX: Move PLE grow/shrink helpers above vmx_vcpu_load()

2024-04-30 Thread Sean Christopherson

Move VMX's {grow,shrink}_ple_window() above vmx_vcpu_load() in preparation
of moving the sched_in logic, which handles shrinking the PLE window, into
vmx_vcpu_load().

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/vmx/vmx.c | 64 +-
 1 file changed, 32 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6780313914f8..cb36db7b6140 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1402,6 +1402,38 @@ static void vmx_write_guest_kernel_gs_base(struct 
vcpu_vmx *vmx, u64 data)
 }
 #endif
 
+static void grow_ple_window(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   unsigned int old = vmx->ple_window;
+
+   vmx->ple_window = __grow_ple_window(old, ple_window,
+   ple_window_grow,
+   ple_window_max);
+
+   if (vmx->ple_window != old) {
+   vmx->ple_window_dirty = true;
+   trace_kvm_ple_window_update(vcpu->vcpu_id,
+   vmx->ple_window, old);
+   }
+}
+
+static void shrink_ple_window(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   unsigned int old = vmx->ple_window;
+
+   vmx->ple_window = __shrink_ple_window(old, ple_window,
+ ple_window_shrink,
+ ple_window);
+
+   if (vmx->ple_window != old) {
+   vmx->ple_window_dirty = true;
+   trace_kvm_ple_window_update(vcpu->vcpu_id,
+   vmx->ple_window, old);
+   }
+}
+
 void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
struct loaded_vmcs *buddy)
 {
@@ -5871,38 +5903,6 @@ int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
return 1;
 }
 
-static void grow_ple_window(struct kvm_vcpu *vcpu)
-{
-   struct vcpu_vmx *vmx = to_vmx(vcpu);
-   unsigned int old = vmx->ple_window;
-
-   vmx->ple_window = __grow_ple_window(old, ple_window,
-   ple_window_grow,
-   ple_window_max);
-
-   if (vmx->ple_window != old) {
-   vmx->ple_window_dirty = true;
-   trace_kvm_ple_window_update(vcpu->vcpu_id,
-   vmx->ple_window, old);
-   }
-}
-
-static void shrink_ple_window(struct kvm_vcpu *vcpu)
-{
-   struct vcpu_vmx *vmx = to_vmx(vcpu);
-   unsigned int old = vmx->ple_window;
-
-   vmx->ple_window = __shrink_ple_window(old, ple_window,
- ple_window_shrink,
- ple_window);
-
-   if (vmx->ple_window != old) {
-   vmx->ple_window_dirty = true;
-   trace_kvm_ple_window_update(vcpu->vcpu_id,
-   vmx->ple_window, old);
-   }
-}
-
 /*
  * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
  * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
-- 
2.45.0.rc0.197.gbae5840b3b-goog

[PATCH 1/4] KVM: Plumb in a @sched_in flag to kvm_arch_vcpu_load()

2024-04-30 Thread Sean Christopherson

Add a @sched_in flag to kvm_arch_vcpu_load() to note that the vCPU is
being (re)loaded by kvm_sched_in(), i.e. after the vCPU was previously
scheduled out.  KVM x86 currently uses a dedicated kvm_arch_sched_in()
hook, but that's unnecessarily brittle as the behavior of the arch hook
heavily depends on the arbitrary order of the two arch calls.

A separate hook also makes it unnecessarily difficult to do something
unique when re-loading vCPU during kvm_sched_in(), e.g. to optimize vCPU
loading if KVM knows that some CPU state couldn't have changed while the
vCPU was scheduled out.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/kvm/arm.c| 2 +-
 arch/arm64/kvm/emulate-nested.c | 4 ++--
 arch/arm64/kvm/reset.c  | 2 +-
 arch/loongarch/kvm/vcpu.c   | 2 +-
 arch/mips/kvm/mmu.c | 2 +-
 arch/powerpc/kvm/powerpc.c  | 2 +-
 arch/riscv/kvm/vcpu.c   | 4 ++--
 arch/s390/kvm/kvm-s390.c| 2 +-
 arch/x86/kvm/x86.c  | 2 +-
 include/linux/kvm_host.h| 2 +-
 virt/kvm/kvm_main.c | 4 ++--
 11 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index c4a0a35e02c7..30ea103bfacb 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -428,7 +428,7 @@ void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu)
 
 }
 
-void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool sched_in)
 {
struct kvm_s2_mmu *mmu;
int *last_ran;
diff --git a/arch/arm64/kvm/emulate-nested.c b/arch/arm64/kvm/emulate-nested.c
index 4697ba41b3a9..ad5458c47e5e 100644
--- a/arch/arm64/kvm/emulate-nested.c
+++ b/arch/arm64/kvm/emulate-nested.c
@@ -2193,7 +2193,7 @@ void kvm_emulate_nested_eret(struct kvm_vcpu *vcpu)
*vcpu_pc(vcpu) = elr;
*vcpu_cpsr(vcpu) = spsr;
 
-   kvm_arch_vcpu_load(vcpu, smp_processor_id());
+   kvm_arch_vcpu_load(vcpu, smp_processor_id(), false);
preempt_enable();
 }
 
@@ -2274,7 +2274,7 @@ static int kvm_inject_nested(struct kvm_vcpu *vcpu, u64 
esr_el2,
 */
__kvm_adjust_pc(vcpu);
 
-   kvm_arch_vcpu_load(vcpu, smp_processor_id());
+   kvm_arch_vcpu_load(vcpu, smp_processor_id(), false);
preempt_enable();
 
return 1;
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 68d1d05672bd..654cf09c81e9 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -262,7 +262,7 @@ void kvm_reset_vcpu(struct kvm_vcpu *vcpu)
kvm_timer_vcpu_reset(vcpu);
 
if (loaded)
-   kvm_arch_vcpu_load(vcpu, smp_processor_id());
+   kvm_arch_vcpu_load(vcpu, smp_processor_id(), false);
preempt_enable();
 }
 
diff --git a/arch/loongarch/kvm/vcpu.c b/arch/loongarch/kvm/vcpu.c
index 3a8779065f73..61d549c4f8d1 100644
--- a/arch/loongarch/kvm/vcpu.c
+++ b/arch/loongarch/kvm/vcpu.c
@@ -1050,7 +1050,7 @@ static int _kvm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
return 0;
 }
 
-void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool sched_in)
 {
unsigned long flags;
 
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index c17157e700c0..6797799f3f32 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -682,7 +682,7 @@ static void kvm_mips_migrate_count(struct kvm_vcpu *vcpu)
 }
 
 /* Restore ASID once we are scheduled back after preemption */
-void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool sched_in)
 {
unsigned long flags;
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index d32abe7fe6ab..8de620716875 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -826,7 +826,7 @@ int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu)
return kvmppc_core_pending_dec(vcpu);
 }
 
-void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool sched_in)
 {
 #ifdef CONFIG_BOOKE
/*
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index b5ca9f2e98ac..a7b7f172fa61 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -87,7 +87,7 @@ static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)
 
/* Reset the guest CSRs for hotplug usecase */
if (loaded)
-   kvm_arch_vcpu_load(vcpu, smp_processor_id());
+   kvm_arch_vcpu_load(vcpu, smp_processor_id(), false);
put_cpu();
 }
 
@@ -507,7 +507,7 @@ static void kvm_riscv_vcpu_setup_config(struct kvm_vcpu 
*vcpu)
}
 }
 
-void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu, bool sched_in)
 {
struct kvm_vcpu_csr *csr = >arch.guest_csr;
struct kvm_vcpu_config *cfg = >arch.cfg;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 5147b9

[PATCH 0/4] KVM: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()

2024-04-30 Thread Sean Christopherson

Drop kvm_arch_sched_in() and instead pass a @sched_in boolean to
kvm_arch_vcpu_load().

While fiddling with an idea for optimizing state management on AMD CPUs,
I wanted to skip re-saving certain host state when a vCPU is scheduled back
in, as the state (theoretically) shouldn't change for the task while it's
scheduled out.  Actually doing that was annoying and unnecessarily brittle
due to having a separate API for the kvm_sched_in() case (the state save
needed to be in kvm_arch_vcpu_load() for the common path).

E.g. I could have set a "temporary"-ish flag somewhere in kvm_vcpu, but (a)
that's gross and (b) it would rely on the arbitrary ordering between
sched_in() and vcpu_load() staying the same.

The only real downside I see is that arm64 and riscv end up having to pass
"false" for their direct usage of kvm_arch_vcpu_load(), and passing boolean
literals isn't ideal.  But that can be solved by adding an inner helper that
omits the @sched_in param (I almost added a patch to do that, but I couldn't
convince myself it was necessary).

The other motivation for this is to avoid yet another arch hook, and more
arbitrary ordering, if there's a future need to hook kvm_sched_out() (we've
come close on the x86 side several times).

Sean Christopherson (4):
  KVM: Plumb in a @sched_in flag to kvm_arch_vcpu_load()
  KVM: VMX: Move PLE grow/shrink helpers above vmx_vcpu_load()
  KVM: x86: Fold kvm_arch_sched_in() into kvm_arch_vcpu_load()
  KVM: Delete the now unused kvm_arch_sched_in()

 arch/arm64/include/asm/kvm_host.h |  1 -
 arch/arm64/kvm/arm.c  |  2 +-
 arch/arm64/kvm/emulate-nested.c   |  4 +-
 arch/arm64/kvm/reset.c|  2 +-
 arch/loongarch/include/asm/kvm_host.h |  1 -
 arch/loongarch/kvm/vcpu.c |  2 +-
 arch/mips/include/asm/kvm_host.h  |  1 -
 arch/mips/kvm/mmu.c   |  2 +-
 arch/powerpc/include/asm/kvm_host.h   |  1 -
 arch/powerpc/kvm/powerpc.c|  2 +-
 arch/riscv/include/asm/kvm_host.h |  1 -
 arch/riscv/kvm/vcpu.c |  4 +-
 arch/s390/include/asm/kvm_host.h  |  1 -
 arch/s390/kvm/kvm-s390.c  |  2 +-
 arch/x86/include/asm/kvm-x86-ops.h|  1 -
 arch/x86/include/asm/kvm_host.h   |  4 +-
 arch/x86/kvm/pmu.c|  6 +--
 arch/x86/kvm/svm/svm.c| 13 ++---
 arch/x86/kvm/vmx/main.c   |  2 -
 arch/x86/kvm/vmx/vmx.c| 75 +--
 arch/x86/kvm/vmx/x86_ops.h|  3 +-
 arch/x86/kvm/x86.c| 26 +-
 include/linux/kvm_host.h  |  4 +-
 virt/kvm/kvm_main.c   |  5 +-
 24 files changed, 70 insertions(+), 95 deletions(-)


base-commit: a96cb3bf390eebfead5fc7a2092f8452a7997d1b
-- 
2.45.0.rc0.197.gbae5840b3b-goog

Re: [PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2

2024-04-25 Thread Sean Christopherson

On Thu, Apr 25, 2024, Shuah Khan wrote:
> On 4/25/24 08:12, Dan Carpenter wrote:
> > On Fri, Oct 27, 2023 at 11:22:07AM -0700, Sean Christopherson wrote:
> > > Use KVM_SET_USER_MEMORY_REGION2 throughout KVM's selftests library so that
> > > support for guest private memory can be added without needing an entirely
> > > separate set of helpers.
> > > 
> > > Note, this obviously makes selftests backwards-incompatible with older KVM
> >
> > ^^
> > > versions from this point forward.
> >^
> > 
> > Is there a way we could disable the tests on older kernels instead of
> > making them fail?  Check uname or something?  There is probably a
> > standard way to do this...  It's these tests which fail.
> 
> They shouldn't fail - the tests should be skipped on older kernels.

Ah, that makes sense.  Except for a few outliers that aren't all that 
interesting,
all KVM selftests create memslots, so I'm tempted to just make it a hard 
requirement
to spare us headache, e.g.

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index b2262b5fad9e..4b2038b1f11f 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -2306,6 +2306,9 @@ void __attribute((constructor)) kvm_selftest_init(void)
/* Tell stdout not to buffer its content. */
setbuf(stdout, NULL);
 
+   __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
+  "KVM selftests from v6.8+ require 
KVM_SET_USER_MEMORY_REGION2");
+
kvm_selftest_arch_init();
 }

--

but it's also easy enough to be more precise and skip only those that actually
create memslots.

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index b2262b5fad9e..b21152adf448 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -944,6 +944,9 @@ int __vm_set_user_memory_region2(struct kvm_vm *vm, 
uint32_t slot, uint32_t flag
.guest_memfd_offset = guest_memfd_offset,
};
 
+   __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
+  "KVM selftests from v6.8+ require 
KVM_SET_USER_MEMORY_REGION2");
+
return ioctl(vm->fd, KVM_SET_USER_MEMORY_REGION2, );
 }
 
@@ -970,6 +973,9 @@ void vm_mem_add(struct kvm_vm *vm, enum 
vm_mem_backing_src_type src_type,
size_t mem_size = npages * vm->page_size;
size_t alignment;
 
+   __TEST_REQUIRE(kvm_has_cap(KVM_CAP_USER_MEMORY2),
+  "KVM selftests from v6.8+ require 
KVM_SET_USER_MEMORY_REGION2");
+
TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages,
"Number of guest pages is not compatible with the host. "
"Try npages=%d", vm_adjust_num_guest_pages(vm->mode, npages));
--

Re: [PATCH 1/3] x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n

2024-04-19 Thread Sean Christopherson

On Fri, Apr 19, 2024, Will Deacon wrote:
> On Mon, Apr 15, 2024 at 07:31:23AM -0700, Sean Christopherson wrote:
> > On Mon, Apr 15, 2024, Geert Uytterhoeven wrote:
> > Oof.  I completely missed that "cpu_mitigations" wasn't x86-only.  I can't 
> > think
> > of better solution than an on-by-default generic Kconfig, though can't that 
> > it
> > more simply be:
> > 
> > diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> > index 2b8fd6bb7da0..5930cb56ee29 100644
> > --- a/drivers/base/Kconfig
> > +++ b/drivers/base/Kconfig
> > @@ -191,6 +191,9 @@ config GENERIC_CPU_AUTOPROBE
> >  config GENERIC_CPU_VULNERABILITIES
> > bool
> >  
> > +config SPECULATION_MITIGATIONS
> > +   def_bool !X86
> > +
> >  config SOC_BUS
> > bool
> > select GLOB
> 
> I can't see this in -next yet. Do you plan to post it as a proper patch
> to collect acks etc?

Sorry, I neglected to Cc everyone.

https://lore.kernel.org/all/20240417001507.2264512-2-sea...@google.com

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-19 Thread Sean Christopherson

On Fri, Apr 19, 2024, Will Deacon wrote:
> > @@ -663,10 +669,22 @@ static __always_inline kvm_mn_ret_t 
> > __kvm_handle_hva_range(struct kvm *kvm,
> > break;
> > }
> > r.ret |= range->handler(kvm, _range);
> > +
> > +  /*
> > +   * Use a precise gfn-based TLB flush when possible, as
> > +   * most mmu_notifier events affect a small-ish range.
> > +   * Fall back to a full TLB flush if the gfn-based flush
> > +   * fails, and don't bother trying the gfn-based flush
> > +   * if a full flush is already pending.
> > +   */
> > +  if (range->flush_on_ret && !need_flush && r.ret &&
> > +  kvm_arch_flush_remote_tlbs_range(kvm, 
> > gfn_range.start,
> > +   gfn_range.end - 
> > gfn_range.start + 1))
> 
> What's that '+ 1' needed for here?

 (a) To see if you're paying attention.
 (b) Because more is always better.
 (c) Because math is hard.
 (d) Because I haven't tested this.
 (e) Both (c) and (d).

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-18 Thread Sean Christopherson

On Thu, Apr 18, 2024, Will Deacon wrote:
> On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> > On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson 
> > >  wrote:
> > > > 
> > > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  
> > > > > wrote:
> > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > > > would be really great to change the .clear_flush_young() callback so
> > > > > > that the architecture could handle the TLB invalidation. At the 
> > > > > > moment,
> > > > > > the core KVM code invalidates the whole VMID courtesy of 
> > > > > > 'flush_on_ret'
> > > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > > lighter-weight and targetted TLBI in the architecture page-table 
> > > > > > code
> > > > > > when we actually update the ptes for small ranges.
> > > > > 
> > > > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > > > costly consequences).
> > > > > 
> > > > > In general, it feels like the TLB invalidation should stay with the
> > > > > code that deals with the page tables, as it has a pretty good idea of
> > > > > what needs to be invalidated and how -- specially on architectures
> > > > > that have a HW-broadcast facility like arm64.
> > > > 
> > > > Would this be roughly on par with an in-line flush on arm64?  The 
> > > > simpler, more
> > > > straightforward solution would be to let architectures override 
> > > > flush_on_ret,
> > > > but I would prefer something like the below as x86 can also utilize a 
> > > > range-based
> > > > flush when running as a nested hypervisor.
> > 
> > ...
> > 
> > > I think this works for us on HW that has range invalidation, which
> > > would already be a positive move.
> > > 
> > > For the lesser HW that isn't range capable, it also gives the
> > > opportunity to perform the iteration ourselves or go for the nuclear
> > > option if the range is larger than some arbitrary constant (though
> > > this is additional work).
> > > 
> > > But this still considers the whole range as being affected by
> > > range->handler(). It'd be interesting to try and see whether more
> > > precise tracking is (or isn't) generally beneficial.
> > 
> > I assume the idea would be to let arch code do single-page invalidations of
> > stage-2 entries for each gfn?
> 
> Right, as it's the only code which knows which ptes actually ended up
> being aged.
> 
> > Unless I'm having a brain fart, x86 can't make use of that functionality.  
> > Intel
> > doesn't provide any way to do targeted invalidation of stage-2 mappings.  
> > AMD
> > provides an instruction to do broadcast invalidations, but it takes a 
> > virtual
> > address, i.e. a stage-1 address.  I can't tell if it's a host virtual 
> > address or
> > a guest virtual address, but it's a moot point because KVM doen't have the 
> > guest
> > virtual address, and if it's a host virtual address, there would need to be 
> > valid
> > mappings in the host page tables for it to work, which KVM can't guarantee.
> 
> Ah, so it sounds like it would need to be an arch opt-in then.

Even if x86 (or some other arch code) could use the precise tracking, I think it
would make sense to have the behavior be arch specific.  Adding infrastructure
to get information from arch code, only to turn around and give it back to arch
code would be odd.

Unless arm64 can't do the invalidation immediately after aging the stage-2 PTE,
the best/easiest solution would be to let arm64 opt out of the common TLB flush
when a SPTE is made young.

With the range-based flushing bundled in, this?

---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c  | 40 +---
 2 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index afbc99264ffa..8fe5f5e16919 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2010,6 +2010,8 @@ extern const struct k

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-15 Thread Sean Christopherson

On Sat, Apr 13, 2024, Marc Zyngier wrote:
> On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson  
> wrote:
> > 
> > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  wrote:
> > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > would be really great to change the .clear_flush_young() callback so
> > > > that the architecture could handle the TLB invalidation. At the moment,
> > > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
> > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > lighter-weight and targetted TLBI in the architecture page-table code
> > > > when we actually update the ptes for small ranges.
> > > 
> > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > costly consequences).
> > > 
> > > In general, it feels like the TLB invalidation should stay with the
> > > code that deals with the page tables, as it has a pretty good idea of
> > > what needs to be invalidated and how -- specially on architectures
> > > that have a HW-broadcast facility like arm64.
> > 
> > Would this be roughly on par with an in-line flush on arm64?  The simpler, 
> > more
> > straightforward solution would be to let architectures override 
> > flush_on_ret,
> > but I would prefer something like the below as x86 can also utilize a 
> > range-based
> > flush when running as a nested hypervisor.

...

> I think this works for us on HW that has range invalidation, which
> would already be a positive move.
> 
> For the lesser HW that isn't range capable, it also gives the
> opportunity to perform the iteration ourselves or go for the nuclear
> option if the range is larger than some arbitrary constant (though
> this is additional work).
> 
> But this still considers the whole range as being affected by
> range->handler(). It'd be interesting to try and see whether more
> precise tracking is (or isn't) generally beneficial.

I assume the idea would be to let arch code do single-page invalidations of
stage-2 entries for each gfn?

Unless I'm having a brain fart, x86 can't make use of that functionality.  Intel
doesn't provide any way to do targeted invalidation of stage-2 mappings.  AMD
provides an instruction to do broadcast invalidations, but it takes a virtual
address, i.e. a stage-1 address.  I can't tell if it's a host virtual address or
a guest virtual address, but it's a moot point because KVM doen't have the guest
virtual address, and if it's a host virtual address, there would need to be 
valid
mappings in the host page tables for it to work, which KVM can't guarantee.

Re: [PATCH 1/3] x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n

2024-04-15 Thread Sean Christopherson

On Mon, Apr 15, 2024, Geert Uytterhoeven wrote:
> Hi Michael,
> 
> On Sat, Apr 13, 2024 at 11:38 AM Michael Ellerman  wrote:
> > Michael Ellerman  writes:
> > > Stephen Rothwell  writes:
> > ...
> > >> On Tue,  9 Apr 2024 10:51:05 -0700 Sean Christopherson 
> > >>  wrote:
> > ...
> > >>> diff --git a/kernel/cpu.c b/kernel/cpu.c
> > >>> index 8f6affd051f7..07ad53b7f119 100644
> > >>> --- a/kernel/cpu.c
> > >>> +++ b/kernel/cpu.c
> > >>> @@ -3207,7 +3207,8 @@ enum cpu_mitigations {
> > >>>  };
> > >>>
> > >>>  static enum cpu_mitigations cpu_mitigations __ro_after_init =
> > >>> -   CPU_MITIGATIONS_AUTO;
> > >>> +   IS_ENABLED(CONFIG_SPECULATION_MITIGATIONS) ? CPU_MITIGATIONS_AUTO :
> > >>> +CPU_MITIGATIONS_OFF;
> > >>>
> > >>>  static int __init mitigations_parse_cmdline(char *arg)
> > >>>  {
> >
> > I think a minimal workaround/fix would be:
> >
> > diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> > index 2b8fd6bb7da0..290be2f9e909 100644
> > --- a/drivers/base/Kconfig
> > +++ b/drivers/base/Kconfig
> > @@ -191,6 +191,10 @@ config GENERIC_CPU_AUTOPROBE
> >  config GENERIC_CPU_VULNERABILITIES
> > bool
> >
> > +config SPECULATION_MITIGATIONS
> > +   def_bool y
> > +   depends on !X86
> > +
> >  config SOC_BUS
> > bool
> > select GLOB
> 
> Thanks, that works for me (on arm64), so
> Tested-by: Geert Uytterhoeven 

Oof.  I completely missed that "cpu_mitigations" wasn't x86-only.  I can't think
of better solution than an on-by-default generic Kconfig, though can't that it
more simply be:

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 2b8fd6bb7da0..5930cb56ee29 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -191,6 +191,9 @@ config GENERIC_CPU_AUTOPROBE
 config GENERIC_CPU_VULNERABILITIES
bool
 
+config SPECULATION_MITIGATIONS
+   def_bool !X86
+
 config SOC_BUS
bool
select GLOB

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-12 Thread Sean Christopherson

On Fri, Apr 12, 2024, Marc Zyngier wrote:
> On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  wrote:
> > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > Also, if you're in the business of hacking the MMU notifier code, it
> > would be really great to change the .clear_flush_young() callback so
> > that the architecture could handle the TLB invalidation. At the moment,
> > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
> > being set by kvm_handle_hva_range(), whereas we could do a much
> > lighter-weight and targetted TLBI in the architecture page-table code
> > when we actually update the ptes for small ranges.
> 
> Indeed, and I was looking at this earlier this week as it has a pretty
> devastating effect with NV (it blows the shadow S2 for that VMID, with
> costly consequences).
> 
> In general, it feels like the TLB invalidation should stay with the
> code that deals with the page tables, as it has a pretty good idea of
> what needs to be invalidated and how -- specially on architectures
> that have a HW-broadcast facility like arm64.

Would this be roughly on par with an in-line flush on arm64?  The simpler, more
straightforward solution would be to let architectures override flush_on_ret,
but I would prefer something like the below as x86 can also utilize a 
range-based
flush when running as a nested hypervisor.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff0a20565f90..b65116294efe 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -601,6 +601,7 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
struct kvm_gfn_range gfn_range;
struct kvm_memory_slot *slot;
struct kvm_memslots *slots;
+   bool need_flush = false;
int i, idx;
 
if (WARN_ON_ONCE(range->end <= range->start))
@@ -653,10 +654,22 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
break;
}
r.ret |= range->handler(kvm, _range);
+
+   /*
+* Use a precise gfn-based TLB flush when possible, as
+* most mmu_notifier events affect a small-ish range.
+* Fall back to a full TLB flush if the gfn-based flush
+* fails, and don't bother trying the gfn-based flush
+* if a full flush is already pending.
+*/
+   if (range->flush_on_ret && !need_flush && r.ret &&
+   kvm_arch_flush_remote_tlbs_range(kvm, 
gfn_range.start
+gfn_range.end - 
gfn_range.start + 1))
+   need_flush = true;
}
}
 
-   if (range->flush_on_ret && r.ret)
+   if (need_flush)
kvm_flush_remote_tlbs(kvm);
 
if (r.found_memslot)

Re: [PATCH -fixes v2] RISC-V: KVM: Require HAVE_KVM

2024-01-18 Thread Sean Christopherson

On Thu, Jan 18, 2024, Anup Patel wrote:
> On Thu, Jan 4, 2024 at 6:07 PM Andrew Jones  wrote:
> >
> > KVM requires EVENTFD, which is selected by HAVE_KVM. Other KVM
> > supporting architectures select HAVE_KVM and then their KVM
> > Kconfigs ensure its there with a depends on HAVE_KVM. Make RISCV
> > consistent with that approach which fixes configs which have KVM
> > but not EVENTFD, as was discovered with a randconfig test.
> >
> > Fixes: 99cdc6c18c2d ("RISC-V: Add initial skeletal KVM support")
> > Reported-by: Randy Dunlap 
> > Closes: 
> > https://lore.kernel.org/all/44907c6b-c5bd-4e4a-a921-e4d382553...@infradead.org/
> > Signed-off-by: Andrew Jones 
> 
> Queued this patch for Linux-6.8

That should be unnecessary.  Commit caadf876bb74 ("KVM: introduce 
CONFIG_KVM_COMMON"),
which is in Paolo's pull request for 6.8, addresses the EVENTFD issue.  And the
rest of Paolo's series[*], which presumably will get queued for 6.9, eliminates
HAVE_KVM entirely.

[*] https://lore.kernel.org/all/20240108124740.114453-6-pbonz...@redhat.com

Re: [PATCH v4 10/12] KVM: x86: never write to memory from kvm_vcpu_check_block()

2023-12-13 Thread Sean Christopherson

On Thu, Dec 14, 2023, Maxim Levitsky wrote:
> On Tue, 2023-12-12 at 07:28 -0800, Sean Christopherson wrote:
> > On Sun, Dec 10, 2023, Jim Mattson wrote:
> > > On Thu, Dec 7, 2023 at 8:21 AM Sean Christopherson  
> > > wrote:
> > > > Doh.  We got the less obvious cases and missed the obvious one.
> > > > 
> > > > Ugh, and we also missed a related mess in 
> > > > kvm_guest_apic_has_interrupt().  That
> > > > thing should really be folded into vmx_has_nested_events().
> > > > 
> > > > Good gravy.  And vmx_interrupt_blocked() does the wrong thing because 
> > > > that
> > > > specifically checks if L1 interrupts are blocked.
> > > > 
> > > > Compile tested only, and definitely needs to be chunked into multiple 
> > > > patches,
> > > > but I think something like this mess?
> > > 
> > > The proposed patch does not fix the problem. In fact, it messes things
> > > up so much that I don't get any test results back.
> > 
> > Drat.
> > 
> > > Google has an internal K-U-T test that demonstrates the problem. I
> > > will post it soon.
> > 
> > Received, I'll dig in soonish, though "soonish" might unfortunately might 
> > mean
> > 2024.
> > 
> 
> Hi,
> 
> So this is what I think:
> 
> KVM does have kvm_guest_apic_has_interrupt() for this exact purpose,
> to check if nested APICv has a pending interrupt before halting.

For all intents and purposes, so was nested_ops->has_events().  I don't see
any reason to have two APIs that do the same thing, and the call to
kvm_guest_apic_has_interrupt() is wrong in that it doesn't verify that IRQs are
enabled for _L2_.  That's why my preference is to fold the two together.

> However the problem is bigger - with APICv we have in essence 2 pending
> interrupt bitmaps - the PIR and the IRR, and to know if the guest has a
> pending interrupt one has in theory to copy PIR to IRR, then see if the max
> is larger then the current PPR.

Yeah, this is what my untested hack-a-patch tried to do.

> Since we don't want to write to guest memory,

The changelog is misleading/wrong.  Writing guest memory is ok, what isn't safe
is blocking or sleeping, i.e. KVM must not trigger a host page fault due to
accessing a page that's been swapped out.  Read vs. write doesn't matter.

So KVM can safely read and write guest memory so long as it already mapped by 
kvm_vcpu_map() (or I suppose if we wrapped an access with pagefault_disable(),
but I can't think of a sane reason to do that).  E.g. nVMX can access a vCPU's
PID mapping, but synthesizing a nested VM-Exit will cause explosions on nSVM.

> and the IRR here resides in the guest memory, I guess we have to do a
> 'dry-run' version of 'vmx_complete_nested_posted_interrupt' and call it from
> kvm_guest_apic_has_interrupt().

nested_ops->has_events() is the much better fit, e.g. the naming won't get weird
and we can gate the whole thing on is_guest_mode().  Though we probably need a
wrapper to handle any commonalities between nVMX and nSVM.

> What do you think? I can prepare a patch for this.

As above, this is what I tried to do, sort of.  Though it's obviously broken.  
We
don't need a full dry-run because KVM only needs to detect events that are 
unique
to L2, e.g. nVMX's preemption timer, MTF, and pending virtual interrupts (hmm,
I suspect nSVM's vNMI is broken too).  Things like INIT and SMI don't require
nested virtualization awareness because the event itself is tracked for the vCPU
as a whole.

Re: [PATCH 05/26] vfio: KVM: Pass get/put helpers from KVM to VFIO, don't do circular lookup

2023-12-12 Thread Sean Christopherson

On Sun, Dec 03, 2023, Jason Gunthorpe wrote:
> On Fri, Dec 01, 2023 at 04:51:55PM -0800, Sean Christopherson wrote:
> 
> > There's one more wrinkle: this patch is buggy in that it doesn't ensure the 
> > liveliness
> > of KVM-the-module, i.e. nothing prevents userspace from unloading kvm.ko 
> > while VFIO
> > still holds a reference to a kvm structure, and so invoking ->put_kvm() 
> > could jump
> > into freed code.  To fix that, KVM would also need to pass along a module 
> > pointer :-(
> 
> Maybe we should be refcounting the struct file not the struct kvm?
> 
> Then we don't need special helpers and it keeps the module alive correctly.

Huh.  It took my brain a while to catch up, but this seems comically obvious in
hindsight.  I *love* this approach, both conceptually and from a code 
perspective.

Handing VFIO (and any other external entities) a file makes it so that KVM 
effectively
interacts with users via files, regardless of whether the user lives in 
userspace
or the kernel.  That makes it easier to reason about the safety of operations,
e.g. in addition to ensuring KVM-the-module is pinned, having a file pointer 
allows
KVM to verify that the incoming pointer does indeed represent a VM.  Which isn't
necessary by any means, but it's a nice sanity check.

>From a code perspective, it's far cleaner than manually grabbing module 
>references,
and having only a file pointers makes it a wee bit harder for non-KVM code to
poke into KVM, e.g. keeps us honest.

Assuming nothing blows up in testing, I'll go this route for v2.

Thanks!

Re: [PATCH v4 10/12] KVM: x86: never write to memory from kvm_vcpu_check_block()

2023-12-12 Thread Sean Christopherson

On Sun, Dec 10, 2023, Jim Mattson wrote:
> On Thu, Dec 7, 2023 at 8:21 AM Sean Christopherson  wrote:
> > Doh.  We got the less obvious cases and missed the obvious one.
> >
> > Ugh, and we also missed a related mess in kvm_guest_apic_has_interrupt().  
> > That
> > thing should really be folded into vmx_has_nested_events().
> >
> > Good gravy.  And vmx_interrupt_blocked() does the wrong thing because that
> > specifically checks if L1 interrupts are blocked.
> >
> > Compile tested only, and definitely needs to be chunked into multiple 
> > patches,
> > but I think something like this mess?
> 
> The proposed patch does not fix the problem. In fact, it messes things
> up so much that I don't get any test results back.

Drat.

> Google has an internal K-U-T test that demonstrates the problem. I
> will post it soon.

Received, I'll dig in soonish, though "soonish" might unfortunately might mean
2024.

Re: [PATCH v4 10/12] KVM: x86: never write to memory from kvm_vcpu_check_block()

2023-12-07 Thread Sean Christopherson

On Wed, Dec 06, 2023, Jim Mattson wrote:
> kvm_vcpu_check_block() is called while not in TASK_RUNNING, and therefore
> it cannot sleep.  Writing to guest memory is therefore forbidden, but it
> can happen on AMD processors if kvm_check_nested_events() causes a vmexit.
> 
> Fortunately, all events that are caught by kvm_check_nested_events() are
> also recognized by kvm_vcpu_has_events() through vendor callbacks such as
> kvm_x86_interrupt_allowed() or kvm_x86_ops.nested_ops->has_events(), so
> remove the call and postpone the actual processing to vcpu_block().
> 
> Opportunistically honor the return of kvm_check_nested_events().  KVM
> punted on the check in kvm_vcpu_running() because the only error path is
> if vmx_complete_nested_posted_interrupt() fails, in which case KVM exits
> to userspace with "internal error" i.e. the VM is likely dead anyways so
> it wasn't worth overloading the return of kvm_vcpu_running().
> 
> Add the check mostly so that KVM is consistent with itself; the return of
> the call via kvm_apic_accept_events()=>kvm_check_nested_events() that
> immediately follows  _is_ checked.
> 
> Reported-by: Maxim Levitsky 
> Signed-off-by: Paolo Bonzini 
> [sean: check and handle return of kvm_check_nested_events()]
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/kvm/x86.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index dcc675d4e44b..8aeacbc2bff9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10815,6 +10815,17 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
>   return 1;
>   }
>  
> + /*
> +  * Evaluate nested events before exiting the halted state.  This allows
> +  * the halt state to be recorded properly in the VMCS12's activity
> +  * state field (AMD does not have a similar field and a VM-Exit always
> +  * causes a spurious wakeup from HLT).
> +  */
> + if (is_guest_mode(vcpu)) {
> + if (kvm_check_nested_events(vcpu) < 0)
> + return 0;
> + }
> +
>   if (kvm_apic_accept_events(vcpu) < 0)
>   return 0;
>   switch(vcpu->arch.mp_state) {
> @@ -10837,9 +10848,6 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
>  
>  static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
>  {
> - if (is_guest_mode(vcpu))
> - kvm_check_nested_events(vcpu);
> -
>   return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
>   !vcpu->arch.apf.halted);
>  }
> 
> This commit breaks delivery of a (virtualized) posted interrupt from
> an L1 vCPU to a halted L2 vCPU.
> 
> Looking back at commit e6c67d8cf117 ("KVM: nVMX: Wake blocked vCPU in
> guest-mode if pending interrupt in virtual APICv"), Liran wrote:
> 
> Note that this also handles the case of nested posted-interrupt by the
> fact RVI is updated in vmx_complete_nested_posted_interrupt() which is
> called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() ->
> kvm_vcpu_running() -> vmx_check_nested_events() ->
> vmx_complete_nested_posted_interrupt().
> 
> Clearly, that is no longer the case.

Doh.  We got the less obvious cases and missed the obvious one.

Ugh, and we also missed a related mess in kvm_guest_apic_has_interrupt().  That
thing should really be folded into vmx_has_nested_events().

Good gravy.  And vmx_interrupt_blocked() does the wrong thing because that
specifically checks if L1 interrupts are blocked.

Compile tested only, and definitely needs to be chunked into multiple patches,
but I think something like this mess?

---
 arch/x86/include/asm/kvm-x86-ops.h |  1 -
 arch/x86/include/asm/kvm_host.h|  1 -
 arch/x86/kvm/lapic.c   | 20 ++---
 arch/x86/kvm/lapic.h   | 12 ++
 arch/x86/kvm/vmx/nested.c  | 36 --
 arch/x86/kvm/vmx/vmx.c | 34 
 arch/x86/kvm/vmx/vmx.h |  1 +
 arch/x86/kvm/x86.c | 10 +
 8 files changed, 59 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h 
b/arch/x86/include/asm/kvm-x86-ops.h
index 378ed944b849..6f81774c1dd0 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -85,7 +85,6 @@ KVM_X86_OP_OPTIONAL(update_cr8_intercept)
 KVM_X86_OP(refresh_apicv_exec_ctrl)
 KVM_X86_OP_OPTIONAL(hwapic_irr_update)
 KVM_X86_OP_OPTIONAL(hwapic_isr_update)
-KVM_X86_OP_OPTIONAL_RET0(guest_apic_has_interrupt)
 KVM_X86_OP_OPTIONAL(load_eoi_exitmap)
 KVM_X86_OP_OPTIONAL(set_virtual_apic_mode)
 KVM_X86_OP_OPTIONAL(set_apic_access_page_addr)
diff --git a/arch/x

Re: [PATCH 05/26] vfio: KVM: Pass get/put helpers from KVM to VFIO, don't do circular lookup

2023-12-01 Thread Sean Christopherson

On Mon, Sep 18, 2023, Jason Gunthorpe wrote:
> On Mon, Sep 18, 2023 at 08:49:57AM -0700, Sean Christopherson wrote:
> > On Mon, Sep 18, 2023, Jason Gunthorpe wrote:
> > > On Fri, Sep 15, 2023 at 05:30:57PM -0700, Sean Christopherson wrote:
> > > > Explicitly pass KVM's get/put helpers to VFIO when attaching a VM to
> > > > VFIO instead of having VFIO do a symbol lookup back into KVM.  Having 
> > > > both
> > > > KVM and VFIO do symbol lookups increases the overall complexity and 
> > > > places
> > > > an unnecessary dependency on KVM (from VFIO) without adding any value.
> > > > 
> > > > Signed-off-by: Sean Christopherson 
> > > > ---
> > > >  drivers/vfio/vfio.h  |  2 ++
> > > >  drivers/vfio/vfio_main.c | 74 +++-
> > > >  include/linux/vfio.h |  4 ++-
> > > >  virt/kvm/vfio.c  |  9 +++--
> > > >  4 files changed, 47 insertions(+), 42 deletions(-)
> > > 
> > > I don't mind this, but Christoph had disliked my prior attempt to do
> > > this with function pointers..
> > > 
> > > The get can be inlined, IIRC, what about putting a pointer to the put
> > > inside the kvm struct?
> > 
> > That wouldn't allow us to achieve our goal, which is to hide the details of
> > "struct kvm" from VFIO (and the rest of the kernel).
> 
> > What's the objection to handing VFIO a function pointer?
> 
> Hmm, looks like it was this thread:
> 
>  
> https://lore.kernel.org/r/0-v1-33906a626da1+16b0-vfio_kvm_no_group_...@nvidia.com
> 
> Your rational looks a little better to me.
> 
> > > The the normal kvm get/put don't have to exported symbols at all?
> > 
> > The export of kvm_get_kvm_safe() can go away (I forgot to do that in this 
> > series),
> > but kvm_get_kvm() will hang around as it's needed by KVM sub-modules (PPC 
> > and x86),
> > KVMGT (x86), and drivers/s390/crypto/vfio_ap_ops.c (no idea what to call 
> > that beast).
> 
> My thought would be to keep it as an inline, there should be some way
> to do that without breaking your desire to hide the bulk of the kvm
> struct content. Like put the refcount as the first element in the
> struct and just don't ifdef it away?.

That doesn't work because of the need to invoke kvm_destroy_vm() when the last
reference is put, i.e. all of kvm_destroy_vm() would need to be inlined (LOL) or
VFIO would need to do a symbol lookup on kvm_destroy_vm(), which puts back us at
square one.

There's one more wrinkle: this patch is buggy in that it doesn't ensure the 
liveliness
of KVM-the-module, i.e. nothing prevents userspace from unloading kvm.ko while 
VFIO
still holds a reference to a kvm structure, and so invoking ->put_kvm() could 
jump
into freed code.  To fix that, KVM would also need to pass along a module 
pointer :-(

One thought would be to have vac.ko (tentative name), which is the "base" module
that will hold the KVM/virtualization bits that need to be singletons, i.e. 
can't
be per-KVM, provide the symbols needed for VFIO to manage references.  But that
just ends up moving the module reference trickiness into VAC+KVM, e.g. vac.ko 
would
still need to be handed a function pointer in order to call back into the 
correct
kvm.ko code.

Hrm, but I suspect the vac.ko <=> kvm.ko interactions will need to deal with
module shenanigans anyways, and the shenanigans would only affect crazy people
like us that actually want multiple KVM modules.

I'll plan on going that route.  The very worst case scenario is that it just 
punts
this conversation down to a possibile future.  Dropping this patch and the 
previous
prep patch won't meaningful affect the goals of this series, as only 
kvm_get_kvm_safe()
and kvm_get_kvm() would need to be exposed outside of #ifdef __KVM__.  Then we 
can
figure out what to do with them if/when the whole multi-KVM thing comes along.

Re: Ping? Re: [PATCH rc] kvm: Prevent compiling virt/kvm/vfio.c unless VFIO is selected

2023-11-29 Thread Sean Christopherson

On Wed, Nov 29, 2023, Jason Gunthorpe wrote:
> On Wed, Nov 29, 2023 at 05:07:45PM -0800, Sean Christopherson wrote:
> > On Wed, Nov 29, 2023, Michael Ellerman wrote:
> > > Sean Christopherson  writes:
> > > > On Fri, Nov 10, 2023, Michael Ellerman wrote:
> > > >> Jason Gunthorpe  writes:
> > > >> > There are a bunch of reported randconfig failures now because of 
> > > >> > this,
> > > >> > something like:
> > > >> >
> > > >> >>> arch/powerpc/kvm/../../../virt/kvm/vfio.c:89:7: warning: attribute 
> > > >> >>> declaration must precede definition [-Wignored-attributes]
> > > >> >fn = symbol_get(vfio_file_iommu_group);
> > > >> > ^
> > > >> >include/linux/module.h:805:60: note: expanded from macro 
> > > >> > 'symbol_get'
> > > >> >#define symbol_get(x) ({ extern typeof(x) x 
> > > >> > __attribute__((weak,visibility("hidden"))); &(x); })
> > > >> >
> > > >> > It happens because the arch forces KVM_VFIO without knowing if VFIO 
> > > >> > is
> > > >> > even enabled.
> > > >> 
> > > >> This is still breaking some builds. Can we get this fix in please?
> > > >> 
> > > >> cheers
> > > >> 
> > > >> > Split the kconfig so the arch selects the usual HAVE_KVM_ARCH_VFIO 
> > > >> > and
> > > >> > then KVM_VFIO is only enabled if the arch wants it and VFIO is 
> > > >> > turned on.
> > > >
> > > > Heh, so I was trying to figure out why things like vfio_file_set_kvm() 
> > > > aren't
> > > > problematic, i.e. why the existing mess didn't cause failures.  I can't 
> > > > repro the
> > > > warning (requires clang-16?), but IIUC the reason only the group code 
> > > > is problematic
> > > > is that vfio.h creates a stub for vfio_file_iommu_group() and thus 
> > > > there's no symbol,
> > > > whereas vfio.h declares vfio_file_set_kvm() unconditionally.
> > > 
> > > That warning I'm unsure about.
> > 
> > Ah, it's the same warning, I just missed the CONFIG_MODULES=n requirement.
> 
> Oh, wait, doesn't that mean the approach won't work? IIRC doesn't
> symbol_get turn into just  when non-modular turning this into a
> link failure without the kconfig part?

Yes, but it doesn't cause linker errors.  IIUC, because the extern declaration
is tagged "weak", a dummy default is used.  E.g. on x86, this is what is 
generated
with VFIO=y

fn = symbol_get(vfio_file_is_valid);
if (!fn)
   0x810396c5 <+5>: mov$0x81829230,%rax
   0x810396cc <+12>:test   %rax,%rax

whereas VFIO=n gets

fn = symbol_get(vfio_file_is_valid);
if (!fn)
   0x810396c5 <+5>: mov$0x0,%rax
   0x810396cc <+12>:test   %rax,%rax

I have no idea if the fact that symbol_get() generates '0', i.e. the !NULL 
checks
work as expected, is intentional or if KVM works by sheer dumb luck.  I'm not
entirely sure I want to find out, though it's definitely extra motiviation to 
get
the KVM mess fixed sooner than later.

Re: Ping? Re: [PATCH rc] kvm: Prevent compiling virt/kvm/vfio.c unless VFIO is selected

2023-11-29 Thread Sean Christopherson

On Wed, Nov 29, 2023, Michael Ellerman wrote:
> Sean Christopherson  writes:
> > On Fri, Nov 10, 2023, Michael Ellerman wrote:
> >> Jason Gunthorpe  writes:
> >> > There are a bunch of reported randconfig failures now because of this,
> >> > something like:
> >> >
> >> >>> arch/powerpc/kvm/../../../virt/kvm/vfio.c:89:7: warning: attribute 
> >> >>> declaration must precede definition [-Wignored-attributes]
> >> >fn = symbol_get(vfio_file_iommu_group);
> >> > ^
> >> >include/linux/module.h:805:60: note: expanded from macro 'symbol_get'
> >> >#define symbol_get(x) ({ extern typeof(x) x 
> >> > __attribute__((weak,visibility("hidden"))); &(x); })
> >> >
> >> > It happens because the arch forces KVM_VFIO without knowing if VFIO is
> >> > even enabled.
> >> 
> >> This is still breaking some builds. Can we get this fix in please?
> >> 
> >> cheers
> >> 
> >> > Split the kconfig so the arch selects the usual HAVE_KVM_ARCH_VFIO and
> >> > then KVM_VFIO is only enabled if the arch wants it and VFIO is turned on.
> >
> > Heh, so I was trying to figure out why things like vfio_file_set_kvm() 
> > aren't
> > problematic, i.e. why the existing mess didn't cause failures.  I can't 
> > repro the
> > warning (requires clang-16?), but IIUC the reason only the group code is 
> > problematic
> > is that vfio.h creates a stub for vfio_file_iommu_group() and thus there's 
> > no symbol,
> > whereas vfio.h declares vfio_file_set_kvm() unconditionally.
> 
> That warning I'm unsure about.

Ah, it's the same warning, I just missed the CONFIG_MODULES=n requirement.

Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-29 Thread Sean Christopherson

On Mon, Nov 27, 2023, Vlastimil Babka wrote:
> On 11/2/23 16:46, Paolo Bonzini wrote:
> > On Thu, Nov 2, 2023 at 4:38 PM Sean Christopherson  
> > wrote:
> >> Actually, looking that this again, there's not actually a hard dependency 
> >> on THP.
> >> A THP-enabled kernel _probably_  gives a higher probability of using 
> >> hugepages,
> >> but mostly because THP selects COMPACTION, and I suppose because using THP 
> >> for
> >> other allocations reduces overall fragmentation.
> > 
> > Yes, that's why I didn't even bother enabling it unless THP is
> > enabled, but it makes even more sense to just try.
> > 
> >> So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I 
> >> think
> >> we should do the below (I verified KVM can create hugepages with THP=n).  
> >> We'll
> >> need another capability, but (a) we probably should have that anyways and 
> >> (b) it
> >> provides a cleaner path to adding PUD-sized hugepage support in the future.
> > 
> > I wonder if we need KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE though. This
> > should be a generic kernel API and in fact the sizes are available in
> > a not-so-friendly format in /sys/kernel/mm/hugepages.
> > 
> > We should just add /sys/kernel/mm/hugepages/sizes that contains
> > "2097152 1073741824" on x86 (only the former if 1G pages are not
> > supported).
> > 
> > Plus: is this the best API if we need something else for 1G pages?
> > 
> > Let's drop *this* patch and proceed incrementally. (Again, this is
> > what I want to do with this final review: identify places that are
> > stil sticky, and don't let them block the rest).
> > 
> > Coincidentially we have an open spot next week at plumbers. Let's
> > extend Fuad's section to cover more guestmem work.
> 
> Hi,
> 
> was there any outcome wrt this one?

No, we punted on hugepage support for the initial guest_memfd merge.  We 
definitely
plan on adding hugeapge support sooner than later, but we haven't yet agreed on
exactly what that will look like.

> Based on my experience with THP's it would be best if userspace didn't have
> to opt-in, nor care about the supported size. If the given size is unaligned,
> provide a mix of large pages up to an aligned size, and for the rest fallback
> to base pages, which should be better than -EINVAL on creation (is it
> possible with the current implementation? I'd hope so so?).

guest_memfd serves a different use case than THP.  For modern VMs, and 
especially
for slice-of-hardware VMs that are one of the main targets for guest_memfd, if 
not
_the_ main target, guest memory should _always_ be backed by hugepages in the
physical domain.  The actual guest mappings might not be huge, e.g. x86 needs to
do partial mappings to skip over (legacy) memory holes, but KVM already 
gracefully
handles that.

In other words, for most guest_memfd use cases, if userspace wants hugepages but
KVM can't provide hugepages, then it is much more desirable to return an error
than to silently fall back to small pages.

I 100% agree that having to opt-in is suboptimal, but IMO providing "error on an
incompatible configuration" semantics without requiring userspace to opt-in is 
an
even worse experience for userspace.

> A way to opt-out from huge pages could be useful although there's always the
> risk of some initial troubles resulting in various online sources cargo-cult
> recommending to opt-out forever.

Re: Ping? Re: [PATCH rc] kvm: Prevent compiling virt/kvm/vfio.c unless VFIO is selected

2023-11-29 Thread Sean Christopherson

On Wed, Nov 29, 2023, Jason Gunthorpe wrote:
> On Tue, Nov 28, 2023 at 06:21:42PM -0800, Sean Christopherson wrote:
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index 454e9295970c..a65b2513f8cd 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -289,16 +289,12 @@ void vfio_combine_iova_ranges(struct rb_root_cached 
> > *root, u32 cur_nodes,
> >  /*
> >   * External user API
> >   */
> > -#if IS_ENABLED(CONFIG_VFIO_GROUP)
> >  struct iommu_group *vfio_file_iommu_group(struct file *file);
> > +
> > +#if IS_ENABLED(CONFIG_VFIO_GROUP)
> >  bool vfio_file_is_group(struct file *file);
> >  bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
> >  #else
> > -static inline struct iommu_group *vfio_file_iommu_group(struct file *file)
> > -{
> > -   return NULL;
> > -}
> > -
> >  static inline bool vfio_file_is_group(struct file *file)
> >  {
> > return false;
> > 
> 
> So you symbol get on a symbol that can never be defined? Still says to
> me the kconfig needs fixing :|

Yeah, I completely agree, and if KVM didn't already rely on this horrific
behavior and there wasn't a more complete overhaul in-flight, I wouldn't suggest
this.

I'll send the KVM Kconfig/Makefile cleanups from my "Hide KVM internals from 
others"
series separately (which is still the bulk of the series) so as to prioritize
getting the cleanups landed.

Re: Ping? Re: [PATCH rc] kvm: Prevent compiling virt/kvm/vfio.c unless VFIO is selected

2023-11-28 Thread Sean Christopherson

On Fri, Nov 10, 2023, Michael Ellerman wrote:
> Jason Gunthorpe  writes:
> > There are a bunch of reported randconfig failures now because of this,
> > something like:
> >
> >>> arch/powerpc/kvm/../../../virt/kvm/vfio.c:89:7: warning: attribute 
> >>> declaration must precede definition [-Wignored-attributes]
> >fn = symbol_get(vfio_file_iommu_group);
> > ^
> >include/linux/module.h:805:60: note: expanded from macro 'symbol_get'
> >#define symbol_get(x) ({ extern typeof(x) x 
> > __attribute__((weak,visibility("hidden"))); &(x); })
> >
> > It happens because the arch forces KVM_VFIO without knowing if VFIO is
> > even enabled.
> 
> This is still breaking some builds. Can we get this fix in please?
> 
> cheers
> 
> > Split the kconfig so the arch selects the usual HAVE_KVM_ARCH_VFIO and
> > then KVM_VFIO is only enabled if the arch wants it and VFIO is turned on.

Heh, so I was trying to figure out why things like vfio_file_set_kvm() aren't
problematic, i.e. why the existing mess didn't cause failures.  I can't repro 
the
warning (requires clang-16?), but IIUC the reason only the group code is 
problematic
is that vfio.h creates a stub for vfio_file_iommu_group() and thus there's no 
symbol,
whereas vfio.h declares vfio_file_set_kvm() unconditionally.

Because KVM is doing symbol_get() and not taking a direct dependency, the lack 
of
an exported symbol doesn't cause problems, i.e. simply declaring the symbol 
makes
the compiler happy.

Given that the vfio_file_iommu_group() stub shouldn't exist (KVM is the only 
user,
and so if I'm correct the stub is worthless), what about this as a temporary 
"fix"?

I'm 100% on-board with fixing KVM properly, my motivation is purely to minimize
the total amount of churn.  E.g. if this works, then the only extra churn is to
move the declaration of vfio_file_iommu_group() back under the #if, versus 
having
to churn all of the KVM Kconfigs twice (once now, and again for the full 
cleanup).

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 454e9295970c..a65b2513f8cd 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -289,16 +289,12 @@ void vfio_combine_iova_ranges(struct rb_root_cached 
*root, u32 cur_nodes,
 /*
  * External user API
  */
-#if IS_ENABLED(CONFIG_VFIO_GROUP)
 struct iommu_group *vfio_file_iommu_group(struct file *file);
+
+#if IS_ENABLED(CONFIG_VFIO_GROUP)
 bool vfio_file_is_group(struct file *file);
 bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
 #else
-static inline struct iommu_group *vfio_file_iommu_group(struct file *file)
-{
-   return NULL;
-}
-
 static inline bool vfio_file_is_group(struct file *file)
 {
return false;

Re: [PATCH 15/34] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-10 Thread Sean Christopherson

On Fri, Nov 10, 2023, Xiaoyao Li wrote:
> On 11/6/2023 12:30 AM, Paolo Bonzini wrote:
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 68a144cb7dbc..a6de526c0426 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -589,8 +589,20 @@ struct kvm_memory_slot {
> > u32 flags;
> > short id;
> > u16 as_id;
> > +
> > +#ifdef CONFIG_KVM_PRIVATE_MEM
> > +   struct {
> > +   struct file __rcu *file;
> > +   pgoff_t pgoff;
> > +   } gmem;
> > +#endif
> >   };
> > +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot 
> > *slot)
> > +{
> > +   return slot && (slot->flags & KVM_MEM_GUEST_MEMFD);
> > +}
> > +
> 
> maybe we can move this block and ...
> 
> 
> 
> > @@ -2355,6 +2379,30 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm 
> > *kvm,
> > struct kvm_gfn_range *range);
> >   bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> >  struct kvm_gfn_range *range);
> > +
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > +   return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) &&
> > +  kvm_get_memory_attributes(kvm, gfn) & 
> > KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > +}
> > +#else
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > +   return false;
> > +}
> >   #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
> 
> this block to Patch 18?

It would work, but my vote is to keep them here to minimize the changes to 
common
KVM code in the x86 enabling.  It's not a strong preference though.  Of course,
at this point, fiddling with this sort of thing is probably a bad idea in terms
of landing guest_memfd.

> > @@ -4844,6 +4875,10 @@ static int 
> > kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> >   #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> > case KVM_CAP_MEMORY_ATTRIBUTES:
> > return kvm_supported_mem_attributes(kvm);
> > +#endif
> > +#ifdef CONFIG_KVM_PRIVATE_MEM
> > +   case KVM_CAP_GUEST_MEMFD:
> > +   return !kvm || kvm_arch_has_private_mem(kvm);
> >   #endif
> > default:
> > break;
> > @@ -5277,6 +5312,18 @@ static long kvm_vm_ioctl(struct file *filp,
> > case KVM_GET_STATS_FD:
> > r = kvm_vm_ioctl_get_stats_fd(kvm);
> > break;
> > +#ifdef CONFIG_KVM_PRIVATE_MEM
> > +   case KVM_CREATE_GUEST_MEMFD: {
> > +   struct kvm_create_guest_memfd guest_memfd;
> 
> Do we need a guard of below?
> 
>   r = -EINVAL;
>   if (!kvm_arch_has_private_mem(kvm))
>   goto out;

Argh, yeah, that's weird since KVM_CAP_GUEST_MEMFD says "not supported" if the
VM doesn't support private memory.

Enforcing that would break guest_memfd_test.c though.  And having to create a
"special" VM just to test basic guest_memfd functionality would be quite
annoying.

So my vote is to do:

case KVM_CAP_GUEST_MEMFD:
return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM);

There's no harm to KVM if userspace creates a file it can't use, and at some
point KVM will hopefully support guest_memfd irrespective of private memory.

Re: [PATCH 25/34] KVM: selftests: Add helpers to convert guest memory b/w private and shared

2023-11-06 Thread Sean Christopherson

On Mon, Nov 06, 2023, Fuad Tabba wrote:
> On Sun, Nov 5, 2023 at 4:34 PM Paolo Bonzini  wrote:
> > +void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t base, uint64_t 
> > size,
> > +   bool punch_hole)
> > +{
> > +   const int mode = FALLOC_FL_KEEP_SIZE | (punch_hole ? 
> > FALLOC_FL_PUNCH_HOLE : 0);
> > +   struct userspace_mem_region *region;
> > +   uint64_t end = base + size;
> > +   uint64_t gpa, len;
> > +   off_t fd_offset;
> > +   int ret;
> > +
> > +   for (gpa = base; gpa < end; gpa += len) {
> > +   uint64_t offset;
> > +
> > +   region = userspace_mem_region_find(vm, gpa, gpa);
> > +   TEST_ASSERT(region && region->region.flags & 
> > KVM_MEM_GUEST_MEMFD,
> > +   "Private memory region not found for GPA 
> > 0x%lx", gpa);
> > +
> > +   offset = (gpa - region->region.guest_phys_addr);
> 
> nit: why the parentheses?

I simply forgot to remove them when I changed the function to support spanning
multiple memslots, i.e. when the code went from this

fd_offset = region->region.gmem_offset +
(gpa - region->region.guest_phys_addr);

to what you see above.

Re: [PATCH 27/34] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type

2023-11-06 Thread Sean Christopherson

On Mon, Nov 06, 2023, Fuad Tabba wrote:
> On Sun, Nov 5, 2023 at 4:34 PM Paolo Bonzini  wrote:
> >
> > From: Sean Christopherson 
> >
> > Add a "vm_shape" structure to encapsulate the selftests-defined "mode",
> > along with the KVM-defined "type" for use when creating a new VM.  "mode"
> > tracks physical and virtual address properties, as well as the preferred
> > backing memory type, while "type" corresponds to the VM type.
> >
> > Taking the VM type will allow adding tests for KVM_CREATE_GUEST_MEMFD,
> > a.k.a. guest private memory, without needing an entirely separate set of
> > helpers.  Guest private memory is effectively usable only by confidential
> > VM types, and it's expected that x86 will double down and require unique
> > VM types for TDX and SNP guests.
> >
> > Signed-off-by: Sean Christopherson 
> > Message-Id: <20231027182217.3615211-30-sea...@google.com>
> > Signed-off-by: Paolo Bonzini 
> > ---
> 
> nit: as in a prior selftest commit messages, references in the commit
> message to guest _private_ memory. Should these be changed to just
> guest memory?

Hmm, no, "private" is mostly appropriate here.  At this point in time, only x86
supports KVM_CREATE_GUEST_MEMFD, and x86 only supports it for private memory.
And the purpose of letting x86 selftests specify KVM_X86_SW_PROTECTED_VM, i.e.
the reason this patch exists, is purely to get private memory.

Maybe tweak the second paragraph to this?

Taking the VM type will allow adding tests for KVM_CREATE_GUEST_MEMFD
without needing an entirely separate set of helpers.  At this time,
guest_memfd is effectively usable only by confidential VM types in the
form of guest private memory, and it's expected that x86 will double down
and require unique VM types for TDX and SNP guests.

Re: [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory

2023-11-06 Thread Sean Christopherson

On Mon, Nov 06, 2023, Xu Yilun wrote:
> On Sun, Nov 05, 2023 at 05:19:36PM +0100, Paolo Bonzini wrote:
> > On Sun, Nov 5, 2023 at 2:04 PM Xu Yilun  wrote:
> > >
> > > > +static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> > > > +   struct kvm_page_fault 
> > > > *fault)
> > > > +{
> > > > + kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> > > > +   PAGE_SIZE, fault->write, 
> > > > fault->exec,
> > > > +   fault->is_private);
> > > > +}
> > > > +
> > > > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > > > +struct kvm_page_fault *fault)
> > > > +{
> > > > + int max_order, r;
> > > > +
> > > > + if (!kvm_slot_can_be_private(fault->slot)) {
> > > > + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > > > + return -EFAULT;
> > > > + }
> > > > +
> > > > + r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, 
> > > > >pfn,
> > > > +  _order);
> > > > + if (r) {
> > > > + kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > > > + return r;
> > >
> > > Why report KVM_EXIT_MEMORY_FAULT here? even with a ret != -EFAULT?
> > 
> > The cases are EFAULT, EHWPOISON (which can report
> > KVM_EXIT_MEMORY_FAULT) and ENOMEM. I think it's fine
> > that even -ENOMEM can return KVM_EXIT_MEMORY_FAULT,
> > and it doesn't violate the documentation.  The docs tell you "what
> > can you do if error if EFAULT or EHWPOISON?"; they don't
> > exclude that other errnos result in KVM_EXIT_MEMORY_FAULT,
> > it's just that you're not supposed to look at it
> 
> Thanks, it's OK for ENOMEM + KVM_EXIT_MEMORY_FAULT.
> 
> Another concern is, now 3 places to report EFAULT + KVM_EXIT_MEMORY_FAULT:
> 
>   if (!kvm_slot_can_be_private(fault->slot)) {
>   kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
>   return -EFAULT;
>   }
> 
>   file = kvm_gmem_get_file(slot);
>   if (!file)
>   return -EFAULT;
> 
>   if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
>   kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
>   return -EFAULT;
>   }
> 
> They are different cases, and seems userspace should handle them
> differently, but not enough information to distinguish them.

For the first, the memory_fault exit will inform userspace that the guest wants
to map memory as private, and userspace will see that the memslot isn't 
configured
to support private mappings.  Userspace may not even need to query memslots, 
e.g.
if the gfn in question has been enumerated to the guest as something that can 
only
be mapped shared.

For the second (no valid guest_memfd file), userspace put the last reference to
the guest_memfd file without informing the guest or creating a memslot.  That's
firmly a userspace bug.

For the third and last, userspace will see that the guest is requesting a 
private
mapping but the gfn is configured for shared mappings.

In all cases, userspace has the necessary information to resolve the issue, 
where
"resolving the issue" may mean terminating the guest.  If userspace isn't 
tracking
memslots or the private attribute, then userspace has far bigger problems.

Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-06 Thread Sean Christopherson

On Sat, Nov 04, 2023, Xu Yilun wrote:
> > +KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION 
> > that
> > +allows mapping guest_memfd memory into a guest.  All fields shared with
> > +KVM_SET_USER_MEMORY_REGION identically.  Userspace can set KVM_MEM_PRIVATE 
> > in
> > +flags to have KVM bind the memory region to a given guest_memfd range of
> > +[guest_memfd_offset, guest_memfd_offset + memory_size].  The target 
> > guest_memfd
> ^
> The range end should be exclusive, is it?

Yes, that should be a ')', not a ']'.

> > +static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
> > +{
> > +   const char *anon_name = "[kvm-gmem]";
> > +   struct kvm_gmem *gmem;
> > +   struct inode *inode;
> > +   struct file *file;
> > +   int fd, err;
> > +
> > +   fd = get_unused_fd_flags(0);
> > +   if (fd < 0)
> > +   return fd;
> > +
> > +   gmem = kzalloc(sizeof(*gmem), GFP_KERNEL);
> > +   if (!gmem) {
> > +   err = -ENOMEM;
> > +   goto err_fd;
> > +   }
> > +
> > +   /*
> > +* Use the so called "secure" variant, which creates a unique inode
> > +* instead of reusing a single inode.  Each guest_memfd instance needs
> > +* its own inode to track the size, flags, etc.
> > +*/
> > +   file = anon_inode_getfile_secure(anon_name, _gmem_fops, gmem,
> > +O_RDWR, NULL);
> > +   if (IS_ERR(file)) {
> > +   err = PTR_ERR(file);
> > +   goto err_gmem;
> > +   }
> > +
> > +   file->f_flags |= O_LARGEFILE;
> > +
> > +   inode = file->f_inode;
> > +   WARN_ON(file->f_mapping != inode->i_mapping);
> 
> Just curious, why should we check the mapping fields which is garanteed in
> other subsystem?

Mostly to document the behavior.  The vast majority of folks that read this code
will be KVM developers, not file systems developers, and will likely have no 
clue
about the relationship between f_mapping and i_mapping.  And in the extremely
unlikely scenario that anon_inode_getfile_secure() no longer sets f_mapping, a
WARN detects the issue whereas a comment does not.

Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-03 Thread Sean Christopherson

On Thu, Nov 02, 2023, Fuad Tabba wrote:
> On Wed, Nov 1, 2023 at 9:55 PM Sean Christopherson  wrote:
> > E.g. a misbehaving userspace could prematurely delete a memslot.  And the 
> > more
> > fun example is intrahost migration, where the plan is to allow pointing 
> > multiple
> > guest_memfd files at a single guest_memfd inode:
> > https://lore.kernel.org/all/cover.1691446946.git.ackerley...@google.com
> >
> > There was a lot of discussion for this, but it's scattered all over the 
> > place.
> > The TL;DR is is that the inode will represent physical memory, and a file 
> > will
> > represent a given "struct kvm" instance's view of that memory.  And so the 
> > memory
> > isn't reclaimed until the inode is truncated/punched.
> >
> > I _think_ this reflects the most recent plan from the guest_memfd side:
> > https://lore.kernel.org/all/1233d749211c08d51f9ca5d427938d47f008af1f.1689893403.git.isaku.yamah...@intel.com

Doh, sitting in my TODO folder...

https://lore.kernel.org/all/20231016115028.996656-1-michael.r...@amd.com

> Thanks for pointing that out. I think this might be the way to go.
> I'll have a closer look at this and see how to get it to work with
> pKVM.

Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-02 Thread Sean Christopherson

On Thu, Nov 02, 2023, David Matlack wrote:
> On Thu, Nov 2, 2023 at 9:03 AM Sean Christopherson  wrote:
> >
> > On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> > > On 10/31/23 23:39, David Matlack wrote:
> > > > > > Maybe can you sketch out how you see this proposal being extensible 
> > > > > > to
> > > > > > using guest_memfd for shared mappings?
> > > > > For in-place conversions, e.g. pKVM, no additional guest_memfd is 
> > > > > needed.  What's
> > > > > missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM 
> > > > > needs to
> > > > > ensure there are no outstanding references when converting back to 
> > > > > private.
> > > > >
> > > > > For TDX/SNP, assuming we don't find a performant and robust way to do 
> > > > > in-place
> > > > > conversions, a second fd+offset pair would be needed.
> > > > Is there a way to support non-in-place conversions within a single 
> > > > guest_memfd?
> > >
> > > For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to guest
> > > memory.  The hook would invalidate now-private parts if they have a VMA,
> > > causing a SIGSEGV/EFAULT if the host touches them.
> > >
> > > It would forbid mappings from multiple gfns to a single offset of the
> > > guest_memfd, because then the shared vs. private attribute would be tied 
> > > to
> > > the offset.  This should not be a problem; for example, in the case of 
> > > SNP,
> > > the RMP already requires a single mapping from host physical address to
> > > guest physical address.
> >
> > I don't see how this can work.  It's not a M:1 scenario (where M is 
> > multiple gfns),
> > it's a 1:N scenario (wheren N is multiple offsets).  The *gfn* doesn't 
> > change on
> > a conversion, what needs to change to do non-in-place conversion is the 
> > pfn, which
> > is effectively the guest_memfd+offset pair.
> >
> > So yes, we *could* support non-in-place conversions within a single 
> > guest_memfd,
> > but it would require a second offset,
> 
> Why can't KVM free the existing page at guest_memfd+offset and
> allocate a new one when doing non-in-place conversions?

Oh, I see what you're suggesting.  Eww.

It's certainly possible, but it would largely defeat the purpose of why we are
adding guest_memfd in the first place.

For TDX and SNP, the goal is to provide a simple, robust mechanism for isolating
guest private memory so that it's all but impossible for the host to access 
private
memory.  As things stand, memory for a given guest_memfd is either private or 
shared
(assuming we support a second guest_memfd per memslot).  I.e. there's no need to
track whether a given page/folio in the guest_memfd is private vs. shared.

We could use memory attributes, but that further complicates things when 
intrahost
migration (and potentially other multi-user scenarios) comes along, i.e. when 
KVM
supports linking multiple guest_memfd files to a single inode.  We'd have to 
ensure
that all "struct kvm" instances have identical PRIVATE attributes for a given
*offset* in the inode.  I'm not even sure how feasible that is for intrahost
migration, and that's the *easy* case, because IIRC it's already a hard 
requirement
that the source and destination have identical gnf=>guest_memfd bindings, i.e. 
KVM
can somewhat easily reason about gfn attributes.

But even then, that only helps with the actual migration of the VM, e.g. we'd 
still
have to figure out how to deal with .mmap() and other shared vs. private actions
when linking a new guest_memfd file against an existing inode.

I haven't seen the pKVM patches for supporting .mmap(), so maybe this is already
a solved problem, but I'd honestly be quite surprised if it all works correctly
if/when KVM supports multiple files per inode.

And I don't see what value non-in-place conversions would add.  The value added
by in-place conversions, aside from the obvious preservation of data, which 
isn't
relevant to TDX/SNP, is that it doesn't require freeing and reallocating memory
to avoid double-allocating for private vs. shared.  That's especialy quite nice
when hugepages are being used because reconstituing a hugepage "only" requires
zapping SPTEs.

But if KVM is freeing the private page, it's the same as punching a hole, 
probably
quite literally, when mapping the gfn as shared.  In every way I can think of, 
it's
worse.  E.g. it's more complex for KVM, and the PUNCH_HOLE => allocation 
operations
must be serialized.

Regarding double-allocating, I really, really think we should solve that in the
guest.  I.e. teach Linux-as-a-guest to aggressively convert at 2MiB granularity
and avoid 4KiB conversions.  4KiB conversions aren't just a memory utilization
problem, they're also a performance problem, e.g. shatters hugepages (which KVM
doesn't yet support recovering) and increases TLB pressure for both stage-1 and
stage-2 mappings.

Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-02 Thread Sean Christopherson

On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> On 10/31/23 23:39, David Matlack wrote:
> > > > Maybe can you sketch out how you see this proposal being extensible to
> > > > using guest_memfd for shared mappings?
> > > For in-place conversions, e.g. pKVM, no additional guest_memfd is needed. 
> > >  What's
> > > missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM 
> > > needs to
> > > ensure there are no outstanding references when converting back to 
> > > private.
> > > 
> > > For TDX/SNP, assuming we don't find a performant and robust way to do 
> > > in-place
> > > conversions, a second fd+offset pair would be needed.
> > Is there a way to support non-in-place conversions within a single 
> > guest_memfd?
> 
> For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to guest
> memory.  The hook would invalidate now-private parts if they have a VMA,
> causing a SIGSEGV/EFAULT if the host touches them.
> 
> It would forbid mappings from multiple gfns to a single offset of the
> guest_memfd, because then the shared vs. private attribute would be tied to
> the offset.  This should not be a problem; for example, in the case of SNP,
> the RMP already requires a single mapping from host physical address to
> guest physical address.

I don't see how this can work.  It's not a M:1 scenario (where M is multiple 
gfns),
it's a 1:N scenario (wheren N is multiple offsets).  The *gfn* doesn't change on
a conversion, what needs to change to do non-in-place conversion is the pfn, 
which
is effectively the guest_memfd+offset pair.

So yes, we *could* support non-in-place conversions within a single guest_memfd,
but it would require a second offset, at which point it makes sense to add a
second file descriptor as well.  Userspace could still use a single guest_memfd
instance, i.e. pass in the same file descriptor but different offsets.

Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-02 Thread Sean Christopherson

On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> On 11/2/23 10:35, Huang, Kai wrote:
> > IIUC KVM can already handle the case of poisoned
> > page by sending signal to user app:
> > 
> > static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu,  
> > struct
> > kvm_page_fault *fault)  
> > {
> > ...
> > 
> > if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
> > kvm_send_hwpoison_signal(fault->slot, 
> > fault->gfn);

No, this doesn't work, because that signals the host virtual address

unsigned long hva = gfn_to_hva_memslot(slot, gfn);

send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current);

which is the *shared* page.

> > return RET_PF_RETRY;
> > }
> > }
> 
> EHWPOISON is not implemented by this series, so it should be left out of the
> documentation.

EHWPOISON *is* implemented.  kvm_gmem_get_pfn() returns -EWPOISON as 
appropriate,
and kvm_faultin_pfn() returns that directly without going through 
kvm_handle_error_pfn().

  kvm_faultin_pfn_private()
  |
  |-> kvm_gmem_get_pfn()
  |
  |-> if (folio_test_hwpoison(folio)) {
r = -EHWPOISON;
goto out_unlock;
  }

  |
  |->   r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, 
>pfn,
 _order);
if (r) {
kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
return r;
}

|
|-> ret = __kvm_faultin_pfn(vcpu, fault);
if (ret != RET_PF_CONTINUE)
return ret;

if (unlikely(is_error_pfn(fault->pfn)))
return kvm_handle_error_pfn(vcpu, fault);

Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-02 Thread Sean Christopherson

On Thu, Nov 02, 2023, Xiaoyao Li wrote:
> On 11/2/2023 1:36 AM, Sean Christopherson wrote:
> > > KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function 
> > > to
> > > ?
> > I'd prefer to keep it in generic code, as it's highly likely to end up there
> > sooner than later.  There's a known use case for ARM (exit to userspace on 
> > missing
> > userspace mapping[*]), and I'm guessing pKVM (also ARM) will also utilize 
> > this API.
> > 
> > [*]https://lore.kernel.org/all/20230908222905.1321305-8-amoor...@google.com
> 
> I wonder how this CAP is supposed to be checked in userspace, for guest
> memfd case? 

It's basically useless for guest_memfd.

>   if (!kvm_check_extension(s, KVM_CAP_MEMORY_FAULT_INFO) &&
>   run->exit_reason == KVM_EXIT_MEMORY_FAULT)
>   abort("unexpected KVM_EXIT_MEMORY_FAULT");
> 
> In my implementation of QEMU patches, I find it's unnecessary. When
> userspace gets an exit with KVM_EXIT_MEMORY_FAULT, it implies
> "KVM_CAP_MEMORY_FAULT_INFO".
> 
> So I don't see how it is necessary in this series. Whether it's necessary or
> not for [*], I don't have the answer but we can leave the discussion to that
> patch series.

It's not strictly necessary there either.

However, Oliver felt (and presumably still feels) quite strongly, and I agree,
that neither reporting extra information shouldn't be tightly coupled to
KVM_CAP_EXIT_ON_MISSING or KVM_CAP_GUEST_MEMFD.

E.g. if userspace develops a "standalone" use case for 
KVM_CAP_MEMORY_FAULT_INFO,
userspace should be able to check for support without having to take a 
dependency
on KVM_CAP_GUEST_MEMFD, especially since because KVM_CAP_GUEST_MEMFD may not be
supported, i.e. userspace should be able to do:

if (!kvm_check_extension(s, KVM_CAP_MEMORY_FAULT_INFO))
abort("KVM_CAP_MEMORY_FAULT_INFO required for fancy feature 
XYZ");

Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-02 Thread Sean Christopherson

On Thu, Nov 02, 2023, Kai Huang wrote:
> On Wed, 2023-11-01 at 10:36 -0700, Sean Christopherson wrote:
> > On Wed, Nov 01, 2023, Kai Huang wrote:
> > > 
> > > > +7.34 KVM_CAP_MEMORY_FAULT_INFO
> > > > +--
> > > > +
> > > > +:Architectures: x86
> > > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> > > > +
> > > > +The presence of this capability indicates that KVM_RUN will fill
> > > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, 
> > > > e.g. if
> > > > +there is a valid memslot but no backing VMA for the corresponding host 
> > > > virtual
> > > > +address.
> > > > +
> > > > +The information in kvm_run.memory_fault is valid if and only if 
> > > > KVM_RUN returns
> > > > +an error with errno=EFAULT or errno=EHWPOISON *and* 
> > > > kvm_run.exit_reason is set
> > > > +to KVM_EXIT_MEMORY_FAULT.
> > > 
> > > IIUC returning -EFAULT or whatever -errno is sort of KVM internal
> > > implementation.
> > 
> > The errno that is returned to userspace is ABI.  In KVM, it's a _very_ 
> > poorly
> > defined ABI for the vast majority of ioctls(), but it's still technically 
> > ABI.
> > KVM gets away with being cavalier with errno because the vast majority of 
> > errors
> > are considered fatal by userespace, i.e. in most cases, userspace simply 
> > doesn't
> > care about the exact errno.
> > 
> > A good example is KVM_RUN with -EINTR; if KVM were to return something 
> > other than
> > -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would 
> > fall over.
> > 
> > > Is it better to relax the validity of kvm_run.memory_fault when
> > > KVM_RUN returns any -errno?
> > 
> > Not unless there's a need to do so, and if there is then we can update the
> > documentation accordingly.  If KVM's ABI is that kvm_run.memory_fault is 
> > valid
> > for any errno, then KVM would need to purge kvm_run.exit_reason super early 
> > in
> > KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being
> > misinterpreted as KVM_EXIT_MEMORY_FAULT.  And purging exit_reason super 
> > early is
> > subtly tricky because KVM's (again, poorly documented) ABI is that *some* 
> > exit
> > reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or 
> > with a
> > pending signal).
> > 
> > https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com
> > 
> > 
> 
> Agreed with not to relax to any errno.  However using -EFAULT as part of ABI
> definition seems a little bit dangerous, e.g., someone could accidentally or
> mistakenly return -EFAULT in KVM_RUN at early time and/or in a completely
> different code path, etc.  -EINTR has well defined meaning, but -EFAULT (which
> is "Bad address") seems doesn't but I am not sure either. :-)

KVM has returned -EFAULT since forever, i.e. it's effectively already part of 
the
ABI.  I doubt there's a userspace that relies precisely on -EFAULT, but 
userspace
definitely will be confused if KVM returns '0' where KVM used to return -EFAULT.
And so if we want to return '0', it needs to be opt-in, which means forcing
userspace to enable a capability *and* requires code in KVM to conditionally 
return
'0' instead of -EFAULT/-EHWPOISON.

> One example is, for backing VMA with VM_IO | VM_PFNMAP, hva_to_pfn() returns
> KVM_PFN_ERR_FAULT when the kernel cannot get a valid PFN (e.g. when SGX vepc
> fault handler failed to allocate EPC) and kvm_handle_error_pfn() will just
> return -EFAULT.  If kvm_run.exit_reason isn't purged early then is it possible
> to have some issue here?

Well, yeah, but that's exactly why this series has a patch to reset exit_reason.
The solution to "if KVM is buggy then bad things happen" is to not have KVM 
bugs :-)

Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-02 Thread Sean Christopherson

On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> On Wed, Nov 1, 2023 at 11:35 PM Sean Christopherson  wrote:
> >
> > On Wed, Nov 01, 2023, Paolo Bonzini wrote:
> > > On 11/1/23 17:36, Sean Christopherson wrote:
> > > > Can you post a fixup patch?  It's not clear to me exactly what behavior 
> > > > you intend
> > > > to end up with.
> > >
> > > Sure, just this:
> > >
> > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > index 7d1a33c2ad42..34fd070e03d9 100644
> > > --- a/virt/kvm/guest_memfd.c
> > > +++ b/virt/kvm/guest_memfd.c
> > > @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct 
> > > kvm_create_guest_memfd *args)
> > >  {
> > >   loff_t size = args->size;
> > >   u64 flags = args->flags;
> > > - u64 valid_flags = 0;
> > > -
> > > - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > > - valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> > > + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> > >   if (flags & ~valid_flags)
> > >   return -EINVAL;
> > > @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct 
> > > kvm_create_guest_memfd *args)
> > >   if (size < 0 || !PAGE_ALIGNED(size))
> > >   return -EINVAL;
> > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > >   if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
> > >   !IS_ALIGNED(size, HPAGE_PMD_SIZE))
> > >   return -EINVAL;
> > > -#endif
> >
> > That won't work, HPAGE_PMD_SIZE is valid only for 
> > CONFIG_TRANSPARENT_HUGEPAGE=y.
> >
> > #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> > #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> > #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
> > #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
> 
> Would have caught it when actually testing it, I guess. :) It has to
> be PMD_SIZE, possibly with
> 
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> BUILD_BUG_ON(HPAGE_PMD_SIZE != PMD_SIZE);
> #endif

Yeah, that works for me.

Actually, looking that this again, there's not actually a hard dependency on 
THP.
A THP-enabled kernel _probably_  gives a higher probability of using hugepages,
but mostly because THP selects COMPACTION, and I suppose because using THP for
other allocations reduces overall fragmentation.

So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I think
we should do the below (I verified KVM can create hugepages with THP=n).  We'll
need another capability, but (a) we probably should have that anyways and (b) it
provides a cleaner path to adding PUD-sized hugepage support in the future.

And then adjust the tests like so:

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c 
b/tools/testing/selftests/kvm/guest_memfd_test.c
index c15de9852316..c9f449718fce 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -201,6 +201,10 @@ int main(int argc, char *argv[])
 
TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
 
+   if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE) && 
thp_configured())
+   TEST_ASSERT_EQ(get_trans_hugepagesz(),
+  
kvm_check_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE));
+
page_size = getpagesize();
total_size = page_size * 4;
 
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c 
b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
index be311944e90a..245901587ed2 100644
--- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
@@ -396,7 +396,7 @@ static void test_mem_conversions(enum 
vm_mem_backing_src_type src_type, uint32_t
 
vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
 
-   if (backing_src_can_be_huge(src_type))
+   if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE))
memfd_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
else
memfd_flags = 0;

--
From: Sean Christopherson 
Date: Wed, 25 Oct 2023 16:26:41 -0700
Subject: [PATCH] KVM: Add best-effort hugepage support for dedicated guest
 memory

Extend guest_memfd to allow backing guest memory with hugepages.  For now,
make hugepage utilization best-effort, i.e. fall back to non-huge mappings
if a hugepage can't be allocated.  Guaranteeing hugepages would require a
dedicated memory pool and significantly more complexity and churn..

Require userspace to opt-in via a flag even though it's unlikely real use
cases will ever want to use order-0 pages, e.g.

Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events

2023-11-02 Thread Sean Christopherson

On Thu, Nov 02, 2023, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson  wrote:
> >
> > Add flags to "struct kvm_gfn_range" to let notifier events target only
> > shared and only private mappings, and write up the existing mmu_notifier
> > events to be shared-only (private memory is never associated with a
> > userspace virtual address, i.e. can't be reached via mmu_notifiers).
> >
> > Add two flags so that KVM can handle the three possibilities (shared,
> > private, and shared+private) without needing something like a tri-state
> > enum.
> >
> > Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com
> > Signed-off-by: Sean Christopherson 
> > ---
> >  include/linux/kvm_host.h | 2 ++
> >  virt/kvm/kvm_main.c  | 7 +++
> >  2 files changed, 9 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 96aa930536b1..89c1a991a3b8 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -263,6 +263,8 @@ struct kvm_gfn_range {
> > gfn_t start;
> > gfn_t end;
> > union kvm_mmu_notifier_arg arg;
> > +   bool only_private;
> > +   bool only_shared;
> 
> If these flags aren't used in this patch series, should this patch be
> moved to the other series?

If *both* TDX and SNP need this patch, then I think it's probably worth applying
it now to make their lives easier.  But if only one needs the support, then I
completely agree this should be punted to whichever series needs it (this also
came up in v11, but we didn't force the issue).

Mike, Isaku?

> Also, if shared+private is a possibility, doesn't the prefix "only_"
> confuse things a bit? I.e., what is shared+private, is it when both
> are 0 or when both are 1? I assume it's the former (both are 0), but
> it might be clearer.

Heh, I was hoping that "only_private && only_shared" would be obviously 
nonsensical.

The only alternative I can think would be to add an enum, e.g.

enum {
PROCESS_PRIVATE_AND_SHARED,
PROCESS_ONLY_PRIVATE,
PROCESS_ONLY_SHARED,
};

because every other way of expressing the flags either results in more confusion
or an unsafe default.  I.e. I want zapping only private or only shared to 
require
the caller to explicitly set a non-zero value, which is how I ended up with
"only_{private,shared}" as opposed to "process_{private,shared}".

Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-01 Thread Sean Christopherson

On Wed, Nov 01, 2023, Paolo Bonzini wrote:
> On 11/1/23 17:36, Sean Christopherson wrote:
> > > > "Allow" isn't perfect, e.g. I would much prefer a straight 
> > > > KVM_GUEST_MEMFD_USE_HUGEPAGES
> > > > or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that 
> > > > KVM doesn't
> > > > (yet) guarantee hugepages.  I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is 
> > > > stronger than
> > > > a hint, but weaker than a requirement.  And if/when KVM supports a 
> > > > dedicated memory
> > > > pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.
> > > I think that the current patch is fine, but I will adjust it to always
> > > allow the flag, and to make the size check even if 
> > > !CONFIG_TRANSPARENT_HUGEPAGE.
> > > If hugepages are not guaranteed, and (theoretically) you could have no
> > > hugepage at all in the result, it's okay to get this result even if THP 
> > > is not
> > > available in the kernel.
> > Can you post a fixup patch?  It's not clear to me exactly what behavior you 
> > intend
> > to end up with.
> 
> Sure, just this:
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 7d1a33c2ad42..34fd070e03d9 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct 
> kvm_create_guest_memfd *args)
>  {
>   loff_t size = args->size;
>   u64 flags = args->flags;
> - u64 valid_flags = 0;
> -
> - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> - valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
>   if (flags & ~valid_flags)
>   return -EINVAL;
> @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct 
> kvm_create_guest_memfd *args)
>   if (size < 0 || !PAGE_ALIGNED(size))
>   return -EINVAL;
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
>   !IS_ALIGNED(size, HPAGE_PMD_SIZE))
>   return -EINVAL;
> -#endif

That won't work, HPAGE_PMD_SIZE is valid only for CONFIG_TRANSPARENT_HUGEPAGE=y.

#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })

...

>   return __kvm_gmem_create(kvm, size, flags);
>  }
> 
> Paolo
>

Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-01 Thread Sean Christopherson

On Wed, Nov 01, 2023, Fuad Tabba wrote:
> > > > @@ -1034,6 +1034,9 @@ static void kvm_destroy_dirty_bitmap(struct 
> > > > kvm_memory_slot *memslot)
> > > >  /* This does not remove the slot from struct kvm_memslots data 
> > > > structures */
> > > >  static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot 
> > > > *slot)
> > > >  {
> > > > +   if (slot->flags & KVM_MEM_PRIVATE)
> > > > +   kvm_gmem_unbind(slot);
> > > > +
> > >
> > > Should this be called after kvm_arch_free_memslot()? Arch-specific ode
> > > might need some of the data before the unbinding, something I thought
> > > might be necessary at one point for the pKVM port when deleting a
> > > memslot, but realized later that kvm_invalidate_memslot() ->
> > > kvm_arch_guest_memory_reclaimed() was the more logical place for it.
> > > Also, since that seems to be the pattern for arch-specific handlers in
> > > KVM.
> >
> > Maybe?  But only if we can about symmetry between the allocation and free 
> > paths
> > I really don't think kvm_arch_free_memslot() should be doing anything 
> > beyond a
> > "pure" free.  E.g. kvm_arch_free_memslot() is also called after moving a 
> > memslot,
> > which hopefully we never actually have to allow for guest_memfd, but any 
> > code in
> > kvm_arch_free_memslot() would bring about "what if" questions regarding 
> > memslot
> > movement.  I.e. the API is intended to be a "free arch metadata associated 
> > with
> > the memslot".
> >
> > Out of curiosity, what does pKVM need to do at 
> > kvm_arch_guest_memory_reclaimed()?
> 
> It's about the host reclaiming ownership of guest memory when tearing
> down a protected guest. In pKVM, we currently teardown the guest and
> reclaim its memory when kvm_arch_destroy_vm() is called. The problem
> with guestmem is that kvm_gmem_unbind() could get called before that
> happens, after which the host might try to access the unbound guest
> memory. Since the host hasn't reclaimed ownership of the guest memory
> from hyp, hilarity ensues (it crashes).
> 
> Initially, I hooked reclaim guest memory to kvm_free_memslot(), but
> then I needed to move the unbind later in the function. I realized
> later that kvm_arch_guest_memory_reclaimed() gets called earlier (at
> the right time), and is more aptly named.

Aha!  I suspected that might be the case.

TDX and SNP also need to solve the same problem of "reclaiming" memory before it
can be safely accessed by the host.  The plan is to add an arch hook (or two?)
into guest_memfd that is invoked when memory is freed from guest_memfd.

Hooking kvm_arch_guest_memory_reclaimed() isn't completely correct as deleting a
memslot doesn't *guarantee* that guest memory is actually reclaimed (which 
reminds
me, we need to figure out a better name for that thing before introducing
kvm_arch_gmem_invalidate()).

The effective false positives aren't fatal for the current usage because the 
hook
is used only for x86 SEV guests to flush caches.  An unnecessary flush can cause
performance issues, but it doesn't affect correctness. For TDX and SNP, and IIUC
pKVM, false positives are fatal because KVM could assign memory back to the host
that is still owned by guest_memfd.

E.g. a misbehaving userspace could prematurely delete a memslot.  And the more
fun example is intrahost migration, where the plan is to allow pointing multiple
guest_memfd files at a single guest_memfd inode:
https://lore.kernel.org/all/cover.1691446946.git.ackerley...@google.com

There was a lot of discussion for this, but it's scattered all over the place.
The TL;DR is is that the inode will represent physical memory, and a file will
represent a given "struct kvm" instance's view of that memory.  And so the 
memory
isn't reclaimed until the inode is truncated/punched.

I _think_ this reflects the most recent plan from the guest_memfd side:
https://lore.kernel.org/all/1233d749211c08d51f9ca5d427938d47f008af1f.1689893403.git.isaku.yamah...@intel.com

Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-01 Thread Sean Christopherson

On Wed, Nov 01, 2023, Kai Huang wrote:
> 
> > +7.34 KVM_CAP_MEMORY_FAULT_INFO
> > +--
> > +
> > +:Architectures: x86
> > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> > +
> > +The presence of this capability indicates that KVM_RUN will fill
> > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, 
> > e.g. if
> > +there is a valid memslot but no backing VMA for the corresponding host 
> > virtual
> > +address.
> > +
> > +The information in kvm_run.memory_fault is valid if and only if KVM_RUN 
> > returns
> > +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is 
> > set
> > +to KVM_EXIT_MEMORY_FAULT.
> 
> IIUC returning -EFAULT or whatever -errno is sort of KVM internal
> implementation.

The errno that is returned to userspace is ABI.  In KVM, it's a _very_ poorly
defined ABI for the vast majority of ioctls(), but it's still technically ABI.
KVM gets away with being cavalier with errno because the vast majority of errors
are considered fatal by userespace, i.e. in most cases, userspace simply doesn't
care about the exact errno.

A good example is KVM_RUN with -EINTR; if KVM were to return something other 
than
-EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall 
over.

> Is it better to relax the validity of kvm_run.memory_fault when
> KVM_RUN returns any -errno?

Not unless there's a need to do so, and if there is then we can update the
documentation accordingly.  If KVM's ABI is that kvm_run.memory_fault is valid
for any errno, then KVM would need to purge kvm_run.exit_reason super early in
KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being
misinterpreted as KVM_EXIT_MEMORY_FAULT.  And purging exit_reason super early is
subtly tricky because KVM's (again, poorly documented) ABI is that *some* exit
reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or with a
pending signal).

https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com

> [...]
> 
> 
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2327,4 +2327,15 @@ static inline void kvm_account_pgtable_pages(void 
> > *virt, int nr)
> >  /* Max number of entries allowed for each kvm dirty ring */
> >  #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
> >  
> > +static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> > +gpa_t gpa, gpa_t size)
> > +{
> > +   vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > +   vcpu->run->memory_fault.gpa = gpa;
> > +   vcpu->run->memory_fault.size = size;
> > +
> > +   /* Flags are not (yet) defined or communicated to userspace. */
> > +   vcpu->run->memory_fault.flags = 0;
> > +}
> > +
> 
> KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function to
> ?

I'd prefer to keep it in generic code, as it's highly likely to end up there
sooner than later.  There's a known use case for ARM (exit to userspace on 
missing
userspace mapping[*]), and I'm guessing pKVM (also ARM) will also utilize this 
API.

[*] https://lore.kernel.org/all/20230908222905.1321305-8-amoor...@google.com

Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-01 Thread Sean Christopherson

On Wed, Nov 01, 2023, Paolo Bonzini wrote:
> On Wed, Nov 1, 2023 at 2:41 PM Sean Christopherson  wrote:
> >
> > On Wed, Nov 01, 2023, Xiaoyao Li wrote:
> > > On 10/31/2023 10:16 PM, Sean Christopherson wrote:
> > > > On Tue, Oct 31, 2023, Xiaoyao Li wrote:
> > > > > On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> > > But it's different than MADV_HUGEPAGE, in a way. Per my understanding, the
> > > failure of MADV_HUGEPAGE is not fatal, user space can ignore it and
> > > continue.
> > >
> > > However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which 
> > > leads
> > > to failure of guest memfd creation.
> >
> > Failing KVM_CREATE_GUEST_MEMFD isn't truly fatal, it just requires different
> > action from userspace, i.e. instead of ignoring the error, userspace could 
> > redo
> > KVM_CREATE_GUEST_MEMFD with KVM_GUEST_MEMFD_ALLOW_HUGEPAGE=0.
> >
> > We could make the behavior more like MADV_HUGEPAGE, e.g. theoretically we 
> > could
> > extend fadvise() with FADV_HUGEPAGE, or add a guest_memfd knob/ioctl() to 
> > let
> > userspace provide advice/hints after creating a guest_memfd.  But I suspect 
> > that
> > guest_memfd would be the only user of FADV_HUGEPAGE, and IMO a 
> > post-creation hint
> > is actually less desirable.
> >
> > KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will fail only if userspace didn't provide a
> > compatible size or the kernel doesn't support THP.  An incompatible size is 
> > likely
> > a userspace bug, and for most setups that want to utilize guest_memfd, lack 
> > of THP
> > support is likely a configuration bug.  I.e. many/most uses *want* failures 
> > due to
> > KVM_GUEST_MEMFD_ALLOW_HUGEPAGE to be fatal.
> >
> > > For current implementation, I think maybe KVM_GUEST_MEMFD_DESIRE_HUGEPAGE
> > > fits better than KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*?
> >
> > Why?  Verbs like "prefer" and "desire" aren't a good fit IMO because they 
> > suggest
> > the flag is a hint, and hints are usually best effort only, i.e. are 
> > ignored if
> > there is a fundamental incompatibility.
> >
> > "Allow" isn't perfect, e.g. I would much prefer a straight 
> > KVM_GUEST_MEMFD_USE_HUGEPAGES
> > or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM 
> > doesn't
> > (yet) guarantee hugepages.  I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger 
> > than
> > a hint, but weaker than a requirement.  And if/when KVM supports a 
> > dedicated memory
> > pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.
> 
> I think that the current patch is fine, but I will adjust it to always
> allow the flag, and to make the size check even if 
> !CONFIG_TRANSPARENT_HUGEPAGE.
> If hugepages are not guaranteed, and (theoretically) you could have no
> hugepage at all in the result, it's okay to get this result even if THP is not
> available in the kernel.

Can you post a fixup patch?  It's not clear to me exactly what behavior you 
intend
to end up with.

Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-01 Thread Sean Christopherson

On Wed, Nov 01, 2023, Xiaoyao Li wrote:
> On 10/31/2023 10:16 PM, Sean Christopherson wrote:
> > On Tue, Oct 31, 2023, Xiaoyao Li wrote:
> > > On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> > > > Extended guest_memfd to allow backing guest memory with transparent
> > > > hugepages. Require userspace to opt-in via a flag even though there's no
> > > > known/anticipated use case for forcing small pages as THP is optional,
> > > > i.e. to avoid ending up in a situation where userspace is unaware that
> > > > KVM can't provide hugepages.
> > > 
> > > Personally, it seems not so "transparent" if requiring userspace to 
> > > opt-in.
> > > 
> > > People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE
> > > support, or check is the sysfs of transparent hugepage exists; 2)get the
> > > maximum support hugepage size 3) ensure the size satisfies the alignment;
> > > before opt-in it.
> > > 
> > > Even simpler, userspace can blindly try to create guest memfd with
> > > transparent hugapage flag. If getting error, fallback to create without 
> > > the
> > > transparent hugepage flag.
> > > 
> > > However, it doesn't look transparent to me.
> > 
> > The "transparent" part is referring to the underlying kernel mechanism, 
> > it's not
> > saying anything about the API.  The "transparent" part of THP is that the 
> > kernel
> > doesn't guarantee hugepages, i.e. whether or not hugepages are actually 
> > used is
> > (mostly) transparent to userspace.
> > 
> > Paolo also isn't the biggest fan[*], but there are also downsides to always
> > allowing hugepages, e.g. silent failure due to lack of THP or unaligned 
> > size,
> > and there's precedent in the form of MADV_HUGEPAGE.
> > 
> > [*] 
> > https://lore.kernel.org/all/84a908ae-04c7-51c7-c9a8-119e1933a...@redhat.com
> 
> But it's different than MADV_HUGEPAGE, in a way. Per my understanding, the
> failure of MADV_HUGEPAGE is not fatal, user space can ignore it and
> continue.
>
> However, the failure of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is fatal, which leads
> to failure of guest memfd creation.

Failing KVM_CREATE_GUEST_MEMFD isn't truly fatal, it just requires different
action from userspace, i.e. instead of ignoring the error, userspace could redo
KVM_CREATE_GUEST_MEMFD with KVM_GUEST_MEMFD_ALLOW_HUGEPAGE=0.

We could make the behavior more like MADV_HUGEPAGE, e.g. theoretically we could
extend fadvise() with FADV_HUGEPAGE, or add a guest_memfd knob/ioctl() to let
userspace provide advice/hints after creating a guest_memfd.  But I suspect that
guest_memfd would be the only user of FADV_HUGEPAGE, and IMO a post-creation 
hint
is actually less desirable.

KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will fail only if userspace didn't provide a
compatible size or the kernel doesn't support THP.  An incompatible size is 
likely
a userspace bug, and for most setups that want to utilize guest_memfd, lack of 
THP
support is likely a configuration bug.  I.e. many/most uses *want* failures due 
to
KVM_GUEST_MEMFD_ALLOW_HUGEPAGE to be fatal.

> For current implementation, I think maybe KVM_GUEST_MEMFD_DESIRE_HUGEPAGE
> fits better than KVM_GUEST_MEMFD_ALLOW_HUGEPAGE? or maybe *PREFER*?

Why?  Verbs like "prefer" and "desire" aren't a good fit IMO because they 
suggest
the flag is a hint, and hints are usually best effort only, i.e. are ignored if
there is a fundamental incompatibility.

"Allow" isn't perfect, e.g. I would much prefer a straight 
KVM_GUEST_MEMFD_USE_HUGEPAGES
or KVM_GUEST_MEMFD_HUGEPAGES flag, but I wanted the name to convey that KVM 
doesn't
(yet) guarantee hugepages.  I.e. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is stronger than
a hint, but weaker than a requirement.  And if/when KVM supports a dedicated 
memory
pool of some kind, then we can add KVM_GUEST_MEMFD_REQUIRE_HUGEPAGE.

Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-10-31 Thread Sean Christopherson

On Tue, Oct 31, 2023, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson  wrote:
> 
> ...
> 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index e2252c748fd6..e82c69d5e755 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6079,6 +6079,15 @@ applied.
> >  :Parameters: struct kvm_userspace_memory_region2 (in)
> >  :Returns: 0 on success, -1 on error
> >
> > +KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION 
> > that
> > +allows mapping guest_memfd memory into a guest.  All fields shared with
> > +KVM_SET_USER_MEMORY_REGION identically.  Userspace can set KVM_MEM_PRIVATE 
> > in
> > +flags to have KVM bind the memory region to a given guest_memfd range of
> > +[guest_memfd_offset, guest_memfd_offset + memory_size].  The target 
> > guest_memfd
> > +must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, 
> > and
> > +the target range must not be bound to any other memory region.  All 
> > standard
> > +bounds checks apply (use common sense).
> > +
> 
> Bikeshedding here: Not sure if KVM_MEM_PRIVATE is the best name for
> this. It gets confusing with KVM_MEMORY_ATTRIBUTE_PRIVATE, i.e., that
> a region marked as KVM_MEM_PRIVATE is only potentially private. It did
> confuse the rest of the team when I walked them through a previous
> version of this code once. Would something like KVM_MEM_GUESTMEM make
> more sense?

Heh, deja vu.  We discussed this back in v7[*], and I came to the conclusion 
that
choosing a name that wasn't explicitly tied to private memory wasn't justified.
But that was before a KVM-owned guest_memfd was even an idea, and thus before we
had anything close to a real use case.

Since we now know that at least pKVM will use guest_memfd for shared memory, and
odds are quite good that "regular" VMs will also do the same, i.e. will want
guest_memfd with the concept of private memory, I agree that we should avoid
PRIVATE.

Though I vote for KVM_MEM_GUEST_MEMFD (or KVM_MEM_GUEST_MEMFD_VALID or
KVM_MEM_USE_GUEST_MEMFD).  I.e. do our best to avoid ambiguity between referring
to "guest memory" at-large and guest_memfd.

Copying a few relevant points from v7 to save a click or three.

 : I don't have a concrete use case (this is a recent idea on my end), but 
since we're
 : already adding fd-based memory, I can't think of a good reason not make it 
more generic
 : for not much extra cost.  And there are definitely classes of VMs for which 
fd-based
 : memory would Just Work, e.g. large VMs that are never oversubscribed on 
memory don't
 : need to support reclaim, so the fact that fd-based memslots won't support 
page aging
 : (among other things) right away is a non-issue.

...

 : Hrm, but basing private memory on top of a generic FD_VALID would 
effectively require
 : shared memory to use hva-based memslots for confidential VMs.  That'd yield 
a very
 : weird API, e.g. non-confidential VMs could be backed entirely by fd-based 
memslots,
 : but confidential VMs would be forced to use hva-based memslots.
 : 
 : Ignore this idea for now.  If there's an actual use case for generic 
fd-based memory
 : then we'll want a separate flag, fd, and offset, i.e. that support could be 
added
 : independent of KVM_MEM_PRIVATE.

...

 : One alternative would be to call it KVM_MEM_PROTECTED.  That shouldn't cause
 : problems for the known use of "private" (TDX and SNP), and it gives us a 
little
 : wiggle room, e.g. if we ever get a use case where VMs can share memory that 
is
 : otherwise protected.
 : 
 : That's a pretty big "if" though, and odds are good we'd need more memslot 
flags and
 : fd+offset pairs to allow differentiating "private" vs. "protected-shared" 
without
 : forcing userspace to punch holes in memslots, so I don't know that hedging 
now will
 : buy us anything.
 : 
 : So I'd say that if people think KVM_MEM_PRIVATE brings additional and 
meaningful
 : clarity over KVM_MEM_PROTECTECD, then lets go with PRIVATE.  But if 
PROTECTED is
 : just as good, go with PROTECTED as it gives us a wee bit of wiggle room for 
the
 : future.

[*] https://lore.kernel.org/all/yuh0ikhoh+tck...@google.com

> > -See KVM_SET_USER_MEMORY_REGION.
> > +A KVM_MEM_PRIVATE region _must_ have a valid guest_memfd (private memory) 
> > and
> > +userspace_addr (shared memory).  However, "valid" for userspace_addr simply
> > +means that the address itself must be a legal userspace address.  The 
> > backing
> > +mapping for userspace_addr is not required to be valid/populated at the 
> > time of
> > +KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily 
> > mapped/allo

Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-10-31 Thread Sean Christopherson

On Tue, Oct 31, 2023, David Matlack wrote:
> On 2023-10-27 11:21 AM, Sean Christopherson wrote:
> > Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
> > memory that is tied to a specific KVM virtual machine and whose primary
> > purpose is to serve guest memory.
> > 
> > A guest-first memory subsystem allows for optimizations and enhancements
> > that are kludgy or outright infeasible to implement/support in a generic
> > memory subsystem.  With guest_memfd, guest protections and mapping sizes
> > are fully decoupled from host userspace mappings.   E.g. KVM currently
> > doesn't support mapping memory as writable in the guest without it also
> > being writable in host userspace, as KVM's ABI uses VMA protections to
> > define the allow guest protection.  Userspace can fudge this by
> > establishing two mappings, a writable mapping for the guest and readable
> > one for itself, but that’s suboptimal on multiple fronts.
> > 
> > Similarly, KVM currently requires the guest mapping size to be a strict
> > subset of the host userspace mapping size, e.g. KVM doesn’t support
> > creating a 1GiB guest mapping unless userspace also has a 1GiB guest
> > mapping.  Decoupling the mappings sizes would allow userspace to precisely
> > map only what is needed without impacting guest performance, e.g. to
> > harden against unintentional accesses to guest memory.
> > 
> > Decoupling guest and userspace mappings may also allow for a cleaner
> > alternative to high-granularity mappings for HugeTLB, which has reached a
> > bit of an impasse and is unlikely to ever be merged.
> > 
> > A guest-first memory subsystem also provides clearer line of sight to
> > things like a dedicated memory pool (for slice-of-hardware VMs) and
> > elimination of "struct page" (for offload setups where userspace _never_
> > needs to mmap() guest memory).
> 
> All of these use-cases involve using guest_memfd for shared pages, but
> this entire series sets up KVM to only use guest_memfd for private
> pages.
> 
> For example, the per-page attributes are a property of a KVM VM, not the
> underlying guest_memfd. So that implies we will need separate
> guest_memfds for private and shared pages. But a given memslot can have
> a mix of private and shared pages. So that implies a memslot will need
> to support 2 guest_memfds?

Yes, someday this may be true.  Allowing guest_memfd (it was probably called
something else at that point) for "regular" memory was discussed in I think v10?
We made a concious decision to defer supporting 2 guest_memfds because it isn't 
strictly
necessary to support the TDX/SNP use cases for which all of this was initially
designed, and adding a second guest_memfd and the infrastructure needed to let
userspace map a guest_memfd can be done on top with minimal overhead.

> But the UAPI only allows 1 and uses the HVA for shared mappings.
> 
> My initial reaction after reading through this series is that the
> per-page private/shared should be a property of the guest_memfd, not the
> VM. Maybe it would even be cleaner in the long-run to make all memory
> attributes a property of the guest_memfd. That way we can scope the
> support to only guest_memfds and not have to worry about making per-page
> attributes work with "legacy" HVA-based memslots.

Making the private vs. shared state a property of the guest_memfd doesn't work
for TDX and SNP.  We (upstream x86 and KVM maintainers) have taken a hard stance
that in-place conversion will not be allowed for TDX/SNP due to the ease with
which a misbehaving userspace and/or guest can crash the host.

We'd also be betting that there would *never* be a use case for per-gfn 
attributes
for non-standard memory, e.g. virtio-gpu buffers, any kind of device memory, 
etc.

We'd also effectively be signing up to either support swap and page migration in
guest_memfd, or make those mutually exclusive with per-gfn attributes too.

guest_memfd is only intended for guest DRAM, and if I get my way, will never 
support
swap (page migration is less scary).  I.e. guest_memfd isn't intended to be a
one-size-fits-all solution, nor is it intended to wholesale replace memslots,
which is effectively what we'd be doing by deprecating hva-based guest memory.

And ignoring all that, the ABI would end up being rather bizarre due to way 
guest_memfd
interacts with memslots.  guest_memfd itself has no real notion of gfns, i.e. 
the
shared vs. private state would be tied to a file offset, not a gfn.  That's a 
solvable
problem, e.g. we could make a gfn:offset binding "sticky", but that would edd 
extra
complexity to the ABI, and AFAICT wouldn't buy us that much, if anything.

> Maybe can you sketch out how you see this propos

Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-10-31 Thread Sean Christopherson

On Tue, Oct 31, 2023, Xiaoyao Li wrote:
> On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> > Extended guest_memfd to allow backing guest memory with transparent
> > hugepages. Require userspace to opt-in via a flag even though there's no
> > known/anticipated use case for forcing small pages as THP is optional,
> > i.e. to avoid ending up in a situation where userspace is unaware that
> > KVM can't provide hugepages.
> 
> Personally, it seems not so "transparent" if requiring userspace to opt-in.
> 
> People need to 1) check if the kernel built with TRANSPARENT_HUGEPAGE
> support, or check is the sysfs of transparent hugepage exists; 2)get the
> maximum support hugepage size 3) ensure the size satisfies the alignment;
> before opt-in it.
> 
> Even simpler, userspace can blindly try to create guest memfd with
> transparent hugapage flag. If getting error, fallback to create without the
> transparent hugepage flag.
> 
> However, it doesn't look transparent to me.

The "transparent" part is referring to the underlying kernel mechanism, it's not
saying anything about the API.  The "transparent" part of THP is that the kernel
doesn't guarantee hugepages, i.e. whether or not hugepages are actually used is
(mostly) transparent to userspace.

Paolo also isn't the biggest fan[*], but there are also downsides to always
allowing hugepages, e.g. silent failure due to lack of THP or unaligned size,
and there's precedent in the form of MADV_HUGEPAGE.

[*] https://lore.kernel.org/all/84a908ae-04c7-51c7-c9a8-119e1933a...@redhat.com

Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-10-31 Thread Sean Christopherson

On Tue, Oct 31, 2023, Chao Gao wrote:
> >+int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> >+{
> >+loff_t size = args->size;
> >+u64 flags = args->flags;
> >+u64 valid_flags = 0;
> >+
> >+if (flags & ~valid_flags)
> >+return -EINVAL;
> >+
> >+if (size < 0 || !PAGE_ALIGNED(size))
> >+return -EINVAL;
> 
> is size == 0 a valid case?

Nope, this is a bug.

> >+if (!xa_empty(>bindings) &&
> >+xa_find(>bindings, , end - 1, XA_PRESENT)) {
> >+filemap_invalidate_unlock(inode->i_mapping);
> >+goto err;
> >+}
> >+
> >+/*
> >+ * No synchronize_rcu() needed, any in-flight readers are guaranteed to
> >+ * be see either a NULL file or this new file, no need for them to go
> >+ * away.
> >+ */
> >+rcu_assign_pointer(slot->gmem.file, file);
> >+slot->gmem.pgoff = start;
> >+
> >+xa_store_range(>bindings, start, end - 1, slot, GFP_KERNEL);
> >+filemap_invalidate_unlock(inode->i_mapping);
> >+
> >+/*
> >+ * Drop the reference to the file, even on success.  The file pins KVM,
> >+ * not the other way 'round.  Active bindings are invalidated if the
> >+ * file is closed before memslots are destroyed.
> >+ */
> >+fput(file);
> >+return 0;
> >+
> >+err:
> >+fput(file);
> >+return -EINVAL;
> 
> The cleanup, i.e., filemap_invalidate_unlock() and fput(), is common. So, I 
> think it
> may be slightly better to consolidate the common part e.g.,

I would prefer to keep this as-is.  Only goto needs the unlock, and I find it 
easier
to understand the success vs. error paths with explicit returns.  But it's not a
super strong preference.

Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-10-31 Thread Sean Christopherson

On Tue, Oct 31, 2023, Xiaoyao Li wrote:
> On 10/28/2023 2:21 AM, Sean Christopherson wrote:
> > Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional
> > information can be supplied without setting userspace up to fail.  The
> > padding in the new kvm_userspace_memory_region2 structure will be used to
> > pass a file descriptor in addition to the userspace_addr, i.e. allow
> > userspace to point at a file descriptor and map memory into a guest that
> > is NOT mapped into host userspace.
> > 
> > Alternatively, KVM could simply add "struct kvm_userspace_memory_region2"
> > without a new ioctl(), but as Paolo pointed out, adding a new ioctl()
> > makes detection of bad flags a bit more robust, e.g. if the new fd field
> > is guarded only by a flag and not a new ioctl(), then a userspace bug
> > (setting a "bad" flag) would generate out-of-bounds access instead of an
> > -EINVAL error.
> > 
> > Cc: Jarkko Sakkinen 
> > Reviewed-by: Paolo Bonzini 
> > Reviewed-by: Xiaoyao Li 
> > Signed-off-by: Sean Christopherson 
> > ---
> >   Documentation/virt/kvm/api.rst | 21 +++
> >   arch/x86/kvm/x86.c |  2 +-
> >   include/linux/kvm_host.h   |  4 ++--
> >   include/uapi/linux/kvm.h   | 13 
> >   virt/kvm/kvm_main.c| 38 +++---
> >   5 files changed, 67 insertions(+), 11 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 21a7578142a1..ace984acc125 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6070,6 +6070,27 @@ writes to the CNTVCT_EL0 and CNTPCT_EL0 registers 
> > using the SET_ONE_REG
> >   interface. No error will be returned, but the resulting offset will not be
> >   applied.
> > +4.139 KVM_SET_USER_MEMORY_REGION2
> > +-
> > +
> > +:Capability: KVM_CAP_USER_MEMORY2
> > +:Architectures: all
> > +:Type: vm ioctl
> > +:Parameters: struct kvm_userspace_memory_region2 (in)
> > +:Returns: 0 on success, -1 on error
> > +
> > +::
> > +
> > +  struct kvm_userspace_memory_region2 {
> > +   __u32 slot;
> > +   __u32 flags;
> > +   __u64 guest_phys_addr;
> > +   __u64 memory_size; /* bytes */
> > +   __u64 userspace_addr; /* start of the userspace allocated memory */
> 
> missing
> 
>   __u64 pad[16];

I can't even copy+paste correctly :-(

Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-10-30 Thread Sean Christopherson

On Tue, Oct 31, 2023, Paolo Bonzini wrote:
> On 10/30/23 21:25, Sean Christopherson wrote:
> > > Probably worth adding a check on valid flags here.
> > 
> > Definitely needed.  There's a very real bug here.  But rather than 
> > duplicate flags
> > checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that 
> > we
> > have the fancy guard(mutex) and there are no internal calls to 
> > kvm_set_memory_region(),
> > what if we:
> > 
> >1. Acquire/release slots_lock in __kvm_set_memory_region()
> >2. Call kvm_set_memory_region() from x86 code for the internal memslots
> >3. Disallow *any* flags for internal memslots
> >4. Open code check_memory_region_flags in 
> > kvm_vm_ioctl_set_memory_region()
> 
> I dislike this step, there is a clear point where all paths meet
> (ioctl/internal, locked/unlocked) and that's __kvm_set_memory_region().
> I think that's the place where flags should be checked.  (I don't mind
> the restriction on internal memslots; it's just that to me it's not a
> particularly natural way to structure the checks).

Yeah, I just don't like the discrepancy it causes where some flags are 
explicitly
checked and allowed, allowed and then later disallowed.

> On the other hand, the place where to protect from out-of-bounds
> accesses, is the place where you stop caring about struct
> kvm_userspace_memory_region vs kvm_userspace_memory_region2 (and
> your code gets it right, by dropping "ioctl" as soon as possible).
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 87f45aa91ced..fe5a2af14fff 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1635,6 +1635,14 @@ bool __weak kvm_arch_dirty_log_supported(struct kvm 
> *kvm)
>   return true;
>  }
> +/*
> + * Flags that do not access any of the extra space of struct
> + * kvm_userspace_memory_region2.  KVM_SET_USER_MEMORY_REGION_FLAGS
> + * only allows these.
> + */
> +#define KVM_SET_USER_MEMORY_REGION_FLAGS \

Can we name this KVM_SET_USER_MEMORY_REGION_LEGACY_FLAGS, or something equally
horrific?  As is, this sounds way too much like a generic "allowed flags for any
memory region".

Or maybe invert the macro?  I.e. something to make it more obvious that it's
effectively a versioning check, not a generic "what's supported?" check.

#define KVM_SET_USER_MEMORY_FLAGS_V2_ONLY \
(~(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY))


> + (KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
> +
>  static int check_memory_region_flags(struct kvm *kvm,
>const struct kvm_userspace_memory_region2 
> *mem)
>  {
> @@ -5149,10 +5149,16 @@ static long kvm_vm_ioctl(struct file *filp,
>   struct kvm_userspace_memory_region2 mem;
>   unsigned long size;
> - if (ioctl == KVM_SET_USER_MEMORY_REGION)
> + if (ioctl == KVM_SET_USER_MEMORY_REGION) {
> + /*
> +  * Fields beyond struct kvm_userspace_memory_region 
> shouldn't be
> +  * accessed, but avoid leaking kernel memory in case of 
> a bug.
> +  */
> + memset(, 0, sizeof(mem));
>   size = sizeof(struct kvm_userspace_memory_region);
> - else
> + } else {
>   size = sizeof(struct kvm_userspace_memory_region2);
> + }
>   /* Ensure the common parts of the two structs are identical. */
>   SANITY_CHECK_MEM_REGION_FIELD(slot);
> @@ -5165,6 +5167,11 @@ static long kvm_vm_ioctl(struct file *filp,
>   if (copy_from_user(, argp, size))
>   goto out;
> + r = -EINVAL;
> + if (ioctl == KVM_SET_USER_MEMORY_REGION &&
> + (mem->flags & ~KVM_SET_USER_MEMORY_REGION_FLAGS))
> + goto out;
> +
>   r = kvm_vm_ioctl_set_memory_region(kvm, );
>   break;
>   }
> 
> 
> That's a kind of patch that you can't really get wrong (though I have
> the brown paper bag ready).
> 
> Maintainance-wise it's fine, since flags are being added at a pace of
> roughly one every five years,

Heh, true.

> and anyway it's also future proof: I placed the #define near
> check_memory_region_flags so that in five years we remember to keep it up to
> date.  But worst case, the new flags will only be allowed by
> KVM_SET_USER_MEMORY_REGION2 unnecessarily; there are no security issues
> waiting to bite us.
>
> In sum, this is exactly the only kind of fix that should be in the v13->v14
> delta.

Boiling the ocean can be fun too ;-)

Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-10-30 Thread Sean Christopherson

On Mon, Oct 30, 2023, Sean Christopherson wrote:
> On Mon, Oct 30, 2023, Paolo Bonzini wrote:
> > On 10/27/23 20:21, Sean Christopherson wrote:
> > > 
> > > + if (ioctl == KVM_SET_USER_MEMORY_REGION)
> > > + size = sizeof(struct kvm_userspace_memory_region);
> > 
> > This also needs a memset(, 0, sizeof(mem)), otherwise the out-of-bounds
> > access of the commit message becomes a kernel stack read.
> 
> Ouch.  There's some irony.  Might be worth doing memset(, -1, sizeof(mem))
> though as '0' is a valid file descriptor and a valid file offset.
> 
> > Probably worth adding a check on valid flags here.
> 
> Definitely needed.  There's a very real bug here.  But rather than duplicate 
> flags
> checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that we
> have the fancy guard(mutex) and there are no internal calls to 
> kvm_set_memory_region(),
> what if we:
> 
>   1. Acquire/release slots_lock in __kvm_set_memory_region()

Gah, this won't work with x86's usage, which acquire slots_lock quite far away
from __kvm_set_memory_region() in several places.  The rest of the idea still
works, kvm_vm_ioctl_set_memory_region() will just need to take slots_lock 
manually.

>   2. Call kvm_set_memory_region() from x86 code for the internal memslots
>   3. Disallow *any* flags for internal memslots
>   4. Open code check_memory_region_flags in kvm_vm_ioctl_set_memory_region()
>   5. Pass @ioctl to kvm_vm_ioctl_set_memory_region() and allow KVM_MEM_PRIVATE
>  only for KVM_SET_USER_MEMORY_REGION2

Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events

2023-10-30 Thread Sean Christopherson

On Mon, Oct 30, 2023, Paolo Bonzini wrote:
> On 10/27/23 20:21, Sean Christopherson wrote:
> > @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t 
> > __kvm_handle_hva_range(struct kvm *kvm,
> >  * the second or later invocation of the handler).
> >  */
> > gfn_range.arg = range->arg;
> > +
> > +   /*
> > +* HVA-based notifications aren't relevant to private
> > +* mappings as they don't have a userspace mapping.
> 
> It's confusing who "they" is.  Maybe
> 
>* HVA-based notifications provide a userspace address,
>* and as such are only relevant for shared mappings.

Works for me.

Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes

2023-10-30 Thread Sean Christopherson

On Mon, Oct 30, 2023, Sean Christopherson wrote:
> On Mon, Oct 30, 2023, Chao Gao wrote:
> > On Fri, Oct 27, 2023 at 11:21:55AM -0700, Sean Christopherson wrote:
> > >From: Chao Peng 
> > >
> > >In confidential computing usages, whether a page is private or shared is
> > >necessary information for KVM to perform operations like page fault
> > >handling, page zapping etc. There are other potential use cases for
> > >per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > >or exec-only, etc.) without having to modify memslots.
> > >
> > >Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > >userspace to operate on the per-page memory attributes.
> > >  - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> > >a guest memory range.
> > 
> > >  - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> > >memory attributes.
> > 
> > This ioctl() is already removed. So, the changelog is out-of-date and needs
> > an update.
> 
> Doh, I lost track of this and the fixup for KVM_CAP_MEMORY_ATTRIBUTES below.
> 
> > >+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > >+:Architectures: x86
> > >+:Type: vm ioctl
> > >+:Parameters: struct kvm_memory_attributes(in)
> > 
> >^ add one space here?
> 
> Ah, yeah, that does appear to be the standard.
> > 
> > 
> > >+static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
> > >+struct kvm_gfn_range *range)
> > >+{
> > >+  /*
> > >+   * Unconditionally add the range to the invalidation set, regardless of
> > >+   * whether or not the arch callback actually needs to zap SPTEs.  E.g.
> > >+   * if KVM supports RWX attributes in the future and the attributes are
> > >+   * going from R=>RW, zapping isn't strictly necessary.  Unconditionally
> > >+   * adding the range allows KVM to require that MMU invalidations add at
> > >+   * least one range between begin() and end(), e.g. allows KVM to detect
> > >+   * bugs where the add() is missed.  Rexlaing the rule *might* be safe,
> > 
> >  Relaxing
> > 
> > >@@ -4640,6 +4850,17 @@ static int 
> > >kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> > >   case KVM_CAP_BINARY_STATS_FD:
> > >   case KVM_CAP_SYSTEM_EVENT_DATA:
> > >   return 1;
> > >+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> > >+  case KVM_CAP_MEMORY_ATTRIBUTES:
> > >+  u64 attrs = kvm_supported_mem_attributes(kvm);
> > >+
> > >+  r = -EFAULT;
> > >+  if (copy_to_user(argp, , sizeof(attrs)))
> > >+  goto out;
> > >+  r = 0;
> > >+  break;
> > 
> > This cannot work, e.g., no @argp in this function and is fixed by a later 
> > commit:
> > 
> > fcbef1e5e5d2 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for 
> > guest-specific backing memory")
> 
> I'll post a fixup patch for all of these, thanks much!

Heh, that was an -ENOCOFFEE.  Fixup patches for a changelog goof and an 
ephemeral
bug are going to be hard to post.

Paolo, do you want to take care of all of these fixups and typos, or would you
prefer that I start a v14 branch and then hand it off to you at some point?

Re: [PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-10-30 Thread Sean Christopherson

On Mon, Oct 30, 2023, Paolo Bonzini wrote:
> On 10/27/23 20:21, Sean Christopherson wrote:
> > 
> > +   if (ioctl == KVM_SET_USER_MEMORY_REGION)
> > +   size = sizeof(struct kvm_userspace_memory_region);
> 
> This also needs a memset(, 0, sizeof(mem)), otherwise the out-of-bounds
> access of the commit message becomes a kernel stack read.

Ouch.  There's some irony.  Might be worth doing memset(, -1, sizeof(mem))
though as '0' is a valid file descriptor and a valid file offset.

> Probably worth adding a check on valid flags here.

Definitely needed.  There's a very real bug here.  But rather than duplicate 
flags
checking or plumb @ioctl all the way to __kvm_set_memory_region(), now that we
have the fancy guard(mutex) and there are no internal calls to 
kvm_set_memory_region(),
what if we:

  1. Acquire/release slots_lock in __kvm_set_memory_region()
  2. Call kvm_set_memory_region() from x86 code for the internal memslots
  3. Disallow *any* flags for internal memslots
  4. Open code check_memory_region_flags in kvm_vm_ioctl_set_memory_region()
  5. Pass @ioctl to kvm_vm_ioctl_set_memory_region() and allow KVM_MEM_PRIVATE
 only for KVM_SET_USER_MEMORY_REGION2

E.g. this over ~5 patches

---
 arch/x86/kvm/x86.c   |  2 +-
 include/linux/kvm_host.h |  4 +--
 virt/kvm/kvm_main.c  | 65 +---
 3 files changed, 29 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e3eb608b6692..dd3e2017366c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12478,7 +12478,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, 
int id, gpa_t gpa,
m.guest_phys_addr = gpa;
m.userspace_addr = hva;
m.memory_size = size;
-   r = __kvm_set_memory_region(kvm, );
+   r = kvm_set_memory_region(kvm, );
if (r < 0)
return ERR_PTR_USR(r);
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 687589ce9f63..fbb98efe8200 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1170,7 +1170,7 @@ static inline bool kvm_memslot_iter_is_valid(struct 
kvm_memslot_iter *iter, gfn_
  *   -- just change its flags
  *
  * Since flags can be changed by some of these operations, the following
- * differentiation is the best we can do for __kvm_set_memory_region():
+ * differentiation is the best we can do for __kvm_set_memory_region().
  */
 enum kvm_mr_change {
KVM_MR_CREATE,
@@ -1181,8 +1181,6 @@ enum kvm_mr_change {
 
 int kvm_set_memory_region(struct kvm *kvm,
  const struct kvm_userspace_memory_region2 *mem);
-int __kvm_set_memory_region(struct kvm *kvm,
-   const struct kvm_userspace_memory_region2 *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 23633984142f..39ceee2f67f2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1608,28 +1608,6 @@ static void kvm_replace_memslot(struct kvm *kvm,
}
 }
 
-static int check_memory_region_flags(struct kvm *kvm,
-const struct kvm_userspace_memory_region2 
*mem)
-{
-   u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
-
-   if (kvm_arch_has_private_mem(kvm))
-   valid_flags |= KVM_MEM_PRIVATE;
-
-   /* Dirty logging private memory is not currently supported. */
-   if (mem->flags & KVM_MEM_PRIVATE)
-   valid_flags &= ~KVM_MEM_LOG_DIRTY_PAGES;
-
-#ifdef __KVM_HAVE_READONLY_MEM
-   valid_flags |= KVM_MEM_READONLY;
-#endif
-
-   if (mem->flags & ~valid_flags)
-   return -EINVAL;
-
-   return 0;
-}
-
 static void kvm_swap_active_memslots(struct kvm *kvm, int as_id)
 {
struct kvm_memslots *slots = kvm_get_inactive_memslots(kvm, as_id);
@@ -2014,11 +1992,9 @@ static bool kvm_check_memslot_overlap(struct 
kvm_memslots *slots, int id,
  * space.
  *
  * Discontiguous memory is allowed, mostly for framebuffers.
- *
- * Must be called holding kvm->slots_lock for write.
  */
-int __kvm_set_memory_region(struct kvm *kvm,
-   const struct kvm_userspace_memory_region2 *mem)
+static int __kvm_set_memory_region(struct kvm *kvm,
+  const struct kvm_userspace_memory_region2 
*mem)
 {
struct kvm_memory_slot *old, *new;
struct kvm_memslots *slots;
@@ -2028,9 +2004,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
int as_id, id;
int r;
 
-   r = check_memory_region_flags(kvm, mem);
-   if (r)
-   return r;
+   guard(mutex)(>slots_lock);
 
as_id = mem->slot >> 16;
id = (u16)mem->s

Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes

2023-10-30 Thread Sean Christopherson

On Mon, Oct 30, 2023, Chao Gao wrote:
> On Fri, Oct 27, 2023 at 11:21:55AM -0700, Sean Christopherson wrote:
> >From: Chao Peng 
> >
> >In confidential computing usages, whether a page is private or shared is
> >necessary information for KVM to perform operations like page fault
> >handling, page zapping etc. There are other potential use cases for
> >per-page memory attributes, e.g. to make memory read-only (or no-exec,
> >or exec-only, etc.) without having to modify memslots.
> >
> >Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> >userspace to operate on the per-page memory attributes.
> >  - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> >a guest memory range.
> 
> >  - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> >memory attributes.
> 
> This ioctl() is already removed. So, the changelog is out-of-date and needs
> an update.

Doh, I lost track of this and the fixup for KVM_CAP_MEMORY_ATTRIBUTES below.

> >+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> >+:Architectures: x86
> >+:Type: vm ioctl
> >+:Parameters: struct kvm_memory_attributes(in)
> 
>  ^ add one space here?

Ah, yeah, that does appear to be the standard.
> 
> 
> >+static bool kvm_pre_set_memory_attributes(struct kvm *kvm,
> >+  struct kvm_gfn_range *range)
> >+{
> >+/*
> >+ * Unconditionally add the range to the invalidation set, regardless of
> >+ * whether or not the arch callback actually needs to zap SPTEs.  E.g.
> >+ * if KVM supports RWX attributes in the future and the attributes are
> >+ * going from R=>RW, zapping isn't strictly necessary.  Unconditionally
> >+ * adding the range allows KVM to require that MMU invalidations add at
> >+ * least one range between begin() and end(), e.g. allows KVM to detect
> >+ * bugs where the add() is missed.  Rexlaing the rule *might* be safe,
> 
>    Relaxing
> 
> >@@ -4640,6 +4850,17 @@ static int 
> >kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> > case KVM_CAP_BINARY_STATS_FD:
> > case KVM_CAP_SYSTEM_EVENT_DATA:
> > return 1;
> >+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> >+case KVM_CAP_MEMORY_ATTRIBUTES:
> >+u64 attrs = kvm_supported_mem_attributes(kvm);
> >+
> >+r = -EFAULT;
> >+if (copy_to_user(argp, , sizeof(attrs)))
> >+goto out;
> >+r = 0;
> >+break;
> 
> This cannot work, e.g., no @argp in this function and is fixed by a later 
> commit:
> 
>   fcbef1e5e5d2 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for 
> guest-specific backing memory")

I'll post a fixup patch for all of these, thanks much!

[PATCH v13 35/35] KVM: selftests: Test KVM exit behavior for private memory/access

2023-10-27 Thread Sean Christopherson

From: Ackerley Tng 

"Testing private access when memslot gets deleted" tests the behavior
of KVM when a private memslot gets deleted while the VM is using the
private memslot. When KVM looks up the deleted (slot = NULL) memslot,
KVM should exit to userspace with KVM_EXIT_MEMORY_FAULT.

In the second test, upon a private access to non-private memslot, KVM
should also exit to userspace with KVM_EXIT_MEMORY_FAULT.

Intentionally don't take a requirement on KVM_CAP_GUEST_MEMFD,
KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_ATTRIBUTE_PRIVATE, etc., as it's a
KVM bug to advertise KVM_X86_SW_PROTECTED_VM without its prerequisites.

Signed-off-by: Ackerley Tng 
[sean: call out the similarities with set_memory_region_test]
Signed-off-by: Sean Christopherson 
---
 tools/testing/selftests/kvm/Makefile  |   1 +
 .../kvm/x86_64/private_mem_kvm_exits_test.c   | 120 ++
 2 files changed, 121 insertions(+)
 create mode 100644 
tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index 2b1ef809d73a..f7fdd8244547 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -82,6 +82,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
 TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
 TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
+TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
 TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
 TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c 
b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
new file mode 100644
index ..7f7ca4475745
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
@@ -0,0 +1,120 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022, Google LLC.
+ */
+#include 
+#include 
+#include 
+
+#include "kvm_util.h"
+#include "processor.h"
+#include "test_util.h"
+
+/* Arbitrarily selected to avoid overlaps with anything else */
+#define EXITS_TEST_GVA 0xc000
+#define EXITS_TEST_GPA EXITS_TEST_GVA
+#define EXITS_TEST_NPAGES 1
+#define EXITS_TEST_SIZE (EXITS_TEST_NPAGES * PAGE_SIZE)
+#define EXITS_TEST_SLOT 10
+
+static uint64_t guest_repeatedly_read(void)
+{
+   volatile uint64_t value;
+
+   while (true)
+   value = *((uint64_t *) EXITS_TEST_GVA);
+
+   return value;
+}
+
+static uint32_t run_vcpu_get_exit_reason(struct kvm_vcpu *vcpu)
+{
+   int r;
+
+   r = _vcpu_run(vcpu);
+   if (r) {
+   TEST_ASSERT(errno == EFAULT, KVM_IOCTL_ERROR(KVM_RUN, r));
+   TEST_ASSERT_EQ(vcpu->run->exit_reason, KVM_EXIT_MEMORY_FAULT);
+   }
+   return vcpu->run->exit_reason;
+}
+
+const struct vm_shape protected_vm_shape = {
+   .mode = VM_MODE_DEFAULT,
+   .type = KVM_X86_SW_PROTECTED_VM,
+};
+
+static void test_private_access_memslot_deleted(void)
+{
+   struct kvm_vm *vm;
+   struct kvm_vcpu *vcpu;
+   pthread_t vm_thread;
+   void *thread_return;
+   uint32_t exit_reason;
+
+   vm = vm_create_shape_with_one_vcpu(protected_vm_shape, ,
+  guest_repeatedly_read);
+
+   vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+   EXITS_TEST_GPA, EXITS_TEST_SLOT,
+   EXITS_TEST_NPAGES,
+   KVM_MEM_PRIVATE);
+
+   virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
+
+   /* Request to access page privately */
+   vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
+
+   pthread_create(_thread, NULL,
+  (void *(*)(void *))run_vcpu_get_exit_reason,
+  (void *)vcpu);
+
+   vm_mem_region_delete(vm, EXITS_TEST_SLOT);
+
+   pthread_join(vm_thread, _return);
+   exit_reason = (uint32_t)(uint64_t)thread_return;
+
+   TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
+   TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, 
KVM_MEMORY_EXIT_FLAG_PRIVATE);
+   TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
+   TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+
+   kvm_vm_free(vm);
+}
+
+static void test_private_access_memslot_not_private(void)
+{
+   struct kvm_vm *vm;
+   struct kvm_vcpu *vcpu;
+   uint32_t exit_reason;
+
+   vm = vm_create_shape_with_one_vcpu(protected_vm_shape, ,
+  guest_repeatedly_read);
+
+   /* Add a non-private memslot (flags = 0) */
+   vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOU

[PATCH v13 34/35] KVM: selftests: Add basic selftest for guest_memfd()

2023-10-27 Thread Sean Christopherson

From: Chao Peng 

Add a selftest to verify the basic functionality of guest_memfd():

+ file descriptor created with the guest_memfd() ioctl does not allow
  read/write/mmap operations
+ file size and block size as returned from fstat are as expected
+ fallocate on the fd checks that offset/length on
  fallocate(FALLOC_FL_PUNCH_HOLE) should be page aligned
+ invalid inputs (misaligned size, invalid flags) are rejected
+ file size and inode are unique (the innocuous-sounding
  anon_inode_getfile() backs all files with a single inode...)

Signed-off-by: Chao Peng 
Co-developed-by: Ackerley Tng 
Signed-off-by: Ackerley Tng 
Co-developed-by: Paolo Bonzini 
Signed-off-by: Paolo Bonzini 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 tools/testing/selftests/kvm/Makefile  |   1 +
 .../testing/selftests/kvm/guest_memfd_test.c  | 221 ++
 2 files changed, 222 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_test.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index b709a52d5cdb..2b1ef809d73a 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -124,6 +124,7 @@ TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
 TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
 TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
+TEST_GEN_PROGS_x86_64 += guest_memfd_test
 TEST_GEN_PROGS_x86_64 += guest_print_test
 TEST_GEN_PROGS_x86_64 += hardware_disable_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c 
b/tools/testing/selftests/kvm/guest_memfd_test.c
new file mode 100644
index ..c15de9852316
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -0,0 +1,221 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright Intel Corporation, 2023
+ *
+ * Author: Chao Peng 
+ */
+
+#define _GNU_SOURCE
+#include "test_util.h"
+#include "kvm_util_base.h"
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static void test_file_read_write(int fd)
+{
+   char buf[64];
+
+   TEST_ASSERT(read(fd, buf, sizeof(buf)) < 0,
+   "read on a guest_mem fd should fail");
+   TEST_ASSERT(write(fd, buf, sizeof(buf)) < 0,
+   "write on a guest_mem fd should fail");
+   TEST_ASSERT(pread(fd, buf, sizeof(buf), 0) < 0,
+   "pread on a guest_mem fd should fail");
+   TEST_ASSERT(pwrite(fd, buf, sizeof(buf), 0) < 0,
+   "pwrite on a guest_mem fd should fail");
+}
+
+static void test_mmap(int fd, size_t page_size)
+{
+   char *mem;
+
+   mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+   TEST_ASSERT_EQ(mem, MAP_FAILED);
+}
+
+static void test_file_size(int fd, size_t page_size, size_t total_size)
+{
+   struct stat sb;
+   int ret;
+
+   ret = fstat(fd, );
+   TEST_ASSERT(!ret, "fstat should succeed");
+   TEST_ASSERT_EQ(sb.st_size, total_size);
+   TEST_ASSERT_EQ(sb.st_blksize, page_size);
+}
+
+static void test_fallocate(int fd, size_t page_size, size_t total_size)
+{
+   int ret;
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, total_size);
+   TEST_ASSERT(!ret, "fallocate with aligned offset and size should 
succeed");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+   page_size - 1, page_size);
+   TEST_ASSERT(ret, "fallocate with unaligned offset should fail");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size, page_size);
+   TEST_ASSERT(ret, "fallocate beginning at total_size should fail");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size + page_size, 
page_size);
+   TEST_ASSERT(ret, "fallocate beginning after total_size should fail");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+   total_size, page_size);
+   TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) at total_size should succeed");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+   total_size + page_size, page_size);
+   TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) after total_size should 
succeed");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+   page_size, page_size - 1);
+   TEST_ASSERT(ret, "fallocate with unaligned size should fail");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+   page_size, page_size);
+   TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) with aligned offset and size 
should succeed");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SI

[PATCH v13 33/35] KVM: selftests: Expand set_memory_region_test to validate guest_memfd()

2023-10-27 Thread Sean Christopherson

From: Chao Peng 

Expand set_memory_region_test to exercise various positive and negative
testcases for private memory.

 - Non-guest_memfd() file descriptor for private memory
 - guest_memfd() from different VM
 - Overlapping bindings
 - Unaligned bindings

Signed-off-by: Chao Peng 
Co-developed-by: Ackerley Tng 
Signed-off-by: Ackerley Tng 
[sean: trim the testcases to remove duplicate coverage]
Signed-off-by: Sean Christopherson 
---
 .../selftests/kvm/include/kvm_util_base.h |  10 ++
 .../selftests/kvm/set_memory_region_test.c| 100 ++
 2 files changed, 110 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 8ec122f5fcc8..e4d2cd9218b2 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -819,6 +819,16 @@ static inline struct kvm_vm *vm_create_barebones(void)
return vm_create(VM_SHAPE_DEFAULT);
 }
 
+static inline struct kvm_vm *vm_create_barebones_protected_vm(void)
+{
+   const struct vm_shape shape = {
+   .mode = VM_MODE_DEFAULT,
+   .type = KVM_X86_SW_PROTECTED_VM,
+   };
+
+   return vm_create(shape);
+}
+
 static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)
 {
return __vm_create(VM_SHAPE_DEFAULT, nr_runnable_vcpus, 0);
diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c 
b/tools/testing/selftests/kvm/set_memory_region_test.c
index b32960189f5f..ca83e3307a98 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -385,6 +385,98 @@ static void test_add_max_memory_regions(void)
kvm_vm_free(vm);
 }
 
+
+static void test_invalid_guest_memfd(struct kvm_vm *vm, int memfd,
+size_t offset, const char *msg)
+{
+   int r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, 
KVM_MEM_PRIVATE,
+MEM_REGION_GPA, MEM_REGION_SIZE,
+0, memfd, offset);
+   TEST_ASSERT(r == -1 && errno == EINVAL, "%s", msg);
+}
+
+static void test_add_private_memory_region(void)
+{
+   struct kvm_vm *vm, *vm2;
+   int memfd, i;
+
+   pr_info("Testing ADD of KVM_MEM_PRIVATE memory regions\n");
+
+   vm = vm_create_barebones_protected_vm();
+
+   test_invalid_guest_memfd(vm, vm->kvm_fd, 0, "KVM fd should fail");
+   test_invalid_guest_memfd(vm, vm->fd, 0, "VM's fd should fail");
+
+   memfd = kvm_memfd_alloc(MEM_REGION_SIZE, false);
+   test_invalid_guest_memfd(vm, memfd, 0, "Regular memfd() should fail");
+   close(memfd);
+
+   vm2 = vm_create_barebones_protected_vm();
+   memfd = vm_create_guest_memfd(vm2, MEM_REGION_SIZE, 0);
+   test_invalid_guest_memfd(vm, memfd, 0, "Other VM's guest_memfd() should 
fail");
+
+   vm_set_user_memory_region2(vm2, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+  MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 
0);
+   close(memfd);
+   kvm_vm_free(vm2);
+
+   memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE, 0);
+   for (i = 1; i < PAGE_SIZE; i++)
+   test_invalid_guest_memfd(vm, memfd, i, "Unaligned offset should 
fail");
+
+   vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+  MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 
0);
+   close(memfd);
+
+   kvm_vm_free(vm);
+}
+
+static void test_add_overlapping_private_memory_regions(void)
+{
+   struct kvm_vm *vm;
+   int memfd;
+   int r;
+
+   pr_info("Testing ADD of overlapping KVM_MEM_PRIVATE memory regions\n");
+
+   vm = vm_create_barebones_protected_vm();
+
+   memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE * 4, 0);
+
+   vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+  MEM_REGION_GPA, MEM_REGION_SIZE * 2, 0, 
memfd, 0);
+
+   vm_set_user_memory_region2(vm, MEM_REGION_SLOT + 1, KVM_MEM_PRIVATE,
+  MEM_REGION_GPA * 2, MEM_REGION_SIZE * 2,
+  0, memfd, MEM_REGION_SIZE * 2);
+
+   /*
+* Delete the first memslot, and then attempt to recreate it except
+* with a "bad" offset that results in overlap in the guest_memfd().
+*/
+   vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+  MEM_REGION_GPA, 0, NULL, -1, 0);
+
+   /* Overlap the front half of the other slot. */
+   r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+MEM_REGION_GPA * 2 - MEM_REGION_SIZE,
+MEM_REGION_SIZE * 2,

[PATCH v13 32/35] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper

2023-10-27 Thread Sean Christopherson

From: Chao Peng 

Add helpers to invoke KVM_SET_USER_MEMORY_REGION2 directly so that tests
can validate of features that are unique to "version 2" of "set user
memory region", e.g. do negative testing on gmem_fd and gmem_offset.

Provide a raw version as well as an assert-success version to reduce
the amount of boilerplate code need for basic usage.

Signed-off-by: Chao Peng 
Signed-off-by: Ackerley Tng 
Signed-off-by: Sean Christopherson 
---
 .../selftests/kvm/include/kvm_util_base.h |  7 +
 tools/testing/selftests/kvm/lib/kvm_util.c| 29 +++
 2 files changed, 36 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 157508c071f3..8ec122f5fcc8 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -522,6 +522,13 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t 
slot, uint32_t flags,
   uint64_t gpa, uint64_t size, void *hva);
 int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
uint64_t gpa, uint64_t size, void *hva);
+void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
+   uint64_t gpa, uint64_t size, void *hva,
+   uint32_t guest_memfd, uint64_t 
guest_memfd_offset);
+int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
+uint64_t gpa, uint64_t size, void *hva,
+uint32_t guest_memfd, uint64_t 
guest_memfd_offset);
+
 void vm_userspace_mem_region_add(struct kvm_vm *vm,
enum vm_mem_backing_src_type src_type,
uint64_t guest_paddr, uint32_t slot, uint64_t npages,
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index 52b131e3aca5..1620452c1cf7 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -873,6 +873,35 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t 
slot, uint32_t flags,
errno, strerror(errno));
 }
 
+int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
+uint64_t gpa, uint64_t size, void *hva,
+uint32_t guest_memfd, uint64_t 
guest_memfd_offset)
+{
+   struct kvm_userspace_memory_region2 region = {
+   .slot = slot,
+   .flags = flags,
+   .guest_phys_addr = gpa,
+   .memory_size = size,
+   .userspace_addr = (uintptr_t)hva,
+   .guest_memfd = guest_memfd,
+   .guest_memfd_offset = guest_memfd_offset,
+   };
+
+   return ioctl(vm->fd, KVM_SET_USER_MEMORY_REGION2, );
+}
+
+void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
+   uint64_t gpa, uint64_t size, void *hva,
+   uint32_t guest_memfd, uint64_t 
guest_memfd_offset)
+{
+   int ret = __vm_set_user_memory_region2(vm, slot, flags, gpa, size, hva,
+  guest_memfd, guest_memfd_offset);
+
+   TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION2 failed, errno = %d (%s)",
+   errno, strerror(errno));
+}
+
+
 /* FIXME: This thing needs to be ripped apart and rewritten. */
 void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
uint64_t guest_paddr, uint32_t slot, uint64_t npages,
-- 
2.42.0.820.g83a721a137-goog

[PATCH v13 31/35] KVM: selftests: Add x86-only selftest for private memory conversions

2023-10-27 Thread Sean Christopherson

From: Vishal Annapurve 

Add a selftest to exercise implicit/explicit conversion functionality
within KVM and verify:

 - Shared memory is visible to host userspace
 - Private memory is not visible to host userspace
 - Host userspace and guest can communicate over shared memory
 - Data in shared backing is preserved across conversions (test's
   host userspace doesn't free the data)
 - Private memory is bound to the lifetime of the VM

Ideally, KVM's selftests infrastructure would be reworked to allow backing
a single region of guest memory with multiple memslots for _all_ backing
types and shapes, i.e. ideally the code for using a single backing fd
across multiple memslots would work for "regular" memory as well.  But
sadly, support for KVM_CREATE_GUEST_MEMFD has languished for far too long,
and overhauling selftests' memslots infrastructure would likely open a can
of worms, i.e. delay things even further.

In addition to the more obvious tests, verify that PUNCH_HOLE actually
frees memory.  Directly verifying that KVM frees memory is impractical, if
it's even possible, so instead indirectly verify memory is freed by
asserting that the guest reads zeroes after a PUNCH_HOLE.  E.g. if KVM
zaps SPTEs but doesn't actually punch a hole in the inode, the subsequent
read will still see the previous value.  And obviously punching a hole
shouldn't cause explosions.

Let the user specify the number of memslots in the private mem conversion
test, i.e. don't require the number of memslots to be '1' or "nr_vcpus".
Creating more memslots than vCPUs is particularly interesting, e.g. it can
result in a single KVM_SET_MEMORY_ATTRIBUTES spanning multiple memslots.
To keep the math reasonable, align each vCPU's chunk to at least 2MiB (the
size is 2MiB+4KiB), and require the total size to be cleanly divisible by
the number of memslots.  The goal is to be able to validate that KVM plays
nice with multiple memslots, being able to create a truly arbitrary number
of memslots doesn't add meaningful value, i.e. isn't worth the cost.

Intentionally don't take a requirement on KVM_CAP_GUEST_MEMFD,
KVM_CAP_MEMORY_FAULT_INFO, KVM_MEMORY_ATTRIBUTE_PRIVATE, etc., as it's a
KVM bug to advertise KVM_X86_SW_PROTECTED_VM without its prerequisites.

Signed-off-by: Vishal Annapurve 
Co-developed-by: Ackerley Tng 
Signed-off-by: Ackerley Tng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 tools/testing/selftests/kvm/Makefile  |   1 +
 .../kvm/x86_64/private_mem_conversions_test.c | 487 ++
 2 files changed, 488 insertions(+)
 create mode 100644 
tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index a3bb36fb3cfc..b709a52d5cdb 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -81,6 +81,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/monitor_mwait_test
 TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
 TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
+TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
 TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
 TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c 
b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
new file mode 100644
index ..be311944e90a
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
@@ -0,0 +1,487 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022, Google LLC.
+ */
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#define BASE_DATA_SLOT 10
+#define BASE_DATA_GPA  ((uint64_t)(1ull << 32))
+#define PER_CPU_DATA_SIZE  ((uint64_t)(SZ_2M + PAGE_SIZE))
+
+/* Horrific macro so that the line info is captured accurately :-( */
+#define memcmp_g(gpa, pattern,  size)  
\
+do {   
\
+   uint8_t *mem = (uint8_t *)gpa;  
\
+   size_t i;   
\
+   
\
+   for (i = 0; i < size; i++)  
\
+   __GUEST_ASSERT(mem[i] == pattern,   
\
+  "Guest expected 0x%

[PATCH v13 30/35] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data

2023-10-27 Thread Sean Christopherson

Add GUEST_SYNC[1-6]() so that tests can pass the maximum amount of
information supported via ucall(), without needing to resort to shared
memory.

Signed-off-by: Sean Christopherson 
---
 tools/testing/selftests/kvm/include/ucall_common.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/ucall_common.h 
b/tools/testing/selftests/kvm/include/ucall_common.h
index ce33d306c2cb..0fb472a5a058 100644
--- a/tools/testing/selftests/kvm/include/ucall_common.h
+++ b/tools/testing/selftests/kvm/include/ucall_common.h
@@ -52,6 +52,17 @@ int ucall_nr_pages_required(uint64_t page_size);
 #define GUEST_SYNC_ARGS(stage, arg1, arg2, arg3, arg4) \
ucall(UCALL_SYNC, 6, "hello", stage, arg1, 
arg2, arg3, arg4)
 #define GUEST_SYNC(stage)  ucall(UCALL_SYNC, 2, "hello", stage)
+#define GUEST_SYNC1(arg0)  ucall(UCALL_SYNC, 1, arg0)
+#define GUEST_SYNC2(arg0, arg1)ucall(UCALL_SYNC, 2, arg0, arg1)
+#define GUEST_SYNC3(arg0, arg1, arg2) \
+   ucall(UCALL_SYNC, 3, arg0, arg1, arg2)
+#define GUEST_SYNC4(arg0, arg1, arg2, arg3) \
+   ucall(UCALL_SYNC, 4, arg0, arg1, arg2, arg3)
+#define GUEST_SYNC5(arg0, arg1, arg2, arg3, arg4) \
+   ucall(UCALL_SYNC, 5, arg0, arg1, arg2, arg3, 
arg4)
+#define GUEST_SYNC6(arg0, arg1, arg2, arg3, arg4, arg5) \
+   ucall(UCALL_SYNC, 6, arg0, arg1, arg2, arg3, 
arg4, arg5)
+
 #define GUEST_PRINTF(_fmt, _args...) ucall_fmt(UCALL_PRINTF, _fmt, ##_args)
 #define GUEST_DONE()   ucall(UCALL_DONE, 0)
 
-- 
2.42.0.820.g83a721a137-goog

[PATCH v13 29/35] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type

2023-10-27 Thread Sean Christopherson

Add a "vm_shape" structure to encapsulate the selftests-defined "mode",
along with the KVM-defined "type" for use when creating a new VM.  "mode"
tracks physical and virtual address properties, as well as the preferred
backing memory type, while "type" corresponds to the VM type.

Taking the VM type will allow adding tests for KVM_CREATE_GUEST_MEMFD,
a.k.a. guest private memory, without needing an entirely separate set of
helpers.  Guest private memory is effectively usable only by confidential
VM types, and it's expected that x86 will double down and require unique
VM types for TDX and SNP guests.

Signed-off-by: Sean Christopherson 
---
 tools/testing/selftests/kvm/dirty_log_test.c  |  2 +-
 .../selftests/kvm/include/kvm_util_base.h | 54 +++
 .../selftests/kvm/kvm_page_table_test.c   |  2 +-
 tools/testing/selftests/kvm/lib/kvm_util.c| 43 +++
 tools/testing/selftests/kvm/lib/memstress.c   |  3 +-
 .../kvm/x86_64/ucna_injection_test.c  |  2 +-
 6 files changed, 72 insertions(+), 34 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
b/tools/testing/selftests/kvm/dirty_log_test.c
index 936f3a8d1b83..6cbecf499767 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -699,7 +699,7 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, 
struct kvm_vcpu **vcpu,
 
pr_info("Testing guest mode: %s\n", vm_guest_mode_string(mode));
 
-   vm = __vm_create(mode, 1, extra_mem_pages);
+   vm = __vm_create(VM_SHAPE(mode), 1, extra_mem_pages);
 
log_mode_create_vm_done(vm);
*vcpu = vm_vcpu_add(vm, 0, guest_code);
diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 1441fca6c273..157508c071f3 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -188,6 +188,23 @@ enum vm_guest_mode {
NUM_VM_MODES,
 };
 
+struct vm_shape {
+   enum vm_guest_mode mode;
+   unsigned int type;
+};
+
+#define VM_TYPE_DEFAULT0
+
+#define VM_SHAPE(__mode)   \
+({ \
+   struct vm_shape shape = {   \
+   .mode = (__mode),   \
+   .type = VM_TYPE_DEFAULT \
+   };  \
+   \
+   shape;  \
+})
+
 #if defined(__aarch64__)
 
 extern enum vm_guest_mode vm_mode_default;
@@ -220,6 +237,8 @@ extern enum vm_guest_mode vm_mode_default;
 
 #endif
 
+#define VM_SHAPE_DEFAULT   VM_SHAPE(VM_MODE_DEFAULT)
+
 #define MIN_PAGE_SIZE  (1U << MIN_PAGE_SHIFT)
 #define PTES_PER_MIN_PAGE  ptes_per_page(MIN_PAGE_SIZE)
 
@@ -784,21 +803,21 @@ vm_paddr_t vm_alloc_page_table(struct kvm_vm *vm);
  * __vm_create() does NOT create vCPUs, @nr_runnable_vcpus is used purely to
  * calculate the amount of memory needed for per-vCPU data, e.g. stacks.
  */
-struct kvm_vm *vm_create(enum vm_guest_mode mode);
-struct kvm_vm *__vm_create(enum vm_guest_mode mode, uint32_t nr_runnable_vcpus,
+struct kvm_vm *vm_create(struct vm_shape shape);
+struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus,
   uint64_t nr_extra_pages);
 
 static inline struct kvm_vm *vm_create_barebones(void)
 {
-   return vm_create(VM_MODE_DEFAULT);
+   return vm_create(VM_SHAPE_DEFAULT);
 }
 
 static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)
 {
-   return __vm_create(VM_MODE_DEFAULT, nr_runnable_vcpus, 0);
+   return __vm_create(VM_SHAPE_DEFAULT, nr_runnable_vcpus, 0);
 }
 
-struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t 
nr_vcpus,
+struct kvm_vm *__vm_create_with_vcpus(struct vm_shape shape, uint32_t nr_vcpus,
  uint64_t extra_mem_pages,
  void *guest_code, struct kvm_vcpu 
*vcpus[]);
 
@@ -806,17 +825,27 @@ static inline struct kvm_vm 
*vm_create_with_vcpus(uint32_t nr_vcpus,
  void *guest_code,
  struct kvm_vcpu *vcpus[])
 {
-   return __vm_create_with_vcpus(VM_MODE_DEFAULT, nr_vcpus, 0,
+   return __vm_create_with_vcpus(VM_SHAPE_DEFAULT, nr_vcpus, 0,
  guest_code, vcpus);
 }
 
+
+struct kvm_vm *__vm_create_shape_with_one_vcpu(struct vm_shape shape,
+  struct kvm_vcpu **vcpu,
+  uint64_t extra_mem_pages,
+  void *guest_code);
+
 /*
  * Create a VM with a single vCPU with reasonable defaults a

[PATCH v13 28/35] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86)

2023-10-27 Thread Sean Christopherson

From: Vishal Annapurve 

Add helpers for x86 guests to invoke the KVM_HC_MAP_GPA_RANGE hypercall,
which KVM will forward to userspace and thus can be used by tests to
coordinate private<=>shared conversions between host userspace code and
guest code.

Signed-off-by: Vishal Annapurve 
[sean: drop shared/private helpers (let tests specify flags)]
Signed-off-by: Sean Christopherson 
---
 .../selftests/kvm/include/x86_64/processor.h  | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h 
b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 25bc61dac5fb..a84863503fcb 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 
+#include 
 #include 
 
 #include "../kvm_util.h"
@@ -1194,6 +1195,20 @@ uint64_t kvm_hypercall(uint64_t nr, uint64_t a0, 
uint64_t a1, uint64_t a2,
 uint64_t __xen_hypercall(uint64_t nr, uint64_t a0, void *a1);
 void xen_hypercall(uint64_t nr, uint64_t a0, void *a1);
 
+static inline uint64_t __kvm_hypercall_map_gpa_range(uint64_t gpa,
+uint64_t size, uint64_t 
flags)
+{
+   return kvm_hypercall(KVM_HC_MAP_GPA_RANGE, gpa, size >> PAGE_SHIFT, 
flags, 0);
+}
+
+static inline void kvm_hypercall_map_gpa_range(uint64_t gpa, uint64_t size,
+  uint64_t flags)
+{
+   uint64_t ret = __kvm_hypercall_map_gpa_range(gpa, size, flags);
+
+   GUEST_ASSERT(!ret);
+}
+
 void __vm_xsave_require_permission(uint64_t xfeature, const char *name);
 
 #define vm_xsave_require_permission(xfeature)  \
-- 
2.42.0.820.g83a721a137-goog

[PATCH v13 27/35] KVM: selftests: Add helpers to convert guest memory b/w private and shared

2023-10-27 Thread Sean Christopherson

From: Vishal Annapurve 

Add helpers to convert memory between private and shared via KVM's
memory attributes, as well as helpers to free/allocate guest_memfd memory
via fallocate().  Userspace, i.e. tests, is NOT required to do fallocate()
when converting memory, as the attributes are the single source of true.
Provide allocate() helpers so that tests can mimic a userspace that frees
private memory on conversion, e.g. to prioritize memory usage over
performance.

Signed-off-by: Vishal Annapurve 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 .../selftests/kvm/include/kvm_util_base.h | 48 +++
 tools/testing/selftests/kvm/lib/kvm_util.c| 28 +++
 2 files changed, 76 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 9f861182c02a..1441fca6c273 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -333,6 +333,54 @@ static inline void vm_enable_cap(struct kvm_vm *vm, 
uint32_t cap, uint64_t arg0)
vm_ioctl(vm, KVM_ENABLE_CAP, _cap);
 }
 
+static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
+   uint64_t size, uint64_t attributes)
+{
+   struct kvm_memory_attributes attr = {
+   .attributes = attributes,
+   .address = gpa,
+   .size = size,
+   .flags = 0,
+   };
+
+   /*
+* KVM_SET_MEMORY_ATTRIBUTES overwrites _all_ attributes.  These flows
+* need significant enhancements to support multiple attributes.
+*/
+   TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE,
+   "Update me to support multiple attributes!");
+
+   vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, );
+}
+
+
+static inline void vm_mem_set_private(struct kvm_vm *vm, uint64_t gpa,
+ uint64_t size)
+{
+   vm_set_memory_attributes(vm, gpa, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+static inline void vm_mem_set_shared(struct kvm_vm *vm, uint64_t gpa,
+uint64_t size)
+{
+   vm_set_memory_attributes(vm, gpa, size, 0);
+}
+
+void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t gpa, uint64_t size,
+   bool punch_hole);
+
+static inline void vm_guest_mem_punch_hole(struct kvm_vm *vm, uint64_t gpa,
+  uint64_t size)
+{
+   vm_guest_mem_fallocate(vm, gpa, size, true);
+}
+
+static inline void vm_guest_mem_allocate(struct kvm_vm *vm, uint64_t gpa,
+uint64_t size)
+{
+   vm_guest_mem_fallocate(vm, gpa, size, false);
+}
+
 void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size);
 const char *vm_guest_mode_string(uint32_t i);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index 45050f54701a..a140aee8d0f5 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1176,6 +1176,34 @@ void vm_mem_region_delete(struct kvm_vm *vm, uint32_t 
slot)
__vm_mem_region_delete(vm, memslot2region(vm, slot), true);
 }
 
+void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t base, uint64_t size,
+   bool punch_hole)
+{
+   const int mode = FALLOC_FL_KEEP_SIZE | (punch_hole ? 
FALLOC_FL_PUNCH_HOLE : 0);
+   struct userspace_mem_region *region;
+   uint64_t end = base + size;
+   uint64_t gpa, len;
+   off_t fd_offset;
+   int ret;
+
+   for (gpa = base; gpa < end; gpa += len) {
+   uint64_t offset;
+
+   region = userspace_mem_region_find(vm, gpa, gpa);
+   TEST_ASSERT(region && region->region.flags & KVM_MEM_PRIVATE,
+   "Private memory region not found for GPA 0x%lx", 
gpa);
+
+   offset = (gpa - region->region.guest_phys_addr);
+   fd_offset = region->region.guest_memfd_offset + offset;
+   len = min_t(uint64_t, end - gpa, region->region.memory_size - 
offset);
+
+   ret = fallocate(region->region.guest_memfd, mode, fd_offset, 
len);
+   TEST_ASSERT(!ret, "fallocate() failed to %s at %lx (len = %lu), 
fd = %d, mode = %x, offset = %lx\n",
+   punch_hole ? "punch hole" : "allocate", gpa, len,
+   region->region.guest_memfd, mode, fd_offset);
+   }
+}
+
 /* Returns the size of a vCPU's kvm_run structure. */
 static int vcpu_mmap_sz(void)
 {
-- 
2.42.0.820.g83a721a137-goog

[PATCH v13 26/35] KVM: selftests: Add support for creating private memslots

2023-10-27 Thread Sean Christopherson

Add support for creating "private" memslots via KVM_CREATE_GUEST_MEMFD and
KVM_SET_USER_MEMORY_REGION2.  Make vm_userspace_mem_region_add() a wrapper
to its effective replacement, vm_mem_add(), so that private memslots are
fully opt-in, i.e. don't require update all tests that add memory regions.

Pivot on the KVM_MEM_PRIVATE flag instead of the validity of the "gmem"
file descriptor so that simple tests can let vm_mem_add() do the heavy
lifting of creating the guest memfd, but also allow the caller to pass in
an explicit fd+offset so that fancier tests can do things like back
multiple memslots with a single file.  If the caller passes in a fd, dup()
the fd so that (a) __vm_mem_region_delete() can close the fd associated
with the memory region without needing yet another flag, and (b) so that
the caller can safely close its copy of the fd without having to first
destroy memslots.

Co-developed-by: Ackerley Tng 
Signed-off-by: Ackerley Tng 
Signed-off-by: Sean Christopherson 
---
 .../selftests/kvm/include/kvm_util_base.h | 23 +
 .../testing/selftests/kvm/include/test_util.h |  5 ++
 tools/testing/selftests/kvm/lib/kvm_util.c| 85 ---
 3 files changed, 82 insertions(+), 31 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 9f144841c2ee..9f861182c02a 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -431,6 +431,26 @@ static inline uint64_t vm_get_stat(struct kvm_vm *vm, 
const char *stat_name)
 
 void vm_create_irqchip(struct kvm_vm *vm);
 
+static inline int __vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
+   uint64_t flags)
+{
+   struct kvm_create_guest_memfd guest_memfd = {
+   .size = size,
+   .flags = flags,
+   };
+
+   return __vm_ioctl(vm, KVM_CREATE_GUEST_MEMFD, _memfd);
+}
+
+static inline int vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
+   uint64_t flags)
+{
+   int fd = __vm_create_guest_memfd(vm, size, flags);
+
+   TEST_ASSERT(fd >= 0, KVM_IOCTL_ERROR(KVM_CREATE_GUEST_MEMFD, fd));
+   return fd;
+}
+
 void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
   uint64_t gpa, uint64_t size, void *hva);
 int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
@@ -439,6 +459,9 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
enum vm_mem_backing_src_type src_type,
uint64_t guest_paddr, uint32_t slot, uint64_t npages,
uint32_t flags);
+void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
+   uint64_t guest_paddr, uint32_t slot, uint64_t npages,
+   uint32_t flags, int guest_memfd_fd, uint64_t 
guest_memfd_offset);
 
 void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags);
 void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, uint64_t new_gpa);
diff --git a/tools/testing/selftests/kvm/include/test_util.h 
b/tools/testing/selftests/kvm/include/test_util.h
index 7e614adc6cf4..7257f2243ab9 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -142,6 +142,11 @@ static inline bool backing_src_is_shared(enum 
vm_mem_backing_src_type t)
return vm_mem_backing_src_alias(t)->flag & MAP_SHARED;
 }
 
+static inline bool backing_src_can_be_huge(enum vm_mem_backing_src_type t)
+{
+   return t != VM_MEM_SRC_ANONYMOUS && t != VM_MEM_SRC_SHMEM;
+}
+
 /* Aligns x up to the next multiple of size. Size must be a power of 2. */
 static inline uint64_t align_up(uint64_t x, uint64_t size)
 {
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index 3676b37bea38..45050f54701a 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -669,6 +669,8 @@ static void __vm_mem_region_delete(struct kvm_vm *vm,
TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("munmap()", ret));
close(region->fd);
}
+   if (region->region.guest_memfd >= 0)
+   close(region->region.guest_memfd);
 
free(region);
 }
@@ -870,36 +872,15 @@ void vm_set_user_memory_region(struct kvm_vm *vm, 
uint32_t slot, uint32_t flags,
errno, strerror(errno));
 }
 
-/*
- * VM Userspace Memory Region Add
- *
- * Input Args:
- *   vm - Virtual Machine
- *   src_type - Storage source for this region.
- *  NULL to use anonymous memory.
- *   guest_paddr - Starting guest physical address
- *   slot - KVM region slot
- *   npages - Number of physical pages
- *   flags - KVM memory region flags (e.g. KVM_MEM_LOG_DIRTY_PAGES)
- *
- * Output Arg

[PATCH v13 25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2

2023-10-27 Thread Sean Christopherson

Use KVM_SET_USER_MEMORY_REGION2 throughout KVM's selftests library so that
support for guest private memory can be added without needing an entirely
separate set of helpers.

Note, this obviously makes selftests backwards-incompatible with older KVM
versions from this point forward.

Signed-off-by: Sean Christopherson 
---
 .../selftests/kvm/include/kvm_util_base.h |  2 +-
 tools/testing/selftests/kvm/lib/kvm_util.c| 19 ++-
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 967eaaeacd75..9f144841c2ee 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -44,7 +44,7 @@ typedef uint64_t vm_paddr_t; /* Virtual Machine (Guest) 
physical address */
 typedef uint64_t vm_vaddr_t; /* Virtual Machine (Guest) virtual address */
 
 struct userspace_mem_region {
-   struct kvm_userspace_memory_region region;
+   struct kvm_userspace_memory_region2 region;
struct sparsebit *unused_phy_pages;
int fd;
off_t offset;
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index f09295d56c23..3676b37bea38 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -453,8 +453,9 @@ void kvm_vm_restart(struct kvm_vm *vmp)
vm_create_irqchip(vmp);
 
hash_for_each(vmp->regions.slot_hash, ctr, region, slot_node) {
-   int ret = ioctl(vmp->fd, KVM_SET_USER_MEMORY_REGION, 
>region);
-   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL 
failed,\n"
+   int ret = ioctl(vmp->fd, KVM_SET_USER_MEMORY_REGION2, 
>region);
+
+   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL 
failed,\n"
"  rc: %i errno: %i\n"
"  slot: %u flags: 0x%x\n"
"  guest_phys_addr: 0x%llx size: 0x%llx",
@@ -657,7 +658,7 @@ static void __vm_mem_region_delete(struct kvm_vm *vm,
}
 
region->region.memory_size = 0;
-   vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, >region);
+   vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, >region);
 
sparsebit_free(>unused_phy_pages);
ret = munmap(region->mmap_start, region->mmap_size);
@@ -1014,8 +1015,8 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
region->region.guest_phys_addr = guest_paddr;
region->region.memory_size = npages * vm->page_size;
region->region.userspace_addr = (uintptr_t) region->host_mem;
-   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, >region);
-   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
+   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, >region);
+   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
"  rc: %i errno: %i\n"
"  slot: %u flags: 0x%x\n"
"  guest_phys_addr: 0x%lx size: 0x%lx",
@@ -1097,9 +1098,9 @@ void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t 
slot, uint32_t flags)
 
region->region.flags = flags;
 
-   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, >region);
+   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, >region);
 
-   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
+   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
"  rc: %i errno: %i slot: %u flags: 0x%x",
ret, errno, slot, flags);
 }
@@ -1127,9 +1128,9 @@ void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, 
uint64_t new_gpa)
 
region->region.guest_phys_addr = new_gpa;
 
-   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, >region);
+   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, >region);
 
-   TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION failed\n"
+   TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION2 failed\n"
"ret: %i errno: %i slot: %u new_gpa: 0x%lx",
ret, errno, slot, new_gpa);
 }
-- 
2.42.0.820.g83a721a137-goog

[PATCH v13 24/35] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper

2023-10-27 Thread Sean Christopherson

Drop kvm_userspace_memory_region_find(), it's unused and a terrible API
(probably why it's unused).  If anything outside of kvm_util.c needs to
get at the memslot, userspace_mem_region_find() can be exposed to give
others full access to all memory region/slot information.

Signed-off-by: Sean Christopherson 
---
 .../selftests/kvm/include/kvm_util_base.h |  4 ---
 tools/testing/selftests/kvm/lib/kvm_util.c| 29 ---
 2 files changed, 33 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index a18db6a7b3cf..967eaaeacd75 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -776,10 +776,6 @@ vm_adjust_num_guest_pages(enum vm_guest_mode mode, 
unsigned int num_guest_pages)
return n;
 }
 
-struct kvm_userspace_memory_region *
-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
-uint64_t end);
-
 #define sync_global_to_guest(vm, g) ({ \
typeof(g) *_p = addr_gva2hva(vm, (vm_vaddr_t)&(g)); \
memcpy(_p, &(g), sizeof(g));\
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index 7a8af1821f5d..f09295d56c23 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -590,35 +590,6 @@ userspace_mem_region_find(struct kvm_vm *vm, uint64_t 
start, uint64_t end)
return NULL;
 }
 
-/*
- * KVM Userspace Memory Region Find
- *
- * Input Args:
- *   vm - Virtual Machine
- *   start - Starting VM physical address
- *   end - Ending VM physical address, inclusive.
- *
- * Output Args: None
- *
- * Return:
- *   Pointer to overlapping region, NULL if no such region.
- *
- * Public interface to userspace_mem_region_find. Allows tests to look up
- * the memslot datastructure for a given range of guest physical memory.
- */
-struct kvm_userspace_memory_region *
-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
-uint64_t end)
-{
-   struct userspace_mem_region *region;
-
-   region = userspace_mem_region_find(vm, start, end);
-   if (!region)
-   return NULL;
-
-   return >region;
-}
-
 __weak void vcpu_arch_free(struct kvm_vcpu *vcpu)
 {
 
-- 
2.42.0.820.g83a721a137-goog

[PATCH v13 23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory

2023-10-27 Thread Sean Christopherson

Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development
and testing vehicle for Confidential (CoCo) VMs, and potentially to even
become a "real" product in the distant future, e.g. a la pKVM.

The private memory support in KVM x86 is aimed at AMD's SEV-SNP and
Intel's TDX, but those technologies are extremely complex (understatement),
difficult to debug, don't support running as nested guests, and require
hardware that's isn't universally accessible.  I.e. relying SEV-SNP or TDX
for maintaining guest private memory isn't a realistic option.

At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of
selftests for guest_memfd and private memory support without requiring
unique hardware.

Signed-off-by: Sean Christopherson 
---
 Documentation/virt/kvm/api.rst  | 32 
 arch/x86/include/asm/kvm_host.h | 15 +--
 arch/x86/include/uapi/asm/kvm.h |  3 +++
 arch/x86/kvm/Kconfig| 12 
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 arch/x86/kvm/x86.c  | 16 +++-
 include/uapi/linux/kvm.h|  1 +
 virt/kvm/Kconfig|  5 +
 8 files changed, 78 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 38dc1fda4f45..00029436ac5b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -147,10 +147,29 @@ described as 'basic' will be available.
 The new VM has no virtual cpus and no memory.
 You probably want to use 0 as machine type.
 
+X86:
+
+
+Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.
+
+S390:
+^
+
 In order to create user controlled virtual machines on S390, check
 KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
 privileged user (CAP_SYS_ADMIN).
 
+MIPS:
+^
+
+To use hardware assisted virtualization on MIPS (VZ ASE) rather than
+the default trap & emulate implementation (which changes the virtual
+memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
+flag KVM_VM_MIPS_VZ.
+
+ARM64:
+^^
+
 On arm64, the physical address size for a VM (IPA Size limit) is limited
 to 40bits by default. The limit can be configured if the host supports the
 extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
@@ -8650,6 +8669,19 @@ block sizes is exposed in 
KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
 64-bit bitmap (each bit describing a block size). The default value is
 0, to disable the eager page splitting.
 
+8.41 KVM_CAP_VM_TYPES
+-
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: system ioctl
+
+This capability returns a bitmap of support VM types.  The 1-setting of bit @n
+means the VM type with value @n is supported.  Possible values of @n are::
+
+  #define KVM_X86_DEFAULT_VM   0
+  #define KVM_X86_SW_PROTECTED_VM  1
+
 9. Known KVM API problems
 =
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f9e8d5642069..dff10051e9b6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1244,6 +1244,7 @@ enum kvm_apicv_inhibit {
 };
 
 struct kvm_arch {
+   unsigned long vm_type;
unsigned long n_used_mmu_pages;
unsigned long n_requested_mmu_pages;
unsigned long n_max_mmu_pages;
@@ -2077,6 +2078,12 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t 
new_pgd);
 void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
   int tdp_max_root_level, int tdp_huge_page_level);
 
+#ifdef CONFIG_KVM_PRIVATE_MEM
+#define kvm_arch_has_private_mem(kvm) ((kvm)->arch.vm_type != 
KVM_X86_DEFAULT_VM)
+#else
+#define kvm_arch_has_private_mem(kvm) false
+#endif
+
 static inline u16 kvm_read_ldt(void)
 {
u16 ldt;
@@ -2125,14 +2132,10 @@ enum {
 #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
 
 # define KVM_MAX_NR_ADDRESS_SPACES 2
+/* SMM is currently unsupported for guests with private memory. */
+# define kvm_arch_nr_memslot_as_ids(kvm) (kvm_arch_has_private_mem(kvm) ? 1 : 
2)
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 
1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
-
-static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
-{
-   return KVM_MAX_NR_ADDRESS_SPACES;
-}
-
 #else
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
 #endif
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 1a6a1f987949..a448d0964fc0 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -562,4 +562,7 @@ struct kvm_pmu_event_filter {
 /* x86-specific KVM_EXIT_HYPERCALL flags. */
 #define KVM_EXIT_HYPERCALL_LONG_MODE   BIT(0)
 
+#define KVM_X86_DEFAULT_VM 0
+#define KVM_X86_SW_PROTECTED_VM1
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 091b74599c22..

[PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM

2023-10-27 Thread Sean Christopherson

Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs.  Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.

Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).

Signed-off-by: Sean Christopherson 
---
 arch/powerpc/kvm/book3s_hv.c|  2 +-
 arch/x86/include/asm/kvm_host.h |  8 +++-
 arch/x86/kvm/debugfs.c  |  2 +-
 arch/x86/kvm/mmu/mmu.c  |  6 +++---
 arch/x86/kvm/x86.c  |  2 +-
 include/linux/kvm_host.h| 17 +++--
 virt/kvm/dirty_ring.c   |  2 +-
 virt/kvm/kvm_main.c | 26 ++
 8 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 130bafdb1430..9b0eaa17275a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -6084,7 +6084,7 @@ static int kvmhv_svm_off(struct kvm *kvm)
}
 
srcu_idx = srcu_read_lock(>srcu);
-   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
struct kvm_memory_slot *memslot;
struct kvm_memslots *slots = __kvm_memslots(kvm, i);
int bkt;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6702f795c862..f9e8d5642069 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2124,9 +2124,15 @@ enum {
 #define HF_SMM_MASK(1 << 1)
 #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
 
-# define KVM_ADDRESS_SPACE_NUM 2
+# define KVM_MAX_NR_ADDRESS_SPACES 2
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 
1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
+
+static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
+{
+   return KVM_MAX_NR_ADDRESS_SPACES;
+}
+
 #else
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
 #endif
diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
index ee8c4c3496ed..42026b3f3ff3 100644
--- a/arch/x86/kvm/debugfs.c
+++ b/arch/x86/kvm/debugfs.c
@@ -111,7 +111,7 @@ static int kvm_mmu_rmaps_stat_show(struct seq_file *m, void 
*v)
mutex_lock(>slots_lock);
write_lock(>mmu_lock);
 
-   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
int bkt;
 
slots = __kvm_memslots(kvm, i);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c4e758f0aebb..baeba8fc1c38 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3755,7 +3755,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
kvm_page_track_write_tracking_enabled(kvm))
goto out_success;
 
-   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);
kvm_for_each_memslot(slot, bkt, slots) {
/*
@@ -6294,7 +6294,7 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t 
gfn_start, gfn_t gfn_e
if (!kvm_memslots_have_rmaps(kvm))
return flush;
 
-   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);
 
kvm_for_each_memslot_in_gfn_range(, slots, gfn_start, 
gfn_end) {
@@ -6791,7 +6791,7 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 
gen)
 * modifier prior to checking for a wrap of the MMIO generation so
 * that a wrap in any address space is detected.
 */
-   gen &= ~((u64)KVM_ADDRESS_SPACE_NUM - 1);
+   gen &= ~((u64)kvm_arch_nr_memslot_as_ids(kvm) - 1);
 
/*
 * The very rare case: if the MMIO generation number has wrapped,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 824b58b44382..c4d17727b199 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12456,7 +12456,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, 
int id, gpa_t gpa,
hva = slot->userspace_addr;
}
 
-   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
struct kvm_userspace_memory_region2 m;
 
m.slot = id | (i << 16);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c3cfe08b1300..687589ce9f63 100644
--- a/include/linux/kvm_host.h
+++ b/include/l

[PATCH v13 21/35] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro

2023-10-27 Thread Sean Christopherson

Drop __KVM_VCPU_MULTIPLE_ADDRESS_SPACE and instead check the value of
KVM_ADDRESS_SPACE_NUM.

No functional change intended.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h | 1 -
 include/linux/kvm_host.h| 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8d60e4745e8b..6702f795c862 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2124,7 +2124,6 @@ enum {
 #define HF_SMM_MASK(1 << 1)
 #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
 
-# define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
 # define KVM_ADDRESS_SPACE_NUM 2
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 
1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e3223cafd7db..c3cfe08b1300 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -692,7 +692,7 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
 #define KVM_MEM_SLOTS_NUM SHRT_MAX
 #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
 
-#ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
+#if KVM_ADDRESS_SPACE_NUM == 1
 static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 {
return 0;
-- 
2.42.0.820.g83a721a137-goog

[PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory

2023-10-27 Thread Sean Christopherson

From: Chao Peng 

Add support for resolving page faults on guest private memory for VMs
that differentiate between "shared" and "private" memory.  For such VMs,
KVM_MEM_PRIVATE memslots can include both fd-based private memory and
hva-based shared memory, and KVM needs to map in the "correct" variant,
i.e. KVM needs to map the gfn shared/private as appropriate based on the
current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.

For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
shared vs. private via a bit in the guest page tables, i.e. what the guest
wants may conflict with the current memory attributes.  To support such
"implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
to forward the request to userspace.  Add a new flag for memory faults,
KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
map memory as shared vs. private.

Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
exit on missing mappings when handling guest page fault VM-Exits.  In
that case, userspace will want to know RWX information in order to
correctly/precisely resolve the fault.

Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
always come from the host userspace page tables, and private mappings
always come from a guest_memfd instance.

Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 Documentation/virt/kvm/api.rst  |   8 ++-
 arch/x86/kvm/mmu/mmu.c  | 101 ++--
 arch/x86/kvm/mmu/mmu_internal.h |   1 +
 include/linux/kvm_host.h|   8 ++-
 include/uapi/linux/kvm.h|   1 +
 5 files changed, 110 insertions(+), 9 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7f00c310c24a..38dc1fda4f45 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6837,6 +6837,7 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
 
/* KVM_EXIT_MEMORY_FAULT */
struct {
+  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3)
__u64 flags;
__u64 gpa;
__u64 size;
@@ -6845,8 +6846,11 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
 KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
 could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
 guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
-describes properties of the faulting access that are likely pertinent.
-Currently, no flags are defined.
+describes properties of the faulting access that are likely pertinent:
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
+   on a private memory access.  When clear, indicates the fault occurred on a
+   shared access.
 
 Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
 accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4167d557c577..c4e758f0aebb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3147,9 +3147,9 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t 
gfn,
return level;
 }
 
-int kvm_mmu_max_mapping_level(struct kvm *kvm,
- const struct kvm_memory_slot *slot, gfn_t gfn,
- int max_level)
+static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
+  const struct kvm_memory_slot *slot,
+  gfn_t gfn, int max_level, bool 
is_private)
 {
struct kvm_lpage_info *linfo;
int host_level;
@@ -3161,6 +3161,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
break;
}
 
+   if (is_private)
+   return max_level;
+
if (max_level == PG_LEVEL_4K)
return PG_LEVEL_4K;
 
@@ -3168,6 +3171,16 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
return min(host_level, max_level);
 }
 
+int kvm_mmu_max_mapping_level(struct kvm *kvm,
+ const struct kvm_memory_slot *slot, gfn_t gfn,
+ int max_level)
+{
+   bool is_private = kvm_slot_can_be_private(slot) &&
+ kvm_mem_is_private(kvm, gfn);
+
+   return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, 
is_private);
+}
+
 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault 
*fault)
 {
struct kvm_memory_slot *slot = fault->slot;
@@ -3188,8 +3201,9 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcp

[PATCH v13 19/35] KVM: x86: Disallow hugepages when memory attributes are mixed

2023-10-27 Thread Sean Christopherson

From: Chao Peng 

Disallow creating hugepages with mixed memory attributes, e.g. shared
versus private, as mapping a hugepage in this case would allow the guest
to access memory with the wrong attributes, e.g. overlaying private memory
with a shared hugepage.

Tracking whether or not attributes are mixed via the existing
disallow_lpage field, but use the most significant bit in 'disallow_lpage'
to indicate a hugepage has mixed attributes instead using the normal
refcounting.  Whether or not attributes are mixed is binary; either they
are or they aren't.  Attempting to squeeze that info into the refcount is
unnecessarily complex as it would require knowing the previous state of
the mixed count when updating attributes.  Using a flag means KVM just
needs to ensure the current status is reflected in the memslots.

Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h |   3 +
 arch/x86/kvm/mmu/mmu.c  | 154 +++-
 arch/x86/kvm/x86.c  |   4 +
 3 files changed, 159 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 31e84668014e..8d60e4745e8b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1836,6 +1836,9 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu);
 void kvm_mmu_init_vm(struct kvm *kvm);
 void kvm_mmu_uninit_vm(struct kvm *kvm);
 
+void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
+   struct kvm_memory_slot *slot);
+
 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu);
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d33657d61d80..4167d557c577 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -795,16 +795,26 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
return >arch.lpage_info[level - 2][idx];
 }
 
+/*
+ * The most significant bit in disallow_lpage tracks whether or not memory
+ * attributes are mixed, i.e. not identical for all gfns at the current level.
+ * The lower order bits are used to refcount other cases where a hugepage is
+ * disallowed, e.g. if KVM has shadow a page table at the gfn.
+ */
+#define KVM_LPAGE_MIXED_FLAG   BIT(31)
+
 static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
gfn_t gfn, int count)
 {
struct kvm_lpage_info *linfo;
-   int i;
+   int old, i;
 
for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
linfo = lpage_info_slot(gfn, slot, i);
+
+   old = linfo->disallow_lpage;
linfo->disallow_lpage += count;
-   WARN_ON_ONCE(linfo->disallow_lpage < 0);
+   WARN_ON_ONCE((old ^ linfo->disallow_lpage) & 
KVM_LPAGE_MIXED_FLAG);
}
 }
 
@@ -7161,3 +7171,143 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
if (kvm->arch.nx_huge_page_recovery_thread)
kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
+
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+   int level)
+{
+   return lpage_info_slot(gfn, slot, level)->disallow_lpage & 
KVM_LPAGE_MIXED_FLAG;
+}
+
+static void hugepage_clear_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+int level)
+{
+   lpage_info_slot(gfn, slot, level)->disallow_lpage &= 
~KVM_LPAGE_MIXED_FLAG;
+}
+
+static void hugepage_set_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+  int level)
+{
+   lpage_info_slot(gfn, slot, level)->disallow_lpage |= 
KVM_LPAGE_MIXED_FLAG;
+}
+
+static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
+  gfn_t gfn, int level, unsigned long attrs)
+{
+   const unsigned long start = gfn;
+   const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
+
+   if (level == PG_LEVEL_2M)
+   return kvm_range_has_memory_attributes(kvm, start, end, attrs);
+
+   for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
+   if (hugepage_test_mixed(slot, gfn, level - 1) ||
+   attrs != kvm_get_memory_attributes(kvm, gfn))
+   return false;
+   }
+   return true;
+}
+
+bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+struct kvm_gfn_range *range)
+{
+   unsigned long attrs = range->arg.attributes;
+   struct kvm_memory_slot *slot = range->slot;
+   int level;
+
+   lockdep_assert_held_write(>mmu_lock);
+   lockdep_assert_held(>slots_lock);
+
+   /*
+

[PATCH v13 18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN

2023-10-27 Thread Sean Christopherson

Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce
the probability of exiting to userspace with a stale run->exit_reason that
*appears* to be valid.

To support fd-based guest memory (guest memory without a corresponding
userspace virtual address), KVM will exit to userspace for various memory
related errors, which userspace *may* be able to resolve, instead of using
e.g. BUS_MCEERR_AR.  And in the more distant future, KVM will also likely
utilize the same functionality to let userspace "intercept" and handle
memory faults when the userspace mapping is missing, i.e. when fast gup()
fails.

Because many of KVM's internal APIs related to guest memory use '0' to
indicate "success, continue on" and not "exit to userspace", reporting
memory faults/errors to userspace will set run->exit_reason and
corresponding fields in the run structure fields in conjunction with a
a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON.  And because
KVM already returns  -EFAULT in many paths, there's a relatively high
probability that KVM could return -EFAULT without setting run->exit_reason,
in which case reporting KVM_EXIT_UNKNOWN is much better than reporting
whatever exit reason happened to be in the run structure.

Note, KVM must wait until after run->immediate_exit is serviced to
sanitize run->exit_reason as KVM's ABI is that run->exit_reason is
preserved across KVM_RUN when run->immediate_exit is true.

Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoor...@google.com
Link: https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/x86.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ee3cd8c3c0ef..f41dbb1465a0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10963,6 +10963,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 {
int r;
 
+   vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
vcpu->arch.l1tf_flush_l1d = true;
 
for (;;) {
-- 
2.42.0.820.g83a721a137-goog

[PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-10-27 Thread Sean Christopherson

Extended guest_memfd to allow backing guest memory with transparent
hugepages.  Require userspace to opt-in via a flag even though there's no
known/anticipated use case for forcing small pages as THP is optional,
i.e. to avoid ending up in a situation where userspace is unaware that
KVM can't provide hugepages.

For simplicity, require the guest_memfd size to be a multiple of the
hugepage size, e.g. so that KVM doesn't need to do bounds checking when
deciding whether or not to allocate a huge folio.

When reporting the max order when KVM gets a pfn from guest_memfd, force
order-0 pages if the hugepage is not fully contained by the memslot
binding, e.g. if userspace requested hugepages but punches a hole in the
memslot bindings in order to emulate x86's VGA hole.

Signed-off-by: Sean Christopherson 
---
 Documentation/virt/kvm/api.rst |  7 
 include/uapi/linux/kvm.h   |  2 +
 virt/kvm/guest_memfd.c | 73 ++
 3 files changed, 75 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e82c69d5e755..7f00c310c24a 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6176,6 +6176,8 @@ and cannot be resized  (guest_memfd files do however 
support PUNCH_HOLE).
__u64 reserved[6];
   };
 
+  #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0)
+
 Conceptually, the inode backing a guest_memfd file represents physical memory,
 i.e. is coupled to the virtual machine as a thing, not to a "struct kvm".  The
 file itself, which is bound to a "struct kvm", is that instance's view of the
@@ -6192,6 +6194,11 @@ most one mapping per page, i.e. binding multiple memory 
regions to a single
 guest_memfd range is not allowed (any number of memory regions can be bound to
 a single guest_memfd file, but the bound ranges must not overlap).
 
+If KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set in flags, KVM will attempt to allocate
+and map hugepages for the guest_memfd file.  This is currently best effort.  If
+KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set, the size must be aligned to the maximum
+transparent hugepage size supported by the kernel
+
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
 5. The kvm_run structure
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 25caee8d1a80..33d542de0a61 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2303,4 +2303,6 @@ struct kvm_create_guest_memfd {
__u64 reserved[6];
 };
 
+#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0)
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 98a12da80214..94bc478c26f3 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -13,14 +13,47 @@ struct kvm_gmem {
struct list_head entry;
 };
 
+static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t 
index)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   unsigned long huge_index = round_down(index, HPAGE_PMD_NR);
+   unsigned long flags = (unsigned long)inode->i_private;
+   struct address_space *mapping  = inode->i_mapping;
+   gfp_t gfp = mapping_gfp_mask(mapping);
+   struct folio *folio;
+
+   if (!(flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE))
+   return NULL;
+
+   if (filemap_range_has_page(mapping, huge_index << PAGE_SHIFT,
+  (huge_index + HPAGE_PMD_NR - 1) << 
PAGE_SHIFT))
+   return NULL;
+
+   folio = filemap_alloc_folio(gfp, HPAGE_PMD_ORDER);
+   if (!folio)
+   return NULL;
+
+   if (filemap_add_folio(mapping, folio, huge_index, gfp)) {
+   folio_put(folio);
+   return NULL;
+   }
+
+   return folio;
+#else
+   return NULL;
+#endif
+}
+
 static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 {
struct folio *folio;
 
-   /* TODO: Support huge pages. */
-   folio = filemap_grab_folio(inode->i_mapping, index);
-   if (IS_ERR_OR_NULL(folio))
-   return NULL;
+   folio = kvm_gmem_get_huge_folio(inode, index);
+   if (!folio) {
+   folio = filemap_grab_folio(inode->i_mapping, index);
+   if (IS_ERR_OR_NULL(folio))
+   return NULL;
+   }
 
/*
 * Use the up-to-date flag to track whether or not the memory has been
@@ -373,6 +406,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, 
u64 flags)
inode->i_mode |= S_IFREG;
inode->i_size = size;
mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+   mapping_set_large_folios(inode->i_mapping);
mapping_set_unmovable(inode->i_mapping);
/* Unmovable mappings are supposed to be marked unevictable as well. */
WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
@@ -398,12 +432,2

[PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-10-27 Thread Sean Christopherson

erwick 
Cc: Isaku Yamahata 
Co-developed-by: Kirill A. Shutemov 
Signed-off-by: Kirill A. Shutemov 
Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Co-developed-by: Chao Peng 
Signed-off-by: Chao Peng 
Co-developed-by: Ackerley Tng 
Signed-off-by: Ackerley Tng 
Co-developed-by: Isaku Yamahata 
Signed-off-by: Isaku Yamahata 
Co-developed-by: Paolo Bonzini 
Signed-off-by: Paolo Bonzini 
Co-developed-by: Michael Roth 
Signed-off-by: Michael Roth 
Signed-off-by: Sean Christopherson 
---
 Documentation/virt/kvm/api.rst |  69 -
 include/linux/kvm_host.h   |  48 +++
 include/uapi/linux/kvm.h   |  15 +-
 virt/kvm/Kconfig   |   4 +
 virt/kvm/Makefile.kvm  |   1 +
 virt/kvm/guest_memfd.c | 548 +
 virt/kvm/kvm_main.c|  68 +++-
 virt/kvm/kvm_mm.h  |  26 ++
 8 files changed, 764 insertions(+), 15 deletions(-)
 create mode 100644 virt/kvm/guest_memfd.c

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e2252c748fd6..e82c69d5e755 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6079,6 +6079,15 @@ applied.
 :Parameters: struct kvm_userspace_memory_region2 (in)
 :Returns: 0 on success, -1 on error
 
+KVM_SET_USER_MEMORY_REGION2 is an extension to KVM_SET_USER_MEMORY_REGION that
+allows mapping guest_memfd memory into a guest.  All fields shared with
+KVM_SET_USER_MEMORY_REGION identically.  Userspace can set KVM_MEM_PRIVATE in
+flags to have KVM bind the memory region to a given guest_memfd range of
+[guest_memfd_offset, guest_memfd_offset + memory_size].  The target guest_memfd
+must point at a file created via KVM_CREATE_GUEST_MEMFD on the current VM, and
+the target range must not be bound to any other memory region.  All standard
+bounds checks apply (use common sense).
+
 ::
 
   struct kvm_userspace_memory_region2 {
@@ -6087,9 +6096,24 @@ applied.
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
__u64 userspace_addr; /* start of the userspace allocated memory */
+  __u64 guest_memfd_offset;
+   __u32 guest_memfd;
+   __u32 pad1;
+   __u64 pad2[14];
   };
 
-See KVM_SET_USER_MEMORY_REGION.
+A KVM_MEM_PRIVATE region _must_ have a valid guest_memfd (private memory) and
+userspace_addr (shared memory).  However, "valid" for userspace_addr simply
+means that the address itself must be a legal userspace address.  The backing
+mapping for userspace_addr is not required to be valid/populated at the time of
+KVM_SET_USER_MEMORY_REGION2, e.g. shared memory can be lazily mapped/allocated
+on-demand.
+
+When mapping a gfn into the guest, KVM selects shared vs. private, i.e consumes
+userspace_addr vs. guest_memfd, based on the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE
+state.  At VM creation time, all memory is shared, i.e. the PRIVATE attribute
+is '0' for all gfns.  Userspace can control whether memory is shared/private by
+toggling KVM_MEMORY_ATTRIBUTE_PRIVATE via KVM_SET_MEMORY_ATTRIBUTES as needed.
 
 4.140 KVM_SET_MEMORY_ATTRIBUTES
 ---
@@ -6127,6 +6151,49 @@ the state of a gfn/page as needed.
 
 The "flags" field is reserved for future extensions and must be '0'.
 
+4.141 KVM_CREATE_GUEST_MEMFD
+
+
+:Capability: KVM_CAP_GUEST_MEMFD
+:Architectures: none
+:Type: vm ioctl
+:Parameters: struct struct kvm_create_guest_memfd(in)
+:Returns: 0 on success, <0 on error
+
+KVM_CREATE_GUEST_MEMFD creates an anonymous file and returns a file descriptor
+that refers to it.  guest_memfd files are roughly analogous to files created
+via memfd_create(), e.g. guest_memfd files live in RAM, have volatile storage,
+and are automatically released when the last reference is dropped.  Unlike
+"regular" memfd_create() files, guest_memfd files are bound to their owning
+virtual machine (see below), cannot be mapped, read, or written by userspace,
+and cannot be resized  (guest_memfd files do however support PUNCH_HOLE).
+
+::
+
+  struct kvm_create_guest_memfd {
+   __u64 size;
+   __u64 flags;
+   __u64 reserved[6];
+  };
+
+Conceptually, the inode backing a guest_memfd file represents physical memory,
+i.e. is coupled to the virtual machine as a thing, not to a "struct kvm".  The
+file itself, which is bound to a "struct kvm", is that instance's view of the
+underlying memory, e.g. effectively provides the translation of guest addresses
+to host memory.  This allows for use cases where multiple KVM structures are
+used to manage a single virtual machine, e.g. when performing intrahost
+migration of a virtual machine.
+
+KVM currently only supports mapping guest_memfd via 
KVM_SET_USER_MEMORY_REGION2,
+and more specifically via the guest_memfd and guest_memfd_offset fields in
+"struct kvm_userspace_memory_region2", where guest_memfd_offset is the offset
+into the guest_memfd instance.  For a given guest_memfd file,

[PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM

2023-10-27 Thread Sean Christopherson

Export anon_inode_getfile_secure() so that it can be used by KVM to create
and manage file-based guest memory without need a fullblow filesystem.
The "standard" anon_inode_getfd() doesn't work for KVM's use case as KVM
needs a unique inode for each file, e.g. to be able to independently
manage the size and lifecycle of a given file.

Note, KVM doesn't need a "secure" version, just unique inodes, i.e. ignore
the name.

Signed-off-by: Sean Christopherson 
---
 fs/anon_inodes.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 24192a7667ed..4190336180ee 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -176,6 +176,7 @@ struct file *anon_inode_getfile_secure(const char *name,
return __anon_inode_getfile(name, fops, priv, flags,
context_inode, true);
 }
+EXPORT_SYMBOL_GPL(anon_inode_getfile_secure);
 
 static int __anon_inode_getfd(const char *name,
  const struct file_operations *fops,
-- 
2.42.0.820.g83a721a137-goog

[PATCH v13 14/35] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

2023-10-27 Thread Sean Christopherson

Add an "unmovable" flag for mappings that cannot be migrated under any
circumstance.  KVM will use the flag for its upcoming GUEST_MEMFD support,
which will not support compaction/migration, at least not in the
foreseeable future.

Test AS_UNMOVABLE under folio lock as already done for the async
compaction/dirty folio case, as the mapping can be removed by truncation
while compaction is running.  To avoid having to lock every folio with a
mapping, assume/require that unmovable mappings are also unevictable, and
have mapping_set_unmovable() also set AS_UNEVICTABLE.

Cc: Matthew Wilcox 
Co-developed-by: Vlastimil Babka 
Signed-off-by: Vlastimil Babka 
Signed-off-by: Sean Christopherson 
---
 include/linux/pagemap.h | 19 +-
 mm/compaction.c | 43 +
 mm/migrate.c|  2 ++
 3 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 351c3b7f93a1..82c9bf506b79 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -203,7 +203,8 @@ enum mapping_flags {
/* writeback related tags are not used */
AS_NO_WRITEBACK_TAGS = 5,
AS_LARGE_FOLIO_SUPPORT = 6,
-   AS_RELEASE_ALWAYS,  /* Call ->release_folio(), even if no private 
data */
+   AS_RELEASE_ALWAYS = 7,  /* Call ->release_folio(), even if no private 
data */
+   AS_UNMOVABLE= 8,/* The mapping cannot be moved, ever */
 };
 
 /**
@@ -289,6 +290,22 @@ static inline void mapping_clear_release_always(struct 
address_space *mapping)
clear_bit(AS_RELEASE_ALWAYS, >flags);
 }
 
+static inline void mapping_set_unmovable(struct address_space *mapping)
+{
+   /*
+* It's expected unmovable mappings are also unevictable. Compaction
+* migrate scanner (isolate_migratepages_block()) relies on this to
+* reduce page locking.
+*/
+   set_bit(AS_UNEVICTABLE, >flags);
+   set_bit(AS_UNMOVABLE, >flags);
+}
+
+static inline bool mapping_unmovable(struct address_space *mapping)
+{
+   return test_bit(AS_UNMOVABLE, >flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
return mapping->gfp_mask;
diff --git a/mm/compaction.c b/mm/compaction.c
index 38c8d216c6a3..12b828aed7c8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -883,6 +883,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
/* Time to isolate some pages for migration */
for (; low_pfn < end_pfn; low_pfn++) {
+   bool is_dirty, is_unevictable;
 
if (skip_on_failure && low_pfn >= next_skip_pfn) {
/*
@@ -1080,8 +1081,10 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (!folio_test_lru(folio))
goto isolate_fail_put;
 
+   is_unevictable = folio_test_unevictable(folio);
+
/* Compaction might skip unevictable pages but CMA takes them */
-   if (!(mode & ISOLATE_UNEVICTABLE) && 
folio_test_unevictable(folio))
+   if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
goto isolate_fail_put;
 
/*
@@ -1093,26 +1096,42 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if ((mode & ISOLATE_ASYNC_MIGRATE) && 
folio_test_writeback(folio))
goto isolate_fail_put;
 
-   if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_dirty(folio)) {
-   bool migrate_dirty;
+   is_dirty = folio_test_dirty(folio);
+
+   if (((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) ||
+   (mapping && is_unevictable)) {
+   bool migrate_dirty = true;
+   bool is_unmovable;
 
/*
 * Only folios without mappings or that have
-* a ->migrate_folio callback are possible to
-* migrate without blocking.  However, we may
-* be racing with truncation, which can free
-* the mapping.  Truncation holds the folio lock
-* until after the folio is removed from the page
-* cache so holding it ourselves is sufficient.
+* a ->migrate_folio callback are possible to migrate
+* without blocking.
+*
+* Folios from unmovable mappings are not migratable.
+*
+* However, we can be racing with truncation, which can
+* free the mapping that we need to check. Truncation
+*

[PATCH v13 13/35] KVM: Introduce per-page memory attributes

2023-10-27 Thread Sean Christopherson

From: Chao Peng 

In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
userspace to operate on the per-page memory attributes.
  - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
a guest memory range.
  - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
memory attributes.

Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.

Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.

Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.

To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation.  For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.

It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.

Suggested-by: Sean Christopherson 
Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
Cc: Fuad Tabba 
Cc: Xu Yilun 
Cc: Mickaël Salaün 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 Documentation/virt/kvm/api.rst |  36 +
 include/linux/kvm_host.h   |  18 +++
 include/uapi/linux/kvm.h   |  13 ++
 virt/kvm/Kconfig   |   4 +
 virt/kvm/kvm_main.c| 233 +
 5 files changed, 304 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 860216536810..e2252c748fd6 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6091,6 +6091,42 @@ applied.
 
 See KVM_SET_USER_MEMORY_REGION.
 
+4.140 KVM_SET_MEMORY_ATTRIBUTES
+---
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_memory_attributes(in)
+:Returns: 0 on success, <0 on error
+
+KVM_SET_MEMORY_ATTRIBUTES allows userspace to set memory attributes for a range
+of guest physical memory.
+
+::
+
+  struct kvm_memory_attributes {
+   __u64 address;
+   __u64 size;
+   __u64 attributes;
+   __u64 flags;
+  };
+
+  #define KVM_MEMORY_ATTRIBUTE_PRIVATE   (1ULL << 3)
+
+The address and size must be page aligned.  The supported attributes can be
+retrieved via ioctl(KVM_CHECK_EXTENSION) on KVM_CAP_MEMORY_ATTRIBUTES.  If
+executed on a VM, KVM_CAP_MEMORY_ATTRIBUTES precisely returns the attributes
+supported by that VM.  If executed at system scope, KVM_CAP_MEMORY_ATTRIBUTES
+returns all attributes supported by KVM.  The only attribute defined at this
+time is KVM_MEMORY_ATTRIBUTE_PRIVATE, which marks the associated gfn as being
+guest private memory.
+
+Note, there is no "get" API.  Userspace is responsible for explicitly tracking
+the state of a gfn/page as needed.
+
+The "flags" field is reserved for future extensions and must be '0'.
+
 5. The kvm_run structure
 
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 89c1a991a3b8..df573229651b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -256,6 +256,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
 union kvm_mmu_notifier_arg {
pte_t pte;
+   unsigned long attributes;
 };
 
 struct kvm_gfn_range {
@@ -808,6 +809,9 @@ struct kvm {
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
struct notifier_block pm_notifier;
+#endif
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+   struct xarray mem_attr_array;
 #endif
char stats_id[KVM_STATS_NAME_SIZE];
 };
@@ -2340,4 +2344,18 @@ static inline void kvm_prepare_memory_fault_exit(struct 
kvm_vcpu *vcpu,
vcpu->run->memory_fault.flags = 0;
 }
 
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+static inline unsigned long kvm_get_memory_attributes(struct kvm

[PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events

2023-10-27 Thread Sean Christopherson

Add flags to "struct kvm_gfn_range" to let notifier events target only
shared and only private mappings, and write up the existing mmu_notifier
events to be shared-only (private memory is never associated with a
userspace virtual address, i.e. can't be reached via mmu_notifiers).

Add two flags so that KVM can handle the three possibilities (shared,
private, and shared+private) without needing something like a tri-state
enum.

Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com
Signed-off-by: Sean Christopherson 
---
 include/linux/kvm_host.h | 2 ++
 virt/kvm/kvm_main.c  | 7 +++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 96aa930536b1..89c1a991a3b8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -263,6 +263,8 @@ struct kvm_gfn_range {
gfn_t start;
gfn_t end;
union kvm_mmu_notifier_arg arg;
+   bool only_private;
+   bool only_shared;
bool may_block;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cb9376833c18..302ccb87b4c1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
 * the second or later invocation of the handler).
 */
gfn_range.arg = range->arg;
+
+   /*
+* HVA-based notifications aren't relevant to private
+* mappings as they don't have a userspace mapping.
+*/
+   gfn_range.only_private = false;
+   gfn_range.only_shared = true;
gfn_range.may_block = range->may_block;
 
/*
-- 
2.42.0.820.g83a721a137-goog

[PATCH v13 11/35] KVM: Drop .on_unlock() mmu_notifier hook

2023-10-27 Thread Sean Christopherson

Drop the .on_unlock() mmu_notifer hook now that it's no longer used for
notifying arch code that memory has been reclaimed.  Adding .on_unlock()
and invoking it *after* dropping mmu_lock was a terrible idea, as doing so
resulted in .on_lock() and .on_unlock() having divergent and asymmetric
behavior, and set future developers up for failure, i.e. all but asked for
bugs where KVM relied on using .on_unlock() to try to run a callback while
holding mmu_lock.

Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to
guard against future bugs of this nature.

Reported-by: Isaku Yamahata 
Link: https://lore.kernel.org/all/20230802203119.gb2021...@ls.amr.corp.intel.com
Signed-off-by: Sean Christopherson 
---
 virt/kvm/kvm_main.c | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2bc04c8ae1f4..cb9376833c18 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -544,7 +544,6 @@ static inline struct kvm *mmu_notifier_to_kvm(struct 
mmu_notifier *mn)
 typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm);
-typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
 struct kvm_mmu_notifier_range {
/*
@@ -556,7 +555,6 @@ struct kvm_mmu_notifier_range {
union kvm_mmu_notifier_arg arg;
gfn_handler_t handler;
on_lock_fn_t on_lock;
-   on_unlock_fn_t on_unlock;
bool flush_on_ret;
bool may_block;
 };
@@ -663,11 +661,8 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
if (range->flush_on_ret && r.ret)
kvm_flush_remote_tlbs(kvm);
 
-   if (r.found_memslot) {
+   if (r.found_memslot)
KVM_MMU_UNLOCK(kvm);
-   if (!IS_KVM_NULL_FN(range->on_unlock))
-   range->on_unlock(kvm);
-   }
 
srcu_read_unlock(>srcu, idx);
 
@@ -687,7 +682,6 @@ static __always_inline int kvm_handle_hva_range(struct 
mmu_notifier *mn,
.arg= arg,
.handler= handler,
.on_lock= (void *)kvm_null_fn,
-   .on_unlock  = (void *)kvm_null_fn,
.flush_on_ret   = true,
.may_block  = false,
};
@@ -706,7 +700,6 @@ static __always_inline int 
kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
.end= end,
.handler= handler,
.on_lock= (void *)kvm_null_fn,
-   .on_unlock  = (void *)kvm_null_fn,
.flush_on_ret   = false,
.may_block  = false,
};
@@ -813,7 +806,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct 
mmu_notifier *mn,
.end= range->end,
.handler= kvm_mmu_unmap_gfn_range,
.on_lock= kvm_mmu_invalidate_begin,
-   .on_unlock  = (void *)kvm_null_fn,
.flush_on_ret   = true,
.may_block  = mmu_notifier_range_blockable(range),
};
@@ -858,6 +850,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct 
mmu_notifier *mn,
 
 void kvm_mmu_invalidate_end(struct kvm *kvm)
 {
+   lockdep_assert_held_write(>mmu_lock);
+
/*
 * This sequence increase will notify the kvm page fault that
 * the page that is going to be mapped in the spte could have
@@ -889,7 +883,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct 
mmu_notifier *mn,
.end= range->end,
.handler= (void *)kvm_null_fn,
.on_lock= kvm_mmu_invalidate_end,
-   .on_unlock  = (void *)kvm_null_fn,
.flush_on_ret   = false,
.may_block  = mmu_notifier_range_blockable(range),
};
-- 
2.42.0.820.g83a721a137-goog

[PATCH v13 10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory

2023-10-27 Thread Sean Christopherson

Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having
__kvm_handle_hva_range() return whether or not an overlapping memslot
was found, i.e. mmu_lock was acquired.  Using the .on_unlock() hook
works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping
mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical.

Use a small struct to return the tuple of the notifier-specific return,
plus whether or not overlap was found.  Because the iteration helpers are
__always_inlined, practically speaking, the struct will never actually be
returned from a function call (not to mention the size of the struct will
be two bytes in practice).

Signed-off-by: Sean Christopherson 
---
 virt/kvm/kvm_main.c | 53 +++--
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3f5b7c2c5327..2bc04c8ae1f4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -561,6 +561,19 @@ struct kvm_mmu_notifier_range {
bool may_block;
 };
 
+/*
+ * The inner-most helper returns a tuple containing the return value from the
+ * arch- and action-specific handler, plus a flag indicating whether or not at
+ * least one memslot was found, i.e. if the handler found guest memory.
+ *
+ * Note, most notifiers are averse to booleans, so even though KVM tracks the
+ * return from arch code as a bool, outer helpers will cast it to an int. :-(
+ */
+typedef struct kvm_mmu_notifier_return {
+   bool ret;
+   bool found_memslot;
+} kvm_mn_ret_t;
+
 /*
  * Use a dedicated stub instead of NULL to indicate that there is no callback
  * function/handler.  The compiler technically can't guarantee that a real
@@ -582,22 +595,25 @@ static const union kvm_mmu_notifier_arg 
KVM_MMU_NOTIFIER_NO_ARG;
 node;   \
 node = interval_tree_iter_next(node, start, last))  \
 
-static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
- const struct 
kvm_mmu_notifier_range *range)
+static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
+  const struct 
kvm_mmu_notifier_range *range)
 {
-   bool ret = false, locked = false;
+   struct kvm_mmu_notifier_return r = {
+   .ret = false,
+   .found_memslot = false,
+   };
struct kvm_gfn_range gfn_range;
struct kvm_memory_slot *slot;
struct kvm_memslots *slots;
int i, idx;
 
if (WARN_ON_ONCE(range->end <= range->start))
-   return 0;
+   return r;
 
/* A null handler is allowed if and only if on_lock() is provided. */
if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) &&
 IS_KVM_NULL_FN(range->handler)))
-   return 0;
+   return r;
 
idx = srcu_read_lock(>srcu);
 
@@ -631,8 +647,8 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE 
- 1, slot);
gfn_range.slot = slot;
 
-   if (!locked) {
-   locked = true;
+   if (!r.found_memslot) {
+   r.found_memslot = true;
KVM_MMU_LOCK(kvm);
if (!IS_KVM_NULL_FN(range->on_lock))
range->on_lock(kvm);
@@ -640,14 +656,14 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
if (IS_KVM_NULL_FN(range->handler))
break;
}
-   ret |= range->handler(kvm, _range);
+   r.ret |= range->handler(kvm, _range);
}
}
 
-   if (range->flush_on_ret && ret)
+   if (range->flush_on_ret && r.ret)
kvm_flush_remote_tlbs(kvm);
 
-   if (locked) {
+   if (r.found_memslot) {
KVM_MMU_UNLOCK(kvm);
if (!IS_KVM_NULL_FN(range->on_unlock))
range->on_unlock(kvm);
@@ -655,8 +671,7 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
 
srcu_read_unlock(>srcu, idx);
 
-   /* The notifiers are averse to booleans. :-( */
-   return (int)ret;
+   return r;
 }
 
 static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
@@ -677,7 +692,7 @@ static __always_inline int kvm_handle_hva_range(struct 
mmu_notifier *mn,
.may_block  = false,
};
 
-   return __kvm_handle_hva_range(kvm, );
+   return __kvm_handle_hva_range(kvm, ).ret;
 }
 
 static __always_inline int kvm_handle

[PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-10-27 Thread Sean Christopherson

From: Chao Peng 

Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).

KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory.  With guest private memory,
there will be two kind of memory conversions:

  - explicit conversion: happens when the guest explicitly calls into KVM
to map a range (as private or shared)

  - implicit conversion: happens when the guest attempts to access a gfn
that is configured in the "wrong" state (private vs. shared)

On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.

KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.

Note!  To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'!  Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.

Report the gpa+size instead of a single gfn even though the initial usage
is expected to always report single pages.  It's entirely possible, likely
even, that KVM will someday support sub-page granularity faults, e.g.
Intel's sub-page protection feature allows for additional protections at
128-byte granularity.

Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoor...@google.com
Link: https://lore.kernel.org/all/zq3amlo2syv3d...@google.com
Cc: Anish Moorthy 
Cc: David Matlack 
Suggested-by: Sean Christopherson 
Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 Documentation/virt/kvm/api.rst | 41 ++
 arch/x86/kvm/x86.c |  1 +
 include/linux/kvm_host.h   | 11 +
 include/uapi/linux/kvm.h   |  8 +++
 4 files changed, 61 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index ace984acc125..860216536810 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6723,6 +6723,26 @@ array field represents return values. The userspace 
should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+   /* KVM_EXIT_MEMORY_FAULT */
+   struct {
+   __u64 flags;
+   __u64 gpa;
+   __u64 size;
+   } memory;
+
+KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
+could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
+guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
+describes properties of the faulting access that are likely pertinent.
+Currently, no flags are defined.
+
+Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
+accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
+or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
+kvm_run.exit_reason is stale/undefined for all other error numbers.
+
 ::
 
 /* KVM_EXIT_NOTIFY */
@@ -7757,6 +,27 @@ This capability is aimed to mitigate the threat that 
malicious VMs can
 cause CPU stuck (due to event windows don't open up) and make the CPU
 unavailable to host or other VMs.
 
+7.34 KVM_CAP_MEMORY_FAULT_INFO
+--
+
+:Architectures: x86
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_RUN will fill
+kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
+there is a valid memslot but no backing VMA for the corresponding host virtual
+address.
+
+The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
+an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
+to KVM_EXIT_MEMORY_FAULT.
+
+Note: Userspaces which attempt to resolve memory faults so that they can retry
+KVM_RUN are encouraged to guard against repeatedly receiving the same
+error/annotated fault.
+
+See KVM_EXIT_MEMORY_FAULT

[PATCH v13 08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-10-27 Thread Sean Christopherson

Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional
information can be supplied without setting userspace up to fail.  The
padding in the new kvm_userspace_memory_region2 structure will be used to
pass a file descriptor in addition to the userspace_addr, i.e. allow
userspace to point at a file descriptor and map memory into a guest that
is NOT mapped into host userspace.

Alternatively, KVM could simply add "struct kvm_userspace_memory_region2"
without a new ioctl(), but as Paolo pointed out, adding a new ioctl()
makes detection of bad flags a bit more robust, e.g. if the new fd field
is guarded only by a flag and not a new ioctl(), then a userspace bug
(setting a "bad" flag) would generate out-of-bounds access instead of an
-EINVAL error.

Cc: Jarkko Sakkinen 
Reviewed-by: Paolo Bonzini 
Reviewed-by: Xiaoyao Li 
Signed-off-by: Sean Christopherson 
---
 Documentation/virt/kvm/api.rst | 21 +++
 arch/x86/kvm/x86.c |  2 +-
 include/linux/kvm_host.h   |  4 ++--
 include/uapi/linux/kvm.h   | 13 
 virt/kvm/kvm_main.c| 38 +++---
 5 files changed, 67 insertions(+), 11 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 21a7578142a1..ace984acc125 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6070,6 +6070,27 @@ writes to the CNTVCT_EL0 and CNTPCT_EL0 registers using 
the SET_ONE_REG
 interface. No error will be returned, but the resulting offset will not be
 applied.
 
+4.139 KVM_SET_USER_MEMORY_REGION2
+-
+
+:Capability: KVM_CAP_USER_MEMORY2
+:Architectures: all
+:Type: vm ioctl
+:Parameters: struct kvm_userspace_memory_region2 (in)
+:Returns: 0 on success, -1 on error
+
+::
+
+  struct kvm_userspace_memory_region2 {
+   __u32 slot;
+   __u32 flags;
+   __u64 guest_phys_addr;
+   __u64 memory_size; /* bytes */
+   __u64 userspace_addr; /* start of the userspace allocated memory */
+  };
+
+See KVM_SET_USER_MEMORY_REGION.
+
 5. The kvm_run structure
 
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 41cce5031126..6409914428ca 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12455,7 +12455,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, 
int id, gpa_t gpa,
}
 
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-   struct kvm_userspace_memory_region m;
+   struct kvm_userspace_memory_region2 m;
 
m.slot = id | (i << 16);
m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5faba69403ac..4e741ff27af3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1146,9 +1146,9 @@ enum kvm_mr_change {
 };
 
 int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_userspace_memory_region2 *mem);
 int __kvm_set_memory_region(struct kvm *kvm,
-   const struct kvm_userspace_memory_region *mem);
+   const struct kvm_userspace_memory_region2 *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 13065dd96132..bd1abe067f28 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -95,6 +95,16 @@ struct kvm_userspace_memory_region {
__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+/* for KVM_SET_USER_MEMORY_REGION2 */
+struct kvm_userspace_memory_region2 {
+   __u32 slot;
+   __u32 flags;
+   __u64 guest_phys_addr;
+   __u64 memory_size;
+   __u64 userspace_addr;
+   __u64 pad[16];
+};
+
 /*
  * The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for
  * userspace, other bits are reserved for kvm internal use which are defined
@@ -1192,6 +1202,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_COUNTER_OFFSET 227
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
+#define KVM_CAP_USER_MEMORY2 230
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1473,6 +1484,8 @@ struct kvm_vfio_spapr_tce {
struct kvm_userspace_memory_region)
 #define KVM_SET_TSS_ADDR  _IO(KVMIO,   0x47)
 #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
+#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
+struct kvm_userspace_memory_region2)
 
 /* enable ucontrol for s390 */
 struct kvm_s390_ucas_mapping {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6e708017064d..3f5b7c2c5327 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/

[PATCH v13 07/35] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER

2023-10-27 Thread Sean Christopherson

Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
appropriate to effectively maintain existing behavior.  Using a proper
Kconfig will simplify building more functionality on top of KVM's
mmu_notifier infrastructure.

Add a forward declaration of kvm_gfn_range to kvm_types.h so that
including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
generate warnings due to kvm_gfn_range being undeclared.  PPC defines
hooks for PR vs. HV without guarding them via #ifdeffery, e.g.

  bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);

Alternatively, PPC could forward declare kvm_gfn_range, but there's no
good reason not to define it in common KVM.

Acked-by: Anup Patel 
Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h   |  2 --
 arch/arm64/kvm/Kconfig  |  2 +-
 arch/mips/include/asm/kvm_host.h|  2 --
 arch/mips/kvm/Kconfig   |  2 +-
 arch/powerpc/include/asm/kvm_host.h |  2 --
 arch/powerpc/kvm/Kconfig|  8 
 arch/powerpc/kvm/powerpc.c  |  4 +---
 arch/riscv/include/asm/kvm_host.h   |  2 --
 arch/riscv/kvm/Kconfig  |  2 +-
 arch/x86/include/asm/kvm_host.h |  2 --
 arch/x86/kvm/Kconfig|  2 +-
 include/linux/kvm_host.h|  6 +++---
 include/linux/kvm_types.h   |  1 +
 virt/kvm/Kconfig|  4 
 virt/kvm/kvm_main.c | 10 +-
 15 files changed, 22 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index af06ccb7ee34..9e046b64847a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -921,8 +921,6 @@ int __kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
 int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
  struct kvm_vcpu_events *events);
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
 
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 83c1e09be42e..1a15199f 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -22,7 +22,7 @@ menuconfig KVM
bool "Kernel-based Virtual Machine (KVM) support"
depends on HAVE_KVM
select KVM_GENERIC_HARDWARE_ENABLING
-   select MMU_NOTIFIER
+   select KVM_GENERIC_MMU_NOTIFIER
select PREEMPT_NOTIFIERS
select HAVE_KVM_CPU_RELAX_INTERCEPT
select KVM_MMIO
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 54a85f1d4f2c..179f320cc231 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -810,8 +810,6 @@ int kvm_mips_mkclean_gpa_pt(struct kvm *kvm, gfn_t 
start_gfn, gfn_t end_gfn);
 pgd_t *kvm_pgd_alloc(void);
 void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 /* Emulation */
 enum emulation_result update_pc(struct kvm_vcpu *vcpu, u32 cause);
 int kvm_get_badinstr(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index a8cdba75f98d..c04987d2ed2e 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -25,7 +25,7 @@ config KVM
select HAVE_KVM_EVENTFD
select HAVE_KVM_VCPU_ASYNC_IOCTL
select KVM_MMIO
-   select MMU_NOTIFIER
+   select KVM_GENERIC_MMU_NOTIFIER
select INTERVAL_TREE
select KVM_GENERIC_HARDWARE_ENABLING
help
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 14ee0dece853..4b5c3f2acf78 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -62,8 +62,6 @@
 
 #include 
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 #define HPTEG_CACHE_NUM(1 << 15)
 #define HPTEG_HASH_BITS_PTE13
 #define HPTEG_HASH_BITS_PTE_LONG   12
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 902611954200..b33358ee6424 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -42,7 +42,7 @@ config KVM_BOOK3S_64_HANDLER
 config KVM_BOOK3S_PR_POSSIBLE
bool
select KVM_MMIO
-   select MMU_NOTIFIER
+   select KVM_GENERIC_MMU_NOTIFIER
 
 config KVM_BOOK3S_HV_POSSIBLE
bool
@@ -85,7 +85,7 @@ config KVM_BOOK3S_64_HV
tristate "KVM for POWER7 and later using hypervisor mode in host"
depends on KVM_BOOK3S_64 && PPC_POWERNV
select KVM_BOOK3S_HV_POSSIBLE
-   select MMU_NOTIFIER
+   select KVM_GENERIC_MMU_NOTIFIER
select CMA
help
  Support running unmodified book3s_64 guest kernels in
@@ -194,7 +194,7 @@

[PATCH v13 06/35] KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU

2023-10-27 Thread Sean Christopherson

Advertise that KVM's MMU is synchronized with the primary MMU for all
flavors of PPC KVM support, i.e. advertise that the MMU is synchronized
when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y but the VM is not using hypervisor
mode (a.k.a. PR VMs).  PR VMs, via kvm_unmap_gfn_range_pr(), do the right
thing for mmu_notifier invalidation events, and more tellingly, KVM
returns '1' for KVM_CAP_SYNC_MMU when CONFIG_KVM_BOOK3S_HV_POSSIBLE=n
and CONFIG_KVM_BOOK3S_PR_POSSIBLE=y, i.e. KVM already advertises a
synchronized MMU for PR VMs, just not when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y.

Suggested-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/powerpc/kvm/powerpc.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index b0a512ede764..8d3ec483bc2b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -635,11 +635,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
BUILD_BUG();
 #endif
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-   r = hv_enabled;
-#else
r = 1;
-#endif
break;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
case KVM_CAP_PPC_HTAB_FD:
-- 
2.42.0.820.g83a721a137-goog

[PATCH v13 05/35] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER

2023-10-27 Thread Sean Christopherson

Assert that both KVM_ARCH_WANT_MMU_NOTIFIER and CONFIG_MMU_NOTIFIER are
defined when KVM is enabled, and return '1' unconditionally for the
CONFIG_KVM_BOOK3S_HV_POSSIBLE=n path.  All flavors of PPC support for KVM
select MMU_NOTIFIER, and KVM_ARCH_WANT_MMU_NOTIFIER is unconditionally
defined by arch/powerpc/include/asm/kvm_host.h.

Effectively dropping use of KVM_ARCH_WANT_MMU_NOTIFIER will simplify a
future cleanup to turn KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig, i.e.
will allow combining all of the

  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)

checks into a single

  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER

without having to worry about PPC's "bare" usage of
KVM_ARCH_WANT_MMU_NOTIFIER.

Signed-off-by: Sean Christopherson 
---
 arch/powerpc/kvm/powerpc.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 7197c8256668..b0a512ede764 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -632,12 +632,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
break;
 #endif
case KVM_CAP_SYNC_MMU:
+#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+   BUILD_BUG();
+#endif
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
r = hv_enabled;
-#elif defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-   r = 1;
 #else
-   r = 0;
+   r = 1;
 #endif
break;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-- 
2.42.0.820.g83a721a137-goog

[PATCH v13 04/35] KVM: WARN if there are dangling MMU invalidations at VM destruction

2023-10-27 Thread Sean Christopherson

Add an assertion that there are no in-progress MMU invalidations when a
VM is being destroyed, with the exception of the scenario where KVM
unregisters its MMU notifier between an .invalidate_range_start() call and
the corresponding .invalidate_range_end().

KVM can't detect unpaired calls from the mmu_notifier due to the above
exception waiver, but the assertion can detect KVM bugs, e.g. such as the
bug that *almost* escaped initial guest_memfd development.

Link: 
https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5...@linux.intel.com
Signed-off-by: Sean Christopherson 
---
 virt/kvm/kvm_main.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1a577a25de47..4dba682586ee 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1356,9 +1356,16 @@ static void kvm_destroy_vm(struct kvm *kvm)
 * No threads can be waiting in kvm_swap_active_memslots() as the
 * last reference on KVM has been dropped, but freeing
 * memslots would deadlock without this manual intervention.
+*
+* If the count isn't unbalanced, i.e. KVM did NOT unregister its MMU
+* notifier between a start() and end(), then there shouldn't be any
+* in-progress invalidations.
 */
WARN_ON(rcuwait_active(>mn_memslots_update_rcuwait));
-   kvm->mn_active_invalidate_count = 0;
+   if (kvm->mn_active_invalidate_count)
+   kvm->mn_active_invalidate_count = 0;
+   else
+   WARN_ON(kvm->mmu_invalidate_in_progress);
 #else
kvm_flush_shadow_all(kvm);
 #endif
-- 
2.42.0.820.g83a721a137-goog

[PATCH v13 03/35] KVM: Use gfn instead of hva for mmu_notifier_retry

2023-10-27 Thread Sean Christopherson

From: Chao Peng 

Currently in mmu_notifier invalidate path, hva range is recorded and
then checked against by mmu_notifier_retry_hva() in the page fault
handling path. However, for the to be introduced private memory, a page
fault may not have a hva associated, checking gfn(gpa) makes more sense.

For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

Suggested-by: Sean Christopherson 
Cc: Xu Yilun 
Signed-off-by: Chao Peng 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
[sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/mmu/mmu.c   | 10 ++
 arch/x86/kvm/vmx/vmx.c   | 11 +--
 include/linux/kvm_host.h | 33 
 virt/kvm/kvm_main.c  | 41 +++-
 4 files changed, 64 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f7901cb4d2fa..d33657d61d80 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3056,7 +3056,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, 
u64 *sptep)
  *
  * There are several ways to safely use this helper:
  *
- * - Check mmu_invalidate_retry_hva() after grabbing the mapping level, before
+ * - Check mmu_invalidate_retry_gfn() after grabbing the mapping level, before
  *   consuming it.  In this case, mmu_lock doesn't need to be held during the
  *   lookup, but it does need to be held while checking the MMU notifier.
  *
@@ -4358,7 +4358,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
return true;
 
return fault->slot &&
-  mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
+  mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault 
*fault)
@@ -6245,7 +6245,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end)
 
write_lock(>mmu_lock);
 
-   kvm_mmu_invalidate_begin(kvm, 0, -1ul);
+   kvm_mmu_invalidate_begin(kvm);
+
+   kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
 
flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
@@ -6255,7 +6257,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end)
if (flush)
kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - 
gfn_start);
 
-   kvm_mmu_invalidate_end(kvm, 0, -1ul);
+   kvm_mmu_invalidate_end(kvm);
 
write_unlock(>mmu_lock);
 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 72e3943f3693..6e502ba93141 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6757,10 +6757,10 @@ static void vmx_set_apic_access_page_addr(struct 
kvm_vcpu *vcpu)
return;
 
/*
-* Grab the memslot so that the hva lookup for the mmu_notifier retry
-* is guaranteed to use the same memslot as the pfn lookup, i.e. rely
-* on the pfn lookup's validation of the memslot to ensure a valid hva
-* is used for the retry check.
+* Explicitly grab the memslot using KVM's internal slot ID to ensure
+* KVM doesn't unintentionally grab a userspace memslot.  It _should_
+* be impossible for userspace to create a memslot for the APIC when
+* APICv is enabled, but paranoia won't hurt in this case.
 */
slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT);
if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
@@ -6785,8 +6785,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu 
*vcpu)
return;
 
read_lock(>kvm->mmu_lock);
-   if (mmu_invalidate_retry_hva(kvm, mmu_seq,
-gfn_to_hva_memslot(slot, gfn))) {
+   if (mmu_invalidate_retry_gfn(kvm, mmu_seq, gfn)) {
kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
read_unlock(>kvm->mmu_lock);
goto out;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fb6c6109fdca..11d091688346 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -787,8 +787,8 @@ struct kvm {
struct mmu_notifier mmu_notifier;
unsigned long mmu_invalidate_seq;
long mmu_invalidate_in_progress;
-   unsigned long mmu_invalidate_range_start;
-   unsigned long mmu_invalidate_range_end;
+   gfn_t mmu_invalidate_range_start;
+   gfn_t mmu_invalidate_range_end;
 #endif
struct list_head devices;
u64 manual_dirty_log_protect;
@@ -1392,10 +1392,9 @@ void kvm_mmu_free_memory_cache(struct 
kvm_mmu_memory_cache *mc);
 void *

[PATCH v13 02/35] KVM: Assert that mmu_invalidate_in_progress never goes negative

2023-10-27 Thread Sean Christopherson

Move the assertion on the in-progress invalidation count from the primary
MMU's notifier path to KVM's common notification path, i.e. assert that
the count doesn't go negative even when the invalidation is coming from
KVM itself.

Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only
the affected VM, not the entire kernel.  A corrupted count is fatal to the
VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry()
to block any and all attempts to install new mappings.  But it's far from
guaranteed that an end() without a start() is fatal or even problematic to
anything other than the target VM, e.g. the underlying bug could simply be
a duplicate call to end().  And it's much more likely that a missed
invalidation, i.e. a potential use-after-free, would manifest as no
notification whatsoever, not an end() without a start().

Signed-off-by: Sean Christopherson 
---
 virt/kvm/kvm_main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0524933856d4..5a97e6c7d9c2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -833,6 +833,7 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long 
start,
 * in conjunction with the smp_rmb in mmu_invalidate_retry().
 */
kvm->mmu_invalidate_in_progress--;
+   KVM_BUG_ON(kvm->mmu_invalidate_in_progress < 0, kvm);
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
@@ -863,8 +864,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct 
mmu_notifier *mn,
 */
if (wake)
rcuwait_wake_up(>mn_memslots_update_rcuwait);
-
-   BUG_ON(kvm->mmu_invalidate_in_progress < 0);
 }
 
 static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn,
-- 
2.42.0.820.g83a721a137-goog

1 2 3 4 5 >

1 - 100 of 467 matches

Mail list logo