Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events

2023-11-02 Thread Binbin Wu




On 10/28/2023 2:21 AM, Sean Christopherson wrote:

Add flags to "struct kvm_gfn_range" to let notifier events target only
shared and only private mappings, and write up the existing mmu_notifier
events to be shared-only (private memory is never associated with a
userspace virtual address, i.e. can't be reached via mmu_notifiers).

Add two flags so that KVM can handle the three possibilities (shared,
private, and shared+private) without needing something like a tri-state
enum.
I see the two flags are set/cleared in __kvm_handle_hva_range() in this 
patch

and kvm_handle_gfn_range() from the later patch 13/35, but I didn't see they
are used/read in this patch series if I didn't miss anything.  How are they
supposed to be used in KVM?




Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com
Signed-off-by: Sean Christopherson 
---
  include/linux/kvm_host.h | 2 ++
  virt/kvm/kvm_main.c  | 7 +++
  2 files changed, 9 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 96aa930536b1..89c1a991a3b8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -263,6 +263,8 @@ struct kvm_gfn_range {
gfn_t start;
gfn_t end;
union kvm_mmu_notifier_arg arg;
+   bool only_private;
+   bool only_shared;
bool may_block;
  };
  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cb9376833c18..302ccb87b4c1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
 * the second or later invocation of the handler).
 */
gfn_range.arg = range->arg;
+
+   /*
+* HVA-based notifications aren't relevant to private
+* mappings as they don't have a userspace mapping.
+*/
+   gfn_range.only_private = false;
+   gfn_range.only_shared = true;
gfn_range.may_block = range->may_block;
  
  			/*




Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-01 Thread Binbin Wu




On 10/28/2023 2:21 AM, Sean Christopherson wrote:

From: Chao Peng 

Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).

KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory.  With guest private memory,
there will be two kind of memory conversions:

   - explicit conversion: happens when the guest explicitly calls into KVM
 to map a range (as private or shared)

   - implicit conversion: happens when the guest attempts to access a gfn
 that is configured in the "wrong" state (private vs. shared)

On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.

KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.

Note!  To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'!

Is "-EHWPOISON" case not considered unresolved, so it is not mentioned here?


Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.

Report the gpa+size instead of a single gfn even though the initial usage
is expected to always report single pages.  It's entirely possible, likely
even, that KVM will someday support sub-page granularity faults, e.g.
Intel's sub-page protection feature allows for additional protections at
128-byte granularity.

Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoor...@google.com
Link: https://lore.kernel.org/all/zq3amlo2syv3d...@google.com
Cc: Anish Moorthy 
Cc: David Matlack 
Suggested-by: Sean Christopherson 
Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
  Documentation/virt/kvm/api.rst | 41 ++
  arch/x86/kvm/x86.c |  1 +
  include/linux/kvm_host.h   | 11 +
  include/uapi/linux/kvm.h   |  8 +++
  4 files changed, 61 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index ace984acc125..860216536810 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6723,6 +6723,26 @@ array field represents return values. The userspace 
should update the return
  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
  spec refer, https://github.com/riscv/riscv-sbi-doc.
  
+::

+
+   /* KVM_EXIT_MEMORY_FAULT */
+   struct {
+   __u64 flags;
+   __u64 gpa;
+   __u64 size;
+   } memory;
+
+KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
+could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
+guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
+describes properties of the faulting access that are likely pertinent.
+Currently, no flags are defined.
+
+Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
+accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
+or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
+kvm_run.exit_reason is stale/undefined for all other error numbers.
+
  ::
  
  /* KVM_EXIT_NOTIFY */

@@ -7757,6 +,27 @@ This capability is aimed to mitigate the threat that 
malicious VMs can
  cause CPU stuck (due to event windows don't open up) and make the CPU
  unavailable to host or other VMs.
  
+7.34 KVM_CAP_MEMORY_FAULT_INFO

+--
+
+:Architectures: x86
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that KVM_RUN will fill
+kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if
+there is a valid memslot but no backing VMA for the corresponding host virtual
+address.
+
+The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns
+an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set
+to KVM_EXIT_MEMORY_FAULT.
+
+Note: Userspaces which attempt to resolve memory faults so that they can retry
+KVM_RUN 

Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes

2023-09-26 Thread Binbin Wu




On 9/21/2023 5:03 AM, Sean Christopherson wrote:

On Mon, Sep 18, 2023, Binbin Wu wrote:


On 9/14/2023 9:55 AM, Sean Christopherson wrote:

From: Chao Peng 

[...]

+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+/*
+ * Returns true if _all_ gfns in the range [@start, @end) have attributes
+ * matching @attrs.
+ */
+bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+unsigned long attrs)
+{
+   XA_STATE(xas, >mem_attr_array, start);
+   unsigned long index;
+   bool has_attrs;
+   void *entry;
+
+   rcu_read_lock();
+
+   if (!attrs) {
+   has_attrs = !xas_find(, end);

IIUIC, xas_find() is inclusive for "end", so here should be "end - 1" ?

Yes, that does appear to be the case.  Inclusive vs. exclusive on gfn ranges has
is the bane of my existence.


Seems this one is not included in the "KVM: guest_memfd fixes" patch series?
https://lore.kernel.org/kvm/20230921203331.3746712-1-sea...@google.com/





Re: [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-09-20 Thread Binbin Wu




On 9/20/2023 10:24 PM, Sean Christopherson wrote:

On Tue, Sep 19, 2023, Binbin Wu wrote:


On 9/14/2023 9:55 AM, Sean Christopherson wrote:
[...]

+
+static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
+ pgoff_t end)
+{
+   struct kvm_memory_slot *slot;
+   struct kvm *kvm = gmem->kvm;
+   unsigned long index;
+   bool flush = false;
+
+   KVM_MMU_LOCK(kvm);
+
+   kvm_mmu_invalidate_begin(kvm);
+
+   xa_for_each_range(>bindings, index, slot, start, end - 1) {
+   pgoff_t pgoff = slot->gmem.pgoff;
+
+   struct kvm_gfn_range gfn_range = {
+   .start = slot->base_gfn + max(pgoff, start) - pgoff,
+   .end = slot->base_gfn + min(pgoff + slot->npages, end) 
- pgoff,
+   .slot = slot,
+   .may_block = true,
+   };
+
+   flush |= kvm_mmu_unmap_gfn_range(kvm, _range);
+   }
+
+   if (flush)
+   kvm_flush_remote_tlbs(kvm);
+
+   KVM_MMU_UNLOCK(kvm);
+}
+
+static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
+   pgoff_t end)
+{
+   struct kvm *kvm = gmem->kvm;
+
+   KVM_MMU_LOCK(kvm);
+   if (xa_find(>bindings, , end - 1, XA_PRESENT))
+   kvm_mmu_invalidate_end(kvm);

kvm_mmu_invalidate_begin() is called unconditionally in
kvm_gmem_invalidate_begin(),
but kvm_mmu_invalidate_end() is not here.
This makes the kvm_gmem_invalidate_{begin, end}() calls asymmetric.

Another ouch :-(

And there should be no need to acquire mmu_lock() unconditionally, the inode's
mutex protects the bindings, not mmu_lock.

I'll get a fix posted today.  I think KVM can also add a sanity check to detect
unresolved invalidations, e.g.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7ba1ab1832a9..2a2d18070856 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1381,8 +1381,13 @@ static void kvm_destroy_vm(struct kvm *kvm)
  * No threads can be waiting in kvm_swap_active_memslots() as the
  * last reference on KVM has been dropped, but freeing
  * memslots would deadlock without this manual intervention.
+*
+* If the count isn't unbalanced, i.e. KVM did NOT unregister between
+* a start() and end(), then there shouldn't be any in-progress
+* invalidations.
  */
 WARN_ON(rcuwait_active(>mn_memslots_update_rcuwait));
+   WARN_ON(!kvm->mn_active_invalidate_count && 
kvm->mmu_invalidate_in_progress);
 kvm->mn_active_invalidate_count = 0;
  #else
 kvm_flush_shadow_all(kvm);


or an alternative style

if (kvm->mn_active_invalidate_count)
kvm->mn_active_invalidate_count = 0;
else
WARN_ON(kvm->mmu_invalidate_in_progress)


+   KVM_MMU_UNLOCK(kvm);
+}
+
+static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
+{
+   struct list_head *gmem_list = >i_mapping->private_list;
+   pgoff_t start = offset >> PAGE_SHIFT;
+   pgoff_t end = (offset + len) >> PAGE_SHIFT;
+   struct kvm_gmem *gmem;
+
+   /*
+* Bindings must stable across invalidation to ensure the start+end
+* are balanced.
+*/
+   filemap_invalidate_lock(inode->i_mapping);
+
+   list_for_each_entry(gmem, gmem_list, entry) {
+   kvm_gmem_invalidate_begin(gmem, start, end);
+   kvm_gmem_invalidate_end(gmem, start, end);
+   }

Why to loop for each gmem in gmem_list here?

IIUIC, offset is the offset according to the inode, it is only meaningful to
the inode passed in, i.e, it is only meaningful to the gmem binding with the
inode, not others.

The code is structured to allow for multiple gmem instances per inode.  This 
isn't
actually possible in the initial code base, but it's on the horizon[*].  I 
included
the list-based infrastructure in this initial series to ensure that guest_memfd
can actually support multiple files per inode, and to minimize the churn when 
the
"link" support comes along.

[*] https://lore.kernel.org/all/cover.1691446946.git.ackerley...@google.com

Got it, thanks for the explanation!





Re: [RFC PATCH v12 18/33] KVM: x86/mmu: Handle page fault for private memory

2023-09-20 Thread Binbin Wu




On 9/15/2023 10:26 PM, Sean Christopherson wrote:

On Fri, Sep 15, 2023, Yan Zhao wrote:

On Wed, Sep 13, 2023 at 06:55:16PM -0700, Sean Christopherson wrote:


+static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
+{
+   kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
+ PAGE_SIZE, fault->write, fault->exec,
+ fault->is_private);
+}
+
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+  struct kvm_page_fault *fault)
+{
+   int max_order, r;
+
+   if (!kvm_slot_can_be_private(fault->slot)) {
+   kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+   return -EFAULT;
+   }
+
+   r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, >pfn,
+_order);
+   if (r) {
+   kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+   return r;
+   }
+
+   fault->max_level = min(kvm_max_level_for_order(max_order),
+  fault->max_level);
+   fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
+
+   return RET_PF_CONTINUE;
+}
+
  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault 
*fault)
  {
struct kvm_memory_slot *slot = fault->slot;
@@ -4293,6 +4356,14 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault
return RET_PF_EMULATE;
}
  
+	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {

In patch 21,
fault->is_private is set as:
".is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT)",
then, the inequality here means memory attribute has been updated after
last check.
So, why an exit to user space for converting is required instead of a mere 
retry?

Or, is it because how .is_private is assigned in patch 21 is subjected to change
in future?

This.  Retrying on SNP or TDX would hang the guest.  I suppose we could special
case VMs where .is_private is derived from the memory attributes, but the
SW_PROTECTED_VM type is primary a development vehicle at this point.  I'd like 
to
have it mimic SNP/TDX as much as possible; performance is a secondary concern.
So when .is_private is derived from the memory attributes, and if I 
didn't miss
anything, there is no explicit conversion mechanism introduced yet so 
far, does

it mean for pure sw-protected VM (withouth SNP/TDX), the page fault will be
handled according to the memory attributes setup by host/user vmm, no 
implicit

conversion will be triggered, right?




E.g. userspace needs to be prepared for "spurious" exits due to races on SNP and
TDX, which this can theoretically exercise.  Though the window is quite small so
I doubt that'll actually happen in practice; which of course also makes it less
important to retry instead of exiting.




Re: [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-09-19 Thread Binbin Wu




On 9/14/2023 9:55 AM, Sean Christopherson wrote:
[...]

+
+static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
+ pgoff_t end)
+{
+   struct kvm_memory_slot *slot;
+   struct kvm *kvm = gmem->kvm;
+   unsigned long index;
+   bool flush = false;
+
+   KVM_MMU_LOCK(kvm);
+
+   kvm_mmu_invalidate_begin(kvm);
+
+   xa_for_each_range(>bindings, index, slot, start, end - 1) {
+   pgoff_t pgoff = slot->gmem.pgoff;
+
+   struct kvm_gfn_range gfn_range = {
+   .start = slot->base_gfn + max(pgoff, start) - pgoff,
+   .end = slot->base_gfn + min(pgoff + slot->npages, end) 
- pgoff,
+   .slot = slot,
+   .may_block = true,
+   };
+
+   flush |= kvm_mmu_unmap_gfn_range(kvm, _range);
+   }
+
+   if (flush)
+   kvm_flush_remote_tlbs(kvm);
+
+   KVM_MMU_UNLOCK(kvm);
+}
+
+static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
+   pgoff_t end)
+{
+   struct kvm *kvm = gmem->kvm;
+
+   KVM_MMU_LOCK(kvm);
+   if (xa_find(>bindings, , end - 1, XA_PRESENT))
+   kvm_mmu_invalidate_end(kvm);
kvm_mmu_invalidate_begin() is called unconditionally in 
kvm_gmem_invalidate_begin(),

but kvm_mmu_invalidate_end() is not here.
This makes the kvm_gmem_invalidate_{begin, end}() calls asymmetric.



+   KVM_MMU_UNLOCK(kvm);
+}
+
+static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
+{
+   struct list_head *gmem_list = >i_mapping->private_list;
+   pgoff_t start = offset >> PAGE_SHIFT;
+   pgoff_t end = (offset + len) >> PAGE_SHIFT;
+   struct kvm_gmem *gmem;
+
+   /*
+* Bindings must stable across invalidation to ensure the start+end
+* are balanced.
+*/
+   filemap_invalidate_lock(inode->i_mapping);
+
+   list_for_each_entry(gmem, gmem_list, entry) {
+   kvm_gmem_invalidate_begin(gmem, start, end);
+   kvm_gmem_invalidate_end(gmem, start, end);
+   }

Why to loop for each gmem in gmem_list here?

IIUIC, offset is the offset according to the inode, it is only 
meaningful to the
inode passed in, i.e, it is only meaningful to the gmem binding with the 
inode,

not others.



+
+   filemap_invalidate_unlock(inode->i_mapping);
+
+   return 0;
+}
+

[...]


Re: [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes

2023-09-18 Thread Binbin Wu




On 9/14/2023 9:55 AM, Sean Christopherson wrote:

From: Chao Peng 

[...]
  
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES

+/*
+ * Returns true if _all_ gfns in the range [@start, @end) have attributes
+ * matching @attrs.
+ */
+bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+unsigned long attrs)
+{
+   XA_STATE(xas, >mem_attr_array, start);
+   unsigned long index;
+   bool has_attrs;
+   void *entry;
+
+   rcu_read_lock();
+
+   if (!attrs) {
+   has_attrs = !xas_find(, end);

IIUIC, xas_find() is inclusive for "end", so here should be "end - 1" ?



+   goto out;
+   }
+
+   has_attrs = true;
+   for (index = start; index < end; index++) {
+   do {
+   entry = xas_next();
+   } while (xas_retry(, entry));
+
+   if (xas.xa_index != index || xa_to_value(entry) != attrs) {
+   has_attrs = false;
+   break;
+   }
+   }
+
+out:
+   rcu_read_unlock();
+   return has_attrs;
+}
+


[...]


Re: [RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events

2023-09-17 Thread Binbin Wu




On 9/14/2023 9:55 AM, Sean Christopherson wrote:

Add flags to "struct kvm_gfn_range" to let notifier events target only
shared and only private mappings, and write up the existing mmu_notifier
events to be shared-only (private memory is never associated with a
userspace virtual address, i.e. can't be reached via mmu_notifiers).

Add two flags so that KVM can handle the three possibilities (shared,
private, and shared+private) without needing something like a tri-state
enum.


How to understand the word "stage" in short log?




Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com
Signed-off-by: Sean Christopherson 
---
  include/linux/kvm_host.h | 2 ++
  virt/kvm/kvm_main.c  | 7 +++
  2 files changed, 9 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d8c6ce6c8211..b5373cee2b08 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -263,6 +263,8 @@ struct kvm_gfn_range {
gfn_t start;
gfn_t end;
union kvm_mmu_notifier_arg arg;
+   bool only_private;
+   bool only_shared;
bool may_block;
  };
  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 174de2789657..a41f8658dfe0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
 * the second or later invocation of the handler).
 */
gfn_range.arg = range->arg;
+
+   /*
+* HVA-based notifications aren't relevant to private
+* mappings as they don't have a userspace mapping.
+*/
+   gfn_range.only_private = false;
+   gfn_range.only_shared = true;
gfn_range.may_block = range->may_block;
  
  			/*




Re: [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry

2023-09-13 Thread Binbin Wu




On 9/14/2023 9:55 AM, Sean Christopherson wrote:

From: Chao Peng 

Currently in mmu_notifier invalidate path, hva range is recorded and
then checked against by mmu_notifier_retry_hva() in the page fault
handling path. However, for the to be introduced private memory, a page
fault may not have a hva associated, checking gfn(gpa) makes more sense.

For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

Suggested-by: Sean Christopherson 
Signed-off-by: Chao Peng 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
[sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
Signed-off-by: Sean Christopherson 
---
  arch/x86/kvm/mmu/mmu.c   | 10 ++
  arch/x86/kvm/vmx/vmx.c   | 11 +--
  include/linux/kvm_host.h | 33 +
  virt/kvm/kvm_main.c  | 40 +++-
  4 files changed, 63 insertions(+), 31 deletions(-)


[...]
  
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,

- unsigned long end)
+void kvm_mmu_invalidate_begin(struct kvm *kvm)
  {
+   lockdep_assert_held_write(>mmu_lock);
/*
 * The count increase must become visible at unlock time as no
 * spte can be established without taking the mmu_lock and
 * count is also read inside the mmu_lock critical section.
 */
kvm->mmu_invalidate_in_progress++;
+
+   if (likely(kvm->mmu_invalidate_in_progress == 1))
+   kvm->mmu_invalidate_range_start = INVALID_GPA;
+}
+
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+   lockdep_assert_held_write(>mmu_lock);
+
+   WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
+
if (likely(kvm->mmu_invalidate_in_progress == 1)) {
kvm->mmu_invalidate_range_start = start;
kvm->mmu_invalidate_range_end = end;
@@ -771,6 +781,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned 
long start,
}
  }
  
+static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)

+{
+   kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
+   return kvm_unmap_gfn_range(kvm, range);
+}
+
  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
const struct mmu_notifier_range *range)
  {
@@ -778,7 +794,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct 
mmu_notifier *mn,
const struct kvm_mmu_notifier_range hva_range = {
.start  = range->start,
.end= range->end,
-   .handler= kvm_unmap_gfn_range,
+   .handler= kvm_mmu_unmap_gfn_range,
.on_lock= kvm_mmu_invalidate_begin,
.on_unlock  = kvm_arch_guest_memory_reclaimed,
.flush_on_ret   = true,
@@ -817,8 +833,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct 
mmu_notifier *mn,
return 0;
  }
  
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,

-   unsigned long end)
+void kvm_mmu_invalidate_end(struct kvm *kvm)
  {
/*
 * This sequence increase will notify the kvm page fault that
@@ -833,6 +848,13 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long 
start,
 * in conjunction with the smp_rmb in mmu_invalidate_retry().
 */
kvm->mmu_invalidate_in_progress--;
+
+   /*
+* Assert that at least one range must be added between start() and
+* end().  Not adding a range isn't fatal, but it is a KVM bug.
+*/
+   WARN_ON_ONCE(kvm->mmu_invalidate_in_progress &&
+kvm->mmu_invalidate_range_start == INVALID_GPA);
Should the check happen before the decrease of 
kvm->mmu_invalidate_in_progress?
Otherwise, KVM calls kvm_mmu_invalidate_begin(), then 
kvm_mmu_invalidate_end()

the check will not take effect.


  }
  
  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,




Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-08-31 Thread Binbin Wu




On 8/31/2023 12:44 AM, Ackerley Tng wrote:

Binbin Wu  writes:




+static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
+{
+   struct address_space *mapping = inode->i_mapping;
+   pgoff_t start, index, end;
+   int r;
+
+   /* Dedicated guest is immutable by default. */
+   if (offset + len > i_size_read(inode))
+   return -EINVAL;
+
+   filemap_invalidate_lock_shared(mapping);
+
+   start = offset >> PAGE_SHIFT;
+   end = (offset + len) >> PAGE_SHIFT;
+
+   r = 0;
+   for (index = start; index < end; ) {
+   struct folio *folio;
+
+   if (signal_pending(current)) {
+   r = -EINTR;
+   break;
+   }
+
+   folio = kvm_gmem_get_folio(inode, index);
+   if (!folio) {
+   r = -ENOMEM;
+   break;
+   }
+
+   index = folio_next_index(folio);
+
+   folio_unlock(folio);
+   folio_put(folio);

May be a dumb question, why we get the folio and then put it immediately?
Will it make the folio be released back to the page allocator?


I was wondering this too, but it is correct.

In filemap_grab_folio(), the refcount is incremented in three places:

+ When the folio is created in filemap_alloc_folio(), it is given a
   refcount of 1 in

 filemap_alloc_folio() -> folio_alloc() -> __folio_alloc_node() ->
 __folio_alloc() -> __alloc_pages() -> get_page_from_freelist() ->
 prep_new_page() -> post_alloc_hook() -> set_page_refcounted()

+ Then, in filemap_add_folio(), the refcount is incremented twice:

 + The first is from the filemap (1 refcount per page if this is a
   hugepage):

 filemap_add_folio() -> __filemap_add_folio() -> folio_ref_add()

 + The second is a refcount from the lru list

 filemap_add_folio() -> folio_add_lru() -> folio_get() ->
 folio_ref_inc()

In the other path, if the folio exists in the page cache (filemap), the
refcount is also incremented through

 filemap_grab_folio() -> __filemap_get_folio() -> filemap_get_entry()
 -> folio_try_get_rcu()

I believe all the branches in kvm_gmem_get_folio() are taking a refcount
on the folio while the kernel does some work on the folio like clearing
the folio in clear_highpage() or getting the next index, and then when
done, the kernel does folio_put().

This pattern is also used in shmem and hugetlb. :)


Thanks for your explanation. It helps a lot.



I'm not sure whose refcount the folio_put() in kvm_gmem_allocate() is
dropping though:

+ The refcount for the filemap depends on whether this is a hugepage or
   not, but folio_put() strictly drops a refcount of 1.
+ The refcount for the lru list is just 1, but doesn't the page still
   remain in the lru list?


I guess the refcount drop here is the one get on the fresh allocation.
Now the filemap has grabbed the folio, so the lifecycle of the folio now 
is decided by the filemap/inode?





+
+   /* 64-bit only, wrapping the index should be impossible. */
+   if (WARN_ON_ONCE(!index))
+   break;
+
+   cond_resched();
+   }
+
+   filemap_invalidate_unlock_shared(mapping);
+
+   return r;
+}
+






Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-08-30 Thread Binbin Wu




On 7/19/2023 7:44 AM, Sean Christopherson wrote:

[...]

+
+static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
+{
+   struct folio *folio;
+
+   /* TODO: Support huge pages. */
+   folio = filemap_grab_folio(file->f_mapping, index);
+   if (!folio)

Should use  if ((IS_ERR(folio)) instead.


+   return NULL;
+
+   /*
+* Use the up-to-date flag to track whether or not the memory has been
+* zeroed before being handed off to the guest.  There is no backing
+* storage for the memory, so the folio will remain up-to-date until
+* it's removed.
+*
+* TODO: Skip clearing pages when trusted firmware will do it when
+* assigning memory to the guest.
+*/
+   if (!folio_test_uptodate(folio)) {
+   unsigned long nr_pages = folio_nr_pages(folio);
+   unsigned long i;
+
+   for (i = 0; i < nr_pages; i++)
+   clear_highpage(folio_page(folio, i));
+
+   folio_mark_uptodate(folio);
+   }
+
+   /*
+* Ignore accessed, referenced, and dirty flags.  The memory is
+* unevictable and there is no storage to write back to.
+*/
+   return folio;
+}

[...]

+
+static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
+{
+   struct address_space *mapping = inode->i_mapping;
+   pgoff_t start, index, end;
+   int r;
+
+   /* Dedicated guest is immutable by default. */
+   if (offset + len > i_size_read(inode))
+   return -EINVAL;
+
+   filemap_invalidate_lock_shared(mapping);
+
+   start = offset >> PAGE_SHIFT;
+   end = (offset + len) >> PAGE_SHIFT;
+
+   r = 0;
+   for (index = start; index < end; ) {
+   struct folio *folio;
+
+   if (signal_pending(current)) {
+   r = -EINTR;
+   break;
+   }
+
+   folio = kvm_gmem_get_folio(inode, index);
+   if (!folio) {
+   r = -ENOMEM;
+   break;
+   }
+
+   index = folio_next_index(folio);
+
+   folio_unlock(folio);
+   folio_put(folio);

May be a dumb question, why we get the folio and then put it immediately?
Will it make the folio be released back to the page allocator?


+
+   /* 64-bit only, wrapping the index should be impossible. */
+   if (WARN_ON_ONCE(!index))
+   break;
+
+   cond_resched();
+   }
+
+   filemap_invalidate_unlock_shared(mapping);
+
+   return r;
+}
+

[...]

+
+int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
+ unsigned int fd, loff_t offset)
+{
+   loff_t size = slot->npages << PAGE_SHIFT;
+   unsigned long start, end, flags;
+   struct kvm_gmem *gmem;
+   struct inode *inode;
+   struct file *file;
+
+   BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
+
+   file = fget(fd);
+   if (!file)
+   return -EINVAL;
+
+   if (file->f_op != _gmem_fops)
+   goto err;
+
+   gmem = file->private_data;
+   if (gmem->kvm != kvm)
+   goto err;
+
+   inode = file_inode(file);
+   flags = (unsigned long)inode->i_private;
+
+   /*
+* For simplicity, require the offset into the file and the size of the
+* memslot to be aligned to the largest possible page size used to back
+* the file (same as the size of the file itself).
+*/
+   if (!kvm_gmem_is_valid_size(offset, flags) ||
+   !kvm_gmem_is_valid_size(size, flags))
+   goto err;
+
+   if (offset + size > i_size_read(inode))
+   goto err;
+
+   filemap_invalidate_lock(inode->i_mapping);
+
+   start = offset >> PAGE_SHIFT;
+   end = start + slot->npages;
+
+   if (!xa_empty(>bindings) &&
+   xa_find(>bindings, , end - 1, XA_PRESENT)) {
+   filemap_invalidate_unlock(inode->i_mapping);
+   goto err;
+   }
+
+   /*
+* No synchronize_rcu() needed, any in-flight readers are guaranteed to
+* be see either a NULL file or this new file, no need for them to go
+* away.
+*/
+   rcu_assign_pointer(slot->gmem.file, file);
+   slot->gmem.pgoff = start;
+
+   xa_store_range(>bindings, start, end - 1, slot, GFP_KERNEL);
+   filemap_invalidate_unlock(inode->i_mapping);
+
+   /*
+* Drop the reference to the file, even on success.  The file pins KVM,
+* not the other way 'round.  Active bindings are invalidated if the

an extra ',  or maybe around?



+* file is closed before memslots are destroyed.
+*/
+   fput(file);
+   return 0;
+
+err:
+   fput(file);
+   return -EINVAL;
+}
+

[...]

[]




Re: [RFC PATCH v11 08/29] KVM: Introduce per-page memory attributes

2023-08-14 Thread Binbin Wu




On 7/19/2023 7:44 AM, Sean Christopherson wrote:

From: Chao Peng 

In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
userspace to operate on the per-page memory attributes.
   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
 a guest memory range.
   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
 memory attributes.

Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.

Because setting memory attributes is roughly analogous to mprotect() on
memory that is mapped into the guest, zap existing mappings prior to
updating the memory attributes.  Opportunistically provide an arch hook
for the post-set path (needed to complete invalidation anyways) in

s/anyways/anyway


anticipation of x86 needing the hook to update metadata related to
determining whether or not a given gfn can be backed with various sizes
of hugepages.

It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.

Suggested-by: Sean Christopherson 
Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
Cc: Fuad Tabba 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
  Documentation/virt/kvm/api.rst |  60 
  include/linux/kvm_host.h   |  14 +++
  include/uapi/linux/kvm.h   |  14 +++
  virt/kvm/Kconfig   |   4 +
  virt/kvm/kvm_main.c| 170 +
  5 files changed, 262 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 34d4ce66e0c8..0ca8561775ac 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6068,6 +6068,56 @@ writes to the CNTVCT_EL0 and CNTPCT_EL0 registers using 
the SET_ONE_REG
  interface. No error will be returned, but the resulting offset will not be
  applied.
  
+4.139 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES

+-
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: u64 memory attributes bitmask(out)
+:Returns: 0 on success, <0 on error
+
+Returns supported memory attributes bitmask. Supported memory attributes will
+have the corresponding bits set in u64 memory attributes bitmask.
+
+The following memory attributes are defined::
+
+  #define KVM_MEMORY_ATTRIBUTE_PRIVATE   (1ULL << 3)
+
+4.140 KVM_SET_MEMORY_ATTRIBUTES
+-
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_memory_attributes(in/out)
+:Returns: 0 on success, <0 on error
+
+Sets memory attributes for pages in a guest memory range. Parameters are
+specified via the following structure::
+
+  struct kvm_memory_attributes {
+   __u64 address;
+   __u64 size;
+   __u64 attributes;
+   __u64 flags;
+  };
+
+The user sets the per-page memory attributes to a guest memory range indicated
+by address/size, and in return KVM adjusts address and size to reflect the
+actual pages of the memory range have been successfully set to the attributes.
+If the call returns 0, "address" is updated to the last successful address + 1
+and "size" is updated to the remaining address size that has not been set
+successfully. The user should check the return value as well as the size to
+decide if the operation succeeded for the whole range or not. The user may want
+to retry the operation with the returned address/size if the previous range was
+partially successful.
+
+Both address and size should be page aligned and the supported attributes can 
be
+retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
+
+The "flags" field may be used for future extensions and should be set to 0s.
+
  5. The kvm_run structure
  
  
@@ -8494,6 +8544,16 @@ block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a

  64-bit bitmap (each bit describing a block size). The default value is
  0, to disable the eager page splitting.
  
+8.41 KVM_CAP_MEMORY_ATTRIBUTES

+--
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm
+
+This capability indicates KVM supports per-page memory attributes and ioctls
+KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
+
  9. Known KVM API problems
  =