Re: [PATCH 15/34] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-12 Thread Xiaoyao Li

On 11/11/2023 2:22 AM, Sean Christopherson wrote:

On Fri, Nov 10, 2023, Xiaoyao Li wrote:

On 11/6/2023 12:30 AM, Paolo Bonzini wrote:

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 68a144cb7dbc..a6de526c0426 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -589,8 +589,20 @@ struct kvm_memory_slot {
u32 flags;
short id;
u16 as_id;
+
+#ifdef CONFIG_KVM_PRIVATE_MEM
+   struct {
+   struct file __rcu *file;
+   pgoff_t pgoff;
+   } gmem;
+#endif
   };
+static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
+{
+   return slot && (slot->flags & KVM_MEM_GUEST_MEMFD);
+}
+


maybe we can move this block and ...




@@ -2355,6 +2379,30 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
struct kvm_gfn_range *range);
   bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 struct kvm_gfn_range *range);
+
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+   return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) &&
+  kvm_get_memory_attributes(kvm, gfn) & 
KVM_MEMORY_ATTRIBUTE_PRIVATE;
+}
+#else
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+   return false;
+}
   #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */


this block to Patch 18?


It would work, but my vote is to keep them here to minimize the changes to 
common
KVM code in the x86 enabling.  It's not a strong preference though.  Of course,
at this point, fiddling with this sort of thing is probably a bad idea in terms
of landing guest_memfd.


Indeed. It's OK then.


@@ -4844,6 +4875,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct 
kvm *kvm, long arg)
   #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
case KVM_CAP_MEMORY_ATTRIBUTES:
return kvm_supported_mem_attributes(kvm);
+#endif
+#ifdef CONFIG_KVM_PRIVATE_MEM
+   case KVM_CAP_GUEST_MEMFD:
+   return !kvm || kvm_arch_has_private_mem(kvm);
   #endif
default:
break;
@@ -5277,6 +5312,18 @@ static long kvm_vm_ioctl(struct file *filp,
case KVM_GET_STATS_FD:
r = kvm_vm_ioctl_get_stats_fd(kvm);
break;
+#ifdef CONFIG_KVM_PRIVATE_MEM
+   case KVM_CREATE_GUEST_MEMFD: {
+   struct kvm_create_guest_memfd guest_memfd;


Do we need a guard of below?

r = -EINVAL;
if (!kvm_arch_has_private_mem(kvm))
goto out;


Argh, yeah, that's weird since KVM_CAP_GUEST_MEMFD says "not supported" if the
VM doesn't support private memory.

Enforcing that would break guest_memfd_test.c though.  And having to create a
"special" VM just to test basic guest_memfd functionality would be quite
annoying.

So my vote is to do:

case KVM_CAP_GUEST_MEMFD:
return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM);


I'm fine with it.


There's no harm to KVM if userspace creates a file it can't use, and at some
point KVM will hopefully support guest_memfd irrespective of private memory.




Re: [PATCH 15/34] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-10 Thread Sean Christopherson
On Fri, Nov 10, 2023, Xiaoyao Li wrote:
> On 11/6/2023 12:30 AM, Paolo Bonzini wrote:
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 68a144cb7dbc..a6de526c0426 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -589,8 +589,20 @@ struct kvm_memory_slot {
> > u32 flags;
> > short id;
> > u16 as_id;
> > +
> > +#ifdef CONFIG_KVM_PRIVATE_MEM
> > +   struct {
> > +   struct file __rcu *file;
> > +   pgoff_t pgoff;
> > +   } gmem;
> > +#endif
> >   };
> > +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot 
> > *slot)
> > +{
> > +   return slot && (slot->flags & KVM_MEM_GUEST_MEMFD);
> > +}
> > +
> 
> maybe we can move this block and ...
> 
> 
> 
> > @@ -2355,6 +2379,30 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm 
> > *kvm,
> > struct kvm_gfn_range *range);
> >   bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
> >  struct kvm_gfn_range *range);
> > +
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > +   return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) &&
> > +  kvm_get_memory_attributes(kvm, gfn) & 
> > KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > +}
> > +#else
> > +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > +   return false;
> > +}
> >   #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
> 
> this block to Patch 18?

It would work, but my vote is to keep them here to minimize the changes to 
common
KVM code in the x86 enabling.  It's not a strong preference though.  Of course,
at this point, fiddling with this sort of thing is probably a bad idea in terms
of landing guest_memfd.

> > @@ -4844,6 +4875,10 @@ static int 
> > kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
> >   #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> > case KVM_CAP_MEMORY_ATTRIBUTES:
> > return kvm_supported_mem_attributes(kvm);
> > +#endif
> > +#ifdef CONFIG_KVM_PRIVATE_MEM
> > +   case KVM_CAP_GUEST_MEMFD:
> > +   return !kvm || kvm_arch_has_private_mem(kvm);
> >   #endif
> > default:
> > break;
> > @@ -5277,6 +5312,18 @@ static long kvm_vm_ioctl(struct file *filp,
> > case KVM_GET_STATS_FD:
> > r = kvm_vm_ioctl_get_stats_fd(kvm);
> > break;
> > +#ifdef CONFIG_KVM_PRIVATE_MEM
> > +   case KVM_CREATE_GUEST_MEMFD: {
> > +   struct kvm_create_guest_memfd guest_memfd;
> 
> Do we need a guard of below?
> 
>   r = -EINVAL;
>   if (!kvm_arch_has_private_mem(kvm))
>   goto out;

Argh, yeah, that's weird since KVM_CAP_GUEST_MEMFD says "not supported" if the
VM doesn't support private memory.

Enforcing that would break guest_memfd_test.c though.  And having to create a
"special" VM just to test basic guest_memfd functionality would be quite
annoying.

So my vote is to do:

case KVM_CAP_GUEST_MEMFD:
return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM);

There's no harm to KVM if userspace creates a file it can't use, and at some
point KVM will hopefully support guest_memfd irrespective of private memory.


Re: [PATCH 15/34] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-09 Thread Xiaoyao Li

On 11/6/2023 12:30 AM, Paolo Bonzini wrote:

From: Sean Christopherson 

Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.

A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem.  With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings.   E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection.  Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.

Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping.  Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.

Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).

More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd.  While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption.  And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.

Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.

Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping.  And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.

Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory.  That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.

Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem.  I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.

Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay.  Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.

Link: 
https://lore.kernel.org/all/20201020061859.18385-1-kirill.shute...@linux.intel.com
Link: 
https://lore.kernel.org/all/20210416154106.23721-1-kirill.shute...@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-sea...@google.com
Link: 
https://lore.kernel.org/all/2021141352.26311-1-chao.p.p...@linux.intel.com
Link: 
https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.p...@linux.intel.com
Link: 
https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd63...@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/zem5zq8oo+xna...@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/zii1p8zhlhaq3...@casper.infradead.org
Cc: Fuad Tabba 
Cc: Vishal Annapurve 
Cc: Ackerley Tng 
Cc: Jarkko Sakkinen 
Cc: Maciej Szmigiero 
Cc: Vlastimil Babka 
Cc: David 

Re: [PATCH 15/34] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-06 Thread Fuad Tabba
Hi,

>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 083ed507e200..6d681f45969e 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst

...
>
> +4.142 KVM_CREATE_GUEST_MEMFD
> +
> +
> +:Capability: KVM_CAP_GUEST_MEMFD
> +:Architectures: none
> +:Type: vm ioctl
> +:Parameters: struct kvm_create_guest_memfd(in)

nit: space before (in)

Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 

Cheers,
/fuad

> +:Returns: 0 on success, <0 on error
> +
> +KVM_CREATE_GUEST_MEMFD creates an anonymous file and returns a file 
> descriptor
> +that refers to it.  guest_memfd files are roughly analogous to files created
> +via memfd_create(), e.g. guest_memfd files live in RAM, have volatile 
> storage,
> +and are automatically released when the last reference is dropped.  Unlike
> +"regular" memfd_create() files, guest_memfd files are bound to their owning
> +virtual machine (see below), cannot be mapped, read, or written by userspace,
> +and cannot be resized  (guest_memfd files do however support PUNCH_HOLE).
> +
> +::
> +
> +  struct kvm_create_guest_memfd {
> +   __u64 size;
> +   __u64 flags;
> +   __u64 reserved[6];
> +  };
> +
> +Conceptually, the inode backing a guest_memfd file represents physical 
> memory,
> +i.e. is coupled to the virtual machine as a thing, not to a "struct kvm".  
> The
> +file itself, which is bound to a "struct kvm", is that instance's view of the
> +underlying memory, e.g. effectively provides the translation of guest 
> addresses
> +to host memory.  This allows for use cases where multiple KVM structures are
> +used to manage a single virtual machine, e.g. when performing intrahost
> +migration of a virtual machine.
> +
> +KVM currently only supports mapping guest_memfd via 
> KVM_SET_USER_MEMORY_REGION2,
> +and more specifically via the guest_memfd and guest_memfd_offset fields in
> +"struct kvm_userspace_memory_region2", where guest_memfd_offset is the offset
> +into the guest_memfd instance.  For a given guest_memfd file, there can be at
> +most one mapping per page, i.e. binding multiple memory regions to a single
> +guest_memfd range is not allowed (any number of memory regions can be bound 
> to
> +a single guest_memfd file, but the bound ranges must not overlap).
> +
> +See KVM_SET_USER_MEMORY_REGION2 for additional details.
> +
>  5. The kvm_run structure
>  
>
> diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
> index 3d4a27f8b4fe..6f3d31b4d1e3 100644
> --- a/fs/anon_inodes.c
> +++ b/fs/anon_inodes.c
> @@ -181,6 +181,7 @@ struct file *anon_inode_create_getfile(const char *name,
> return __anon_inode_getfile(name, fops, priv, flags,
> context_inode, true);
>  }
> +EXPORT_SYMBOL_GPL(anon_inode_create_getfile);
>
>  static int __anon_inode_getfd(const char *name,
>   const struct file_operations *fops,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 68a144cb7dbc..a6de526c0426 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -589,8 +589,20 @@ struct kvm_memory_slot {
> u32 flags;
> short id;
> u16 as_id;
> +
> +#ifdef CONFIG_KVM_PRIVATE_MEM
> +   struct {
> +   struct file __rcu *file;
> +   pgoff_t pgoff;
> +   } gmem;
> +#endif
>  };
>
> +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot 
> *slot)
> +{
> +   return slot && (slot->flags & KVM_MEM_GUEST_MEMFD);
> +}
> +
>  static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot 
> *slot)
>  {
> return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
> @@ -685,6 +697,17 @@ static inline int kvm_arch_vcpu_memslots_id(struct 
> kvm_vcpu *vcpu)
>  }
>  #endif
>
> +/*
> + * Arch code must define kvm_arch_has_private_mem if support for private 
> memory
> + * is enabled.
> + */
> +#if !defined(kvm_arch_has_private_mem) && !IS_ENABLED(CONFIG_KVM_PRIVATE_MEM)
> +static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> +   return false;
> +}
> +#endif
> +
>  struct kvm_memslots {
> u64 generation;
> atomic_long_t last_used_slot;
> @@ -1400,6 +1423,7 @@ void *kvm_mmu_memory_cache_alloc(struct 
> kvm_mmu_memory_cache *mc);
>  void kvm_mmu_invalidate_begin(struct kvm *kvm);
>  void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
>  void kvm_mmu_invalidate_end(struct kvm *kvm);
> +bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
>
>  long kvm_arch_dev_ioctl(struct file *filp,
> unsigned int ioctl, unsigned long arg);
> @@ -2355,6 +2379,30 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm 
> *kvm,
> struct kvm_gfn_range *range);
>  bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>  struct 

[PATCH 15/34] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-05 Thread Paolo Bonzini
From: Sean Christopherson 

Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.

A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem.  With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings.   E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection.  Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.

Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping.  Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.

Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).

More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd.  While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption.  And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.

Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.

Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping.  And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.

Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory.  That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.

Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem.  I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.

Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay.  Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.

Link: 
https://lore.kernel.org/all/20201020061859.18385-1-kirill.shute...@linux.intel.com
Link: 
https://lore.kernel.org/all/20210416154106.23721-1-kirill.shute...@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-sea...@google.com
Link: 
https://lore.kernel.org/all/2021141352.26311-1-chao.p.p...@linux.intel.com
Link: 
https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.p...@linux.intel.com
Link: 
https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd63...@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/zem5zq8oo+xna...@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/zii1p8zhlhaq3...@casper.infradead.org
Cc: Fuad Tabba 
Cc: Vishal Annapurve 
Cc: Ackerley Tng 
Cc: Jarkko Sakkinen 
Cc: Maciej Szmigiero 
Cc: Vlastimil Babka 
Cc: David Hildenbrand 
Cc: Quentin Perret 
Cc: Michael Roth 
Cc: