Re: [PATCH v14 00/34] KVM: guest_memfd() and per-page attributes

2023-11-13 Thread Paolo Bonzini

On 11/5/23 17:30, Paolo Bonzini wrote:

The "development cycle" for this version is going to be very short;
ideally, next week I will merge it as is in kvm/next, taking this through
the KVM tree for 6.8 immediately after the end of the merge window.
The series is still based on 6.6 (plus KVM changes for 6.7) so it
will require a small fixup for changes to get_file_rcu() introduced in
6.7 by commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU").
The fixup will be done as part of the merge commit, and most of the text
above will become the commit message for the merge.


The changes from review are small enough and entirely in tests, so
I went ahead and pushed it to kvm/next, together with "selftests: kvm/s390x: use 
vm_create_barebones()" which also fixed testcase failures (similar to the 
aarch64/page_fault_test.c hunk below).

The guestmemfd branch on kvm.git was force-pushed, and can be used for further
development if you don't want to run 6.7-rc1 for whatever reason.

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 38882263278d..926241e23aeb 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1359,7 +1359,6 @@ yet and must be cleared on entry.
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
__u64 userspace_addr; /* start of the userspace allocated memory */
-   __u64 pad[16];
   };
 
   /* for kvm_userspace_memory_region::flags */

diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c 
b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
index eb4217b7c768..08a5ca5bed56 100644
--- a/tools/testing/selftests/kvm/aarch64/page_fault_test.c
+++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
@@ -705,7 +705,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 
 	print_test_banner(mode, p);
 
-	vm = vm_create(mode);

+   vm = vm_create(VM_SHAPE(mode));
setup_memslots(vm, p);
kvm_vm_elf_load(vm, program_invocation_name);
setup_ucall(vm);
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c 
b/tools/testing/selftests/kvm/guest_memfd_test.c
index ea0ae7e25330..fd389663c49b 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -6,14 +6,6 @@
  */
 
 #define _GNU_SOURCE

-#include "test_util.h"
-#include "kvm_util_base.h"
-#include 
-#include 
-#include 
-#include 
-#include 
-
 #include 
 #include 
 #include 
@@ -21,6 +13,15 @@
 #include 
 #include 
 
+#include 

+#include 
+#include 
+#include 
+#include 
+
+#include "test_util.h"
+#include "kvm_util_base.h"
+
 static void test_file_read_write(int fd)
 {
char buf[64];
diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index e4d2cd9218b2..1b58f943562f 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -819,6 +819,7 @@ static inline struct kvm_vm *vm_create_barebones(void)
return vm_create(VM_SHAPE_DEFAULT);
 }
 
+#ifdef __x86_64__

 static inline struct kvm_vm *vm_create_barebones_protected_vm(void)
 {
const struct vm_shape shape = {
@@ -828,6 +829,7 @@ static inline struct kvm_vm 
*vm_create_barebones_protected_vm(void)
 
 	return vm_create(shape);

 }
+#endif
 
 static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)

 {
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index d05d95cc3693..9b29cbf49476 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1214,7 +1214,7 @@ void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t 
base, uint64_t size,
TEST_ASSERT(region && region->region.flags & 
KVM_MEM_GUEST_MEMFD,
"Private memory region not found for GPA 0x%lx", 
gpa);
 
-		offset = (gpa - region->region.guest_phys_addr);

+   offset = gpa - region->region.guest_phys_addr;
fd_offset = region->region.guest_memfd_offset + offset;
len = min_t(uint64_t, end - gpa, region->region.memory_size - 
offset);
 
diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c

index 343e807043e1..1efee1cfcff0 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -433,6 +433,7 @@ static void test_add_max_memory_regions(void)
 }
 
 
+#ifdef __x86_64__

 static void test_invalid_guest_memfd(struct kvm_vm *vm, int memfd,
 size_t offset, const char *msg)
 {
@@ -523,14 +524,13 @@ static void 
test_add_overlapping_private_memory_regions(void)
close(memfd);
kvm_vm_free(vm);
 }
+#endif
 
 int main(int argc, char *argv[])

 {
 #ifdef __x86_64__
int i, loops;
-#endif
 
-#ifdef 

[PATCH v14 00/34] KVM: guest_memfd() and per-page attributes

2023-11-05 Thread Paolo Bonzini
[If the introduction below is not enough, go read
 https://lwn.net/SubscriberLink/949277/118520c1248ace63/ and subscribe to LWN]

Introduce several new KVM uAPIs to ultimately create a guest-first memory
subsystem within KVM, a.k.a. guest_memfd.  Guest-first memory allows KVM
to provide features, enhancements, and optimizations that are kludgly
or outright impossible to implement in a generic memory subsystem.

The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which
similar to the generic memfd_create(), creates an anonymous file and
returns a file descriptor that refers to it.  Again like "regular"
memfd files, guest_memfd files live in RAM, have volatile storage,
and are automatically released when the last reference is dropped.
The key differences between memfd files (and every other memory subystem)
is that guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be resized.
guest_memfd files do however support PUNCH_HOLE, which can be used to
convert a guest memory area between the shared and guest-private states.

A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to
specify attributes for a given page of guest memory.  In the long term,
it will likely be extended to allow userspace to specify per-gfn RWX
protections, including allowing memory to be writable in the guest
without it also being writable in host userspace.

The immediate and driving use case for guest_memfd are Confidential
(CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM.
For such use cases, being able to map memory into KVM guests without
requiring said memory to be mapped into the host is a hard requirement.
While SEV+ and TDX prevent untrusted software from reading guest private
data by encrypting guest memory, pKVM provides confidentiality and
integrity *without* relying on memory encryption.  In addition, with
SEV-SNP and especially TDX, accessing guest private memory can be fatal
to the host, i.e. KVM must be prevent host userspace from accessing
guest memory irrespective of hardware behavior.

Long term, guest_memfd may be useful for use cases beyond CoCo VMs,
for example hardening userspace against unintentional accesses to guest
memory.  As mentioned earlier, KVM's ABI uses userspace VMA protections to
define the allow guest protection (with an exception granted to mapping
guest memory executable), and similarly KVM currently requires the guest
mapping size to be a strict subset of the host userspace mapping size.
Decoupling the mappings sizes would allow userspace to precisely map
only what is needed and with the required permissions, without impacting
guest performance.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to DMA from or into guest memory).

guest_memfd is the result of 3+ years of development and exploration;
taking on memory management responsibilities in KVM was not the first,
second, or even third choice for supporting CoCo VMs.  But after many
failed attempts to avoid KVM-specific backing memory, and looking at
where things ended up, it is quite clear that of all approaches tried,
guest_memfd is the simplest, most robust, and most extensible, and the
right thing to do for KVM and the kernel at-large.

The "development cycle" for this version is going to be very short;
ideally, next week I will merge it as is in kvm/next, taking this through
the KVM tree for 6.8 immediately after the end of the merge window.
The series is still based on 6.6 (plus KVM changes for 6.7) so it
will require a small fixup for changes to get_file_rcu() introduced in
6.7 by commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU").
The fixup will be done as part of the merge commit, and most of the text
above will become the commit message for the merge.

Because of this, the only two commits that had substantial remarks in v13
(depending on your definition of substantial) are *not* officially part of
this series and will not be merged:

  KVM: Prepare for handling only shared mappings in mmu_notifier events
  KVM: Add transparent hugepage support for dedicated guest memory

Pending post-merge work includes:
- looking into using the restrictedmem framework for guest memory
- introducing a testing mechanism to poison memory, possibly using
  the same memory attributes introduced here
- SNP and TDX support

Non-KVM people, you may want to explicitly ACK two patches buried in the
middle of this series:

  fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()
  mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

The first is small and mostly suggested-by Christian Brauner; the second
a bit less so but it was written by an mm person (Vlastimil Babka).
Note, adding AS_UNMOVABLE isn't strictly required as it's "just" an