Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2023-04-17 Thread Chao Peng
On Tue, Jan 24, 2023 at 01:27:50AM +, Sean Christopherson wrote:
> On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > On Thu, Jan 19, 2023 at 03:25:08PM +,
> > Sean Christopherson  wrote:
> > 
> > > On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > > > On Sat, Jan 14, 2023 at 12:37:59AM +,
> > > > Sean Christopherson  wrote:
> > > > 
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > This patch series implements KVM guest private memory for 
> > > > > > confidential
> > > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > > TDX-protected guest memory, machine check can happen which can 
> > > > > > further
> > > > > > crash the running host system, this is terrible for multi-tenant
> > > > > > configurations. The host accesses include those from KVM userspace 
> > > > > > like
> > > > > > QEMU. This series addresses KVM userspace induced crash by 
> > > > > > introducing
> > > > > > new mm and KVM interfaces so KVM userspace can still manage guest 
> > > > > > memory
> > > > > > via a fd-based approach, but it can never access the guest memory
> > > > > > content.
> > > > > > 
> > > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any 
> > > > > > other
> > > > > > reviews are always welcome.
> > > > > >   - 01: mm change, target for mm tree
> > > > > >   - 02-09: KVM change, target for KVM tree
> > > > > 
> > > > > A version with all of my feedback, plus reworked versions of Vishal's 
> > > > > selftest,
> > > > > is available here:
> > > > > 
> > > > >   g...@github.com:sean-jc/linux.git x86/upm_base_support
> > > > > 
> > > > > It compiles and passes the selftest, but it's otherwise barely 
> > > > > tested.  There are
> > > > > a few todos (2 I think?) and many of the commits need changelogs, 
> > > > > i.e. it's still
> > > > > a WIP.
> > > > > 
> > > > > As for next steps, can you (handwaving all of the TDX folks) take a 
> > > > > look at what
> > > > > I pushed and see if there's anything horrifically broken, and that it 
> > > > > still works
> > > > > for TDX?
> > > > > 
> > > > > Fuad (and pKVM folks) same ask for you with respect to pKVM.  
> > > > > Absolutely no rush
> > > > > (and I mean that).
> > > > > 
> > > > > On my side, the two things on my mind are (a) tests and (b) 
> > > > > downstream dependencies
> > > > > (SEV and TDX).  For tests, I want to build a lists of tests that are 
> > > > > required for
> > > > > merging so that the criteria for merging are clear, and so that if 
> > > > > the list is large
> > > > > (haven't thought much yet), the work of writing and running tests can 
> > > > > be distributed.
> > > > > 
> > > > > Regarding downstream dependencies, before this lands, I want to pull 
> > > > > in all the
> > > > > TDX and SNP series and see how everything fits together.  
> > > > > Specifically, I want to
> > > > > make sure that we don't end up with a uAPI that necessitates ugly 
> > > > > code, and that we
> > > > > don't miss an opportunity to make things simpler.  The patches in the 
> > > > > SNP series to
> > > > > add "legacy" SEV support for UPM in particular made me slightly 
> > > > > rethink some minor
> > > > > details.  Nothing remotely major, but something that needs attention 
> > > > > since it'll
> > > > > be uAPI.
> > > > 
> > > > Although I'm still debuging with TDX KVM, I needed the following.
> > > > kvm_faultin_pfn() is called without mmu_lock held.  the race to change
> > > > private/shared is handled by mmu_seq.  Maybe dedicated function only for
> > > > kvm_faultin_pfn().
> > > 
> > > Gah, you're not on the other thread where this was discussed[*].  Simply 
> > > deleting
> > > the

Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE

2023-03-28 Thread Chao Peng
On Fri, Mar 24, 2023 at 10:29:25AM +0800, Xiaoyao Li wrote:
> On 3/24/2023 10:10 AM, Chao Peng wrote:
> > On Wed, Mar 22, 2023 at 05:41:31PM -0700, Isaku Yamahata wrote:
> > > On Wed, Mar 08, 2023 at 03:40:26PM +0800,
> > > Chao Peng  wrote:
> > > 
> > > > On Wed, Mar 08, 2023 at 12:13:24AM +, Ackerley Tng wrote:
> > > > > Chao Peng  writes:
> > > > > 
> > > > > > On Sat, Jan 14, 2023 at 12:01:01AM +, Sean Christopherson wrote:
> > > > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > ...
> > > > > > > Strongly prefer to use similar logic to existing code that 
> > > > > > > detects wraps:
> > > > > 
> > > > > > >   mem->restricted_offset + mem->memory_size < 
> > > > > > > mem->restricted_offset
> > > > > 
> > > > > > > This is also where I'd like to add the "gfn is aligned to offset"
> > > > > > > check, though
> > > > > > > my brain is too fried to figure that out right now.
> > > > > 
> > > > > > Used count_trailing_zeros() for this TODO, unsure we have other 
> > > > > > better
> > > > > > approach.
> > > > > 
> > > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > > > index afc8c26fa652..fd34c5f7cd2f 100644
> > > > > > --- a/virt/kvm/kvm_main.c
> > > > > > +++ b/virt/kvm/kvm_main.c
> > > > > > @@ -56,6 +56,7 @@
> > > > > >#include 
> > > > > >#include 
> > > > > >#include 
> > > > > > +#include 
> > > > > 
> > > > > >#include "coalesced_mmio.h"
> > > > > >#include "async_pf.h"
> > > > > > @@ -2087,6 +2088,19 @@ static bool kvm_check_memslot_overlap(struct
> > > > > > kvm_memslots *slots, int id,
> > > > > > return false;
> > > > > >}
> > > > > 
> > > > > > +/*
> > > > > > + * Return true when ALIGNMENT(offset) >= ALIGNMENT(gpa).
> > > > > > + */
> > > > > > +static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
> > > > > > +{
> > > > > > +   if (!offset)
> > > > > > +   return true;
> > > > > > +   if (!gpa)
> > > > > > +   return false;
> > > > > > +
> > > > > > +   return !!(count_trailing_zeros(offset) >= 
> > > > > > count_trailing_zeros(gpa));
> > > 
> > > This check doesn't work expected. For example, offset = 2GB, gpa=4GB
> > > this check fails.
> > 
> > This case is expected to fail as Sean initially suggested[*]:
> >I would rather reject memslot if the gfn has lesser alignment than
> >the offset. I'm totally ok with this approach _if_ there's a use case.
> >Until such a use case presents itself, I would rather be conservative
> >from a uAPI perspective.
> > 
> > I understand that we put tighter restriction on this but if you see such
> > restriction is really a big issue for real usage, instead of a
> > theoretical problem, then we can loosen the check here. But at that time
> > below code is kind of x86 specific and may need improve.
> > 
> > BTW, in latest code, I replaced count_trailing_zeros() with fls64():
> >return !!(fls64(offset) >= fls64(gpa));
> 
> wouldn't it be !!(ffs64(offset) <= ffs64(gpa)) ?

As the function document explains, here we want to return true when
ALIGNMENT(offset) >= ALIGNMENT(gpa), so '>=' is what we need.

It's worthy clarifying that in Sean's original suggestion he actually
mentioned the opposite. He said 'reject memslot if the gfn has lesser
alignment than the offset', but I wonder this is his purpose, since
if ALIGNMENT(offset) < ALIGNMENT(gpa), we wouldn't be possible to map
the page as largepage. Consider we have below config:

  gpa=2M, offset=1M

In this case KVM tries to map gpa at 2M as 2M hugepage but the physical
page at the offset(1M) in private_fd cannot provide the 2M page due to
misalignment.

But as we discussed in the off-list thread, here we do find a real use
case indicating this check is too strict. i.e. QEMU immediately fails
when launch a guest > 2G memory. For this case QEMU splits guest memory
space into two slots:

  Slot#1(ram_below_4G):

Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2023-03-24 Thread Chao Peng
On Wed, Mar 22, 2023 at 08:27:37PM -0500, Michael Roth wrote:
> On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote:
> > > Hi Sean,
> > > 
> > > We've rebased the SEV+SNP support onto your updated UPM base support
> > > tree and things seem to be working okay, but we needed some fixups on
> > > top of the base support get things working, along with 1 workaround
> > > for an issue that hasn't been root-caused yet:
> > > 
> > >   https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip
> > > 
> > >   *stash (upm_base_support): mm: restrictedmem: Kirill's pinning 
> > > implementation
> > >   *workaround (use_base_support): mm: restrictedmem: loosen exclusivity 
> > > check
> > 
> > What I'm seeing is Slot#3 gets added first and then deleted. When it's
> > gets added, Slot#0 already has the same range bound to restrictedmem so
> > trigger the exclusive check. This check is exactly the current code for.
> 
> With the following change in QEMU, we no longer trigger this check:
> 
>   diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
>   index 20da121374..849b5de469 100644
>   --- a/hw/pci-host/q35.c
>   +++ b/hw/pci-host/q35.c
>   @@ -588,9 +588,9 @@ static void mch_realize(PCIDevice *d, Error **errp)
>memory_region_init_alias(>open_high_smram, OBJECT(mch), 
> "smram-open-high",
> mch->ram_memory, MCH_HOST_BRIDGE_SMRAM_C_BASE,
> MCH_HOST_BRIDGE_SMRAM_C_SIZE);
>   +memory_region_set_enabled(>open_high_smram, false);
>memory_region_add_subregion_overlap(mch->system_memory, 0xfeda,
>>open_high_smram, 1);
>   -memory_region_set_enabled(>open_high_smram, false);
> 
> I'm not sure if QEMU is actually doing something wrong here though or if
> this check is putting tighter restrictions on userspace than what was
> expected before. Will look into it more.

I don't think above QEMU change is upstream acceptable. It may break
functionality for 'normal' VMs.

The UPM check does putting tighter restriction, the restriction is that
you can't bind the same fd range to more than one memslot. For SMRAM in
QEMU however, it violates this restriction. The right 'fix' is disabling
SMM in QEMU for UPM usages rather than trying to work around it. There
is more discussion in below link:

  https://lore.kernel.org/all/y8bob7vuvisxo...@google.com/

Chao

> 
> > 
> > >   *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem 
> > > binding/unbinding
> > >   *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for 
> > > issuing invalidations
> > 
> > As many kernel APIs treat 'end' as exclusive, I would rather keep using
> > exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind
> > and notifier callbacks) but fix it internally in the restrictedmem. E.g.
> > all the places where xarray API needs a 'last'/'max' we use 'end - 1'.
> > See below for the change.
> 
> Yes I did feel like I was fighting the kernel a bit on that; your
> suggestion seems like it would be a better fit.
> 
> > 
> > >   *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations
> > 
> > Subtracting slot->restrictedmem.index for start/end in
> > restrictedmem_get_gfn_range() is the correct fix.
> > 
> > >   *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes
> > > 
> > > We plan to post an updated RFC for v8 soon, but also wanted to share
> > > the staging tree in case you end up looking at the UPM integration aspects
> > > before then.
> > > 
> > > -Mike
> > 
> > This is the restrictedmem fix to solve 'end' being stored and checked in 
> > xarray:
> 
> Looks good.
> 
> Thanks!
> 
> -Mike
> 
> > 
> > --- a/mm/restrictedmem.c
> > +++ b/mm/restrictedmem.c
> > @@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct 
> > restrictedmem *rm, int mode,
> >  */
> > down_read(>lock);
> >  
> > -   xa_for_each_range(>bindings, index, notifier, start, end)
> > +   xa_for_each_range(>bindings, index, notifier, start, end - 1)
> > notifier->ops->invalidate_start(notifier, start, end);
> >  
> > ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> >  
> > -   xa_for_each_range(>bindings, index, notifier, start, end)
> > +   xa_for_each_range(>bindings, index, notifier, start, end - 1)
> > no

Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE

2023-03-23 Thread Chao Peng
On Wed, Mar 22, 2023 at 05:41:31PM -0700, Isaku Yamahata wrote:
> On Wed, Mar 08, 2023 at 03:40:26PM +0800,
> Chao Peng  wrote:
> 
> > On Wed, Mar 08, 2023 at 12:13:24AM +, Ackerley Tng wrote:
> > > Chao Peng  writes:
> > > 
> > > > On Sat, Jan 14, 2023 at 12:01:01AM +, Sean Christopherson wrote:
> > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > ...
> > > > > Strongly prefer to use similar logic to existing code that detects 
> > > > > wraps:
> > > 
> > > > >   mem->restricted_offset + mem->memory_size < 
> > > > > mem->restricted_offset
> > > 
> > > > > This is also where I'd like to add the "gfn is aligned to offset"
> > > > > check, though
> > > > > my brain is too fried to figure that out right now.
> > > 
> > > > Used count_trailing_zeros() for this TODO, unsure we have other better
> > > > approach.
> > > 
> > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > > index afc8c26fa652..fd34c5f7cd2f 100644
> > > > --- a/virt/kvm/kvm_main.c
> > > > +++ b/virt/kvm/kvm_main.c
> > > > @@ -56,6 +56,7 @@
> > > >   #include 
> > > >   #include 
> > > >   #include 
> > > > +#include 
> > > 
> > > >   #include "coalesced_mmio.h"
> > > >   #include "async_pf.h"
> > > > @@ -2087,6 +2088,19 @@ static bool kvm_check_memslot_overlap(struct
> > > > kvm_memslots *slots, int id,
> > > > return false;
> > > >   }
> > > 
> > > > +/*
> > > > + * Return true when ALIGNMENT(offset) >= ALIGNMENT(gpa).
> > > > + */
> > > > +static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
> > > > +{
> > > > +   if (!offset)
> > > > +   return true;
> > > > +   if (!gpa)
> > > > +   return false;
> > > > +
> > > > +   return !!(count_trailing_zeros(offset) >= 
> > > > count_trailing_zeros(gpa));
> 
> This check doesn't work expected. For example, offset = 2GB, gpa=4GB
> this check fails.

This case is expected to fail as Sean initially suggested[*]:
  I would rather reject memslot if the gfn has lesser alignment than
  the offset. I'm totally ok with this approach _if_ there's a use case.
  Until such a use case presents itself, I would rather be conservative
  from a uAPI perspective.

I understand that we put tighter restriction on this but if you see such
restriction is really a big issue for real usage, instead of a
theoretical problem, then we can loosen the check here. But at that time
below code is kind of x86 specific and may need improve.

BTW, in latest code, I replaced count_trailing_zeros() with fls64():
  return !!(fls64(offset) >= fls64(gpa));

[*] https://lore.kernel.org/all/y8hldehbrw+oo...@google.com/

Chao
> I come up with the following.
> 
> >From ec87e25082f0497431b732702fae82c6a05071bf Mon Sep 17 00:00:00 2001
> Message-Id: 
> 
> From: Isaku Yamahata 
> Date: Wed, 22 Mar 2023 15:32:56 -0700
> Subject: [PATCH] KVM: Relax alignment check for restricted mem
> 
> kvm_check_rmem_offset_alignment() only checks based on offset alignment
> and GPA alignment.  However, the actual alignment for offset depends
> on architecture.  For x86 case, it can be 1G, 2M or 4K.  So even if
> GPA is aligned for 1G+, only 1G-alignment is required for offset.
> 
> Without this patch, gpa=4G, offset=2G results in failure of memory slot
> creation.
> 
> Fixes: edc8814b2c77 ("KVM: Require gfn be aligned with restricted offset")
> Signed-off-by: Isaku Yamahata 
> ---
>  arch/x86/include/asm/kvm_host.h | 15 +++
>  virt/kvm/kvm_main.c |  9 -
>  2 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 88e11dd3afde..03af44650f24 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -143,6 +144,20 @@
>  #define KVM_HPAGE_MASK(x)(~(KVM_HPAGE_SIZE(x) - 1))
>  #define KVM_PAGES_PER_HPAGE(x)   (KVM_HPAGE_SIZE(x) / PAGE_SIZE)
>  
> +#define kvm_arch_required_alignment  kvm_arch_required_alignment
> +static inline int kvm_arch_required_alignment(u64 gpa)
> +{
> + int zeros = count_trailing_zeros(gpa);
> +
> +  

Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE

2023-03-07 Thread Chao Peng
On Wed, Mar 08, 2023 at 12:13:24AM +, Ackerley Tng wrote:
> Chao Peng  writes:
> 
> > On Sat, Jan 14, 2023 at 12:01:01AM +, Sean Christopherson wrote:
> > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > ...
> > > Strongly prefer to use similar logic to existing code that detects wraps:
> 
> > >   mem->restricted_offset + mem->memory_size < 
> > > mem->restricted_offset
> 
> > > This is also where I'd like to add the "gfn is aligned to offset"
> > > check, though
> > > my brain is too fried to figure that out right now.
> 
> > Used count_trailing_zeros() for this TODO, unsure we have other better
> > approach.
> 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index afc8c26fa652..fd34c5f7cd2f 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -56,6 +56,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> 
> >   #include "coalesced_mmio.h"
> >   #include "async_pf.h"
> > @@ -2087,6 +2088,19 @@ static bool kvm_check_memslot_overlap(struct
> > kvm_memslots *slots, int id,
> > return false;
> >   }
> 
> > +/*
> > + * Return true when ALIGNMENT(offset) >= ALIGNMENT(gpa).
> > + */
> > +static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
> > +{
> > +   if (!offset)
> > +   return true;
> > +   if (!gpa)
> > +   return false;
> > +
> > +   return !!(count_trailing_zeros(offset) >= count_trailing_zeros(gpa));
> 
> Perhaps we could do something like
> 
> #define lowest_set_bit(val) (val & -val)
> 
> and use
> 
> return lowest_set_bit(offset) >= lowest_set_bit(gpa);

I see kernel already has fls64(), that looks what we need ;)

> 
> Please help me to understand: why must ALIGNMENT(offset) >=
> ALIGNMENT(gpa)? Why is it not sufficient to have both gpa and offset be
> aligned to PAGE_SIZE?

Yes, it's sufficient. Here we just want to be conservative on the uAPI
as Sean explained this at [1]:

  I would rather reject memslot if the gfn has lesser alignment than the
  offset. I'm totally ok with this approach _if_ there's a use case. 
  Until such a use case presents itself, I would rather be conservative
  from a uAPI perspective.

[1] https://lore.kernel.org/all/y8hldehbrw+oo...@google.com/

Chao
> 
> > +}
> > +
> >   /*
> >* Allocate some memory and give it an address in the guest physical
> > address
> >* space.
> > @@ -2128,7 +2142,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > if (mem->flags & KVM_MEM_PRIVATE &&
> > (mem->restrictedmem_offset & (PAGE_SIZE - 1) ||
> >  mem->restrictedmem_offset + mem->memory_size <
> > mem->restrictedmem_offset ||
> > -0 /* TODO: require gfn be aligned with restricted offset */))
> > +!kvm_check_rmem_offset_alignment(mem->restrictedmem_offset,
> > + mem->guest_phys_addr)))
> > return -EINVAL;
> > if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_MEM_SLOTS_NUM)
> > return -EINVAL;



Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2023-02-23 Thread Chao Peng
> > int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end,
> >struct restrictedmem_notifier *notifier, bool exclusive)
> > {
> > struct restrictedmem *rm = file->f_mapping->private_data;
> > int ret = -EINVAL;
> > 
> > down_write(>lock);
> > 
> > /* Non-exclusive mappings are not yet implemented. */
> > if (!exclusive)
> > goto out_unlock;
> > 
> > if (!xa_empty(>bindings)) {
> > if (exclusive != rm->exclusive)
> > goto out_unlock;
> > 
> > if (exclusive && xa_find(>bindings, , end, 
> > XA_PRESENT))
> > goto out_unlock;
> > }
> > 
> > xa_store_range(>bindings, start, end, notifier, GFP_KERNEL);
> 
> 
> || ld: mm/restrictedmem.o: in function `restrictedmem_bind':
> mm/restrictedmem.c|295| undefined reference to `xa_store_range'

Right, xa_store_range() is only available for XARRAY_MULTI.

> 
> 
> This is missing:
> ===
> diff --git a/mm/Kconfig b/mm/Kconfig
> index f952d0172080..03aca542c0da 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1087,6 +1087,7 @@ config SECRETMEM
>  config RESTRICTEDMEM
> bool
> depends on TMPFS
> +   select XARRAY_MULTI
> ===
> 
> Thanks,
> 
> 
> 
> > rm->exclusive = exclusive;
> > ret = 0;
> > out_unlock:
> > up_write(>lock);
> > return ret;
> > }
> > EXPORT_SYMBOL_GPL(restrictedmem_bind);
> > 
> > void restrictedmem_unbind(struct file *file, pgoff_t start, pgoff_t end,
> >   struct restrictedmem_notifier *notifier)
> > {
> > struct restrictedmem *rm = file->f_mapping->private_data;
> > 
> > down_write(>lock);
> > xa_store_range(>bindings, start, end, NULL, GFP_KERNEL);
> > synchronize_rcu();
> > up_write(>lock);
> > }
> > EXPORT_SYMBOL_GPL(restrictedmem_unbind);
> 
> -- 
> Alexey



Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2023-02-21 Thread Chao Peng
> Hi Sean,
> 
> We've rebased the SEV+SNP support onto your updated UPM base support
> tree and things seem to be working okay, but we needed some fixups on
> top of the base support get things working, along with 1 workaround
> for an issue that hasn't been root-caused yet:
> 
>   https://github.com/mdroth/linux/commits/upmv10b-host-snp-v8-wip
> 
>   *stash (upm_base_support): mm: restrictedmem: Kirill's pinning 
> implementation
>   *workaround (use_base_support): mm: restrictedmem: loosen exclusivity check

What I'm seeing is Slot#3 gets added first and then deleted. When it's
gets added, Slot#0 already has the same range bound to restrictedmem so
trigger the exclusive check. This check is exactly the current code for.

>   *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem 
> binding/unbinding
>   *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for 
> issuing invalidations

As many kernel APIs treat 'end' as exclusive, I would rather keep using
exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind
and notifier callbacks) but fix it internally in the restrictedmem. E.g.
all the places where xarray API needs a 'last'/'max' we use 'end - 1'.
See below for the change.

>   *fixup (upm_base_support): KVM: fix restrictedmem GFN range calculations

Subtracting slot->restrictedmem.index for start/end in
restrictedmem_get_gfn_range() is the correct fix.

>   *fixup (upm_base_support): KVM: selftests: CoCo compilation fixes
> 
> We plan to post an updated RFC for v8 soon, but also wanted to share
> the staging tree in case you end up looking at the UPM integration aspects
> before then.
> 
> -Mike

This is the restrictedmem fix to solve 'end' being stored and checked in xarray:

--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -46,12 +46,12 @@ static long restrictedmem_punch_hole(struct restrictedmem 
*rm, int mode,
 */
down_read(>lock);
 
-   xa_for_each_range(>bindings, index, notifier, start, end)
+   xa_for_each_range(>bindings, index, notifier, start, end - 1)
notifier->ops->invalidate_start(notifier, start, end);
 
ret = memfd->f_op->fallocate(memfd, mode, offset, len);
 
-   xa_for_each_range(>bindings, index, notifier, start, end)
+   xa_for_each_range(>bindings, index, notifier, start, end - 1)
notifier->ops->invalidate_end(notifier, start, end);
 
up_read(>lock);
@@ -224,7 +224,7 @@ static int restricted_error_remove_page(struct 
address_space *mapping,
}
spin_unlock(>i_lock);
 
-   xa_for_each_range(>bindings, index, notifier, start, end)
+   xa_for_each_range(>bindings, index, notifier, start, end - 
1)
notifier->ops->error(notifier, start, end);
break;
}
@@ -301,11 +301,12 @@ int restrictedmem_bind(struct file *file, pgoff_t start, 
pgoff_t end,
if (exclusive != rm->exclusive)
goto out_unlock;
 
-   if (exclusive && xa_find(>bindings, , end, 
XA_PRESENT))
+   if (exclusive &&
+   xa_find(>bindings, , end - 1, XA_PRESENT))
goto out_unlock;
}
 
-   xa_store_range(>bindings, start, end, notifier, GFP_KERNEL);
+   xa_store_range(>bindings, start, end - 1, notifier, GFP_KERNEL);
rm->exclusive = exclusive;
ret = 0;
 out_unlock:
@@ -320,7 +321,7 @@ void restrictedmem_unbind(struct file *file, pgoff_t start, 
pgoff_t end,
struct restrictedmem *rm = file->f_mapping->private_data;
 
down_write(>lock);
-   xa_store_range(>bindings, start, end, NULL, GFP_KERNEL);
+   xa_store_range(>bindings, start, end - 1, NULL, GFP_KERNEL);
synchronize_rcu();
up_write(>lock);
 }



Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE

2023-01-28 Thread Chao Peng
On Sat, Jan 14, 2023 at 12:01:01AM +, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
... 
> Strongly prefer to use similar logic to existing code that detects wraps:
> 
>   mem->restricted_offset + mem->memory_size < 
> mem->restricted_offset
> 
> This is also where I'd like to add the "gfn is aligned to offset" check, 
> though
> my brain is too fried to figure that out right now.

Used count_trailing_zeros() for this TODO, unsure we have other better
approach.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index afc8c26fa652..fd34c5f7cd2f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "coalesced_mmio.h"
 #include "async_pf.h"
@@ -2087,6 +2088,19 @@ static bool kvm_check_memslot_overlap(struct 
kvm_memslots *slots, int id,
return false;
 }
 
+/*
+ * Return true when ALIGNMENT(offset) >= ALIGNMENT(gpa).
+ */
+static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
+{
+   if (!offset)
+   return true;
+   if (!gpa)
+   return false;
+
+   return !!(count_trailing_zeros(offset) >= count_trailing_zeros(gpa));
+}
+
 /*
  * Allocate some memory and give it an address in the guest physical address
  * space.
@@ -2128,7 +2142,8 @@ int __kvm_set_memory_region(struct kvm *kvm,
if (mem->flags & KVM_MEM_PRIVATE &&
(mem->restrictedmem_offset & (PAGE_SIZE - 1) ||
 mem->restrictedmem_offset + mem->memory_size < 
mem->restrictedmem_offset ||
-0 /* TODO: require gfn be aligned with restricted offset */))
+!kvm_check_rmem_offset_alignment(mem->restrictedmem_offset,
+ mem->guest_phys_addr)))
return -EINVAL;
if (as_id >= kvm_arch_nr_memslot_as_ids(kvm) || id >= KVM_MEM_SLOTS_NUM)
return -EINVAL;




Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed

2023-01-28 Thread Chao Peng
On Fri, Jan 13, 2023 at 11:16:27PM +, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 9a07380f8d3c..5aefcff614d2 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm 
> > *kvm,
> > if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 
> > 1))
> > linfo[lpages - 1].disallow_lpage = 1;
> > ugfn = slot->userspace_addr >> PAGE_SHIFT;
> > +   if (kvm_slot_can_be_private(slot))
> > +   ugfn |= slot->restricted_offset >> PAGE_SHIFT;
> > /*
> >  * If the gfn and userspace address are not aligned wrt each
> >  * other, disable large page support for this slot.
> 
> Forgot to talk about the bug.  This code needs to handle the scenario where a
> memslot is created with existing, non-uniform attributes.  It might be a bit 
> ugly
> (I didn't even try to write the code), but it's definitely possible, and since
> memslot updates are already slow I think it's best to handle things here.
> 
> In the meantime, I added this so we don't forget to fix it before merging.
> 
> #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
>   pr_crit_once("FIXME: Walk the memory attributes of the slot and set the 
> mixed status appropriately");
> #endif

Here is the code to fix (based on your latest github repo).

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e552374f2357..609ff1cba9c5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2195,4 +2195,9 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, 
unsigned long npages);
 KVM_X86_QUIRK_FIX_HYPERCALL_INSN | \
 KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS)
 
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+void kvm_memory_attributes_create_memslot(struct kvm *kvm,
+ struct kvm_memory_slot *slot);
+#endif
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index eda615f3951c..8833d7201e41 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7201,10 +7201,11 @@ static bool has_mixed_attrs(struct kvm *kvm, struct 
kvm_memory_slot *slot,
return false;
 }
 
-void kvm_arch_set_memory_attributes(struct kvm *kvm,
-   struct kvm_memory_slot *slot,
-   unsigned long attrs,
-   gfn_t start, gfn_t end)
+static void kvm_update_lpage_mixed_flag(struct kvm *kvm,
+   struct kvm_memory_slot *slot,
+   bool set_attrs,
+   unsigned long attrs,
+   gfn_t start, gfn_t end)
 {
unsigned long pages, mask;
gfn_t gfn, gfn_end, first, last;
@@ -7231,25 +7232,53 @@ void kvm_arch_set_memory_attributes(struct kvm *kvm,
first = start & mask;
last = (end - 1) & mask;
 
-   /*
-* We only need to scan the head and tail page, for middle pages
-* we know they will not be mixed.
-*/
+   /* head page */
gfn = max(first, slot->base_gfn);
gfn_end = min(first + pages, slot->base_gfn + slot->npages);
+   if(!set_attrs)
+   attrs = kvm_get_memory_attributes(kvm, gfn);
mixed = has_mixed_attrs(kvm, slot, level, attrs, gfn, gfn_end);
linfo_update_mixed(gfn, slot, level, mixed);
 
if (first == last)
return;
 
-   for (gfn = first + pages; gfn < last; gfn += pages)
-   linfo_update_mixed(gfn, slot, level, false);
+   /* middle pages */
+   for (gfn = first + pages; gfn < last; gfn += pages) {
+   if (set_attrs) {
+   mixed = false;
+   } else {
+   gfn_end = gfn + pages;
+   attrs = kvm_get_memory_attributes(kvm, gfn);
+   mixed = has_mixed_attrs(kvm, slot, level, attrs,
+   gfn, gfn_end);
+   }
+   linfo_update_mixed(gfn, slot, level, mixed);
+   }
 
+   /* tail page */
gfn = last;
gfn_end = min(last + pages, slot->base_gfn + slot->npages);
+   if(!set_attrs)
+   attrs = kv

Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE

2023-01-18 Thread Chao Peng
On Tue, Jan 17, 2023 at 07:35:58PM +, Sean Christopherson wrote:
> On Tue, Jan 17, 2023, Chao Peng wrote:
> > On Sat, Jan 14, 2023 at 12:01:01AM +, Sean Christopherson wrote:
> > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu 
> > > > *vcpu)
> > > >  
> > > > if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, 
> > > > vcpu))
> > > > 
> > > > static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> > > > +
> > > > +   if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
> > > > +   vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> > > 
> > > Synthesizing triple fault shutdown is not the right approach.  Even with 
> > > TDX's
> > > MCE "architecture" (heavy sarcasm), it's possible that host userspace and 
> > > the
> > > guest have a paravirt interface for handling memory errors without 
> > > killing the
> > > host.
> > 
> > Agree shutdown is not the correct choice. I see you made below change:
> > 
> > send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current)
> > 
> > The MCE may happen in any thread than KVM thread, sending siginal to
> > 'current' thread may not be the expected behavior.
> 
> This is already true today, e.g. a #MC in memory that is mapped into the 
> guest can
> be triggered by a host access.  Hrm, but in this case we actually have a KVM
> instance, and we know that the #MC is relevant to the KVM instance, so I agree
> that signaling 'current' is kludgy.
> 
> >  Also how userspace can tell is the MCE on the shared page or private page?
> >  Do we care?
> 
> We care.  I was originally thinking we could require userspace to keep track 
> of
> things, but that's quite prescriptive and flawed, e.g. could race with 
> conversions.
> 
> One option would be to KVM_EXIT_MEMORY_FAULT, and then wire up a generic (not 
> x86
> specific) KVM request to exit to userspace, e.g.
> 
>   /* KVM_EXIT_MEMORY_FAULT */
>   struct {
> #define KVM_MEMORY_EXIT_FLAG_PRIVATE  (1ULL << 3)
> #define KVM_MEMORY_EXIT_FLAG_HW_ERROR (1ULL << 4)
>   __u64 flags;
>   __u64 gpa;
>   __u64 size;
>   } memory;
> 
> But I'm not sure that's the correct approach.  It kinda feels like we're 
> reinventing
> the wheel.  It seems like restrictedmem_get_page() _must_ be able to reject 
> attempts
> to get a poisoned page, i.e. restrictedmem_get_page() should yield 
> KVM_PFN_ERR_HWPOISON.

Yes, I see there is -EHWPOISON handling for hva_to_pfn() for shared
memory. It makes sense doing similar for private page.

> Assuming that's the case, then I believe KVM simply needs to zap SPTEs in 
> response
> to an error notification in order to force vCPUs to fault on the poisoned 
> page.

Agree, this is waht we should do anyway.

> 
> > > > +   return -EINVAL;
> > > > if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> > > > return -EINVAL;
> > > > if (mem->guest_phys_addr + mem->memory_size < 
> > > > mem->guest_phys_addr)
> > > > @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > > > if ((kvm->nr_memslot_pages + npages) < 
> > > > kvm->nr_memslot_pages)
> > > > return -EINVAL;
> > > > } else { /* Modify an existing slot. */
> > > > +   /* Private memslots are immutable, they can only be 
> > > > deleted. */
> > > 
> > > I'm 99% certain I suggested this, but if we're going to make these 
> > > memslots
> > > immutable, then we should straight up disallow dirty logging, otherwise 
> > > we'll
> > > end up with a bizarre uAPI.
> > 
> > But in my mind dirty logging will be needed in the very short time, when
> > live migration gets supported?
> 
> Ya, but if/when live migration support is added, private memslots will no 
> longer
> be immutable as userspace will want to enable dirty logging only when a VM is
> being migrated, i.e. something will need to change.
> 
> Given that it looks like we have clear line of sight to SEV+UPM guests, my
> preference would be to allow toggling dirty logging from the get-go.  It 
> doesn't
> necessarily have to be in the first patch, e.g. KV

Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2023-01-18 Thread Chao Peng
On Tue, Jan 17, 2023 at 04:34:15PM +, Sean Christopherson wrote:
> On Tue, Jan 17, 2023, Chao Peng wrote:
> > On Fri, Jan 13, 2023 at 09:54:41PM +, Sean Christopherson wrote:
> > > > +   list_for_each_entry(notifier, >notifiers, list) {
> > > > +   notifier->ops->invalidate_start(notifier, start, end);
> > > 
> > > Two major design issues that we overlooked long ago:
> > > 
> > >   1. Blindly invoking notifiers will not scale.  E.g. if userspace 
> > > configures a
> > >  VM with a large number of convertible memslots that are all backed 
> > > by a
> > >  single large restrictedmem instance, then converting a single page 
> > > will
> > >  result in a linear walk through all memslots.  I don't expect anyone 
> > > to
> > >  actually do something silly like that, but I also never expected 
> > > there to be
> > >  a legitimate usecase for thousands of memslots.
> > > 
> > >   2. This approach fails to provide the ability for KVM to ensure a guest 
> > > has
> > >  exclusive access to a page.  As discussed in the past, the kernel 
> > > can rely
> > >  on hardware (and maybe ARM's pKVM implementation?) for those 
> > > guarantees, but
> > >  only for SNP and TDX VMs.  For VMs where userspace is trusted to 
> > > some extent,
> > >  e.g. SEV, there is value in ensuring a 1:1 association.
> > > 
> > >  And probably more importantly, relying on hardware for SNP and TDX 
> > > yields a
> > >  poor ABI and complicates KVM's internals.  If the kernel doesn't 
> > > guarantee a
> > >  page is exclusive to a guest, i.e. if userspace can hand out the 
> > > same page
> > >  from a restrictedmem instance to multiple VMs, then failure will 
> > > occur only
> > >  when KVM tries to assign the page to the second VM.  That will 
> > > happen deep
> > >  in KVM, which means KVM needs to gracefully handle such errors, and 
> > > it means
> > >  that KVM's ABI effectively allows plumbing garbage into its memslots.
> > 
> > It may not be a valid usage, but in my TDX environment I do meet below
> > issue.
> > 
> > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x8000 
> > ua=0x7fe1ebfff000 ret=0
> > kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc0 
> > size=0x40 ua=0x7fe271579000 ret=0
> > kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda 
> > size=0x2 ua=0x7fe1ec09f000 ret=-22
> > 
> > Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU
> > and slot#2 fails due to below exclusive check.
> > 
> > Currently I changed QEMU code to mark these alias slots as shared
> > instead of private but I'm not 100% confident this is correct fix.
> 
> That's a QEMU bug of sorts.  SMM is mutually exclusive with TDX, QEMU 
> shouldn't
> be configuring SMRAM (or any SMM memslots for that matter) for TDX guests.

Thanks for the confirmation. As long as we only bind one notifier for
each address, using xarray does make things simple.

Chao



Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2023-01-17 Thread Chao Peng
On Tue, Jan 17, 2023 at 11:21:10AM +0800, Binbin Wu wrote:
> 
> On 12/2/2022 2:13 PM, Chao Peng wrote:
> > In confidential computing usages, whether a page is private or shared is
> > necessary information for KVM to perform operations like page fault
> > handling, page zapping etc. There are other potential use cases for
> > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > or exec-only, etc.) without having to modify memslots.
> > 
> > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > userspace to operate on the per-page memory attributes.
> >- KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> >  a guest memory range.
> >- KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> >  memory attributes.
> > 
> > KVM internally uses xarray to store the per-page memory attributes.
> > 
> > Suggested-by: Sean Christopherson 
> > Signed-off-by: Chao Peng 
> > Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com/
> > ---
> >   Documentation/virt/kvm/api.rst | 63 
> >   arch/x86/kvm/Kconfig   |  1 +
> >   include/linux/kvm_host.h   |  3 ++
> >   include/uapi/linux/kvm.h   | 17 
> 
> Should the changes introduced in this file also need to be added in
> tools/include/uapi/linux/kvm.h ?

Yes I think. But I'm hesitate to include in this patch or not. I see
many commits sync kernel kvm.h to tools's copy. Looks that is done
periodically and with a 'pull' model.

Chao



Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2023-01-17 Thread Chao Peng
On Sat, Jan 14, 2023 at 12:37:59AM +, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > This patch series implements KVM guest private memory for confidential
> > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > TDX-protected guest memory, machine check can happen which can further
> > crash the running host system, this is terrible for multi-tenant
> > configurations. The host accesses include those from KVM userspace like
> > QEMU. This series addresses KVM userspace induced crash by introducing
> > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > via a fd-based approach, but it can never access the guest memory
> > content.
> > 
> > The patch series touches both core mm and KVM code. I appreciate
> > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > reviews are always welcome.
> >   - 01: mm change, target for mm tree
> >   - 02-09: KVM change, target for KVM tree
> 
> A version with all of my feedback, plus reworked versions of Vishal's 
> selftest,
> is available here:
> 
>   g...@github.com:sean-jc/linux.git x86/upm_base_support
> 
> It compiles and passes the selftest, but it's otherwise barely tested.  There 
> are
> a few todos (2 I think?) and many of the commits need changelogs, i.e. it's 
> still
> a WIP.

Thanks very much for doing this. Almost all of your comments are well
received, except for two cases that need more discussions which have
replied individually.

> 
> As for next steps, can you (handwaving all of the TDX folks) take a look at 
> what
> I pushed and see if there's anything horrifically broken, and that it still 
> works
> for TDX?

I have integrated this into my local TDX repo, with some changes (as I
replied individually), the new code basically still works with TDX.

I have also asked other TDX folks to take a look.

> 
> Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no 
> rush
> (and I mean that).
> 
> On my side, the two things on my mind are (a) tests and (b) downstream 
> dependencies
> (SEV and TDX).  For tests, I want to build a lists of tests that are required 
> for
> merging so that the criteria for merging are clear, and so that if the list 
> is large
> (haven't thought much yet), the work of writing and running tests can be 
> distributed.
> 
> Regarding downstream dependencies, before this lands, I want to pull in all 
> the
> TDX and SNP series and see how everything fits together.  Specifically, I 
> want to
> make sure that we don't end up with a uAPI that necessitates ugly code, and 
> that we
> don't miss an opportunity to make things simpler.  The patches in the SNP 
> series to
> add "legacy" SEV support for UPM in particular made me slightly rethink some 
> minor
> details.  Nothing remotely major, but something that needs attention since 
> it'll
> be uAPI.
> 
> I'm off Monday, so it'll be at least Tuesday before I make any more progress 
> on
> my side.

Appreciate your effort. As for the next steps, if you see something we
can do parallel, feel free to let me know.

Thanks,
Chao



Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE

2023-01-17 Thread Chao Peng
On Sat, Jan 14, 2023 at 12:01:01AM +, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  
> > if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
> > static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> > +
> > +   if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
> > +   vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> 
> Synthesizing triple fault shutdown is not the right approach.  Even with TDX's
> MCE "architecture" (heavy sarcasm), it's possible that host userspace and the
> guest have a paravirt interface for handling memory errors without killing the
> host.

Agree shutdown is not the correct choice. I see you made below change:

send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current)

The MCE may happen in any thread than KVM thread, sending siginal to
'current' thread may not be the expected behavior. Also how userspace
can tell is the MCE on the shared page or private page? Do we care?

> 
> > +   r = 0;
> > +   goto out;
> > +   }
> > }
> 
> 
> > @@ -1982,6 +2112,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
> >  !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> > mem->memory_size))
> > return -EINVAL;
> > +   if (mem->flags & KVM_MEM_PRIVATE &&
> > +   (mem->restricted_offset & (PAGE_SIZE - 1) ||
> 
> Align indentation.
> 
> > +mem->restricted_offset > U64_MAX - mem->memory_size))
> 
> Strongly prefer to use similar logic to existing code that detects wraps:
> 
>   mem->restricted_offset + mem->memory_size < 
> mem->restricted_offset
> 
> This is also where I'd like to add the "gfn is aligned to offset" check, 
> though
> my brain is too fried to figure that out right now.
> 
> > +   return -EINVAL;
> > if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> > return -EINVAL;
> > if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> > @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
> > return -EINVAL;
> > } else { /* Modify an existing slot. */
> > +   /* Private memslots are immutable, they can only be deleted. */
> 
> I'm 99% certain I suggested this, but if we're going to make these memslots
> immutable, then we should straight up disallow dirty logging, otherwise we'll
> end up with a bizarre uAPI.

But in my mind dirty logging will be needed in the very short time, when
live migration gets supported?

> 
> > +   if (mem->flags & KVM_MEM_PRIVATE)
> > +   return -EINVAL;
> > if ((mem->userspace_addr != old->userspace_addr) ||
> > (npages != old->npages) ||
> > ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> > @@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > new->npages = npages;
> > new->flags = mem->flags;
> > new->userspace_addr = mem->userspace_addr;
> > +   if (mem->flags & KVM_MEM_PRIVATE) {
> > +   new->restricted_file = fget(mem->restricted_fd);
> > +   if (!new->restricted_file ||
> > +   !file_is_restrictedmem(new->restricted_file)) {
> > +   r = -EINVAL;
> > +   goto out;
> > +   }
> > +   new->restricted_offset = mem->restricted_offset;

I see you changed slot->restricted_offset type from loff_t to gfn_t and
used pgoff_t when doing the restrictedmem_bind/unbind(). Using page
index is reasonable KVM internally and sounds simpler than loff_t. But
we also need initialize it to page index here as well as changes in
another two cases. This is needed when restricted_offset != 0.

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 547b92215002..49e375e78f30 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2364,8 +2364,7 @@ static inline int kvm_restricted_mem_get_pfn(struct 
kvm_memory_slot *slot,
 gfn_t gfn, kvm_pfn_t *pfn,
 int *order)
 {
-   pgoff_t index = gfn - slot->base_gfn +
-   (slot->restricted_offset >> PAGE_SHIFT);
+   

Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2023-01-17 Thread Chao Peng
On Fri, Jan 13, 2023 at 10:37:39PM +, Sean Christopherson wrote:
> On Tue, Jan 10, 2023, Chao Peng wrote:
> > On Mon, Jan 09, 2023 at 07:32:05PM +, Sean Christopherson wrote:
> > > On Fri, Jan 06, 2023, Chao Peng wrote:
> > > > On Thu, Jan 05, 2023 at 11:23:01AM +, Jarkko Sakkinen wrote:
> > > > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > > > To make future maintenance easy, internally use a binary compatible
> > > > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > > > '_ext' variants.
> > > > > 
> > > > > Feels bit hacky IMHO, and more like a completely new feature than
> > > > > an extension.
> > > > > 
> > > > > Why not just add a new ioctl? The commit message does not address
> > > > > the most essential design here.
> > > > 
> > > > Yes, people can always choose to add a new ioctl for this kind of change
> > > > and the balance point here is we want to also avoid 'too many ioctls' if
> > > > the functionalities are similar.  The '_ext' variant reuses all the
> > > > existing fields in the 'normal' variant and most importantly KVM
> > > > internally can reuse most of the code. I certainly can add some words in
> > > > the commit message to explain this design choice.
> > > 
> > > After seeing the userspace side of this, I agree with Jarkko; overloading
> > > KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up 
> > > being
> > > bogus, and userspace ends up abusing unions or implementing 
> > > kvm_user_mem_region
> > > itself.
> > 
> > How is the size validation being bogus? I don't quite follow.
> 
> The ioctl() magic embeds the size of the payload (struct 
> kvm_userspace_memory_region
> in this case) in the ioctl() number, and that information is visible to 
> userspace
> via _IOCTL_SIZE().  Attempting to take a larger size can mess up sanity 
> checks,
> e.g. KVM selftests get tripped up on this assert if 
> KVM_SET_USER_MEMORY_REGION is
> passed an "extended" struct.
> 
>   #define kvm_do_ioctl(fd, cmd, arg)  
> \
>   ({  
> \
>   kvm_static_assert(!_IOC_SIZE(cmd) || sizeof(*arg) == 
> _IOC_SIZE(cmd));   \
>   ioctl(fd, cmd, arg);
> \
>   })

Got it. Thanks for the explanation.

Chao



Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2023-01-17 Thread Chao Peng
On Fri, Jan 13, 2023 at 09:54:41PM +, Sean Christopherson wrote:
> On Fri, Dec 02, 2022, Chao Peng wrote:
> > The system call is currently wired up for x86 arch.
> 
> Building on other architectures (except for arm64 for some reason) yields:
> 
>   CALL/.../scripts/checksyscalls.sh
>   :1565:2: warning: #warning syscall memfd_restricted not implemented 
> [-Wcpp]
> 
> Do we care?  It's the only such warning, which makes me think we either need 
> to
> wire this up for all architectures, or explicitly document that it's 
> unsupported.

I'm a bit conservative and prefer enabling only on x86 where we know the
exact usecase. For the warning we can get rid of by changing
scripts/checksyscalls.sh, just like __IGNORE_memfd_secret:

https://lkml.kernel.org/r/20210518072034.31572-7-r...@kernel.org

> 
> > Signed-off-by: Kirill A. Shutemov 
> > Signed-off-by: Chao Peng 
> > ---
> 
> ...
> 
> > diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> > new file mode 100644
> > index ..c2700c5daa43
> > --- /dev/null
> > +++ b/include/linux/restrictedmem.h
> > @@ -0,0 +1,71 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +#ifndef _LINUX_RESTRICTEDMEM_H
> 
> Missing
> 
>  #define _LINUX_RESTRICTEDMEM_H
> 
> which causes fireworks if restrictedmem.h is included more than once.
> 
> > +#include 
> > +#include 
> > +#include 
> 
> ...
> 
> > +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +struct page **pagep, int *order)
> > +{
> > +   return -1;
> 
> This should be a proper -errno, though in the current incarnation of things 
> it's
> a moot point because no stub is needed.  KVM can (and should) easily provide 
> its
> own stub for this one.
> 
> > +}
> > +
> > +static inline bool file_is_restrictedmem(struct file *file)
> > +{
> > +   return false;
> > +}
> > +
> > +static inline void restrictedmem_error_page(struct page *page,
> > +   struct address_space *mapping)
> > +{
> > +}
> > +
> > +#endif /* CONFIG_RESTRICTEDMEM */
> > +
> > +#endif /* _LINUX_RESTRICTEDMEM_H */
> 
> ...
> 
> > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> > new file mode 100644
> > index ..56953c204e5c
> > --- /dev/null
> > +++ b/mm/restrictedmem.c
> > @@ -0,0 +1,318 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +struct restrictedmem_data {
> 
> Any objection to simply calling this "restrictedmem"?  And then using either 
> "rm"
> or "rmem" for local variable names?  I kept reading "data" as the underyling 
> data
> being written to the page, as opposed to the metadata describing the 
> restrictedmem
> instance.
> 
> > +   struct mutex lock;
> > +   struct file *memfd;
> > +   struct list_head notifiers;
> > +};
> > +
> > +static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
> > +  pgoff_t start, pgoff_t end)
> > +{
> > +   struct restrictedmem_notifier *notifier;
> > +
> > +   mutex_lock(>lock);
> 
> This can be a r/w semaphore instead of a mutex, that way punching holes at 
> multiple
> points in the file can at least run the notifiers in parallel.  The actual 
> allocation
> by shmem will still be serialized, but I think it's worth the simple 
> optimization
> since zapping and flushing in KVM may be somewhat slow.
> 
> > +   list_for_each_entry(notifier, >notifiers, list) {
> > +   notifier->ops->invalidate_start(notifier, start, end);
> 
> Two major design issues that we overlooked long ago:
> 
>   1. Blindly invoking notifiers will not scale.  E.g. if userspace configures 
> a
>  VM with a large number of convertible memslots that are all backed by a
>  single large restrictedmem instance, then converting a single page will
>  result in a linear walk through all memslots.  I don't expect anyone to
>  actually do something silly like that, but I also never expected there 
> to be
>  a legitimate usecase for thousands of memslots.
> 
>   2. This approach fails to provide the ability for KVM to ensure a guest has
>  exclusive access to a page.  As discussed in the past

Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2023-01-10 Thread Chao Peng
On Mon, Jan 09, 2023 at 07:32:05PM +, Sean Christopherson wrote:
> On Fri, Jan 06, 2023, Chao Peng wrote:
> > On Thu, Jan 05, 2023 at 11:23:01AM +, Jarkko Sakkinen wrote:
> > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > To make future maintenance easy, internally use a binary compatible
> > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > '_ext' variants.
> > > 
> > > Feels bit hacky IMHO, and more like a completely new feature than
> > > an extension.
> > > 
> > > Why not just add a new ioctl? The commit message does not address
> > > the most essential design here.
> > 
> > Yes, people can always choose to add a new ioctl for this kind of change
> > and the balance point here is we want to also avoid 'too many ioctls' if
> > the functionalities are similar.  The '_ext' variant reuses all the
> > existing fields in the 'normal' variant and most importantly KVM
> > internally can reuse most of the code. I certainly can add some words in
> > the commit message to explain this design choice.
> 
> After seeing the userspace side of this, I agree with Jarkko; overloading
> KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up being
> bogus, and userspace ends up abusing unions or implementing 
> kvm_user_mem_region
> itself.

How is the size validation being bogus? I don't quite follow. Then we
will use kvm_userspace_memory_region2 as the KVM internal alias, right?
I see similar examples use different functions to handle different
versions but it does look easier if we use alias for this function.

> 
> It feels absolutely ridiculous, but I think the best option is to do:
> 
> #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
>struct kvm_userspace_memory_region2)

Just interesting, is 0x49 a safe number we can use? 

> 
> /* for KVM_SET_USER_MEMORY_REGION2 */
> struct kvm_user_mem_region2 {
>   __u32 slot;
>   __u32 flags;
>   __u64 guest_phys_addr;
>   __u64 memory_size;
>   __u64 userspace_addr;
>   __u64 restricted_offset;
>   __u32 restricted_fd;
>   __u32 pad1;
>   __u64 pad2[14];
> }
> 
> And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.

Okay, agree from KVM userspace API perspective this is more consistent
with similar existing examples. I see several of them.

I think we will also need a CAP_KVM_SET_USER_MEMORY_REGION2 for this new
ioctl.

> 
> Regarding the userspace side of things, please include Vishal's selftests in 
> v11,
> it's impossible to properly review the uAPI changes without seeing the 
> userspace
> side of things.  I'm in the process of reviewing Vishal's v2[*], I'll try to
> massage it into a set of patches that you can incorporate into your series.

Previously I included Vishal's selftests in the github repo, but not
include them in this patch series. It's OK for me to incorporate them
directly into this series and review together if Vishal is fine.

Chao
> 
> [*] https://lore.kernel.org/all/20221205232341.4131240-1-vannapu...@google.com



Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2023-01-06 Thread Chao Peng
On Thu, Jan 05, 2023 at 11:23:01AM +, Jarkko Sakkinen wrote:
> On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided through a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> > 
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> > 
> > The extended memslot can still have the userspace_addr(hva). When use, a
> > single memslot can maintain both private memory through restricted_fd
> > and shared memory through userspace_addr. Whether the private or shared
> > part is visible to guest is maintained by other KVM code.
> > 
> > A restrictedmem_notifier field is also added to the memslot structure to
> > allow the restricted_fd's backing store to notify KVM the memory change,
> > KVM then can invalidate its page table entries or handle memory errors.
> > 
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only.
> > 
> > To make future maintenance easy, internally use a binary compatible
> > alias struct kvm_user_mem_region to handle both the normal and the
> > '_ext' variants.
> 
> Feels bit hacky IMHO, and more like a completely new feature than
> an extension.
> 
> Why not just add a new ioctl? The commit message does not address
> the most essential design here.

Yes, people can always choose to add a new ioctl for this kind of change
and the balance point here is we want to also avoid 'too many ioctls' if
the functionalities are similar.  The '_ext' variant reuses all the
existing fields in the 'normal' variant and most importantly KVM
internally can reuse most of the code. I certainly can add some words in
the commit message to explain this design choice.

Thanks,
Chao
> 
> BR, Jarkko



Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE

2023-01-05 Thread Chao Peng
On Thu, Jan 05, 2023 at 12:38:30PM -0800, Vishal Annapurve wrote:
> On Thu, Dec 1, 2022 at 10:20 PM Chao Peng  wrote:
> >
> > +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> > +static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
> > +pgoff_t start, pgoff_t end,
> > +gfn_t *gfn_start, gfn_t *gfn_end)
> > +{
> > +   unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
> > +
> > +   if (start > base_pgoff)
> > +   *gfn_start = slot->base_gfn + start - base_pgoff;
> 
> There should be a check for overflow here in case start is a very big
> value. Additional check can look like:
> if (start >= base_pgoff + slot->npages)
>return false;
> 
> > +   else
> > +   *gfn_start = slot->base_gfn;
> > +
> > +   if (end < base_pgoff + slot->npages)
> > +   *gfn_end = slot->base_gfn + end - base_pgoff;
> 
> If "end" is smaller than base_pgoff, this can cause overflow and
> return the range as valid. There should be additional check:
> if (end < base_pgoff)
>  return false;

Thanks! Both are good catches. The improved code:

static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
 pgoff_t start, pgoff_t end,
 gfn_t *gfn_start, gfn_t *gfn_end)
{
unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;

if (start >= base_pgoff + slot->npages)
return false;
else if (start <= base_pgoff)
*gfn_start = slot->base_gfn;
else
*gfn_start = start - base_pgoff + slot->base_gfn;

if (end <= base_pgoff)
return false;
else if (end >= base_pgoff + slot->npages)
*gfn_end = slot->base_gfn + slot->npages;
else
*gfn_end = end - base_pgoff + slot->base_gfn;

if (*gfn_start >= *gfn_end)
return false;

return true;
}

Thanks,
Chao
> 
> 
> > +   else
> > +   *gfn_end = slot->base_gfn + slot->npages;
> > +
> > +   if (*gfn_start >= *gfn_end)
> > +   return false;
> > +
> > +   return true;
> > +}
> > +



Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2023-01-04 Thread Chao Peng
On Tue, Jan 03, 2023 at 11:06:37PM +, Sean Christopherson wrote:
> On Tue, Jan 03, 2023, Wang, Wei W wrote:
> > On Tuesday, January 3, 2023 9:40 AM, Chao Peng wrote:
> > > > Because guest memory defaults to private, and now this patch stores
> > > > the attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of
> > > _SHARED,
> > > > it would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of
> > > > boot time. Maybe it can be optimized somehow in other places? e.g. set
> > > > mem attr in advance.
> > > 
> > > KVM defaults to 'shared' because this ioctl can also be potentially used 
> > > by
> > > normal VMs and 'shared' sounds a value meaningful for both normal VMs and
> > > confidential VMs. 
> > 
> > Do you mean a normal VM could have pages marked private? What's the usage?
> > (If all the pages are just marked shared for normal VMs, then why do we 
> > need it)
> 
> No, there are potential use cases for per-page attribute/permissions, e.g. to
> make select pages read-only, exec-only, no-exec, etc...

Right, normal VMs are not likely use private/shared bit. Not sure pKVM,
but perhaps not call it 'normal' VMs in this context. But since the
ioctl can be used by normal VMs for other bits (read-only, exec-only,
no-exec, etc), a default 'private' looks strange for them. That's why I
default it to 'shared' and for confidential guest, we can issue another
call to this ioctl to set all the memory to 'private' before guest
booting, if default 'private' is needed for guest.

Like Wei mentioned, it's also possible to make the default dependents on
vm_type, but that looks awkward to me from the API definition as well as
the implementation, also the vm_type has not been introduced at this time.

> 
> > > As for more KVM_EXIT_MEMORY_FAULT exits during the
> > > booting time, yes, setting all memory to 'private' for confidential VMs 
> > > through
> > > this ioctl in userspace before guest launch is an approach for KVM 
> > > userspace to
> > > 'override' the KVM default and reduce the number of implicit conversions.
> > 
> > Most pages of a confidential VM are likely to be private pages. It seems 
> > more efficient
> > (and not difficult to check vm_type) to have KVM defaults to "private" for 
> > confidential VMs
> > and defaults to "shared" for normal VMs.
> 
> If done right, the default shouldn't matter all that much for efficiency.  KVM
> needs to be able to effeciently track large ranges regardless of the default,
> otherwise the memory overhead and the presumably cost of lookups will be 
> painful.
> E.g. converting a 1GiB chunk to shared should ideally require one entry, not 
> 256k
> entries.

I agree, KVM should have the ability to track large ranges efficiently.

> 
> Looks like that behavior was changed in v8 in response to feedback[*] that 
> doing
> xa_store_range() on a subset of an existing range (entry) would overwrite the
> entire existing range (entry), not just the smaller subset.  xa_store_range() 
> does
> appear to be too simplistic for this use case, but looking at 
> __filemap_add_folio(),
> splitting an existing entry isn't super complex.

Yes, xa_store_range() looks a perfect match for us initially but the
'overwriting the entire entry' behavior makes it incorrect for us when
storing a subset on an existing large entry. xarray lib has utilities
for splitting, the hard part is merging existing entries, as you also
said below. Thanks for pointing out the __filemap_add_folio() example,
it does look not too complex for splitting.

> 
> Using xa_store() for the very initial implementation is ok, and probably a 
> good
> idea since it's more obviously correct and will give us a bisection point.  
> But
> we definitely want a more performant implementation sooner than later.  The 
> hardest
> part will likely be merging existing entries, but that can be done separately 
> too,
> and is probably lower priority.
> 
> E.g. (1) use xa_store() and always track at 4KiB granularity, (2) support 
> storing
> metadata in multi-index entries, and finally (3) support merging adjacent 
> entries
> with identical values.

This path looks good to me.

Thanks,
Chao
> 
> [*] 
> https://lore.kernel.org/all/CAGtprH9xyw6bt4=rbwf6-v2cspabocpkq5rpz+e-9co7eis...@mail.gmail.com



Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2023-01-02 Thread Chao Peng
On Wed, Dec 28, 2022 at 04:28:01PM +0800, Chenyi Qiang wrote:
...
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > +  struct kvm_memory_attributes *attrs)
> > +{
> > +   gfn_t start, end;
> > +   unsigned long i;
> > +   void *entry;
> > +   u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +   /* flags is currently not used. */
> > +   if (attrs->flags)
> > +   return -EINVAL;
> > +   if (attrs->attributes & ~supported_attrs)
> > +   return -EINVAL;
> > +   if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > +   return -EINVAL;
> > +   if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > +   return -EINVAL;
> > +
> > +   start = attrs->address >> PAGE_SHIFT;
> > +   end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +   entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> 
> Because guest memory defaults to private, and now this patch stores the
> attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of _SHARED, it
> would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of boot
> time. Maybe it can be optimized somehow in other places? e.g. set mem
> attr in advance.

KVM defaults to 'shared' because this ioctl can also be potentially used
by normal VMs and 'shared' sounds a value meaningful for both normal VMs
and confidential VMs. As for more KVM_EXIT_MEMORY_FAULT exits during the
booting time, yes, setting all memory to 'private' for confidential VMs
through this ioctl in userspace before guest launch is an approach for
KVM userspace to 'override' the KVM default and reduce the number of
implicit conversions.

Thanks,
Chao
> 
> > +   mutex_lock(>lock);
> > +   for (i = start; i < end; i++)
> > +   if (xa_err(xa_store(>mem_attr_array, i, entry,
> > +   GFP_KERNEL_ACCOUNT)))
> > +   break;
> > +   mutex_unlock(>lock);
> > +
> > +   attrs->address = i << PAGE_SHIFT;
> > +   attrs->size = (end - i) << PAGE_SHIFT;
> > +
> > +   return 0;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> >  struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
> >  {
> > return __gfn_to_memslot(kvm_memslots(kvm), gfn);



Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2022-12-23 Thread Chao Peng
On Thu, Dec 22, 2022 at 06:15:24PM +, Sean Christopherson wrote:
> On Wed, Dec 21, 2022, Chao Peng wrote:
> > On Tue, Dec 20, 2022 at 08:33:05AM +, Huang, Kai wrote:
> > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > > On Mon, Dec 19, 2022 at 08:48:10AM +, Huang, Kai wrote:
> > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > > refcount after setting up mapping in the secondary mmu, otherwise the 
> > > page will
> > > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
> > 
> > That's true. Actually even true for restrictedmem case, most likely we
> > will still need the kvm_release_pfn_clean() for KVM generic code. On one
> > side, other restrictedmem users like pKVM may not require page pinning
> > at all. On the other side, see below.
> > 
> > > 
> > > So what we are expecting is: for KVM if the page comes from restricted 
> > > mem, then
> > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM 
> > > should.
> 
> No, requiring the user (KVM) to guard against lack of support for page 
> migration
> in restricted mem is a terrible API.  It's totally fine for restricted mem to 
> not
> support page migration until there's a use case, but punting the problem to 
> KVM
> is not acceptable.  Restricted mem itself doesn't yet support page migration,
> e.g. explosions would occur even if KVM wanted to allow migration since there 
> is
> no notification to invalidate existing mappings.
> 
> > I argue that this page pinning (or page migration prevention) is not
> > tied to where the page comes from, instead related to how the page will
> > be used. Whether the page is restrictedmem backed or GUP() backed, once
> > it's used by current version of TDX then the page pinning is needed. So
> > such page migration prevention is really TDX thing, even not KVM generic
> > thing (that's why I think we don't need change the existing logic of
> > kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who
> > requires that) to increase/decrease the refcount when it populates/drops
> > the secure EPT entries? This is exactly what the current TDX code does:
> 
> I agree that whether or not migration is supported should be controllable by 
> the
> user, but I strongly disagree on punting refcount management to KVM (or TDX).
> The whole point of restricted mem is to support technologies like TDX and SNP,
> accomodating their special needs for things like page migration should be 
> part of
> the API, not some footnote in the documenation.

I never doubt page migration should be part of restrictedmem API, but
that's not an initial implementing as we all agreed? Then before that
API being introduced, we need find a solution to prevent page migration
for TDX. Other than refcount management, do we have any other workable
solution? 

> 
> It's not difficult to let the user communicate support for page migration, 
> e.g.
> if/when restricted mem gains support, add a hook to restrictedmem_notifier_ops
> to signal support (or lack thereof) for page migration.  NULL == no migration,
> non-NULL == migration allowed.

I know.

> 
> We know that supporting page migration in TDX and SNP is possible, and we know
> that page migration will require a dedicated API since the backing store can't
> memcpy() the page.  I don't see any reason to ignore that eventuality.

No, I'm not ignoring it. It's just about the short-term page migration
prevention before that dedicated API being introduced.

> 
> But again, unless I'm missing something, that's a future problem because 
> restricted
> mem doesn't yet support page migration regardless of the downstream user.

It's true a future problem for page migration support itself, but page
migration prevention is not a future problem since TDX pages need to be
pinned before page migration gets supported.

Thanks,
Chao



Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2022-12-23 Thread Chao Peng
On Thu, Dec 22, 2022 at 12:37:19AM +, Huang, Kai wrote:
> On Wed, 2022-12-21 at 21:39 +0800, Chao Peng wrote:
> > > On Tue, Dec 20, 2022 at 08:33:05AM +, Huang, Kai wrote:
> > > > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > > > > > On Mon, Dec 19, 2022 at 08:48:10AM +, Huang, Kai wrote:
> > > > > > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > [...]
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > + /*
> > > > > > > > > > > > > > > +  * These pages are currently unmovable so don't 
> > > > > > > > > > > > > > > place them into
> > > > > > > > > > > > > > > movable
> > > > > > > > > > > > > > > +  * pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > > > > > > > > > > > > > +  */
> > > > > > > > > > > > > > > + mapping = memfd->f_mapping;
> > > > > > > > > > > > > > > + mapping_set_unevictable(mapping);
> > > > > > > > > > > > > > > + mapping_set_gfp_mask(mapping,
> > > > > > > > > > > > > > > +  mapping_gfp_mask(mapping) 
> > > > > > > > > > > > > > > & ~__GFP_MOVABLE);
> > > > > > > > > > > > > 
> > > > > > > > > > > > > But, IIUC removing __GFP_MOVABLE flag here only makes 
> > > > > > > > > > > > > page allocation from
> > > > > > > > > > > > > non-
> > > > > > > > > > > > > movable zones, but doesn't necessarily prevent page 
> > > > > > > > > > > > > from being migrated.  My
> > > > > > > > > > > > > first glance is you need to implement either 
> > > > > > > > > > > > > a_ops->migrate_folio() or just
> > > > > > > > > > > > > get_page() after faulting in the page to prevent.
> > > > > > > > > > > 
> > > > > > > > > > > The current api restrictedmem_get_page() already does 
> > > > > > > > > > > this, after the
> > > > > > > > > > > caller calling it, it holds a reference to the page. The 
> > > > > > > > > > > caller then
> > > > > > > > > > > decides when to call put_page() appropriately.
> > > > > > > > > 
> > > > > > > > > I tried to dig some history. Perhaps I am missing something, 
> > > > > > > > > but it seems Kirill
> > > > > > > > > said in v9 that this code doesn't prevent page migration, and 
> > > > > > > > > we need to
> > > > > > > > > increase page refcount in restrictedmem_get_page():
> > > > > > > > > 
> > > > > > > > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47q...@box.shutemov.name/
> > > > > > > > > 
> > > > > > > > > But looking at this series it seems restrictedmem_get_page() 
> > > > > > > > > in this v10 is
> > > > > > > > > identical to the one in v9 (except v10 uses 'folio' instead 
> > > > > > > > > of 'page')?
> > > > > > > 
> > > > > > > restrictedmem_get_page() increases page refcount several versions 
> > > > > > > ago so
> > > > > > > no change in v10 is needed. You probably missed my reply:
> > > > > > > 
> > > > > > > https://lore.kernel.org/linux-mm/20221129135844.ga902...@chaop.bj.intel.com/
> > > > > 
> > > > > But for non-restricted-mem case, it is correct for KVM to decrease 
> > > > > page's
> > > > > refcount after setting up mapping in the secondary mmu, otherwise the 
> > > > > page wil

Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2022-12-21 Thread Chao Peng
On Tue, Dec 20, 2022 at 10:55:44AM +0100, Borislav Petkov wrote:
> On Tue, Dec 20, 2022 at 03:43:18PM +0800, Chao Peng wrote:
> > RESTRICTEDMEM is needed by TDX_HOST, not TDX_GUEST.
> 
> Which basically means that RESTRICTEDMEM should simply depend on KVM.
> Because you can't know upfront whether KVM will run a TDX guest or a SNP
> guest and so on.
> 
> Which then means that RESTRICTEDMEM will practically end up always
> enabled in KVM HV configs.

That's right, CONFIG_RESTRICTEDMEM is always selected for supported KVM
architectures (currently x86_64).

> 
> > The only reason to add another HAVE_KVM_RESTRICTED_MEM is some code only
> > works for 64bit[*] and CONFIG_RESTRICTEDMEM is not sufficient to enforce
> > that.
> 
> This is what I mean with "we have too many Kconfig items". :-\

Yes I agree. One way to remove this is probably additionally checking
CONFIG_64BIT instead.

Thanks,
Chao
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2022-12-21 Thread Chao Peng
On Tue, Dec 20, 2022 at 08:33:05AM +, Huang, Kai wrote:
> On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > On Mon, Dec 19, 2022 at 08:48:10AM +, Huang, Kai wrote:
> > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > +
> > > > > > +   /*
> > > > > > +* These pages are currently unmovable so don't place them into
> > > > > > movable
> > > > > > +* pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > > > > +*/
> > > > > > +   mapping = memfd->f_mapping;
> > > > > > +   mapping_set_unevictable(mapping);
> > > > > > +   mapping_set_gfp_mask(mapping,
> > > > > > +    mapping_gfp_mask(mapping) & 
> > > > > > ~__GFP_MOVABLE);
> > > > > 
> > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation 
> > > > > from
> > > > > non-
> > > > > movable zones, but doesn't necessarily prevent page from being 
> > > > > migrated.  My
> > > > > first glance is you need to implement either a_ops->migrate_folio() 
> > > > > or just
> > > > > get_page() after faulting in the page to prevent.
> > > > 
> > > > The current api restrictedmem_get_page() already does this, after the
> > > > caller calling it, it holds a reference to the page. The caller then
> > > > decides when to call put_page() appropriately.
> > > 
> > > I tried to dig some history. Perhaps I am missing something, but it seems 
> > > Kirill
> > > said in v9 that this code doesn't prevent page migration, and we need to
> > > increase page refcount in restrictedmem_get_page():
> > > 
> > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47q...@box.shutemov.name/
> > > 
> > > But looking at this series it seems restrictedmem_get_page() in this v10 
> > > is
> > > identical to the one in v9 (except v10 uses 'folio' instead of 'page')?
> > 
> > restrictedmem_get_page() increases page refcount several versions ago so
> > no change in v10 is needed. You probably missed my reply:
> > 
> > https://lore.kernel.org/linux-mm/20221129135844.ga902...@chaop.bj.intel.com/
> 
> But for non-restricted-mem case, it is correct for KVM to decrease page's
> refcount after setting up mapping in the secondary mmu, otherwise the page 
> will
> be pinned by KVM for normal VM (since KVM uses GUP to get the page).

That's true. Actually even true for restrictedmem case, most likely we
will still need the kvm_release_pfn_clean() for KVM generic code. On one
side, other restrictedmem users like pKVM may not require page pinning
at all. On the other side, see below.

> 
> So what we are expecting is: for KVM if the page comes from restricted mem, 
> then
> KVM cannot decrease the refcount, otherwise for normal page via GUP KVM 
> should.

I argue that this page pinning (or page migration prevention) is not
tied to where the page comes from, instead related to how the page will
be used. Whether the page is restrictedmem backed or GUP() backed, once
it's used by current version of TDX then the page pinning is needed. So
such page migration prevention is really TDX thing, even not KVM generic
thing (that's why I think we don't need change the existing logic of
kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who
requires that) to increase/decrease the refcount when it populates/drops
the secure EPT entries? This is exactly what the current TDX code does:

get_page():
https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1217

put_page():
https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1334

Thanks,
Chao
> 
> > 
> > The current solution is clear: unless we have better approach, we will
> > let restrictedmem user (KVM in this case) to hold the refcount to
> > prevent page migration.
> > 
> 
> OK.  Will leave to others :)
> 



Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2022-12-19 Thread Chao Peng
On Mon, Dec 19, 2022 at 03:36:28PM +0100, Borislav Petkov wrote:
> On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> 
> valueless?
> 
> I can't parse that.

It's unnecessary and ...

> 
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided through a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> 
> bookmarked?

userspace is restricted to access the memory content in the fd.

> 
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> > 
> > The extended memslot can still have the userspace_addr(hva). When use, a
> 
> "When un use, ..."

When both userspace_addr and restricted_fd/offset were used, ...

> 
> ...
> 
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index a8e379a3afee..690cb21010e7 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -50,6 +50,8 @@ config KVM
> > select INTERVAL_TREE
> > select HAVE_KVM_PM_NOTIFIER if PM
> > select HAVE_KVM_MEMORY_ATTRIBUTES
> > +   select HAVE_KVM_RESTRICTED_MEM if X86_64
> > +   select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> 
> Those deps here look weird.
> 
> RESTRICTEDMEM should be selected by TDX_GUEST as it can't live without
> it.

RESTRICTEDMEM is needed by TDX_HOST, not TDX_GUEST.

> 
> Then you don't have to select HAVE_KVM_RESTRICTED_MEM simply because of
> X86_64 - you need that functionality when the respective guest support
> is enabled in KVM.

Letting the actual feature(e.g. TDX or pKVM) select it or add dependency
sounds a viable and clearer solution. Sean, let me know your opinion.

> 
> Then, looking forward into your patchset, I'm not sure you even
> need HAVE_KVM_RESTRICTED_MEM - you could make it all depend on
> CONFIG_RESTRICTEDMEM. But that's KVM folks call - I'd always aim for
> less Kconfig items because we have waay too many.

The only reason to add another HAVE_KVM_RESTRICTED_MEM is some code only
works for 64bit[*] and CONFIG_RESTRICTEDMEM is not sufficient to enforce
that.

[*] https://lore.kernel.org/all/ykjlfu98hzovt...@google.com/

Thanks,
Chao
> 
> Thx.
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2022-12-19 Thread Chao Peng
On Mon, Dec 19, 2022 at 11:17:22AM +0100, Borislav Petkov wrote:
> On Mon, Dec 19, 2022 at 04:15:32PM +0800, Chao Peng wrote:
> > Tamping down with error number a bit:
> > 
> > if (attrs->flags)
> > return -ENXIO;
> > if (attrs->attributes & ~supported_attrs)
> > return -EOPNOTSUPP;
> > if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size) ||
> > attrs->size == 0)
> > return -EINVAL;
> > if (attrs->address + attrs->size < attrs->address)
> > return -E2BIG;
> 
> Yap, better.
> 
> I guess you should add those to the documentation of the ioctl too
> so that people can find out why it fails. Or, well, they can look
> at the code directly too but still... imagine some blurb about
> user-friendliness here...

Thanks for reminding. Yes KVM api doc is the right place to put these
documentation in.

Thanks,
Chao
> 
> :-)
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2022-12-19 Thread Chao Peng
On Mon, Dec 19, 2022 at 08:48:10AM +, Huang, Kai wrote:
> On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > > 
> > > [...]
> > > 
> > > > +
> > > > +   /*
> > > > +* These pages are currently unmovable so don't place them into
> > > > movable
> > > > +* pageblocks (e.g. CMA and ZONE_MOVABLE).
> > > > +*/
> > > > +   mapping = memfd->f_mapping;
> > > > +   mapping_set_unevictable(mapping);
> > > > +   mapping_set_gfp_mask(mapping,
> > > > +    mapping_gfp_mask(mapping) & 
> > > > ~__GFP_MOVABLE);
> > > 
> > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from
> > > non-
> > > movable zones, but doesn't necessarily prevent page from being migrated.  
> > > My
> > > first glance is you need to implement either a_ops->migrate_folio() or 
> > > just
> > > get_page() after faulting in the page to prevent.
> > 
> > The current api restrictedmem_get_page() already does this, after the
> > caller calling it, it holds a reference to the page. The caller then
> > decides when to call put_page() appropriately.
> 
> I tried to dig some history. Perhaps I am missing something, but it seems 
> Kirill
> said in v9 that this code doesn't prevent page migration, and we need to
> increase page refcount in restrictedmem_get_page():
> 
> https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47q...@box.shutemov.name/
> 
> But looking at this series it seems restrictedmem_get_page() in this v10 is
> identical to the one in v9 (except v10 uses 'folio' instead of 'page')?

restrictedmem_get_page() increases page refcount several versions ago so
no change in v10 is needed. You probably missed my reply:

https://lore.kernel.org/linux-mm/20221129135844.ga902...@chaop.bj.intel.com/

The current solution is clear: unless we have better approach, we will
let restrictedmem user (KVM in this case) to hold the refcount to
prevent page migration.

Thanks,
Chao
> 
> Anyway if this is not fixed, then it should be fixed.  Otherwise, a comment at
> the place where page refcount is increased will be helpful to help people
> understand page migration is actually prevented.
> 



Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2022-12-19 Thread Chao Peng
On Fri, Dec 16, 2022 at 04:09:06PM +0100, Borislav Petkov wrote:
> On Fri, Dec 02, 2022 at 02:13:40PM +0800, Chao Peng wrote:
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 1782c4555d94..7f0f5e9f2406 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1150,6 +1150,9 @@ static struct kvm *kvm_create_vm(unsigned long type, 
> > const char *fdname)
> > spin_lock_init(>mn_invalidate_lock);
> > rcuwait_init(>mn_memslots_update_rcuwait);
> > xa_init(>vcpu_array);
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +   xa_init(>mem_attr_array);
> > +#endif
> 
>   if (IS_ENABLED(CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES))
>   ...
> 
> would at least remove the ugly ifdeffery.
> 
> Or you could create wrapper functions for that xa_init() and
> xa_destroy() and put the ifdeffery in there.

Agreed.

> 
> > @@ -2323,6 +2329,49 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm 
> > *kvm,
> >  }
> >  #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
> >  
> > +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> > +static u64 kvm_supported_mem_attributes(struct kvm *kvm)
> 
> I guess that function should have a verb in the name:
> 
> kvm_get_supported_mem_attributes()

Right!
> 
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > +  struct kvm_memory_attributes *attrs)
> > +{
> > +   gfn_t start, end;
> > +   unsigned long i;
> > +   void *entry;
> > +   u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +   /* flags is currently not used. */
> > +   if (attrs->flags)
> > +   return -EINVAL;
> > +   if (attrs->attributes & ~supported_attrs)
> > +   return -EINVAL;
> > +   if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > +   return -EINVAL;
> > +   if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > +   return -EINVAL;
> 
> Dunno, shouldn't those issue some sort of an error message so that the
> caller knows where it failed? Or at least return different retvals which
> signal what the problem is?

Tamping down with error number a bit:

if (attrs->flags)
return -ENXIO;
if (attrs->attributes & ~supported_attrs)
return -EOPNOTSUPP;
if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size) ||
attrs->size == 0)
return -EINVAL;
if (attrs->address + attrs->size < attrs->address)
return -E2BIG;

Chao
> 
> > +   start = attrs->address >> PAGE_SHIFT;
> > +   end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +   entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> > +   mutex_lock(>lock);
> > +   for (i = start; i < end; i++)
> > +   if (xa_err(xa_store(>mem_attr_array, i, entry,
> > +   GFP_KERNEL_ACCOUNT)))
> > +   break;
> > +   mutex_unlock(>lock);
> > +
> > +   attrs->address = i << PAGE_SHIFT;
> > +   attrs->size = (end - i) << PAGE_SHIFT;
> > +
> > +   return 0;
> > +}
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2022-12-19 Thread Chao Peng
On Tue, Dec 13, 2022 at 08:04:14PM +0800, Xiaoyao Li wrote:
> On 12/8/2022 7:30 PM, Chao Peng wrote:
> > On Thu, Dec 08, 2022 at 04:37:03PM +0800, Xiaoyao Li wrote:
> > > On 12/2/2022 2:13 PM, Chao Peng wrote:
> > > 
> > > ..
> > > 
> > > > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > > > and right now it is selected on X86_64 only.
> > > > 
> > > 
> > >  From the patch implementation, I have no idea why 
> > > HAVE_KVM_RESTRICTED_MEM is
> > > needed.
> > 
> > The reason is we want KVM further controls the feature enabling. An
> > opt-in CONFIG_RESTRICTEDMEM can cause problem if user sets that for
> > unsupported architectures.
> 
> HAVE_KVM_RESTRICTED_MEM is not used in this patch. It's better to introduce
> it in the patch that actually uses it.

It's being 'used' in this patch by reverse selecting RESTRICTEDMEM in
arch/x86/kvm/Kconfig, this gives people a sense where
restrictedmem_notifier comes from. Introducing the config with other
private/restricted memslot stuff together can also help future
supporting architectures better identify what they need do. But those
are trivial and moving to patch 08 sounds also good to me.

Thanks,
Chao
> 
> > Here is the original discussion:
> > https://lore.kernel.org/all/ykjlfu98hzovt...@google.com/
> > 
> > Thanks,
> > Chao



Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

2022-12-18 Thread Chao Peng
On Tue, Dec 13, 2022 at 11:51:25PM +, Huang, Kai wrote:
> On Fri, 2022-12-02 at 14:13 +0800, Chao Peng wrote:
> >  
> > -   /* flags is currently not used. */
> > +   /* 'flags' is currently not used. */
> >     if (attrs->flags)
> >     return -EINVAL;
> 
> Unintended code change.

Yeah!

Chao



Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2022-12-18 Thread Chao Peng
On Tue, Dec 13, 2022 at 11:49:13PM +, Huang, Kai wrote:
> > 
> > memfd_restricted() itself is implemented as a shim layer on top of real
> > memory file systems (currently tmpfs). Pages in restrictedmem are marked
> > as unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> > 
> > 
> I didn't dig full histroy, but I interpret this as we don't support page
> migration and swapping for restricted memfd for now.  IMHO "page marked as
> unmovable" can be confused with PageMovable(), which is a different thing from
> this series.  It's better to just say something like "those pages cannot be
> migrated and swapped".

Yes, if that helps some clarification.

> 
> [...]
> 
> > +
> > +   /*
> > +* These pages are currently unmovable so don't place them into movable
> > +* pageblocks (e.g. CMA and ZONE_MOVABLE).
> > +*/
> > +   mapping = memfd->f_mapping;
> > +   mapping_set_unevictable(mapping);
> > +   mapping_set_gfp_mask(mapping,
> > +mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> 
> But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from 
> non-
> movable zones, but doesn't necessarily prevent page from being migrated.  My
> first glance is you need to implement either a_ops->migrate_folio() or just
> get_page() after faulting in the page to prevent.

The current api restrictedmem_get_page() already does this, after the
caller calling it, it holds a reference to the page. The caller then
decides when to call put_page() appropriately.

> 
> So I think the comment also needs improvement -- IMHO we can just call out
> currently those pages cannot be migrated and swapped, which is clearer (and 
> the
> latter justifies mapping_set_unevictable() clearly).

Good to me.

Thanks,
Chao
> 
> 



Re: [PATCH v10 8/9] KVM: Handle page fault for private memory

2022-12-11 Thread Chao Peng
On Fri, Dec 09, 2022 at 09:01:04AM +, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng  wrote:
> >
> > A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> > hva-based shared memory. Architecture code (like TDX code) can tell
> > whether the on-going fault is private or not. This patch adds a
> > 'is_private' field to kvm_page_fault to indicate this and architecture
> > code is expected to set it.
> >
> > To handle page fault for such memslot, the handling logic is different
> > depending on whether the fault is private or shared. KVM checks if
> > 'is_private' matches the host's view of the page (maintained in
> > mem_attr_array).
> >   - For a successful match, private pfn is obtained with
> > restrictedmem_get_page() and shared pfn is obtained with existing
> > get_user_pages().
> >   - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> > userspace. Userspace then can convert memory between private/shared
> > in host's view and retry the fault.
> >
> > Co-developed-by: Yu Zhang 
> > Signed-off-by: Yu Zhang 
> > Signed-off-by: Chao Peng 
> > ---
> >  arch/x86/kvm/mmu/mmu.c  | 63 +++--
> >  arch/x86/kvm/mmu/mmu_internal.h | 14 +++-
> >  arch/x86/kvm/mmu/mmutrace.h |  1 +
> >  arch/x86/kvm/mmu/tdp_mmu.c  |  2 +-
> >  include/linux/kvm_host.h| 30 
> >  5 files changed, 105 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 2190fd8c95c0..b1953ebc012e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, 
> > gfn_t gfn,
> >
> >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >   const struct kvm_memory_slot *slot, gfn_t gfn,
> > - int max_level)
> > + int max_level, bool is_private)
> >  {
> > struct kvm_lpage_info *linfo;
> > int host_level;
> > @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > break;
> > }
> >
> > +   if (is_private)
> > +   return max_level;
> > +
> > if (max_level == PG_LEVEL_4K)
> > return PG_LEVEL_4K;
> >
> > @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, 
> > struct kvm_page_fault *fault
> >  * level, which will be used to do precise, accurate accounting.
> >  */
> > fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > -fault->gfn, 
> > fault->max_level);
> > +fault->gfn, 
> > fault->max_level,
> > +fault->is_private);
> > if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> > return;
> >
> > @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu 
> > *vcpu, struct kvm_async_pf *work)
> > kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> >  }
> >
> > +static inline u8 order_to_level(int order)
> > +{
> > +   BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > +
> > +   if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > +   return PG_LEVEL_1G;
> > +
> > +   if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > +   return PG_LEVEL_2M;
> > +
> > +   return PG_LEVEL_4K;
> > +}
> > +
> > +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> > +   struct kvm_page_fault *fault)
> > +{
> > +   vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > +   if (fault->is_private)
> > +   vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > +   else
> > +   vcpu->run->memory.flags = 0;
> > +   vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> 
> nit: As in previous patches, use helpers (for this and other similar
> shifts in this patch)?

Agreed.

> 
> > +   vcpu->run->memory.size = PAGE_SIZE;
> > +   return RET_PF_USER;
> > +}
> > +
> > +static int kvm_faultin_pfn_private(struct kvm_vcpu *vc

Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

2022-12-11 Thread Chao Peng
On Fri, Dec 09, 2022 at 08:57:31AM +, Fuad Tabba wrote:
> Hi,
> 
> On Thu, Dec 8, 2022 at 11:18 AM Chao Peng  wrote:
> >
> > On Wed, Dec 07, 2022 at 05:16:34PM +, Fuad Tabba wrote:
> > > Hi,
> > >
> > > On Fri, Dec 2, 2022 at 6:19 AM Chao Peng  
> > > wrote:
> > > >
> > > > Unmap the existing guest mappings when memory attribute is changed
> > > > between shared and private. This is needed because shared pages and
> > > > private pages are from different backends, unmapping existing ones
> > > > gives a chance for page fault handler to re-populate the mappings
> > > > according to the new attribute.
> > > >
> > > > Only architecture has private memory support needs this and the
> > > > supported architecture is expected to rewrite the weak
> > > > kvm_arch_has_private_mem().
> > >
> > > This kind of ties into the discussion of being able to share memory in
> > > place. For pKVM for example, shared and private memory would have the
> > > same backend, and the unmapping wouldn't be needed.
> > >
> > > So I guess that, instead of kvm_arch_has_private_mem(), can the check
> > > be done differently, e.g., with a different function, say
> > > kvm_arch_private_notify_attribute_change() (but maybe with a more
> > > friendly name than what I suggested :) )?
> >
> > Besides controlling the unmapping here, kvm_arch_has_private_mem() is
> > also used to gate the memslot KVM_MEM_PRIVATE flag in patch09. I know
> > unmapping is confirmed unnecessary for pKVM, but how about
> > KVM_MEM_PRIVATE? Will pKVM add its own flag or reuse KVM_MEM_PRIVATE?
> > If the answer is the latter, then yes we should use a different check
> > which only works for confidential usages here.
> 
> I think it makes sense for pKVM to use the same flag (KVM_MEM_PRIVATE)
> and not to add another one.

Thanks for the reply.
Chao
> 
> Thank you,
> /fuad
> 
> 
> 
> >
> > Thanks,
> > Chao
> > >
> > > Thanks,
> > > /fuad
> > >
> > > >
> > > > Also, during memory attribute changing and the unmapping time frame,
> > > > page fault handler may happen in the same memory range and can cause
> > > > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > > > page fault handler retry during this time frame.
> > > >
> > > > Signed-off-by: Chao Peng 
> > > > ---
> > > >  include/linux/kvm_host.h |   7 +-
> > > >  virt/kvm/kvm_main.c  | 168 ++-
> > > >  2 files changed, 116 insertions(+), 59 deletions(-)
> > > >
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index 3d69484d2704..3331c0c92838 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, 
> > > > gpa_t cr2_or_gpa,
> > > >  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > > >  #endif
> > > >
> > > > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > >  struct kvm_gfn_range {
> > > > struct kvm_memory_slot *slot;
> > > > gfn_t start;
> > > > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > > > bool may_block;
> > > >  };
> > > >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > > > +
> > > > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > > >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > >  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > > > @@ -785,11 +786,12 @@ struct kvm {
> > > >
> > > >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > > struct mmu_notifier mmu_notifier;
> > > > +#endif
> > > > unsigned long mmu_invalidate_seq;
> > > > long mmu_invalidate_in_progress;
> > > > gfn_t mmu_invalidate_range_start;
> > > > gfn_t mmu_invalidate_range_end;
> > > > -#endif
> > > > +
> > > > struct list_head devices;
> > > > u64 manual_dirty_log_protect;
> > > > struct dentry *debugfs_dentry;
> > >

Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry

2022-12-08 Thread Chao Peng
On Tue, Dec 06, 2022 at 03:48:50PM +, Fuad Tabba wrote:
...
 > >
> > > >  */
> > > > -   if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > > > -   hva >= kvm->mmu_invalidate_range_start &&
> > > > -   hva < kvm->mmu_invalidate_range_end)
> > > > -   return 1;
> > > > +   if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > > > +   /*
> > > > +* Dropping mmu_lock after bumping 
> > > > mmu_invalidate_in_progress
> > > > +* but before updating the range is a KVM bug.
> > > > +*/
> > > > +   if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == 
> > > > INVALID_GPA ||
> > > > +kvm->mmu_invalidate_range_end == 
> > > > INVALID_GPA))
> > >
> > > INVALID_GPA is an x86-specific define in
> > > arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> > > architectures. The obvious fix is to move it to
> > > include/linux/kvm_host.h.
> >
> > Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
> > correct choice for other architectures, but after search it has not been
> > used for other architectures, so should be safe to make it common.

As Yu posted a patch:
https://lore.kernel.org/all/20221209023622.274715-1-yu.c.zh...@linux.intel.com/

There is a GPA_INVALID in include/linux/kvm_types.h and I see ARM has already
been using it so sounds that is exactly what I need.

Chao
> 
> With this fixed,
> 
> Reviewed-by: Fuad Tabba 
> And the necessary work to port to arm64 (on qemu/arm64):
> Tested-by: Fuad Tabba 
> 
> Cheers,
> /fuad



Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2022-12-08 Thread Chao Peng
On Thu, Dec 08, 2022 at 04:37:03PM +0800, Xiaoyao Li wrote:
> On 12/2/2022 2:13 PM, Chao Peng wrote:
> 
> ..
> 
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only.
> > 
> 
> From the patch implementation, I have no idea why HAVE_KVM_RESTRICTED_MEM is
> needed.

The reason is we want KVM further controls the feature enabling. An
opt-in CONFIG_RESTRICTEDMEM can cause problem if user sets that for
unsupported architectures.

Here is the original discussion:
https://lore.kernel.org/all/ykjlfu98hzovt...@google.com/

Thanks,
Chao



Re: [PATCH v10 8/9] KVM: Handle page fault for private memory

2022-12-08 Thread Chao Peng
On Thu, Dec 08, 2022 at 10:29:18AM +0800, Yuan Yao wrote:
> On Fri, Dec 02, 2022 at 02:13:46PM +0800, Chao Peng wrote:
> > A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
> > hva-based shared memory. Architecture code (like TDX code) can tell
> > whether the on-going fault is private or not. This patch adds a
> > 'is_private' field to kvm_page_fault to indicate this and architecture
> > code is expected to set it.
> >
> > To handle page fault for such memslot, the handling logic is different
> > depending on whether the fault is private or shared. KVM checks if
> > 'is_private' matches the host's view of the page (maintained in
> > mem_attr_array).
> >   - For a successful match, private pfn is obtained with
> > restrictedmem_get_page() and shared pfn is obtained with existing
> > get_user_pages().
> >   - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> > userspace. Userspace then can convert memory between private/shared
> > in host's view and retry the fault.
> >
> > Co-developed-by: Yu Zhang 
> > Signed-off-by: Yu Zhang 
> > Signed-off-by: Chao Peng 
> > ---
> >  arch/x86/kvm/mmu/mmu.c  | 63 +++--
> >  arch/x86/kvm/mmu/mmu_internal.h | 14 +++-
> >  arch/x86/kvm/mmu/mmutrace.h |  1 +
> >  arch/x86/kvm/mmu/tdp_mmu.c  |  2 +-
> >  include/linux/kvm_host.h| 30 
> >  5 files changed, 105 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 2190fd8c95c0..b1953ebc012e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, 
> > gfn_t gfn,
> >
> >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >   const struct kvm_memory_slot *slot, gfn_t gfn,
> > - int max_level)
> > + int max_level, bool is_private)
> >  {
> > struct kvm_lpage_info *linfo;
> > int host_level;
> > @@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > break;
> > }
> >
> > +   if (is_private)
> > +   return max_level;
> 
> lpage mixed information already saved, so is that possible
> to query info->disallow_lpage without care 'is_private' ?

Actually we already queried info->disallow_lpage just before this
sentence. The check is needed because later in the function we call
host_pfn_mapping_level() which is shared memory specific.

Thanks,
Chao
> 
> > +
> > if (max_level == PG_LEVEL_4K)
> > return PG_LEVEL_4K;
> >
> > @@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, 
> > struct kvm_page_fault *fault
> >  * level, which will be used to do precise, accurate accounting.
> >  */
> > fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > -fault->gfn, 
> > fault->max_level);
> > +fault->gfn, 
> > fault->max_level,
> > +fault->is_private);
> > if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> > return;
> >
> > @@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu 
> > *vcpu, struct kvm_async_pf *work)
> > kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> >  }
> >
> > +static inline u8 order_to_level(int order)
> > +{
> > +   BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > +
> > +   if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > +   return PG_LEVEL_1G;
> > +
> > +   if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > +   return PG_LEVEL_2M;
> > +
> > +   return PG_LEVEL_4K;
> > +}
> > +
> > +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> > +   struct kvm_page_fault *fault)
> > +{
> > +   vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > +   if (fault->is_private)
> > +   vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > +   else
> > +   vcpu->run->memory.flags = 0;
> > +   vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> > +   vcpu->run->memory.size = PAGE_SIZE;
> > +   return RET_PF_USER;
> > +}
> > +
> > +sta

Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

2022-12-08 Thread Chao Peng
On Wed, Dec 07, 2022 at 04:13:14PM +0800, Yuan Yao wrote:
> On Fri, Dec 02, 2022 at 02:13:44PM +0800, Chao Peng wrote:
> > Unmap the existing guest mappings when memory attribute is changed
> > between shared and private. This is needed because shared pages and
> > private pages are from different backends, unmapping existing ones
> > gives a chance for page fault handler to re-populate the mappings
> > according to the new attribute.
> >
> > Only architecture has private memory support needs this and the
> > supported architecture is expected to rewrite the weak
> > kvm_arch_has_private_mem().
> >
> > Also, during memory attribute changing and the unmapping time frame,
> > page fault handler may happen in the same memory range and can cause
> > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > page fault handler retry during this time frame.
> >
> > Signed-off-by: Chao Peng 
> > ---
> >  include/linux/kvm_host.h |   7 +-
> >  virt/kvm/kvm_main.c  | 168 ++-
> >  2 files changed, 116 insertions(+), 59 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 3d69484d2704..3331c0c92838 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t 
> > cr2_or_gpa,
> >  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> >  #endif
> >
> > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> >  struct kvm_gfn_range {
> > struct kvm_memory_slot *slot;
> > gfn_t start;
> > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > bool may_block;
> >  };
> >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > +
> > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > @@ -785,11 +786,12 @@ struct kvm {
> >
> >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > struct mmu_notifier mmu_notifier;
> > +#endif
> > unsigned long mmu_invalidate_seq;
> > long mmu_invalidate_in_progress;
> > gfn_t mmu_invalidate_range_start;
> > gfn_t mmu_invalidate_range_end;
> > -#endif
> > +
> > struct list_head devices;
> > u64 manual_dirty_log_protect;
> > struct dentry *debugfs_dentry;
> > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct 
> > kvm_vcpu *vcpu);
> >  int kvm_arch_post_init_vm(struct kvm *kvm);
> >  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> >  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> >
> >  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> >  /*
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ad55dfbc75d7..4e1e1e113bf0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> >
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > +{
> > +   /*
> > +* The count increase must become visible at unlock time as no
> > +* spte can be established without taking the mmu_lock and
> > +* count is also read inside the mmu_lock critical section.
> > +*/
> > +   kvm->mmu_invalidate_in_progress++;
> > +
> > +   if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > +   kvm->mmu_invalidate_range_start = INVALID_GPA;
> > +   kvm->mmu_invalidate_range_end = INVALID_GPA;
> > +   }
> > +}
> > +
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
> > +{
> > +   WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
> > +
> > +   if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > +   kvm->mmu_invalidate_range_start = start;
> > +   kvm->mmu_invalidate_range_end = end;
> > +   } else {
> > +   /*
> > +* Fully tracking multiple concurrent ranges has diminishing
> > +* returns. Keep things simple and just find the minimal range
> > +* which includes the current and new ranges. As there won't be
> > +* enough information to subtract a range after its invalidate
> > +* complete

Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed

2022-12-08 Thread Chao Peng
On Tue, Dec 06, 2022 at 10:42:24PM -0800, Isaku Yamahata wrote:
> On Tue, Dec 06, 2022 at 08:02:24PM +0800,
> Chao Peng  wrote:
> 
> > On Mon, Dec 05, 2022 at 02:49:59PM -0800, Isaku Yamahata wrote:
> > > On Fri, Dec 02, 2022 at 02:13:45PM +0800,
> > > Chao Peng  wrote:
> > > 
> > > > A large page with mixed private/shared subpages can't be mapped as large
> > > > page since its sub private/shared pages are from different memory
> > > > backends and may also treated by architecture differently. When
> > > > private/shared memory are mixed in a large page, the current lpage_info
> > > > is not sufficient to decide whether the page can be mapped as large page
> > > > or not and additional private/shared mixed information is needed.
> > > > 
> > > > Tracking this 'mixed' information with the current 'count' like
> > > > disallow_lpage is a bit challenge so reserve a bit in 'disallow_lpage'
> > > > to indicate a large page has mixed private/share subpages and update
> > > > this 'mixed' bit whenever the memory attribute is changed between
> > > > private and shared.
> > > > 
> > > > Signed-off-by: Chao Peng 
> > > > ---
> > > >  arch/x86/include/asm/kvm_host.h |   8 ++
> > > >  arch/x86/kvm/mmu/mmu.c  | 134 +++-
> > > >  arch/x86/kvm/x86.c  |   2 +
> > > >  include/linux/kvm_host.h|  19 +
> > > >  virt/kvm/kvm_main.c |   9 ++-
> > > >  5 files changed, 169 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/kvm_host.h 
> > > > b/arch/x86/include/asm/kvm_host.h
> > > > index 283cbb83d6ae..7772ab37ac89 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -38,6 +38,7 @@
> > > >  #include 
> > > >  
> > > >  #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> > > > +#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
> > > >  
> > > >  #define KVM_MAX_VCPUS 1024
> > > >  
> > > > @@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
> > > >  #endif
> > > >  };
> > > >  
> > > > +/*
> > > > + * Use a bit in disallow_lpage to indicate private/shared pages mixed 
> > > > at the
> > > > + * level. The remaining bits are used as a reference count.
> > > > + */
> > > > +#define KVM_LPAGE_PRIVATE_SHARED_MIXED (1U << 31)
> > > > +#define KVM_LPAGE_COUNT_MAX((1U << 31) - 1)
> > > > +
> > > >  struct kvm_lpage_info {
> > > > int disallow_lpage;
> > > >  };
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index e2c70b5afa3e..2190fd8c95c0 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const 
> > > > struct kvm_memory_slot *slot,
> > > >  {
> > > > struct kvm_lpage_info *linfo;
> > > > int i;
> > > > +   int disallow_count;
> > > >  
> > > > for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> > > > linfo = lpage_info_slot(gfn, slot, i);
> > > > +
> > > > +   disallow_count = linfo->disallow_lpage & 
> > > > KVM_LPAGE_COUNT_MAX;
> > > > +   WARN_ON(disallow_count + count < 0 ||
> > > > +   disallow_count > KVM_LPAGE_COUNT_MAX - count);
> > > > +
> > > > linfo->disallow_lpage += count;
> > > > -   WARN_ON(linfo->disallow_lpage < 0);
> > > > }
> > > >  }
> > > >  
> > > > @@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> > > > if (kvm->arch.nx_huge_page_recovery_thread)
> > > > kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
> > > >  }
> > > > +
> > > > +static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> > > > +{
> > > > +   return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > > > +}
> > > > +
> > > > +static void linfo_set_mi

Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

2022-12-08 Thread Chao Peng
On Wed, Dec 07, 2022 at 05:16:34PM +, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng  wrote:
> >
> > Unmap the existing guest mappings when memory attribute is changed
> > between shared and private. This is needed because shared pages and
> > private pages are from different backends, unmapping existing ones
> > gives a chance for page fault handler to re-populate the mappings
> > according to the new attribute.
> >
> > Only architecture has private memory support needs this and the
> > supported architecture is expected to rewrite the weak
> > kvm_arch_has_private_mem().
> 
> This kind of ties into the discussion of being able to share memory in
> place. For pKVM for example, shared and private memory would have the
> same backend, and the unmapping wouldn't be needed.
> 
> So I guess that, instead of kvm_arch_has_private_mem(), can the check
> be done differently, e.g., with a different function, say
> kvm_arch_private_notify_attribute_change() (but maybe with a more
> friendly name than what I suggested :) )?

Besides controlling the unmapping here, kvm_arch_has_private_mem() is
also used to gate the memslot KVM_MEM_PRIVATE flag in patch09. I know
unmapping is confirmed unnecessary for pKVM, but how about
KVM_MEM_PRIVATE? Will pKVM add its own flag or reuse KVM_MEM_PRIVATE?
If the answer is the latter, then yes we should use a different check
which only works for confidential usages here.

Thanks,
Chao
> 
> Thanks,
> /fuad
> 
> >
> > Also, during memory attribute changing and the unmapping time frame,
> > page fault handler may happen in the same memory range and can cause
> > incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
> > page fault handler retry during this time frame.
> >
> > Signed-off-by: Chao Peng 
> > ---
> >  include/linux/kvm_host.h |   7 +-
> >  virt/kvm/kvm_main.c  | 168 ++-
> >  2 files changed, 116 insertions(+), 59 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 3d69484d2704..3331c0c92838 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t 
> > cr2_or_gpa,
> >  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> >  #endif
> >
> > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> >  struct kvm_gfn_range {
> > struct kvm_memory_slot *slot;
> > gfn_t start;
> > @@ -264,6 +263,8 @@ struct kvm_gfn_range {
> > bool may_block;
> >  };
> >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > +
> > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > @@ -785,11 +786,12 @@ struct kvm {
> >
> >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > struct mmu_notifier mmu_notifier;
> > +#endif
> > unsigned long mmu_invalidate_seq;
> > long mmu_invalidate_in_progress;
> > gfn_t mmu_invalidate_range_start;
> > gfn_t mmu_invalidate_range_end;
> > -#endif
> > +
> > struct list_head devices;
> > u64 manual_dirty_log_protect;
> > struct dentry *debugfs_dentry;
> > @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct 
> > kvm_vcpu *vcpu);
> >  int kvm_arch_post_init_vm(struct kvm *kvm);
> >  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
> >  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> > +bool kvm_arch_has_private_mem(struct kvm *kvm);
> >
> >  #ifndef __KVM_HAVE_ARCH_VM_ALLOC
> >  /*
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ad55dfbc75d7..4e1e1e113bf0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
> >
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm)
> > +{
> > +   /*
> > +* The count increase must become visible at unlock time as no
> > +* spte can be established without taking the mmu_lock and
> > +* count is also read inside the mmu_lock critical section.
> > +*/
> > +   kvm->mmu_invalidate_in_progress++;
> > +
> > +  

Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry

2022-12-07 Thread Chao Peng
On Tue, Dec 06, 2022 at 10:34:11PM -0800, Isaku Yamahata wrote:
> On Tue, Dec 06, 2022 at 07:56:23PM +0800,
> Chao Peng  wrote:
> 
> > > > -   if (unlikely(kvm->mmu_invalidate_in_progress) &&
> > > > -   hva >= kvm->mmu_invalidate_range_start &&
> > > > -   hva < kvm->mmu_invalidate_range_end)
> > > > -   return 1;
> > > > +   if (unlikely(kvm->mmu_invalidate_in_progress)) {
> > > > +   /*
> > > > +* Dropping mmu_lock after bumping 
> > > > mmu_invalidate_in_progress
> > > > +* but before updating the range is a KVM bug.
> > > > +*/
> > > > +   if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == 
> > > > INVALID_GPA ||
> > > > +kvm->mmu_invalidate_range_end == 
> > > > INVALID_GPA))
> > > 
> > > INVALID_GPA is an x86-specific define in
> > > arch/x86/include/asm/kvm_host.h, so this doesn't build on other
> > > architectures. The obvious fix is to move it to
> > > include/linux/kvm_host.h.
> > 
> > Hmm, INVALID_GPA is defined as ZERO for x86, not 100% confident this is
> > correct choice for other architectures, but after search it has not been
> > used for other architectures, so should be safe to make it common.
> 
> INVALID_GPA is defined as all bit 1.  Please notice "~" (tilde).
> 
> #define INVALID_GPA (~(gpa_t)0)

Thanks for mention. Still looks right moving it to include/linux/kvm_host.h. 
Chao
> -- 
> Isaku Yamahata 



Re: [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-12-07 Thread Chao Peng
On Tue, Dec 06, 2022 at 03:47:20PM +, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng  wrote:
> >
> > This new KVM exit allows userspace to handle memory-related errors. It
> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> > The flags includes additional information for userspace to handle the
> > error. Currently bit 0 is defined as 'private memory' where '1'
> > indicates error happens due to private memory access and '0' indicates
> > error happens due to shared memory access.
> >
> > When private memory is enabled, this new exit will be used for KVM to
> > exit to userspace for shared <-> private memory conversion in memory
> > encryption usage. In such usage, typically there are two kind of memory
> > conversions:
> >   - explicit conversion: happens when guest explicitly calls into KVM
> > to map a range (as private or shared), KVM then exits to userspace
> > to perform the map/unmap operations.
> >   - implicit conversion: happens in KVM page fault handler where KVM
> > exits to userspace for an implicit conversion when the page is in a
> > different state than requested (private or shared).
> >
> > Suggested-by: Sean Christopherson 
> > Co-developed-by: Yu Zhang 
> > Signed-off-by: Yu Zhang 
> > Signed-off-by: Chao Peng 
> > Reviewed-by: Fuad Tabba 
> > ---
> >  Documentation/virt/kvm/api.rst | 22 ++
> >  include/uapi/linux/kvm.h   |  8 
> >  2 files changed, 30 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 99352170c130..d9edb14ce30b 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6634,6 +6634,28 @@ array field represents return values. The userspace 
> > should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >
> > +::
> > +
> > +   /* KVM_EXIT_MEMORY_FAULT */
> > +   struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 0)
> > +   __u64 flags;
> 
> I see you've removed the padding and increased the flag size.

Yes Sean suggested this and also looks good to me.

Chao
> 
> Reviewed-by: Fuad Tabba 
> Tested-by: Fuad Tabba 
> 
> Cheers,
> /fuad
> 
> 
> 
> 
> > +   __u64 gpa;
> > +   __u64 size;
> > +   } memory;
> > +
> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> > +encountered a memory error which is not handled by KVM kernel module and
> > +userspace may choose to handle it. The 'flags' field indicates the memory
> > +properties of the exit.
> > +
> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> > +   private memory access when the bit is set. Otherwise the memory error is
> > +   caused by shared memory access when the bit is clear.
> > +
> > +'gpa' and 'size' indicate the memory range the error occurs at. The 
> > userspace
> > +may handle the error and return to KVM to retry the previous memory access.
> > +
> >  ::
> >
> >  /* KVM_EXIT_NOTIFY */
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 13bff963b8b0..c7e9d375a902 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
> >  #define KVM_EXIT_RISCV_SBI35
> >  #define KVM_EXIT_RISCV_CSR36
> >  #define KVM_EXIT_NOTIFY   37
> > +#define KVM_EXIT_MEMORY_FAULT 38
> >
> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >  /* Emulate instruction failed. */
> > @@ -541,6 +542,13 @@ struct kvm_run {
> >  #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
> > __u32 flags;
> > } notify;
> > +   /* KVM_EXIT_MEMORY_FAULT */
> > +   struct {
> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1ULL << 0)
> > +   __u64 flags;
> > +   __u64 gpa;
> > +   __u64 size;
> > +   } memory;
> > /* Fix the size of the union. */
> > char padding[256];
> > };
> > --
> > 2.25.1
> >



Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2022-12-07 Thread Chao Peng
On Tue, Dec 06, 2022 at 12:39:18PM +, Fuad Tabba wrote:
> Hi Chao,
> 
> On Tue, Dec 6, 2022 at 11:58 AM Chao Peng  wrote:
> >
> > On Mon, Dec 05, 2022 at 09:03:11AM +, Fuad Tabba wrote:
> > > Hi Chao,
> > >
> > > On Fri, Dec 2, 2022 at 6:18 AM Chao Peng  
> > > wrote:
> > > >
> > > > In memory encryption usage, guest memory may be encrypted with special
> > > > key and can be accessed only by the guest itself. We call such memory
> > > > private memory. It's valueless and sometimes can cause problem to allow
> > > > userspace to access guest private memory. This new KVM memslot extension
> > > > allows guest private memory being provided through a restrictedmem
> > > > backed file descriptor(fd) and userspace is restricted to access the
> > > > bookmarked memory in the fd.
> > > >
> > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > > > userspace to instruct KVM to provide guest memory through restricted_fd.
> > > > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > > > and the size is 'memory_size'.
> > > >
> > > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > > single memslot can maintain both private memory through restricted_fd
> > > > and shared memory through userspace_addr. Whether the private or shared
> > > > part is visible to guest is maintained by other KVM code.
> > > >
> > > > A restrictedmem_notifier field is also added to the memslot structure to
> > > > allow the restricted_fd's backing store to notify KVM the memory change,
> > > > KVM then can invalidate its page table entries or handle memory errors.
> > > >
> > > > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > > > and right now it is selected on X86_64 only.
> > > >
> > > > To make future maintenance easy, internally use a binary compatible
> > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > '_ext' variants.
> > > >
> > > > Co-developed-by: Yu Zhang 
> > > > Signed-off-by: Yu Zhang 
> > > > Signed-off-by: Chao Peng 
> > > > Reviewed-by: Fuad Tabba 
> > > > Tested-by: Fuad Tabba 
> > >
> > > V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
> > > patch series anymore. Any reason you removed it, or is it just an
> > > omission?
> >
> > We had some discussion in v9 [1] to add generic memory attributes ioctls
> > and KVM_CAP_PRIVATE_MEM can be implemented as a new
> > KVM_MEMORY_ATTRIBUTE_PRIVATE flag via KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES()
> > ioctl [2]. The api doc has been updated:
> >
> > +- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
> > +  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl) …
> >
> >
> > [1] https://lore.kernel.org/linux-mm/y2wb48kd0j4vg...@google.com/
> > [2]
> > https://lore.kernel.org/linux-mm/20221202061347.1070246-3-chao.p.p...@linux.intel.com/
> 
> I see. I just retested it with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES,
> and my Reviewed/Tested-by still apply.

Thanks for the info.

Chao
> 
> Cheers,
> /fuad
> 
> >
> > Thanks,
> > Chao
> > >
> > > [*] 
> > > https://lore.kernel.org/linux-mm/20221025151344.3784230-3-chao.p.p...@linux.intel.com/
> > >
> > > Thanks,
> > > /fuad
> > >
> > > > ---
> > > >  Documentation/virt/kvm/api.rst | 40 ++-
> > > >  arch/x86/kvm/Kconfig   |  2 ++
> > > >  arch/x86/kvm/x86.c |  2 +-
> > > >  include/linux/kvm_host.h   |  8 --
> > > >  include/uapi/linux/kvm.h   | 28 +++
> > > >  virt/kvm/Kconfig   |  3 +++
> > > >  virt/kvm/kvm_main.c| 49 --
> > > >  7 files changed, 114 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/Documentation/virt/kvm/api.rst 
> > > > b/Documentation/virt/kvm/api.rst
> > > > index bb2f709c0900..99352170c130 100644
> > > > --- a/Documentation/virt/kvm/api.rst
> > > > +++ b/Documentation/virt/kvm/api.rst
> > > > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
&

Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2022-12-07 Thread Chao Peng
On Tue, Dec 06, 2022 at 03:07:27PM +, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Dec 2, 2022 at 6:18 AM Chao Peng  wrote:
> >
> > In confidential computing usages, whether a page is private or shared is
> > necessary information for KVM to perform operations like page fault
> > handling, page zapping etc. There are other potential use cases for
> > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > or exec-only, etc.) without having to modify memslots.
> >
> > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > userspace to operate on the per-page memory attributes.
> >   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> > a guest memory range.
> >   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> > memory attributes.
> >
> > KVM internally uses xarray to store the per-page memory attributes.
> >
> > Suggested-by: Sean Christopherson 
> > Signed-off-by: Chao Peng 
> > Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com/
> > ---
> >  Documentation/virt/kvm/api.rst | 63 
> >  arch/x86/kvm/Kconfig   |  1 +
> >  include/linux/kvm_host.h   |  3 ++
> >  include/uapi/linux/kvm.h   | 17 
> >  virt/kvm/Kconfig   |  3 ++
> >  virt/kvm/kvm_main.c| 76 ++
> >  6 files changed, 163 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 5617bc4f899f..bb2f709c0900 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
> >  The "pad" and "reserved" fields may be used for future extensions and 
> > should be
> >  set to 0s by userspace.
> >
> > +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> > +-
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: u64 memory attributes bitmask(out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Returns supported memory attributes bitmask. Supported memory attributes 
> > will
> > +have the corresponding bits set in u64 memory attributes bitmask.
> > +
> > +The following memory attributes are defined::
> > +
> > +  #define KVM_MEMORY_ATTRIBUTE_READ  (1ULL << 0)
> > +  #define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> > +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE   (1ULL << 2)
> > +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE   (1ULL << 3)
> > +
> > +4.139 KVM_SET_MEMORY_ATTRIBUTES
> > +-
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct kvm_memory_attributes(in/out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Sets memory attributes for pages in a guest memory range. Parameters are
> > +specified via the following structure::
> > +
> > +  struct kvm_memory_attributes {
> > +   __u64 address;
> > +   __u64 size;
> > +   __u64 attributes;
> > +   __u64 flags;
> > +  };
> > +
> > +The user sets the per-page memory attributes to a guest memory range 
> > indicated
> > +by address/size, and in return KVM adjusts address and size to reflect the
> > +actual pages of the memory range have been successfully set to the 
> > attributes.
> > +If the call returns 0, "address" is updated to the last successful address 
> > + 1
> > +and "size" is updated to the remaining address size that has not been set
> > +successfully. The user should check the return value as well as the size to
> > +decide if the operation succeeded for the whole range or not. The user may 
> > want
> > +to retry the operation with the returned address/size if the previous 
> > range was
> > +partially successful.
> > +
> > +Both address and size should be page aligned and the supported attributes 
> > can be
> > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > +
> > +The "flags" field may be used for future extensions and should be set to 
> > 0s.
> > +
> >  5. The kvm_run structure
> >  
> >
> > @@ -8270,6 +8323,16 @@ structure.
> >  When getting the Modified C

Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2022-12-07 Thread Chao Peng
On Tue, Dec 06, 2022 at 10:34:32AM -0300, Fabiano Rosas wrote:
> Chao Peng  writes:
> 
> > In confidential computing usages, whether a page is private or shared is
> > necessary information for KVM to perform operations like page fault
> > handling, page zapping etc. There are other potential use cases for
> > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > or exec-only, etc.) without having to modify memslots.
> >
> > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > userspace to operate on the per-page memory attributes.
> >   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> > a guest memory range.
> >   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> > memory attributes.
> >
> > KVM internally uses xarray to store the per-page memory attributes.
> >
> > Suggested-by: Sean Christopherson 
> > Signed-off-by: Chao Peng 
> > Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com/
> > ---
> >  Documentation/virt/kvm/api.rst | 63 
> >  arch/x86/kvm/Kconfig   |  1 +
> >  include/linux/kvm_host.h   |  3 ++
> >  include/uapi/linux/kvm.h   | 17 
> >  virt/kvm/Kconfig   |  3 ++
> >  virt/kvm/kvm_main.c| 76 ++
> >  6 files changed, 163 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 5617bc4f899f..bb2f709c0900 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
> >  The "pad" and "reserved" fields may be used for future extensions and 
> > should be
> >  set to 0s by userspace.
> >  
> > +4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
> > +-
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: u64 memory attributes bitmask(out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Returns supported memory attributes bitmask. Supported memory attributes 
> > will
> > +have the corresponding bits set in u64 memory attributes bitmask.
> > +
> > +The following memory attributes are defined::
> > +
> > +  #define KVM_MEMORY_ATTRIBUTE_READ  (1ULL << 0)
> > +  #define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> > +  #define KVM_MEMORY_ATTRIBUTE_EXECUTE   (1ULL << 2)
> > +  #define KVM_MEMORY_ATTRIBUTE_PRIVATE   (1ULL << 3)
> > +
> > +4.139 KVM_SET_MEMORY_ATTRIBUTES
> > +-
> > +
> > +:Capability: KVM_CAP_MEMORY_ATTRIBUTES
> > +:Architectures: x86
> > +:Type: vm ioctl
> > +:Parameters: struct kvm_memory_attributes(in/out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Sets memory attributes for pages in a guest memory range. Parameters are
> > +specified via the following structure::
> > +
> > +  struct kvm_memory_attributes {
> > +   __u64 address;
> > +   __u64 size;
> > +   __u64 attributes;
> > +   __u64 flags;
> > +  };
> > +
> > +The user sets the per-page memory attributes to a guest memory range 
> > indicated
> > +by address/size, and in return KVM adjusts address and size to reflect the
> > +actual pages of the memory range have been successfully set to the 
> > attributes.
> 
> This wording could cause some confusion, what about a simpler:
> 
> "reflect the range of pages that had its attributes successfully set"

Thanks, this is much better.

> 
> > +If the call returns 0, "address" is updated to the last successful address 
> > + 1
> > +and "size" is updated to the remaining address size that has not been set
> > +successfully.
> 
> "address + 1 page" or "subsequent page" perhaps.
> 
> In fact, wouldn't this all become simpler if size were number of pages 
> instead?

It indeed becomes better if the size is number of pages and the address
is gfn, but I think we don't want to imply that the page size is 4K to
userspace.

> 
> > The user should check the return value as well as the size to
> > +decide if the operation succeeded for the whole range or not. The user may 
> > want
> > +to retry the operation with the returned address/size if the previous 
> > range was
>

Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2022-12-07 Thread Chao Peng
On Tue, Dec 06, 2022 at 02:57:04PM +, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Dec 2, 2022 at 6:18 AM Chao Peng  wrote:
> >
> > From: "Kirill A. Shutemov" 
> >
> > Introduce 'memfd_restricted' system call with the ability to create
> > memory areas that are restricted from userspace access through ordinary
> > MMU operations (e.g. read/write/mmap). The memory content is expected to
> > be used through the new in-kernel interface by a third kernel module.
> >
> > memfd_restricted() is useful for scenarios where a file descriptor(fd)
> > can be used as an interface into mm but want to restrict userspace's
> > ability on the fd. Initially it is designed to provide protections for
> > KVM encrypted guest memory.
> >
> > Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> > (e.g. QEMU) and then using the mmaped virtual address to setup the
> > mapping in the KVM secondary page table (e.g. EPT). With confidential
> > computing technologies like Intel TDX, the memfd memory may be encrypted
> > with special key for special software domain (e.g. KVM guest) and is not
> > expected to be directly accessed by userspace. Precisely, userspace
> > access to such encrypted memory may lead to host crash so should be
> > prevented.
> >
> > memfd_restricted() provides semantics required for KVM guest encrypted
> > memory support that a fd created with memfd_restricted() is going to be
> > used as the source of guest memory in confidential computing environment
> > and KVM can directly interact with core-mm without the need to expose
> > the memoy content into KVM userspace.
> 
> nit: memory

Ya!

> 
> >
> > KVM userspace is still in charge of the lifecycle of the fd. It should
> > pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> > obtain the physical memory page and then uses it to populate the KVM
> > secondary page table entries.
> >
> > The userspace restricted memfd can be fallocate-ed or hole-punched
> > from userspace. When hole-punched, KVM can get notified through
> > invalidate_start/invalidate_end() callbacks, KVM then gets chance to
> > remove any mapped entries of the range in the secondary page tables.
> >
> > Machine check can happen for memory pages in the restricted memfd,
> > instead of routing this directly to userspace, we call the error()
> > callback that KVM registered. KVM then gets chance to handle it
> > correctly.
> >
> > memfd_restricted() itself is implemented as a shim layer on top of real
> > memory file systems (currently tmpfs). Pages in restrictedmem are marked
> > as unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> >
> > By default memfd_restricted() prevents userspace read, write and mmap.
> > By defining new bit in the 'flags', it can be extended to support other
> > restricted semantics in the future.
> >
> > The system call is currently wired up for x86 arch.
> 
> Reviewed-by: Fuad Tabba 
> After wiring the system call for arm64 (on qemu/arm64):
> Tested-by: Fuad Tabba 

Thanks.
Chao
> 
> Cheers,
> /fuad
> 
> 
> 
> >
> > Signed-off-by: Kirill A. Shutemov 
> > Signed-off-by: Chao Peng 
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
> >  include/linux/restrictedmem.h  |  71 ++
> >  include/linux/syscalls.h   |   1 +
> >  include/uapi/asm-generic/unistd.h  |   5 +-
> >  include/uapi/linux/magic.h |   1 +
> >  kernel/sys_ni.c|   3 +
> >  mm/Kconfig |   4 +
> >  mm/Makefile|   1 +
> >  mm/memory-failure.c|   3 +
> >  mm/restrictedmem.c | 318 +
> >  11 files changed, 408 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/restrictedmem.h
> >  create mode 100644 mm/restrictedmem.c
> >
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> > b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 320480a8db4f..dc70ba90247e 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -455,3 +455,4 @@
> >  448i386process_mreleasesys_process_mrelease
> >  449i386futex_waitv sys_futex_waitv
> >  450i386set_mempolicy_home_node sys_set_mempolicy_home_node
> > +451i386

Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed

2022-12-06 Thread Chao Peng
On Mon, Dec 05, 2022 at 02:49:59PM -0800, Isaku Yamahata wrote:
> On Fri, Dec 02, 2022 at 02:13:45PM +0800,
> Chao Peng  wrote:
> 
> > A large page with mixed private/shared subpages can't be mapped as large
> > page since its sub private/shared pages are from different memory
> > backends and may also treated by architecture differently. When
> > private/shared memory are mixed in a large page, the current lpage_info
> > is not sufficient to decide whether the page can be mapped as large page
> > or not and additional private/shared mixed information is needed.
> > 
> > Tracking this 'mixed' information with the current 'count' like
> > disallow_lpage is a bit challenge so reserve a bit in 'disallow_lpage'
> > to indicate a large page has mixed private/share subpages and update
> > this 'mixed' bit whenever the memory attribute is changed between
> > private and shared.
> > 
> > Signed-off-by: Chao Peng 
> > ---
> >  arch/x86/include/asm/kvm_host.h |   8 ++
> >  arch/x86/kvm/mmu/mmu.c  | 134 +++-
> >  arch/x86/kvm/x86.c  |   2 +
> >  include/linux/kvm_host.h|  19 +
> >  virt/kvm/kvm_main.c |   9 ++-
> >  5 files changed, 169 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 283cbb83d6ae..7772ab37ac89 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -38,6 +38,7 @@
> >  #include 
> >  
> >  #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> > +#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
> >  
> >  #define KVM_MAX_VCPUS 1024
> >  
> > @@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
> >  #endif
> >  };
> >  
> > +/*
> > + * Use a bit in disallow_lpage to indicate private/shared pages mixed at 
> > the
> > + * level. The remaining bits are used as a reference count.
> > + */
> > +#define KVM_LPAGE_PRIVATE_SHARED_MIXED (1U << 31)
> > +#define KVM_LPAGE_COUNT_MAX((1U << 31) - 1)
> > +
> >  struct kvm_lpage_info {
> > int disallow_lpage;
> >  };
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index e2c70b5afa3e..2190fd8c95c0 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const 
> > struct kvm_memory_slot *slot,
> >  {
> > struct kvm_lpage_info *linfo;
> > int i;
> > +   int disallow_count;
> >  
> > for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> > linfo = lpage_info_slot(gfn, slot, i);
> > +
> > +   disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
> > +   WARN_ON(disallow_count + count < 0 ||
> > +   disallow_count > KVM_LPAGE_COUNT_MAX - count);
> > +
> > linfo->disallow_lpage += count;
> > -   WARN_ON(linfo->disallow_lpage < 0);
> > }
> >  }
> >  
> > @@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> > if (kvm->arch.nx_huge_page_recovery_thread)
> > kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
> >  }
> > +
> > +static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> > +{
> > +   return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static void linfo_set_mixed(gfn_t gfn, struct kvm_memory_slot *slot,
> > +   int level, bool mixed)
> > +{
> > +   struct kvm_lpage_info *linfo = lpage_info_slot(gfn, slot, level);
> > +
> > +   if (mixed)
> > +   linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +   else
> > +   linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static bool is_expected_attr_entry(void *entry, unsigned long 
> > expected_attrs)
> > +{
> > +   bool expect_private = expected_attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> > +
> > +   if (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
> > +   if (!expect_private)
> > +   return false;
> > +   } else if (expect_private)
> > +   return false;
> > +
> > +   return true;
> > +}
> > +
> > +static bool mem_attrs_mixed_2m(struct kvm *kvm, unsigned long attrs,
> > +  gfn_t start, 

Re: [PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry

2022-12-06 Thread Chao Peng
On Mon, Dec 05, 2022 at 09:23:49AM +, Fuad Tabba wrote:
> Hi Chao,
> 
> On Fri, Dec 2, 2022 at 6:19 AM Chao Peng  wrote:
> >
> > Currently in mmu_notifier invalidate path, hva range is recorded and
> > then checked against by mmu_notifier_retry_hva() in the page fault
> > handling path. However, for the to be introduced private memory, a page
> > fault may not have a hva associated, checking gfn(gpa) makes more sense.
> >
> > For existing hva based shared memory, gfn is expected to also work. The
> > only downside is when aliasing multiple gfns to a single hva, the
> > current algorithm of checking multiple ranges could result in a much
> > larger range being rejected. Such aliasing should be uncommon, so the
> > impact is expected small.
> >
> > Suggested-by: Sean Christopherson 
> > Signed-off-by: Chao Peng 
> > ---
> >  arch/x86/kvm/mmu/mmu.c   |  8 +---
> >  include/linux/kvm_host.h | 33 +
> >  virt/kvm/kvm_main.c  | 32 +++-
> >  3 files changed, 49 insertions(+), 24 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 4736d7849c60..e2c70b5afa3e 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> > return true;
> >
> > return fault->slot &&
> > -  mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> > +  mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
> >  }
> >
> >  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault 
> > *fault)
> > @@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t 
> > gfn_start, gfn_t gfn_end)
> >
> > write_lock(>mmu_lock);
> >
> > -   kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> > +   kvm_mmu_invalidate_begin(kvm);
> > +
> > +   kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
> >
> > flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> >
> > @@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t 
> > gfn_start, gfn_t gfn_end)
> > kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
> >gfn_end - gfn_start);
> >
> > -   kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> > +   kvm_mmu_invalidate_end(kvm);
> >
> > write_unlock(>mmu_lock);
> >  }
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 02347e386ea2..3d69484d2704 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -787,8 +787,8 @@ struct kvm {
> > struct mmu_notifier mmu_notifier;
> > unsigned long mmu_invalidate_seq;
> > long mmu_invalidate_in_progress;
> > -   unsigned long mmu_invalidate_range_start;
> > -   unsigned long mmu_invalidate_range_end;
> > +   gfn_t mmu_invalidate_range_start;
> > +   gfn_t mmu_invalidate_range_end;
> >  #endif
> > struct list_head devices;
> > u64 manual_dirty_log_protect;
> > @@ -1389,10 +1389,9 @@ void kvm_mmu_free_memory_cache(struct 
> > kvm_mmu_memory_cache *mc);
> >  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  #endif
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > - unsigned long end);
> > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > -   unsigned long end);
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm);
> > +void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
> > +void kvm_mmu_invalidate_end(struct kvm *kvm);
> >
> >  long kvm_arch_dev_ioctl(struct file *filp,
> > unsigned int ioctl, unsigned long arg);
> > @@ -1963,9 +1962,9 @@ static inline int mmu_invalidate_retry(struct kvm 
> > *kvm, unsigned long mmu_seq)
> > return 0;
> >  }
> >
> > -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> > +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
> >unsigned long mmu_seq,
> > -  unsigned long hva)
> > +  gfn_t gfn)
> >  {
> > lockdep_assert_held

Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2022-12-06 Thread Chao Peng
On Mon, Dec 05, 2022 at 09:03:11AM +, Fuad Tabba wrote:
> Hi Chao,
> 
> On Fri, Dec 2, 2022 at 6:18 AM Chao Peng  wrote:
> >
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided through a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> >
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> >
> > The extended memslot can still have the userspace_addr(hva). When use, a
> > single memslot can maintain both private memory through restricted_fd
> > and shared memory through userspace_addr. Whether the private or shared
> > part is visible to guest is maintained by other KVM code.
> >
> > A restrictedmem_notifier field is also added to the memslot structure to
> > allow the restricted_fd's backing store to notify KVM the memory change,
> > KVM then can invalidate its page table entries or handle memory errors.
> >
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only.
> >
> > To make future maintenance easy, internally use a binary compatible
> > alias struct kvm_user_mem_region to handle both the normal and the
> > '_ext' variants.
> >
> > Co-developed-by: Yu Zhang 
> > Signed-off-by: Yu Zhang 
> > Signed-off-by: Chao Peng 
> > Reviewed-by: Fuad Tabba 
> > Tested-by: Fuad Tabba 
> 
> V9 of this patch [*] had KVM_CAP_PRIVATE_MEM, but it's not in this
> patch series anymore. Any reason you removed it, or is it just an
> omission?

We had some discussion in v9 [1] to add generic memory attributes ioctls
and KVM_CAP_PRIVATE_MEM can be implemented as a new 
KVM_MEMORY_ATTRIBUTE_PRIVATE flag via KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES()
ioctl [2]. The api doc has been updated:

+- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
+  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl) …


[1] https://lore.kernel.org/linux-mm/y2wb48kd0j4vg...@google.com/
[2]
https://lore.kernel.org/linux-mm/20221202061347.1070246-3-chao.p.p...@linux.intel.com/

Thanks,
Chao
> 
> [*] 
> https://lore.kernel.org/linux-mm/20221025151344.3784230-3-chao.p.p...@linux.intel.com/
> 
> Thanks,
> /fuad
> 
> > ---
> >  Documentation/virt/kvm/api.rst | 40 ++-
> >  arch/x86/kvm/Kconfig   |  2 ++
> >  arch/x86/kvm/x86.c |  2 +-
> >  include/linux/kvm_host.h   |  8 --
> >  include/uapi/linux/kvm.h   | 28 +++
> >  virt/kvm/Kconfig   |  3 +++
> >  virt/kvm/kvm_main.c| 49 --
> >  7 files changed, 114 insertions(+), 18 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index bb2f709c0900..99352170c130 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> >  :Capability: KVM_CAP_USER_MEMORY
> >  :Architectures: all
> >  :Type: vm ioctl
> > -:Parameters: struct kvm_userspace_memory_region (in)
> > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> >  :Returns: 0 on success, -1 on error
> >
> >  ::
> > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> > __u64 userspace_addr; /* start of the userspace allocated memory */
> >};
> >
> > +  struct kvm_userspace_memory_region_ext {
> > +   struct kvm_userspace_memory_region region;
> > +   __u64 restricted_offset;
> > +   __u32 restricted_fd;
> > +   __u32 pad1;
> > +   __u64 pad2[14];
> > +  };
> > +
> >/* for kvm_memory_region::flags */
> >#define KVM_MEM_LOG_DIRTY_PAGES  (1UL << 0)
> >#define KVM_MEM_READONLY (1UL << 1)
> > +  #define KVM_MEM_PRIVATE  (1UL << 2)
> >
> >  This ioctl allows the user to create, modify or delete a guest physical
> >  memory slot.  Bits 0-15 of "slot" specify the slot id and

Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-12-01 Thread Chao Peng
On Thu, Dec 01, 2022 at 06:16:46PM -0800, Vishal Annapurve wrote:
> On Tue, Oct 25, 2022 at 8:18 AM Chao Peng  wrote:
> >
...
> > +}
> > +
> > +SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags)
> > +{
> 
> Looking at the underlying shmem implementation, there seems to be no
> way to enable transparent huge pages specifically for restricted memfd
> files.
> 
> Michael discussed earlier about tweaking
> /sys/kernel/mm/transparent_hugepage/shmem_enabled setting to allow
> hugepages to be used while backing restricted memfd. Such a change
> will affect the rest of the shmem usecases as well. Even setting the
> shmem_enabled policy to "advise" wouldn't help unless file based
> advise for hugepage allocation is implemented.

Had a look at fadvise() and looks it does not support HUGEPAGE for any
filesystem yet.

> 
> Does it make sense to provide a flag here to allow creating restricted
> memfds backed possibly by huge pages to give a more granular control?

We do have a unused 'flags' can be extended for such usage, but I would
let Kirill have further look, perhaps need more discussions.

Chao
> 
> > +   struct file *file, *restricted_file;
> > +   int fd, err;
> > +
> > +   if (flags)
> > +   return -EINVAL;
> > +
> > +   fd = get_unused_fd_flags(0);
> > +   if (fd < 0)
> > +   return fd;
> > +
> > +   file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE);
> > +   if (IS_ERR(file)) {
> > +   err = PTR_ERR(file);
> > +   goto err_fd;
> > +   }
> > +   file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> > +   file->f_flags |= O_LARGEFILE;
> > +
> > +   restricted_file = restrictedmem_file_create(file);
> > +   if (IS_ERR(restricted_file)) {
> > +   err = PTR_ERR(restricted_file);
> > +   fput(file);
> > +   goto err_fd;
> > +   }
> > +
> > +   fd_install(fd, restricted_file);
> > +   return fd;
> > +err_fd:
> > +   put_unused_fd(fd);
> > +   return err;
> > +}
> > +
> > +void restrictedmem_register_notifier(struct file *file,
> > +struct restrictedmem_notifier 
> > *notifier)
> > +{
> > +   struct restrictedmem_data *data = file->f_mapping->private_data;
> > +
> > +   mutex_lock(>lock);
> > +   list_add(>list, >notifiers);
> > +   mutex_unlock(>lock);
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_register_notifier);
> > +
> > +void restrictedmem_unregister_notifier(struct file *file,
> > +  struct restrictedmem_notifier 
> > *notifier)
> > +{
> > +   struct restrictedmem_data *data = file->f_mapping->private_data;
> > +
> > +   mutex_lock(>lock);
> > +   list_del(>list);
> > +   mutex_unlock(>lock);
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_unregister_notifier);
> > +
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +  struct page **pagep, int *order)
> > +{
> > +   struct restrictedmem_data *data = file->f_mapping->private_data;
> > +   struct file *memfd = data->memfd;
> > +   struct page *page;
> > +   int ret;
> > +
> > +   ret = shmem_getpage(file_inode(memfd), offset, , SGP_WRITE);
> > +   if (ret)
> > +   return ret;
> > +
> > +   *pagep = page;
> > +   if (order)
> > +   *order = thp_order(compound_head(page));
> > +
> > +   SetPageUptodate(page);
> > +   unlock_page(page);
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > --
> > 2.25.1
> >



[PATCH v10 2/9] KVM: Introduce per-page memory attributes

2022-12-01 Thread Chao Peng
In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
userspace to operate on the per-page memory attributes.
  - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
a guest memory range.
  - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
memory attributes.

KVM internally uses xarray to store the per-page memory attributes.

Suggested-by: Sean Christopherson 
Signed-off-by: Chao Peng 
Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com/
---
 Documentation/virt/kvm/api.rst | 63 
 arch/x86/kvm/Kconfig   |  1 +
 include/linux/kvm_host.h   |  3 ++
 include/uapi/linux/kvm.h   | 17 
 virt/kvm/Kconfig   |  3 ++
 virt/kvm/kvm_main.c| 76 ++
 6 files changed, 163 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 5617bc4f899f..bb2f709c0900 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5952,6 +5952,59 @@ delivery must be provided via the "reg_aen" struct.
 The "pad" and "reserved" fields may be used for future extensions and should be
 set to 0s by userspace.
 
+4.138 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
+-
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: u64 memory attributes bitmask(out)
+:Returns: 0 on success, <0 on error
+
+Returns supported memory attributes bitmask. Supported memory attributes will
+have the corresponding bits set in u64 memory attributes bitmask.
+
+The following memory attributes are defined::
+
+  #define KVM_MEMORY_ATTRIBUTE_READ  (1ULL << 0)
+  #define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
+  #define KVM_MEMORY_ATTRIBUTE_EXECUTE   (1ULL << 2)
+  #define KVM_MEMORY_ATTRIBUTE_PRIVATE   (1ULL << 3)
+
+4.139 KVM_SET_MEMORY_ATTRIBUTES
+-
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_memory_attributes(in/out)
+:Returns: 0 on success, <0 on error
+
+Sets memory attributes for pages in a guest memory range. Parameters are
+specified via the following structure::
+
+  struct kvm_memory_attributes {
+   __u64 address;
+   __u64 size;
+   __u64 attributes;
+   __u64 flags;
+  };
+
+The user sets the per-page memory attributes to a guest memory range indicated
+by address/size, and in return KVM adjusts address and size to reflect the
+actual pages of the memory range have been successfully set to the attributes.
+If the call returns 0, "address" is updated to the last successful address + 1
+and "size" is updated to the remaining address size that has not been set
+successfully. The user should check the return value as well as the size to
+decide if the operation succeeded for the whole range or not. The user may want
+to retry the operation with the returned address/size if the previous range was
+partially successful.
+
+Both address and size should be page aligned and the supported attributes can 
be
+retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
+
+The "flags" field may be used for future extensions and should be set to 0s.
+
 5. The kvm_run structure
 
 
@@ -8270,6 +8323,16 @@ structure.
 When getting the Modified Change Topology Report value, the attr->addr
 must point to a byte where the value will be stored or retrieved from.
 
+8.40 KVM_CAP_MEMORY_ATTRIBUTES
+--
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm
+
+This capability indicates KVM supports per-page memory attributes and ioctls
+KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
+
 9. Known KVM API problems
 =
 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index fbeaa9ddef59..a8e379a3afee 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -49,6 +49,7 @@ config KVM
select SRCU
select INTERVAL_TREE
select HAVE_KVM_PM_NOTIFIER if PM
+   select HAVE_KVM_MEMORY_ATTRIBUTES
help
  Support hosting fully virtualized guest machines using hardware
  virtualization extensions.  You will need a fairly recent
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8f874a964313..a784e2b06625 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -800,6 +800,9 @@ struct kvm {
 
 

[PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed

2022-12-01 Thread Chao Peng
A large page with mixed private/shared subpages can't be mapped as large
page since its sub private/shared pages are from different memory
backends and may also treated by architecture differently. When
private/shared memory are mixed in a large page, the current lpage_info
is not sufficient to decide whether the page can be mapped as large page
or not and additional private/shared mixed information is needed.

Tracking this 'mixed' information with the current 'count' like
disallow_lpage is a bit challenge so reserve a bit in 'disallow_lpage'
to indicate a large page has mixed private/share subpages and update
this 'mixed' bit whenever the memory attribute is changed between
private and shared.

Signed-off-by: Chao Peng 
---
 arch/x86/include/asm/kvm_host.h |   8 ++
 arch/x86/kvm/mmu/mmu.c  | 134 +++-
 arch/x86/kvm/x86.c  |   2 +
 include/linux/kvm_host.h|  19 +
 virt/kvm/kvm_main.c |   9 ++-
 5 files changed, 169 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 283cbb83d6ae..7772ab37ac89 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -38,6 +38,7 @@
 #include 
 
 #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES
 
 #define KVM_MAX_VCPUS 1024
 
@@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
 #endif
 };
 
+/*
+ * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
+ * level. The remaining bits are used as a reference count.
+ */
+#define KVM_LPAGE_PRIVATE_SHARED_MIXED (1U << 31)
+#define KVM_LPAGE_COUNT_MAX((1U << 31) - 1)
+
 struct kvm_lpage_info {
int disallow_lpage;
 };
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e2c70b5afa3e..2190fd8c95c0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const struct 
kvm_memory_slot *slot,
 {
struct kvm_lpage_info *linfo;
int i;
+   int disallow_count;
 
for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
linfo = lpage_info_slot(gfn, slot, i);
+
+   disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
+   WARN_ON(disallow_count + count < 0 ||
+   disallow_count > KVM_LPAGE_COUNT_MAX - count);
+
linfo->disallow_lpage += count;
-   WARN_ON(linfo->disallow_lpage < 0);
}
 }
 
@@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
if (kvm->arch.nx_huge_page_recovery_thread)
kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
+
+static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
+{
+   return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static void linfo_set_mixed(gfn_t gfn, struct kvm_memory_slot *slot,
+   int level, bool mixed)
+{
+   struct kvm_lpage_info *linfo = lpage_info_slot(gfn, slot, level);
+
+   if (mixed)
+   linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
+   else
+   linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static bool is_expected_attr_entry(void *entry, unsigned long expected_attrs)
+{
+   bool expect_private = expected_attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
+
+   if (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
+   if (!expect_private)
+   return false;
+   } else if (expect_private)
+   return false;
+
+   return true;
+}
+
+static bool mem_attrs_mixed_2m(struct kvm *kvm, unsigned long attrs,
+  gfn_t start, gfn_t end)
+{
+   XA_STATE(xas, >mem_attr_array, start);
+   gfn_t gfn = start;
+   void *entry;
+   bool mixed = false;
+
+   rcu_read_lock();
+   entry = xas_load();
+   while (gfn < end) {
+   if (xas_retry(, entry))
+   continue;
+
+   KVM_BUG_ON(gfn != xas.xa_index, kvm);
+
+   if (!is_expected_attr_entry(entry, attrs)) {
+   mixed = true;
+   break;
+   }
+
+   entry = xas_next();
+   gfn++;
+   }
+
+   rcu_read_unlock();
+   return mixed;
+}
+
+static bool mem_attrs_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
+   int level, unsigned long attrs,
+   gfn_t start, gfn_t end)
+{
+   unsigned long gfn;
+
+   if (level == PG_LEVEL_2M)
+   return mem_attrs_mixed_2m(kvm, attrs, start, end);
+
+   for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1))
+   if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)

[PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE

2022-12-01 Thread Chao Peng
Register/unregister private memslot to fd-based memory backing store
restrictedmem and implement the callbacks for restrictedmem_notifier:
  - invalidate_start()/invalidate_end() to zap the existing memory
mappings in the KVM page table.
  - error() to request KVM_REQ_MEMORY_MCE and later exit to userspace
with KVM_EXIT_SHUTDOWN.

Expose KVM_MEM_PRIVATE for memslot and KVM_MEMORY_ATTRIBUTE_PRIVATE for
KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to userspace but either are
controlled by kvm_arch_has_private_mem() which should be rewritten by
architecture code.

Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
Reviewed-by: Fuad Tabba 
---
 arch/x86/include/asm/kvm_host.h |   1 +
 arch/x86/kvm/x86.c  |  13 +++
 include/linux/kvm_host.h|   3 +
 virt/kvm/kvm_main.c | 179 +++-
 4 files changed, 191 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7772ab37ac89..27ef31133352 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -114,6 +114,7 @@
KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_HV_TLB_FLUSH \
KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_MEMORY_MCE KVM_ARCH_REQ(33)
 
 #define CR0_RESERVED_BITS   \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5aefcff614d2..c67e22f3e2ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6587,6 +6587,13 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long 
state)
 }
 #endif /* CONFIG_HAVE_KVM_PM_NOTIFIER */
 
+#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
+void kvm_arch_memory_mce(struct kvm *kvm)
+{
+   kvm_make_all_cpus_request(kvm, KVM_REQ_MEMORY_MCE);
+}
+#endif
+
 static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
 {
struct kvm_clock_data data = { 0 };
@@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
+
+   if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
+   vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
+   r = 0;
+   goto out;
+   }
}
 
if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win ||
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 153842bb33df..f032d878e034 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -590,6 +590,7 @@ struct kvm_memory_slot {
struct file *restricted_file;
loff_t restricted_offset;
struct restrictedmem_notifier notifier;
+   struct kvm *kvm;
 };
 
 static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
@@ -2363,6 +2364,8 @@ static inline int kvm_restricted_mem_get_pfn(struct 
kvm_memory_slot *slot,
*pfn = page_to_pfn(page);
return ret;
 }
+
+void kvm_arch_memory_mce(struct kvm *kvm);
 #endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
 
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e107afea32f0..ac835fc77273 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -936,6 +936,121 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
+#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
+static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
+pgoff_t start, pgoff_t end,
+gfn_t *gfn_start, gfn_t *gfn_end)
+{
+   unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
+
+   if (start > base_pgoff)
+   *gfn_start = slot->base_gfn + start - base_pgoff;
+   else
+   *gfn_start = slot->base_gfn;
+
+   if (end < base_pgoff + slot->npages)
+   *gfn_end = slot->base_gfn + end - base_pgoff;
+   else
+   *gfn_end = slot->base_gfn + slot->npages;
+
+   if (*gfn_start >= *gfn_end)
+   return false;
+
+   return true;
+}
+
+static void kvm_restrictedmem_invalidate_begin(struct restrictedmem_notifier 
*notifier,
+  pgoff_t start, pgoff_t end)
+{
+   struct kvm_memory_slot *slot = container_of(notifier,
+   struct kvm_memory_slot,
+   notifier);
+   struct kvm *kvm = slot->kvm;
+   gfn_t gfn_start, gfn_end;
+   struct kvm_gfn_range gfn_range;
+   int idx;
+
+   if (!restrictedmem_range_is_valid(slot, start, end,
+

[PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

2022-12-01 Thread Chao Peng
Unmap the existing guest mappings when memory attribute is changed
between shared and private. This is needed because shared pages and
private pages are from different backends, unmapping existing ones
gives a chance for page fault handler to re-populate the mappings
according to the new attribute.

Only architecture has private memory support needs this and the
supported architecture is expected to rewrite the weak
kvm_arch_has_private_mem().

Also, during memory attribute changing and the unmapping time frame,
page fault handler may happen in the same memory range and can cause
incorrect page state, invoke kvm_mmu_invalidate_* helpers to let the
page fault handler retry during this time frame.

Signed-off-by: Chao Peng 
---
 include/linux/kvm_host.h |   7 +-
 virt/kvm/kvm_main.c  | 168 ++-
 2 files changed, 116 insertions(+), 59 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3d69484d2704..3331c0c92838 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -255,7 +255,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t 
cr2_or_gpa,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
 struct kvm_gfn_range {
struct kvm_memory_slot *slot;
gfn_t start;
@@ -264,6 +263,8 @@ struct kvm_gfn_range {
bool may_block;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+
+#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
@@ -785,11 +786,12 @@ struct kvm {
 
 #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
struct mmu_notifier mmu_notifier;
+#endif
unsigned long mmu_invalidate_seq;
long mmu_invalidate_in_progress;
gfn_t mmu_invalidate_range_start;
gfn_t mmu_invalidate_range_end;
-#endif
+
struct list_head devices;
u64 manual_dirty_log_protect;
struct dentry *debugfs_dentry;
@@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu 
*vcpu);
 int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
 int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_has_private_mem(struct kvm *kvm);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ad55dfbc75d7..4e1e1e113bf0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -520,6 +520,62 @@ void kvm_destroy_vcpus(struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(kvm_destroy_vcpus);
 
+void kvm_mmu_invalidate_begin(struct kvm *kvm)
+{
+   /*
+* The count increase must become visible at unlock time as no
+* spte can be established without taking the mmu_lock and
+* count is also read inside the mmu_lock critical section.
+*/
+   kvm->mmu_invalidate_in_progress++;
+
+   if (likely(kvm->mmu_invalidate_in_progress == 1)) {
+   kvm->mmu_invalidate_range_start = INVALID_GPA;
+   kvm->mmu_invalidate_range_end = INVALID_GPA;
+   }
+}
+
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+   WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
+
+   if (likely(kvm->mmu_invalidate_in_progress == 1)) {
+   kvm->mmu_invalidate_range_start = start;
+   kvm->mmu_invalidate_range_end = end;
+   } else {
+   /*
+* Fully tracking multiple concurrent ranges has diminishing
+* returns. Keep things simple and just find the minimal range
+* which includes the current and new ranges. As there won't be
+* enough information to subtract a range after its invalidate
+* completes, any ranges invalidated concurrently will
+* accumulate and persist until all outstanding invalidates
+* complete.
+*/
+   kvm->mmu_invalidate_range_start =
+   min(kvm->mmu_invalidate_range_start, start);
+   kvm->mmu_invalidate_range_end =
+   max(kvm->mmu_invalidate_range_end, end);
+   }
+}
+
+void kvm_mmu_invalidate_end(struct kvm *kvm)
+{
+   /*
+* This sequence increase will notify the kvm page fault that
+* the page that is going to be mapped in the spte could have
+* been freed.
+*/
+   kvm->mmu_invalidate_seq++;
+   smp_wmb();
+   /*
+* The above sequence increase must be visible before the
+* below count decrease, which is ensured by the smp_wmb above
+* in conjunction with the smp_rmb in mmu_invalidate_retry().
+*/
+   kvm->mmu_invalidate_in_progress--;
+}
+
 #if defined(CONFIG_M

[PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2022-12-01 Thread Chao Peng
In memory encryption usage, guest memory may be encrypted with special
key and can be accessed only by the guest itself. We call such memory
private memory. It's valueless and sometimes can cause problem to allow
userspace to access guest private memory. This new KVM memslot extension
allows guest private memory being provided through a restrictedmem
backed file descriptor(fd) and userspace is restricted to access the
bookmarked memory in the fd.

This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
additional KVM memslot fields restricted_fd/restricted_offset to allow
userspace to instruct KVM to provide guest memory through restricted_fd.
'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
and the size is 'memory_size'.

The extended memslot can still have the userspace_addr(hva). When use, a
single memslot can maintain both private memory through restricted_fd
and shared memory through userspace_addr. Whether the private or shared
part is visible to guest is maintained by other KVM code.

A restrictedmem_notifier field is also added to the memslot structure to
allow the restricted_fd's backing store to notify KVM the memory change,
KVM then can invalidate its page table entries or handle memory errors.

Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
and right now it is selected on X86_64 only.

To make future maintenance easy, internally use a binary compatible
alias struct kvm_user_mem_region to handle both the normal and the
'_ext' variants.

Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
---
 Documentation/virt/kvm/api.rst | 40 ++-
 arch/x86/kvm/Kconfig   |  2 ++
 arch/x86/kvm/x86.c |  2 +-
 include/linux/kvm_host.h   |  8 --
 include/uapi/linux/kvm.h   | 28 +++
 virt/kvm/Kconfig   |  3 +++
 virt/kvm/kvm_main.c| 49 --
 7 files changed, 114 insertions(+), 18 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index bb2f709c0900..99352170c130 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
 :Capability: KVM_CAP_USER_MEMORY
 :Architectures: all
 :Type: vm ioctl
-:Parameters: struct kvm_userspace_memory_region (in)
+:Parameters: struct kvm_userspace_memory_region(_ext) (in)
 :Returns: 0 on success, -1 on error
 
 ::
@@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
__u64 userspace_addr; /* start of the userspace allocated memory */
   };
 
+  struct kvm_userspace_memory_region_ext {
+   struct kvm_userspace_memory_region region;
+   __u64 restricted_offset;
+   __u32 restricted_fd;
+   __u32 pad1;
+   __u64 pad2[14];
+  };
+
   /* for kvm_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES  (1UL << 0)
   #define KVM_MEM_READONLY (1UL << 1)
+  #define KVM_MEM_PRIVATE  (1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1365,12 +1374,29 @@ It is recommended that the lower 21 bits of 
guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+kvm_userspace_memory_region_ext struct includes all fields of
+kvm_userspace_memory_region struct, while also adds additional fields for some
+other features. See below description of flags field for more information.
+It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
+
+The flags field supports following flags:
+
+- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
+  within the slot. For more details, see KVM_GET_DIRTY_LOG ioctl.
+
+- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
+  read-only. In this case, writes to this memory will be posted to userspace as
+  KVM_EXIT_MMIO exits.
+
+- KVM_MEM_PRIVATE, if KVM_MEMORY_ATTRIBUTE_PRIVATE is supported (see
+  KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES ioctl), to indicate a new slot has 
private
+  memory backed by a file descriptor(fd) and userspace access to the fd may be
+  restricted. Userspace should use restricted_fd/restricted_offset in the
+  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
+  to guest. Userspace should guarantee not to map the

[PATCH v10 8/9] KVM: Handle page fault for private memory

2022-12-01 Thread Chao Peng
A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
hva-based shared memory. Architecture code (like TDX code) can tell
whether the on-going fault is private or not. This patch adds a
'is_private' field to kvm_page_fault to indicate this and architecture
code is expected to set it.

To handle page fault for such memslot, the handling logic is different
depending on whether the fault is private or shared. KVM checks if
'is_private' matches the host's view of the page (maintained in
mem_attr_array).
  - For a successful match, private pfn is obtained with
restrictedmem_get_page() and shared pfn is obtained with existing
get_user_pages().
  - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
userspace. Userspace then can convert memory between private/shared
in host's view and retry the fault.

Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
---
 arch/x86/kvm/mmu/mmu.c  | 63 +++--
 arch/x86/kvm/mmu/mmu_internal.h | 14 +++-
 arch/x86/kvm/mmu/mmutrace.h |  1 +
 arch/x86/kvm/mmu/tdp_mmu.c  |  2 +-
 include/linux/kvm_host.h| 30 
 5 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2190fd8c95c0..b1953ebc012e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3058,7 +3058,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t 
gfn,
 
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
  const struct kvm_memory_slot *slot, gfn_t gfn,
- int max_level)
+ int max_level, bool is_private)
 {
struct kvm_lpage_info *linfo;
int host_level;
@@ -3070,6 +3070,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
break;
}
 
+   if (is_private)
+   return max_level;
+
if (max_level == PG_LEVEL_4K)
return PG_LEVEL_4K;
 
@@ -3098,7 +3101,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault
 * level, which will be used to do precise, accurate accounting.
 */
fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
-fault->gfn, 
fault->max_level);
+fault->gfn, 
fault->max_level,
+fault->is_private);
if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
return;
 
@@ -4178,6 +4182,49 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, 
struct kvm_async_pf *work)
kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
 }
 
+static inline u8 order_to_level(int order)
+{
+   BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
+
+   if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
+   return PG_LEVEL_1G;
+
+   if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+   return PG_LEVEL_2M;
+
+   return PG_LEVEL_4K;
+}
+
+static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
+   struct kvm_page_fault *fault)
+{
+   vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+   if (fault->is_private)
+   vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
+   else
+   vcpu->run->memory.flags = 0;
+   vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
+   vcpu->run->memory.size = PAGE_SIZE;
+   return RET_PF_USER;
+}
+
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+  struct kvm_page_fault *fault)
+{
+   int order;
+   struct kvm_memory_slot *slot = fault->slot;
+
+   if (!kvm_slot_can_be_private(slot))
+   return kvm_do_memory_fault_exit(vcpu, fault);
+
+   if (kvm_restricted_mem_get_pfn(slot, fault->gfn, >pfn, ))
+   return RET_PF_RETRY;
+
+   fault->max_level = min(order_to_level(order), fault->max_level);
+   fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
+   return RET_PF_CONTINUE;
+}
+
 static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
struct kvm_memory_slot *slot = fault->slot;
@@ -4210,6 +4257,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct 
kvm_page_fault *fault)
return RET_PF_EMULATE;
}
 
+   if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
+   return kvm_do_memory_fault_exit(vcpu, fault);
+
+   if (fault->is_private)
+   return kvm_faultin_pfn_private(vcpu, fault);
+
async = false;
fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, 
,
   

[PATCH v10 5/9] KVM: Use gfn instead of hva for mmu_notifier_retry

2022-12-01 Thread Chao Peng
Currently in mmu_notifier invalidate path, hva range is recorded and
then checked against by mmu_notifier_retry_hva() in the page fault
handling path. However, for the to be introduced private memory, a page
fault may not have a hva associated, checking gfn(gpa) makes more sense.

For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

Suggested-by: Sean Christopherson 
Signed-off-by: Chao Peng 
---
 arch/x86/kvm/mmu/mmu.c   |  8 +---
 include/linux/kvm_host.h | 33 +
 virt/kvm/kvm_main.c  | 32 +++-
 3 files changed, 49 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4736d7849c60..e2c70b5afa3e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
return true;
 
return fault->slot &&
-  mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
+  mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault 
*fault)
@@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end)
 
write_lock(>mmu_lock);
 
-   kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
+   kvm_mmu_invalidate_begin(kvm);
+
+   kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
 
flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
@@ -6112,7 +6114,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end)
kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
   gfn_end - gfn_start);
 
-   kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
+   kvm_mmu_invalidate_end(kvm);
 
write_unlock(>mmu_lock);
 }
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 02347e386ea2..3d69484d2704 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -787,8 +787,8 @@ struct kvm {
struct mmu_notifier mmu_notifier;
unsigned long mmu_invalidate_seq;
long mmu_invalidate_in_progress;
-   unsigned long mmu_invalidate_range_start;
-   unsigned long mmu_invalidate_range_end;
+   gfn_t mmu_invalidate_range_start;
+   gfn_t mmu_invalidate_range_end;
 #endif
struct list_head devices;
u64 manual_dirty_log_protect;
@@ -1389,10 +1389,9 @@ void kvm_mmu_free_memory_cache(struct 
kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
- unsigned long end);
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-   unsigned long end);
+void kvm_mmu_invalidate_begin(struct kvm *kvm);
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_mmu_invalidate_end(struct kvm *kvm);
 
 long kvm_arch_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg);
@@ -1963,9 +1962,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, 
unsigned long mmu_seq)
return 0;
 }
 
-static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
+static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
   unsigned long mmu_seq,
-  unsigned long hva)
+  gfn_t gfn)
 {
lockdep_assert_held(>mmu_lock);
/*
@@ -1974,10 +1973,20 @@ static inline int mmu_invalidate_retry_hva(struct kvm 
*kvm,
 * that might be being invalidated. Note that it may include some false
 * positives, due to shortcuts when handing concurrent invalidations.
 */
-   if (unlikely(kvm->mmu_invalidate_in_progress) &&
-   hva >= kvm->mmu_invalidate_range_start &&
-   hva < kvm->mmu_invalidate_range_end)
-   return 1;
+   if (unlikely(kvm->mmu_invalidate_in_progress)) {
+   /*
+* Dropping mmu_lock after bumping mmu_invalidate_in_progress
+* but before updating the range is a KVM bug.
+*/
+   if (WARN_ON_ONCE(kvm->mmu_invalidate_range_start == INVALID_GPA 
||
+kvm->mmu_invalidate_range_end == INVALID_GPA))
+   return 1;
+
+   if (gfn >= kvm->mmu_invalidate_range_start &&
+   gfn < kvm->mmu_invalidate_range_end)

[PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-12-01 Thread Chao Peng
This new KVM exit allows userspace to handle memory-related errors. It
indicates an error happens in KVM at guest memory range [gpa, gpa+size).
The flags includes additional information for userspace to handle the
error. Currently bit 0 is defined as 'private memory' where '1'
indicates error happens due to private memory access and '0' indicates
error happens due to shared memory access.

When private memory is enabled, this new exit will be used for KVM to
exit to userspace for shared <-> private memory conversion in memory
encryption usage. In such usage, typically there are two kind of memory
conversions:
  - explicit conversion: happens when guest explicitly calls into KVM
to map a range (as private or shared), KVM then exits to userspace
to perform the map/unmap operations.
  - implicit conversion: happens in KVM page fault handler where KVM
exits to userspace for an implicit conversion when the page is in a
different state than requested (private or shared).

Suggested-by: Sean Christopherson 
Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
Reviewed-by: Fuad Tabba 
---
 Documentation/virt/kvm/api.rst | 22 ++
 include/uapi/linux/kvm.h   |  8 
 2 files changed, 30 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 99352170c130..d9edb14ce30b 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6634,6 +6634,28 @@ array field represents return values. The userspace 
should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+   /* KVM_EXIT_MEMORY_FAULT */
+   struct {
+  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 0)
+   __u64 flags;
+   __u64 gpa;
+   __u64 size;
+   } memory;
+
+If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
+encountered a memory error which is not handled by KVM kernel module and
+userspace may choose to handle it. The 'flags' field indicates the memory
+properties of the exit.
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
+   private memory access when the bit is set. Otherwise the memory error is
+   caused by shared memory access when the bit is clear.
+
+'gpa' and 'size' indicate the memory range the error occurs at. The userspace
+may handle the error and return to KVM to retry the previous memory access.
+
 ::
 
 /* KVM_EXIT_NOTIFY */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 13bff963b8b0..c7e9d375a902 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -300,6 +300,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI35
 #define KVM_EXIT_RISCV_CSR36
 #define KVM_EXIT_NOTIFY   37
+#define KVM_EXIT_MEMORY_FAULT 38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -541,6 +542,13 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
__u32 flags;
} notify;
+   /* KVM_EXIT_MEMORY_FAULT */
+   struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1ULL << 0)
+   __u64 flags;
+   __u64 gpa;
+   __u64 size;
+   } memory;
/* Fix the size of the union. */
char padding[256];
};
-- 
2.25.1




[PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2022-12-01 Thread Chao Peng
eate() to force
setting F_SEAL_INACCESSIBLE when the fd is created.
  - KVM: add the shared part of the memslot back to make private/shared
pages live in one memslot.

Reference
=
[1] Intel TDX:
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Kirill's implementation:
https://lore.kernel.org/all/20210416154106.23721-1-kirill.shute...@linux.intel.com/T/
 
[3] Original design proposal:
https://lore.kernel.org/all/20210824005248.200037-1-sea...@google.com/  
[4] Selftest:
https://lore.kernel.org/all/2022014244.1714148-1-vannapu...@google.com/


Chao Peng (8):
  KVM: Introduce per-page memory attributes
  KVM: Extend the memslot to support fd-based private memory
  KVM: Add KVM_EXIT_MEMORY_FAULT exit
  KVM: Use gfn instead of hva for mmu_notifier_retry
  KVM: Unmap existing mappings when change the memory attributes
  KVM: Update lpage info when private/shared memory are mixed
  KVM: Handle page fault for private memory
  KVM: Enable and expose KVM_MEM_PRIVATE

Kirill A. Shutemov (1):
  mm: Introduce memfd_restricted system call to create restricted user
memory

 Documentation/virt/kvm/api.rst | 125 ++-
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 arch/x86/include/asm/kvm_host.h|   9 +
 arch/x86/kvm/Kconfig   |   3 +
 arch/x86/kvm/mmu/mmu.c | 205 ++-
 arch/x86/kvm/mmu/mmu_internal.h|  14 +-
 arch/x86/kvm/mmu/mmutrace.h|   1 +
 arch/x86/kvm/mmu/tdp_mmu.c |   2 +-
 arch/x86/kvm/x86.c |  17 +-
 include/linux/kvm_host.h   | 103 +-
 include/linux/restrictedmem.h  |  71 
 include/linux/syscalls.h   |   1 +
 include/uapi/asm-generic/unistd.h  |   5 +-
 include/uapi/linux/kvm.h   |  53 +++
 include/uapi/linux/magic.h |   1 +
 kernel/sys_ni.c|   3 +
 mm/Kconfig |   4 +
 mm/Makefile|   1 +
 mm/memory-failure.c|   3 +
 mm/restrictedmem.c | 318 +
 virt/kvm/Kconfig   |   6 +
 virt/kvm/kvm_main.c| 469 +
 23 files changed, 1323 insertions(+), 93 deletions(-)
 create mode 100644 include/linux/restrictedmem.h
 create mode 100644 mm/restrictedmem.c


base-commit: df0bb47baa95aad133820b149851d5b94cbc6790
-- 
2.25.1




[PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2022-12-01 Thread Chao Peng
From: "Kirill A. Shutemov" 

Introduce 'memfd_restricted' system call with the ability to create
memory areas that are restricted from userspace access through ordinary
MMU operations (e.g. read/write/mmap). The memory content is expected to
be used through the new in-kernel interface by a third kernel module.

memfd_restricted() is useful for scenarios where a file descriptor(fd)
can be used as an interface into mm but want to restrict userspace's
ability on the fd. Initially it is designed to provide protections for
KVM encrypted guest memory.

Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
(e.g. QEMU) and then using the mmaped virtual address to setup the
mapping in the KVM secondary page table (e.g. EPT). With confidential
computing technologies like Intel TDX, the memfd memory may be encrypted
with special key for special software domain (e.g. KVM guest) and is not
expected to be directly accessed by userspace. Precisely, userspace
access to such encrypted memory may lead to host crash so should be
prevented.

memfd_restricted() provides semantics required for KVM guest encrypted
memory support that a fd created with memfd_restricted() is going to be
used as the source of guest memory in confidential computing environment
and KVM can directly interact with core-mm without the need to expose
the memoy content into KVM userspace.

KVM userspace is still in charge of the lifecycle of the fd. It should
pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
obtain the physical memory page and then uses it to populate the KVM
secondary page table entries.

The userspace restricted memfd can be fallocate-ed or hole-punched
from userspace. When hole-punched, KVM can get notified through
invalidate_start/invalidate_end() callbacks, KVM then gets chance to
remove any mapped entries of the range in the secondary page tables.

Machine check can happen for memory pages in the restricted memfd,
instead of routing this directly to userspace, we call the error()
callback that KVM registered. KVM then gets chance to handle it
correctly.

memfd_restricted() itself is implemented as a shim layer on top of real
memory file systems (currently tmpfs). Pages in restrictedmem are marked
as unmovable and unevictable, this is required for current confidential
usage. But in future this might be changed.

By default memfd_restricted() prevents userspace read, write and mmap.
By defining new bit in the 'flags', it can be extended to support other
restricted semantics in the future.

The system call is currently wired up for x86 arch.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Chao Peng 
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/restrictedmem.h  |  71 ++
 include/linux/syscalls.h   |   1 +
 include/uapi/asm-generic/unistd.h  |   5 +-
 include/uapi/linux/magic.h |   1 +
 kernel/sys_ni.c|   3 +
 mm/Kconfig |   4 +
 mm/Makefile|   1 +
 mm/memory-failure.c|   3 +
 mm/restrictedmem.c | 318 +
 11 files changed, 408 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/restrictedmem.h
 create mode 100644 mm/restrictedmem.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 320480a8db4f..dc70ba90247e 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -455,3 +455,4 @@
 448i386process_mreleasesys_process_mrelease
 449i386futex_waitv sys_futex_waitv
 450i386set_mempolicy_home_node sys_set_mempolicy_home_node
+451i386memfd_restrictedsys_memfd_restricted
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..06516abc8318 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
 448common  process_mreleasesys_process_mrelease
 449common  futex_waitv sys_futex_waitv
 450common  set_mempolicy_home_node sys_set_mempolicy_home_node
+451common  memfd_restrictedsys_memfd_restricted
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
new file mode 100644
index ..c2700c5daa43
--- /dev/null
+++ b/include/linux/restrictedmem.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_RESTRICTEDMEM_H
+
+#include 
+#include 
+#include 
+
+struct restrictedmem_notifier;
+
+struct restrictedmem_notifier_ops {
+   void (*invalidate_start)(struct restrictedmem_notifier *notifier,
+pgoff_t start, pgoff_t end);
+ 

Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-11-30 Thread Chao Peng
On Tue, Nov 29, 2022 at 01:18:15PM -0600, Michael Roth wrote:
> On Tue, Nov 29, 2022 at 01:06:58PM -0600, Michael Roth wrote:
> > On Tue, Nov 29, 2022 at 10:06:15PM +0800, Chao Peng wrote:
> > > On Mon, Nov 28, 2022 at 06:37:25PM -0600, Michael Roth wrote:
> > > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > ...
> > > > > +static long restrictedmem_fallocate(struct file *file, int mode,
> > > > > + loff_t offset, loff_t len)
> > > > > +{
> > > > > + struct restrictedmem_data *data = file->f_mapping->private_data;
> > > > > + struct file *memfd = data->memfd;
> > > > > + int ret;
> > > > > +
> > > > > + if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > + return -EINVAL;
> > > > > + }
> > > > > +
> > > > > + restrictedmem_notifier_invalidate(data, offset, offset + len, 
> > > > > true);
> > > > 
> > > > The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
> > > > loff_t. For SNP we've made this strange as part of the following patch
> > > > and it seems to produce the expected behavior:
> > > 
> > > That's correct. Thanks.
> > > 
> > > > 
> > > >   
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2Fd669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=kAL42bmyBB0alVwh%2FN%2BT3D%2BiVTdxxMsJ7V4TNuCTjM4%3Dreserved=0
> > > > 
> > > > > + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > + restrictedmem_notifier_invalidate(data, offset, offset + len, 
> > > > > false);
> > > > > + return ret;
> > > > > +}
> > > > > +
> > > > 
> > > > 
> > > > 
> > > > > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > > > > +struct page **pagep, int *order)
> > > > > +{
> > > > > + struct restrictedmem_data *data = file->f_mapping->private_data;
> > > > > + struct file *memfd = data->memfd;
> > > > > + struct page *page;
> > > > > + int ret;
> > > > > +
> > > > > + ret = shmem_getpage(file_inode(memfd), offset, , 
> > > > > SGP_WRITE);
> > > > 
> > > > This will result in KVM allocating pages that userspace hasn't necessary
> > > > fallocate()'d. In the case of SNP we need to get the PFN so we can clean
> > > > up the RMP entries when restrictedmem invalidations are issued for a GFN
> > > > range.
> > > 
> > > Yes fallocate() is unnecessary unless someone wants to reserve some
> > > space (e.g. for determination or performance purpose), this matches its
> > > semantics perfectly at:
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.man7.org%2Flinux%2Fman-pages%2Fman2%2Ffallocate.2.htmldata=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=acBSquFG%2FHtpbcZfHDZrP2O63bu06rI0pjiPJFSJSj8%3Dreserved=0
> > > 
> > > > 
> > > > If the guest supports lazy-acceptance however, these pages may not have
> > > > been faulted in yet, and if the VMM defers actually fallocate()'ing 
> > > > space
> > > > until the guest actually tries to issue a shared->private for that GFN
> > > > (to support lazy-pinning), then there may never be a need to allocate
> > > > pages for these backends.
> > > > 
> > > > However, the restrictedmem invalidations are for GFN ranges so there's
> > > > no way to know inadvance whether it's been allocated yet or not. The
> > > > xarray is one option but currently it defaults to 'private' so that
> > > > doesn't help us here. It might if we introduced a 'uninitialized' state
> > > > or something along tha

Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-11-29 Thread Chao Peng
On Mon, Nov 28, 2022 at 06:37:25PM -0600, Michael Roth wrote:
> On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
...
> > +static long restrictedmem_fallocate(struct file *file, int mode,
> > +   loff_t offset, loff_t len)
> > +{
> > +   struct restrictedmem_data *data = file->f_mapping->private_data;
> > +   struct file *memfd = data->memfd;
> > +   int ret;
> > +
> > +   if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +   return -EINVAL;
> > +   }
> > +
> > +   restrictedmem_notifier_invalidate(data, offset, offset + len, true);
> 
> The KVM restrictedmem ops seem to expect pgoff_t, but here we pass
> loff_t. For SNP we've made this strange as part of the following patch
> and it seems to produce the expected behavior:

That's correct. Thanks.

> 
>   
> https://github.com/mdroth/linux/commit/d669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6
> 
> > +   ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > +   restrictedmem_notifier_invalidate(data, offset, offset + len, false);
> > +   return ret;
> > +}
> > +
> 
> 
> 
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +  struct page **pagep, int *order)
> > +{
> > +   struct restrictedmem_data *data = file->f_mapping->private_data;
> > +   struct file *memfd = data->memfd;
> > +   struct page *page;
> > +   int ret;
> > +
> > +   ret = shmem_getpage(file_inode(memfd), offset, , SGP_WRITE);
> 
> This will result in KVM allocating pages that userspace hasn't necessary
> fallocate()'d. In the case of SNP we need to get the PFN so we can clean
> up the RMP entries when restrictedmem invalidations are issued for a GFN
> range.

Yes fallocate() is unnecessary unless someone wants to reserve some
space (e.g. for determination or performance purpose), this matches its
semantics perfectly at:
https://www.man7.org/linux/man-pages/man2/fallocate.2.html

> 
> If the guest supports lazy-acceptance however, these pages may not have
> been faulted in yet, and if the VMM defers actually fallocate()'ing space
> until the guest actually tries to issue a shared->private for that GFN
> (to support lazy-pinning), then there may never be a need to allocate
> pages for these backends.
> 
> However, the restrictedmem invalidations are for GFN ranges so there's
> no way to know inadvance whether it's been allocated yet or not. The
> xarray is one option but currently it defaults to 'private' so that
> doesn't help us here. It might if we introduced a 'uninitialized' state
> or something along that line instead of just the binary
> 'shared'/'private' though...

How about if we change the default to 'shared' as we discussed at
https://lore.kernel.org/all/y35gi0l8gmt9+...@google.com/?
> 
> But for now we added a restrictedmem_get_page_noalloc() that uses
> SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch
> of memory as part of guest shutdown, and a
> kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But
> maybe a boolean param is better? Or maybe SGP_NOALLOC is the better
> default, and we just propagate an error to userspace if they didn't
> fallocate() in advance?

This (making fallocate() a hard requirement) not only complicates the
userspace but also forces the lazy-faulting going through a long path of
exiting to userspace. Unless we don't have other options I would not go
this way.

Chao
> 
> -Mike
> 
> > +   if (ret)
> > +   return ret;
> > +
> > +   *pagep = page;
> > +   if (order)
> > +   *order = thp_order(compound_head(page));
> > +
> > +   SetPageUptodate(page);
> > +   unlock_page(page);
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(restrictedmem_get_page);
> > -- 
> > 2.25.1
> > 



Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-11-29 Thread Chao Peng
On Tue, Nov 29, 2022 at 12:39:06PM +0100, David Hildenbrand wrote:
> On 29.11.22 12:21, Kirill A. Shutemov wrote:
> > On Mon, Nov 28, 2022 at 06:06:32PM -0600, Michael Roth wrote:
> > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > > From: "Kirill A. Shutemov" 
> > > > 
> > > 
> > > 
> > > 
> > > > +static struct file *restrictedmem_file_create(struct file *memfd)
> > > > +{
> > > > +   struct restrictedmem_data *data;
> > > > +   struct address_space *mapping;
> > > > +   struct inode *inode;
> > > > +   struct file *file;
> > > > +
> > > > +   data = kzalloc(sizeof(*data), GFP_KERNEL);
> > > > +   if (!data)
> > > > +   return ERR_PTR(-ENOMEM);
> > > > +
> > > > +   data->memfd = memfd;
> > > > +   mutex_init(>lock);
> > > > +   INIT_LIST_HEAD(>notifiers);
> > > > +
> > > > +   inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> > > > +   if (IS_ERR(inode)) {
> > > > +   kfree(data);
> > > > +   return ERR_CAST(inode);
> > > > +   }
> > > > +
> > > > +   inode->i_mode |= S_IFREG;
> > > > +   inode->i_op = _iops;
> > > > +   inode->i_mapping->private_data = data;
> > > > +
> > > > +   file = alloc_file_pseudo(inode, restrictedmem_mnt,
> > > > +"restrictedmem", O_RDWR,
> > > > +_fops);
> > > > +   if (IS_ERR(file)) {
> > > > +   iput(inode);
> > > > +   kfree(data);
> > > > +   return ERR_CAST(file);
> > > > +   }
> > > > +
> > > > +   file->f_flags |= O_LARGEFILE;
> > > > +
> > > > +   mapping = memfd->f_mapping;
> > > > +   mapping_set_unevictable(mapping);
> > > > +   mapping_set_gfp_mask(mapping,
> > > > +mapping_gfp_mask(mapping) & 
> > > > ~__GFP_MOVABLE);
> > > 
> > > Is this supposed to prevent migration of pages being used for
> > > restrictedmem/shmem backend?
> > 
> > Yes, my bad. I expected it to prevent migration, but it is not true.
> 
> Maybe add a comment that these pages are not movable and we don't want to
> place them into movable pageblocks (including CMA and ZONE_MOVABLE). That's
> the primary purpose of the GFP mask here.

Yes I can do that.

Chao
> 
> -- 
> Thanks,
> 
> David / dhildenb



Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-11-29 Thread Chao Peng
On Tue, Nov 29, 2022 at 02:21:39PM +0300, Kirill A. Shutemov wrote:
> On Mon, Nov 28, 2022 at 06:06:32PM -0600, Michael Roth wrote:
> > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > From: "Kirill A. Shutemov" 
> > > 
> > 
> > 
> > 
> > > +static struct file *restrictedmem_file_create(struct file *memfd)
> > > +{
> > > + struct restrictedmem_data *data;
> > > + struct address_space *mapping;
> > > + struct inode *inode;
> > > + struct file *file;
> > > +
> > > + data = kzalloc(sizeof(*data), GFP_KERNEL);
> > > + if (!data)
> > > + return ERR_PTR(-ENOMEM);
> > > +
> > > + data->memfd = memfd;
> > > + mutex_init(>lock);
> > > + INIT_LIST_HEAD(>notifiers);
> > > +
> > > + inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> > > + if (IS_ERR(inode)) {
> > > + kfree(data);
> > > + return ERR_CAST(inode);
> > > + }
> > > +
> > > + inode->i_mode |= S_IFREG;
> > > + inode->i_op = _iops;
> > > + inode->i_mapping->private_data = data;
> > > +
> > > + file = alloc_file_pseudo(inode, restrictedmem_mnt,
> > > +  "restrictedmem", O_RDWR,
> > > +  _fops);
> > > + if (IS_ERR(file)) {
> > > + iput(inode);
> > > + kfree(data);
> > > + return ERR_CAST(file);
> > > + }
> > > +
> > > + file->f_flags |= O_LARGEFILE;
> > > +
> > > + mapping = memfd->f_mapping;
> > > + mapping_set_unevictable(mapping);
> > > + mapping_set_gfp_mask(mapping,
> > > +  mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> > 
> > Is this supposed to prevent migration of pages being used for
> > restrictedmem/shmem backend?
> 
> Yes, my bad. I expected it to prevent migration, but it is not true.
> 
> Looks like we need to bump refcount in restrictedmem_get_page() and reduce
> it back when KVM is no longer use it.

The restrictedmem_get_page() has taken a reference, but later KVM
put_page() after populating the secondary page table entry through
kvm_release_pfn_clean(). One option would let the user feature(e.g.
TDX/SEV) to get_page/put_page() during populating the secondary page
table entry, AFAICS, this requirement also comes from these features.

Chao
> 
> Chao, could you adjust it?
> 
> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov



Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-11-22 Thread Chao Peng
On Fri, Nov 18, 2022 at 03:59:12PM +, Sean Christopherson wrote:
> On Fri, Nov 18, 2022, Alex Benn?e wrote:
> > 
> > Chao Peng  writes:
> > 
> > > On Thu, Nov 17, 2022 at 03:08:17PM +, Alex Benn?e wrote:
> > >> >> I think this should be explicit rather than implied by the absence of
> > >> >> another flag. Sean suggested you might want flags for RWX failures so
> > >> >> maybe something like:
> > >> >> 
> > >> >>   KVM_MEMORY_EXIT_SHARED_FLAG_READ(1 << 0)
> > >> >>   KVM_MEMORY_EXIT_SHARED_FLAG_WRITE   (1 << 1)
> > >> >>   KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE (1 << 2)
> > >> >> KVM_MEMORY_EXIT_FLAG_PRIVATE(1 << 3)
> > >> >
> > >> > Yes, but I would not add 'SHARED' to RWX, they are not share memory
> > >> > specific, private memory can also set them once introduced.
> > >> 
> > >> OK so how about:
> > >> 
> > >>  KVM_MEMORY_EXIT_FLAG_READ   (1 << 0)
> > >>  KVM_MEMORY_EXIT_FLAG_WRITE  (1 << 1)
> > >>  KVM_MEMORY_EXIT_FLAG_EXECUTE(1 << 2)
> > >> KVM_MEMORY_EXIT_FLAG_SHARED (1 << 3)
> > >> KVM_MEMORY_EXIT_FLAG_PRIVATE(1 << 4)
> > >
> > > We don't actually need a new bit, the opposite side of private is
> > > shared, i.e. flags with KVM_MEMORY_EXIT_FLAG_PRIVATE cleared expresses
> > > 'shared'.
> > 
> > If that is always true and we never expect a 3rd type of memory that is
> > fine. But given we are leaving room for expansion having an explicit bit
> > allows for that as well as making cases of forgetting to set the flags
> > more obvious.
> 
> Hrm, I'm on the fence.
> 
> A dedicated flag isn't strictly needed, e.g. even if we end up with 3+ types 
> in
> this category, the baseline could always be "private".

The baseline for the current code is actually "shared".

> 
> I do like being explicit, and adding a PRIVATE flag costs KVM practically 
> nothing
> to implement and maintain, but evetually we'll up with flags that are paired 
> with
> an implicit state, e.g. see the many #PF error codes in x86.  In other words,
> inevitably KVM will need to define the default/base state of the access, at 
> which
> point the base state for SHARED vs. PRIVATE is "undefined".  

Current memory conversion for confidential usage is bi-directional so we
already need both private and shared states and if we use one bit for
both "shared" and "private" then we will have to define the default
state, e.g, currently the default state is "shared" when we define

KVM_MEMORY_EXIT_FLAG_PRIVATE(1 << 0)

> 
> The RWX bits are in the same boat, e.g. the READ flag isn't strictly 
> necessary.
> I was thinking more of the KVM_SET_MEMORY_ATTRIBUTES ioctl(), which does need
> the full RWX gamut, when I typed out that response.

For KVM_SET_MEMORY_ATTRIBUTES it's reasonable to add RWX bits and match
that to the permission bits definition in EPT entry.

> 
> So I would say if we add an explicit READ flag, then we might as well add an 
> explicit
> PRIVATE flag too.  But if we omit PRIVATE, then we should omit READ too.

Since we assume the default state is shared, so we actually only need a
PRIVATE flag, e.g. there is no SHARED flag and will ignore the RWX for now.

Chao



Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-11-17 Thread Chao Peng
On Thu, Nov 17, 2022 at 03:08:17PM +, Alex Bennée wrote:
> 
> Chao Peng  writes:
> 
> > On Wed, Nov 16, 2022 at 07:03:49PM +, Alex Bennée wrote:
> >> 
> >> Chao Peng  writes:
> >> 
> >> > On Tue, Nov 15, 2022 at 04:56:12PM +, Alex Bennée wrote:
> >> >> 
> >> >> Chao Peng  writes:
> >> >> 
> >> >> > This new KVM exit allows userspace to handle memory-related errors. It
> >> >> > indicates an error happens in KVM at guest memory range [gpa, 
> >> >> > gpa+size).
> >> >> > The flags includes additional information for userspace to handle the
> >> >> > error. Currently bit 0 is defined as 'private memory' where '1'
> >> >> > indicates error happens due to private memory access and '0' indicates
> >> >> > error happens due to shared memory access.
> >> >> >
> >> >> > When private memory is enabled, this new exit will be used for KVM to
> >> >> > exit to userspace for shared <-> private memory conversion in memory
> >> >> > encryption usage. In such usage, typically there are two kind of 
> >> >> > memory
> >> >> > conversions:
> >> >> >   - explicit conversion: happens when guest explicitly calls into KVM
> >> >> > to map a range (as private or shared), KVM then exits to userspace
> >> >> > to perform the map/unmap operations.
> >> >> >   - implicit conversion: happens in KVM page fault handler where KVM
> >> >> > exits to userspace for an implicit conversion when the page is in 
> >> >> > a
> >> >> > different state than requested (private or shared).
> >> >> >
> >> >> > Suggested-by: Sean Christopherson 
> >> >> > Co-developed-by: Yu Zhang 
> >> >> > Signed-off-by: Yu Zhang 
> >> >> > Signed-off-by: Chao Peng 
> >> >> > ---
> >> >> >  Documentation/virt/kvm/api.rst | 23 +++
> >> >> >  include/uapi/linux/kvm.h   |  9 +
> >> >> >  2 files changed, 32 insertions(+)
> >> >> >
> >> >> > diff --git a/Documentation/virt/kvm/api.rst 
> >> >> > b/Documentation/virt/kvm/api.rst
> >> >> > index f3fa75649a78..975688912b8c 100644
> >> >> > --- a/Documentation/virt/kvm/api.rst
> >> >> > +++ b/Documentation/virt/kvm/api.rst
> >> >> > @@ -6537,6 +6537,29 @@ array field represents return values. The 
> >> >> > userspace should update the return
> >> >> >  values of SBI call before resuming the VCPU. For more details on 
> >> >> > RISC-V SBI
> >> >> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >> >> >  
> >> >> > +::
> >> >> > +
> >> >> > + /* KVM_EXIT_MEMORY_FAULT */
> >> >> > + struct {
> >> >> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1 << 0)
> >> >> > + __u32 flags;
> >> >> > + __u32 padding;
> >> >> > + __u64 gpa;
> >> >> > + __u64 size;
> >> >> > + } memory;
> >> >> > +
> >> >> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the 
> >> >> > VCPU has
> >> >> > +encountered a memory error which is not handled by KVM kernel module 
> >> >> > and
> >> >> > +userspace may choose to handle it. The 'flags' field indicates the 
> >> >> > memory
> >> >> > +properties of the exit.
> >> >> > +
> >> >> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is 
> >> >> > caused by
> >> >> > +   private memory access when the bit is set. Otherwise the memory 
> >> >> > error is
> >> >> > +   caused by shared memory access when the bit is clear.
> >> >> 
> >> >> What does a shared memory access failure entail?
> >> >
> >> > In the context of confidential computing usages, guest can issue a
> >> > shared memory access while the memory is actually private from the host
> >> > point 

Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

2022-11-17 Thread Chao Peng
On Wed, Nov 16, 2022 at 09:40:23AM +, Alex Bennée wrote:
> 
> Chao Peng  writes:
> 
> > On Mon, Nov 14, 2022 at 11:43:37AM +, Alex Bennée wrote:
> >> 
> >> Chao Peng  writes:
> >> 
> >> 
> >> > Introduction
> >> > 
> >> > KVM userspace being able to crash the host is horrible. Under current
> >> > KVM architecture, all guest memory is inherently accessible from KVM
> >> > userspace and is exposed to the mentioned crash issue. The goal of this
> >> > series is to provide a solution to align mm and KVM, on a userspace
> >> > inaccessible approach of exposing guest memory. 
> >> >
> >> > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> >> > virtual address (hva) from core mm page table (e.g. x86 userspace page
> >> > table). This requires guest memory being mmaped into KVM userspace, but
> >> > this is also the source where the mentioned crash issue can happen. In
> >> > theory, apart from those 'shared' memory for device emulation etc, guest
> >> > memory doesn't have to be mmaped into KVM userspace.
> >> >
> >> > This series introduces fd-based guest memory which will not be mmaped
> >> > into KVM userspace. KVM populates secondary page table by using a
> >> > fd/offset pair backed by a memory file system. The fd can be created
> >> > from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
> >> > directly interact with them with newly introduced in-kernel interface,
> >> > therefore remove the KVM userspace from the path of accessing/mmaping
> >> > the guest memory. 
> >> >
> >> > Kirill had a patch [2] to address the same issue in a different way. It
> >> > tracks guest encrypted memory at the 'struct page' level and relies on
> >> > HWPOISON to reject the userspace access. The patch has been discussed in
> >> > several online and offline threads and resulted in a design document [3]
> >> > which is also the original proposal for this series. Later this patch
> >> > series evolved as more comments received in community but the major
> >> > concepts in [3] still hold true so recommend reading.
> >> >
> >> > The patch series may also be useful for other usages, for example, pure
> >> > software approach may use it to harden itself against unintentional
> >> > access to guest memory. This series is designed with these usages in
> >> > mind but doesn't have code directly support them and extension might be
> >> > needed.
> >> 
> >> There are a couple of additional use cases where having a consistent
> >> memory interface with the kernel would be useful.
> >
> > Thanks very much for the info. But I'm not so confident that the current
> > memfd_restricted() implementation can be useful for all these usages. 
> >
> >> 
> >>   - Xen DomU guests providing other domains with VirtIO backends
> >> 
> >>   Xen by default doesn't give other domains special access to a domains
> >>   memory. The guest can grant access to regions of its memory to other
> >>   domains for this purpose. 
> >
> > I'm trying to form my understanding on how this could work and what's
> > the benefit for a DomU guest to provide memory through memfd_restricted().
> > AFAICS, memfd_restricted() can help to hide the memory from DomU userspace,
> > but I assume VirtIO backends are still in DomU uerspace and need access
> > that memory, right?
> 
> They need access to parts of the memory. At the moment you run your
> VirtIO domains in the Dom0 and give them access to the whole of a DomU's
> address space - however the Xen model is by default the guests memory is
> inaccessible to other domains on the system. The DomU guest uses the Xen
> grant model to expose portions of its address space to other domains -
> namely for the VirtIO queues themselves and any pages containing buffers
> involved in the VirtIO transaction. My thought was that looks like a
> guest memory interface which is mostly inaccessible (private) with some
> holes in it where memory is being explicitly shared with other domains.

Yes, similar in conception. For KVM, memfd_restricted() is used by host
OS, guest will issue conversion between private and shared for its
memory range. This is similar to Xen DomU guest grants its memory to
other domains. Similarly, I guess to make memfd_restricted() being really
useful for Xen, it should be run on the VirtIO backend domain (e.g

Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-11-17 Thread Chao Peng
On Wed, Nov 16, 2022 at 07:03:49PM +, Alex Bennée wrote:
> 
> Chao Peng  writes:
> 
> > On Tue, Nov 15, 2022 at 04:56:12PM +, Alex Bennée wrote:
> >> 
> >> Chao Peng  writes:
> >> 
> >> > This new KVM exit allows userspace to handle memory-related errors. It
> >> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> >> > The flags includes additional information for userspace to handle the
> >> > error. Currently bit 0 is defined as 'private memory' where '1'
> >> > indicates error happens due to private memory access and '0' indicates
> >> > error happens due to shared memory access.
> >> >
> >> > When private memory is enabled, this new exit will be used for KVM to
> >> > exit to userspace for shared <-> private memory conversion in memory
> >> > encryption usage. In such usage, typically there are two kind of memory
> >> > conversions:
> >> >   - explicit conversion: happens when guest explicitly calls into KVM
> >> > to map a range (as private or shared), KVM then exits to userspace
> >> > to perform the map/unmap operations.
> >> >   - implicit conversion: happens in KVM page fault handler where KVM
> >> > exits to userspace for an implicit conversion when the page is in a
> >> > different state than requested (private or shared).
> >> >
> >> > Suggested-by: Sean Christopherson 
> >> > Co-developed-by: Yu Zhang 
> >> > Signed-off-by: Yu Zhang 
> >> > Signed-off-by: Chao Peng 
> >> > ---
> >> >  Documentation/virt/kvm/api.rst | 23 +++
> >> >  include/uapi/linux/kvm.h   |  9 +
> >> >  2 files changed, 32 insertions(+)
> >> >
> >> > diff --git a/Documentation/virt/kvm/api.rst 
> >> > b/Documentation/virt/kvm/api.rst
> >> > index f3fa75649a78..975688912b8c 100644
> >> > --- a/Documentation/virt/kvm/api.rst
> >> > +++ b/Documentation/virt/kvm/api.rst
> >> > @@ -6537,6 +6537,29 @@ array field represents return values. The 
> >> > userspace should update the return
> >> >  values of SBI call before resuming the VCPU. For more details on RISC-V 
> >> > SBI
> >> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >> >  
> >> > +::
> >> > +
> >> > +/* KVM_EXIT_MEMORY_FAULT */
> >> > +struct {
> >> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE  (1 << 0)
> >> > +__u32 flags;
> >> > +__u32 padding;
> >> > +__u64 gpa;
> >> > +__u64 size;
> >> > +} memory;
> >> > +
> >> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU 
> >> > has
> >> > +encountered a memory error which is not handled by KVM kernel module and
> >> > +userspace may choose to handle it. The 'flags' field indicates the 
> >> > memory
> >> > +properties of the exit.
> >> > +
> >> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused 
> >> > by
> >> > +   private memory access when the bit is set. Otherwise the memory 
> >> > error is
> >> > +   caused by shared memory access when the bit is clear.
> >> 
> >> What does a shared memory access failure entail?
> >
> > In the context of confidential computing usages, guest can issue a
> > shared memory access while the memory is actually private from the host
> > point of view. This exit with bit 0 cleared gives userspace a chance to
> > convert the private memory to shared memory on host.
> 
> I think this should be explicit rather than implied by the absence of
> another flag. Sean suggested you might want flags for RWX failures so
> maybe something like:
> 
>   KVM_MEMORY_EXIT_SHARED_FLAG_READ(1 << 0)
>   KVM_MEMORY_EXIT_SHARED_FLAG_WRITE   (1 << 1)
>   KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE (1 << 2)
> KVM_MEMORY_EXIT_FLAG_PRIVATE(1 << 3)

Yes, but I would not add 'SHARED' to RWX, they are not share memory
specific, private memory can also set them once introduced.

Thanks,
Chao
> 
> which would allow you to signal the various failure modes of the shared
> region, or that you had accessed private memory.
> 
> >

Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-11-17 Thread Chao Peng
On Wed, Nov 16, 2022 at 06:48:43PM +, Sean Christopherson wrote:
> On Wed, Nov 16, 2022, Andy Lutomirski wrote:
> > 
> > 
> > On Tue, Oct 25, 2022, at 8:13 AM, Chao Peng wrote:
> > > diff --git a/Documentation/virt/kvm/api.rst 
> > > b/Documentation/virt/kvm/api.rst
> > > index f3fa75649a78..975688912b8c 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -6537,6 +6537,29 @@ array field represents return values. The 
> > > userspace should update the return
> > >  values of SBI call before resuming the VCPU. For more details on 
> > > RISC-V SBI
> > >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> > > 
> > > +::
> > > +
> > > + /* KVM_EXIT_MEMORY_FAULT */
> > > + struct {
> > > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1 << 0)
> > > + __u32 flags;
> > > + __u32 padding;
> > > + __u64 gpa;
> > > + __u64 size;
> > > + } memory;
> > > +
> > 
> > Would it make sense to also have a field for the access type (read, write,
> > execute, etc)?  I realize that shared <-> private conversion doesn't 
> > strictly
> > need this, but it seems like it could be useful for logging failures and 
> > also
> > for avoiding a second immediate fault if the type gets converted but doesn't
> > have the right protection yet.
> 
> I don't think a separate field is necessary, that info can be conveyed via 
> flags.
> Though maybe we should go straight to a u64 for flags.

Yeah, I can do that.

> Hmm, and maybe avoid bits
> 0-3 so that if/when RWX info is conveyed the flags can align with
> PROT_{READ,WRITE,EXEC} and the EPT flags, e.g.

You mean avoiding bits 0-2, right, bit3 is not so special and can be used
for KVM_MEMORY_EXIT_FLAG_PRIVATE.

Chao
> 
>   KVM_MEMORY_EXIT_FLAG_READ   (1 << 0)
>   KVM_MEMORY_EXIT_FLAG_WRITE  (1 << 1)
>   KVM_MEMORY_EXIT_FLAG_EXECUTE(1 << 2)
> 
> > (Obviously, if this were changed, KVM would need the ability to report that
> > it doesn't actually know the mode.)
> > 
> > --Andy



Re: [PATCH v9 7/8] KVM: Handle page fault for private memory

2022-11-17 Thread Chao Peng
On Wed, Nov 16, 2022 at 10:13:07PM +, Sean Christopherson wrote:
> On Wed, Nov 16, 2022, Ackerley Tng wrote:
> > >@@ -4173,6 +4203,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, 
> > >struct kvm_page_fault *fault)
> > >   return RET_PF_EMULATE;
> > >   }
> > >
> > >+  if (kvm_slot_can_be_private(slot) &&
> > >+  fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> > >+  vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> > >+  if (fault->is_private)
> > >+  vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> > >+  else
> > >+  vcpu->run->memory.flags = 0;
> > >+  vcpu->run->memory.padding = 0;
> > >+  vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> > >+  vcpu->run->memory.size = PAGE_SIZE;
> > >+  return RET_PF_USER;
> > >+  }
> > >+
> > >+  if (fault->is_private)
> > >+  return kvm_faultin_pfn_private(fault);
> > >+
> > 
> > Since each memslot may also not be backed by restricted memory, we
> > should also check if the memslot has been set up for private memory
> > with
> > 
> > if (fault->is_private && kvm_slot_can_be_private(slot))
> > return kvm_faultin_pfn_private(fault);
> > 
> > Without this check, restrictedmem_get_page will get called with NULL
> > in slot->restricted_file, which causes a NULL pointer dereference.
> 
> Hmm, silently skipping the faultin would result in KVM faulting in the shared
> portion of the memslot, and I believe would end up mapping that pfn as 
> private,
> i.e. would map a non-UPM PFN as a private mapping.  For TDX and SNP, that 
> would
> be double ungood as it would let the host access memory that is mapped 
> private,
> i.e. lead to #MC or #PF(RMP) in the host.

That's correct.

> 
> I believe the correct solution is to drop the "can be private" check from the
> above check, and instead handle that in kvm_faultin_pfn_private().  That 
> would fix
> another bug, e.g. if the fault is shared, the slot can't be private, but for
> whatever reason userspace marked the gfn as private.  Even though KVM might be
> able service the fault, the correct thing to do in that case is to exit to 
> userspace.

It makes sense to me.

Chao
> 
> E.g.
> 
> ---
>  arch/x86/kvm/mmu/mmu.c | 36 ++--
>  1 file changed, 22 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 10017a9f26ee..e2ac8873938e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4158,11 +4158,29 @@ static inline u8 order_to_level(int order)
>   return PG_LEVEL_4K;
>  }
>  
> -static int kvm_faultin_pfn_private(struct kvm_page_fault *fault)
> +static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault)
> +{
> + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> + if (fault->is_private)
> + vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> + else
> + vcpu->run->memory.flags = 0;
> + vcpu->run->memory.padding = 0;
> + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> + vcpu->run->memory.size = PAGE_SIZE;
> + return RET_PF_USER;
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> +struct kvm_page_fault *fault)
>  {
>   int order;
>   struct kvm_memory_slot *slot = fault->slot;
>  
> + if (kvm_slot_can_be_private(slot))
> + return kvm_do_memory_fault_exit(vcpu, fault);
> +
>   if (kvm_restricted_mem_get_pfn(slot, fault->gfn, >pfn, ))
>   return RET_PF_RETRY;
>  
> @@ -4203,21 +4221,11 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, 
> struct kvm_page_fault *fault)
>   return RET_PF_EMULATE;
>   }
>  
> - if (kvm_slot_can_be_private(slot) &&
> - fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> - vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> - if (fault->is_private)
> - vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> - else
> - vcpu->run->memory.flags = 0;
> - vcpu->run->memory.padding = 0;
> - vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> - vcpu->run->memory.size = PAGE_SIZE;
> - return RET_PF_USER;
> - }
> + if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
> + return kvm_do_memory_fault_exit(vcpu, fault);
>  
>   if (fault->is_private)
> - return kvm_faultin_pfn_private(fault);
> + return kvm_faultin_pfn_private(vcpu, fault);
>  
>   async = false;
>   fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, ,
> 
> base-commit: 969d761bb7b8654605937f31ae76123dcb7f15a3
> -- 



Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

2022-11-17 Thread Chao Peng
On Wed, Nov 16, 2022 at 10:24:11PM +, Sean Christopherson wrote:
> On Tue, Oct 25, 2022, Chao Peng wrote:
> > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t 
> > size,
> > +bool is_private)
> > +{
> > +   gfn_t start, end;
> > +   unsigned long i;
> > +   void *entry;
> > +   int idx;
> > +   int r = 0;
> > +
> > +   if (size == 0 || gpa + size < gpa)
> > +   return -EINVAL;
> > +   if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> > +   return -EINVAL;
> > +
> > +   start = gpa >> PAGE_SHIFT;
> > +   end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +   /*
> > +* Guest memory defaults to private, kvm->mem_attr_array only stores
> > +* shared memory.
> > +*/
> > +   entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> > +
> > +   idx = srcu_read_lock(>srcu);
> > +   KVM_MMU_LOCK(kvm);
> > +   kvm_mmu_invalidate_begin(kvm, start, end);
> > +
> > +   for (i = start; i < end; i++) {
> > +   r = xa_err(xa_store(>mem_attr_array, i, entry,
> > +   GFP_KERNEL_ACCOUNT));
> > +   if (r)
> > +   goto err;
> > +   }
> > +
> > +   kvm_unmap_mem_range(kvm, start, end);
> > +
> > +   goto ret;
> > +err:
> > +   for (; i > start; i--)
> > +   xa_erase(>mem_attr_array, i);
> 
> I don't think deleting previous entries is correct.  To unwind, the correct 
> thing
> to do is restore the original values.  E.g. if userspace space is mapping a 
> large
> range as shared, and some of the previous entries were shared, deleting them 
> would
> incorrectly "convert" those entries to private.

Ah, right!

> 
> Tracking the previous state likely isn't the best approach, e.g. it would 
> require
> speculatively allocating extra memory for a rare condition that is likely 
> going to
> lead to OOM anyways.

Agree.

> 
> Instead of trying to unwind, what about updating the ioctl() params such that
> retrying with the updated addr+size would Just Work?  E.g.

Looks good to me. Thanks!

Chao
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 55b07aae67cc..f1de592a1a06 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1015,15 +1015,12 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, 
> gpa_t gpa, gpa_t size,
>  
> kvm_unmap_mem_range(kvm, start, end, attr);
>  
> -   goto ret;
> -err:
> -   for (; i > start; i--)
> -   xa_erase(>mem_attr_array, i);
> -ret:
> kvm_mmu_invalidate_end(kvm, start, end);
> KVM_MMU_UNLOCK(kvm);
> srcu_read_unlock(>srcu, idx);
>  
> +   
> +
> return r;
>  }
>  #endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
> @@ -4989,6 +4986,8 @@ static long kvm_vm_ioctl(struct file *filp,
>  
> r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
>   region.size, set);
> +   if (copy_to_user(argp, , sizeof(region)) && !r)
> +   r = -EFAULT
> break;
> }
>  #endif



Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

2022-11-15 Thread Chao Peng
On Mon, Nov 14, 2022 at 11:43:37AM +, Alex Bennée wrote:
> 
> Chao Peng  writes:
> 
> 
> > Introduction
> > 
> > KVM userspace being able to crash the host is horrible. Under current
> > KVM architecture, all guest memory is inherently accessible from KVM
> > userspace and is exposed to the mentioned crash issue. The goal of this
> > series is to provide a solution to align mm and KVM, on a userspace
> > inaccessible approach of exposing guest memory. 
> >
> > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > table). This requires guest memory being mmaped into KVM userspace, but
> > this is also the source where the mentioned crash issue can happen. In
> > theory, apart from those 'shared' memory for device emulation etc, guest
> > memory doesn't have to be mmaped into KVM userspace.
> >
> > This series introduces fd-based guest memory which will not be mmaped
> > into KVM userspace. KVM populates secondary page table by using a
> > fd/offset pair backed by a memory file system. The fd can be created
> > from a supported memory filesystem like tmpfs/hugetlbfs and KVM can
> > directly interact with them with newly introduced in-kernel interface,
> > therefore remove the KVM userspace from the path of accessing/mmaping
> > the guest memory. 
> >
> > Kirill had a patch [2] to address the same issue in a different way. It
> > tracks guest encrypted memory at the 'struct page' level and relies on
> > HWPOISON to reject the userspace access. The patch has been discussed in
> > several online and offline threads and resulted in a design document [3]
> > which is also the original proposal for this series. Later this patch
> > series evolved as more comments received in community but the major
> > concepts in [3] still hold true so recommend reading.
> >
> > The patch series may also be useful for other usages, for example, pure
> > software approach may use it to harden itself against unintentional
> > access to guest memory. This series is designed with these usages in
> > mind but doesn't have code directly support them and extension might be
> > needed.
> 
> There are a couple of additional use cases where having a consistent
> memory interface with the kernel would be useful.

Thanks very much for the info. But I'm not so confident that the current
memfd_restricted() implementation can be useful for all these usages. 

> 
>   - Xen DomU guests providing other domains with VirtIO backends
> 
>   Xen by default doesn't give other domains special access to a domains
>   memory. The guest can grant access to regions of its memory to other
>   domains for this purpose. 

I'm trying to form my understanding on how this could work and what's
the benefit for a DomU guest to provide memory through memfd_restricted().
AFAICS, memfd_restricted() can help to hide the memory from DomU userspace,
but I assume VirtIO backends are still in DomU uerspace and need access
that memory, right?

> 
>   - pKVM on ARM
> 
>   Similar to Xen, pKVM moves the management of the page tables into the
>   hypervisor and again doesn't allow those domains to share memory by
>   default.

Right, we already had some discussions on this in the past versions.

> 
>   - VirtIO loopback
> 
>   This allows for VirtIO devices for the host kernel to be serviced by
>   backends running in userspace. Obviously the memory userspace is
>   allowed to access is strictly limited to the buffers and queues
>   because giving userspace unrestricted access to the host kernel would
>   have consequences.

Okay, but normal memfd_create() should work for it, right? And
memfd_restricted() instead may not work as it unmaps the memory from
userspace.

> 
> All of these VirtIO backends work with vhost-user which uses memfds to
> pass references to guest memory from the VMM to the backend
> implementation.

Sounds to me these are the places where normal memfd_create() can act on.
VirtIO backends work on the mmap-ed memory which currently is not the
case for memfd_restricted(). memfd_restricted() has different design
purpose that unmaps the memory from userspace and employs some kernel
callbacks so other kernel modules can make use of the memory with these
callbacks instead of userspace virtual address.

Chao
> 
> > mm change
> > =
> > Introduces a new memfd_restricted system call which can create memory
> > file that is restricted from userspace access via normal MMU operations
> > like read(), write() or mmap() etc and the only way to use it is
> > passing it to a third kernel module like KVM a

Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-11-15 Thread Chao Peng
On Tue, Nov 15, 2022 at 04:56:12PM +, Alex Bennée wrote:
> 
> Chao Peng  writes:
> 
> > This new KVM exit allows userspace to handle memory-related errors. It
> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> > The flags includes additional information for userspace to handle the
> > error. Currently bit 0 is defined as 'private memory' where '1'
> > indicates error happens due to private memory access and '0' indicates
> > error happens due to shared memory access.
> >
> > When private memory is enabled, this new exit will be used for KVM to
> > exit to userspace for shared <-> private memory conversion in memory
> > encryption usage. In such usage, typically there are two kind of memory
> > conversions:
> >   - explicit conversion: happens when guest explicitly calls into KVM
> > to map a range (as private or shared), KVM then exits to userspace
> > to perform the map/unmap operations.
> >   - implicit conversion: happens in KVM page fault handler where KVM
> > exits to userspace for an implicit conversion when the page is in a
> > different state than requested (private or shared).
> >
> > Suggested-by: Sean Christopherson 
> > Co-developed-by: Yu Zhang 
> > Signed-off-by: Yu Zhang 
> > Signed-off-by: Chao Peng 
> > ---
> >  Documentation/virt/kvm/api.rst | 23 +++
> >  include/uapi/linux/kvm.h   |  9 +
> >  2 files changed, 32 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f3fa75649a78..975688912b8c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace 
> > should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >  
> > +::
> > +
> > +   /* KVM_EXIT_MEMORY_FAULT */
> > +   struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
> > +   __u32 flags;
> > +   __u32 padding;
> > +   __u64 gpa;
> > +   __u64 size;
> > +   } memory;
> > +
> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> > +encountered a memory error which is not handled by KVM kernel module and
> > +userspace may choose to handle it. The 'flags' field indicates the memory
> > +properties of the exit.
> > +
> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> > +   private memory access when the bit is set. Otherwise the memory error is
> > +   caused by shared memory access when the bit is clear.
> 
> What does a shared memory access failure entail?

In the context of confidential computing usages, guest can issue a
shared memory access while the memory is actually private from the host
point of view. This exit with bit 0 cleared gives userspace a chance to
convert the private memory to shared memory on host.

> 
> If you envision any other failure modes it might be worth making it
> explicit with additional flags.

Sean mentioned some more usages[1][]2] other than the memory conversion
for confidential usage. But I would leave those flags being added in the
future after those usages being well discussed.

[1] https://lkml.kernel.org/r/20200617230052.gb27...@linux.intel.com
[2] https://lore.kernel.org/all/ykxjlcg%2fwompe...@google.com

> I also wonder if a bitmask makes sense if
> there can only be one reason for a failure? Maybe all that is needed is
> a reason enum?

Tough we only have one reason right now but we still want to leave room
for future extension. Enum can express a single value at once well but
bitmask makes it possible to express multiple orthogonal flags.

Chao
> 
> > +
> > +'gpa' and 'size' indicate the memory range the error occurs at. The 
> > userspace
> > +may handle the error and return to KVM to retry the previous memory access.
> > +
> >  ::
> >  
> >  /* KVM_EXIT_NOTIFY */
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index f1ae45c10c94..fa60b032a405 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
> >  #define KVM_EXIT_RISCV_SBI35
> >  #define KVM_EXIT_RISCV_CSR36
> >  #define KVM_EXIT_NOTIFY   37
> > +#define KVM_EXIT_MEMORY_FAULT 38
> >  
> >  /* For KVM_EXIT_INTERNAL_ERROR *

Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-11-15 Thread Chao Peng
On Mon, Nov 14, 2022 at 04:16:32PM -0600, Michael Roth wrote:
> On Mon, Nov 14, 2022 at 06:28:43PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Nov 14, 2022 at 03:02:37PM +0100, Vlastimil Babka wrote:
> > > On 11/1/22 16:19, Michael Roth wrote:
> > > > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> > > >> > 
> > > >> >   1) restoring kernel directmap:
> > > >> > 
> > > >> >  Currently SNP (and I believe TDX) need to either split or 
> > > >> > remove kernel
> > > >> >  direct mappings for restricted PFNs, since there is no 
> > > >> > guarantee that
> > > >> >  other PFNs within a 2MB range won't be used for non-restricted
> > > >> >  (which will cause an RMP #PF in the case of SNP since the 2MB
> > > >> >  mapping overlaps with guest-owned pages)
> > > >> 
> > > >> Has the splitting and restoring been a well-discussed direction? I'm
> > > >> just curious whether there is other options to solve this issue.
> > > > 
> > > > For SNP it's been discussed for quite some time, and either splitting or
> > > > removing private entries from directmap are the well-discussed way I'm
> > > > aware of to avoid RMP violations due to some other kernel process using
> > > > a 2MB mapping to access shared memory if there are private pages that
> > > > happen to be within that range.
> > > > 
> > > > In both cases the issue of how to restore directmap as 2M becomes a
> > > > problem.
> > > > 
> > > > I was also under the impression TDX had similar requirements. If so,
> > > > do you know what the plan is for handling this for TDX?
> > > > 
> > > > There are also 2 potential alternatives I'm aware of, but these haven't
> > > > been discussed in much detail AFAIK:
> > > > 
> > > > a) Ensure confidential guests are backed by 2MB pages. shmem has a way 
> > > > to
> > > >request 2MB THP pages, but I'm not sure how reliably we can guarantee
> > > >that enough THPs are available, so if we went that route we'd 
> > > > probably
> > > >be better off requiring the use of hugetlbfs as the backing store. 
> > > > But
> > > >obviously that's a bit limiting and it would be nice to have the 
> > > > option
> > > >of using normal pages as well. One nice thing with invalidation
> > > >scheme proposed here is that this would "Just Work" if implement
> > > >hugetlbfs support, so an admin that doesn't want any directmap
> > > >splitting has this option available, otherwise it's done as a
> > > >best-effort.
> > > > 
> > > > b) Implement general support for restoring directmap as 2M even when
> > > >subpages might be in use by other kernel threads. This would be the
> > > >most flexible approach since it requires no special handling during
> > > >invalidations, but I think it's only possible if all the CPA
> > > >attributes for the 2M range are the same at the time the mapping is
> > > >restored/unsplit, so some potential locking issues there and still
> > > >chance for splitting directmap over time.
> > > 
> > > I've been hoping that
> > > 
> > > c) using a mechanism such as [1] [2] where the goal is to group together
> > > these small allocations that need to increase directmap granularity so
> > > maximum number of large mappings are preserved.
> > 
> > As I mentioned in the other thread the restricted memfd can be backed by
> > secretmem instead of plain memfd. It already handles directmap with care.
> 
> It looks like it would handle direct unmapping/cleanup nicely, but it
> seems to lack fallocate(PUNCH_HOLE) support which we'd probably want to
> avoid additional memory requirements. I think once we added that we'd
> still end up needing some sort of handling for the invalidations.
> 
> Also, I know Chao has been considering hugetlbfs support, I assume by
> leveraging the support that already exists in shmem. Ideally SNP would
> be able to make use of that support as well, but relying on a separate
> backend seems likely to result in more complications getting there
> later.
> 
> > 
> > But I don't think it has to be part of initial restricted memfd
> > implementation. It is SEV-specific requirement and AM

Re: [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory

2022-11-15 Thread Chao Peng
On Mon, Nov 14, 2022 at 04:04:59PM +, Alex Bennée wrote:
> 
> Chao Peng  writes:
> 
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided though a restrictedmem
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> >
> 
> > To make code maintenance easy, internally we use a binary compatible
> > alias struct kvm_user_mem_region to handle both the normal and the
> > '_ext' variants.
> 
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 0d5d4419139a..f1ae45c10c94 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region {
> > __u64 userspace_addr; /* start of the userspace allocated memory */
> >  };
> >  
> > +struct kvm_userspace_memory_region_ext {
> > +   struct kvm_userspace_memory_region region;
> > +   __u64 restricted_offset;
> > +   __u32 restricted_fd;
> > +   __u32 pad1;
> > +   __u64 pad2[14];
> > +};
> > +
> > +#ifdef __KERNEL__
> > +/*
> > + * kvm_user_mem_region is a kernel-only alias of 
> > kvm_userspace_memory_region_ext
> > + * that "unpacks" kvm_userspace_memory_region so that KVM can directly 
> > access
> > + * all fields from the top-level "extended" region.
> > + */
> > +struct kvm_user_mem_region {
> > +   __u32 slot;
> > +   __u32 flags;
> > +   __u64 guest_phys_addr;
> > +   __u64 memory_size;
> > +   __u64 userspace_addr;
> > +   __u64 restricted_offset;
> > +   __u32 restricted_fd;
> > +   __u32 pad1;
> > +   __u64 pad2[14];
> > +};
> > +#endif
> 
> I'm not sure I buy the argument this makes the code maintenance easier
> because you now have multiple places to update if you extend the field.
> Was this simply to avoid changing:
> 
>   foo->slot to foo->region.slot
> 
> in the underlying code?

That is one of the reasons, by doing this we can also avoid confusion to
deal with '_ext' and the 'base' struct for different functions spread
across KVM code. No doubt now I need update every places where the
'base' struct is being used, but that makes future maintenance easier,
e.g. adding another new field or even extend the memslot structure again
would just require changes to the flat struct here and the places where
the new field is actually used.

> 
> > +
> >  /*
> >   * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for 
> > userspace,
> >   * other bits are reserved for kvm internal use which are defined in
> > @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region {
> >   */
> >  #define KVM_MEM_LOG_DIRTY_PAGES(1UL << 0)
> >  #define KVM_MEM_READONLY   (1UL << 1)
> > +#define KVM_MEM_PRIVATE(1UL << 2)
> >  
> >  /* for KVM_IRQ_LINE */
> >  struct kvm_irq_level {
> > @@ -1178,6 +1206,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_S390_ZPCI_OP 221
> >  #define KVM_CAP_S390_CPU_TOPOLOGY 222
> >  #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
> > +#define KVM_CAP_PRIVATE_MEM 224
> >  
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >  
> > diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> > index 800f9470e36b..9ff164c7e0cc 100644
> > --- a/virt/kvm/Kconfig
> > +++ b/virt/kvm/Kconfig
> > @@ -86,3 +86,6 @@ config KVM_XFER_TO_GUEST_WORK
> >  
> >  config HAVE_KVM_PM_NOTIFIER
> > bool
> > +
> > +config HAVE_KVM_RESTRICTED_MEM
> > +   bool
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index e30f1b4ecfa5..8dace78a0278 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -1526,7 +1526,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
> > }
> >  }
> >  
> > -static int check_memory_region_flags(const struct 
> > kvm_userspace_memory_region *mem)
> > +static int check_memory_region_flags(const struct kvm_user_mem_region *mem)
> >  {
> > u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> >  
> > @@ -1920,7 +1920,7 @@ static bool kvm_check_memslot_overlap(struct 
> > kvm_memslots *slots, int id,
> >   * Must be called holding kvm->slots_lock for write.
> >   */
> >  int __kvm_set_memory_region

Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry

2022-11-11 Thread Chao Peng
On Thu, Nov 10, 2022 at 08:06:33PM +, Sean Christopherson wrote:
> On Tue, Oct 25, 2022, Chao Peng wrote:
> > @@ -715,15 +715,9 @@ static void kvm_mmu_notifier_change_pte(struct 
> > mmu_notifier *mn,
> > kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
> >  }
> >  
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > - unsigned long end)
> > +static inline
> 
> Don't tag static functions with "inline" unless they're in headers, in which 
> case
> the inline is effectively required.  In pretty much every scenario, the 
> compiler
> can do a better job of optimizing inline vs. non-inline, i.e. odds are very 
> good
> the compiler would inline this helper anyways, and if not, there would likely 
> be
> a good reason not to inline it.

Yep, I know the rationale behind, I made a mistake.

> 
> It'll be a moot point in this case (more below), but this would also reduce 
> the
> line length and avoid the wrap.
> 
> > void update_invalidate_range(struct kvm *kvm, gfn_t start,
> > +   gfn_t end)
> 
> I appreciate the effort to make this easier to read, but making such a big 
> divergence
> from the kernel's preferred formatting is often counter-productive, e.g. I 
> blinked a
> few times when first reading this code.
> 
> Again, moot point this time (still below ;-) ), but for future reference, 
> better
> options are to either let the line poke out or simply wrap early to get the
> bundling of parameters that you want, e.g.
> 
>   static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, 
> gfn_t end)
> 
> or 
> 
>   static inline void update_invalidate_range(struct kvm *kvm,
>gfn_t start, gfn_t end)

Fully agreed.

> 
> >  {
> > -   /*
> > -* The count increase must become visible at unlock time as no
> > -* spte can be established without taking the mmu_lock and
> > -* count is also read inside the mmu_lock critical section.
> > -*/
> > -   kvm->mmu_invalidate_in_progress++;
> > if (likely(kvm->mmu_invalidate_in_progress == 1)) {
> > kvm->mmu_invalidate_range_start = start;
> > kvm->mmu_invalidate_range_end = end;
> > @@ -744,6 +738,28 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, 
> > unsigned long start,
> > }
> >  }
> >  
> > +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, 
> > gfn_t end)
> 
> Splitting the helpers this way yields a weird API overall, e.g. it's possible
> (common, actually) to have an "end" without a "begin".
> 
> Taking the range in the "end" is also dangerous/misleading/imbalanced, 
> because _if_
> there are multiple ranges in a batch, each range would need to be unwound
> independently, e.g. the invocation of the "end" helper in
> kvm_mmu_notifier_invalidate_range_end() is flat out wrong, it just doesn't 
> cause
> problems because KVM doesn't (currently) try to unwind regions (and probably 
> never
> will, but that's beside the point).

I actually also don't feel good with existing code (taking range in the
"start" and "end") but didn't go further to find a better solution.

> 
> Rather than shunt what is effectively the "begin" into a separate helper, 
> provide
> three separate APIs, e.g. begin, range_add, end.  That way, begin+end don't 
> take a
> range and thus are symmetrical, always paired, and can't screw up unwinding 
> since
> they don't have a range to unwind.

This looks much better to me.

> 
> It'll require three calls in every case, but that's not the end of the world 
> since
> none of these flows are super hot paths.
> 
> > +{
> > +   /*
> > +* The count increase must become visible at unlock time as no
> > +* spte can be established without taking the mmu_lock and
> > +* count is also read inside the mmu_lock critical section.
> > +*/
> > +   kvm->mmu_invalidate_in_progress++;
> 
> This should invalidate (ha!) mmu_invalidate_range_{start,end}, and then WARN 
> in
> mmu_invalidate_retry() if the range isn't valid.  And the "add" helper should
> WARN if mmu_invalidate_in_progress == 0.
> 
> > +}
> > +
> > +static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range 
> > *range)
> 
> "handle" is wy too generic.  Just match kvm_unmap_gfn_range() and call it
> kvm_mmu_unmap_gfn_range().  This is a local function so it's unlikely to 

Re: [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed

2022-11-08 Thread Chao Peng
On Tue, Nov 08, 2022 at 08:08:05PM +0800, Yuan Yao wrote:
> On Tue, Oct 25, 2022 at 11:13:42PM +0800, Chao Peng wrote:
> > When private/shared memory are mixed in a large page, the lpage_info may
> > not be accurate and should be updated with this mixed info. A large page
> > has mixed pages can't be really mapped as large page since its
> > private/shared pages are from different physical memory.
> >
> > Update lpage_info when private/shared memory attribute is changed. If
> > both private and shared pages are within a large page region, it can't
> > be mapped as large page. It's a bit challenge to track the mixed
> > info in a 'count' like variable, this patch instead reserves a bit in
> > 'disallow_lpage' to indicate a large page has mixed private/share pages.
> >
> > Signed-off-by: Chao Peng 
> > ---
> >  arch/x86/include/asm/kvm_host.h |   8 +++
> >  arch/x86/kvm/mmu/mmu.c  | 112 +++-
> >  arch/x86/kvm/x86.c  |   2 +
> >  include/linux/kvm_host.h|  19 ++
> >  virt/kvm/kvm_main.c |  16 +++--
> >  5 files changed, 152 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 7551b6f9c31c..db811a54e3fd 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -37,6 +37,7 @@
> >  #include 
> >
> >  #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> > +#define __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
> >
> >  #define KVM_MAX_VCPUS 1024
> >
> > @@ -952,6 +953,13 @@ struct kvm_vcpu_arch {
> >  #endif
> >  };
> >
> > +/*
> > + * Use a bit in disallow_lpage to indicate private/shared pages mixed at 
> > the
> > + * level. The remaining bits are used as a reference count.
> > + */
> > +#define KVM_LPAGE_PRIVATE_SHARED_MIXED (1U << 31)
> > +#define KVM_LPAGE_COUNT_MAX((1U << 31) - 1)
> > +
> >  struct kvm_lpage_info {
> > int disallow_lpage;
> >  };
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 33b1aec44fb8..67a9823a8c35 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -762,11 +762,16 @@ static void update_gfn_disallow_lpage_count(const 
> > struct kvm_memory_slot *slot,
> >  {
> > struct kvm_lpage_info *linfo;
> > int i;
> > +   int disallow_count;
> >
> > for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> > linfo = lpage_info_slot(gfn, slot, i);
> > +
> > +   disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
> > +   WARN_ON(disallow_count + count < 0 ||
> > +   disallow_count > KVM_LPAGE_COUNT_MAX - count);
> > +
> > linfo->disallow_lpage += count;
> > -   WARN_ON(linfo->disallow_lpage < 0);
> > }
> >  }
> >
> > @@ -6910,3 +6915,108 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> > if (kvm->arch.nx_lpage_recovery_thread)
> > kthread_stop(kvm->arch.nx_lpage_recovery_thread);
> >  }
> > +
> > +static inline bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> > +{
> > +   return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static inline void linfo_update_mixed(struct kvm_lpage_info *linfo, bool 
> > mixed)
> > +{
> > +   if (mixed)
> > +   linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +   else
> > +   linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static bool mem_attr_is_mixed_2m(struct kvm *kvm, unsigned int attr,
> > +gfn_t start, gfn_t end)
> > +{
> > +   XA_STATE(xas, >mem_attr_array, start);
> > +   gfn_t gfn = start;
> > +   void *entry;
> > +   bool shared = attr == KVM_MEM_ATTR_SHARED;
> > +   bool mixed = false;
> > +
> > +   rcu_read_lock();
> > +   entry = xas_load();
> > +   while (gfn < end) {
> > +   if (xas_retry(, entry))
> > +   continue;
> > +
> > +   KVM_BUG_ON(gfn != xas.xa_index, kvm);
> > +
> > +   if ((entry && !shared) || (!entry && shared)) {
> > +   mixed = true;
> > +   goto out;
> > +   }
> > +
> > +   entry = xas_next();
> >

Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

2022-11-08 Thread Chao Peng
On Tue, Nov 08, 2022 at 09:35:06AM +0800, Yuan Yao wrote:
> On Tue, Oct 25, 2022 at 11:13:41PM +0800, Chao Peng wrote:
> > Introduce generic private memory register/unregister by reusing existing
> > SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case
> > by treating address in the region as gpa instead of hva. Which cases
> > should these ioctls go is determined by the kvm_arch_has_private_mem().
> > Architecture which supports KVM_PRIVATE_MEM should override this function.
> >
> > KVM internally defaults all guest memory as private memory and maintain
> > the shared memory in 'mem_attr_array'. The above ioctls operate on this
> > field and unmap existing mappings if any.
> >
> > Signed-off-by: Chao Peng 
> > ---
> >  Documentation/virt/kvm/api.rst |  17 ++-
> >  arch/x86/kvm/Kconfig   |   1 +
> >  include/linux/kvm_host.h   |  10 +-
> >  virt/kvm/Kconfig   |   4 +
> >  virt/kvm/kvm_main.c| 227 +
> >  5 files changed, 198 insertions(+), 61 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 975688912b8c..08253cf498d1 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -4717,10 +4717,19 @@ 
> > Documentation/virt/kvm/x86/amd-memory-encryption.rst.
> >  This ioctl can be used to register a guest memory region which may
> >  contain encrypted data (e.g. guest RAM, SMRAM etc).
> >
> > -It is used in the SEV-enabled guest. When encryption is enabled, a guest
> > -memory region may contain encrypted data. The SEV memory encryption
> > -engine uses a tweak such that two identical plaintext pages, each at
> > -different locations will have differing ciphertexts. So swapping or
> > +Currently this ioctl supports registering memory regions for two usages:
> > +private memory and SEV-encrypted memory.
> > +
> > +When private memory is enabled, this ioctl is used to register guest 
> > private
> > +memory region and the addr/size of kvm_enc_region represents guest physical
> > +address (GPA). In this usage, this ioctl zaps the existing guest memory
> > +mappings in KVM that fallen into the region.
> > +
> > +When SEV-encrypted memory is enabled, this ioctl is used to register guest
> > +memory region which may contain encrypted data for a SEV-enabled guest. The
> > +addr/size of kvm_enc_region represents userspace address (HVA). The SEV
> > +memory encryption engine uses a tweak such that two identical plaintext 
> > pages,
> > +each at different locations will have differing ciphertexts. So swapping or
> >  moving ciphertext of those pages will not result in plaintext being
> >  swapped. So relocating (or migrating) physical backing pages for the SEV
> >  guest will require some additional steps.
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index 8d2bd455c0cd..73fdfa429b20 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -51,6 +51,7 @@ config KVM
> > select HAVE_KVM_PM_NOTIFIER if PM
> > select HAVE_KVM_RESTRICTED_MEM if X86_64
> > select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
> > +   select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM
> > help
> >   Support hosting fully virtualized guest machines using hardware
> >   virtualization extensions.  You will need a fairly recent
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 79e5cbc35fcf..4ce98fa0153c 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t 
> > cr2_or_gpa,
> >  int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> >  #endif
> >
> > -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> > +
> > +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || 
> > defined(CONFIG_KVM_GENERIC_PRIVATE_MEM)
> >  struct kvm_gfn_range {
> > struct kvm_memory_slot *slot;
> > gfn_t start;
> > @@ -254,6 +255,9 @@ struct kvm_gfn_range {
> > bool may_block;
> >  };
> >  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> > +#endif
> > +
> > +#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
> >  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> >  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> > @@ -794,6 +798,9 @@ struct kvm {
> > stru

Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

2022-11-08 Thread Chao Peng
On Fri, Nov 04, 2022 at 09:19:31PM +, Sean Christopherson wrote:
> Paolo, any thoughts before I lead things further astray?
> 
> On Fri, Nov 04, 2022, Chao Peng wrote:
> > On Thu, Nov 03, 2022 at 11:04:53PM +, Sean Christopherson wrote:
> > > On Tue, Oct 25, 2022, Chao Peng wrote:
> > > > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> > > > r = kvm_vm_ioctl_set_memory_region(kvm, );
> > > > break;
> > > > }
> > > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > > +   case KVM_MEMORY_ENCRYPT_REG_REGION:
> > > > +   case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > > 
> > > I'm having second thoughts about usurping 
> > > KVM_MEMORY_ENCRYPT_(UN)REG_REGION.  Aside
> > > from the fact that restricted/protected memory may not be encrypted, 
> > > there are
> > > other potential use cases for per-page memory attributes[*], e.g. to make 
> > > memory
> > > read-only (or no-exec, or exec-only, etc...) without having to modify 
> > > memslots.
> > > 
> > > Any paravirt use case where the attributes of a page are effectively 
> > > dictated by
> > > the guest is going to run into the exact same performance problems with 
> > > memslots,
> > > which isn't suprising in hindsight since shared vs. private is really 
> > > just an
> > > attribute, albeit with extra special semantics.
> > > 
> > > And if we go with a brand new ioctl(), maybe someday in the very distant 
> > > future
> > > we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION.
> > > 
> > > Switching to a new ioctl() should be a minor change, i.e. shouldn't throw 
> > > too big
> > > of a wrench into things.
> > > 
> > > Something like:
> > > 
> > >   KVM_SET_MEMORY_ATTRIBUTES
> > > 
> > >   struct kvm_memory_attributes {
> > >   __u64 address;
> > >   __u64 size;
> > >   __u64 flags;
> 
> Oh, this is half-baked.  I lost track of which flags were which.  What I 
> intended
> was a separate, initially-unused flags, e.g.

That makes sense.

> 
>  struct kvm_memory_attributes {
>   __u64 address;
>   __u64 size;
>   __u64 attributes;
>   __u64 flags;
>   }
> 
> so that KVM can tweak behavior and/or extend the effective size of the struct.
> 
> > I like the idea of adding a new ioctl(). But putting all attributes into
> > a flags in uAPI sounds not good to me, e.g. forcing userspace to set all
> > attributes in one call can cause pain for userspace, probably for KVM
> > implementation as well. For private<->shared memory conversion, we
> > actually only care the KVM_MEM_ATTR_SHARED or KVM_MEM_ATTR_PRIVATE bit,
> 
> Not necessarily, e.g. I can see pKVM wanting to convert from RW+PRIVATE => 
> RO+SHARED
> or even RW+PRIVATE => NONE+SHARED so that the guest can't write/access the 
> memory
> while it's accessible from the host.
> 
> And if this does extend beyond shared/private, dropping from RWX=>R, i.e. 
> dropping
> WX permissions, would also be a common operation.
> 
> Hmm, typing that out makes me think that if we do end up supporting other 
> "attributes",
> i.e. protections, we should go straight to full RWX protections instead of 
> doing
> things piecemeal, i.e. add individual protections instead of combinations like
> NO_EXEC and READ_ONLY.  The protections would have to be inverted for 
> backwards
> compatibility, but that's easy enough to handle.  The semantics could be like
> protection keys, which also have inverted persmissions, where the final 
> protections
> are the combination of memslot+attributes, i.e. a read-only memslot couldn't 
> be made
> writable via attributes.
> 
> E.g. userspace could do "NO_READ | NO_WRITE | NO_EXEC" to temporarily block 
> access
> to memory without needing to delete the memslot.  KVM would need to disallow
> unsupported combinations, e.g. disallowed effective protections would be:
> 
>   - W or WX [unless there's an arch that supports write-only memory]
>   - R or RW [until KVM plumbs through support for no-exec, or it's 
> unsupported in hardware]
>   - X   [until KVM plumbs through support for exec-only, or it's 
> unsupported in hardware]
> 
> Anyways, that's all future work...
> 
> > but we force userspace to set other irrelevant bits as well if use this
> > API.
> 
> They aren't irrelevant though, as the memory attributes are all describing the
> allowed protectio

Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry

2022-11-07 Thread Chao Peng
On Fri, Nov 04, 2022 at 10:29:48PM +, Sean Christopherson wrote:
> On Fri, Nov 04, 2022, Chao Peng wrote:
> > On Thu, Oct 27, 2022 at 11:29:14AM +0100, Fuad Tabba wrote:
> > > Hi,
> > > 
> > > On Tue, Oct 25, 2022 at 4:19 PM Chao Peng  
> > > wrote:
> > > >
> > > > Currently in mmu_notifier validate path, hva range is recorded and then
> > > > checked against in the mmu_notifier_retry_hva() of the page fault path.
> > > > However, for the to be introduced private memory, a page fault may not
> > > > have a hva associated, checking gfn(gpa) makes more sense.
> > > >
> > > > For existing non private memory case, gfn is expected to continue to
> > > > work. The only downside is when aliasing multiple gfns to a single hva,
> > > > the current algorithm of checking multiple ranges could result in a much
> > > > larger range being rejected. Such aliasing should be uncommon, so the
> > > > impact is expected small.
> > > >
> > > > It also fixes a bug in kvm_zap_gfn_range() which has already been using
> > > 
> > > nit: Now it's kvm_unmap_gfn_range().
> > 
> > Forgot to mention: the bug is still with kvm_zap_gfn_range(). It calls
> > kvm_mmu_invalidate_begin/end with a gfn range but before this series
> > kvm_mmu_invalidate_begin/end actually accept a hva range. Note it's
> > unrelated to whether we use kvm_zap_gfn_range() or kvm_unmap_gfn_range()
> > in the following patch (patch 05).
> 
> Grr, in the future, if you find an existing bug, please send a patch.  At the
> very least, report the bug.

Agreed, this can be sent out separately from this series.

> The APICv case that this was added for could very
> well be broken because of this, and the resulting failures would be an 
> absolute
> nightmare to debug.

Given the apicv_inhibit should be rare, the change looks good to me.
Just to be clear, your will send out this fix, right?

Chao

> 
> Compile tested only...
> 
> --
> From: Sean Christopherson 
> Date: Fri, 4 Nov 2022 22:20:33 +
> Subject: [PATCH] KVM: x86/mmu: Block all page faults during
>  kvm_zap_gfn_range()
> 
> When zapping a GFN range, pass 0 => ALL_ONES for the to-be-invalidated
> range to effectively block all page faults while the zap is in-progress.
> The invalidation helpers take a host virtual address, whereas zapping a
> GFN obviously provides a guest physical address and with the wrong unit
> of measurement (frame vs. byte).
> 
> Alternatively, KVM could walk all memslots to get the associated HVAs,
> but thanks to SMM, that would require multiple lookups.  And practically
> speaking, kvm_zap_gfn_range() usage is quite rare and not a hot path,
> e.g. MTRR and CR0.CD are almost guaranteed to be done only on vCPU0
> during boot, and APICv inhibits are similarly infrequent operations.
> 
> Fixes: edb298c663fc ("KVM: x86/mmu: bump mmu notifier count in 
> kvm_zap_gfn_range")
> Cc: sta...@vger.kernel.org
> Cc: Maxim Levitsky 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/kvm/mmu/mmu.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f81539061d6..1ccb769f62af 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6056,7 +6056,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t 
> gfn_start, gfn_t gfn_end)
>  
>   write_lock(>mmu_lock);
>  
> - kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
> + kvm_mmu_invalidate_begin(kvm, 0, -1ul);
>  
>   flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>  
> @@ -6070,7 +6070,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t 
> gfn_start, gfn_t gfn_end)
>   kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
>  gfn_end - gfn_start);
>  
> - kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
> + kvm_mmu_invalidate_end(kvm, 0, -1ul);
>  
>   write_unlock(>mmu_lock);
>  }
> 
> base-commit: c12879206e47730ff5ab255bbf625b28ade4028f
> -- 



Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

2022-11-04 Thread Chao Peng
On Thu, Nov 03, 2022 at 11:04:53PM +, Sean Christopherson wrote:
> On Tue, Oct 25, 2022, Chao Peng wrote:
> > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> > r = kvm_vm_ioctl_set_memory_region(kvm, );
> > break;
> > }
> > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > +   case KVM_MEMORY_ENCRYPT_REG_REGION:
> > +   case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> 
> I'm having second thoughts about usurping KVM_MEMORY_ENCRYPT_(UN)REG_REGION.  
> Aside
> from the fact that restricted/protected memory may not be encrypted, there are
> other potential use cases for per-page memory attributes[*], e.g. to make 
> memory
> read-only (or no-exec, or exec-only, etc...) without having to modify 
> memslots.
> 
> Any paravirt use case where the attributes of a page are effectively dictated 
> by
> the guest is going to run into the exact same performance problems with 
> memslots,
> which isn't suprising in hindsight since shared vs. private is really just an
> attribute, albeit with extra special semantics.
> 
> And if we go with a brand new ioctl(), maybe someday in the very distant 
> future
> we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION.
> 
> Switching to a new ioctl() should be a minor change, i.e. shouldn't throw too 
> big
> of a wrench into things.
> 
> Something like:
> 
>   KVM_SET_MEMORY_ATTRIBUTES
> 
>   struct kvm_memory_attributes {
>   __u64 address;
>   __u64 size;
>   __u64 flags;
>   }

I like the idea of adding a new ioctl(). But putting all attributes into
a flags in uAPI sounds not good to me, e.g. forcing userspace to set all
attributes in one call can cause pain for userspace, probably for KVM
implementation as well. For private<->shared memory conversion, we
actually only care the KVM_MEM_ATTR_SHARED or KVM_MEM_ATTR_PRIVATE bit,
but we force userspace to set other irrelevant bits as well if use this
API.

I looked at kvm_device_attr, sounds we can do similar:

  KVM_SET_MEMORY_ATTR

  struct kvm_memory_attr {
__u64 address;
__u64 size;
#define KVM_MEM_ATTR_SHARED BIT(0)
#define KVM_MEM_ATTR_READONLY   BIT(1)
#define KVM_MEM_ATTR_NOEXEC BIT(2)
__u32 attr;
__u32 pad;
  }

I'm not sure if we need KVM_GET_MEMORY_ATTR/KVM_HAS_MEMORY_ATTR as well,
but sounds like we need a KVM_UNSET_MEMORY_ATTR.

Since we are exposing the attribute directly to userspace I also think
we'd better treat shared memory as the default, so even when the private
memory is not used, the bit can still be meaningful. So define BIT(0) as
KVM_MEM_ATTR_PRIVATE instead of KVM_MEM_ATTR_SHARED.

Thanks,
Chao

> 
> [*] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com
> 
> > +   struct kvm_enc_region region;
> > +   bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > +
> > +   if (!kvm_arch_has_private_mem(kvm))
> > +   goto arch_vm_ioctl;
> > +
> > +   r = -EFAULT;
> > +   if (copy_from_user(, argp, sizeof(region)))
> > +   goto out;
> > +
> > +   r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> > + region.size, set);
> > +   break;
> > +   }
> > +#endif



Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry

2022-11-03 Thread Chao Peng
On Thu, Oct 27, 2022 at 11:29:14AM +0100, Fuad Tabba wrote:
> Hi,
> 
> On Tue, Oct 25, 2022 at 4:19 PM Chao Peng  wrote:
> >
> > Currently in mmu_notifier validate path, hva range is recorded and then
> > checked against in the mmu_notifier_retry_hva() of the page fault path.
> > However, for the to be introduced private memory, a page fault may not
> > have a hva associated, checking gfn(gpa) makes more sense.
> >
> > For existing non private memory case, gfn is expected to continue to
> > work. The only downside is when aliasing multiple gfns to a single hva,
> > the current algorithm of checking multiple ranges could result in a much
> > larger range being rejected. Such aliasing should be uncommon, so the
> > impact is expected small.
> >
> > It also fixes a bug in kvm_zap_gfn_range() which has already been using
> 
> nit: Now it's kvm_unmap_gfn_range().

Forgot to mention: the bug is still with kvm_zap_gfn_range(). It calls
kvm_mmu_invalidate_begin/end with a gfn range but before this series
kvm_mmu_invalidate_begin/end actually accept a hva range. Note it's
unrelated to whether we use kvm_zap_gfn_range() or kvm_unmap_gfn_range()
in the following patch (patch 05).

Thanks,
Chao
> 
> > gfn when calling kvm_mmu_invalidate_begin/end() while these functions
> > accept hva in current code.
> >
> > Signed-off-by: Chao Peng 
> > ---
> 
> Based on reading this code and my limited knowledge of the x86 MMU code:
> Reviewed-by: Fuad Tabba 
> 
> Cheers,
> /fuad
> 
> 
> >  arch/x86/kvm/mmu/mmu.c   |  2 +-
> >  include/linux/kvm_host.h | 18 +++-
> >  virt/kvm/kvm_main.c  | 45 ++--
> >  3 files changed, 39 insertions(+), 26 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 6f81539061d6..33b1aec44fb8 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4217,7 +4217,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
> > return true;
> >
> > return fault->slot &&
> > -  mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
> > +  mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
> >  }
> >
> >  static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault 
> > *fault)
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 739a7562a1f3..79e5cbc35fcf 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -775,8 +775,8 @@ struct kvm {
> > struct mmu_notifier mmu_notifier;
> > unsigned long mmu_invalidate_seq;
> > long mmu_invalidate_in_progress;
> > -   unsigned long mmu_invalidate_range_start;
> > -   unsigned long mmu_invalidate_range_end;
> > +   gfn_t mmu_invalidate_range_start;
> > +   gfn_t mmu_invalidate_range_end;
> >  #endif
> > struct list_head devices;
> > u64 manual_dirty_log_protect;
> > @@ -1365,10 +1365,8 @@ void kvm_mmu_free_memory_cache(struct 
> > kvm_mmu_memory_cache *mc);
> >  void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >  #endif
> >
> > -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> > - unsigned long end);
> > -void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
> > -   unsigned long end);
> > +void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end);
> > +void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end);
> >
> >  long kvm_arch_dev_ioctl(struct file *filp,
> > unsigned int ioctl, unsigned long arg);
> > @@ -1937,9 +1935,9 @@ static inline int mmu_invalidate_retry(struct kvm 
> > *kvm, unsigned long mmu_seq)
> > return 0;
> >  }
> >
> > -static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
> > +static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
> >unsigned long mmu_seq,
> > -  unsigned long hva)
> > +  gfn_t gfn)
> >  {
> > lockdep_assert_held(>mmu_lock);
> > /*
> > @@ -1949,8 +1947,8 @@ static inline int mmu_invalidate_retry_hva(struct kvm 
> > *kvm,
> >  * positives, due to shortcuts when handing concurrent 
> > invalidations.
> >  */
> > if 

Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-11-02 Thread Chao Peng
On Tue, Nov 01, 2022 at 02:30:58PM -0500, Michael Roth wrote:
> On Tue, Nov 01, 2022 at 10:19:44AM -0500, Michael Roth wrote:
> > On Tue, Nov 01, 2022 at 07:37:29PM +0800, Chao Peng wrote:
> > > On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> > > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > > 
> > > > 
> > > >   3) Potentially useful for hugetlbfs support:
> > > > 
> > > >  One issue with hugetlbfs is that we don't support splitting the
> > > >  hugepage in such cases, which was a big obstacle prior to UPM. Now
> > > >  however, we may have the option of doing "lazy" invalidations where
> > > >  fallocate(PUNCH_HOLE, ...) won't free a shmem-allocate page unless
> > > >  all the subpages within the 2M range are either hole-punched, or 
> > > > the
> > > >  guest is shut down, so in that way we never have to split it. Sean
> > > >  was pondering something similar in another thread:
> > > > 
> > > >
> > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2FYyGLXXkFCmxBfu5U%40google.com%2Fdata=05%7C01%7CMichael.Roth%40amd.com%7C28ba5dbb51844f910dec08dabc1c99e6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638029128345507924%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=bxcRfuJIgo1Z1G8HQ800HscE6y7RXRQwvWSkfc5M8Bs%3Dreserved=0
> > > > 
> > > >  Issuing invalidations with folio-granularity ties in fairly well
> > > >  with this sort of approach if we end up going that route.
> > > 
> > > There is semantics difference between the current one and the proposed
> > > one: The invalidation range is exactly what userspace passed down to the
> > > kernel (being fallocated) while the proposed one will be subset of that
> > > (if userspace-provided addr/size is not aligned to power of two), I'm
> > > not quite confident this difference has no side effect.
> > 
> > In theory userspace should not be allocating/hole-punching restricted
> > pages for GPA ranges that are already mapped as private in the xarray,
> > and KVM could potentially fail such requests (though it does currently).
> > 
> > But if we somehow enforced that, then we could rely on
> > KVM_MEMORY_ENCRYPT_REG_REGION to handle all the MMU invalidation stuff,
> > which would free up the restricted fd invalidation callbacks to be used
> > purely to handle doing things like RMP/directmap fixups prior to returning
> > restricted pages back to the host. So that was sort of my thinking why the
> > new semantics would still cover all the necessary cases.
> 
> Sorry, this explanation is if we rely on userspace to fallocate() on 2MB
> boundaries, and ignore any non-aligned requests in the kernel. But
> that's not how I actually ended up implementing things, so I'm not sure
> why answered that way...
> 
> In my implementation we actually do issue invalidations for fallocate()
> even for non-2M-aligned GPA/offset ranges. For instance (assuming
> restricted FD offset 0 corresponds to GPA 0), an fallocate() on GPA
> range 0x1000-0x402000 would result in the following invalidations being
> issued if everything was backed by a 2MB page:
> 
>   invalidate GPA: 0x001000-0x20, Page: pfn_to_page(I), order:9
>   invalidate GPA: 0x20-0x40, Page: pfn_to_page(J), order:9
>   invalidate GPA: 0x40-0x402000, Page: pfn_to_page(K), order:9

Only see this I understand what you are actually going to propose;)

So the memory range(start/end) will be still there and covers exactly
what it should be from usrspace point of view, the page+order(or just
folio) is really just a _hint_ for the invalidation callbacks. Looks
ugly though.

In v9 we use a invalidate_start/ invalidate_end pair to solve a race
contention issue(https://lore.kernel.org/kvm/y1loe4jvntbfn...@google.com/).
To work with this, I believe we only need pass this hint info for
invalidate_start() since at the invalidate_end() time, the page has
already been discarded.

Another worth-mentioning-thing is invalidate_start/end is not just
invoked for hole punching, but also for allocation(e.g. default
fallocate). While for allocation we can get the page only at the
invalidate_end() time. But AFAICS, the invalidate() is called for
fallocate(allocation) is because previously we rely on the existence in
memory backing store to tell a page is private and we need notify KVM
that the page is being converted from shared to private, but that is not
true for current code and fallocate() is also not mandatory since KVM
can c

Re: [PATCH v9 7/8] KVM: Handle page fault for private memory

2022-11-01 Thread Chao Peng
On Mon, Oct 31, 2022 at 05:02:50PM -0700, Isaku Yamahata wrote:
> On Fri, Oct 28, 2022 at 02:55:45PM +0800,
> Chao Peng  wrote:
> 
> > On Wed, Oct 26, 2022 at 02:54:25PM -0700, Isaku Yamahata wrote:
> > > On Tue, Oct 25, 2022 at 11:13:43PM +0800,
> > > Chao Peng  wrote:
> > > 
> > > > A memslot with KVM_MEM_PRIVATE being set can include both fd-based
> > > > private memory and hva-based shared memory. Architecture code (like TDX
> > > > code) can tell whether the on-going fault is private or not. This patch
> > > > adds a 'is_private' field to kvm_page_fault to indicate this and
> > > > architecture code is expected to set it.
> > > > 
> > > > To handle page fault for such memslot, the handling logic is different
> > > > depending on whether the fault is private or shared. KVM checks if
> > > > 'is_private' matches the host's view of the page (maintained in
> > > > mem_attr_array).
> > > >   - For a successful match, private pfn is obtained with
> > > > restrictedmem_get_page () from private fd and shared pfn is obtained
> > > > with existing get_user_pages().
> > > >   - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> > > >     userspace. Userspace then can convert memory between private/shared
> > > > in host's view and retry the fault.
> > > > 
> > > > Co-developed-by: Yu Zhang 
> > > > Signed-off-by: Yu Zhang 
> > > > Signed-off-by: Chao Peng 
> > > > ---
> > > >  arch/x86/kvm/mmu/mmu.c  | 56 +++--
> > > >  arch/x86/kvm/mmu/mmu_internal.h | 14 -
> > > >  arch/x86/kvm/mmu/mmutrace.h |  1 +
> > > >  arch/x86/kvm/mmu/spte.h |  6 
> > > >  arch/x86/kvm/mmu/tdp_mmu.c  |  3 +-
> > > >  include/linux/kvm_host.h| 28 +
> > > >  6 files changed, 103 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 67a9823a8c35..10017a9f26ee 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -3030,7 +3030,7 @@ static int host_pfn_mapping_level(struct kvm 
> > > > *kvm, gfn_t gfn,
> > > >  
> > > >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > > >   const struct kvm_memory_slot *slot, gfn_t 
> > > > gfn,
> > > > - int max_level)
> > > > + int max_level, bool is_private)
> > > >  {
> > > > struct kvm_lpage_info *linfo;
> > > > int host_level;
> > > > @@ -3042,6 +3042,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > > > break;
> > > > }
> > > >  
> > > > +   if (is_private)
> > > > +   return max_level;
> > > 
> > > Below PG_LEVEL_NUM is passed by zap_collapsible_spte_range().  It doesn't 
> > > make
> > > sense.
> > > 
> > > > +
> > > > if (max_level == PG_LEVEL_4K)
> > > > return PG_LEVEL_4K;
> > > >  
> > > > @@ -3070,7 +3073,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu 
> > > > *vcpu, struct kvm_page_fault *fault
> > > >  * level, which will be used to do precise, accurate accounting.
> > > >  */
> > > > fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > > > -fault->gfn, 
> > > > fault->max_level);
> > > > +fault->gfn, 
> > > > fault->max_level,
> > > > +fault->is_private);
> > > > if (fault->req_level == PG_LEVEL_4K || 
> > > > fault->huge_page_disallowed)
> > > > return;
> > > >  
> > > > @@ -4141,6 +4145,32 @@ void kvm_arch_async_page_ready(struct kvm_vcpu 
> > > > *vcpu, struct kvm_async_pf *work)
> > > > kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> > > >  }
> > > >  
> > > > +static inline u8 order_to_level(int order)
> > > > +{
> > > > +   BUILD_BUG_ON(KVM_MAX

Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-11-01 Thread Chao Peng
On Mon, Oct 31, 2022 at 12:47:38PM -0500, Michael Roth wrote:
> On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> > From: "Kirill A. Shutemov" 
> > 
> > Introduce 'memfd_restricted' system call with the ability to create
> > memory areas that are restricted from userspace access through ordinary
> > MMU operations (e.g. read/write/mmap). The memory content is expected to
> > be used through a new in-kernel interface by a third kernel module.
> > 
> > memfd_restricted() is useful for scenarios where a file descriptor(fd)
> > can be used as an interface into mm but want to restrict userspace's
> > ability on the fd. Initially it is designed to provide protections for
> > KVM encrypted guest memory.
> > 
> > Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
> > (e.g. QEMU) and then using the mmaped virtual address to setup the
> > mapping in the KVM secondary page table (e.g. EPT). With confidential
> > computing technologies like Intel TDX, the memfd memory may be encrypted
> > with special key for special software domain (e.g. KVM guest) and is not
> > expected to be directly accessed by userspace. Precisely, userspace
> > access to such encrypted memory may lead to host crash so should be
> > prevented.
> > 
> > memfd_restricted() provides semantics required for KVM guest encrypted
> > memory support that a fd created with memfd_restricted() is going to be
> > used as the source of guest memory in confidential computing environment
> > and KVM can directly interact with core-mm without the need to expose
> > the memoy content into KVM userspace.
> > 
> > KVM userspace is still in charge of the lifecycle of the fd. It should
> > pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
> > obtain the physical memory page and then uses it to populate the KVM
> > secondary page table entries.
> > 
> > The userspace restricted memfd can be fallocate-ed or hole-punched
> > from userspace. When these operations happen, KVM can get notified
> > through restrictedmem_notifier, it then gets chance to remove any
> > mapped entries of the range in the secondary page tables.
> > 
> > memfd_restricted() itself is implemented as a shim layer on top of real
> > memory file systems (currently tmpfs). Pages in restrictedmem are marked
> > as unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> > 
> > By default memfd_restricted() prevents userspace read, write and mmap.
> > By defining new bit in the 'flags', it can be extended to support other
> > restricted semantics in the future.
> > 
> > The system call is currently wired up for x86 arch.
> > 
> > Signed-off-by: Kirill A. Shutemov 
> > Signed-off-by: Chao Peng 
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |   1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
> >  include/linux/restrictedmem.h  |  62 ++
> >  include/linux/syscalls.h   |   1 +
> >  include/uapi/asm-generic/unistd.h  |   5 +-
> >  include/uapi/linux/magic.h |   1 +
> >  kernel/sys_ni.c|   3 +
> >  mm/Kconfig |   4 +
> >  mm/Makefile|   1 +
> >  mm/restrictedmem.c | 250 +
> >  10 files changed, 328 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/restrictedmem.h
> >  create mode 100644 mm/restrictedmem.c
> > 
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> > b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 320480a8db4f..dc70ba90247e 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -455,3 +455,4 @@
> >  448i386process_mreleasesys_process_mrelease
> >  449i386futex_waitv sys_futex_waitv
> >  450i386set_mempolicy_home_node 
> > sys_set_mempolicy_home_node
> > +451i386memfd_restrictedsys_memfd_restricted
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> > b/arch/x86/entry/syscalls/syscall_64.tbl
> > index c84d12608cd2..06516abc8318 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -372,6 +372,7 @@
> >  448common  process_mreleasesys_process_mrelease
> >  449common  futex_waitv sys_futex_waitv
> >  450common 

Re: [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory

2022-10-31 Thread Chao Peng
On Fri, Oct 28, 2022 at 03:04:27PM +0800, Xiaoyao Li wrote:
> On 10/25/2022 11:13 PM, Chao Peng wrote:
> > In memory encryption usage, guest memory may be encrypted with special
> > key and can be accessed only by the guest itself. We call such memory
> > private memory. It's valueless and sometimes can cause problem to allow
> > userspace to access guest private memory. This new KVM memslot extension
> > allows guest private memory being provided though a restrictedmem
>  ^
> 
> typo

Thanks!

> 
> > backed file descriptor(fd) and userspace is restricted to access the
> > bookmarked memory in the fd.
> > 
> > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > additional KVM memslot fields restricted_fd/restricted_offset to allow
> > userspace to instruct KVM to provide guest memory through restricted_fd.
> > 'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
> > and the size is 'memory_size'.
> > 
> > The extended memslot can still have the userspace_addr(hva). When use, a
> > single memslot can maintain both private memory through restricted_fd
> > and shared memory through userspace_addr. Whether the private or shared
> > part is visible to guest is maintained by other KVM code.
> > 
> > A restrictedmem_notifier field is also added to the memslot structure to
> > allow the restricted_fd's backing store to notify KVM the memory change,
> > KVM then can invalidate its page table entries.
> > 
> > Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
> > and right now it is selected on X86_64 only. A KVM_CAP_PRIVATE_MEM is
> > also introduced to indicate KVM support for KVM_MEM_PRIVATE.
> > 
> > To make code maintenance easy, internally we use a binary compatible
> > alias struct kvm_user_mem_region to handle both the normal and the
> > '_ext' variants.
> > 
> > Co-developed-by: Yu Zhang 
> > Signed-off-by: Yu Zhang 
> > Signed-off-by: Chao Peng 
> > ---
> >   Documentation/virt/kvm/api.rst | 48 -
> >   arch/x86/kvm/Kconfig   |  2 ++
> >   arch/x86/kvm/x86.c |  2 +-
> >   include/linux/kvm_host.h   | 13 +++--
> >   include/uapi/linux/kvm.h   | 29 
> >   virt/kvm/Kconfig   |  3 +++
> >   virt/kvm/kvm_main.c| 49 --
> >   7 files changed, 128 insertions(+), 18 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index eee9f857a986..f3fa75649a78 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
> >   :Capability: KVM_CAP_USER_MEMORY
> >   :Architectures: all
> >   :Type: vm ioctl
> > -:Parameters: struct kvm_userspace_memory_region (in)
> > +:Parameters: struct kvm_userspace_memory_region(_ext) (in)
> >   :Returns: 0 on success, -1 on error
> >   ::
> > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> > __u64 userspace_addr; /* start of the userspace allocated memory */
> > };
> > +  struct kvm_userspace_memory_region_ext {
> > +   struct kvm_userspace_memory_region region;
> > +   __u64 restricted_offset;
> > +   __u32 restricted_fd;
> > +   __u32 pad1;
> > +   __u64 pad2[14];
> > +  };
> > +
> > /* for kvm_memory_region::flags */
> > #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> > #define KVM_MEM_READONLY(1UL << 1)
> > +  #define KVM_MEM_PRIVATE  (1UL << 2)
> >   This ioctl allows the user to create, modify or delete a guest physical
> >   memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> > @@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of 
> > guest_phys_addr and userspace_addr
> >   be identical.  This allows large pages in the guest to be backed by large
> >   pages in the host.
> > -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> > -writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how 
> > to
> > -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows 
> > it,
> > -to make a new slot read-only.  In this case, writes to this memory will be
> > -posted to userspace as KVM_EXIT_MMIO exits.
> > +kvm_userspace_memory_region_ext struct includes all fi

Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-10-28 Thread Chao Peng
On Wed, Oct 26, 2022 at 10:31:45AM -0700, Isaku Yamahata wrote:
> On Tue, Oct 25, 2022 at 11:13:37PM +0800,
> Chao Peng  wrote:
> 
> > +int restrictedmem_get_page(struct file *file, pgoff_t offset,
> > +  struct page **pagep, int *order)
> > +{
> > +   struct restrictedmem_data *data = file->f_mapping->private_data;
> > +   struct file *memfd = data->memfd;
> > +   struct page *page;
> > +   int ret;
> > +
> > +   ret = shmem_getpage(file_inode(memfd), offset, , SGP_WRITE);
> 
> shmem_getpage() was removed.
> https://lkml.kernel.org/r/20220902194653.1739778-34-wi...@infradead.org

Thanks for pointing out. My current base(kvm/queue) has not included
this change yet so still use shmem_getpage().

Chao
> 
> I needed the following fix to compile.
> 
> thanks,
> 
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> index e5bf8907e0f8..4694dd5609d6 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -231,13 +231,15 @@ int restrictedmem_get_page(struct file *file, pgoff_t 
> offset,
>  {
> struct restrictedmem_data *data = file->f_mapping->private_data;
> struct file *memfd = data->memfd;
> +   struct folio *folio = NULL;
> struct page *page;
> int ret;
>  
> -   ret = shmem_getpage(file_inode(memfd), offset, , SGP_WRITE);
> +   ret = shmem_get_folio(file_inode(memfd), offset, , SGP_WRITE);
> if (ret)
> return ret;
>  
> +   page = folio_file_page(folio, offset);
> *pagep = page;
> if (order)
> *order = thp_order(compound_head(page));
> -- 
> Isaku Yamahata 



Re: [PATCH v9 7/8] KVM: Handle page fault for private memory

2022-10-28 Thread Chao Peng
On Wed, Oct 26, 2022 at 02:54:25PM -0700, Isaku Yamahata wrote:
> On Tue, Oct 25, 2022 at 11:13:43PM +0800,
> Chao Peng  wrote:
> 
> > A memslot with KVM_MEM_PRIVATE being set can include both fd-based
> > private memory and hva-based shared memory. Architecture code (like TDX
> > code) can tell whether the on-going fault is private or not. This patch
> > adds a 'is_private' field to kvm_page_fault to indicate this and
> > architecture code is expected to set it.
> > 
> > To handle page fault for such memslot, the handling logic is different
> > depending on whether the fault is private or shared. KVM checks if
> > 'is_private' matches the host's view of the page (maintained in
> > mem_attr_array).
> >   - For a successful match, private pfn is obtained with
> > restrictedmem_get_page () from private fd and shared pfn is obtained
> > with existing get_user_pages().
> >   - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> > userspace. Userspace then can convert memory between private/shared
> > in host's view and retry the fault.
> > 
> > Co-developed-by: Yu Zhang 
> > Signed-off-by: Yu Zhang 
> > Signed-off-by: Chao Peng 
> > ---
> >  arch/x86/kvm/mmu/mmu.c  | 56 +++--
> >  arch/x86/kvm/mmu/mmu_internal.h | 14 -
> >  arch/x86/kvm/mmu/mmutrace.h |  1 +
> >  arch/x86/kvm/mmu/spte.h |  6 
> >  arch/x86/kvm/mmu/tdp_mmu.c  |  3 +-
> >  include/linux/kvm_host.h| 28 +
> >  6 files changed, 103 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 67a9823a8c35..10017a9f26ee 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3030,7 +3030,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, 
> > gfn_t gfn,
> >  
> >  int kvm_mmu_max_mapping_level(struct kvm *kvm,
> >   const struct kvm_memory_slot *slot, gfn_t gfn,
> > - int max_level)
> > + int max_level, bool is_private)
> >  {
> > struct kvm_lpage_info *linfo;
> > int host_level;
> > @@ -3042,6 +3042,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > break;
> > }
> >  
> > +   if (is_private)
> > +   return max_level;
> 
> Below PG_LEVEL_NUM is passed by zap_collapsible_spte_range().  It doesn't make
> sense.
> 
> > +
> > if (max_level == PG_LEVEL_4K)
> > return PG_LEVEL_4K;
> >  
> > @@ -3070,7 +3073,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, 
> > struct kvm_page_fault *fault
> >  * level, which will be used to do precise, accurate accounting.
> >  */
> > fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> > -fault->gfn, 
> > fault->max_level);
> > +fault->gfn, 
> > fault->max_level,
> > +fault->is_private);
> > if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
> > return;
> >  
> > @@ -4141,6 +4145,32 @@ void kvm_arch_async_page_ready(struct kvm_vcpu 
> > *vcpu, struct kvm_async_pf *work)
> > kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
> >  }
> >  
> > +static inline u8 order_to_level(int order)
> > +{
> > +   BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > +
> > +   if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > +   return PG_LEVEL_1G;
> > +
> > +   if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > +   return PG_LEVEL_2M;
> > +
> > +   return PG_LEVEL_4K;
> > +}
> > +
> > +static int kvm_faultin_pfn_private(struct kvm_page_fault *fault)
> > +{
> > +   int order;
> > +   struct kvm_memory_slot *slot = fault->slot;
> > +
> > +   if (kvm_restricted_mem_get_pfn(slot, fault->gfn, >pfn, ))
> > +   return RET_PF_RETRY;
> > +
> > +   fault->max_level = min(order_to_level(order), fault->max_level);
> > +   fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> > +   return RET_PF_CONTINUE;
> > +}
> > +
> >  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault 
> > *fault)
> >  {
> > struct kvm_memory_slot *slot = fault->slot;
> > @@ -4173,6 +42

Re: [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed

2022-10-28 Thread Chao Peng
On Wed, Oct 26, 2022 at 01:46:20PM -0700, Isaku Yamahata wrote:
> On Tue, Oct 25, 2022 at 11:13:42PM +0800,
> Chao Peng  wrote:
> 
> > When private/shared memory are mixed in a large page, the lpage_info may
> > not be accurate and should be updated with this mixed info. A large page
> > has mixed pages can't be really mapped as large page since its
> > private/shared pages are from different physical memory.
> > 
> > Update lpage_info when private/shared memory attribute is changed. If
> > both private and shared pages are within a large page region, it can't
> > be mapped as large page. It's a bit challenge to track the mixed
> > info in a 'count' like variable, this patch instead reserves a bit in
> > 'disallow_lpage' to indicate a large page has mixed private/share pages.
> > 
> > Signed-off-by: Chao Peng 
> > ---
> >  arch/x86/include/asm/kvm_host.h |   8 +++
> >  arch/x86/kvm/mmu/mmu.c  | 112 +++-
> >  arch/x86/kvm/x86.c  |   2 +
> >  include/linux/kvm_host.h|  19 ++
> >  virt/kvm/kvm_main.c |  16 +++--
> >  5 files changed, 152 insertions(+), 5 deletions(-)
> > 
> ...
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 33b1aec44fb8..67a9823a8c35 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> ...
> > @@ -6910,3 +6915,108 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
> > if (kvm->arch.nx_lpage_recovery_thread)
> > kthread_stop(kvm->arch.nx_lpage_recovery_thread);
> >  }
> > +
> > +static inline bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> > +{
> > +   return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static inline void linfo_update_mixed(struct kvm_lpage_info *linfo, bool 
> > mixed)
> > +{
> > +   if (mixed)
> > +   linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +   else
> > +   linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> > +}
> > +
> > +static bool mem_attr_is_mixed_2m(struct kvm *kvm, unsigned int attr,
> > +gfn_t start, gfn_t end)
> > +{
> > +   XA_STATE(xas, >mem_attr_array, start);
> > +   gfn_t gfn = start;
> > +   void *entry;
> > +   bool shared = attr == KVM_MEM_ATTR_SHARED;
> > +   bool mixed = false;
> > +
> > +   rcu_read_lock();
> > +   entry = xas_load();
> > +   while (gfn < end) {
> > +   if (xas_retry(, entry))
> > +   continue;
> > +
> > +   KVM_BUG_ON(gfn != xas.xa_index, kvm);
> > +
> > +   if ((entry && !shared) || (!entry && shared)) {
> > +   mixed = true;
> > +   goto out;
> 
> nitpick: goto isn't needed. break should work.

Thanks.

> 
> > +   }
> > +
> > +   entry = xas_next();
> > +   gfn++;
> > +   }
> > +out:
> > +   rcu_read_unlock();
> > +   return mixed;
> > +}
> > +
> > +static bool mem_attr_is_mixed(struct kvm *kvm, struct kvm_memory_slot 
> > *slot,
> > + int level, unsigned int attr,
> > + gfn_t start, gfn_t end)
> > +{
> > +   unsigned long gfn;
> > +   void *entry;
> > +
> > +   if (level == PG_LEVEL_2M)
> > +   return mem_attr_is_mixed_2m(kvm, attr, start, end);
> > +
> > +   entry = xa_load(>mem_attr_array, start);
> > +   for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
> > +   if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)))
> > +   return true;
> > +   if (xa_load(>mem_attr_array, gfn) != entry)
> > +   return true;
> > +   }
> > +   return false;
> > +}
> > +
> > +void kvm_arch_update_mem_attr(struct kvm *kvm, struct kvm_memory_slot 
> > *slot,
> > + unsigned int attr, gfn_t start, gfn_t end)
> > +{
> > +
> > +   unsigned long lpage_start, lpage_end;
> > +   unsigned long gfn, pages, mask;
> > +   int level;
> > +
> > +   WARN_ONCE(!(attr & (KVM_MEM_ATTR_PRIVATE | KVM_MEM_ATTR_SHARED)),
> > +   "Unsupported mem attribute.\n");
> > +
> > +   /*
> > +* The sequence matters here: we update the higher level basing on the
> > +* lower level's scanning result.
> > +*/
> > +   for (level = 

Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-10-28 Thread Chao Peng
On Thu, Oct 27, 2022 at 11:27:05AM +0100, Fuad Tabba wrote:
> Hi,
> 
> On Tue, Oct 25, 2022 at 4:19 PM Chao Peng  wrote:
> >
> > This new KVM exit allows userspace to handle memory-related errors. It
> > indicates an error happens in KVM at guest memory range [gpa, gpa+size).
> > The flags includes additional information for userspace to handle the
> > error. Currently bit 0 is defined as 'private memory' where '1'
> > indicates error happens due to private memory access and '0' indicates
> > error happens due to shared memory access.
> >
> > When private memory is enabled, this new exit will be used for KVM to
> > exit to userspace for shared <-> private memory conversion in memory
> > encryption usage. In such usage, typically there are two kind of memory
> > conversions:
> >   - explicit conversion: happens when guest explicitly calls into KVM
> > to map a range (as private or shared), KVM then exits to userspace
> > to perform the map/unmap operations.
> >   - implicit conversion: happens in KVM page fault handler where KVM
> > exits to userspace for an implicit conversion when the page is in a
> > different state than requested (private or shared).
> >
> > Suggested-by: Sean Christopherson 
> > Co-developed-by: Yu Zhang 
> > Signed-off-by: Yu Zhang 
> > Signed-off-by: Chao Peng 
> > ---
> 
> Reviewed-by: Fuad Tabba 
> 
> I have tested the V8 version of this patch on arm64/qemu, and
> considering this hasn't changed:
> Tested-by: Fuad Tabba 

Appreciate your review and testing!

Chao
> 
> Cheers,
> /fuad
> 
> 
> 
> >  Documentation/virt/kvm/api.rst | 23 +++
> >  include/uapi/linux/kvm.h   |  9 +
> >  2 files changed, 32 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f3fa75649a78..975688912b8c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace 
> > should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >
> > +::
> > +
> > +   /* KVM_EXIT_MEMORY_FAULT */
> > +   struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
> > +   __u32 flags;
> > +   __u32 padding;
> > +   __u64 gpa;
> > +   __u64 size;
> > +   } memory;
> > +
> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> > +encountered a memory error which is not handled by KVM kernel module and
> > +userspace may choose to handle it. The 'flags' field indicates the memory
> > +properties of the exit.
> > +
> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> > +   private memory access when the bit is set. Otherwise the memory error is
> > +   caused by shared memory access when the bit is clear.
> > +
> > +'gpa' and 'size' indicate the memory range the error occurs at. The 
> > userspace
> > +may handle the error and return to KVM to retry the previous memory access.
> > +
> >  ::
> >
> >  /* KVM_EXIT_NOTIFY */
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index f1ae45c10c94..fa60b032a405 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -300,6 +300,7 @@ struct kvm_xen_exit {
> >  #define KVM_EXIT_RISCV_SBI35
> >  #define KVM_EXIT_RISCV_CSR36
> >  #define KVM_EXIT_NOTIFY   37
> > +#define KVM_EXIT_MEMORY_FAULT 38
> >
> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >  /* Emulate instruction failed. */
> > @@ -538,6 +539,14 @@ struct kvm_run {
> >  #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
> > __u32 flags;
> > } notify;
> > +   /* KVM_EXIT_MEMORY_FAULT */
> > +   struct {
> > +#define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1 << 0)
> > +   __u32 flags;
> > +   __u32 padding;
> > +   __u64 gpa;
> > +   __u64 size;
> > +   } memory;
> > /* Fix the size of the union. */
> > char padding[256];
> > };
> > --
> > 2.25.1
> >



[PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

2022-10-25 Thread Chao Peng
From: "Kirill A. Shutemov" 

Introduce 'memfd_restricted' system call with the ability to create
memory areas that are restricted from userspace access through ordinary
MMU operations (e.g. read/write/mmap). The memory content is expected to
be used through a new in-kernel interface by a third kernel module.

memfd_restricted() is useful for scenarios where a file descriptor(fd)
can be used as an interface into mm but want to restrict userspace's
ability on the fd. Initially it is designed to provide protections for
KVM encrypted guest memory.

Normally KVM uses memfd memory via mmapping the memfd into KVM userspace
(e.g. QEMU) and then using the mmaped virtual address to setup the
mapping in the KVM secondary page table (e.g. EPT). With confidential
computing technologies like Intel TDX, the memfd memory may be encrypted
with special key for special software domain (e.g. KVM guest) and is not
expected to be directly accessed by userspace. Precisely, userspace
access to such encrypted memory may lead to host crash so should be
prevented.

memfd_restricted() provides semantics required for KVM guest encrypted
memory support that a fd created with memfd_restricted() is going to be
used as the source of guest memory in confidential computing environment
and KVM can directly interact with core-mm without the need to expose
the memoy content into KVM userspace.

KVM userspace is still in charge of the lifecycle of the fd. It should
pass the created fd to KVM. KVM uses the new restrictedmem_get_page() to
obtain the physical memory page and then uses it to populate the KVM
secondary page table entries.

The userspace restricted memfd can be fallocate-ed or hole-punched
from userspace. When these operations happen, KVM can get notified
through restrictedmem_notifier, it then gets chance to remove any
mapped entries of the range in the secondary page tables.

memfd_restricted() itself is implemented as a shim layer on top of real
memory file systems (currently tmpfs). Pages in restrictedmem are marked
as unmovable and unevictable, this is required for current confidential
usage. But in future this might be changed.

By default memfd_restricted() prevents userspace read, write and mmap.
By defining new bit in the 'flags', it can be extended to support other
restricted semantics in the future.

The system call is currently wired up for x86 arch.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Chao Peng 
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/restrictedmem.h  |  62 ++
 include/linux/syscalls.h   |   1 +
 include/uapi/asm-generic/unistd.h  |   5 +-
 include/uapi/linux/magic.h |   1 +
 kernel/sys_ni.c|   3 +
 mm/Kconfig |   4 +
 mm/Makefile|   1 +
 mm/restrictedmem.c | 250 +
 10 files changed, 328 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/restrictedmem.h
 create mode 100644 mm/restrictedmem.c

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 320480a8db4f..dc70ba90247e 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -455,3 +455,4 @@
 448i386process_mreleasesys_process_mrelease
 449i386futex_waitv sys_futex_waitv
 450i386set_mempolicy_home_node sys_set_mempolicy_home_node
+451i386memfd_restrictedsys_memfd_restricted
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..06516abc8318 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
 448common  process_mreleasesys_process_mrelease
 449common  futex_waitv sys_futex_waitv
 450common  set_mempolicy_home_node sys_set_mempolicy_home_node
+451common  memfd_restrictedsys_memfd_restricted
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
new file mode 100644
index ..9c37c3ea3180
--- /dev/null
+++ b/include/linux/restrictedmem.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_RESTRICTEDMEM_H
+
+#include 
+#include 
+#include 
+
+struct restrictedmem_notifier;
+
+struct restrictedmem_notifier_ops {
+   void (*invalidate_start)(struct restrictedmem_notifier *notifier,
+pgoff_t start, pgoff_t end);
+   void (*invalidate_end)(struct restrictedmem_notifier *notifier,
+  pgoff_t start, pgoff_t end);
+};
+
+struct restrictedmem_notifier {
+   struct list_head list;
+   const struct restrictedmem_notifier_ops *ops;
+};
+
+#ifdef CONFIG_RES

[PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry

2022-10-25 Thread Chao Peng
Currently in mmu_notifier validate path, hva range is recorded and then
checked against in the mmu_notifier_retry_hva() of the page fault path.
However, for the to be introduced private memory, a page fault may not
have a hva associated, checking gfn(gpa) makes more sense.

For existing non private memory case, gfn is expected to continue to
work. The only downside is when aliasing multiple gfns to a single hva,
the current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

It also fixes a bug in kvm_zap_gfn_range() which has already been using
gfn when calling kvm_mmu_invalidate_begin/end() while these functions
accept hva in current code.

Signed-off-by: Chao Peng 
---
 arch/x86/kvm/mmu/mmu.c   |  2 +-
 include/linux/kvm_host.h | 18 +++-
 virt/kvm/kvm_main.c  | 45 ++--
 3 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f81539061d6..33b1aec44fb8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4217,7 +4217,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
return true;
 
return fault->slot &&
-  mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
+  mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault 
*fault)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 739a7562a1f3..79e5cbc35fcf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -775,8 +775,8 @@ struct kvm {
struct mmu_notifier mmu_notifier;
unsigned long mmu_invalidate_seq;
long mmu_invalidate_in_progress;
-   unsigned long mmu_invalidate_range_start;
-   unsigned long mmu_invalidate_range_end;
+   gfn_t mmu_invalidate_range_start;
+   gfn_t mmu_invalidate_range_end;
 #endif
struct list_head devices;
u64 manual_dirty_log_protect;
@@ -1365,10 +1365,8 @@ void kvm_mmu_free_memory_cache(struct 
kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif
 
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
- unsigned long end);
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,
-   unsigned long end);
+void kvm_mmu_invalidate_begin(struct kvm *kvm, gfn_t start, gfn_t end);
+void kvm_mmu_invalidate_end(struct kvm *kvm, gfn_t start, gfn_t end);
 
 long kvm_arch_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg);
@@ -1937,9 +1935,9 @@ static inline int mmu_invalidate_retry(struct kvm *kvm, 
unsigned long mmu_seq)
return 0;
 }
 
-static inline int mmu_invalidate_retry_hva(struct kvm *kvm,
+static inline int mmu_invalidate_retry_gfn(struct kvm *kvm,
   unsigned long mmu_seq,
-  unsigned long hva)
+  gfn_t gfn)
 {
lockdep_assert_held(>mmu_lock);
/*
@@ -1949,8 +1947,8 @@ static inline int mmu_invalidate_retry_hva(struct kvm 
*kvm,
 * positives, due to shortcuts when handing concurrent invalidations.
 */
if (unlikely(kvm->mmu_invalidate_in_progress) &&
-   hva >= kvm->mmu_invalidate_range_start &&
-   hva < kvm->mmu_invalidate_range_end)
+   gfn >= kvm->mmu_invalidate_range_start &&
+   gfn < kvm->mmu_invalidate_range_end)
return 1;
if (kvm->mmu_invalidate_seq != mmu_seq)
return 1;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8dace78a0278..09c9cdeb773c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -540,8 +540,7 @@ static void kvm_mmu_notifier_invalidate_range(struct 
mmu_notifier *mn,
 
 typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
-typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
-unsigned long end);
+typedef void (*on_lock_fn_t)(struct kvm *kvm, gfn_t start, gfn_t end);
 
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
@@ -628,7 +627,8 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
locked = true;
KVM_MMU_LOCK(kvm);
if (!IS_KVM_NULL_FN(range->on_lock))
-   range->on_lock(kvm, range->start, 
range->end);
+   range->on_lock(kvm, gfn_range.start,
+   gfn_range.end);
if (IS_KVM_NULL_FN(range

[PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM

2022-10-25 Thread Chao Peng
 the private or the shared part of the memslot is visible to
guest.


Test

Ran two kinds of tests:
  - Selftests [4] from Vishal and VM boot tests in non-TDX environment
Code also in below repo: https://github.com/chao-p/linux/tree/privmem-v9

  - Functional tests in TDX capable environment
Tested the new functionalities in TDX environment. Code repos:
Linux: https://github.com/chao-p/linux/tree/privmem-v9-tdx
QEMU: https://github.com/chao-p/qemu/tree/privmem-v9

An example QEMU command line for TDX test:
-object tdx-guest,id=tdx,debug=off,sept-ve-disable=off \
-machine confidential-guest-support=tdx \
-object memory-backend-memfd-private,id=ram1,size=${mem} \
-machine memory-backend=ram1


TODO

  - Page accounting and limiting for encrypted memory
  - hugetlbfs support


Changelog
=
v9:
  - mm: move inaccessible memfd into separated syscall.
  - mm: return page instead of pfn_t for inaccessible_get_pfn and remove
inaccessible_put_pfn.
  - KVM: rename inaccessible/private to restricted and CONFIG change to
make the code friendly to pKVM.
  - KVM: add invalidate_begin/end pair to fix race contention and revise
the lock protection for invalidation path.
  - KVM: optimize setting lpage_info for > 2M level by direct accessing
lower level's result.
  - KVM: avoid load xarray in kvm_mmu_max_mapping_level() and instead let
the caller to pass in is_private.
  - KVM: API doc improvement.
v8:
  - mm: redesign mm part by introducing a shim layer(inaccessible_memfd)
in memfd to avoid touch the memory file systems directly.
  - mm: exclude F_SEAL_AUTO_ALLOCATE as it is for shared memory and
cause confusion in this series, will send out separately.
  - doc: exclude the man page change, it's not kernel patch and will
send out separately.
  - KVM: adapt to use the new mm inaccessible_memfd interface.
  - KVM: update lpage_info when setting mem_attr_array to support
large page.
  - KVM: change from xa_store_range to xa_store for mem_attr_array due
to xa_store_range overrides all entries which is not intended
behavior for us.
  - KVM: refine the mmu_invalidate_retry_gfn mechanism for private page.
  - KVM: reorganize KVM_MEMORY_ENCRYPT_{UN,}REG_REGION and private page
handling code suggested by Sean.
v7:
  - mm: introduce F_SEAL_AUTO_ALLOCATE to avoid double allocation.
  - KVM: use KVM_MEMORY_ENCRYPT_{UN,}REG_REGION to record
private/shared info.
  - KVM: use similar sync mechanism between zap/page fault paths as
mmu_notifier for memfile_notifier based invalidation.
v6:
  - mm: introduce MEMFILE_F_* flags into memfile_node to allow checking
feature consistence among all memfile_notifier users and get rid of
internal flags like SHM_F_INACCESSIBLE.
  - mm: make pfn_ops callbacks being members of memfile_backing_store
and then refer to it directly in memfile_notifier.
  - mm: remove backing store unregister.
  - mm: remove RLIMIT_MEMLOCK based memory accounting and limiting.
  - KVM: reorganize patch sequence for page fault handling and private
memory enabling.
v5:
  - Add man page for MFD_INACCESSIBLE flag and improve KVM API do for
the new memslot extensions.
  - mm: introduce memfile_{un}register_backing_store to allow memory
backing store to register/unregister it from memfile_notifier.
  - mm: remove F_SEAL_INACCESSIBLE, use in-kernel flag
(SHM_F_INACCESSIBLE for shmem) instead. 
  - mm: add memory accounting and limiting (RLIMIT_MEMLOCK based) for
MFD_INACCESSIBLE memory.
  - KVM: remove the overlap check for mapping the same file+offset into
multiple gfns due to perf consideration, warned in document.
v4:
  - mm: rename memfd_ops to memfile_notifier and separate it from
memfd.c to standalone memfile-notifier.c.
  - KVM: move pfn_ops to per-memslot scope from per-vm scope and allow
registering multiple memslots to the same memory backing store.
  - KVM: add a 'kvm' reference in memslot so that we can recover kvm in
memfile_notifier handlers.
  - KVM: add 'private_' prefix for the new fields in memslot.
  - KVM: reshape the 'type' to 'flag' for kvm_memory_exit
v3:
  - Remove 'RFC' prefix.
  - Fix race condition between memfile_notifier handlers and kvm destroy.
  - mm: introduce MFD_INACCESSIBLE flag for memfd_create() to force
setting F_SEAL_INACCESSIBLE when the fd is created.
  - KVM: add the shared part of the memslot back to make private/shared
pages live in one memslot.

Reference
=
[1] Intel TDX:
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Kirill's implementation:
https://lore.kernel.org/all/20210416154106.23721-1-kirill.shute...@linux.intel.com/T/
 
[3] Original design proposal:
https://lore.kernel.org/all/20210824005248.200037-1-sea...@google.com/  
[4] Selftest:
https://lore.kernel.org/all/20220819174659.2427983-1-vannapu...@google.com/ 


Chao Peng (7):
  KVM: Extend the memslot to s

[PATCH v9 7/8] KVM: Handle page fault for private memory

2022-10-25 Thread Chao Peng
A memslot with KVM_MEM_PRIVATE being set can include both fd-based
private memory and hva-based shared memory. Architecture code (like TDX
code) can tell whether the on-going fault is private or not. This patch
adds a 'is_private' field to kvm_page_fault to indicate this and
architecture code is expected to set it.

To handle page fault for such memslot, the handling logic is different
depending on whether the fault is private or shared. KVM checks if
'is_private' matches the host's view of the page (maintained in
mem_attr_array).
  - For a successful match, private pfn is obtained with
restrictedmem_get_page () from private fd and shared pfn is obtained
with existing get_user_pages().
  - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
userspace. Userspace then can convert memory between private/shared
in host's view and retry the fault.

Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
---
 arch/x86/kvm/mmu/mmu.c  | 56 +++--
 arch/x86/kvm/mmu/mmu_internal.h | 14 -
 arch/x86/kvm/mmu/mmutrace.h |  1 +
 arch/x86/kvm/mmu/spte.h |  6 
 arch/x86/kvm/mmu/tdp_mmu.c  |  3 +-
 include/linux/kvm_host.h| 28 +
 6 files changed, 103 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 67a9823a8c35..10017a9f26ee 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3030,7 +3030,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t 
gfn,
 
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
  const struct kvm_memory_slot *slot, gfn_t gfn,
- int max_level)
+ int max_level, bool is_private)
 {
struct kvm_lpage_info *linfo;
int host_level;
@@ -3042,6 +3042,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
break;
}
 
+   if (is_private)
+   return max_level;
+
if (max_level == PG_LEVEL_4K)
return PG_LEVEL_4K;
 
@@ -3070,7 +3073,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault
 * level, which will be used to do precise, accurate accounting.
 */
fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
-fault->gfn, 
fault->max_level);
+fault->gfn, 
fault->max_level,
+fault->is_private);
if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
return;
 
@@ -4141,6 +4145,32 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, 
struct kvm_async_pf *work)
kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
 }
 
+static inline u8 order_to_level(int order)
+{
+   BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
+
+   if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
+   return PG_LEVEL_1G;
+
+   if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+   return PG_LEVEL_2M;
+
+   return PG_LEVEL_4K;
+}
+
+static int kvm_faultin_pfn_private(struct kvm_page_fault *fault)
+{
+   int order;
+   struct kvm_memory_slot *slot = fault->slot;
+
+   if (kvm_restricted_mem_get_pfn(slot, fault->gfn, >pfn, ))
+   return RET_PF_RETRY;
+
+   fault->max_level = min(order_to_level(order), fault->max_level);
+   fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
+   return RET_PF_CONTINUE;
+}
+
 static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
struct kvm_memory_slot *slot = fault->slot;
@@ -4173,6 +4203,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct 
kvm_page_fault *fault)
return RET_PF_EMULATE;
}
 
+   if (kvm_slot_can_be_private(slot) &&
+   fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+   vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+   if (fault->is_private)
+   vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
+   else
+   vcpu->run->memory.flags = 0;
+   vcpu->run->memory.padding = 0;
+   vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
+   vcpu->run->memory.size = PAGE_SIZE;
+   return RET_PF_USER;
+   }
+
+   if (fault->is_private)
+   return kvm_faultin_pfn_private(fault);
+
async = false;
fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, ,
  fault->write, >map_writable,
@@ -5557,6 +5603,9 @@ int noinline kvm_mmu_page_fault(struct

[PATCH v9 8/8] KVM: Enable and expose KVM_MEM_PRIVATE

2022-10-25 Thread Chao Peng
Expose KVM_MEM_PRIVATE and memslot fields restricted_fd/offset to
userspace. KVM register/unregister private memslot to fd-based
memory backing store and responses to invalidation event from
restrictedmem_notifier to zap the existing memory mappings in the
secondary page table.

Whether KVM_MEM_PRIVATE is actually exposed to userspace is determined
by architecture code which can turn on it by overriding the default
kvm_arch_has_private_mem().

A 'kvm' reference is added in memslot structure since in
restrictedmem_notifier callback we can only obtain a memslot reference
but 'kvm' is needed to do the zapping.

Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
---
 include/linux/kvm_host.h |   3 +-
 virt/kvm/kvm_main.c  | 174 +--
 2 files changed, 171 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 69300fc6d572..e27d62c30484 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -246,7 +246,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
 
-#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || 
defined(CONFIG_KVM_GENERIC_PRIVATE_MEM)
+#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || 
defined(CONFIG_HAVE_KVM_RESTRICTED_MEM)
 struct kvm_gfn_range {
struct kvm_memory_slot *slot;
gfn_t start;
@@ -583,6 +583,7 @@ struct kvm_memory_slot {
struct file *restricted_file;
loff_t restricted_offset;
struct restrictedmem_notifier notifier;
+   struct kvm *kvm;
 };
 
 static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 13a37b4d9e97..dae6a2c196ad 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1028,6 +1028,111 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, 
gpa_t gpa, gpa_t size,
 }
 #endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
 
+#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
+static bool restrictedmem_range_is_valid(struct kvm_memory_slot *slot,
+pgoff_t start, pgoff_t end,
+gfn_t *gfn_start, gfn_t *gfn_end)
+{
+   unsigned long base_pgoff = slot->restricted_offset >> PAGE_SHIFT;
+
+   if (start > base_pgoff)
+   *gfn_start = slot->base_gfn + start - base_pgoff;
+   else
+   *gfn_start = slot->base_gfn;
+
+   if (end < base_pgoff + slot->npages)
+   *gfn_end = slot->base_gfn + end - base_pgoff;
+   else
+   *gfn_end = slot->base_gfn + slot->npages;
+
+   if (*gfn_start >= *gfn_end)
+   return false;
+
+   return true;
+}
+
+static void kvm_restrictedmem_invalidate_begin(struct restrictedmem_notifier 
*notifier,
+  pgoff_t start, pgoff_t end)
+{
+   struct kvm_memory_slot *slot = container_of(notifier,
+   struct kvm_memory_slot,
+   notifier);
+   struct kvm *kvm = slot->kvm;
+   gfn_t gfn_start, gfn_end;
+   struct kvm_gfn_range gfn_range;
+   int idx;
+
+   if (!restrictedmem_range_is_valid(slot, start, end,
+   _start, _end))
+   return;
+
+   idx = srcu_read_lock(>srcu);
+   KVM_MMU_LOCK(kvm);
+
+   kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
+
+   gfn_range.start = gfn_start;
+   gfn_range.end = gfn_end;
+   gfn_range.slot = slot;
+   gfn_range.pte = __pte(0);
+   gfn_range.may_block = true;
+
+   if (kvm_unmap_gfn_range(kvm, _range))
+   kvm_flush_remote_tlbs(kvm);
+
+   KVM_MMU_UNLOCK(kvm);
+   srcu_read_unlock(>srcu, idx);
+}
+
+static void kvm_restrictedmem_invalidate_end(struct restrictedmem_notifier 
*notifier,
+pgoff_t start, pgoff_t end)
+{
+   struct kvm_memory_slot *slot = container_of(notifier,
+   struct kvm_memory_slot,
+   notifier);
+   struct kvm *kvm = slot->kvm;
+   gfn_t gfn_start, gfn_end;
+
+   if (!restrictedmem_range_is_valid(slot, start, end,
+   _start, _end))
+   return;
+
+   KVM_MMU_LOCK(kvm);
+   kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
+   KVM_MMU_UNLOCK(kvm);
+}
+
+static struct restrictedmem_notifier_ops kvm_restrictedmem_notifier_ops = {
+   .invalidate_start = kvm_restrictedmem_invalidate_begin,
+   .invalidate_end = kvm_restrictedmem_invalidate_end,
+};
+
+static inline void kvm_restrictedmem_register(struct kvm_memory_slot *slot)
+{
+   slot->notifier.ops = _restrictedmem_notifier_ops;
+   restrictedmem_register_notifier(slot->restricted_file, >notifie

[PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

2022-10-25 Thread Chao Peng
Introduce generic private memory register/unregister by reusing existing
SEV ioctls KVM_MEMORY_ENCRYPT_{UN,}REG_REGION. It differs from SEV case
by treating address in the region as gpa instead of hva. Which cases
should these ioctls go is determined by the kvm_arch_has_private_mem().
Architecture which supports KVM_PRIVATE_MEM should override this function.

KVM internally defaults all guest memory as private memory and maintain
the shared memory in 'mem_attr_array'. The above ioctls operate on this
field and unmap existing mappings if any.

Signed-off-by: Chao Peng 
---
 Documentation/virt/kvm/api.rst |  17 ++-
 arch/x86/kvm/Kconfig   |   1 +
 include/linux/kvm_host.h   |  10 +-
 virt/kvm/Kconfig   |   4 +
 virt/kvm/kvm_main.c| 227 +
 5 files changed, 198 insertions(+), 61 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 975688912b8c..08253cf498d1 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4717,10 +4717,19 @@ Documentation/virt/kvm/x86/amd-memory-encryption.rst.
 This ioctl can be used to register a guest memory region which may
 contain encrypted data (e.g. guest RAM, SMRAM etc).
 
-It is used in the SEV-enabled guest. When encryption is enabled, a guest
-memory region may contain encrypted data. The SEV memory encryption
-engine uses a tweak such that two identical plaintext pages, each at
-different locations will have differing ciphertexts. So swapping or
+Currently this ioctl supports registering memory regions for two usages:
+private memory and SEV-encrypted memory.
+
+When private memory is enabled, this ioctl is used to register guest private
+memory region and the addr/size of kvm_enc_region represents guest physical
+address (GPA). In this usage, this ioctl zaps the existing guest memory
+mappings in KVM that fallen into the region.
+
+When SEV-encrypted memory is enabled, this ioctl is used to register guest
+memory region which may contain encrypted data for a SEV-enabled guest. The
+addr/size of kvm_enc_region represents userspace address (HVA). The SEV
+memory encryption engine uses a tweak such that two identical plaintext pages,
+each at different locations will have differing ciphertexts. So swapping or
 moving ciphertext of those pages will not result in plaintext being
 swapped. So relocating (or migrating) physical backing pages for the SEV
 guest will require some additional steps.
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 8d2bd455c0cd..73fdfa429b20 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -51,6 +51,7 @@ config KVM
select HAVE_KVM_PM_NOTIFIER if PM
select HAVE_KVM_RESTRICTED_MEM if X86_64
select RESTRICTEDMEM if HAVE_KVM_RESTRICTED_MEM
+   select KVM_GENERIC_PRIVATE_MEM if HAVE_KVM_RESTRICTED_MEM
help
  Support hosting fully virtualized guest machines using hardware
  virtualization extensions.  You will need a fairly recent
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 79e5cbc35fcf..4ce98fa0153c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -245,7 +245,8 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t 
cr2_or_gpa,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
-#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
+
+#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || 
defined(CONFIG_KVM_GENERIC_PRIVATE_MEM)
 struct kvm_gfn_range {
struct kvm_memory_slot *slot;
gfn_t start;
@@ -254,6 +255,9 @@ struct kvm_gfn_range {
bool may_block;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
+#endif
+
+#ifdef KVM_ARCH_WANT_MMU_NOTIFIER
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
@@ -794,6 +798,9 @@ struct kvm {
struct notifier_block pm_notifier;
 #endif
char stats_id[KVM_STATS_NAME_SIZE];
+#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+   struct xarray mem_attr_array;
+#endif
 };
 
 #define kvm_err(fmt, ...) \
@@ -1453,6 +1460,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu 
*vcpu);
 int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
 int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_has_private_mem(struct kvm *kvm);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 9ff164c7e0cc..69ca59e82149 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -89,3 +89,7 @@ config HAVE_KVM_PM_NOTIFIER
 
 config HAVE_KVM_RESTRICTED_MEM
bool
+
+config KVM_GENERIC_PRIVATE_MEM
+   bool
+   depends on HAVE_KVM_RESTRICTED_MEM
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 09c9cdeb773c..fc3835826ace 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c

[PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed

2022-10-25 Thread Chao Peng
When private/shared memory are mixed in a large page, the lpage_info may
not be accurate and should be updated with this mixed info. A large page
has mixed pages can't be really mapped as large page since its
private/shared pages are from different physical memory.

Update lpage_info when private/shared memory attribute is changed. If
both private and shared pages are within a large page region, it can't
be mapped as large page. It's a bit challenge to track the mixed
info in a 'count' like variable, this patch instead reserves a bit in
'disallow_lpage' to indicate a large page has mixed private/share pages.

Signed-off-by: Chao Peng 
---
 arch/x86/include/asm/kvm_host.h |   8 +++
 arch/x86/kvm/mmu/mmu.c  | 112 +++-
 arch/x86/kvm/x86.c  |   2 +
 include/linux/kvm_host.h|  19 ++
 virt/kvm/kvm_main.c |  16 +++--
 5 files changed, 152 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7551b6f9c31c..db811a54e3fd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -37,6 +37,7 @@
 #include 
 
 #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+#define __KVM_HAVE_ARCH_UPDATE_MEM_ATTR
 
 #define KVM_MAX_VCPUS 1024
 
@@ -952,6 +953,13 @@ struct kvm_vcpu_arch {
 #endif
 };
 
+/*
+ * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
+ * level. The remaining bits are used as a reference count.
+ */
+#define KVM_LPAGE_PRIVATE_SHARED_MIXED (1U << 31)
+#define KVM_LPAGE_COUNT_MAX((1U << 31) - 1)
+
 struct kvm_lpage_info {
int disallow_lpage;
 };
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 33b1aec44fb8..67a9823a8c35 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -762,11 +762,16 @@ static void update_gfn_disallow_lpage_count(const struct 
kvm_memory_slot *slot,
 {
struct kvm_lpage_info *linfo;
int i;
+   int disallow_count;
 
for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
linfo = lpage_info_slot(gfn, slot, i);
+
+   disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
+   WARN_ON(disallow_count + count < 0 ||
+   disallow_count > KVM_LPAGE_COUNT_MAX - count);
+
linfo->disallow_lpage += count;
-   WARN_ON(linfo->disallow_lpage < 0);
}
 }
 
@@ -6910,3 +6915,108 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
if (kvm->arch.nx_lpage_recovery_thread)
kthread_stop(kvm->arch.nx_lpage_recovery_thread);
 }
+
+static inline bool linfo_is_mixed(struct kvm_lpage_info *linfo)
+{
+   return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static inline void linfo_update_mixed(struct kvm_lpage_info *linfo, bool mixed)
+{
+   if (mixed)
+   linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
+   else
+   linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
+}
+
+static bool mem_attr_is_mixed_2m(struct kvm *kvm, unsigned int attr,
+gfn_t start, gfn_t end)
+{
+   XA_STATE(xas, >mem_attr_array, start);
+   gfn_t gfn = start;
+   void *entry;
+   bool shared = attr == KVM_MEM_ATTR_SHARED;
+   bool mixed = false;
+
+   rcu_read_lock();
+   entry = xas_load();
+   while (gfn < end) {
+   if (xas_retry(, entry))
+   continue;
+
+   KVM_BUG_ON(gfn != xas.xa_index, kvm);
+
+   if ((entry && !shared) || (!entry && shared)) {
+   mixed = true;
+   goto out;
+   }
+
+   entry = xas_next();
+   gfn++;
+   }
+out:
+   rcu_read_unlock();
+   return mixed;
+}
+
+static bool mem_attr_is_mixed(struct kvm *kvm, struct kvm_memory_slot *slot,
+ int level, unsigned int attr,
+ gfn_t start, gfn_t end)
+{
+   unsigned long gfn;
+   void *entry;
+
+   if (level == PG_LEVEL_2M)
+   return mem_attr_is_mixed_2m(kvm, attr, start, end);
+
+   entry = xa_load(>mem_attr_array, start);
+   for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
+   if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)))
+   return true;
+   if (xa_load(>mem_attr_array, gfn) != entry)
+   return true;
+   }
+   return false;
+}
+
+void kvm_arch_update_mem_attr(struct kvm *kvm, struct kvm_memory_slot *slot,
+ unsigned int attr, gfn_t start, gfn_t end)
+{
+
+   unsigned long lpage_start, lpage_end;
+   unsigned long gfn, pages, mask;
+   int level;
+
+   WARN_ONCE(!(attr & (KVM_

[PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory

2022-10-25 Thread Chao Peng
In memory encryption usage, guest memory may be encrypted with special
key and can be accessed only by the guest itself. We call such memory
private memory. It's valueless and sometimes can cause problem to allow
userspace to access guest private memory. This new KVM memslot extension
allows guest private memory being provided though a restrictedmem
backed file descriptor(fd) and userspace is restricted to access the
bookmarked memory in the fd.

This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
additional KVM memslot fields restricted_fd/restricted_offset to allow
userspace to instruct KVM to provide guest memory through restricted_fd.
'guest_phys_addr' is mapped at the restricted_offset of restricted_fd
and the size is 'memory_size'.

The extended memslot can still have the userspace_addr(hva). When use, a
single memslot can maintain both private memory through restricted_fd
and shared memory through userspace_addr. Whether the private or shared
part is visible to guest is maintained by other KVM code.

A restrictedmem_notifier field is also added to the memslot structure to
allow the restricted_fd's backing store to notify KVM the memory change,
KVM then can invalidate its page table entries.

Together with the change, a new config HAVE_KVM_RESTRICTED_MEM is added
and right now it is selected on X86_64 only. A KVM_CAP_PRIVATE_MEM is
also introduced to indicate KVM support for KVM_MEM_PRIVATE.

To make code maintenance easy, internally we use a binary compatible
alias struct kvm_user_mem_region to handle both the normal and the
'_ext' variants.

Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
---
 Documentation/virt/kvm/api.rst | 48 -
 arch/x86/kvm/Kconfig   |  2 ++
 arch/x86/kvm/x86.c |  2 +-
 include/linux/kvm_host.h   | 13 +++--
 include/uapi/linux/kvm.h   | 29 
 virt/kvm/Kconfig   |  3 +++
 virt/kvm/kvm_main.c| 49 --
 7 files changed, 128 insertions(+), 18 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index eee9f857a986..f3fa75649a78 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1319,7 +1319,7 @@ yet and must be cleared on entry.
 :Capability: KVM_CAP_USER_MEMORY
 :Architectures: all
 :Type: vm ioctl
-:Parameters: struct kvm_userspace_memory_region (in)
+:Parameters: struct kvm_userspace_memory_region(_ext) (in)
 :Returns: 0 on success, -1 on error
 
 ::
@@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
__u64 userspace_addr; /* start of the userspace allocated memory */
   };
 
+  struct kvm_userspace_memory_region_ext {
+   struct kvm_userspace_memory_region region;
+   __u64 restricted_offset;
+   __u32 restricted_fd;
+   __u32 pad1;
+   __u64 pad2[14];
+  };
+
   /* for kvm_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES  (1UL << 0)
   #define KVM_MEM_READONLY (1UL << 1)
+  #define KVM_MEM_PRIVATE  (1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of 
guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
-writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
-to make a new slot read-only.  In this case, writes to this memory will be
-posted to userspace as KVM_EXIT_MMIO exits.
+kvm_userspace_memory_region_ext struct includes all fields of
+kvm_userspace_memory_region struct, while also adds additional fields for some
+other features. See below description of flags field for more information.
+It's recommended to use kvm_userspace_memory_region_ext in new userspace code.
+
+The flags field supports following flags:
+
+- KVM_MEM_LOG_DIRTY_PAGES to instruct KVM to keep track of writes to memory
+  within the slot.  For more details, see KVM_GET_DIRTY_LOG ioctl.
+
+- KVM_MEM_READONLY, if KVM_CAP_READONLY_MEM allows, to make a new slot
+  read-only.  In this case, writes to this memory will be posted to userspace 
as
+  KVM_EXIT_MMIO exits.
+
+- KVM_MEM_PRIVATE, if KVM_CAP_PRIVATE_MEM allows, to indicate a new slot has
+  private memory backed by a file descriptor(fd) and userspace access to the
+  fd may be restricted. Userspace should use restricted_fd/restricted_offset in
+  kvm_userspace_memory_region_ext to instruct KVM to provide private memory
+  to guest. Userspace should guarantee not to map the same pfn indicated by
+  restricted_fd/restric

[PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-10-25 Thread Chao Peng
This new KVM exit allows userspace to handle memory-related errors. It
indicates an error happens in KVM at guest memory range [gpa, gpa+size).
The flags includes additional information for userspace to handle the
error. Currently bit 0 is defined as 'private memory' where '1'
indicates error happens due to private memory access and '0' indicates
error happens due to shared memory access.

When private memory is enabled, this new exit will be used for KVM to
exit to userspace for shared <-> private memory conversion in memory
encryption usage. In such usage, typically there are two kind of memory
conversions:
  - explicit conversion: happens when guest explicitly calls into KVM
to map a range (as private or shared), KVM then exits to userspace
to perform the map/unmap operations.
  - implicit conversion: happens in KVM page fault handler where KVM
exits to userspace for an implicit conversion when the page is in a
different state than requested (private or shared).

Suggested-by: Sean Christopherson 
Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
---
 Documentation/virt/kvm/api.rst | 23 +++
 include/uapi/linux/kvm.h   |  9 +
 2 files changed, 32 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f3fa75649a78..975688912b8c 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6537,6 +6537,29 @@ array field represents return values. The userspace 
should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+   /* KVM_EXIT_MEMORY_FAULT */
+   struct {
+  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
+   __u32 flags;
+   __u32 padding;
+   __u64 gpa;
+   __u64 size;
+   } memory;
+
+If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
+encountered a memory error which is not handled by KVM kernel module and
+userspace may choose to handle it. The 'flags' field indicates the memory
+properties of the exit.
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
+   private memory access when the bit is set. Otherwise the memory error is
+   caused by shared memory access when the bit is clear.
+
+'gpa' and 'size' indicate the memory range the error occurs at. The userspace
+may handle the error and return to KVM to retry the previous memory access.
+
 ::
 
 /* KVM_EXIT_NOTIFY */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f1ae45c10c94..fa60b032a405 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -300,6 +300,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI35
 #define KVM_EXIT_RISCV_CSR36
 #define KVM_EXIT_NOTIFY   37
+#define KVM_EXIT_MEMORY_FAULT 38
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -538,6 +539,14 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0)
__u32 flags;
} notify;
+   /* KVM_EXIT_MEMORY_FAULT */
+   struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1 << 0)
+   __u32 flags;
+   __u32 padding;
+   __u64 gpa;
+   __u64 size;
+   } memory;
/* Fix the size of the union. */
char padding[256];
};
-- 
2.25.1




Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-21 Thread Chao Peng
On Thu, Oct 20, 2022 at 04:20:58PM +0530, Vishal Annapurve wrote:
> On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov
>  wrote:
> >
> > On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> > > On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
> > >  wrote:
> > > >
> > > > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > > > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > > > > From: "Kirill A. Shutemov" 
> > > > > > > >
> > > > > > > > KVM can use memfd-provided memory for guest memory. For normal 
> > > > > > > > userspace
> > > > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd 
> > > > > > > > into its
> > > > > > > > virtual address space and then tells KVM to use the virtual 
> > > > > > > > address to
> > > > > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > > > > >
> > > > > > > > With confidential computing technologies like Intel TDX, the
> > > > > > > > memfd-provided memory may be encrypted with special key for 
> > > > > > > > special
> > > > > > > > software domain (e.g. KVM guest) and is not expected to be 
> > > > > > > > directly
> > > > > > > > accessed by userspace. Precisely, userspace access to such 
> > > > > > > > encrypted
> > > > > > > > memory may lead to host crash so it should be prevented.
> > > > > > > >
> > > > > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace 
> > > > > > > > through
> > > > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed 
> > > > > > > > via
> > > > > > > > in-kernel interface so KVM can directly interact with core-mm 
> > > > > > > > without
> > > > > > > > the need to map the memory into KVM userspace.
> > > > > > > >
> > > > > > > > It provides semantics required for KVM guest private(encrypted) 
> > > > > > > > memory
> > > > > > > > support that a file descriptor with this flag set is going to 
> > > > > > > > be used as
> > > > > > > > the source of guest memory in confidential computing 
> > > > > > > > environments such
> > > > > > > > as Intel TDX/AMD SEV.
> > > > > > > >
> > > > > > > > KVM userspace is still in charge of the lifecycle of the memfd. 
> > > > > > > > It
> > > > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs 
> > > > > > > > newly added
> > > > > > > > in this patch to obtain the physical memory address and then 
> > > > > > > > populate
> > > > > > > > the secondary page table entries.
> > > > > > > >
> > > > > > > > The userspace inaccessible memfd can be fallocate-ed and 
> > > > > > > > hole-punched
> > > > > > > > from userspace. When hole-punching happens, KVM can get 
> > > > > > > > notified through
> > > > > > > > inaccessible_notifier it then gets chance to remove any mapped 
> > > > > > > > entries
> > > > > > > > of the range in the secondary page tables.
> > > > > > > >
> > > > > > > > The userspace inaccessible memfd itself is implemented as a 
> > > > > > > > shim layer
> > > > > > > > on top of real memory file systems like tmpfs/hugetlbfs but 
> > > > > > > > this patch
> > > > > > > > only implemented tmpfs. The allocated memory is currently 
> > > > > > > > marked as
> > > > > > > > unmovable and unevictable, this is required for current 
>

  1   2   3   4   5   >