from:"Sean Christopherson"

Re: Unmapping KVM Guest Memory from Host Kernel

2024-03-08 Thread Sean Christopherson

On Fri, Mar 08, 2024, James Gowans wrote:
> However, memfd_secret doesn’t work out the box for KVM guest memory; the
> main reason seems to be that the GUP path is intentionally disabled for
> memfd_secret, so if we use a memfd_secret backed VMA for a memslot then
> KVM is not able to fault the memory in. If it’s been pre-faulted in by
> userspace then it seems to work.

Huh, that _shouldn't_ work.  The folio_is_secretmem() in gup_pte_range() is
supposed to prevent the "fast gup" path from getting secretmem pages.

Is this on an upstream kernel?  If so, and if you have bandwidth, can you figure
out why that isn't working?  At the very least, I suspect the memfd_secret
maintainers would be very interested to know that it's possible to fast gup
secretmem.

> There are a few other issues around when KVM accesses the guest memory.
> For example the KVM PV clock code goes directly to the PFN via the
> pfncache, and that also breaks if the PFN is not in the direct map, so
> we’d need to change that sort of thing, perhaps going via userspace
> addresses.
> 
> If we remove the memfd_secret check from the GUP path, and disable KVM’s
> pvclock from userspace via KVM_CPUID_FEATURES, we are able to boot a
> simple Linux initrd using a Firecracker VMM modified to use
> memfd_secret.
> 
> We are also aware of ongoing work on guest_memfd. The current
> implementation unmaps guest memory from VMM address space, but leaves it
> in the kernel’s direct map. We’re not looking at unmapping from VMM
> userspace yet; we still need guest RAM there for PV drivers like virtio
> to continue to work. So KVM’s gmem doesn’t seem like the right solution?

We (and by "we", I really mean the pKVM folks) are also working on allowing
userspace to mmap() guest_memfd[*].  pKVM aside, the long term vision I have for
guest_memfd is to be able to use it for non-CoCo VMs, precisely for the security
and robustness benefits it can bring.

What I am hoping to do with guest_memfd is get userspace to only map memory it
needs, e.g. for emulated/synthetic devices, on-demand.  I.e. to get to a state
where guest memory is mapped only when it needs to be.  More below.

> With this in mind, what’s the best way to solve getting guest RAM out of
> the direct map? Is memfd_secret integration with KVM the way to go, or
> should we build a solution on top of guest_memfd, for example via some
> flag that causes it to leave memory in the host userspace’s page tables,
> but removes it from the direct map? 

100% enhance guest_memfd.  If you're willing to wait long enough, pKVM might 
even
do all the work for you. :-)

The killer feature of guest_memfd is that it allows the guest mappings to be a
superset of the host userspace mappings.  Most obviously, it allows mapping 
memory
into the guest without mapping first mapping the memory into the userspace page
tables.  More subtly, it also makes it easier (in theory) to do things like map
the memory with 1GiB hugepages for the guest, but selectively map at 4KiB 
granularity
in the host.  Or map memory as RWX in the guest, but RO in the host (I don't 
have
a concrete use case for this, just pointing out it'll be trivial to do once
guest_memfd supports mmap()).

Every attempt to allow mapping VMA-based memory into a guest without it being
accessible by host userspace emory failed; it's literally why we ended up
implementing guest_memfd.  We could teach KVM to do the same with memfd_secret,
but we'd just end up re-implementing guest_memfd.

memfd_secret obviously gets you a PoC much faster, but in the long term I'm 
quite
sure you'll be fighting memfd_secret all the way.  E.g. it's not dumpable, it
deliberately allocates at 4KiB granularity (though I suspect the bug you found
means that it can be inadvertantly mapped with 2MiB hugepages), it has no line
of sight to taking userspace out of the equation, etc.

With guest_memfd on the other hand, everyone contributing to and maintaining it
has goals that are *very* closely aligned with what you want to do.

[*] https://lore.kernel.org/all/20240222161047.402609-1-ta...@google.com

Re: Unmapping KVM Guest Memory from Host Kernel

2024-03-08 Thread Sean Christopherson

On Fri, Mar 08, 2024, David Woodhouse wrote:
> On Fri, 2024-03-08 at 09:35 -0800, David Matlack wrote:
> > I think what James is looking for (and what we are also interested
> > in), is _eliminating_ the ability to access guest memory from the
> > direct map entirely. And in general, eliminate the ability to access
> > guest memory in as many ways as possible.
> 
> Well, pKVM does that... 

Out-of-tree :-)

I'm not just being snarky; when pKVM lands this functionality upstream, I fully
expect zapping direct map entries to be generic guest_memfd functionality that
would be opt-in, either by the in-kernel technology, e.g. pKVM, or by userspace,
or by some combination of the two, e.g. I can see making it optional to nuke the
direct map when using guest_memfd for TDX guests so that rogue accesses from the
host generate synchronous #PFs instead of latent #MCs.

Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE

2023-11-30 Thread Sean Christopherson

On Thu, Nov 30, 2023, David Hildenbrand wrote:
> On 30.11.23 08:32, Xiaoyao Li wrote:
> > On 11/20/2023 5:26 PM, David Hildenbrand wrote:
> > > 
> > > > > ... did you shamelessly copy that from hw/virtio/virtio-mem.c ? ;)
> > > > 
> > > > Get caught.
> > > > 
> > > > > This should be factored out into a common helper.
> > > > 
> > > > Sure, will do it in next version.
> > > 
> > > Factor it out in a separate patch. Then, this patch is get small that
> > > you can just squash it into #2.
> > > 
> > > And my comment regarding "flags = 0" to patch #2 does no longer apply :)
> > > 
> > 
> > I see.
> > 
> > But it depends on if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will appear together
> > with initial guest memfd in linux (hopefully 6.8)
> > https://lore.kernel.org/all/CABgObfa=dh7fysbvif63os9svog_wt-aqygtuagkqny5biz...@mail.gmail.com/
> > 
> 
> Doesn't seem to be in -next if I am looking at the right tree:
> 
> https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=next

Yeah, we punted on adding hugepage support for the initial guest_memfd merge so
as not to rush in kludgy uABI.  The internal KVM code isn't problematic, we just
haven't figured out exactly what the ABI should look like, e.g. should hugepages
be dependent on THP being enabled, and if not, how does userspace discover the
supported hugepage sizes?

Re: [RFC PATCH v2 00/21] QEMU gmem implemention

2023-09-14 Thread Sean Christopherson

On Thu, Sep 14, 2023, David Hildenbrand wrote:
> On 14.09.23 05:50, Xiaoyao Li wrote:
> > It's the v2 RFC of enabling KVM gmem[1] as the backend for private
> > memory.
> > 
> > For confidential-computing, KVM provides gmem/guest_mem interfaces for
> > userspace, like QEMU, to allocate user-unaccesible private memory. This
> > series aims to add gmem support in QEMU's RAMBlock so that each RAM can
> > have both hva-based shared memory and gmem_fd based private memory. QEMU
> > does the shared-private conversion on KVM_MEMORY_EXIT and discards the
> > memory.
> > 
> > It chooses the design that adds "private" property to hostmeory backend.
> > If "private" property is set, QEMU will allocate/create KVM gmem when
> > initialize the RAMbloch of the memory backend.
> > 
> > This sereis also introduces the first user of kvm gmem,
> > KVM_X86_SW_PROTECTED_VM. A KVM_X86_SW_PROTECTED_VM with private KVM gmem
> > can be created with
> > 
> >$qemu -object sw-protected-vm,id=sp-vm0 \
> > -object memory-backend-ram,id=mem0,size=1G,private=on \
> > -machine 
> > q35,kernel_irqchip=split,confidential-guest-support=sp-vm0,memory-backend=mem0
> >  \
> > ...
> > 
> > Unfortunately this patch series fails the boot of OVMF at very early
> > stage due to triple fault, because KVM doesn't support emulating string IO
> > to private memory.
> 
> Is support being added? Or have we figured out what it would take to make it
> work?

Hrm, this isn't something I've thought deeply about.  The issue is that anything
that reaches any form of copy_{from,to}_user() will go kablooie because KVM will
always try to read/write the shared mappings.  The best case scenario is that 
the
shared mapping is invalid and the uaccess faults.  The worst case scenario is
that KVM read/writes the wrong memory and sends the guest into the weeds.  Eww.

And we (well, at least I) definitely want to support this so that gmem can be
used for "regular" VMs, i.e. for VMs where userspace is in the TCB, but for 
which
userspace doesn't have access to guest memory by default.

It shouldn't be too hard to support.  It's easy enough to wire up the hook
(thankfully that aren't _that_ many sites), and gmem only supports struct page 
at
the moment so we go straight to kmap.  E.g. something like this

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 54480655bcce..b500b0ce5ce3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3291,12 +3291,15 @@ static int next_segment(unsigned long len, int offset)
return len;
 }
 
-static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
-void *data, int offset, int len)
+static int __kvm_read_guest_page(struct kvm *kvm, struct kvm_memory_slot *slot,
+gfn_t gfn, void *data, int offset, int len)
 {
int r;
unsigned long addr;
 
+   if (kvm_mem_is_private(kvm, gfn))
+   return kvm_gmem_read(slot, gfn, data, offset, len);
+
addr = gfn_to_hva_memslot_prot(slot, gfn, NULL);
if (kvm_is_error_hva(addr))
return -EFAULT;
@@ -3309,9 +3312,8 @@ static int __kvm_read_guest_page(struct kvm_memory_slot 
*slot, gfn_t gfn,
 int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
int len)
 {
-   struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
-
-   return __kvm_read_guest_page(slot, gfn, data, offset, len);
+   return __kvm_read_guest_page(kvm, gfn_to_memslot(kvm, gfn), gfn, data,
+offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_read_guest_page);
 
@@ -3320,7 +3322,7 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t 
gfn, void *data,
 {
struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 
-   return __kvm_read_guest_page(slot, gfn, data, offset, len);
+   return __kvm_read_guest_page(vcpu->kvm, slot, gfn, data, offset, len);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page);
 
> > 2. hugepage support.
> > 
> > KVM gmem can be allocated from hugetlbfs. How does QEMU determine

Not yet it can't.  gmem only supports THP, hugetlbfs is a future thing, if it's
ever supported.  I wouldn't be at all surprised if we end up going down a 
slightly
different route and don't use hugetlbfs directly.

> > when to allocate KVM gmem with KVM_GUEST_MEMFD_ALLOW_HUGEPAGE. The
> > easiest solution is create KVM gmem with KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
> > only when memory backend is HostMemoryBackendFile of hugetlbfs.
> 
> Good question.
> 
> Probably "if the memory backend uses huge pages, also use huge pages for the
> private gmem" makes sense.
> 
> ... but it becomes a mess with preallocation ... which is what people should
> actually be using with hugetlb. Andeventual double memory-consumption ...
> but maybe that's all been taken care of already?
> 
> Probably it's best to leave hugetlb support as future work and start

Re: [RFC PATCH 04/19] memory: Introduce memory_region_can_be_private()

2023-08-01 Thread Sean Christopherson

On Mon, Jul 31, 2023, Peter Xu wrote:
> On Mon, Jul 31, 2023 at 05:36:37PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 31, 2023 at 02:34:22PM -0700, Sean Christopherson wrote:
> > > On Mon, Jul 31, 2023, Peter Xu wrote:
> > > > On Mon, Jul 31, 2023 at 12:21:46PM -0400, Xiaoyao Li wrote:
> > > > > +bool memory_region_can_be_private(MemoryRegion *mr)
> > > > > +{
> > > > > +return mr->ram_block && mr->ram_block->gmem_fd >= 0;
> > > > > +}
> > > > 
> > > > This is not really MAP_PRIVATE, am I right?  If so, is there still 
> > > > chance
> > > > we rename it (it seems to be also in the kernel proposal all across..)?
> > > 
> > > Yes and yes.
> > > 
> > > > I worry it can be very confusing in the future against MAP_PRIVATE /
> > > > MAP_SHARED otherwise.
> > > 
> > > Heh, it's already quite confusing at times.  I'm definitely open to 
> > > naming that
> > > doesn't collide with MAP_{PRIVATE,SHARED}, especially if someone can come 
> > > with a
> > > naming scheme that includes a succinct way to describe memory that is 
> > > shared
> > > between two or more VMs, but is accessible to _only_ those VMs.
> > 
> > Standard solution is a technology specific prefix.
> > protect_shared, encrypt_shared etc.
> 
> Agreed, a prefix could definitely help (if nothing better comes at last..).
> If e.g. "encrypted" too long to be applied everywhere in var names and
> functions, maybe it can also be "enc_{private|shared}".

FWIW, I would stay away from "encrypted", there is no requirement that the 
memory
actually be encrypted.

Re: [RFC PATCH 04/19] memory: Introduce memory_region_can_be_private()

2023-07-31 Thread Sean Christopherson

On Mon, Jul 31, 2023, Peter Xu wrote:
> On Mon, Jul 31, 2023 at 12:21:46PM -0400, Xiaoyao Li wrote:
> > +bool memory_region_can_be_private(MemoryRegion *mr)
> > +{
> > +return mr->ram_block && mr->ram_block->gmem_fd >= 0;
> > +}
> 
> This is not really MAP_PRIVATE, am I right?  If so, is there still chance
> we rename it (it seems to be also in the kernel proposal all across..)?

Yes and yes.

> I worry it can be very confusing in the future against MAP_PRIVATE /
> MAP_SHARED otherwise.

Heh, it's already quite confusing at times.  I'm definitely open to naming that
doesn't collide with MAP_{PRIVATE,SHARED}, especially if someone can come with a
naming scheme that includes a succinct way to describe memory that is shared
between two or more VMs, but is accessible to _only_ those VMs.

Re: [ANNOUNCE] KVM Microconference at LPC 2023

2023-06-01 Thread Sean Christopherson

On Thu, Jun 01, 2023, Mickaï¿½l Salaï¿½n wrote:
> Hi,
> 
> What is the status of this microconference proposal? We'd be happy to talk
> about Heki [1] and potentially other hypervisor supports.

Proposal submitted (deadline is/was today), now we wait :-)  IIUC, we should 
find
out rather quickly whether or not the KVM MC is a go.

Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-31 Thread Sean Christopherson

On Tue, May 30, 2023, Rick P Edgecombe wrote:
> On Fri, 2023-05-26 at 17:22 +0200, Mickaï¿½l Salaï¿½n wrote:
> > > > Can the guest kernel ask the host VMM's emulated devices to DMA into
> > > > the protected data? It should go through the host userspace mappings I
> > > > think, which don't care about EPT permissions. Or did I miss where you
> > > > are protecting that another way? There are a lot of easy ways to ask
> > > > the host to write to guest memory that don't involve the EPT. You
> > > > probably need to protect the host userspace mappings, and also the
> > > > places in KVM that kmap a GPA provided by the guest.
> > > 
> > > Good point, I'll check this confused deputy attack. Extended KVM
> > > protections should indeed handle all ways to map guests' memory.  I'm
> > > wondering if current VMMs would gracefully handle such new restrictions
> > > though.
> > 
> > I guess the host could map arbitrary data to the guest, so that need to be
> > handled, but how could the VMM (not the host kernel) bypass/update EPT
> > initially used for the guest (and potentially later mapped to the host)?
> 
> Well traditionally both QEMU and KVM accessed guest memory via host
> mappings instead of the EPT.ï¿½So I'm wondering what is stopping the
> guest from passing a protected gfn when setting up the DMA, and QEMU
> being enticed to write to it? The emulator as well would use these host
> userspace mappings and not consult the EPT IIRC.
> 
> I think Sean was suggesting host userspace should be more involved in
> this process, so perhaps it could protect its own alias of the
> protected memory, for example mprotect() it as read-only.

Ya, though "suggesting" is really "demanding, unless someone provides super 
strong
justification for handling this directly in KVM".  It's basically the same 
argument
that led to Linux Security Modules: I'm all for KVM providing the framework and
plumbing, but I don't want KVM to get involved in defining policy, thread 
models, etc.

Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-25 Thread Sean Christopherson

On Thu, May 25, 2023, Rick P Edgecombe wrote:
> I wonder if it might be a good idea to POC the guest side before
> settling on the KVM interface. Then you can also look at the whole
> thing and judge how much usage it would get for the different options
> of restrictions.

As I said earlier[*], IMO the control plane logic needs to live in host 
userspace.
I think any attempt to have KVM providen anything but the low level plumbing 
will
suffer the same fate as CR4 pinning and XO memory.  Iterating on an imperfect
solution to incremently improve security is far, far easier to do in userspace,
and far more likely to get merged.

[*] https://lore.kernel.org/all/zfuyhpuhtmbyd...@google.com

Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*

2023-05-24 Thread Sean Christopherson

On Wed, May 24, 2023, Peter Zijlstra wrote:
> On Wed, May 24, 2023 at 01:16:03PM -0700, Sean Christopherson wrote:
> > Of course, the only accesses outside of mmu_lock are reads, so on x86 that
> > "atomic" access is just a READ_ONCE() load, but that's not the case for all
> > architectures.
> 
> This is true on *all* archs. atomic_set() and atomic_read() are no more
> and no less than WRITE_ONCE() / READ_ONCE().

Ah, I take it s390's handcoded assembly routines are just a paranoid equivalents
and not truly special?  "l" and "st" do sound quite generic...

  commit 7657e41a0bd16c9d8b3cefe8fd5d6ac3c25ae4bf
  Author: Heiko Carstens 
  Date:   Thu Feb 17 13:13:58 2011 +0100

[S390] atomic: use inline asm

Use inline assemblies for atomic_read/set(). This way there shouldn't
be any questions or subtle volatile semantics left.

static inline int __atomic_read(const atomic_t *v)
{
int c;

asm volatile(
"   l   %0,%1\n"
: "=d" (c) : "R" (v->counter));
return c;
}

static inline void __atomic_set(atomic_t *v, int i)
{
asm volatile(
"   st  %1,%0\n"
: "=R" (v->counter) : "d" (i));
}

Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*

2023-05-24 Thread Sean Christopherson

On Wed, May 24, 2023, Kautuk Consul wrote:
> On 2023-05-23 07:19:43, Sean Christopherson wrote:
> > On Tue, May 23, 2023, Kautuk Consul wrote:
> > > On 2022-07-06 16:20:10, Chao Peng wrote:
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index e9153b54e2a4..c262ebb168a7 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -765,10 +765,10 @@ struct kvm {
> > > >  
> > > >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > > > struct mmu_notifier mmu_notifier;
> > > > -   unsigned long mmu_notifier_seq;
> > > > -   long mmu_notifier_count;
> > > > -   gfn_t mmu_notifier_range_start;
> > > > -   gfn_t mmu_notifier_range_end;
> > > > +   unsigned long mmu_updating_seq;
> > > > +   long mmu_updating_count;
> > > 
> > > Can we convert mmu_updating_seq and mmu_updating_count to atomic_t ?
> > 
> > Heh, can we?  Yes.  Should we?  No.
> > 
> > > I see that not all accesses to these are under the kvm->mmu_lock
> > > spinlock.
> > 
> > Ya, working as intended.  Ignoring gfn_to_pfn_cache for the moment, all 
> > accesses
> > to mmu_invalidate_in_progress (was mmu_notifier_count / mmu_updating_count 
> > above)
> > are done under mmu_lock.  And for for mmu_notifier_seq (mmu_updating_seq 
> > above),
> > all writes and some reads are done under mmu_lock.  The only reads that are 
> > done
> > outside of mmu_lock are the initial snapshots of the sequence number.
> > 
> > gfn_to_pfn_cache uses a different locking scheme, the comments in
> > mmu_notifier_retry_cache() do a good job explaining the ordering.
> > 
> > > This will also remove the need for putting separate smp_wmb() and
> > > smp_rmb() memory barriers while accessing these structure members.
> > 
> > No, the memory barriers aren't there to provide any kind of atomicity.  The 
> > barriers
> > exist to ensure that stores and loads to/from the sequence and invalidate 
> > in-progress
> > counts are ordered relative to the invalidation (stores to counts) and 
> > creation (loads)
> > of SPTEs.  Making the counts atomic changes nothing because atomic 
> > operations don't
> > guarantee the necessary ordering.
> I'm not saying that the memory barriers provide atomicity.
> My comment was based on the assumption that "all atomic operations are
> implicit memory barriers". If that assumption is true then we won't need
> the memory barriers here if we use atomic operations for protecting
> these 2 structure members.

Atomics aren't memory barriers on all architectures, e.g. see the various
definitions of smp_mb__after_atomic().

Even if atomic operations did provide barriers, using an atomic would be 
overkill
and a net negative.  On strongly ordered architectures like x86, memory 
barriers are
just compiler barriers, whereas atomics may be more expensive.  Of course, the 
only
accesses outside of mmu_lock are reads, so on x86 that "atomic" access is just a
READ_ONCE() load, but that's not the case for all architectures.

Anyways, the point is that atomics and memory barriers are different things that
serve different purposes.

Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*

2023-05-23 Thread Sean Christopherson

On Tue, May 23, 2023, Kautuk Consul wrote:
> On 2022-07-06 16:20:10, Chao Peng wrote:
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index e9153b54e2a4..c262ebb168a7 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -765,10 +765,10 @@ struct kvm {
> >  
> >  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
> > struct mmu_notifier mmu_notifier;
> > -   unsigned long mmu_notifier_seq;
> > -   long mmu_notifier_count;
> > -   gfn_t mmu_notifier_range_start;
> > -   gfn_t mmu_notifier_range_end;
> > +   unsigned long mmu_updating_seq;
> > +   long mmu_updating_count;
> 
> Can we convert mmu_updating_seq and mmu_updating_count to atomic_t ?

Heh, can we?  Yes.  Should we?  No.

> I see that not all accesses to these are under the kvm->mmu_lock
> spinlock.

Ya, working as intended.  Ignoring gfn_to_pfn_cache for the moment, all accesses
to mmu_invalidate_in_progress (was mmu_notifier_count / mmu_updating_count 
above)
are done under mmu_lock.  And for for mmu_notifier_seq (mmu_updating_seq above),
all writes and some reads are done under mmu_lock.  The only reads that are done
outside of mmu_lock are the initial snapshots of the sequence number.

gfn_to_pfn_cache uses a different locking scheme, the comments in
mmu_notifier_retry_cache() do a good job explaining the ordering.

> This will also remove the need for putting separate smp_wmb() and
> smp_rmb() memory barriers while accessing these structure members.

No, the memory barriers aren't there to provide any kind of atomicity.  The 
barriers
exist to ensure that stores and loads to/from the sequence and invalidate 
in-progress
counts are ordered relative to the invalidation (stores to counts) and creation 
(loads)
of SPTEs.  Making the counts atomic changes nothing because atomic operations 
don't
guarantee the necessary ordering.

E.g. when handling a page fault, KVM snapshots the sequence outside of mmu_lock
_before_ touching any state that is involved in resolving the host pfn, e.g. 
primary
MMU state (VMAs, host page tables, etc.).   After the page fault task acquires
mmu_lock, KVM checks that there are no in-progress invalidations and that the 
sequence
count is the same.  This ensures that if there is a concurrent page fault and
invalidation event, the page fault task will either acquire mmu_lock and create 
SPTEs
_before_ the invalidation is processed, or the page fault task will observe 
either an
elevated mmu_invalidate_in_progress or a different sequence count, and thus 
retry the
page fault, if the page fault task acquires mmu_lock after the invalidation 
event.

Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2023-05-19 Thread Sean Christopherson

On Fri, May 19, 2023, Nicolas Saenz Julienne wrote:
> Hi Sean,
> 
> On Fri May 19, 2023 at 6:23 PM UTC, Sean Christopherson wrote:
> > On Fri, May 19, 2023, Nicolas Saenz Julienne wrote:
> > > Hi,
> > >
> > > On Fri Dec 2, 2022 at 6:13 AM UTC, Chao Peng wrote:
> > >
> > > [...]
> > > > +The user sets the per-page memory attributes to a guest memory range 
> > > > indicated
> > > > +by address/size, and in return KVM adjusts address and size to reflect 
> > > > the
> > > > +actual pages of the memory range have been successfully set to the 
> > > > attributes.
> > > > +If the call returns 0, "address" is updated to the last successful 
> > > > address + 1
> > > > +and "size" is updated to the remaining address size that has not been 
> > > > set
> > > > +successfully. The user should check the return value as well as the 
> > > > size to
> > > > +decide if the operation succeeded for the whole range or not. The user 
> > > > may want
> > > > +to retry the operation with the returned address/size if the previous 
> > > > range was
> > > > +partially successful.
> > > > +
> > > > +Both address and size should be page aligned and the supported 
> > > > attributes can be
> > > > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > > > +
> > > > +The "flags" field may be used for future extensions and should be set 
> > > > to 0s.
> > >
> > > We have been looking into adding support for the Hyper-V VSM extensions
> > > which Windows uses to implement Credential Guard. This interface seems
> > > like a good fit for one of its underlying features. I just wanted to
> > > share a bit about it, and see if we can expand it to fit this use-case.
> > > Note that this was already briefly discussed between Sean and Alex some
> > > time ago[1].
> > >
> > > VSM introduces isolated guest execution contexts called Virtual Trust
> > > Levels (VTL) [2]. Each VTL has its own memory access protections,
> > > virtual processors states, interrupt controllers and overlay pages. VTLs
> > > are hierarchical and might enforce memory protections on less privileged
> > > VTLs. Memory protections are enforced on a per-GPA granularity.
> > >
> > > The list of possible protections is:
> > > - No access -- This needs a new memory attribute, I think.
> >
> > No, if KVM provides three bits for READ, WRITE, and EXECUTE, then userspace 
> > can
> > get all the possible combinations.  E.g. this is RWX=000b
> 
> That's not what the current implementation does, when attributes is
> equal 0 it clears the entries from the xarray:
> 
> static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>  struct kvm_memory_attributes *attrs)
> {
> 
> entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> [...]
> for (i = start; i < end; i++)
>   if (xa_err(xa_store(>mem_attr_array, i, entry,
>   GFP_KERNEL_ACCOUNT)))
>   break;
> }
> 
> >From Documentation/core-api/xarray.rst:
> 
> "There is no difference between an entry that has never
> been stored to, one that has been erased and one that has most recently
> had ``NULL`` stored to it."
> 
> The way I understood the series, there needs to be a differentiation
> between no attributes (regular page fault) and no-access.

Ah, I see what you're saying.  There are multiple ways to solve things without a
"no access" flag while still maintaining an empty xarray for the default case.
E.g. invert the flags to be DENY flags[*], have an internal-only "entry valid" 
flag,
etc.

[*] I vaguely recall suggesting a "deny" approach somewhere, but I may just be
making things up to make it look like I thought deeply about this ;-)

Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2023-05-19 Thread Sean Christopherson

On Fri, May 19, 2023, Nicolas Saenz Julienne wrote:
> Hi,
> 
> On Fri Dec 2, 2022 at 6:13 AM UTC, Chao Peng wrote:
> 
> [...]
> > +The user sets the per-page memory attributes to a guest memory range 
> > indicated
> > +by address/size, and in return KVM adjusts address and size to reflect the
> > +actual pages of the memory range have been successfully set to the 
> > attributes.
> > +If the call returns 0, "address" is updated to the last successful address 
> > + 1
> > +and "size" is updated to the remaining address size that has not been set
> > +successfully. The user should check the return value as well as the size to
> > +decide if the operation succeeded for the whole range or not. The user may 
> > want
> > +to retry the operation with the returned address/size if the previous 
> > range was
> > +partially successful.
> > +
> > +Both address and size should be page aligned and the supported attributes 
> > can be
> > +retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
> > +
> > +The "flags" field may be used for future extensions and should be set to 
> > 0s.
> 
> We have been looking into adding support for the Hyper-V VSM extensions
> which Windows uses to implement Credential Guard. This interface seems
> like a good fit for one of its underlying features. I just wanted to
> share a bit about it, and see if we can expand it to fit this use-case.
> Note that this was already briefly discussed between Sean and Alex some
> time ago[1].
> 
> VSM introduces isolated guest execution contexts called Virtual Trust
> Levels (VTL) [2]. Each VTL has its own memory access protections,
> virtual processors states, interrupt controllers and overlay pages. VTLs
> are hierarchical and might enforce memory protections on less privileged
> VTLs. Memory protections are enforced on a per-GPA granularity.
> 
> The list of possible protections is:
> - No access -- This needs a new memory attribute, I think.

No, if KVM provides three bits for READ, WRITE, and EXECUTE, then userspace can
get all the possible combinations.  E.g. this is RWX=000b

> - Read-only, no execute

RWX=100b (using my completely arbitrary ordering of RWX bits :-) )

> - Read-only, execute

RWX=101b

> - Read/write, no execute

RWX=110b

> - Read/write, execute

RWX=111b

> We implemented this in the past by using a separate address space per
> VTL and updating memory regions on protection changes. But having to
> update the memory slot layout for every permission change scales poorly,
> especially as we have to perform 100.000s of these operations at boot
> (see [1] for a little more context).
> 
> I believe the biggest barrier for us to use memory attributes is not
> having the ability to target specific address spaces, or to the very
> least having some mechanism to maintain multiple independent layers of
> attributes.

Can you elaborate on "specific address spaces"?  In KVM, that usually means SMM,
but the VTL comment above makes me think you're talking about something entirely
different.  E.g. can you provide a brief summary of the 
requirements/expectations?

> Also sorry for not posting our VSM patches. They are not ready for
> upstream review yet, but we're working on it.
> 
> Nicolas
> 
> [1] https://patchwork.kernel.org/comment/25054908/
> [2] See Chapter 15 of Microsoft's 'Hypervisor Top Level Functional 
> Specification':
> 
> https://raw.githubusercontent.com/MicrosoftDocs/Virtualization-Documentation/main/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf

Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking

2023-05-05 Thread Sean Christopherson

On Fri, May 05, 2023, Mickaï¿½l Salaï¿½n wrote:
> 
> On 05/05/2023 18:28, Sean Christopherson wrote:
> > I have no doubt that we'll need to solve performance and scaling issues 
> > with the
> > memory attributes implementation, e.g. to utilize xarray multi-range support
> > instead of storing information on a per-4KiB-page basis, but AFAICT, the 
> > core
> > idea is sound.  And a very big positive from a maintenance perspective is 
> > that
> > any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should 
> > also
> > benefit the other use case.
> > 
> > [1] https://lore.kernel.org/all/20230311002258.852397-22-sea...@google.com
> > [2] https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
> > [3] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com
> 
> I agree, I used this mechanism because it was easier at first to rely on a
> previous work, but while I was working on the MBEC support, I realized that
> it's not the optimal way to do it.
> 
> I was thinking about using a new special EPT bit similar to
> EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you
> think?

On x86, SPTEs are even more ephemeral than memslots.  E.g. for historical 
reasons,
KVM zaps all SPTEs if _any_ memslot is deleted, which is problematic if the 
guest
is moving around BARs, using option ROMs, etc.

ARM's pKVM tracks metadata in its stage-2 PTEs, i.e. doesn't need an xarray to
otrack attributes, but that works only because pKVM is more privileged than the
host kernel, and the shared vs. private memory attribute that pKVM cares about
is very, very restricted in how it can be used and changed.

I tried shoehorning private vs. shared metadata into x86's SPTEs in the past, 
and
it ended up being a constant battle with the kernel, e.g. page migration, and 
with
KVM itself, e.g. the above memslot mess.

Re: [PATCH v1 4/9] KVM: x86: Add new hypercall to set EPT permissions

2023-05-05 Thread Sean Christopherson

On Fri, May 05, 2023, Mickaï¿½l Salaï¿½n wrote:
> 
> On 05/05/2023 18:44, Sean Christopherson wrote:
> > On Fri, May 05, 2023, Mickaï¿½l Salaï¿½n wrote:
> > > Add a new KVM_HC_LOCK_MEM_PAGE_RANGES hypercall that enables a guest to
> > > set EPT permissions on a set of page ranges.
> > 
> > IMO, manipulation of protections, both for memory (this patch) and CPU state
> > (control registers in the next patch) should come from userspace.  I have no
> > objection to KVM providing plumbing if necessary, but I think userspace 
> > needs to
> > to have full control over the actual state.
> 
> By user space, do you mean the host user space or the guest user space?

Host userspace, a.k.a. the VMM.  Definitely not guest userspace.

> About the guest user space, I see several issues to delegate this kind of
> control:
> - These are restrictions only relevant to the kernel.
> - The threat model is to protect against user space as early as possible.
> - It would be more complex for no obvious gain.
> 
> This patch series is an extension of the kernel self-protections mechanisms,
> and they are not configured by user space.
> 
> 
> > 
> > One of the things that caused Intel's control register pinning series to 
> > stall
> > out was how to handle edge cases like kexec() and reboot.  Deferring to 
> > userspace
> > means the kernel doesn't need to define policy, e.g. when to unprotect 
> > memory,
> > and avoids questions like "should userspace be able to overwrite pinned 
> > control
> > registers".
> 
> The idea is to authenticate every changes. For kexec, the VMM (or something
> else) would have to authenticate the new kernel. Do you have something else
> in mind that could legitimately require such memory or CR changes?

I think we're on the same page, the VMM (host userspace) would need to ack any
changes.

FWIW, SMM is another wart as entry to SMM clobbers CRs.  Now that CONFIG_KVM_SMM
is a thing, the easiest solution would be to disallow coexistence with SMM, 
though
that might not be an option for many use cases (IIUC, QEMU-based deployments use
SMM to implement secure boot).

> > And like the confidential VM use case, keeping userspace in the loop is a 
> > big
> > beneifit, e.g. the guest can't circumvent protections by coercing userspace 
> > into
> > writing to protected memory .
> 
> I don't understand this part. Are you talking about the host user space? How
> the guest could circumvent protections?

Host userspace.  Guest configures a device buffer in write-protected memory, 
gets
a host (synthetic) device to write into the memory.

Re: [PATCH v1 4/9] KVM: x86: Add new hypercall to set EPT permissions

2023-05-05 Thread Sean Christopherson

On Fri, May 05, 2023, Mickaï¿½l Salaï¿½n wrote:
> Add a new KVM_HC_LOCK_MEM_PAGE_RANGES hypercall that enables a guest to
> set EPT permissions on a set of page ranges.

IMO, manipulation of protections, both for memory (this patch) and CPU state
(control registers in the next patch) should come from userspace.  I have no
objection to KVM providing plumbing if necessary, but I think userspace needs to
to have full control over the actual state.

One of the things that caused Intel's control register pinning series to stall
out was how to handle edge cases like kexec() and reboot.  Deferring to 
userspace
means the kernel doesn't need to define policy, e.g. when to unprotect memory,
and avoids questions like "should userspace be able to overwrite pinned control
registers".

And like the confidential VM use case, keeping userspace in the loop is a big
beneifit, e.g. the guest can't circumvent protections by coercing userspace into
writing to protected memory .

Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking

2023-05-05 Thread Sean Christopherson

On Fri, May 05, 2023, Mickaï¿½l Salaï¿½n wrote:
> diff --git a/arch/x86/include/asm/kvm_page_track.h 
> b/arch/x86/include/asm/kvm_page_track.h
> index eb186bc57f6a..a7fb4ff888e6 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -3,6 +3,7 @@
>  #define _ASM_X86_KVM_PAGE_TRACK_H
>  
>  enum kvm_page_track_mode {
> + KVM_PAGE_TRACK_PREWRITE,

Heh, just when I decide to finally kill off support for multiple modes[1] :-)

My assessment from that changelog still holds true for this case:

  Drop "support" for multiple page-track modes, as there is no evidence
  that array-based and refcounted metadata is the optimal solution for
  other modes, nor is there any evidence that other use cases, e.g. for
  access-tracking, will be a good fit for the page-track machinery in
  general.

  E.g. one potential use case of access-tracking would be to prevent guest
  access to poisoned memory (from the guest's perspective).  In that case,
  the number of poisoned pages is likely to be a very small percentage of
  the guest memory, and there is no need to reference count the number of
  access-tracking users, i.e. expanding gfn_track[] for a new mode would be
  grossly inefficient.  And for poisoned memory, host userspace would also
  likely want to trap accesses, e.g. to inject #MC into the guest, and that
  isn't currently supported by the page-track framework.

  A better alternative for that poisoned page use case is likely a
  variation of the proposed per-gfn attributes overlay (linked), which
  would allow efficiently tracking the sparse set of poisoned pages, and by
  default would exit to userspace on access.

Of particular relevance:

  - Using the page-track machinery is inefficient because the guest is likely
going to write-protect a minority of its memory.  And this

  select KVM_EXTERNAL_WRITE_TRACKING if KVM

is particularly nasty because simply enabling HEKI in the Kconfig will cause
KVM to allocate rmaps and gfn tracking.

  - There's no need to reference count the protection, i.e. 15 of the 16 bits of
gfn_track are dead weight.

  - As proposed, adding a second "mode" would double the cost of gfn tracking.

  - Tying the protections to the memslots will create an impossible-to-maintain
ABI because the protections will be lost if the owning memslot is deleted 
and
recreated.

  - The page-track framework provides incomplete protection and will lead to an
ongoing game of whack-a-mole, e.g. this patch catches the obvious cases by
adding calls to kvm_page_track_prewrite(), but misses things like 
kvm_vcpu_map().

  - The scaling and maintenance issues will only get worse if/when someone tries
to support dropping read and/or execute permissions, e.g. for execute-only.

  - The code is x86-only, and is likely to stay that way for the foreseeable
future.

The proposed alternative is to piggyback the memory attributes implementation[2]
that is being added (if all goes according to plan) for confidential VMs.  This
use case (dropping permissions) came up not too long ago[3], which is why I have
a ready-made answer).

I have no doubt that we'll need to solve performance and scaling issues with the
memory attributes implementation, e.g. to utilize xarray multi-range support
instead of storing information on a per-4KiB-page basis, but AFAICT, the core
idea is sound.  And a very big positive from a maintenance perspective is that
any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also
benefit the other use case.

[1] https://lore.kernel.org/all/20230311002258.852397-22-sea...@google.com
[2] https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
[3] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com

Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE

2023-04-25 Thread Sean Christopherson

On Tue, Apr 18, 2023, Ackerley Tng wrote:
> Sean Christopherson  writes:
> > I agree, a pure alignment check is too restrictive, and not really what I
> > intended despite past me literally saying that's what I wanted :-)  I think
> > I may have also inverted the "less alignment" statement, but luckily I
> > believe that ends up being a moot point.
> 
> > The goal is to avoid having to juggle scenarios where KVM wants to create a
> > hugepage, but restrictedmem can't provide one because of a misaligned file
> > offset.  I think the rule we want is that the offset must be aligned to the
> > largest page size allowed by the memslot _size_.  E.g. on x86, if the
> > memslot size is >=1GiB then the offset must be 1GiB or beter, ditto for
> > >=2MiB and >=4KiB (ignoring that 4KiB is already a requirement).
> 
> > We could loosen that to say the largest size allowed by the memslot, but I
> > don't think that's worth the effort unless it's trivially easy to implement
> > in code, e.g. KVM could technically allow a 4KiB aligned offset if the
> > memslot is 2MiB sized but only 4KiB aligned on the GPA.  I doubt there's a
> > real use case for such a memslot, so I want to disallow that unless it's
> > super easy to implement.
> 
> Checking my understanding here about why we need this alignment check:
> 
> When KVM requests a page from restrictedmem, KVM will provide an offset
> into the file in terms of 4K pages.
> 
> When shmem is configured to use hugepages, shmem_get_folio() will round
> the requested offset down to the nearest hugepage-aligned boundary in
> shmem_alloc_hugefolio().
> 
> Example of problematic configuration provided to
> KVM_SET_USER_MEMORY_REGION2:
> 
> + shmem configured to use 1GB pages
> + restrictedmem_offset provided to KVM_SET_USER_MEMORY_REGION2: 0x4000
> + memory_size provided in KVM_SET_USER_MEMORY_REGION2: 1GB
> + KVM requests offset (pgoff_t) 0x8, which translates to offset 0x8000
> 
> restrictedmem_get_page() and shmem_get_folio() returns the page for
> offset 0x0 in the file, since rounding down 0x8000 to the nearest 1GB is
> 0x0. This is allocating outside the range that KVM is supposed to use,
> since the parameters provided in KVM_SET_USER_MEMORY_REGION2 is only
> supposed to be offset 0x4000 to (0x4000 + 1GB = 0x40004000) in the file.
> 
> IIUC shmem will actually just round down (0x4000 rounded down to nearest
> 1GB will be 0x0) and allocate without checking bounds, so if offset 0x0
> to 0x4000 in the file were supposed to be used by something else, there
> might be issues.
> 
> Hence, this alignment check ensures that rounding down of any offsets
> provided by KVM (based on page size configured in the backing file
> provided) to restrictedmem_get_page() must not go below the offset
> provided to KVM_SET_USER_MEMORY_REGION2.
> 
> Enforcing alignment of restrictedmem_offset based on the currently-set
> page size in the backing file (i.e. shmem) may not be effective, since
> the size of the pages in the backing file can be adjusted to a larger
> size after KVM_SET_USER_MEMORY_REGION2 succeeds. With that, we may still
> end up allocating outside the range that KVM was provided with.
> 
> Hence, to be safe, we should check alignment to the max page size across
> all backing filesystems, so the constraint is
> 
> rounding down restrictedmem_offset to
> min(max page size across all backing filesystems,
> max page size that fits in memory_size) == restrictedmem_offset
> 
> which is the same check as
> 
> restrictedmem_offset must be aligned to min(max page size across all
> backing filesystems, max page size that fits in memory_size)
> 
> which can safely reduce to
> 
> restrictedmem_offset must be aligned to max page size that fits in
> memory_size
> 
> since "max page size that fits in memory_size" is probably <= to "max
> page size across all backing filesystems", and if it's larger, it'll
> just be a tighter constraint.

Yes?  The alignment check isn't strictly required, KVM _could_ deal with the 
above
scenario, it's just a lot simpler and safer for KVM if the file offset needs to
be sanely aligned.

> If the above understanding is correct:
> 
> + We must enforce this in the KVM_SET_USER_MEMORY_REGION2 handler, since
>   IIUC shmem will just round down and allocate without checking bounds.
> 
> + I think this is okay because holes in the restrictedmem file (in
>   terms of offset) made to accommodate this constraint don't cost us
>   anything anyway(?) Are they just arbitrary offsets in a file? In
>   our case, this file is usually a new and empty file.
> 
> + In the c

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2023-04-19 Thread Sean Christopherson

On Wed, Apr 19, 2023, Christian Brauner wrote:
> On Thu, Apr 13, 2023 at 03:28:43PM -0700, Sean Christopherson wrote:
> > > But if you want to preserve the inode number and device number of the
> > > relevant tmpfs instance but still report memfd restricted as your
> > > filesystem type
> > 
> > Unless I missed something along the way, reporting memfd_restricted as a 
> > distinct
> > filesystem is very much a non-goal.  AFAIK it's purely a side effect of the
> > proposed implementation.
> 
> In the current implementation you would have to put in effort to fake
> this. For example, you would need to also implement ->statfs
> super_operation where you'd need to fill in the details of the tmpfs
> instance. At that point all that memfd_restricted fs code that you've
> written is nothing but deadweight, I would reckon.

After digging a bit, I suspect the main reason Kirill implemented an overlay to
inode_operations was to prevent modifying the file size via ->setattr().  
Relying
on shmem_setattr() to unmap entries in KVM's MMU wouldn't work because, by 
design,
the memory can't be mmap()'d into host userspace. 

if (attr->ia_valid & ATTR_SIZE) {
if (memfd->f_inode->i_size)
return -EPERM;

if (!PAGE_ALIGNED(attr->ia_size))
return -EINVAL; 
}

But I think we can solve this particular problem by using F_SEAL_{GROW,SHRINK} 
or
SHMEM_LONGPIN.  For a variety of reasons, I'm leaning more and more toward 
making
this a KVM ioctl() instead of a dedicated syscall, at which point we can be both
more flexible and more draconian, e.g. let userspace provide the file size at 
the
time of creation, but make the size immutable, at least by default.

> > After giving myself a bit of a crash course in file systems, would 
> > something like
> > the below have any chance of (a) working, (b) getting merged, and (c) being
> > maintainable?
> > 
> > The idea is similar to a stacking filesystem, but instead of stacking, 
> > restrictedmem
> > hijacks a f_ops and a_ops to create a lightweight shim around tmpfs.  There 
> > are
> > undoubtedly issues and edge cases, I'm just looking for a quick "yes, this 
> > might
> > be doable" or a "no, that's absolutely bonkers, don't try it".
> 
> Maybe, but I think it's weird.

Yeah, agreed.

> _Replacing_ f_ops isn't something that's unprecedented. It happens everytime
> a character device is opened (see fs/char_dev.c:chrdev_open()). And debugfs
> does a similar (much more involved) thing where it replaces it's proxy f_ops
> with the relevant subsystem's f_ops. The difference is that in both cases the
> replace happens at ->open() time; and the replace is done once. Afterwards
> only the newly added f_ops are relevant.
> 
> In your case you'd be keeping two sets of {f,a}_ops; one usable by
> userspace and another only usable by in-kernel consumers. And there are
> some concerns (non-exhaustive list), I think:
> 
> * {f,a}_ops weren't designed for this. IOW, one set of {f,a}_ops is
>   authoritative per @file and it is left to the individual subsystems to
>   maintain driver specific ops (see the sunrpc stuff or sockets).
> * lifetime management for the two sets of {f,a}_ops: If the ops belong
>   to a module then you need to make sure that the module can't get
>   unloaded while you're using the fops. Might not be a concern in this
>   case.

Ah, whereas I assume the owner of inode_operations is pinned by ??? (dentry?)
holding a reference to the inode?

> * brittleness: Not all f_ops for example deal with userspace
>   functionality some deal with cleanup when the file is closed like
>   ->release(). So it's delicate to override that functionality with
>   custom f_ops. Restricted memfds could easily forget to cleanup
>   resources.
> * Potential for confusion why there's two sets of {f,a}_ops.
> * f_ops specifically are generic across a vast amount of consumers and
>   are subject to change. If memfd_restricted() has specific requirements
>   because of this weird double-use they won't be taken into account.
> 
> I find this hard to navigate tbh and it feels like taking a shortcut to
> avoid building a proper api.

Agreed.  At the very least, it would be better to take an explicit dependency on
whatever APIs are being used instead of somewhat blindly bouncing through 
->fallocate().
I think that gives us a clearer path to getting something merged too, as we'll
need Acks on making specific functions visible, i.e. will give MM maintainers
something concrete to react too.

> If you only care about a specific set of operations specific to memfd
> restricte that needs to be available to in-kernel consu

Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2023-04-17 Thread Sean Christopherson

On Mon, Apr 17, 2023, Chao Peng wrote:
> In case you started working on the code again, I have a branch [1]
> originally planned as v11 candidate which I believe I addressed all the
> discussions we had for v10 except the very latest one [2] and integrated
> all the newly added selftests from Ackerley and myself. The branch was
> based on your original upm_base_support and then rebased to your
> kvm-x86/mmu head. Feel free to take anything you think useful( most of
> them are trivial things but also some fixes for bugs).

Nice!  I am going to work on splicing together the various series this week, 
I'll
make sure to grab your work.

Thanks much!

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2023-04-14 Thread Sean Christopherson

On Fri, Apr 14, 2023, Sean Christopherson wrote:
> On Fri, Apr 14, 2023, Ackerley Tng wrote:
> > Sean Christopherson  writes:
> > >   if (WARN_ON_ONCE(file->private_data)) {
> > >   err = -EEXIST;
> > >   goto err_fd;
> > >   }
> > 
> > Did you intend this as a check that the backing filesystem isn't using
> > the private_data field in the mapping?
> >
> > I think you meant file->f_mapping->private_data.
> 
> Ya, sounds right.  I should have added disclaimers that (a) I wrote this quite
> quickly and (b) it's compile tested only at this point.

FWIW, here's a very lightly tested version that doesn't explode on a basic 
selftest.

https://github.com/sean-jc/linux/tree/x86/upm_base_support

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2023-04-14 Thread Sean Christopherson

On Fri, Apr 14, 2023, Ackerley Tng wrote:
> Sean Christopherson  writes:
> 
> > On Thu, Apr 13, 2023, Christian Brauner wrote:
> > > * by a mount option to tmpfs that makes it act
> > >in this restricted manner then you don't need an ioctl() and can get
> > >away with regular open calls. Such a tmpfs instance would only create
> > >regular, restricted memfds.
> 
> > I'd prefer to not go this route, becuase IIUC, it would require relatively
> > invasive changes to shmem code, and IIUC would require similar changes to
> > other support backings in the future, e.g. hugetlbfs?  And as above, I
> > don't think any of the potential use cases need restrictedmem to be a
> > uniquely identifiable mount.
> 
> FWIW, I'm starting to look at extending restrictedmem to hugetlbfs and
> the separation that the current implementation has is very helpful. Also
> helps that hugetlbfs and tmpfs are structured similarly, I guess.
> 
> > One of the goals (hopefully not a pipe dream) is to design restrictmem in
> > such a way that extending it to support other backing types isn't terribly
> > difficult.  In case it's not obvious, most of us working on this stuff
> > aren't filesystems experts, and many of us aren't mm experts either.  The
> > more we (KVM folks for the most part) can leverage existing code to do the
> > heavy lifting, the better.
> 
> > After giving myself a bit of a crash course in file systems, would
> > something like the below have any chance of (a) working, (b) getting
> > merged, and (c) being maintainable?
> 
> > The idea is similar to a stacking filesystem, but instead of stacking,
> > restrictedmem hijacks a f_ops and a_ops to create a lightweight shim around
> > tmpfs.  There are undoubtedly issues and edge cases, I'm just looking for a
> > quick "yes, this might be doable" or a "no, that's absolutely bonkers,
> > don't try it".
> 
> Not an FS expert by any means, but I did think of approaching it this
> way as well!
> 
> "Hijacking" perhaps gives this approach a bit of a negative connotation.

Heh, commandeer then.

> I thought this is pretty close to subclassing (as in Object
> Oriented Programming). When some methods (e.g. fallocate) are called,
> restrictedmem does some work, and calls the same method in the
> superclass.
> 
> The existing restrictedmem code is a more like instantiating an shmem
> object and keeping that object as a field within the restrictedmem
> object.
> 
> Some (maybe small) issues I can think of now:
> 
> (1)
> 
> One difficulty with this approach is that other functions may make
> assumptions about private_data being of a certain type, or functions may
> use private_data.
> 
> I checked and IIUC neither shmem nor hugetlbfs use the private_data
> field in the inode's i_mapping (also file's f_mapping).
> 
> But there's fs/buffer.c which uses private_data, although those
> functions seem to be used by FSes like ext4 and fat, not memory-backed
> FSes.
> 
> We can probably fix this if any backing filesystems of restrictedmem,
> like tmpfs and future ones use private_data.

Ya, if we go the route of poking into f_ops and stuff, I would want to add
WARN_ON_ONCE() hardening of everything that restrictemem wants to "commandeer" 
;-)

> > static int restrictedmem_file_create(struct file *file)
> > {
> > struct address_space *mapping = file->f_mapping;
> > struct restrictedmem *rm;
> 
> > rm = kzalloc(sizeof(*rm), GFP_KERNEL);
> > if (!rm)
> > return -ENOMEM;
> 
> > rm->backing_f_ops = file->f_op;
> > rm->backing_a_ops = mapping->a_ops;
> > rm->file = file;
> 
> We don't really need to do this, since rm->file is already the same as
> file, we could just pass the file itself when it's needed

Aha!  I was working on getting rid of it, but forgot to go back and do another
pass.

> > init_rwsem(>lock);
> > xa_init(>bindings);
> 
> > file->f_flags |= O_LARGEFILE;
> 
> > file->f_op = _fops;
> > mapping->a_ops = _aops;
> 
> I think we probably have to override inode_operations as well, because
> otherwise other methods would become available to a restrictedmem file
> (like link, unlink, mkdir, tmpfile). Or maybe that's a feature instead
> of a bug.

I think we want those?  What we want to restrict are operations that require
read/write/execute access to the file, everything else should be ok. fallocate()
is a special case because restrictmem needs to tell KVM to unmap the memory when
a hole is punched.  I assume ->setattr() needs similar treatmen

Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE

2023-04-14 Thread Sean Christopherson

On Tue, Mar 28, 2023, Chao Peng wrote:
> On Fri, Mar 24, 2023 at 10:29:25AM +0800, Xiaoyao Li wrote:
> > On 3/24/2023 10:10 AM, Chao Peng wrote:
> > > On Wed, Mar 22, 2023 at 05:41:31PM -0700, Isaku Yamahata wrote:
> > > > On Wed, Mar 08, 2023 at 03:40:26PM +0800,
> > > > Chao Peng  wrote:
> > > > 
> > > > > On Wed, Mar 08, 2023 at 12:13:24AM +, Ackerley Tng wrote:
> > > > > > Chao Peng  writes:
> > > > > > 
> > > > > > > On Sat, Jan 14, 2023 at 12:01:01AM +, Sean Christopherson 
> > > > > > > wrote:
> > > > > > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > > > +static bool kvm_check_rmem_offset_alignment(u64 offset, u64 gpa)
> > > > > > > +{
> > > > > > > + if (!offset)
> > > > > > > + return true;
> > > > > > > + if (!gpa)
> > > > > > > + return false;
> > > > > > > +
> > > > > > > + return !!(count_trailing_zeros(offset) >= 
> > > > > > > count_trailing_zeros(gpa));
> > > > 
> > > > This check doesn't work expected. For example, offset = 2GB, gpa=4GB
> > > > this check fails.
> > > 
> > > This case is expected to fail as Sean initially suggested[*]:
> > >I would rather reject memslot if the gfn has lesser alignment than
> > >the offset. I'm totally ok with this approach _if_ there's a use case.
> > >Until such a use case presents itself, I would rather be conservative
> > >from a uAPI perspective.
> > > 
> > > I understand that we put tighter restriction on this but if you see such
> > > restriction is really a big issue for real usage, instead of a
> > > theoretical problem, then we can loosen the check here. But at that time
> > > below code is kind of x86 specific and may need improve.
> > > 
> > > BTW, in latest code, I replaced count_trailing_zeros() with fls64():
> > >return !!(fls64(offset) >= fls64(gpa));
> > 
> > wouldn't it be !!(ffs64(offset) <= ffs64(gpa)) ?
> 
> As the function document explains, here we want to return true when
> ALIGNMENT(offset) >= ALIGNMENT(gpa), so '>=' is what we need.
> 
> It's worthy clarifying that in Sean's original suggestion he actually
> mentioned the opposite. He said 'reject memslot if the gfn has lesser
> alignment than the offset', but I wonder this is his purpose, since
> if ALIGNMENT(offset) < ALIGNMENT(gpa), we wouldn't be possible to map
> the page as largepage. Consider we have below config:
> 
>   gpa=2M, offset=1M
> 
> In this case KVM tries to map gpa at 2M as 2M hugepage but the physical
> page at the offset(1M) in private_fd cannot provide the 2M page due to
> misalignment.
> 
> But as we discussed in the off-list thread, here we do find a real use
> case indicating this check is too strict. i.e. QEMU immediately fails
> when launch a guest > 2G memory. For this case QEMU splits guest memory
> space into two slots:
> 
>   Slot#1(ram_below_4G): gpa=0x0, offset=0x0, size=2G
>   Slot#2(ram_above_4G): gpa=4G,  offset=2G,  size=totalsize-2G
> 
> This strict alignment check fails for slot#2 because offset(2G) has less
> alignment than gpa(4G). To allow this, one solution can revert to my
> previous change in kvm_alloc_memslot_metadata() to disallow hugepage
> only when the offset/gpa are not aligned to related page size.
> 
> Sean, How do you think?

I agree, a pure alignment check is too restrictive, and not really what I 
intended
despite past me literally saying that's what I wanted :-)  I think I may have 
also
inverted the "less alignment" statement, but luckily I believe that ends up 
being
a moot point.

The goal is to avoid having to juggle scenarios where KVM wants to create a 
hugepage,
but restrictedmem can't provide one because of a misaligned file offset.  I 
think
the rule we want is that the offset must be aligned to the largest page size 
allowed
by the memslot _size_.  E.g. on x86, if the memslot size is >=1GiB then the 
offset
must be 1GiB or beter, ditto for >=2MiB and >=4KiB (ignoring that 4KiB is 
already a
requirement).

We could loosen that to say the largest size allowed by the memslot, but I don't
think that's worth the effort unless it's trivially easy to implement in code,
e.g. KVM could technically allow a 4KiB aligned offset if the memslot is 2MiB
sized but only 4KiB aligned on the GPA.  I doubt there's a real use case for 
such
a memslot, so I want to disallow that unless it's super easy to implement.

Re: [RFC PATCH 0/6] Setting memory policy for restrictedmem file

2023-04-14 Thread Sean Christopherson

On Fri, Apr 14, 2023, Michal Hocko wrote:
> On Fri 14-04-23 00:11:49, Ackerley Tng wrote:
> > 3. A more generic fbind(): it seems like this new functionality is
> >really only needed for restrictedmem files, hence a separate,
> >specific syscall was proposed to avoid complexities with handling
> >conflicting policies that may be specified via other syscalls like
> >mbind()
> 
> I do not think it is a good idea to make the syscall restrict mem
> specific.

+1.  IMO, any uAPI that isn't directly related to the fundamental properties of
restricted memory, i.e. isn't truly unique to restrictedmem, should be added as
generic fd-based uAPI.

> History shows that users are much more creative when it comes
> to usecases than us. I do understand that the nature of restricted
> memory is that it is not mapable but memory policies without a mapping
> are a reasonable concept in genereal. After all this just tells where
> the memory should be allocated from. Do we need to implement that for
> any other fs? No, you can safely return EINVAL for anything but
> memfd_restricted fd for now but you shouldn't limit usecases upfront.

I would even go a step further and say that we should seriously reconsider the
design/implemenation of memfd_restricted() if a generic fbind() needs explicit
handling from the restricted memory code.  One of the goals with 
memfd_restricted()
is to rely on the underlying backing store to handle all of the "normal" 
behaviors.

Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted

2023-04-13 Thread Sean Christopherson

On Thu, Apr 13, 2023, Ackerley Tng wrote:
> Christian Brauner  writes:
> > I'm curious, is there an LSFMM session for this?
> 
> As far as I know, there is no LSFMM session for this.

Correct, no LSFMM session.  In hindsight, that's obviously something we should
have pursued :-(

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2023-04-13 Thread Sean Christopherson

On Thu, Apr 13, 2023, Christian Brauner wrote:
> On Thu, Aug 18, 2022 at 04:24:21PM +0300, Kirill A . Shutemov wrote:
> > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > > Here's what I would prefer, and imagine much easier for you to maintain;
> > > but I'm no system designer, and may be misunderstanding throughout.
> > > 
> > > QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps
> > > the fallocate syscall interface itself) to allocate and free the memory,
> > > ioctl for initializing some of it too.  KVM in control of whether that
> > > fd can be read or written or mmap'ed or whatever, no need to prevent it
> > > in shmem.c, no need for flags, seals, notifications to and fro because
> > > KVM is already in control and knows the history.  If shmem actually has
> > > value, call into it underneath - somewhat like SysV SHM, and /dev/zero
> > > mmap, and i915/gem make use of it underneath.  If shmem has nothing to
> > > add, just allocate and free kernel memory directly, recorded in your
> > > own xarray.
> > 
> > I guess shim layer on top of shmem *can* work. I don't see immediately why
> > it would not. But I'm not sure it is right direction. We risk creating yet
> > another parallel VM with own rules/locking/accounting that opaque to
> > core-mm.
> 
> Sorry for necrobumping this thread but I've been reviewing the

No worries, I'm just stoked someone who actually knows what they're doing is
chiming in :-)

> memfd_restricted() extension that Ackerley is currently working on. I
> was pointed to this thread as this is what the extension is building
> on but I'll reply to both threads here.
> 
> From a glance at v10, memfd_restricted() is currently implemented as an
> in-kernel stacking filesystem. A call to memfd_restricted() creates a
> new restricted memfd file and a new unlinked tmpfs file and stashes the
> tmpfs file into the memfd file's private data member. It then uses the
> tmpfs file's f_ops and i_ops to perform the relevant file and inode
> operations. So it has the same callstack as a general stacking
> filesystem like overlayfs in some cases:
> 
> memfd_restricted->getattr()
> -> tmpfs->getattr()

...

> Since you're effectively acting like a stacking filesystem you should
> really use the device number of your memfd restricted filesystem. IOW,
> sm like:
> 
> stat->dev = memfd_restricted_dentry->d_sb->s_dev;
> 
> But then you run into trouble if you want to go forward with Ackerley's
> extension that allows to explicitly pass in tmpfs fds to
> memfd_restricted(). Afaict, two tmpfs instances might allocate the same
> inode number. So now the inode and device number pair isn't unique
> anymore.
> 
> So you might best be served by allocating and reporting your own inode
> numbers as well.
> 
> But if you want to preserve the inode number and device number of the
> relevant tmpfs instance but still report memfd restricted as your
> filesystem type

Unless I missed something along the way, reporting memfd_restricted as a 
distinct
filesystem is very much a non-goal.  AFAIK it's purely a side effect of the
proposed implementation.

> then I think it's reasonable to ask whether a stacking implementation really
> makes sense here.
> 
> If you extend memfd_restricted() or even consider extending it in the
> future to take tmpfs file descriptors as arguments to identify the tmpfs
> instance in which to allocate the underlying tmpfs file for the new
> restricted memfd file you should really consider a tmpfs based
> implementation.
> 
> Because at that point it just feels like a pointless wrapper to get
> custom f_ops and i_ops. Plus it's wasteful because you allocate dentries
> and inodes that you don't really care about at all.
> 
> Just off the top of my hat you might be better served:
> * by a new ioctl() on tmpfs instances that
>   yield regular tmpfs file descriptors with restricted f_ops and i_ops.
>   That's not that different from btrfs subvolumes which effectively are
>   directories but are created through an ioctl().

I think this is more or less what we want to do, except via a dedicated syscall
instead of an ioctl() so that the primary interface isn't strictly tied to 
tmpfs,
e.g. so that it can be extended to other backing types in the future.

> * by a mount option to tmpfs that makes it act
>   in this restricted manner then you don't need an ioctl() and can get
>   away with regular open calls. Such a tmpfs instance would only create
>   regular, restricted memfds.

I'd prefer to not go this route, becuase IIUC, it would require relatively 
invasive
changes to shmem code, and IIUC would require similar changes to other support
backings in the future, e.g. hugetlbfs?  And as above, I don't think any of the
potential use cases need restrictedmem to be a uniquely identifiable mount.

One of the goals (hopefully not a pipe dream) is to design restrictmem in such a
way that extending it to support other backing types isn't terribly

Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2023-04-12 Thread Sean Christopherson

On Wed, Jan 25, 2023, Kirill A. Shutemov wrote:
> On Wed, Jan 25, 2023 at 12:20:26AM +0000, Sean Christopherson wrote:
> > On Tue, Jan 24, 2023, Liam Merwick wrote:
> > > On 14/01/2023 00:37, Sean Christopherson wrote:
> > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > This patch series implements KVM guest private memory for confidential
> > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > crash the running host system, this is terrible for multi-tenant
> > > > > configurations. The host accesses include those from KVM userspace 
> > > > > like
> > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > new mm and KVM interfaces so KVM userspace can still manage guest 
> > > > > memory
> > > > > via a fd-based approach, but it can never access the guest memory
> > > > > content.
> > > > > 
> > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any 
> > > > > other
> > > > > reviews are always welcome.
> > > > >- 01: mm change, target for mm tree
> > > > >- 02-09: KVM change, target for KVM tree
> > > > 
> > > > A version with all of my feedback, plus reworked versions of Vishal's 
> > > > selftest,
> > > > is available here:
> > > > 
> > > >g...@github.com:sean-jc/linux.git x86/upm_base_support
> > > > 
> > > > It compiles and passes the selftest, but it's otherwise barely tested.  
> > > > There are
> > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. 
> > > > it's still
> > > > a WIP.
> > > > 
> > > 
> > > When running LTP (https://github.com/linux-test-project/ltp) on the v10
> > > bits (and also with Sean's branch above) I encounter the following NULL
> > > pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> > > (100% reproducible).
> > > 
> > > It appears that in restrictedmem_error_page()
> > > inode->i_mapping->private_data is NULL in the
> > > list_for_each_entry_safe(inode, next, >s_inodes, i_sb_list) but I
> > > don't know why.
> > 
> > Kirill, can you take a look?  Or pass the buck to someone who can? :-)
> 
> The patch below should help.
> 
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> index 15c52301eeb9..39ada985c7c0 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -307,14 +307,29 @@ void restrictedmem_error_page(struct page *page, struct 
> address_space *mapping)
>  
>   spin_lock(>s_inode_list_lock);
>   list_for_each_entry_safe(inode, next, >s_inodes, i_sb_list) {
> - struct restrictedmem *rm = inode->i_mapping->private_data;
>   struct restrictedmem_notifier *notifier;
> - struct file *memfd = rm->memfd;
> + struct restrictedmem *rm;
>   unsigned long index;
> + struct file *memfd;
>  
> - if (memfd->f_mapping != mapping)
> + if (atomic_read(>i_count))

Kirill, should this be

if (!atomic_read(>i_count))
continue;

i.e. skip unreferenced inodes, not skip referenced inodes?

Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2023-04-12 Thread Sean Christopherson

On Wed, Mar 22, 2023, Michael Roth wrote:
> On Tue, Feb 21, 2023 at 08:11:35PM +0800, Chao Peng wrote:
> > >   *fixup (upm_base_support): KVM: use inclusive ranges for restrictedmem 
> > > binding/unbinding
> > >   *fixup (upm_base_support): mm: restrictedmem: use inclusive ranges for 
> > > issuing invalidations
> > 
> > As many kernel APIs treat 'end' as exclusive, I would rather keep using
> > exclusive 'end' for these APIs(restrictedmem_bind/restrictedmem_unbind
> > and notifier callbacks) but fix it internally in the restrictedmem. E.g.
> > all the places where xarray API needs a 'last'/'max' we use 'end - 1'.
> > See below for the change.
> 
> Yes I did feel like I was fighting the kernel a bit on that; your
> suggestion seems like it would be a better fit.

Comically belated +1, XArray is the odd one here.

Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE

2023-03-07 Thread Sean Christopherson

Please trim your replies so that readers don't need to scan through a hundred or
so lines of quotes just to confirm there's nothing there.

On Tue, Mar 07, 2023, Ackerley Tng wrote:
> Chao Peng  writes:
> 
> > Register/unregister private memslot to fd-based memory backing store
> > restrictedmem and implement the callbacks for restrictedmem_notifier:
> >- invalidate_start()/invalidate_end() to zap the existing memory
> >  mappings in the KVM page table.
> >- error() to request KVM_REQ_MEMORY_MCE and later exit to userspace
> >  with KVM_EXIT_SHUTDOWN.
> 
> > Expose KVM_MEM_PRIVATE for memslot and KVM_MEMORY_ATTRIBUTE_PRIVATE for
> > KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to userspace but either are
> > controlled by kvm_arch_has_private_mem() which should be rewritten by
> > architecture code.
> 
> Could we perhaps rename KVM_MEM_PRIVATE to KVM_MEM_PROTECTED, to be in
> line with KVM_X86_PROTECTED_VM?
> 
> I feel that a memslot that has the KVM_MEM_PRIVATE flag need not always
> be private; It can sometimes be providing memory that is shared and
> also accessible from the host.
> 
> KVM_MEMORY_ATTRIBUTE_PRIVATE is fine as-is because this flag is set when
> the guest memory is meant to be backed by private memory.
> 
> KVM_MEMORY_EXIT_FLAG_PRIVATE is also okay because the flag is used to
> indicate when the memory error is caused by a private access (as opposed
> to a shared access).
> 
> kvm_slot_can_be_private() could perhaps be renamed kvm_is_protected_slot()?

No to this suggestion.  I agree that KVM_MEM_PRIVATE is a bad name, but
kvm_is_protected_slot() is just as wrong.  The _only_ thing that the flag 
controls
is whether whether or not the memslot has an fd that is bound to restricted 
memory.
The memslot itself is not protected in any way, and if the entire memslot is 
mapped
shared, then the data backed by the memslot isn't protected either.

What about KVM_MEM_CAN_BE_PRIVATE?  KVM_MEM_PRIVATIZABLE is more succinct, but
AFAICT that's a made up word, and IMO is unnecessarily fancy.

Re: Fortnightly KVM call for 2023-02-07

2023-03-07 Thread Sean Christopherson

On Tue, Feb 28, 2023, Juan Quintela wrote:
> Sean Christopherson  wrote:
> > On Tue, Jan 24, 2023, Juan Quintela wrote:
> >> Please, send any topic that you are interested in covering in the next
> >> call in 2 weeks.
> >> 
> >> We have already topics:
> >> - single qemu binary
> >>   People on previous call (today) asked if Markus, Paolo and Peter could
> >>   be there on next one to further discuss the topic.
> >> 
> >> - Huge Memory guests
> >> 
> >>   Will send a separate email with the questions that we want to discuss
> >>   later during the week.
> >> 
> >> After discussions on the QEMU Summit, we are going to have always open a
> >> KVM call where you can add topics.
> >
> > Hi Juan!
> >
> > I have a somewhat odd request: can I convince you to rename "KVM call" to 
> > something
> > like "QEMU+KVM call"?
> >
> > I would like to kickstart a recurring public meeting/forum that (almost) 
> > exclusively
> > targets internal KVM development, but I don't to cause confusion and 
> > definitely don't
> > want to usurp your meeting.  The goal/purpose of the KVM-specific meeting 
> > would be to
> > do design reviews, syncs, etc. on KVM internals and things like KVM 
> > selftests, while,
> > IIUC, the current "KVM call" is aimed at at the entire KVM+QEMU+VFIO 
> > ecosystem.
> 
> I can do that.
> I would have told you that you could use our slots, but right now we are
> quite low on slots.
> 
> If nobody objects, I will change that for the next call.

Thanks, much appreciated!

Re: Fortnightly KVM call for 2023-02-07

2023-02-23 Thread Sean Christopherson

On Tue, Jan 24, 2023, Juan Quintela wrote:
> Please, send any topic that you are interested in covering in the next
> call in 2 weeks.
> 
> We have already topics:
> - single qemu binary
>   People on previous call (today) asked if Markus, Paolo and Peter could
>   be there on next one to further discuss the topic.
> 
> - Huge Memory guests
> 
>   Will send a separate email with the questions that we want to discuss
>   later during the week.
> 
> After discussions on the QEMU Summit, we are going to have always open a
> KVM call where you can add topics.

Hi Juan!

I have a somewhat odd request: can I convince you to rename "KVM call" to 
something
like "QEMU+KVM call"?

I would like to kickstart a recurring public meeting/forum that (almost) 
exclusively
targets internal KVM development, but I don't to cause confusion and definitely 
don't
want to usurp your meeting.  The goal/purpose of the KVM-specific meeting would 
be to
do design reviews, syncs, etc. on KVM internals and things like KVM selftests, 
while,
IIUC, the current "KVM call" is aimed at at the entire KVM+QEMU+VFIO ecosystem.

Thanks!

Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2023-02-22 Thread Sean Christopherson

On Thu, Feb 16, 2023, David Hildenbrand wrote:
> On 16.02.23 06:13, Mike Rapoport wrote:
> > Hi,
> > 
> > On Fri, Dec 02, 2022 at 02:13:38PM +0800, Chao Peng wrote:
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > 
> > Sorry for jumping late.
> > 
> > Unless I'm missing something, hibernation will also cause an machine check
> > when there is TDX-protected memory in the system. When the hibernation
> > creates memory snapshot it essentially walks all physical pages and saves
> > their contents, so for TDX memory this will trigger machine check, right?

For hibernation specifically, I think that should be handled elsewhere as 
hibernation
is simply incompatible with TDX, SNP, pKVM, etc. without paravirtualizing the
guest, as none of those technologies support auto-export a la s390.  I suspect
the right approach is to disallow hibernation if KVM is running any protected 
guests.

> I recall bringing that up in the past (also memory access due to kdump,
> /prov/kcore) and was told that the main focus for now is preventing
> unprivileged users from crashing the system, that is, not mapping such
> memory into user space (e.g., QEMU). In the long run, we'll want to handle
> such pages also properly in the other events where the kernel might access
> them.

Ya, unless someone strongly objects, the plan is to essentially treat "attacks"
from privileged users as out of to scope for initial support, and then iterate
as needed to fix/enable more features.

FWIW, read accesses, e.g. kdump, should be ok for TDX and SNP as they both play
nice with "bad" reads.  pKVM is a different beast though as I believe any access
to guest private memory will fault.  But my understanding is that this series
would be a big step forward for pKVM, which currently doesn't have any 
safeguards.

Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2023-02-14 Thread Sean Christopherson

On Mon, Feb 13, 2023, Isaku Yamahata wrote:
> On Fri, Feb 10, 2023 at 12:35:30AM +,
> Sean Christopherson  wrote:
> 
> > On Wed, Feb 08, 2023, Isaku Yamahata wrote:
> > > On Fri, Dec 02, 2022 at 02:13:40PM +0800,
> > > Chao Peng  wrote:
> > > 
> > > > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > > > +  struct kvm_memory_attributes 
> > > > *attrs)
> > > > +{
> > > > +   gfn_t start, end;
> > > > +   unsigned long i;
> > > > +   void *entry;
> > > > +   u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > > > +
> > > > +   /* flags is currently not used. */
> > > > +   if (attrs->flags)
> > > > +   return -EINVAL;
> > > > +   if (attrs->attributes & ~supported_attrs)
> > > > +   return -EINVAL;
> > > > +   if (attrs->size == 0 || attrs->address + attrs->size < 
> > > > attrs->address)
> > > > +   return -EINVAL;
> > > > +   if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > > > +   return -EINVAL;
> > > > +
> > > > +   start = attrs->address >> PAGE_SHIFT;
> > > > +   end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> 
> > > > PAGE_SHIFT;
> > > > +
> > > > +   entry = attrs->attributes ? xa_mk_value(attrs->attributes) : 
> > > > NULL;
> > > > +
> > > > +   mutex_lock(>lock);
> > > > +   for (i = start; i < end; i++)
> > > > +   if (xa_err(xa_store(>mem_attr_array, i, entry,
> > > > +   GFP_KERNEL_ACCOUNT)))
> > > > +   break;
> > > > +   mutex_unlock(>lock);
> > > > +
> > > > +   attrs->address = i << PAGE_SHIFT;
> > > > +   attrs->size = (end - i) << PAGE_SHIFT;
> > > > +
> > > > +   return 0;
> > > > +}
> > > > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > > > +
> > > 
> > > If memslot isn't private, it should return error if private attribute is 
> > > set.
> > 
> > Why?  I'd rather keep the two things separate.  If we enforce this sort of 
> > thing
> > at KVM_SET_MEMORY_ATTRIBUTES, then we also have to enforce it at
> > KVM_SET_USER_MEMORY_REGION.
> 
> For device assignment via shared GPA, non-private memory slot needs to be
> allowed.

That doesn't say anything about why setting attributes needs to poke into the
memslot.  The page fault path already kicks out to userspace if there's a
discrepancy between the attributes and the memslot, why is that insufficient?

Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2023-02-09 Thread Sean Christopherson

On Wed, Feb 08, 2023, Isaku Yamahata wrote:
> On Fri, Dec 02, 2022 at 02:13:40PM +0800,
> Chao Peng  wrote:
> 
> > +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> > +  struct kvm_memory_attributes *attrs)
> > +{
> > +   gfn_t start, end;
> > +   unsigned long i;
> > +   void *entry;
> > +   u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> > +
> > +   /* flags is currently not used. */
> > +   if (attrs->flags)
> > +   return -EINVAL;
> > +   if (attrs->attributes & ~supported_attrs)
> > +   return -EINVAL;
> > +   if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> > +   return -EINVAL;
> > +   if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> > +   return -EINVAL;
> > +
> > +   start = attrs->address >> PAGE_SHIFT;
> > +   end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> > +
> > +   entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> > +
> > +   mutex_lock(>lock);
> > +   for (i = start; i < end; i++)
> > +   if (xa_err(xa_store(>mem_attr_array, i, entry,
> > +   GFP_KERNEL_ACCOUNT)))
> > +   break;
> > +   mutex_unlock(>lock);
> > +
> > +   attrs->address = i << PAGE_SHIFT;
> > +   attrs->size = (end - i) << PAGE_SHIFT;
> > +
> > +   return 0;
> > +}
> > +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> > +
> 
> If memslot isn't private, it should return error if private attribute is set.

Why?  I'd rather keep the two things separate.  If we enforce this sort of thing
at KVM_SET_MEMORY_ATTRIBUTES, then we also have to enforce it at
KVM_SET_USER_MEMORY_REGION.

Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2023-01-24 Thread Sean Christopherson

On Tue, Jan 24, 2023, Liam Merwick wrote:
> On 14/01/2023 00:37, Sean Christopherson wrote:
> > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > > 
> > > The patch series touches both core mm and KVM code. I appreciate
> > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > reviews are always welcome.
> > >- 01: mm change, target for mm tree
> > >- 02-09: KVM change, target for KVM tree
> > 
> > A version with all of my feedback, plus reworked versions of Vishal's 
> > selftest,
> > is available here:
> > 
> >g...@github.com:sean-jc/linux.git x86/upm_base_support
> > 
> > It compiles and passes the selftest, but it's otherwise barely tested.  
> > There are
> > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's 
> > still
> > a WIP.
> > 
> 
> When running LTP (https://github.com/linux-test-project/ltp) on the v10
> bits (and also with Sean's branch above) I encounter the following NULL
> pointer dereference with testcases/kernel/syscalls/madvise/madvise01
> (100% reproducible).
> 
> It appears that in restrictedmem_error_page() inode->i_mapping->private_data
> is NULL
> in the list_for_each_entry_safe(inode, next, >s_inodes, i_sb_list)
> but I don't know why.

Kirill, can you take a look?  Or pass the buck to someone who can? :-)

Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2023-01-23 Thread Sean Christopherson

On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> On Thu, Jan 19, 2023 at 03:25:08PM +,
> Sean Christopherson  wrote:
> 
> > On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> > > On Sat, Jan 14, 2023 at 12:37:59AM +,
> > > Sean Christopherson  wrote:
> > > 
> > > > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > > > This patch series implements KVM guest private memory for confidential
> > > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > > TDX-protected guest memory, machine check can happen which can further
> > > > > crash the running host system, this is terrible for multi-tenant
> > > > > configurations. The host accesses include those from KVM userspace 
> > > > > like
> > > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > > new mm and KVM interfaces so KVM userspace can still manage guest 
> > > > > memory
> > > > > via a fd-based approach, but it can never access the guest memory
> > > > > content.
> > > > > 
> > > > > The patch series touches both core mm and KVM code. I appreciate
> > > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any 
> > > > > other
> > > > > reviews are always welcome.
> > > > >   - 01: mm change, target for mm tree
> > > > >   - 02-09: KVM change, target for KVM tree
> > > > 
> > > > A version with all of my feedback, plus reworked versions of Vishal's 
> > > > selftest,
> > > > is available here:
> > > > 
> > > >   g...@github.com:sean-jc/linux.git x86/upm_base_support
> > > > 
> > > > It compiles and passes the selftest, but it's otherwise barely tested.  
> > > > There are
> > > > a few todos (2 I think?) and many of the commits need changelogs, i.e. 
> > > > it's still
> > > > a WIP.
> > > > 
> > > > As for next steps, can you (handwaving all of the TDX folks) take a 
> > > > look at what
> > > > I pushed and see if there's anything horrifically broken, and that it 
> > > > still works
> > > > for TDX?
> > > > 
> > > > Fuad (and pKVM folks) same ask for you with respect to pKVM.  
> > > > Absolutely no rush
> > > > (and I mean that).
> > > > 
> > > > On my side, the two things on my mind are (a) tests and (b) downstream 
> > > > dependencies
> > > > (SEV and TDX).  For tests, I want to build a lists of tests that are 
> > > > required for
> > > > merging so that the criteria for merging are clear, and so that if the 
> > > > list is large
> > > > (haven't thought much yet), the work of writing and running tests can 
> > > > be distributed.
> > > > 
> > > > Regarding downstream dependencies, before this lands, I want to pull in 
> > > > all the
> > > > TDX and SNP series and see how everything fits together.  Specifically, 
> > > > I want to
> > > > make sure that we don't end up with a uAPI that necessitates ugly code, 
> > > > and that we
> > > > don't miss an opportunity to make things simpler.  The patches in the 
> > > > SNP series to
> > > > add "legacy" SEV support for UPM in particular made me slightly rethink 
> > > > some minor
> > > > details.  Nothing remotely major, but something that needs attention 
> > > > since it'll
> > > > be uAPI.
> > > 
> > > Although I'm still debuging with TDX KVM, I needed the following.
> > > kvm_faultin_pfn() is called without mmu_lock held.  the race to change
> > > private/shared is handled by mmu_seq.  Maybe dedicated function only for
> > > kvm_faultin_pfn().
> > 
> > Gah, you're not on the other thread where this was discussed[*].  Simply 
> > deleting
> > the lockdep assertion is safe, for guest types that rely on the attributes 
> > to
> > define shared vs. private, KVM rechecks the attributes under the protection 
> > of
> > mmu_seq.
> > 
> > I'll get a fixed version pushed out today.
> > 
> > [*] https://lore.kernel.org/all/y8gpl+lwsusgb...@google.com
> 
> Now I have tdx kvm working. I've uploaded at the followings.
> It's rebased to v6.2-rc3.
> g...@github.com:yamahata/linux.git tdx/upm
> g...@github.com:yamahata/qemu.git tdx/upm

And I finally got a working, building version updated and pushed out (again to):

  g...@github.com:sean-jc/linux.git x86/upm_base_support

Took longer than expected to get the memslot restrictions sussed out.  I'm done
working on the code for now, my plan is to come back to it+TDX+SNP in 2-3 weeks
to resolves any remaining todos (that no one else tackles) and to do the whole
"merge the world" excersise.

> kvm_mmu_do_page_fault() needs the following change.
> kvm_mem_is_private() queries mem_attr_array.  kvm_faultin_pfn() also uses
> kvm_mem_is_private(). So the shared-private check in kvm_faultin_pfn() doesn't
> make sense. This change would belong to TDX KVM patches, though.

Yeah, SNP needs similar treatment.  Sorting that out is high up on the todo 
list.

Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2023-01-23 Thread Sean Christopherson

On Mon, Jan 23, 2023, Huang, Kai wrote:
> On Mon, 2023-01-23 at 15:03 +0100, Vlastimil Babka wrote:
> > On 12/22/22 01:37, Huang, Kai wrote:
> > > > > I argue that this page pinning (or page migration prevention) is not
> > > > > tied to where the page comes from, instead related to how the page 
> > > > > will
> > > > > be used. Whether the page is restrictedmem backed or GUP() backed, 
> > > > > once
> > > > > it's used by current version of TDX then the page pinning is needed. 
> > > > > So
> > > > > such page migration prevention is really TDX thing, even not KVM 
> > > > > generic
> > > > > thing (that's why I think we don't need change the existing logic of
> > > > > kvm_release_pfn_clean()). 
> > > > > 
> > > This essentially boils down to who "owns" page migration handling, and 
> > > sadly,
> > > page migration is kinda "owned" by the core-kernel, i.e. KVM cannot 
> > > handle page
> > > migration by itself -- it's just a passive receiver.
> > > 
> > > For normal pages, page migration is totally done by the core-kernel (i.e. 
> > > it
> > > unmaps page from VMA, allocates a new page, and uses migrate_pape() or 
> > > a_ops-
> > > > migrate_page() to actually migrate the page).
> > > In the sense of TDX, conceptually it should be done in the same way. The 
> > > more
> > > important thing is: yes KVM can use get_page() to prevent page migration, 
> > > but
> > > when KVM wants to support it, KVM cannot just remove get_page(), as the 
> > > core-
> > > kernel will still just do migrate_page() which won't work for TDX (given
> > > restricted_memfd doesn't have a_ops->migrate_page() implemented).
> > > 
> > > So I think the restricted_memfd filesystem should own page migration 
> > > handling,
> > > (i.e. by implementing a_ops->migrate_page() to either just reject page 
> > > migration
> > > or somehow support it).
> > 
> > While this thread seems to be settled on refcounts already, 
> > 
> 
> I am not sure but will let Sean/Paolo to decide.

My preference is whatever is most performant without being hideous :-)

> > just wanted
> > to point out that it wouldn't be ideal to prevent migrations by
> > a_ops->migrate_page() rejecting them. It would mean cputime wasted (i.e.
> > by memory compaction) by isolating the pages for migration and then
> > releasing them after the callback rejects it (at least we wouldn't waste
> > time creating and undoing migration entries in the userspace page tables
> > as there's no mmap). Elevated refcount on the other hand is detected
> > very early in compaction so no isolation is attempted, so from that
> > aspect it's optimal.
> 
> I am probably missing something,

Heh, me too, I could have sworn that using refcounts was the least efficient way
to block migration.

> but IIUC the checking of refcount happens at very last stage of page 
> migration too

Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2023-01-19 Thread Sean Christopherson

On Thu, Jan 19, 2023, Isaku Yamahata wrote:
> On Sat, Jan 14, 2023 at 12:37:59AM +,
> Sean Christopherson  wrote:
> 
> > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > This patch series implements KVM guest private memory for confidential
> > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > TDX-protected guest memory, machine check can happen which can further
> > > crash the running host system, this is terrible for multi-tenant
> > > configurations. The host accesses include those from KVM userspace like
> > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > via a fd-based approach, but it can never access the guest memory
> > > content.
> > > 
> > > The patch series touches both core mm and KVM code. I appreciate
> > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > reviews are always welcome.
> > >   - 01: mm change, target for mm tree
> > >   - 02-09: KVM change, target for KVM tree
> > 
> > A version with all of my feedback, plus reworked versions of Vishal's 
> > selftest,
> > is available here:
> > 
> >   g...@github.com:sean-jc/linux.git x86/upm_base_support
> > 
> > It compiles and passes the selftest, but it's otherwise barely tested.  
> > There are
> > a few todos (2 I think?) and many of the commits need changelogs, i.e. it's 
> > still
> > a WIP.
> > 
> > As for next steps, can you (handwaving all of the TDX folks) take a look at 
> > what
> > I pushed and see if there's anything horrifically broken, and that it still 
> > works
> > for TDX?
> > 
> > Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no 
> > rush
> > (and I mean that).
> > 
> > On my side, the two things on my mind are (a) tests and (b) downstream 
> > dependencies
> > (SEV and TDX).  For tests, I want to build a lists of tests that are 
> > required for
> > merging so that the criteria for merging are clear, and so that if the list 
> > is large
> > (haven't thought much yet), the work of writing and running tests can be 
> > distributed.
> > 
> > Regarding downstream dependencies, before this lands, I want to pull in all 
> > the
> > TDX and SNP series and see how everything fits together.  Specifically, I 
> > want to
> > make sure that we don't end up with a uAPI that necessitates ugly code, and 
> > that we
> > don't miss an opportunity to make things simpler.  The patches in the SNP 
> > series to
> > add "legacy" SEV support for UPM in particular made me slightly rethink 
> > some minor
> > details.  Nothing remotely major, but something that needs attention since 
> > it'll
> > be uAPI.
> 
> Although I'm still debuging with TDX KVM, I needed the following.
> kvm_faultin_pfn() is called without mmu_lock held.  the race to change
> private/shared is handled by mmu_seq.  Maybe dedicated function only for
> kvm_faultin_pfn().

Gah, you're not on the other thread where this was discussed[*].  Simply 
deleting
the lockdep assertion is safe, for guest types that rely on the attributes to
define shared vs. private, KVM rechecks the attributes under the protection of
mmu_seq.

I'll get a fixed version pushed out today.

[*] https://lore.kernel.org/all/y8gpl+lwsusgb...@google.com

Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE

2023-01-17 Thread Sean Christopherson

On Tue, Jan 17, 2023, Chao Peng wrote:
> On Sat, Jan 14, 2023 at 12:01:01AM +0000, Sean Christopherson wrote:
> > On Fri, Dec 02, 2022, Chao Peng wrote:
> > > @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu 
> > > *vcpu)
> > >  
> > >   if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
> > >   static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> > > +
> > > + if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
> > > + vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> > 
> > Synthesizing triple fault shutdown is not the right approach.  Even with 
> > TDX's
> > MCE "architecture" (heavy sarcasm), it's possible that host userspace and 
> > the
> > guest have a paravirt interface for handling memory errors without killing 
> > the
> > host.
> 
> Agree shutdown is not the correct choice. I see you made below change:
> 
> send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current)
> 
> The MCE may happen in any thread than KVM thread, sending siginal to
> 'current' thread may not be the expected behavior.

This is already true today, e.g. a #MC in memory that is mapped into the guest 
can
be triggered by a host access.  Hrm, but in this case we actually have a KVM
instance, and we know that the #MC is relevant to the KVM instance, so I agree
that signaling 'current' is kludgy.

>  Also how userspace can tell is the MCE on the shared page or private page?
>  Do we care?

We care.  I was originally thinking we could require userspace to keep track of
things, but that's quite prescriptive and flawed, e.g. could race with 
conversions.

One option would be to KVM_EXIT_MEMORY_FAULT, and then wire up a generic (not 
x86
specific) KVM request to exit to userspace, e.g.

/* KVM_EXIT_MEMORY_FAULT */
struct {
#define KVM_MEMORY_EXIT_FLAG_PRIVATE(1ULL << 3)
#define KVM_MEMORY_EXIT_FLAG_HW_ERROR   (1ULL << 4)
__u64 flags;
__u64 gpa;
__u64 size;
} memory;

But I'm not sure that's the correct approach.  It kinda feels like we're 
reinventing
the wheel.  It seems like restrictedmem_get_page() _must_ be able to reject 
attempts
to get a poisoned page, i.e. restrictedmem_get_page() should yield 
KVM_PFN_ERR_HWPOISON.
Assuming that's the case, then I believe KVM simply needs to zap SPTEs in 
response
to an error notification in order to force vCPUs to fault on the poisoned page.

> > > + return -EINVAL;
> > >   if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
> > >   return -EINVAL;
> > >   if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> > > @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
> > >   if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
> > >   return -EINVAL;
> > >   } else { /* Modify an existing slot. */
> > > + /* Private memslots are immutable, they can only be deleted. */
> > 
> > I'm 99% certain I suggested this, but if we're going to make these memslots
> > immutable, then we should straight up disallow dirty logging, otherwise 
> > we'll
> > end up with a bizarre uAPI.
> 
> But in my mind dirty logging will be needed in the very short time, when
> live migration gets supported?

Ya, but if/when live migration support is added, private memslots will no longer
be immutable as userspace will want to enable dirty logging only when a VM is
being migrated, i.e. something will need to change.

Given that it looks like we have clear line of sight to SEV+UPM guests, my
preference would be to allow toggling dirty logging from the get-go.  It doesn't
necessarily have to be in the first patch, e.g. KVM could initially reject
KVM_MEM_LOG_DIRTY_PAGES + KVM_MEM_PRIVATE and then add support separately to 
make
the series easier to review, test, and bisect.

static int check_memory_region_flags(struct kvm *kvm,
 const struct kvm_userspace_memory_region2 
*mem)
{
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;

if (kvm_arch_has_private_mem(kvm) &&
~(mem->flags & KVM_MEM_LOG_DIRTY_PAGES))
valid_flags |= KVM_MEM_PRIVATE;

...
}

> > > + if (mem->flags & KVM_MEM_PRIVATE)
> > > + return -EINVAL;
> > >   if ((mem->userspace_addr != old->userspace_addr) ||
> > >   (npages != old->npages) ||
> > >   ((mem->flags ^ old->flags) & KVM_MEM_

Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2023-01-17 Thread Sean Christopherson

On Tue, Jan 17, 2023, Chao Peng wrote:
> On Tue, Jan 17, 2023 at 11:21:10AM +0800, Binbin Wu wrote:
> > 
> > On 12/2/2022 2:13 PM, Chao Peng wrote:
> > > In confidential computing usages, whether a page is private or shared is
> > > necessary information for KVM to perform operations like page fault
> > > handling, page zapping etc. There are other potential use cases for
> > > per-page memory attributes, e.g. to make memory read-only (or no-exec,
> > > or exec-only, etc.) without having to modify memslots.
> > > 
> > > Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
> > > userspace to operate on the per-page memory attributes.
> > >- KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
> > >  a guest memory range.
> > >- KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
> > >  memory attributes.
> > > 
> > > KVM internally uses xarray to store the per-page memory attributes.
> > > 
> > > Suggested-by: Sean Christopherson 
> > > Signed-off-by: Chao Peng 
> > > Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com/
> > > ---
> > >   Documentation/virt/kvm/api.rst | 63 
> > >   arch/x86/kvm/Kconfig   |  1 +
> > >   include/linux/kvm_host.h   |  3 ++
> > >   include/uapi/linux/kvm.h   | 17 
> > 
> > Should the changes introduced in this file also need to be added in
> > tools/include/uapi/linux/kvm.h ?
> 
> Yes I think.

I'm not sure how Paolo or others feel, but my preference is to never update 
KVM's
uapi headers in tools/ in KVM's tree.  Nothing KVM-related in tools/ actually
relies on the headers being copied into tools/, e.g. KVM selftests pulls KVM's
headers from the .../usr/include/ directory that's populated by `make 
headers_install`.

Perf's tooling is what actually "needs" the headers to be copied into tools/, so
my preference is to let the tools/perf maintainers deal with the headache of 
keeping
everything up-to-date.

> But I'm hesitate to include in this patch or not. I see many commits sync
> kernel kvm.h to tools's copy. Looks that is done periodically and with a
> 'pull' model.

Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2023-01-17 Thread Sean Christopherson

On Tue, Jan 17, 2023, Chao Peng wrote:
> On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote:
> > > + list_for_each_entry(notifier, >notifiers, list) {
> > > + notifier->ops->invalidate_start(notifier, start, end);
> > 
> > Two major design issues that we overlooked long ago:
> > 
> >   1. Blindly invoking notifiers will not scale.  E.g. if userspace 
> > configures a
> >  VM with a large number of convertible memslots that are all backed by a
> >  single large restrictedmem instance, then converting a single page will
> >  result in a linear walk through all memslots.  I don't expect anyone to
> >  actually do something silly like that, but I also never expected there 
> > to be
> >  a legitimate usecase for thousands of memslots.
> > 
> >   2. This approach fails to provide the ability for KVM to ensure a guest 
> > has
> >  exclusive access to a page.  As discussed in the past, the kernel can 
> > rely
> >  on hardware (and maybe ARM's pKVM implementation?) for those 
> > guarantees, but
> >  only for SNP and TDX VMs.  For VMs where userspace is trusted to some 
> > extent,
> >  e.g. SEV, there is value in ensuring a 1:1 association.
> > 
> >  And probably more importantly, relying on hardware for SNP and TDX 
> > yields a
> >  poor ABI and complicates KVM's internals.  If the kernel doesn't 
> > guarantee a
> >  page is exclusive to a guest, i.e. if userspace can hand out the same 
> > page
> >  from a restrictedmem instance to multiple VMs, then failure will occur 
> > only
> >  when KVM tries to assign the page to the second VM.  That will happen 
> > deep
> >  in KVM, which means KVM needs to gracefully handle such errors, and it 
> > means
> >  that KVM's ABI effectively allows plumbing garbage into its memslots.
> 
> It may not be a valid usage, but in my TDX environment I do meet below
> issue.
> 
> kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x8000 
> ua=0x7fe1ebfff000 ret=0
> kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc0 size=0x40 
> ua=0x7fe271579000 ret=0
> kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda size=0x2 
> ua=0x7fe1ec09f000 ret=-22
> 
> Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU
> and slot#2 fails due to below exclusive check.
> 
> Currently I changed QEMU code to mark these alias slots as shared
> instead of private but I'm not 100% confident this is correct fix.

That's a QEMU bug of sorts.  SMM is mutually exclusive with TDX, QEMU shouldn't
be configuring SMRAM (or any SMM memslots for that matter) for TDX guests.

Actually, KVM should enforce that by disallowing SMM memslots for TDX guests.
Ditto for SNP guests and UPM-backed SEV and SEV-ES guests.  I think it probably
even makes sense to introduce that restriction in the base UPM support, e.g.
something like the below.  That would unnecessarily prevent emulating SMM for
KVM_X86_PROTECTED_VM types that aren't encrypted, but IMO that's an acceptable
limitation until there's an actual use case for KVM_X86_PROTECTED_VM guests 
beyond
SEV (my thought is that KVM_X86_PROTECTED_VM will mostly be a vehicle for 
selftests
and UPM-based SEV and SEV-ES guests).

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 48b7bdad1e0a..0a8aac821cb0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4357,6 +4357,14 @@ bool kvm_arch_has_private_mem(struct kvm *kvm)
return kvm->arch.vm_type != KVM_X86_DEFAULT_VM;
 }
 
+bool kvm_arch_nr_address_spaces(struct kvm *kvm)
+{
+   if (kvm->arch.vm_type != KVM_X86_DEFAULT_VM)
+   return 1;
+
+   return KVM_ADDRESS_SPACE_NUM;
+}
+
 static bool kvm_is_vm_type_supported(unsigned long type)
 {
return type == KVM_X86_DEFAULT_VM ||
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 97801d81ee42..e0a3fc819fe5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2126,7 +2126,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 mem->restricted_offset + mem->memory_size < mem->restricted_offset 
||
 0 /* TODO: require gfn be aligned with restricted offset */))
return -EINVAL;
-   if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
+   if (as_id >= kvm_arch_nr_address_spaces(vm) || id >= KVM_MEM_SLOTS_NUM)
return -EINVAL;
if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
return -EINVAL;

Re: [PATCH v10 0/9] KVM: mm: fd-based approach for supporting KVM

2023-01-13 Thread Sean Christopherson

On Fri, Dec 02, 2022, Chao Peng wrote:
> This patch series implements KVM guest private memory for confidential
> computing scenarios like Intel TDX[1]. If a TDX host accesses
> TDX-protected guest memory, machine check can happen which can further
> crash the running host system, this is terrible for multi-tenant
> configurations. The host accesses include those from KVM userspace like
> QEMU. This series addresses KVM userspace induced crash by introducing
> new mm and KVM interfaces so KVM userspace can still manage guest memory
> via a fd-based approach, but it can never access the guest memory
> content.
> 
> The patch series touches both core mm and KVM code. I appreciate
> Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> reviews are always welcome.
>   - 01: mm change, target for mm tree
>   - 02-09: KVM change, target for KVM tree

A version with all of my feedback, plus reworked versions of Vishal's selftest,
is available here:

  g...@github.com:sean-jc/linux.git x86/upm_base_support

It compiles and passes the selftest, but it's otherwise barely tested.  There 
are
a few todos (2 I think?) and many of the commits need changelogs, i.e. it's 
still
a WIP.

As for next steps, can you (handwaving all of the TDX folks) take a look at what
I pushed and see if there's anything horrifically broken, and that it still 
works
for TDX?

Fuad (and pKVM folks) same ask for you with respect to pKVM.  Absolutely no rush
(and I mean that).

On my side, the two things on my mind are (a) tests and (b) downstream 
dependencies
(SEV and TDX).  For tests, I want to build a lists of tests that are required 
for
merging so that the criteria for merging are clear, and so that if the list is 
large
(haven't thought much yet), the work of writing and running tests can be 
distributed.

Regarding downstream dependencies, before this lands, I want to pull in all the
TDX and SNP series and see how everything fits together.  Specifically, I want 
to
make sure that we don't end up with a uAPI that necessitates ugly code, and 
that we
don't miss an opportunity to make things simpler.  The patches in the SNP 
series to
add "legacy" SEV support for UPM in particular made me slightly rethink some 
minor
details.  Nothing remotely major, but something that needs attention since it'll
be uAPI.

I'm off Monday, so it'll be at least Tuesday before I make any more progress on
my side.

Thanks!

Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE

2023-01-13 Thread Sean Christopherson

On Fri, Dec 02, 2022, Chao Peng wrote:
> @@ -10357,6 +10364,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  
>   if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu))
>   static_call(kvm_x86_update_cpu_dirty_logging)(vcpu);
> +
> + if (kvm_check_request(KVM_REQ_MEMORY_MCE, vcpu)) {
> + vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;

Synthesizing triple fault shutdown is not the right approach.  Even with TDX's
MCE "architecture" (heavy sarcasm), it's possible that host userspace and the
guest have a paravirt interface for handling memory errors without killing the
host.

> + r = 0;
> + goto out;
> + }
>   }


> @@ -1982,6 +2112,10 @@ int __kvm_set_memory_region(struct kvm *kvm,
>!access_ok((void __user *)(unsigned long)mem->userspace_addr,
>   mem->memory_size))
>   return -EINVAL;
> + if (mem->flags & KVM_MEM_PRIVATE &&
> + (mem->restricted_offset & (PAGE_SIZE - 1) ||

Align indentation.

> +  mem->restricted_offset > U64_MAX - mem->memory_size))

Strongly prefer to use similar logic to existing code that detects wraps:

mem->restricted_offset + mem->memory_size < 
mem->restricted_offset

This is also where I'd like to add the "gfn is aligned to offset" check, though
my brain is too fried to figure that out right now.

> + return -EINVAL;
>   if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM)
>   return -EINVAL;
>   if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> @@ -2020,6 +2154,9 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages)
>   return -EINVAL;
>   } else { /* Modify an existing slot. */
> + /* Private memslots are immutable, they can only be deleted. */

I'm 99% certain I suggested this, but if we're going to make these memslots
immutable, then we should straight up disallow dirty logging, otherwise we'll
end up with a bizarre uAPI.

> + if (mem->flags & KVM_MEM_PRIVATE)
> + return -EINVAL;
>   if ((mem->userspace_addr != old->userspace_addr) ||
>   (npages != old->npages) ||
>   ((mem->flags ^ old->flags) & KVM_MEM_READONLY))
> @@ -2048,10 +2185,28 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   new->npages = npages;
>   new->flags = mem->flags;
>   new->userspace_addr = mem->userspace_addr;
> + if (mem->flags & KVM_MEM_PRIVATE) {
> + new->restricted_file = fget(mem->restricted_fd);
> + if (!new->restricted_file ||
> + !file_is_restrictedmem(new->restricted_file)) {
> + r = -EINVAL;
> + goto out;
> + }
> + new->restricted_offset = mem->restricted_offset;
> + }
> +
> + new->kvm = kvm;

Set this above, just so that the code flows better.

Re: [PATCH v10 8/9] KVM: Handle page fault for private memory

2023-01-13 Thread Sean Christopherson

On Fri, Dec 02, 2022, Chao Peng wrote:
> @@ -5599,6 +5652,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, 
> gpa_t cr2_or_gpa, u64 err
>   return -EIO;
>   }
>  
> + if (r == RET_PF_USER)
> + return 0;
> +
>   if (r < 0)
>   return r;
>   if (r != RET_PF_EMULATE)
> @@ -6452,7 +6508,8 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm 
> *kvm,
>*/
>   if (sp->role.direct &&
>   sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, 
> sp->gfn,
> -PG_LEVEL_NUM)) {
> +PG_LEVEL_NUM,
> +false)) {

Passing %false is incorrect.  It might not cause problems because KVM currently
doesn't allowing modifying private memslots (that likely needs to change to 
allow
dirty logging), but it's wrong since nothing guarantees KVM is operating on 
SPTEs
for shared memory.

One option would be to take the patches from the TDX series that add a "private"
flag to the shadow page role, but I'd rather not add the role until it's truly
necessary.

For now, I think we can do this without impacting performance of guests that 
don't
support private memory.

int kvm_mmu_max_mapping_level(struct kvm *kvm,
  const struct kvm_memory_slot *slot, gfn_t gfn,
  int max_level)
{
bool is_private = kvm_slot_can_be_private(slot) &&
  kvm_mem_is_private(kvm, gfn);

return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, 
is_private);
}

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 25099c94e770..153842bb33df 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2335,4 +2335,34 @@ static inline void 
> kvm_arch_set_memory_attributes(struct kvm *kvm,
>  }
>  #endif /* __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES */
>  
> +#ifdef CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{

This code, i.e. the generic KVM changes, belongs in a separate patch.  It'll be
small, but I want to separate x86's page fault changes from the restrictedmem
support adding to common KVM.

This should also short-circuit based on CONFIG_HAVE_KVM_RESTRICTED_MEM, though
I would name that CONFIG_KVM_PRIVATE_MEMORY since in KVM's world, it's all about
private vs. shared at this time.

> + return xa_to_value(xa_load(>mem_attr_array, gfn)) &
> +KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +}
> +#else
> +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
> +{
> + return false;
> +}
> +
> +#endif /* CONFIG_HAVE_KVM_MEMORY_ATTRIBUTES */
> +
> +#ifdef CONFIG_HAVE_KVM_RESTRICTED_MEM
> +static inline int kvm_restricted_mem_get_pfn(struct kvm_memory_slot *slot,
> + gfn_t gfn, kvm_pfn_t *pfn, int *order)
> +{
> + int ret;
> + struct page *page;
> + pgoff_t index = gfn - slot->base_gfn +
> + (slot->restricted_offset >> PAGE_SHIFT);
> +
> + ret = restrictedmem_get_page(slot->restricted_file, index,
> +  , order);

This needs handle errors.  If "ret" is non-zero, "page" is garbage.

> + *pfn = page_to_pfn(page);
> + return ret;
> +}
> +#endif /* CONFIG_HAVE_KVM_RESTRICTED_MEM */
> +
>  #endif
> -- 
> 2.25.1
>

Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed

2023-01-13 Thread Sean Christopherson

On Fri, Dec 02, 2022, Chao Peng wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9a07380f8d3c..5aefcff614d2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12362,6 +12362,8 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
>   if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 
> 1))
>   linfo[lpages - 1].disallow_lpage = 1;
>   ugfn = slot->userspace_addr >> PAGE_SHIFT;
> + if (kvm_slot_can_be_private(slot))
> + ugfn |= slot->restricted_offset >> PAGE_SHIFT;
>   /*
>* If the gfn and userspace address are not aligned wrt each
>* other, disable large page support for this slot.

Forgot to talk about the bug.  This code needs to handle the scenario where a
memslot is created with existing, non-uniform attributes.  It might be a bit 
ugly
(I didn't even try to write the code), but it's definitely possible, and since
memslot updates are already slow I think it's best to handle things here.

In the meantime, I added this so we don't forget to fix it before merging.

#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
pr_crit_once("FIXME: Walk the memory attributes of the slot and set the 
mixed status appropriately");
#endif

Re: [PATCH v10 4/9] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2023-01-13 Thread Sean Christopherson

On Fri, Dec 02, 2022, Chao Peng wrote:
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 99352170c130..d9edb14ce30b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6634,6 +6634,28 @@ array field represents return values. The userspace 
> should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> + /* KVM_EXIT_MEMORY_FAULT */
> + struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE   (1ULL << 0)

Unless there's a reason not to, we should use bit 3 to match the attributes.

Re: [PATCH v10 7/9] KVM: Update lpage info when private/shared memory are mixed

2023-01-13 Thread Sean Christopherson

On Fri, Dec 02, 2022, Chao Peng wrote:
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 283cbb83d6ae..7772ab37ac89 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -38,6 +38,7 @@
>  #include 
>  
>  #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
> +#define __KVM_HAVE_ARCH_SET_MEMORY_ATTRIBUTES

No need for this, I think we should just make it mandatory to implement the
arch hook when CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES=y.  If another arch gains
support for mem attributes and doesn't need the hook, then we can simply add a
weak helper (or maybe add a #define then if we feel that's the way to go).

>  #define KVM_MAX_VCPUS 1024
>  
> @@ -1011,6 +1012,13 @@ struct kvm_vcpu_arch {
>  #endif
>  };
>  
> +/*
> + * Use a bit in disallow_lpage to indicate private/shared pages mixed at the
> + * level. The remaining bits are used as a reference count.
> + */
> +#define KVM_LPAGE_PRIVATE_SHARED_MIXED   (1U << 31)

Similar to the need to unmap, I think we should just say "mixed" and ignore the
private vs. shared, i.e. make this a flag for all memory attributes.

> +#define KVM_LPAGE_COUNT_MAX  ((1U << 31) - 1)

"MAX" is technically correct, but it's more of a mask.  I think we can make it a
moot point though.  There's no need to mask the count, we just want to assert 
that
adjusting the counting doesn't change the flag.

I would also say throw these defines into mmu.c, at least pending the bug fix
for kvm_alloc_memslot_metadata() (more on that below).

>  struct kvm_lpage_info {
>   int disallow_lpage;
>  };
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e2c70b5afa3e..2190fd8c95c0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -763,11 +763,16 @@ static void update_gfn_disallow_lpage_count(const 
> struct kvm_memory_slot *slot,
>  {
>   struct kvm_lpage_info *linfo;
>   int i;
> + int disallow_count;
>  
>   for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
>   linfo = lpage_info_slot(gfn, slot, i);
> +
> + disallow_count = linfo->disallow_lpage & KVM_LPAGE_COUNT_MAX;
> + WARN_ON(disallow_count + count < 0 ||
> + disallow_count > KVM_LPAGE_COUNT_MAX - count);
> +
>   linfo->disallow_lpage += count;
> - WARN_ON(linfo->disallow_lpage < 0);

It's been a long week so don't trust my math, but I believe this can simply be:

old = linfo->disallow_lpage;
linfo->disallow_lpage += count;

WARN_ON_ONCE((old ^ linfo->disallow_lpage) & 
KVM_LPAGE_MIXED_FLAG);
>   }
>  }
>  
> @@ -6986,3 +6991,130 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>   if (kvm->arch.nx_huge_page_recovery_thread)
>   kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
>  }
> +
> +static bool linfo_is_mixed(struct kvm_lpage_info *linfo)
> +{
> + return linfo->disallow_lpage & KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static void linfo_set_mixed(gfn_t gfn, struct kvm_memory_slot *slot,
> + int level, bool mixed)
> +{
> + struct kvm_lpage_info *linfo = lpage_info_slot(gfn, slot, level);
> +
> + if (mixed)
> + linfo->disallow_lpage |= KVM_LPAGE_PRIVATE_SHARED_MIXED;
> + else
> + linfo->disallow_lpage &= ~KVM_LPAGE_PRIVATE_SHARED_MIXED;
> +}
> +
> +static bool is_expected_attr_entry(void *entry, unsigned long expected_attrs)
> +{
> + bool expect_private = expected_attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
> +
> + if (xa_to_value(entry) & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
> + if (!expect_private)
> + return false;
> + } else if (expect_private)
> + return false;

This is messy.  If we drop the private vs. shared specifity, this can go away if
we add a helper to get attributes

static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, 
gfn_t gfn)
{
return xa_to_value(xa_load(>mem_attr_array, gfn));
}

and then we can do


if (KVM_BUG_ON(gfn != xas.xa_index, kvm) ||
attrs != kvm_get_memory_attributes(kvm, gfn)) {
mixed = true;
break;
}

and

if (linfo_is_mixed(lpage_info_slot(gfn, slot, level - 1)) ||
attrs != kvm_get_memory_attributes(kvm, gfn))
return true;


> +
> + return true;
> +}
> +
> +static bool mem_attrs_mixed_2m(struct kvm *kvm, unsigned long attrs,
> +gfn_t start, gfn_t end)
> +{
> + XA_STATE(xas, >mem_attr_array, start);
> + gfn_t gfn = start;
> + void *entry;
> + bool mixed = false;
> +
> + rcu_read_lock();
> + entry = xas_load();
> + while (gfn < end) {
> + if (xas_retry(, entry))
> + continue;
> +
> +

Re: [PATCH v10 6/9] KVM: Unmap existing mappings when change the memory attributes

2023-01-13 Thread Sean Christopherson

On Fri, Dec 02, 2022, Chao Peng wrote:
> @@ -785,11 +786,12 @@ struct kvm {
>  
>  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)
>   struct mmu_notifier mmu_notifier;
> +#endif
>   unsigned long mmu_invalidate_seq;
>   long mmu_invalidate_in_progress;
>   gfn_t mmu_invalidate_range_start;
>   gfn_t mmu_invalidate_range_end;
> -#endif

Blech.  The existing code is a bit ugly, and trying to extend for this use case
makes things even worse.

Rather than use the base MMU_NOTIFIER Kconfig and an arbitrary define, I think 
we
should first add a proper Kconfig, e.g. KVM_GENERIC_MMU_NOTIFIER, to replace the
combination.  E.g

config KVM_GENERIC_MMU_NOTIFIER
   select MMU_NOTIFIER
   bool

and then all architectures that currently #define KVM_ARCH_WANT_MMU_NOTIFIER can
simply select the Kconfig, which is everything except s390.  "GENERIC" again 
because
s390 does select MMU_NOTIFER and actually registers its own notifier for s390's
version of protected VMs (at least, I think that's what its "pv" stands for).

And then later down the line in this series, when the attributes and private mem
needs to tie into the notifiers, we can do:

config KVM_GENERIC_MEMORY_ATTRIBUTES
   select KVM_GENERIC_MMU_NOTIFIER
   bool

I.e. that way this patch doesn't need to partially expose KVM's notifier stuff
and can instead just keep the soon-to-be-existing KVM_GENERIC_MMU_NOTIFIER.

Taking a depending on KVM_GENERIC_MMU_NOTIFIER for KVM_GENERIC_MEMORY_ATTRIBUTES
makes sense, because AFAICT, changing any type of attribute, e.g. RWX bits, is
going to necessitate unmapping the affected gfn range.

>   struct list_head devices;
>   u64 manual_dirty_log_protect;
>   struct dentry *debugfs_dentry;
> @@ -1480,6 +1482,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu 
> *vcpu);
>  int kvm_arch_post_init_vm(struct kvm *kvm);
>  void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>  int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_has_private_mem(struct kvm *kvm);

The reference to private memory belongs in a later patch.  More below.

> +static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> + struct kvm_gfn_range gfn_range;
> + struct kvm_memory_slot *slot;
> + struct kvm_memslots *slots;
> + struct kvm_memslot_iter iter;
> + int i;
> + int r = 0;

The return from kvm_unmap_gfn_range() is a bool, this should be:

bool flush = false;

> +
> + gfn_range.pte = __pte(0);
> + gfn_range.may_block = true;
> +
> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + slots = __kvm_memslots(kvm, i);
> +
> + kvm_for_each_memslot_in_gfn_range(, slots, start, end) {
> + slot = iter.slot;
> + gfn_range.start = max(start, slot->base_gfn);
> + gfn_range.end = min(end, slot->base_gfn + slot->npages);
> + if (gfn_range.start >= gfn_range.end)
> + continue;
> + gfn_range.slot = slot;
> +
> + r |= kvm_unmap_gfn_range(kvm, _range);
> + }
> + }
> +
> + if (r)
> + kvm_flush_remote_tlbs(kvm);
> +}
> +
>  static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>  struct kvm_memory_attributes *attrs)
>  {
>   gfn_t start, end;
>   unsigned long i;
>   void *entry;
> + int idx;
>   u64 supported_attrs = kvm_supported_mem_attributes(kvm);
>  
> - /* flags is currently not used. */
> + /* 'flags' is currently not used. */

Kind of a spurious change.

>   if (attrs->flags)
>   return -EINVAL;
>   if (attrs->attributes & ~supported_attrs)
> @@ -2372,6 +2409,13 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm 
> *kvm,
>  
>   entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
>  
> + if (kvm_arch_has_private_mem(kvm)) {

I think we should assume that any future attributes will necessitate unmapping
and invalidation, i.e. drop the private mem check.  That allows introducing
kvm_arch_has_private_mem() in a later patch that is more directly related to
private memory.

> + KVM_MMU_LOCK(kvm);
> + kvm_mmu_invalidate_begin(kvm);
> + kvm_mmu_invalidate_range_add(kvm, start, end);
> + KVM_MMU_UNLOCK(kvm);
> + }
> +
>   mutex_lock(>lock);
>   for (i = start; i < end; i++)
>   if (xa_err(xa_store(>mem_attr_array, i, entry,
> @@ -2379,6 +2423,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm 
> *kvm,
>   break;
>   mutex_unlock(>lock);
>  
> + if (kvm_arch_has_private_mem(kvm)) {
> + idx = srcu_read_lock(>srcu);

Mostly for reference, this goes away if slots_lock is used instead of kvm->lock.

> + KVM_MMU_LOCK(kvm);
> +

Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2023-01-13 Thread Sean Christopherson

On Tue, Jan 10, 2023, Chao Peng wrote:
> On Mon, Jan 09, 2023 at 07:32:05PM +0000, Sean Christopherson wrote:
> > On Fri, Jan 06, 2023, Chao Peng wrote:
> > > On Thu, Jan 05, 2023 at 11:23:01AM +, Jarkko Sakkinen wrote:
> > > > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > > > To make future maintenance easy, internally use a binary compatible
> > > > > alias struct kvm_user_mem_region to handle both the normal and the
> > > > > '_ext' variants.
> > > > 
> > > > Feels bit hacky IMHO, and more like a completely new feature than
> > > > an extension.
> > > > 
> > > > Why not just add a new ioctl? The commit message does not address
> > > > the most essential design here.
> > > 
> > > Yes, people can always choose to add a new ioctl for this kind of change
> > > and the balance point here is we want to also avoid 'too many ioctls' if
> > > the functionalities are similar.  The '_ext' variant reuses all the
> > > existing fields in the 'normal' variant and most importantly KVM
> > > internally can reuse most of the code. I certainly can add some words in
> > > the commit message to explain this design choice.
> > 
> > After seeing the userspace side of this, I agree with Jarkko; overloading
> > KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up 
> > being
> > bogus, and userspace ends up abusing unions or implementing 
> > kvm_user_mem_region
> > itself.
> 
> How is the size validation being bogus? I don't quite follow.

The ioctl() magic embeds the size of the payload (struct 
kvm_userspace_memory_region
in this case) in the ioctl() number, and that information is visible to 
userspace
via _IOCTL_SIZE().  Attempting to take a larger size can mess up sanity checks,
e.g. KVM selftests get tripped up on this assert if KVM_SET_USER_MEMORY_REGION 
is
passed an "extended" struct.

#define kvm_do_ioctl(fd, cmd, arg)  
\
({  
\
kvm_static_assert(!_IOC_SIZE(cmd) || sizeof(*arg) == 
_IOC_SIZE(cmd));   \
ioctl(fd, cmd, arg);
\
})

> Then we will use kvm_userspace_memory_region2 as the KVM internal alias,
> right?

Yep.

> I see similar examples use different functions to handle different versions
> but it does look easier if we use alias for this function.
> 
> > 
> > It feels absolutely ridiculous, but I think the best option is to do:
> > 
> > #define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
> >  struct kvm_userspace_memory_region2)
> 
> Just interesting, is 0x49 a safe number we can use? 

Yes?  So long as its not used by KVM, it's safe.  AFAICT, it's unused.

Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2023-01-13 Thread Sean Christopherson

On Fri, Dec 02, 2022, Chao Peng wrote:
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index fbeaa9ddef59..a8e379a3afee 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -49,6 +49,7 @@ config KVM
>   select SRCU
>   select INTERVAL_TREE
>   select HAVE_KVM_PM_NOTIFIER if PM
> + select HAVE_KVM_MEMORY_ATTRIBUTES

I would prefer to call this KVM_GENERIC_MEMORY_ATTRIBUTES.  Similar to
KVM_GENERIC_HARDWARE_ENABLING, ARM does need/have hardware enabling, it just
doesn't want KVM's generic implementation.  In this case, pKVM does support 
memory
attributes, but uses stage-2 tables to track ownership and doesn't need/want the
overhead of the generic implementation.

>   help

...

> +#define KVM_MEMORY_ATTRIBUTE_READ  (1ULL << 0)
> +#define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> +#define KVM_MEMORY_ATTRIBUTE_EXECUTE   (1ULL << 2)
> +#define KVM_MEMORY_ATTRIBUTE_PRIVATE   (1ULL << 3)

I think we should carve out bits 0-2 for RWX, but I don't think we should 
actually
define them until they're actually accepted by KVM.

> +static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
> +struct kvm_memory_attributes *attrs)
> +{
> + gfn_t start, end;
> + unsigned long i;
> + void *entry;
> + u64 supported_attrs = kvm_supported_mem_attributes(kvm);
> +
> + /* flags is currently not used. */
> + if (attrs->flags)
> + return -EINVAL;
> + if (attrs->attributes & ~supported_attrs)

Nit, no need for "supported_attrs", just consume kvm_supported_mem_attributes()
directly.

> + return -EINVAL;
> + if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
> + return -EINVAL;
> + if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
> + return -EINVAL;
> +
> + start = attrs->address >> PAGE_SHIFT;
> + end = (attrs->address + attrs->size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> + entry = attrs->attributes ? xa_mk_value(attrs->attributes) : NULL;
> +
> + mutex_lock(>lock);

Peeking forward multiple patches, this needs to take kvm->slots_lock, not 
kvm->lock.
There's a bug in the lpage_disallowed patch that I believe can most easily be
solved by making this mutually exclusive with memslot changes.

When a memslot is created, KVM needs to walk through the attributes to detect
whether or not the attributes are identical for the entire slot.  To avoid 
races,
that means taking slots_lock.

The alternative would be to query the attributes when adjusting the hugepage 
level
and avoid lpage_disallowed entirely, but in the (very brief) time I've thought
about this I haven't come up with a way to do that in a performant manner.

> + for (i = start; i < end; i++)

Curly braces needed on the for-loop.

Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2023-01-13 Thread Sean Christopherson

On Fri, Dec 02, 2022, Chao Peng wrote:
> The system call is currently wired up for x86 arch.

Building on other architectures (except for arm64 for some reason) yields:

  CALL/.../scripts/checksyscalls.sh
  :1565:2: warning: #warning syscall memfd_restricted not implemented 
[-Wcpp]

Do we care?  It's the only such warning, which makes me think we either need to
wire this up for all architectures, or explicitly document that it's 
unsupported.

> Signed-off-by: Kirill A. Shutemov 
> Signed-off-by: Chao Peng 
> ---

...

> diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
> new file mode 100644
> index ..c2700c5daa43
> --- /dev/null
> +++ b/include/linux/restrictedmem.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _LINUX_RESTRICTEDMEM_H

Missing

 #define _LINUX_RESTRICTEDMEM_H

which causes fireworks if restrictedmem.h is included more than once.

> +#include 
> +#include 
> +#include 

...

> +static inline int restrictedmem_get_page(struct file *file, pgoff_t offset,
> +  struct page **pagep, int *order)
> +{
> + return -1;

This should be a proper -errno, though in the current incarnation of things it's
a moot point because no stub is needed.  KVM can (and should) easily provide its
own stub for this one.

> +}
> +
> +static inline bool file_is_restrictedmem(struct file *file)
> +{
> + return false;
> +}
> +
> +static inline void restrictedmem_error_page(struct page *page,
> + struct address_space *mapping)
> +{
> +}
> +
> +#endif /* CONFIG_RESTRICTEDMEM */
> +
> +#endif /* _LINUX_RESTRICTEDMEM_H */

...

> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
> new file mode 100644
> index ..56953c204e5c
> --- /dev/null
> +++ b/mm/restrictedmem.c
> @@ -0,0 +1,318 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct restrictedmem_data {

Any objection to simply calling this "restrictedmem"?  And then using either 
"rm"
or "rmem" for local variable names?  I kept reading "data" as the underyling 
data
being written to the page, as opposed to the metadata describing the 
restrictedmem
instance.

> + struct mutex lock;
> + struct file *memfd;
> + struct list_head notifiers;
> +};
> +
> +static void restrictedmem_invalidate_start(struct restrictedmem_data *data,
> +pgoff_t start, pgoff_t end)
> +{
> + struct restrictedmem_notifier *notifier;
> +
> + mutex_lock(>lock);

This can be a r/w semaphore instead of a mutex, that way punching holes at 
multiple
points in the file can at least run the notifiers in parallel.  The actual 
allocation
by shmem will still be serialized, but I think it's worth the simple 
optimization
since zapping and flushing in KVM may be somewhat slow.

> + list_for_each_entry(notifier, >notifiers, list) {
> + notifier->ops->invalidate_start(notifier, start, end);

Two major design issues that we overlooked long ago:

  1. Blindly invoking notifiers will not scale.  E.g. if userspace configures a
 VM with a large number of convertible memslots that are all backed by a
 single large restrictedmem instance, then converting a single page will
 result in a linear walk through all memslots.  I don't expect anyone to
 actually do something silly like that, but I also never expected there to 
be
 a legitimate usecase for thousands of memslots.

  2. This approach fails to provide the ability for KVM to ensure a guest has
 exclusive access to a page.  As discussed in the past, the kernel can rely
 on hardware (and maybe ARM's pKVM implementation?) for those guarantees, 
but
 only for SNP and TDX VMs.  For VMs where userspace is trusted to some 
extent,
 e.g. SEV, there is value in ensuring a 1:1 association.

 And probably more importantly, relying on hardware for SNP and TDX yields a
 poor ABI and complicates KVM's internals.  If the kernel doesn't guarantee 
a
 page is exclusive to a guest, i.e. if userspace can hand out the same page
 from a restrictedmem instance to multiple VMs, then failure will occur only
 when KVM tries to assign the page to the second VM.  That will happen deep
 in KVM, which means KVM needs to gracefully handle such errors, and it 
means
 that KVM's ABI effectively allows plumbing garbage into its memslots.

Rather than use a simple list of notifiers, this appears to be yet another
opportunity to use an xarray.  Supporting sharing of restrictedmem will be
non-trivial, but IMO we should punt that to the future since it's still unclear
exactly how sharing will work.

An xarray will solve #1 by notifying only the consumers (memslots) that are 
bound
to the affected range.

And for #2, it's relatively straightforward (knock

Re: [PATCH v10 3/9] KVM: Extend the memslot to support fd-based private memory

2023-01-09 Thread Sean Christopherson

On Fri, Jan 06, 2023, Chao Peng wrote:
> On Thu, Jan 05, 2023 at 11:23:01AM +, Jarkko Sakkinen wrote:
> > On Fri, Dec 02, 2022 at 02:13:41PM +0800, Chao Peng wrote:
> > > To make future maintenance easy, internally use a binary compatible
> > > alias struct kvm_user_mem_region to handle both the normal and the
> > > '_ext' variants.
> > 
> > Feels bit hacky IMHO, and more like a completely new feature than
> > an extension.
> > 
> > Why not just add a new ioctl? The commit message does not address
> > the most essential design here.
> 
> Yes, people can always choose to add a new ioctl for this kind of change
> and the balance point here is we want to also avoid 'too many ioctls' if
> the functionalities are similar.  The '_ext' variant reuses all the
> existing fields in the 'normal' variant and most importantly KVM
> internally can reuse most of the code. I certainly can add some words in
> the commit message to explain this design choice.

After seeing the userspace side of this, I agree with Jarkko; overloading
KVM_SET_USER_MEMORY_REGION is a hack.  E.g. the size validation ends up being
bogus, and userspace ends up abusing unions or implementing kvm_user_mem_region
itself.

It feels absolutely ridiculous, but I think the best option is to do:

#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
 struct kvm_userspace_memory_region2)

/* for KVM_SET_USER_MEMORY_REGION2 */
struct kvm_user_mem_region2 {
__u32 slot;
__u32 flags;
__u64 guest_phys_addr;
__u64 memory_size;
__u64 userspace_addr;
__u64 restricted_offset;
__u32 restricted_fd;
__u32 pad1;
__u64 pad2[14];
}

And it's consistent with other KVM ioctls(), e.g. KVM_SET_CPUID2.

Regarding the userspace side of things, please include Vishal's selftests in 
v11,
it's impossible to properly review the uAPI changes without seeing the userspace
side of things.  I'm in the process of reviewing Vishal's v2[*], I'll try to
massage it into a set of patches that you can incorporate into your series.

[*] https://lore.kernel.org/all/20221205232341.4131240-1-vannapu...@google.com

Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2023-01-03 Thread Sean Christopherson

On Tue, Jan 03, 2023, Wang, Wei W wrote:
> On Tuesday, January 3, 2023 9:40 AM, Chao Peng wrote:
> > > Because guest memory defaults to private, and now this patch stores
> > > the attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of
> > _SHARED,
> > > it would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of
> > > boot time. Maybe it can be optimized somehow in other places? e.g. set
> > > mem attr in advance.
> > 
> > KVM defaults to 'shared' because this ioctl can also be potentially used by
> > normal VMs and 'shared' sounds a value meaningful for both normal VMs and
> > confidential VMs. 
> 
> Do you mean a normal VM could have pages marked private? What's the usage?
> (If all the pages are just marked shared for normal VMs, then why do we need 
> it)

No, there are potential use cases for per-page attribute/permissions, e.g. to
make select pages read-only, exec-only, no-exec, etc...

> > As for more KVM_EXIT_MEMORY_FAULT exits during the
> > booting time, yes, setting all memory to 'private' for confidential VMs 
> > through
> > this ioctl in userspace before guest launch is an approach for KVM 
> > userspace to
> > 'override' the KVM default and reduce the number of implicit conversions.
> 
> Most pages of a confidential VM are likely to be private pages. It seems more 
> efficient
> (and not difficult to check vm_type) to have KVM defaults to "private" for 
> confidential VMs
> and defaults to "shared" for normal VMs.

If done right, the default shouldn't matter all that much for efficiency.  KVM
needs to be able to effeciently track large ranges regardless of the default,
otherwise the memory overhead and the presumably cost of lookups will be 
painful.
E.g. converting a 1GiB chunk to shared should ideally require one entry, not 
256k
entries.

Looks like that behavior was changed in v8 in response to feedback[*] that doing
xa_store_range() on a subset of an existing range (entry) would overwrite the
entire existing range (entry), not just the smaller subset.  xa_store_range() 
does
appear to be too simplistic for this use case, but looking at 
__filemap_add_folio(),
splitting an existing entry isn't super complex.

Using xa_store() for the very initial implementation is ok, and probably a good
idea since it's more obviously correct and will give us a bisection point.  But
we definitely want a more performant implementation sooner than later.  The 
hardest
part will likely be merging existing entries, but that can be done separately 
too,
and is probably lower priority.

E.g. (1) use xa_store() and always track at 4KiB granularity, (2) support 
storing
metadata in multi-index entries, and finally (3) support merging adjacent 
entries
with identical values.

[*] 
https://lore.kernel.org/all/CAGtprH9xyw6bt4=rbwf6-v2cspabocpkq5rpz+e-9co7eis...@mail.gmail.com

Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2022-12-22 Thread Sean Christopherson

On Wed, Dec 21, 2022, Chao Peng wrote:
> On Tue, Dec 20, 2022 at 08:33:05AM +, Huang, Kai wrote:
> > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote:
> > > On Mon, Dec 19, 2022 at 08:48:10AM +, Huang, Kai wrote:
> > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote:
> > But for non-restricted-mem case, it is correct for KVM to decrease page's
> > refcount after setting up mapping in the secondary mmu, otherwise the page 
> > will
> > be pinned by KVM for normal VM (since KVM uses GUP to get the page).
> 
> That's true. Actually even true for restrictedmem case, most likely we
> will still need the kvm_release_pfn_clean() for KVM generic code. On one
> side, other restrictedmem users like pKVM may not require page pinning
> at all. On the other side, see below.
> 
> > 
> > So what we are expecting is: for KVM if the page comes from restricted mem, 
> > then
> > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM 
> > should.

No, requiring the user (KVM) to guard against lack of support for page migration
in restricted mem is a terrible API.  It's totally fine for restricted mem to 
not
support page migration until there's a use case, but punting the problem to KVM
is not acceptable.  Restricted mem itself doesn't yet support page migration,
e.g. explosions would occur even if KVM wanted to allow migration since there is
no notification to invalidate existing mappings.

> I argue that this page pinning (or page migration prevention) is not
> tied to where the page comes from, instead related to how the page will
> be used. Whether the page is restrictedmem backed or GUP() backed, once
> it's used by current version of TDX then the page pinning is needed. So
> such page migration prevention is really TDX thing, even not KVM generic
> thing (that's why I think we don't need change the existing logic of
> kvm_release_pfn_clean()). Wouldn't better to let TDX code (or who
> requires that) to increase/decrease the refcount when it populates/drops
> the secure EPT entries? This is exactly what the current TDX code does:

I agree that whether or not migration is supported should be controllable by the
user, but I strongly disagree on punting refcount management to KVM (or TDX).
The whole point of restricted mem is to support technologies like TDX and SNP,
accomodating their special needs for things like page migration should be part 
of
the API, not some footnote in the documenation.

It's not difficult to let the user communicate support for page migration, e.g.
if/when restricted mem gains support, add a hook to restrictedmem_notifier_ops
to signal support (or lack thereof) for page migration.  NULL == no migration,
non-NULL == migration allowed.

We know that supporting page migration in TDX and SNP is possible, and we know
that page migration will require a dedicated API since the backing store can't
memcpy() the page.  I don't see any reason to ignore that eventuality.

But again, unless I'm missing something, that's a future problem because 
restricted
mem doesn't yet support page migration regardless of the downstream user.

Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-11-23 Thread Sean Christopherson

On Tue, Nov 22, 2022, Chao Peng wrote:
> On Fri, Nov 18, 2022 at 03:59:12PM +0000, Sean Christopherson wrote:
> > On Fri, Nov 18, 2022, Alex Benn?e wrote:
> > > > We don't actually need a new bit, the opposite side of private is
> > > > shared, i.e. flags with KVM_MEMORY_EXIT_FLAG_PRIVATE cleared expresses
> > > > 'shared'.
> > > 
> > > If that is always true and we never expect a 3rd type of memory that is
> > > fine. But given we are leaving room for expansion having an explicit bit
> > > allows for that as well as making cases of forgetting to set the flags
> > > more obvious.
> > 
> > Hrm, I'm on the fence.
> > 
> > A dedicated flag isn't strictly needed, e.g. even if we end up with 3+ 
> > types in
> > this category, the baseline could always be "private".
> 
> The baseline for the current code is actually "shared".

Ah, right, the baseline needs to be "shared" so that legacy code doesn't end up
with impossible states.

> > I do like being explicit, and adding a PRIVATE flag costs KVM practically 
> > nothing
> > to implement and maintain, but evetually we'll up with flags that are 
> > paired with
> > an implicit state, e.g. see the many #PF error codes in x86.  In other 
> > words,
> > inevitably KVM will need to define the default/base state of the access, at 
> > which
> > point the base state for SHARED vs. PRIVATE is "undefined".  
> 
> Current memory conversion for confidential usage is bi-directional so we
> already need both private and shared states and if we use one bit for
> both "shared" and "private" then we will have to define the default
> state, e.g, currently the default state is "shared" when we define
> 
>   KVM_MEMORY_EXIT_FLAG_PRIVATE(1 << 0)

...

> > So I would say if we add an explicit READ flag, then we might as well add 
> > an explicit
> > PRIVATE flag too.  But if we omit PRIVATE, then we should omit READ too.
> 
> Since we assume the default state is shared, so we actually only need a
> PRIVATE flag, e.g. there is no SHARED flag and will ignore the RWX for now.

Yeah, I'm leading towards "shared" being the implied default state.  Ditto for
"read" if/when we need to communicate write/execute information  E.g. for VMs
that don't support guest private memory, the "shared" flag is in some ways
nonsensical.  Worst case scenario, e.g. if we end up with variations of 
"shared",
we'll need something like KVM_MEMORY_EXIT_FLAG_SHARED_RESTRICTIVE or whatever,
but the basic "shared" default will still work.

Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-11-18 Thread Sean Christopherson

On Fri, Nov 18, 2022, Alex Bennée wrote:
> 
> Chao Peng  writes:
> 
> > On Thu, Nov 17, 2022 at 03:08:17PM +, Alex Bennée wrote:
> >> >> I think this should be explicit rather than implied by the absence of
> >> >> another flag. Sean suggested you might want flags for RWX failures so
> >> >> maybe something like:
> >> >> 
> >> >> KVM_MEMORY_EXIT_SHARED_FLAG_READ(1 << 0)
> >> >> KVM_MEMORY_EXIT_SHARED_FLAG_WRITE   (1 << 1)
> >> >> KVM_MEMORY_EXIT_SHARED_FLAG_EXECUTE (1 << 2)
> >> >> KVM_MEMORY_EXIT_FLAG_PRIVATE(1 << 3)
> >> >
> >> > Yes, but I would not add 'SHARED' to RWX, they are not share memory
> >> > specific, private memory can also set them once introduced.
> >> 
> >> OK so how about:
> >> 
> >>KVM_MEMORY_EXIT_FLAG_READ   (1 << 0)
> >>KVM_MEMORY_EXIT_FLAG_WRITE  (1 << 1)
> >>KVM_MEMORY_EXIT_FLAG_EXECUTE(1 << 2)
> >> KVM_MEMORY_EXIT_FLAG_SHARED (1 << 3)
> >> KVM_MEMORY_EXIT_FLAG_PRIVATE(1 << 4)
> >
> > We don't actually need a new bit, the opposite side of private is
> > shared, i.e. flags with KVM_MEMORY_EXIT_FLAG_PRIVATE cleared expresses
> > 'shared'.
> 
> If that is always true and we never expect a 3rd type of memory that is
> fine. But given we are leaving room for expansion having an explicit bit
> allows for that as well as making cases of forgetting to set the flags
> more obvious.

Hrm, I'm on the fence.

A dedicated flag isn't strictly needed, e.g. even if we end up with 3+ types in
this category, the baseline could always be "private".

I do like being explicit, and adding a PRIVATE flag costs KVM practically 
nothing
to implement and maintain, but evetually we'll up with flags that are paired 
with
an implicit state, e.g. see the many #PF error codes in x86.  In other words,
inevitably KVM will need to define the default/base state of the access, at 
which
point the base state for SHARED vs. PRIVATE is "undefined".  

The RWX bits are in the same boat, e.g. the READ flag isn't strictly necessary.
I was thinking more of the KVM_SET_MEMORY_ATTRIBUTES ioctl(), which does need
the full RWX gamut, when I typed out that response.

So I would say if we add an explicit READ flag, then we might as well add an 
explicit
PRIVATE flag too.  But if we omit PRIVATE, then we should omit READ too.

Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

2022-11-16 Thread Sean Christopherson

On Tue, Oct 25, 2022, Chao Peng wrote:
> +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
> +  bool is_private)
> +{
> + gfn_t start, end;
> + unsigned long i;
> + void *entry;
> + int idx;
> + int r = 0;
> +
> + if (size == 0 || gpa + size < gpa)
> + return -EINVAL;
> + if (gpa & (PAGE_SIZE - 1) || size & (PAGE_SIZE - 1))
> + return -EINVAL;
> +
> + start = gpa >> PAGE_SHIFT;
> + end = (gpa + size - 1 + PAGE_SIZE) >> PAGE_SHIFT;
> +
> + /*
> +  * Guest memory defaults to private, kvm->mem_attr_array only stores
> +  * shared memory.
> +  */
> + entry = is_private ? NULL : xa_mk_value(KVM_MEM_ATTR_SHARED);
> +
> + idx = srcu_read_lock(>srcu);
> + KVM_MMU_LOCK(kvm);
> + kvm_mmu_invalidate_begin(kvm, start, end);
> +
> + for (i = start; i < end; i++) {
> + r = xa_err(xa_store(>mem_attr_array, i, entry,
> + GFP_KERNEL_ACCOUNT));
> + if (r)
> + goto err;
> + }
> +
> + kvm_unmap_mem_range(kvm, start, end);
> +
> + goto ret;
> +err:
> + for (; i > start; i--)
> + xa_erase(>mem_attr_array, i);

I don't think deleting previous entries is correct.  To unwind, the correct 
thing
to do is restore the original values.  E.g. if userspace space is mapping a 
large
range as shared, and some of the previous entries were shared, deleting them 
would
incorrectly "convert" those entries to private.

Tracking the previous state likely isn't the best approach, e.g. it would 
require
speculatively allocating extra memory for a rare condition that is likely going 
to
lead to OOM anyways.

Instead of trying to unwind, what about updating the ioctl() params such that
retrying with the updated addr+size would Just Work?  E.g.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 55b07aae67cc..f1de592a1a06 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1015,15 +1015,12 @@ static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, 
gpa_t gpa, gpa_t size,
 
kvm_unmap_mem_range(kvm, start, end, attr);
 
-   goto ret;
-err:
-   for (; i > start; i--)
-   xa_erase(>mem_attr_array, i);
-ret:
kvm_mmu_invalidate_end(kvm, start, end);
KVM_MMU_UNLOCK(kvm);
srcu_read_unlock(>srcu, idx);
 
+   
+
return r;
 }
 #endif /* CONFIG_KVM_GENERIC_PRIVATE_MEM */
@@ -4989,6 +4986,8 @@ static long kvm_vm_ioctl(struct file *filp,
 
r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
  region.size, set);
+   if (copy_to_user(argp, , sizeof(region)) && !r)
+   r = -EFAULT
break;
}
 #endif

Re: [PATCH v9 7/8] KVM: Handle page fault for private memory

2022-11-16 Thread Sean Christopherson

On Wed, Nov 16, 2022, Ackerley Tng wrote:
> >@@ -4173,6 +4203,22 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, 
> >struct kvm_page_fault *fault)
> > return RET_PF_EMULATE;
> > }
> >
> >+if (kvm_slot_can_be_private(slot) &&
> >+fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> >+vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> >+if (fault->is_private)
> >+vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> >+else
> >+vcpu->run->memory.flags = 0;
> >+vcpu->run->memory.padding = 0;
> >+vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> >+vcpu->run->memory.size = PAGE_SIZE;
> >+return RET_PF_USER;
> >+}
> >+
> >+if (fault->is_private)
> >+return kvm_faultin_pfn_private(fault);
> >+
> 
> Since each memslot may also not be backed by restricted memory, we
> should also check if the memslot has been set up for private memory
> with
> 
>   if (fault->is_private && kvm_slot_can_be_private(slot))
>   return kvm_faultin_pfn_private(fault);
> 
> Without this check, restrictedmem_get_page will get called with NULL
> in slot->restricted_file, which causes a NULL pointer dereference.

Hmm, silently skipping the faultin would result in KVM faulting in the shared
portion of the memslot, and I believe would end up mapping that pfn as private,
i.e. would map a non-UPM PFN as a private mapping.  For TDX and SNP, that would
be double ungood as it would let the host access memory that is mapped private,
i.e. lead to #MC or #PF(RMP) in the host.

I believe the correct solution is to drop the "can be private" check from the
above check, and instead handle that in kvm_faultin_pfn_private().  That would 
fix
another bug, e.g. if the fault is shared, the slot can't be private, but for
whatever reason userspace marked the gfn as private.  Even though KVM might be
able service the fault, the correct thing to do in that case is to exit to 
userspace.

E.g.

---
 arch/x86/kvm/mmu/mmu.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 10017a9f26ee..e2ac8873938e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4158,11 +4158,29 @@ static inline u8 order_to_level(int order)
return PG_LEVEL_4K;
 }
 
-static int kvm_faultin_pfn_private(struct kvm_page_fault *fault)
+static int kvm_do_memory_fault_exit(struct kvm_vcpu *vcpu,
+   struct kvm_page_fault *fault)
+{
+   vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
+   if (fault->is_private)
+   vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
+   else
+   vcpu->run->memory.flags = 0;
+   vcpu->run->memory.padding = 0;
+   vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
+   vcpu->run->memory.size = PAGE_SIZE;
+   return RET_PF_USER;
+}
+
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+  struct kvm_page_fault *fault)
 {
int order;
struct kvm_memory_slot *slot = fault->slot;
 
+   if (kvm_slot_can_be_private(slot))
+   return kvm_do_memory_fault_exit(vcpu, fault);
+
if (kvm_restricted_mem_get_pfn(slot, fault->gfn, >pfn, ))
return RET_PF_RETRY;
 
@@ -4203,21 +4221,11 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault)
return RET_PF_EMULATE;
}
 
-   if (kvm_slot_can_be_private(slot) &&
-   fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
-   vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
-   if (fault->is_private)
-   vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
-   else
-   vcpu->run->memory.flags = 0;
-   vcpu->run->memory.padding = 0;
-   vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
-   vcpu->run->memory.size = PAGE_SIZE;
-   return RET_PF_USER;
-   }
+   if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn))
+   return kvm_do_memory_fault_exit(vcpu, fault);
 
if (fault->is_private)
-   return kvm_faultin_pfn_private(fault);
+   return kvm_faultin_pfn_private(vcpu, fault);
 
async = false;
fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, ,

base-commit: 969d761bb7b8654605937f31ae76123dcb7f15a3
--

Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-11-16 Thread Sean Christopherson

On Wed, Nov 16, 2022, Andy Lutomirski wrote:
> 
> 
> On Tue, Oct 25, 2022, at 8:13 AM, Chao Peng wrote:
> > diff --git a/Documentation/virt/kvm/api.rst 
> > b/Documentation/virt/kvm/api.rst
> > index f3fa75649a78..975688912b8c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6537,6 +6537,29 @@ array field represents return values. The 
> > userspace should update the return
> >  values of SBI call before resuming the VCPU. For more details on 
> > RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> > 
> > +::
> > +
> > +   /* KVM_EXIT_MEMORY_FAULT */
> > +   struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
> > +   __u32 flags;
> > +   __u32 padding;
> > +   __u64 gpa;
> > +   __u64 size;
> > +   } memory;
> > +
> 
> Would it make sense to also have a field for the access type (read, write,
> execute, etc)?  I realize that shared <-> private conversion doesn't strictly
> need this, but it seems like it could be useful for logging failures and also
> for avoiding a second immediate fault if the type gets converted but doesn't
> have the right protection yet.

I don't think a separate field is necessary, that info can be conveyed via 
flags.
Though maybe we should go straight to a u64 for flags.  Hmm, and maybe avoid 
bits
0-3 so that if/when RWX info is conveyed the flags can align with
PROT_{READ,WRITE,EXEC} and the EPT flags, e.g.

KVM_MEMORY_EXIT_FLAG_READ   (1 << 0)
KVM_MEMORY_EXIT_FLAG_WRITE  (1 << 1)
KVM_MEMORY_EXIT_FLAG_EXECUTE(1 << 2)

> (Obviously, if this were changed, KVM would need the ability to report that
> it doesn't actually know the mode.)
> 
> --Andy

Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry

2022-11-10 Thread Sean Christopherson

On Tue, Oct 25, 2022, Chao Peng wrote:
> @@ -715,15 +715,9 @@ static void kvm_mmu_notifier_change_pte(struct 
> mmu_notifier *mn,
>   kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
>  }
>  
> -void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,
> -   unsigned long end)
> +static inline

Don't tag static functions with "inline" unless they're in headers, in which 
case
the inline is effectively required.  In pretty much every scenario, the compiler
can do a better job of optimizing inline vs. non-inline, i.e. odds are very good
the compiler would inline this helper anyways, and if not, there would likely be
a good reason not to inline it.

It'll be a moot point in this case (more below), but this would also reduce the
line length and avoid the wrap.

> void update_invalidate_range(struct kvm *kvm, gfn_t start,
> + gfn_t end)

I appreciate the effort to make this easier to read, but making such a big 
divergence
from the kernel's preferred formatting is often counter-productive, e.g. I 
blinked a
few times when first reading this code.

Again, moot point this time (still below ;-) ), but for future reference, better
options are to either let the line poke out or simply wrap early to get the
bundling of parameters that you want, e.g.

  static inline void update_invalidate_range(struct kvm *kvm, gfn_t start, 
gfn_t end)

or 

  static inline void update_invalidate_range(struct kvm *kvm,
 gfn_t start, gfn_t end)

>  {
> - /*
> -  * The count increase must become visible at unlock time as no
> -  * spte can be established without taking the mmu_lock and
> -  * count is also read inside the mmu_lock critical section.
> -  */
> - kvm->mmu_invalidate_in_progress++;
>   if (likely(kvm->mmu_invalidate_in_progress == 1)) {
>   kvm->mmu_invalidate_range_start = start;
>   kvm->mmu_invalidate_range_end = end;
> @@ -744,6 +738,28 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned 
> long start,
>   }
>  }
>  
> +static void mark_invalidate_in_progress(struct kvm *kvm, gfn_t start, gfn_t 
> end)

Splitting the helpers this way yields a weird API overall, e.g. it's possible
(common, actually) to have an "end" without a "begin".

Taking the range in the "end" is also dangerous/misleading/imbalanced, because 
_if_
there are multiple ranges in a batch, each range would need to be unwound
independently, e.g. the invocation of the "end" helper in
kvm_mmu_notifier_invalidate_range_end() is flat out wrong, it just doesn't cause
problems because KVM doesn't (currently) try to unwind regions (and probably 
never
will, but that's beside the point).

Rather than shunt what is effectively the "begin" into a separate helper, 
provide
three separate APIs, e.g. begin, range_add, end.  That way, begin+end don't 
take a
range and thus are symmetrical, always paired, and can't screw up unwinding 
since
they don't have a range to unwind.

It'll require three calls in every case, but that's not the end of the world 
since
none of these flows are super hot paths.

> +{
> + /*
> +  * The count increase must become visible at unlock time as no
> +  * spte can be established without taking the mmu_lock and
> +  * count is also read inside the mmu_lock critical section.
> +  */
> + kvm->mmu_invalidate_in_progress++;

This should invalidate (ha!) mmu_invalidate_range_{start,end}, and then WARN in
mmu_invalidate_retry() if the range isn't valid.  And the "add" helper should
WARN if mmu_invalidate_in_progress == 0.

> +}
> +
> +static bool kvm_mmu_handle_gfn_range(struct kvm *kvm, struct kvm_gfn_range 
> *range)

"handle" is wy too generic.  Just match kvm_unmap_gfn_range() and call it
kvm_mmu_unmap_gfn_range().  This is a local function so it's unlikely to collide
with arch code, now or in the future.

> +{
> + update_invalidate_range(kvm, range->start, range->end);
> + return kvm_unmap_gfn_range(kvm, range);
> +}

Overall, this?  Compile tested only...

---
 arch/x86/kvm/mmu/mmu.c   |  8 +---
 include/linux/kvm_host.h | 33 +
 virt/kvm/kvm_main.c  | 30 +-
 3 files changed, 47 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 93c389eaf471..d4b373e3e524 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4259,7 +4259,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
return true;

return fault->slot &&
-  mmu_invalidate_retry_hva(vcpu->kvm, mmu_seq, fault->hva);
+  mmu_invalidate_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn);
 }

 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault 
*fault)
@@ -6098,7 +6098,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t

Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry

2022-11-10 Thread Sean Christopherson

On Tue, Nov 08, 2022, Chao Peng wrote:
> On Fri, Nov 04, 2022 at 10:29:48PM +0000, Sean Christopherson wrote:
> > The APICv case that this was added for could very well be broken because of
> > this, and the resulting failures would be an absolute nightmare to debug.
> 
> Given the apicv_inhibit should be rare, the change looks good to me.
> Just to be clear, your will send out this fix, right?

Ya, I'll post an official patch.

Re: [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry

2022-11-04 Thread Sean Christopherson

On Fri, Nov 04, 2022, Chao Peng wrote:
> On Thu, Oct 27, 2022 at 11:29:14AM +0100, Fuad Tabba wrote:
> > Hi,
> > 
> > On Tue, Oct 25, 2022 at 4:19 PM Chao Peng  
> > wrote:
> > >
> > > Currently in mmu_notifier validate path, hva range is recorded and then
> > > checked against in the mmu_notifier_retry_hva() of the page fault path.
> > > However, for the to be introduced private memory, a page fault may not
> > > have a hva associated, checking gfn(gpa) makes more sense.
> > >
> > > For existing non private memory case, gfn is expected to continue to
> > > work. The only downside is when aliasing multiple gfns to a single hva,
> > > the current algorithm of checking multiple ranges could result in a much
> > > larger range being rejected. Such aliasing should be uncommon, so the
> > > impact is expected small.
> > >
> > > It also fixes a bug in kvm_zap_gfn_range() which has already been using
> > 
> > nit: Now it's kvm_unmap_gfn_range().
> 
> Forgot to mention: the bug is still with kvm_zap_gfn_range(). It calls
> kvm_mmu_invalidate_begin/end with a gfn range but before this series
> kvm_mmu_invalidate_begin/end actually accept a hva range. Note it's
> unrelated to whether we use kvm_zap_gfn_range() or kvm_unmap_gfn_range()
> in the following patch (patch 05).

Grr, in the future, if you find an existing bug, please send a patch.  At the
very least, report the bug.  The APICv case that this was added for could very
well be broken because of this, and the resulting failures would be an absolute
nightmare to debug.

Compile tested only...

--
From: Sean Christopherson 
Date: Fri, 4 Nov 2022 22:20:33 +
Subject: [PATCH] KVM: x86/mmu: Block all page faults during
 kvm_zap_gfn_range()

When zapping a GFN range, pass 0 => ALL_ONES for the to-be-invalidated
range to effectively block all page faults while the zap is in-progress.
The invalidation helpers take a host virtual address, whereas zapping a
GFN obviously provides a guest physical address and with the wrong unit
of measurement (frame vs. byte).

Alternatively, KVM could walk all memslots to get the associated HVAs,
but thanks to SMM, that would require multiple lookups.  And practically
speaking, kvm_zap_gfn_range() usage is quite rare and not a hot path,
e.g. MTRR and CR0.CD are almost guaranteed to be done only on vCPU0
during boot, and APICv inhibits are similarly infrequent operations.

Fixes: edb298c663fc ("KVM: x86/mmu: bump mmu notifier count in 
kvm_zap_gfn_range")
Cc: sta...@vger.kernel.org
Cc: Maxim Levitsky 
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f81539061d6..1ccb769f62af 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6056,7 +6056,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end)
 
write_lock(>mmu_lock);
 
-   kvm_mmu_invalidate_begin(kvm, gfn_start, gfn_end);
+   kvm_mmu_invalidate_begin(kvm, 0, -1ul);
 
flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
@@ -6070,7 +6070,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end)
kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
   gfn_end - gfn_start);
 
-   kvm_mmu_invalidate_end(kvm, gfn_start, gfn_end);
+   kvm_mmu_invalidate_end(kvm, 0, -1ul);
 
write_unlock(>mmu_lock);
 }

base-commit: c12879206e47730ff5ab255bbf625b28ade4028f
--

Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

2022-11-04 Thread Sean Christopherson

Paolo, any thoughts before I lead things further astray?

On Fri, Nov 04, 2022, Chao Peng wrote:
> On Thu, Nov 03, 2022 at 11:04:53PM +0000, Sean Christopherson wrote:
> > On Tue, Oct 25, 2022, Chao Peng wrote:
> > > @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
> > >   r = kvm_vm_ioctl_set_memory_region(kvm, );
> > >   break;
> > >   }
> > > +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> > > + case KVM_MEMORY_ENCRYPT_REG_REGION:
> > > + case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
> > 
> > I'm having second thoughts about usurping 
> > KVM_MEMORY_ENCRYPT_(UN)REG_REGION.  Aside
> > from the fact that restricted/protected memory may not be encrypted, there 
> > are
> > other potential use cases for per-page memory attributes[*], e.g. to make 
> > memory
> > read-only (or no-exec, or exec-only, etc...) without having to modify 
> > memslots.
> > 
> > Any paravirt use case where the attributes of a page are effectively 
> > dictated by
> > the guest is going to run into the exact same performance problems with 
> > memslots,
> > which isn't suprising in hindsight since shared vs. private is really just 
> > an
> > attribute, albeit with extra special semantics.
> > 
> > And if we go with a brand new ioctl(), maybe someday in the very distant 
> > future
> > we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION.
> > 
> > Switching to a new ioctl() should be a minor change, i.e. shouldn't throw 
> > too big
> > of a wrench into things.
> > 
> > Something like:
> > 
> >   KVM_SET_MEMORY_ATTRIBUTES
> > 
> >   struct kvm_memory_attributes {
> > __u64 address;
> > __u64 size;
> > __u64 flags;

Oh, this is half-baked.  I lost track of which flags were which.  What I 
intended
was a separate, initially-unused flags, e.g.

 struct kvm_memory_attributes {
__u64 address;
__u64 size;
__u64 attributes;
__u64 flags;
  }

so that KVM can tweak behavior and/or extend the effective size of the struct.

> I like the idea of adding a new ioctl(). But putting all attributes into
> a flags in uAPI sounds not good to me, e.g. forcing userspace to set all
> attributes in one call can cause pain for userspace, probably for KVM
> implementation as well. For private<->shared memory conversion, we
> actually only care the KVM_MEM_ATTR_SHARED or KVM_MEM_ATTR_PRIVATE bit,

Not necessarily, e.g. I can see pKVM wanting to convert from RW+PRIVATE => 
RO+SHARED
or even RW+PRIVATE => NONE+SHARED so that the guest can't write/access the 
memory
while it's accessible from the host.

And if this does extend beyond shared/private, dropping from RWX=>R, i.e. 
dropping
WX permissions, would also be a common operation.

Hmm, typing that out makes me think that if we do end up supporting other 
"attributes",
i.e. protections, we should go straight to full RWX protections instead of doing
things piecemeal, i.e. add individual protections instead of combinations like
NO_EXEC and READ_ONLY.  The protections would have to be inverted for backwards
compatibility, but that's easy enough to handle.  The semantics could be like
protection keys, which also have inverted persmissions, where the final 
protections
are the combination of memslot+attributes, i.e. a read-only memslot couldn't be 
made
writable via attributes.

E.g. userspace could do "NO_READ | NO_WRITE | NO_EXEC" to temporarily block 
access
to memory without needing to delete the memslot.  KVM would need to disallow
unsupported combinations, e.g. disallowed effective protections would be:

  - W or WX [unless there's an arch that supports write-only memory]
  - R or RW [until KVM plumbs through support for no-exec, or it's unsupported 
in hardware]
  - X   [until KVM plumbs through support for exec-only, or it's 
unsupported in hardware]

Anyways, that's all future work...

> but we force userspace to set other irrelevant bits as well if use this
> API.

They aren't irrelevant though, as the memory attributes are all describing the
allowed protections for a given page.  If there's a use case where userspace 
"can't"
keep track of the attributes for whatever reason, then userspace could do a RMW
to set/clear attributes.  Alternatively, the ioctl() could take an "operation" 
and
support WRITE/OR/AND to allow setting/clearing individual flags, e.g. tweak the
above to be: 

 struct kvm_memory_attributes {
__u64 address;
__u64 size;
__u64 attributes;
__u32 operation;
__u32 flags;
  }

> I looked at kvm_device_attr, sounds we can do similar:

The device attributes deal with isolated, arbitrary values, whereas memory 
attr

Re: [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions

2022-11-03 Thread Sean Christopherson

On Tue, Oct 25, 2022, Chao Peng wrote:
> @@ -4708,6 +4802,24 @@ static long kvm_vm_ioctl(struct file *filp,
>   r = kvm_vm_ioctl_set_memory_region(kvm, );
>   break;
>   }
> +#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
> + case KVM_MEMORY_ENCRYPT_REG_REGION:
> + case KVM_MEMORY_ENCRYPT_UNREG_REGION: {

I'm having second thoughts about usurping KVM_MEMORY_ENCRYPT_(UN)REG_REGION.  
Aside
from the fact that restricted/protected memory may not be encrypted, there are
other potential use cases for per-page memory attributes[*], e.g. to make memory
read-only (or no-exec, or exec-only, etc...) without having to modify memslots.

Any paravirt use case where the attributes of a page are effectively dictated by
the guest is going to run into the exact same performance problems with 
memslots,
which isn't suprising in hindsight since shared vs. private is really just an
attribute, albeit with extra special semantics.

And if we go with a brand new ioctl(), maybe someday in the very distant future
we can deprecate and delete KVM_MEMORY_ENCRYPT_(UN)REG_REGION.

Switching to a new ioctl() should be a minor change, i.e. shouldn't throw too 
big
of a wrench into things.

Something like:

  KVM_SET_MEMORY_ATTRIBUTES

  struct kvm_memory_attributes {
__u64 address;
__u64 size;
__u64 flags;
  }

[*] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com

> + struct kvm_enc_region region;
> + bool set = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> +
> + if (!kvm_arch_has_private_mem(kvm))
> + goto arch_vm_ioctl;
> +
> + r = -EFAULT;
> + if (copy_from_user(, argp, sizeof(region)))
> + goto out;
> +
> + r = kvm_vm_ioctl_set_mem_attr(kvm, region.addr,
> +   region.size, set);
> + break;
> + }
> +#endif

Re: [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit

2022-10-25 Thread Sean Christopherson

On Tue, Oct 25, 2022, Peter Maydell wrote:
> On Tue, 25 Oct 2022 at 16:21, Chao Peng  wrote:
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index f3fa75649a78..975688912b8c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -6537,6 +6537,29 @@ array field represents return values. The userspace 
> > should update the return
> >  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
> >  spec refer, https://github.com/riscv/riscv-sbi-doc.
> >
> > +::
> > +
> > +   /* KVM_EXIT_MEMORY_FAULT */
> > +   struct {
> > +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0)
> > +   __u32 flags;
> > +   __u32 padding;
> > +   __u64 gpa;
> > +   __u64 size;
> > +   } memory;
> > +
> > +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has
> > +encountered a memory error which is not handled by KVM kernel module and
> > +userspace may choose to handle it. The 'flags' field indicates the memory
> > +properties of the exit.
> > +
> > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by
> > +   private memory access when the bit is set. Otherwise the memory error is
> > +   caused by shared memory access when the bit is clear.
> > +
> > +'gpa' and 'size' indicate the memory range the error occurs at. The 
> > userspace
> > +may handle the error and return to KVM to retry the previous memory access.
> > +
> 
> What's the difference between this and a plain old MMIO exit ?
> Just that we can specify a wider size and some flags ?

KVM_EXIT_MMIO is purely for cases where there is no memslot.  
KVM_EXIT_MEMORY_FAULT
will be used for scenarios where there is a valid memslot for a GPA, but for
whatever reason KVM cannot map the memslot into the guest.

In this series, the new exit type is use to handle guest-initiated conversions
between shared and private memory.  By design, conversion requires explicit 
action
from userspace, and so even though KVM has a valid memslot, KVM needs to exit to
userspace to effectively forward the conversion request to userspace.

Long term, I also hope to convert all guest-triggered -EFAULT paths to instead
return KVM_EXIT_MEMORY_FAULT.  At minimum, returning KVM_EXIT_MEMORY_FAULT 
instead
of -EFAULT will allow KVM to provide userspace with the "bad" GPA when something
goes sideways, e.g. if faulting in the page failed because there's no valid
userspace mapping.

There have also been two potential use cases[1][2], though they both appear to 
have
been abandoned, where userspace would do something more than just kill the guest
in response to KVM_EXIT_MEMORY_FAULT.

[1] https://lkml.kernel.org/r/20200617230052.gb27...@linux.intel.com
[2] https://lore.kernel.org/all/ykxjlcg%2fwompe...@google.com

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-21 Thread Sean Christopherson

On Fri, Oct 21, 2022, Chao Peng wrote:
> On Thu, Oct 20, 2022 at 04:20:58PM +0530, Vishal Annapurve wrote:
> > On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov 
> >  wrote:
> > >
> > > On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> > > > I think moving this notifier_invalidate before fallocate may not solve
> > > > the problem completely. Is it possible that between invalidate and
> > > > fallocate, KVM tries to handle the page fault for the guest VM from
> > > > another vcpu and uses the pages to be freed to back gpa ranges? Should
> > > > hole punching here also update mem_attr first to say that KVM should
> > > > consider the corresponding gpa ranges to be no more backed by
> > > > inaccessible memfd?
> > >
> > > We rely on external synchronization to prevent this. See code around
> > > mmu_invalidate_retry_hva().
> > >
> > > --
> > >   Kiryl Shutsemau / Kirill A. Shutemov
> > 
> > IIUC, mmu_invalidate_retry_hva/gfn ensures that page faults on gfn
> > ranges that are being invalidated are retried till invalidation is
> > complete. In this case, is it possible that KVM tries to serve the
> > page fault after inaccessible_notifier_invalidate is complete but
> > before fallocate could punch hole into the files?

It's not just the page fault edge case.  In the more straightforward scenario
where the memory is already mapped into the guest, freeing pages back to the 
kernel
before they are removed from the guest will lead to use-after-free.

> > e.g.
> > inaccessible_notifier_invalidate(...)
> > ... (system event preempting this control flow, giving a window for
> > the guest to retry accessing the gfn range which was invalidated)
> > fallocate(.., PUNCH_HOLE..)
> 
> Looks this is something can happen.
> And sounds to me the solution needs
> just follow the mmu_notifier's way of using a invalidate_start/end pair.
> 
>   invalidate_start()  --> kvm->mmu_invalidate_in_progress++;
>   zap KVM page table entries;
>   fallocate()
>   invalidate_end()  --> kvm->mmu_invalidate_in_progress--;
> 
> Then during invalidate_start/end time window mmu_invalidate_retry_gfn
> checks 'mmu_invalidate_in_progress' and prevent repopulating the same
> page in KVM page table.

Yes, if it's not safe to invalidate after making the change (fallocate()), then
the change needs to be bookended by a start+end pair.  The mmu_notifier's 
unpaired
invalidate() hook works by zapping the primary MMU's PTEs before invalidate(), 
but
frees the underlying physical page _after_ invalidate().

And the only reason the unpaired invalidate() exists is because there are 
secondary
MMUs that reuse the primary MMU's page tables, e.g. shared virtual addressing, 
in
which case bookending doesn't work because the secondary MMU can't remove PTEs, 
it
can only flush its TLBs.

For this case, the whole point is to not create PTEs in the primary MMU, so 
there
should never be a use case that _needs_ an unpaired invalidate().

TL;DR: a start+end pair is likely the simplest solution.

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-21 Thread Sean Christopherson

On Fri, Oct 21, 2022, Chao Peng wrote:
> > 
> > In the context of userspace inaccessible memfd, what would be a
> > suggested way to enforce NUMA memory policy for physical memory
> > allocation? mbind[1] won't work here in absence of virtual address
> > range.
> 
> How about set_mempolicy():
> https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html

Andy Lutomirski brought this up in an off-list discussion way back when the 
whole
private-fd thing was first being proposed.

  : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses.  
If
  : we want to support them for TDX private memory, we either need TDX private
  : memory to have an HVA or we need file-based equivalents. Arguably we should 
add
  : fmove_pages and fbind syscalls anyway, since the current API is quite 
awkward
  : even for tools like numactl.

Re: [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions

2022-10-19 Thread Sean Christopherson

On Wed, Oct 19, 2022, Fuad Tabba wrote:
> > > > This sounds good. Thank you.
> > >
> > > I like the idea of a separate Kconfig, e.g. 
> > > CONFIG_KVM_GENERIC_PRIVATE_MEM or
> > > something.  I highly doubt there will be any non-x86 users for multiple 
> > > years,
> > > if ever, but it would allow testing the private memory stuff on ARM (and 
> > > any other
> > > non-x86 arch) without needing full pKVM support and with only minor KVM
> > > modifications, e.g. the x86 support[*] to test UPM without TDX is shaping 
> > > up to be
> > > trivial.
> >
> > CONFIG_KVM_GENERIC_PRIVATE_MEM looks good to me.
> 
> That sounds good to me, and just keeping the xarray isn't really an
> issue for pKVM.

The xarray won't exist for pKVM if the #ifdefs in this patch are changed from
CONFIG_HAVE_KVM_PRIVATE_MEM => CONFIG_KVM_GENERIC_PRIVATE_MEM.

> We could end up using it instead of some of the other
> structures we use for tracking.

I don't think pKVM should hijack the xarray for other purposes.  At best, it 
will
be confusing, at worst we'll end up with a mess if ARM ever supports the 
"generic"
implementation.

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-17 Thread Sean Christopherson

On Fri, Sep 30, 2022, Fuad Tabba wrote:
> > > > > pKVM would also need a way to make an fd accessible again
> > > > > when shared back, which I think isn't possible with this patch.
> > > >
> > > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > > be the same issue.
> > >
> > > pKVM doesn't really need to unmap the memory. What is really important
> > > is that the memory is not GUP'able.
> >
> > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> > otherwise KVM wouldn't be able to get the PFN to map into guest memory.
> >
> > The problem is that gup() and "mapped" are tied together.  So yes, pKVM 
> > doesn't
> > strictly need to unmap memory _in the untrusted host_, but since 
> > mapped==guppable,
> > the end result is the same.
> >
> > Emphasis above because pKVM still needs unmap the memory _somehwere_.  
> > IIUC, the
> > current approach is to do that only in the stage-2 page tables, i.e. only 
> > in the
> > context of the hypervisor.  Which is also the source of the gup() problems; 
> > the
> > untrusted kernel is blissfully unaware that the memory is inaccessible.
> >
> > Any approach that moves some of that information into the untrusted kernel 
> > so that
> > the kernel can protect itself will incur fragmentation in the VMAs.  Well, 
> > unless
> > all of guest memory becomes unguppable, but that's likely not a viable 
> > option.
> 
> Actually, for pKVM, there is no need for the guest memory to be GUP'able at
> all if we use the new inaccessible_get_pfn().

Ya, I was referring to pKVM without UPM / inaccessible memory.

Jumping back to blocking gup(), what about using the same tricks as secretmem to
block gup()?  E.g. compare vm_ops to block regular gup() and a_ops to block fast
gup() on struct page?  With a Kconfig that's selected by pKVM (which would also
need its own Kconfig), e.g. CONFIG_INACCESSIBLE_MAPPABLE_MEM, there would be 
zero
performance overhead for non-pKVM kernels, i.e. hooking gup() shouldn't be
controversial.

I suspect the fast gup() path could even be optimized to avoid the 
page_mapping()
lookup by adding a PG_inaccessible flag that's defined iff the TBD Kconfig is
selected.  I'm guessing pKVM isn't expected to be deployed on massivve NUMA 
systems
anytime soon, so there should be plenty of page flags to go around.

Blocking gup() instead of trying to play refcount games when converting back to
private would eliminate the need to put heavy restrictions on mapping, as the 
goal
of those were purely to simplify the KVM implementation, e.g. the "one mapping 
per
memslot" thing would go away entirely.

> This of course goes back to what I'd mentioned before in v7; it seems that
> representing the memslot memory as a file descriptor should be orthogonal to
> whether the memory is shared or private, rather than a private_fd for private
> memory and the userspace_addr for shared memory.

I also explored the idea of backing any guest memory with an fd, but came to
the conclusion that private memory needs a separate handle[1], at least on x86.

For SNP and TDX, even though the GPA is the same (ignoring the fact that SNP and
TDX steal GPA bits to differentiate private vs. shared), the two types need to 
be
treated as separate mappings[2].  Post-boot, converting is lossy in both 
directions,
so even conceptually they are two disctint pages that just happen to share 
(some)
GPA bits.

To allow conversions, i.e. changing which mapping to use, without memslot 
updates,
KVM needs to let userspace provide both mappings in a single memslot.  So while
fd-based memory is an orthogonal concept, e.g. we could add fd-based shared 
memory,
KVM would still need a dedicated private handle.

For pKVM, the fd doesn't strictly need to be mutually exclusive with the 
existing
userspace_addr, but since the private_fd is going to be added for x86, I think 
it
makes sense to use that instead of adding generic fd-based memory for pKVM's use
case (which is arguably still "private" memory but with special semantics).

[1] https://lore.kernel.org/all/yulth7bl4mwt5...@google.com
[2] https://lore.kernel.org/all/869622df-5bf6-0fbb-cac4-34c6ae7df...@kernel.org

>  The host can then map or unmap the shared/private memory using the fd, which
>  allows it more freedom in even choosing to unmap shared memory when not
>  needed, for example.

Re: [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions

2022-10-17 Thread Sean Christopherson

On Mon, Oct 17, 2022, Fuad Tabba wrote:
> Hi,
> 
> > > > +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM
> > > > +#define KVM_MEM_ATTR_SHARED0x0001
> > > > +static int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t 
> > > > size,
> > > > +bool is_private)
> > > > +{
> > >
> > > I wonder if this ioctl should be implemented as an arch-specific
> > > ioctl. In this patch it performs some actions that pKVM might not need
> > > or might want to do differently.
> >
> > I think it's doable. We can provide the mem_attr_array kind thing in
> > common code and let arch code decide to use it or not. Currently
> > mem_attr_array is defined in the struct kvm, if those bytes are
> > unnecessary for pKVM it can even be moved to arch definition, but that
> > also loses the potential code sharing for confidential usages in other
> > non-architectures, e.g. if ARM also supports such usage. Or it can be
> > provided through a different CONFIG_ instead of
> > CONFIG_HAVE_KVM_PRIVATE_MEM.
> 
> This sounds good. Thank you.

I like the idea of a separate Kconfig, e.g. CONFIG_KVM_GENERIC_PRIVATE_MEM or
something.  I highly doubt there will be any non-x86 users for multiple years,
if ever, but it would allow testing the private memory stuff on ARM (and any 
other
non-x86 arch) without needing full pKVM support and with only minor KVM
modifications, e.g. the x86 support[*] to test UPM without TDX is shaping up to 
be
trivial.

[*] https://lore.kernel.org/all/y0mu1fkugnqg5...@google.com

Re: [PATCH v8 7/8] KVM: Handle page fault for private memory

2022-10-14 Thread Sean Christopherson

On Thu, Sep 15, 2022, Chao Peng wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a0f198cede3d..81ab20003824 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3028,6 +3028,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
>   break;
>   }
>  
> + if (kvm_mem_is_private(kvm, gfn))

Rather than reload the Xarray info, which is unnecessary overhead, pass in
@is_private.  The caller must hold mmu_lock, i.e. invalidations from
private<->shared conversions will be stalled and will zap the new SPTE if the
state is changed.

E.g.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d68944f07b4b..44eea47697d8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3072,8 +3072,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault
 * Enforce the iTLB multihit workaround after capturing the requested
 * level, which will be used to do precise, accurate accounting.
 */
-   fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
-fault->gfn, 
fault->max_level);
+   fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot, 
fault->gfn,
+fault->max_level, 
fault->is_private);
if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
return;
 
@@ -6460,7 +6460,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 */
if (sp->role.direct &&
sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, 
sp->gfn,
-  PG_LEVEL_NUM)) {
+  PG_LEVEL_NUM, 
false)) {
kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
 
if (kvm_available_flush_tlb_with_range())
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 7670c13ce251..9acdf72537ce 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -315,6 +315,12 @@ static inline bool is_dirty_spte(u64 spte)
return dirty_mask ? spte & dirty_mask : spte & PT_WRITABLE_MASK;
 }
 
+static inline bool is_private_spte(u64 spte)
+{
+   /* FIXME: Query C-bit/S-bit for SEV/TDX. */
+   return false;
+}
+
 static inline u64 get_rsvd_bits(struct rsvd_bits_validate *rsvd_check, u64 pte,
int level)
 {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 672f0432d777..69ba00157e90 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1767,8 +1767,9 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
if (iter.gfn < start || iter.gfn >= end)
continue;
 
-   max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot,
- iter.gfn, 
PG_LEVEL_NUM);
+   max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, 
iter.gfn,
+ PG_LEVEL_NUM,
+ 
is_private_spte(iter.old_spte));
if (max_mapping_level < iter.level)
continue;

Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory

2022-10-07 Thread Sean Christopherson

On Fri, Oct 07, 2022, Jarkko Sakkinen wrote:
> On Thu, Oct 06, 2022 at 03:34:58PM +0000, Sean Christopherson wrote:
> > On Thu, Oct 06, 2022, Jarkko Sakkinen wrote:
> > > On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote:
> > > > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> > > > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds 
> > > > > two
> > > > > additional KVM memslot fields private_fd/private_offset to allow
> > > > > userspace to specify that guest private memory provided from the
> > > > > private_fd and guest_phys_addr mapped at the private_offset of the
> > > > > private_fd, spanning a range of memory_size.
> > > > > 
> > > > > The extended memslot can still have the userspace_addr(hva). When 
> > > > > use, a
> > > > > single memslot can maintain both private memory through private
> > > > > fd(private_fd/private_offset) and shared memory through
> > > > > hva(userspace_addr). Whether the private or shared part is visible to
> > > > > guest is maintained by other KVM code.
> > > > 
> > > > What is anyway the appeal of private_offset field, instead of having 
> > > > just
> > > > 1:1 association between regions and files, i.e. one memfd per region?
> > 
> > Modifying memslots is slow, both in KVM and in QEMU (not sure about 
> > Google's VMM).
> > E.g. if a vCPU converts a single page, it will be forced to wait until all 
> > other
> > vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is 
> > faulting in
> > memory.  KVM's memslot updates also hold a mutex for the entire duration of 
> > the
> > update, i.e. conversions on different vCPUs would be fully serialized, 
> > exacerbating
> > the SRCU problem.
> > 
> > KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any
> > memslot is deleted.
> > 
> > Taking both a private_fd and a shared userspace address allows userspace to 
> > convert
> > between private and shared without having to manipulate memslots.
> 
> Right, this was really good explanation, thank you.
> 
> Still wondering could this possibly work (or not):
> 
> 1. Union userspace_addr and private_fd.

No, because userspace needs to be able to provide both userspace_addr (shared
memory) and private_fd (private memory) for a single memslot.

> 2. Instead of introducing private_offset, use guest_phys_addr as the
>offset.

No, because that would force userspace to use a single private_fd for all of 
guest
memory since it effectively means private_offset=0.  And userspace couldn't skip
over holes in guest memory, i.e. the size of the memfd would need to follow the
max guest gpa.  In other words, dropping private_offset could work, but it'd be
quite kludgy and not worth saving 8 bytes.

Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory

2022-10-06 Thread Sean Christopherson

On Thu, Oct 06, 2022, Jarkko Sakkinen wrote:
> On Thu, Oct 06, 2022 at 05:58:03PM +0300, Jarkko Sakkinen wrote:
> > On Thu, Sep 15, 2022 at 10:29:07PM +0800, Chao Peng wrote:
> > > This new extension, indicated by the new flag KVM_MEM_PRIVATE, adds two
> > > additional KVM memslot fields private_fd/private_offset to allow
> > > userspace to specify that guest private memory provided from the
> > > private_fd and guest_phys_addr mapped at the private_offset of the
> > > private_fd, spanning a range of memory_size.
> > > 
> > > The extended memslot can still have the userspace_addr(hva). When use, a
> > > single memslot can maintain both private memory through private
> > > fd(private_fd/private_offset) and shared memory through
> > > hva(userspace_addr). Whether the private or shared part is visible to
> > > guest is maintained by other KVM code.
> > 
> > What is anyway the appeal of private_offset field, instead of having just
> > 1:1 association between regions and files, i.e. one memfd per region?

Modifying memslots is slow, both in KVM and in QEMU (not sure about Google's 
VMM).
E.g. if a vCPU converts a single page, it will be forced to wait until all other
vCPUs drop SRCU, which can have severe latency spikes, e.g. if KVM is faulting 
in
memory.  KVM's memslot updates also hold a mutex for the entire duration of the
update, i.e. conversions on different vCPUs would be fully serialized, 
exacerbating
the SRCU problem.

KVM also has historical baggage where it "needs" to zap _all_ SPTEs when any
memslot is deleted.

Taking both a private_fd and a shared userspace address allows userspace to 
convert
between private and shared without having to manipulate memslots.

Paolo's original idea (was sent off-list):

  : The problem is that KVM_SET_USER_MEMORY_REGION and memslots in general
  : are designed around (S)RCU.  It is way too slow (in both QEMU and KVM)
  : to be called on every private<->shared conversion with 4K granularity,
  : and it tends naturally to have quadratic behavior (though, at least for
  : KVM, the in-progress "fast memslots" series would avoid that).
  : 
  : Since private PTEs are persistent, and userspace cannot access the memfd
  : in any other way, userspace could use fallocate() to map/unmap an
  : address range as private, and KVM can treat everything that userspace
  : hasn't mapped as shared.
  : 
  : This would be a new entry in struct guest_ops, called by fallocate(),
  : and the callback can take the mmu_lock for write to avoid racing with
  : page faults.  This doesn't add any more contention than
  : KVM_SET_USER_MEMORY_REGION, since the latter takes slots_lock.  If
  : there's something I'm missing then the mapping operation can use a
  : ioctl, while the unmapping can keep using FALLOC_FL_PUNCH_HOLE.
  : 
  : Then:
  : 
  : - for simplicity, mapping a private memslot fails if there are any
  : mappings (similar to the handling when F_SEAL_GUEST is set).
  : 
  : - for TDX, accessing a nonexistent private PTE will cause a userspace
  : exit for a shared->private conversion request.  For SNP, the guest will
  : do a page state change VMGEXIT to request an RMPUPDATE, which can cause
  : a userspace exit too; the consequent fallocate() on the private fd
  : invokes RMPUPDATE.
  : 
  : - trying to map a shared PTE where there's already a private PTE causes
  : a userspace exit for a private->shared conversion request.
  : kvm_faultin_pfn or handle_abnormal_pfn can query this in the private-fd
  : inode, which is essentially a single pagecache_get_page call.
  : 
  : - if userspace asks to map a private PTE where there's already a shared
  : PTE (which it can check because it has the mmu_lock taken for write),
  : KVM unmaps the shared PTE.

> > 
> > If this was the case, then an extended struct would not be needed in the
> > first place. A simple union inside the existing struct would do:
> > 
> > union {
> > __u64 userspace_addr,
> > __u64 private_fd,
> > };
> 
> Also, why is this mechanism just for fd's with MFD_INACCESSIBLE flag? I'd
> consider instead having KVM_MEM_FD flag. For generic KVM (if memfd does not
> have MFD_INACCESSIBLE set), KVM could just use the memory as it is using
> mapped memory. This would simplify user space code, as you can the use the
> same thing for both cases.

I explored this idea too[*].  Because we want to support specifying both the
private and shared backing stores in a single memslot, then we need two file
descriptors so that shared memory can also use fd-based memory.

[*] https://lore.kernel.org/all/yulth7bl4mwt5...@google.com

Re: [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory

2022-09-29 Thread Sean Christopherson

On Thu, Sep 29, 2022, Isaku Yamahata wrote:
> On Thu, Sep 15, 2022 at 10:29:07PM +0800,
> Chao Peng  wrote:
> > @@ -4645,14 +4672,20 @@ static long kvm_vm_ioctl(struct file *filp,
> > break;
> > }
> > case KVM_SET_USER_MEMORY_REGION: {
> > -   struct kvm_userspace_memory_region kvm_userspace_mem;
> > +   struct kvm_user_mem_region mem;
> > +   unsigned long size = sizeof(struct kvm_userspace_memory_region);
> > +
> > +   kvm_sanity_check_user_mem_region_alias();
> >  
> > r = -EFAULT;
> > -   if (copy_from_user(_userspace_mem, argp,
> > -   sizeof(kvm_userspace_mem)))
> > +   if (copy_from_user(, argp, size);
> > +   goto out;
> > +
> > +   r = -EINVAL;
> > +   if (mem.flags & KVM_MEM_PRIVATE)
> > goto out;
> 
> Nit:  It's better to check if padding is zero.  Maybe rename it to reserved.
> 
> +   if (mem.pad1 || memchr_inv(mem.pad2, 0, sizeof(mem.pad2)))
> +   goto out;

No need, KVM has more or less settled on using flags instead "reserving" bytes.
E.g. if/when another fancy feature comes along, we'll add another KVM_MEM_XYZ
and only consume the relevant fields when the flag is set.  Reserving bytes
doesn't work very well because it assumes that '0' is an invalid value, e.g. if
the future expansion is for a non-private file descriptor, then we'd need a new
flag even if KVM reserved bytes since fd=0 is valid.

The only reason to bother with pad2[14] at this time is to avoid having to 
define
yet another struct if/when the struct needs to expand again.  The struct 
definition
will still need to be changed, but at least we won't end up with struct
kvm_userspace_memory_region_really_extended.

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-27 Thread Sean Christopherson

On Mon, Sep 26, 2022, David Hildenbrand wrote:
> On 26.09.22 16:48, Kirill A. Shutemov wrote:
> > On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
> > > When using DAX, what happens with the shared <->private conversion? Which
> > > "type" is supposed to use dax, which not?
> > > 
> > > In other word, I'm missing too many details on the bigger picture of how
> > > this would work at all to see why it makes sense right now to prepare for
> > > that.
> > 
> > IIUC, KVM doesn't really care about pages or folios. They need PFN to
> > populate SEPT. Returning page/folio would make KVM do additional steps to
> > extract PFN and one more place to have a bug.
> 
> Fair enough. Smells KVM specific, though.

TL;DR: I'm good with either approach, though providing a "struct page" might 
avoid
   refactoring the API in the nearish future.

Playing devil's advocate for a second, the counter argument is that KVM is the
only user for the foreseeable future.

That said, it might make sense to return a "struct page" from the core API and
force KVM to do page_to_pfn().  KVM already does that for HVA-based memory, so
it's not exactly new code.

More importantly, KVM may actually need/want the "struct page" in the 
not-too-distant
future to support mapping non-refcounted "struct page" memory into the guest.  
The
ChromeOS folks have a use case involving virtio-gpu blobs where KVM can get 
handed a
"struct page" that _isn't_ refcounted[*].  Once the lack of mmu_notifier 
integration
is fixed, the remaining issue is that KVM doesn't currently have a way to 
determine
whether or not it holds a reference to the page.  Instead, KVM assumes that if 
the
page is "normal", it's refcounted, e.g. see kvm_release_pfn_clean().

KVM's current workaround for this is to refuse to map these pages into the 
guest,
i.e. KVM simply forces its assumption that normal pages are refcounted to be 
true.
To remove that workaround, the likely solution will be to pass around a tuple of
page+pfn, where "page" is non-NULL if the pfn is a refcounted "struct page".

At that point, getting handed a "struct page" from the core API would be a good
thing as KVM wouldn't need to probe the PFN to determine whether or not it's a
refcounted page.

Note, I still want the order to be provided by the API so that KVM doesn't need
to run through a bunch of helpers to try and figure out the allowed mapping 
size.

[*] 
https://lore.kernel.org/all/CAD=HUj736L5oxkzeL2JoPV8g1S6Rugy_TquW=prt73ymfzp...@mail.gmail.com

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-27 Thread Sean Christopherson

On Mon, Sep 26, 2022, Fuad Tabba wrote:
> Hi,
> 
> On Mon, Sep 26, 2022 at 3:28 PM Chao Peng  wrote:
> >
> > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > > >
> > > >   1. Not be supported for TDX or SEV-SNP because they don't allow 
> > > > adding non-zero
> > > >  memory into the guest (after pre-boot phase).
> > > >
> > > >   2. Be mutually exclusive with shared<=>private conversions, and is 
> > > > allowed if
> > > >  and only if the entire gfn range of the associated memslot is 
> > > > shared.
> > >
> > > In general I think that this would work with pKVM. However, limiting
> > > private<->shared conversions to the granularity of a whole memslot
> > > might be difficult to handle in pKVM, since the guest doesn't have the
> > > concept of memslots. For example, in pKVM right now, when a guest
> > > shares back its restricted DMA pool with the host it does so at the
> > > page-level.

Y'all are killing me :-)

Isn't the guest enlightened?  E.g. can't you tell the guest "thou shalt share at
granularity X"?  With KVM's newfangled scalable memslots and per-vCPU MRU slot,
X doesn't even have to be that high to get reasonable performance, e.g. assuming
the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to
work just fine in KVM.

> > > pKVM would also need a way to make an fd accessible again
> > > when shared back, which I think isn't possible with this patch.
> >
> > But does pKVM really want to mmap/munmap a new region at the page-level,
> > that can cause VMA fragmentation if the conversion is frequent as I see.
> > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > be the same issue.
> 
> pKVM doesn't really need to unmap the memory. What is really important
> is that the memory is not GUP'able.

Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
otherwise KVM wouldn't be able to get the PFN to map into guest memory.

The problem is that gup() and "mapped" are tied together.  So yes, pKVM doesn't
strictly need to unmap memory _in the untrusted host_, but since 
mapped==guppable,
the end result is the same.

Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, the
current approach is to do that only in the stage-2 page tables, i.e. only in the
context of the hypervisor.  Which is also the source of the gup() problems; the
untrusted kernel is blissfully unaware that the memory is inaccessible.

Any approach that moves some of that information into the untrusted kernel so 
that
the kernel can protect itself will incur fragmentation in the VMAs.  Well, 
unless
all of guest memory becomes unguppable, but that's likely not a viable option.

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-22 Thread Sean Christopherson

On Thu, Sep 22, 2022, Wang, Wei W wrote:
> On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > +int *order)
> 
> Better to remove "order" from this interface?

Hard 'no'.

> Some callers only need to get pfn, and no need to bother with
> defining and inputting something unused. For callers who need the "order",
> can easily get it via thp_order(pfn_to_page(pfn)) on their own.

That requires (a) assuming the pfn is backed by struct page, and (b) assuming 
the
struct page is a transparent huge page.  That might be true for the current
implementation, but it most certainly will not always be true.

KVM originally did things like this, where there was dedicated code for THP vs.
HugeTLB, and it was a mess.  The goal here is very much to avoid repeating those
mistakes.  Have the backing store _tell_ KVM how big the mapping is, don't force
KVM to rediscover the info on its own.

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-19 Thread Sean Christopherson

+Will, Marc and Fuad (apologies if I missed other pKVM folks)

On Mon, Sep 19, 2022, David Hildenbrand wrote:
> On 15.09.22 16:29, Chao Peng wrote:
> > From: "Kirill A. Shutemov" 
> > 
> > KVM can use memfd-provided memory for guest memory. For normal userspace
> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > virtual address space and then tells KVM to use the virtual address to
> > setup the mapping in the secondary page table (e.g. EPT).
> > 
> > With confidential computing technologies like Intel TDX, the
> > memfd-provided memory may be encrypted with special key for special
> > software domain (e.g. KVM guest) and is not expected to be directly
> > accessed by userspace. Precisely, userspace access to such encrypted
> > memory may lead to host crash so it should be prevented.
> 
> Initially my thaught was that this whole inaccessible thing is TDX specific
> and there is no need to force that on other mechanisms. That's why I
> suggested to not expose this to user space but handle the notifier
> requirements internally.
> 
> IIUC now, protected KVM has similar demands. Either access (read/write) of
> guest RAM would result in a fault and possibly crash the hypervisor (at
> least not the whole machine IIUC).

Yep.  The missing piece for pKVM is the ability to convert from shared to 
private
while preserving the contents, e.g. to hand off a large buffer (hundreds of MiB)
for processing in the protected VM.  Thoughts on this at the bottom.

> > This patch introduces userspace inaccessible memfd (created with
> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > in-kernel interface so KVM can directly interact with core-mm without
> > the need to map the memory into KVM userspace.
> 
> With secretmem we decided to not add such "concept switch" flags and instead
> use a dedicated syscall.
>

I have no personal preference whatsoever between a flag and a dedicated syscall,
but a dedicated syscall does seem like it would give the kernel a bit more
flexibility.

> What about memfd_inaccessible()? Especially, sealing and hugetlb are not
> even supported and it might take a while to support either.

Don't know about sealing, but hugetlb support for "inaccessible" memory needs to
come sooner than later.  "inaccessible" in quotes because we might want to 
choose
a less binary name, e.g. "restricted"?.

Regarding pKVM's use case, with the shim approach I believe this can be done by
allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
piled on top.

My first thought was to make the uAPI a set of KVM ioctls so that KVM could 
tightly
tightly control usage without taking on too much complexity in the kernel, but
working through things, routing the behavior through the shim itself might not 
be
all that horrific.

IIRC, we discarded the idea of allowing userspace to map the "private" fd 
because
things got too complex, but with the shim it doesn't seem _that_ bad.

E.g. on the memfd side:

  1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
 mapping is all or nothing.

  2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
 the restricted memfd.

  3. Add notifier hooks to allow downstream users to further restrict things.

  4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
 one shot.

  5. Require that there are no outstanding references at munmap().  Or if this
 can't be guaranteed by userspace, maybe add some way for userspace to wait
 until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
 to do an expensive check every time.

  static int memfd_restricted_mmap(struct file *file, struct vm_area_struct 
*vma)
  {
if (vma->vm_pgoff)
return -EINVAL;

if ((vma->vm_end - vma->vm_start) != )
return -EINVAL;

mutex_lock(>lock);

if (data->has_mapping) {
r = -EINVAL;
goto err;
}
list_for_each_entry(notifier, >notifiers, list) {
r = notifier->ops->mmap_start(notifier, ...);
if (r)
goto abort;
}

notifier->ops->mmap_end(notifier, ...);
mutex_unlock(>lock);
return 0;

  abort:
list_for_each_entry_continue_reverse(notifier >notifiers, list)
notifier->ops->mmap_abort(notifier, ...);
  err:
mutex_unlock(>lock);
return r;
  }

  static void memfd_restricted_close(struct vm_area_struct *vma)
  {
mutex_lock(...);

/*
 * Destroy the memfd and disable all future accesses if there are
 * outstanding refcounts (or other unsatisfied restrictions?).
 */
if ( || ???)
memfd_restricted_destroy(...);
else
data->has_mapping = false;

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-13 Thread Sean Christopherson

On Tue, Sep 13, 2022, Kirill A. Shutemov wrote:
> On Tue, Sep 13, 2022 at 02:53:25PM +0000, Sean Christopherson wrote:
> > > > Switching topics, what actually prevents mmapp() on the shim?  I tried 
> > > > to follow,
> > > > but I don't know these areas well enough.
> > > 
> > > It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap().
> > > (I did not read the switch statement correctly at first. Note there are
> > > two 'fallthrough' there.)
> > 
> > Ah, validate_mmap_request().  Thought not implementing ->mmap() was the 
> > key, but
> > couldn't find the actual check.
> 
> validate_mmap_request() is in mm/nommu.c which is not relevant for real
> computers.
> 
> I was talking about this check:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/mmap.c#n1495

Hence the comment about 'fallthrough'.  Thanks again!

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-13 Thread Sean Christopherson

On Tue, Sep 13, 2022, Kirill A. Shutemov wrote:
> On Tue, Sep 13, 2022 at 09:44:27AM +0000, Sean Christopherson wrote:
> > On Thu, Sep 08, 2022, Kirill A. Shutemov wrote:
> > > On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
> > > > On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > > > > > I will try next week to rework it as shim to top of shmem. Does it 
> > > > > > work
> > > > > > for you?
> > > > > 
> > > > > Yes, please do, thanks.  It's a compromise between us: the initial TDX
> > > > > case has no justification to use shmem at all, but doing it that way
> > > > > will help you with some of the infrastructure, and will probably be
> > > > > easiest for KVM to extend to other more relaxed fd cases later.
> > > > 
> > > > Okay, below is my take on the shim approach.
> > > > 
> > > > I don't hate how it turned out. It is easier to understand without
> > > > callback exchange thing.
> > > > 
> > > > The only caveat is I had to introduce external lock to protect against
> > > > race between lookup and truncate.
> > 
> > As before, I think this lock is unnecessary.  Or at least it's unnessary to 
> > hold
> > the lock across get/put.  The ->invalidate() call will ensure that the pfn 
> > is
> > never actually used if get() races with truncation.
> 
> The updated version you replying to does not use the lock to protect
> against truncation anymore. The lock protect notifier list.

Gah, grabbed the patch when applying.

> > Switching topics, what actually prevents mmapp() on the shim?  I tried to 
> > follow,
> > but I don't know these areas well enough.
> 
> It has no f_op->mmap, so mmap() will fail with -ENODEV. See do_mmap().
> (I did not read the switch statement correctly at first. Note there are
> two 'fallthrough' there.)

Ah, validate_mmap_request().  Thought not implementing ->mmap() was the key, but
couldn't find the actual check.

Thanks much!

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-09-13 Thread Sean Christopherson

On Thu, Sep 08, 2022, Kirill A. Shutemov wrote:
> On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote:
> > On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote:
> > > > I will try next week to rework it as shim to top of shmem. Does it work
> > > > for you?
> > > 
> > > Yes, please do, thanks.  It's a compromise between us: the initial TDX
> > > case has no justification to use shmem at all, but doing it that way
> > > will help you with some of the infrastructure, and will probably be
> > > easiest for KVM to extend to other more relaxed fd cases later.
> > 
> > Okay, below is my take on the shim approach.
> > 
> > I don't hate how it turned out. It is easier to understand without
> > callback exchange thing.
> > 
> > The only caveat is I had to introduce external lock to protect against
> > race between lookup and truncate.

As before, I think this lock is unnecessary.  Or at least it's unnessary to hold
the lock across get/put.  The ->invalidate() call will ensure that the pfn is
never actually used if get() races with truncation.

Switching topics, what actually prevents mmapp() on the shim?  I tried to 
follow,
but I don't know these areas well enough.

Re: [PATCH v1 15/40] i386/tdx: Add property sept-ve-disable for tdx-guest object

2022-09-02 Thread Sean Christopherson

On Fri, Sep 02, 2022, Gerd Hoffmann wrote:
> On Fri, Sep 02, 2022 at 02:52:25AM +0000, Sean Christopherson wrote:
> > On Fri, Sep 02, 2022, Xiaoyao Li wrote:
> > > On 8/26/2022 1:57 PM, Gerd Hoffmann wrote:
> > > >Hi,
> > > > > For TD guest kernel, it has its own reason to turn SEPT_VE on or off. 
> > > > > E.g.,
> > > > > linux TD guest requires SEPT_VE to be disabled to avoid #VE on 
> > > > > syscall gap
> > > > > [1].
> > > > 
> > > > Why is that a problem for a TD guest kernel?  Installing exception
> > > > handlers is done quite early in the boot process, certainly before any
> > > > userspace code runs.  So I think we should never see a syscall without
> > > > a #VE handler being installed.  /me is confused.
> > > > 
> > > > Or do you want tell me linux has no #VE handler?
> > > 
> > > The problem is not "no #VE handler" and Linux does have #VE handler. The
> > > problem is Linux doesn't want any (or certain) exception occurrence in
> > > syscall gap, it's not specific to #VE. Frankly, I don't understand the
> > > reason clearly, it's something related to IST used in x86 Linux kernel.
> > 
> > The SYSCALL gap issue is that because SYSCALL doesn't load RSP, the first 
> > instruction
> > at the SYSCALL entry point runs with a userspaced-controlled RSP.  With 
> > TDX, a
> > malicious hypervisor can induce a #VE on the SYSCALL page and thus get the 
> > kernel
> > to run the #VE handler with a userspace stack.
> > 
> > The "fix" is to use an IST for #VE so that a kernel-controlled RSP is 
> > loaded on #VE,
> > but ISTs are terrible because they don't play nice with re-entrancy (among 
> > other
> > reasons).  The RSP used for IST-based handlers is hardcoded, and so if a #VE
> > handler triggers another #VE at any point before IRET, the second #VE will 
> > clobber
> > the stack and hose the kernel.
> > v
> > It's possible to workaround this, e.g. change the IST entry at the very 
> > beginning
> > of the handler, but it's a maintenance burden.  Since the only reason to 
> > use an IST
> > is to guard against a malicious hypervisor, Linux decided it would be just 
> > as easy
> > and more beneficial to avoid unexpected #VEs due to unaccepted private 
> > pages entirely.
> 
> Hmm, ok, but shouldn't the SEPT_VE bit *really* controlled by the guest then?
> 
> Having a hypervisor-controlled config bit to protect against a malicious
> hypervisor looks pointless to me ...

IIRC, all (most?) of the attributes are included in the attestation report, so a
guest/customer can refuse to provision secrets to the guest if the hypervisor is
misbehaving.

I'm guessing Intel made it an attribute and not a dynamic control knob to 
simplify
the TDX module implementation.

Re: [PATCH v1 15/40] i386/tdx: Add property sept-ve-disable for tdx-guest object

2022-09-01 Thread Sean Christopherson

On Fri, Sep 02, 2022, Xiaoyao Li wrote:
> On 8/26/2022 1:57 PM, Gerd Hoffmann wrote:
> >Hi,
> > > For TD guest kernel, it has its own reason to turn SEPT_VE on or off. 
> > > E.g.,
> > > linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap
> > > [1].
> > 
> > Why is that a problem for a TD guest kernel?  Installing exception
> > handlers is done quite early in the boot process, certainly before any
> > userspace code runs.  So I think we should never see a syscall without
> > a #VE handler being installed.  /me is confused.
> > 
> > Or do you want tell me linux has no #VE handler?
> 
> The problem is not "no #VE handler" and Linux does have #VE handler. The
> problem is Linux doesn't want any (or certain) exception occurrence in
> syscall gap, it's not specific to #VE. Frankly, I don't understand the
> reason clearly, it's something related to IST used in x86 Linux kernel.

The SYSCALL gap issue is that because SYSCALL doesn't load RSP, the first 
instruction
at the SYSCALL entry point runs with a userspaced-controlled RSP.  With TDX, a
malicious hypervisor can induce a #VE on the SYSCALL page and thus get the 
kernel
to run the #VE handler with a userspace stack.

The "fix" is to use an IST for #VE so that a kernel-controlled RSP is loaded on 
#VE,
but ISTs are terrible because they don't play nice with re-entrancy (among other
reasons).  The RSP used for IST-based handlers is hardcoded, and so if a #VE
handler triggers another #VE at any point before IRET, the second #VE will 
clobber
the stack and hose the kernel.

It's possible to workaround this, e.g. change the IST entry at the very 
beginning
of the handler, but it's a maintenance burden.  Since the only reason to use an 
IST
is to guard against a malicious hypervisor, Linux decided it would be just as 
easy
and more beneficial to avoid unexpected #VEs due to unaccepted private pages 
entirely.

Re: [PATCH v6 6/8] KVM: Handle page fault for private memory

2022-08-25 Thread Sean Christopherson

On Fri, Aug 19, 2022, Kirill A. Shutemov wrote:
> On Fri, Jun 17, 2022 at 09:30:53PM +0000, Sean Christopherson wrote:
> > > @@ -4088,7 +4144,12 @@ static int direct_page_fault(struct kvm_vcpu 
> > > *vcpu, struct kvm_page_fault *fault
> > >   read_unlock(>kvm->mmu_lock);
> > >   else
> > >   write_unlock(>kvm->mmu_lock);
> > > - kvm_release_pfn_clean(fault->pfn);
> > > +
> > > + if (fault->is_private)
> > > + kvm_private_mem_put_pfn(fault->slot, fault->pfn);
> > 
> > Why does the shmem path lock the page, and then unlock it here?
> 
> Lock is require to avoid race with truncate / punch hole. Like if truncate
> happens after get_pfn(), but before it gets into SEPT we are screwed.

Getting the PFN into the SPTE doesn't provide protection in and of itself.  The
protection against truncation and whatnot comes from KVM getting a notification
and either retrying the fault (notification acquires mmu_lock before
direct_page_fault()), or blocking the notification (truncate / punch hole) until
after KVM installs the SPTE.  I.e. KVM just needs to ensure it doesn't install a
SPTE _after_ getting notified.

If the API is similar to gup(), i.e. only elevates the refcount but doesn't lock
the page, then there's no need for a separate kvm_private_mem_put_pfn(), and in
fact no need for ->put_unlock_pfn() because can KVM do set_page_dirty() and
put_page() directly as needed using all of KVM's existing mechanisms.

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-08-23 Thread Sean Christopherson

On Tue, Aug 23, 2022, David Hildenbrand wrote:
> On 19.08.22 05:38, Hugh Dickins wrote:
> > On Fri, 19 Aug 2022, Sean Christopherson wrote:
> >> On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
> >>> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> >>>> On Wed, 6 Jul 2022, Chao Peng wrote:
> >>>> But since then, TDX in particular has forced an effort into preventing
> >>>> (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> >>>>
> >>>> Are any of the shmem.c mods useful to existing users of shmem.c? No.
> >>>> Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.
> >>
> >> But QEMU and other VMMs are users of shmem and memfd.  The new features 
> >> certainly
> >> aren't useful for _all_ existing users, but I don't think it's fair to say 
> >> that
> >> they're not useful for _any_ existing users.
> > 
> > Okay, I stand corrected: there exist some users of memfd_create()
> > who will also have use for "INACCESSIBLE" memory.
> 
> As raised in reply to the relevant patch, I'm not sure if we really have
> to/want to expose MFD_INACCESSIBLE to user space. I feel like this is a
> requirement of specific memfd_notifer (memfile_notifier) implementations
> -- such as TDX that will convert the memory and MCE-kill the machine on
> ordinary write access. We might be able to set/enforce this when
> registering a notifier internally instead, and fail notifier
> registration if a condition isn't met (e.g., existing mmap).
>
> So I'd be curious, which other users of shmem/memfd would benefit from
> (MMU)-"INACCESSIBLE" memory obtained via memfd_create()?

I agree that there's no need to expose the inaccessible behavior via uAPI.  
Making
it a kernel-internal thing that's negotiated/resolved when KVM binds to the fd
would align INACCESSIBLE with the UNMOVABLE and UNRECLAIMABLE flags (and any 
other
flags that get added in the future).

AFAICT, the user-visible flag is a holdover from the early RFCs and doesn't 
provide
any unique functionality.

If we go that route, we might want to have shmem/memfd require INACCESSIBLE to 
be
set for the initial implementation.  I.e. disallow binding without INACCESSIBLE
until there's a use case.

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-08-19 Thread Sean Christopherson

On Thu, Aug 18, 2022, Hugh Dickins wrote:
> On Fri, 19 Aug 2022, Sean Christopherson wrote:
> > On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
> > > On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > > > If your memory could be migrated, that would be some reason to use
> > > > filesystem page cache (because page migration happens to understand
> > > > that type of memory): but it cannot be migrated.
> > > 
> > > Migration support is in pipeline. It is part of TDX 1.5 [1]. 
> > 
> > And this isn't intended for just TDX (or SNP, or pKVM).  We're not _that_ 
> > far off
> > from being able to use UPM for "regular" VMs as a way to provide 
> > defense-in-depth
> 
> UPM? That's an acronym from your side of the fence, I spy references to
> it in the mail threads, but haven't tracked down a definition.  I'll
> just take it to mean the fd-based memory we're discussing.

Ya, sorry, UPM is what we came up with as shorthand for "Unmapping guest Private
Memory".  Your assumption is spot on, it's just a fancy way of saying "guest is
backed with inaccessible fd-based memory".

> > without having to take on the overhead of confidential VMs.  At that point,
> > migration and probably even swap are on the table.
> 
> Good, the more "flexible" that memory is, the better for competing users
> of memory.  But an fd supplied by KVM gives you freedom to change to a
> better implementation of allocation underneath, whenever it suits you.
> Maybe shmem beneath is good from the start, maybe not.

The main flaw with KVM providing the fd is that it forces KVM to get into the
memory management business, which us KVM folks really, really do not want to do.
And based on the types of bugs KVM has had in the past related to memory 
management,
it's a safe bet to say the mm folks don't want us getting involved either :-)

The combination of gup()/follow_pte() and mmu_notifiers has worked very well.
KVM gets a set of (relatively) simple rules to follow and doesn't have to be 
taught
new things every time a new backing type comes along.  And from the other side, 
KVM
has very rarely had to go poke into other subsystems' code to support exposing a
new type of memory to guests.

What we're trying to do with UPM/fd-based memory is establish a similar contract
between mm and KVM, but without requiring mm to also map memory into host 
userspace.

The only way having KVM provide the fd works out in the long run is if KVM is 
the
only subsystem that ever wants to make use of memory that isn't accessible from
userspace and isn't tied to a specific backing type, _and_ if the set of backing
types that KVM ever supports is kept to an absolute minimum.

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-08-18 Thread Sean Christopherson

On Thu, Aug 18, 2022, Kirill A . Shutemov wrote:
> On Wed, Aug 17, 2022 at 10:40:12PM -0700, Hugh Dickins wrote:
> > On Wed, 6 Jul 2022, Chao Peng wrote:
> > But since then, TDX in particular has forced an effort into preventing
> > (by flags, seals, notifiers) almost everything that makes it shmem/tmpfs.
> > 
> > Are any of the shmem.c mods useful to existing users of shmem.c? No.
> > Is MFD_INACCESSIBLE useful or comprehensible to memfd_create() users? No.

But QEMU and other VMMs are users of shmem and memfd.  The new features 
certainly
aren't useful for _all_ existing users, but I don't think it's fair to say that
they're not useful for _any_ existing users.

> > What use do you have for a filesystem here?  Almost none.
> > IIUC, what you want is an fd through which QEMU can allocate kernel
> > memory, selectively free that memory, and communicate fd+offset+length
> > to KVM.  And perhaps an interface to initialize a little of that memory
> > from a template (presumably copied from a real file on disk somewhere).
> > 
> > You don't need shmem.c or a filesystem for that!
> > 
> > If your memory could be swapped, that would be enough of a good reason
> > to make use of shmem.c: but it cannot be swapped; and although there
> > are some references in the mailthreads to it perhaps being swappable
> > in future, I get the impression that will not happen soon if ever.
> > 
> > If your memory could be migrated, that would be some reason to use
> > filesystem page cache (because page migration happens to understand
> > that type of memory): but it cannot be migrated.
> 
> Migration support is in pipeline. It is part of TDX 1.5 [1]. 

And this isn't intended for just TDX (or SNP, or pKVM).  We're not _that_ far 
off
from being able to use UPM for "regular" VMs as a way to provide 
defense-in-depth
without having to take on the overhead of confidential VMs.  At that point,
migration and probably even swap are on the table.

> And swapping theoretically possible, but I'm not aware of any plans as of
> now.

Ya, I highly doubt confidential VMs will ever bother with swap.

> > I'm afraid of the special demands you may make of memory allocation
> > later on - surprised that huge pages are not mentioned already;
> > gigantic contiguous extents? secretmem removed from direct map?
> 
> The design allows for extension to hugetlbfs if needed. Combination of
> MFD_INACCESSIBLE | MFD_HUGETLB should route this way. There should be zero
> implications for shmem. It is going to be separate struct 
> memfile_backing_store.
> 
> I'm not sure secretmem is a fit here as we want to extend MFD_INACCESSIBLE
> to be movable if platform supports it and secretmem is not migratable by
> design (without direct mapping fragmentations).

But secretmem _could_ be a fit.  If a use case wants to unmap guest private 
memory
from both userspace and the kernel then KVM should absolutely be able to support
that, but at the same time I don't want to have to update KVM to enable 
secretmem
(and I definitely don't want KVM poking into the directmap itself).

MFD_INACCESSIBLE should only say "this memory can't be mapped into userspace",
any other properties should be completely separate, e.g. the inability to 
migrate
pages is effective a restriction from KVM (acting on behalf of TDX/SNP), it's 
not
a fundamental property of MFD_INACCESSIBLE.

Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-08-16 Thread Sean Christopherson

On Tue, Aug 16, 2022, Gupta, Pankaj wrote:
> 
> > > > Actually the current version allows you to delay the allocation to a
> > > > later time (e.g. page fault time) if you don't call fallocate() on the
> > > > private fd. fallocate() is necessary in previous versions because we
> > > > treat the existense in the fd as 'private' but in this version we track
> > > > private/shared info in KVM so we don't rely on that fact from memory
> > > > backstores.
> > > 
> > > Does this also mean reservation of guest physical memory with secure
> > > processor (both for SEV-SNP & TDX) will also happen at page fault time?
> > > 
> > > Do we plan to keep it this way?
> > 
> > If you are talking about accepting memory by the guest, it is initiated by
> > the guest and has nothing to do with page fault time vs fallocate()
> > allocation of host memory. I mean acceptance happens after host memory
> > allocation but they are not in lockstep, acceptance can happen much later.
> 
> No, I meant reserving guest physical memory range from hypervisor e.g with
> RMPUpdate for SEV-SNP or equivalent at TDX side (PAMTs?).

As proposed, RMP/PAMT updates will occur in the fault path, i.e. there is no way
for userspace to pre-map guest memory.

I think the best approach is to turn KVM_TDX_INIT_MEM_REGION into a generic
vCPU-scoped ioctl() that allows userspace to pre-map guest memory.  Supporting
initializing guest private memory with a source page can be implemented via a
flag.  That also gives KVM line of sight to in-place "conversion", e.g. another
flag could be added to say that the dest is also the source.

The TDX and SNP restrictions would then become addition restrictions on when
initializing with a source is allowed (and VMs that don't have guest private
memory wouldn't allow the flag at all).

Re: [PATCH v7 03/14] mm: Introduce memfile_notifier

2022-08-10 Thread Sean Christopherson

+Will

On Wed, Aug 10, 2022, David Hildenbrand wrote:
> On 10.08.22 11:22, Chao Peng wrote:
> > On Fri, Aug 05, 2022 at 03:22:58PM +0200, David Hildenbrand wrote:
> >> On 06.07.22 10:20, Chao Peng wrote:
> >>> This patch introduces memfile_notifier facility so existing memory file
> >>> subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a
> >>> third kernel component to make use of memory bookmarked in the memory
> >>> file and gets notified when the pages in the memory file become
> >>> invalidated.
> >>
> >> Stupid question, but why is this called "memfile_notifier" and not
> >> "memfd_notifier". We're only dealing with memfd's after all ... which
> >> are anonymous files essentially. Or what am I missing? Are there any
> >> other plans for fs than plain memfd support that I am not aware of?
> > 
> > There were some discussions on this in v3.
> >   https://lkml.org/lkml/2021/12/28/484
> > Sean commented it's OK to abstract it from memfd but he also wants the
> > kAPI (name) should not bind to memfd to make room for future non-memfd
> > usages.
> 
> Sorry, but how is "memfile" any better? memfd abstracted to memfile?! :)

FWIW, I don't really like the memfile name either.

> I understand Sean's suggestion about abstracting, but if the new name
> makes it harder to grasp and there isn't really an alternative to memfd
> in sight, I'm not so sure I enjoy the tried abstraction here.

ARM's pKVM implementation is potentially (hopefully) going to switch to this API
(as a consumer) sooner than later.  If they anticipate being able to use memfd,
then there's unlikely to be a second backing type any time soon.

Quentin, Will?
 
> Otherwise we'd have to get creative now and discuss something like
> "file_population_notifer" or "mapping_population_notifer" and I am not
> sure that our time is well spent doing so right now.
> 
> ... as this is kernel-internal, we can always adjust the name as we
> please later, once we *actually* now what the abstraction should be.
> Until then I'd suggest to KIS and soft-glue this to memfd.
> 
> Or am I missing something important?

I don't think you're missing anything.  I'd still prefer a name that doesn't 
couple
KVM to memfd, but it's not a sticking point, and I've never been able to come up
with a better name...

With a little bit of cleverness I think we can keep the coupling in KVM to a
minimum, which is what I really care about.

Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions

2022-08-03 Thread Sean Christopherson

On Wed, Aug 03, 2022, Chao Peng wrote:
> On Tue, Aug 02, 2022 at 04:38:55PM +0000, Sean Christopherson wrote:
> > On Tue, Aug 02, 2022, Sean Christopherson wrote:
> > > I think we should avoid UNMAPPABLE even on the KVM side of things for the 
> > > core
> > > memslots functionality and instead be very literal, e.g.
> > > 
> > >   KVM_HAS_FD_BASED_MEMSLOTS
> > >   KVM_MEM_FD_VALID
> > > 
> > > We'll still need KVM_HAS_USER_UNMAPPABLE_MEMORY, but it won't be tied 
> > > directly to
> > > the memslot.  Decoupling the two thingis will require a bit of extra 
> > > work, but the
> > > code impact should be quite small, e.g. explicitly query and propagate
> > > MEMFILE_F_USER_INACCESSIBLE to kvm_memory_slot to track if a memslot can 
> > > be private.
> > > And unless I'm missing something, it won't require an additional memslot 
> > > flag.
> > > The biggest oddity (if we don't also add KVM_MEM_PRIVATE) is that KVM 
> > > would
> > > effectively ignore the hva for fd-based memslots for VM types that don't 
> > > support
> > > private memory, i.e. userspace can't opt out of using the fd-based 
> > > backing, but that
> > > doesn't seem like a deal breaker.
> 
> I actually love this idea. I don't mind adding extra code for potential
> usage other than confidential VMs if we can have a workable solution for
> it.
> 
> > 
> > Hrm, but basing private memory on top of a generic FD_VALID would 
> > effectively require
> > shared memory to use hva-based memslots for confidential VMs.  That'd yield 
> > a very
> > weird API, e.g. non-confidential VMs could be backed entirely by fd-based 
> > memslots,
> > but confidential VMs would be forced to use hva-based memslots.
> 
> It would work if we can treat userspace_addr as optional for
> KVM_MEM_FD_VALID, e.g. userspace can opt in to decide whether needing
> the mappable part or not for a regular VM and we can enforce KVM for
> confidential VMs. But the u64 type of userspace_addr doesn't allow us to
> express a 'null' value so sounds like we will end up needing another
> flag anyway.
> 
> In concept, we could have three cofigurations here:
>   1. hva-only: without any flag and use userspace_addr;
>   2. fd-only:  another new flag is needed and use fd/offset;
>   3. hva/fd mixed: both userspace_addr and fd/offset is effective.
>  KVM_MEM_PRIVATE is a subset of it for confidential VMs. Not sure
>  regular VM also wants this.

My mental model breaks things down slightly differently, though the end result 
is
more or less the same. 

After this series, there will be two types of memory: private and "regular" (I'm
trying to avoid "shared").  "Regular" memory is always hva-based 
(userspace_addr),
and private always fd-based (fd+offset).

In the future, if we want to support fd-based memory for "regular" memory, then
as you said we'd need to add a new flag, and a new fd+offset pair.

At that point, we'd have two new (relatively to current) flags:

  KVM_MEM_PRIVATE_FD_VALID
  KVM_MEM_FD_VALID

along with two new pairs of fd+offset (private_* and "regular").  Mapping those
to your above list:

  1.  Neither *_FD_VALID flag set.
  2a. Both PRIVATE_FD_VALID and FD_VALID are set
  2b. FD_VALID is set and the VM doesn't support private memory
  3.  Only PRIVATE_FD_VALID is set (which private memory support in the VM).

Thus, "regular" VMs can't have a mix in a single memslot because they can't use
private memory.

> There is no direct relationship between unmappable and fd-based since
> even fd-based can also be mappable for regular VM?

Yep.

> > Ignore this idea for now.  If there's an actual use case for generic 
> > fd-based memory
> > then we'll want a separate flag, fd, and offset, i.e. that support could be 
> > added
> > independent of KVM_MEM_PRIVATE.
> 
> If we ignore this idea now (which I'm also fine), do you still think we
> need change KVM_MEM_PRIVATE to KVM_MEM_USER_UNMAPPBLE?

Hmm, no.  After working through this, I think it's safe to say 
KVM_MEM_USER_UNMAPPABLE
is bad name because we could end up with "regular" memory that's backed by an
inaccessible (unmappable) file.

One alternative would be to call it KVM_MEM_PROTECTED.  That shouldn't cause
problems for the known use of "private" (TDX and SNP), and it gives us a little
wiggle room, e.g. if we ever get a use case where VMs can share memory that is
otherwise protected.

That's a pretty big "if" though, and odds are good we'd need more memslot flags 
and
fd+offset pairs to allow differentiating "private" vs. "protected-shared" 
with

Re: [PATCH v7 09/14] KVM: Extend the memslot to support fd-based private memory

2022-08-03 Thread Sean Christopherson

On Wed, Aug 03, 2022, Chao Peng wrote:
> On Fri, Jul 29, 2022 at 07:51:29PM +0000, Sean Christopherson wrote:
> > On Wed, Jul 06, 2022, Chao Peng wrote:
> > > @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
> > >   __u64 userspace_addr; /* start of the userspace allocated memory */
> > >};
> > >  
> > > +  struct kvm_userspace_memory_region_ext {
> > > + struct kvm_userspace_memory_region region;
> > > + __u64 private_offset;
> > > + __u32 private_fd;
> > > + __u32 pad1;
> > > + __u64 pad2[14];
> > > +};
> > > +
> > >/* for kvm_memory_region::flags */
> > >#define KVM_MEM_LOG_DIRTY_PAGES(1UL << 0)
> > >#define KVM_MEM_READONLY   (1UL << 1)
> > > +  #define KVM_MEM_PRIVATE(1UL << 2)
> > 
> > Very belatedly following up on prior feedback...
> > 
> >   | I think a flag is still needed, the problem is private_fd can be safely
> >   | accessed only when this flag is set, e.g. without this flag, we can't
> >   | copy_from_user these new fields since they don't exist for previous
> >   | kvm_userspace_memory_region callers.
> > 
> > I forgot about that aspect of things.  We don't technically need a dedicated
> > PRIVATE flag to handle that, but it does seem to be the least awful 
> > soltuion.
> > We could either add a generic KVM_MEM_EXTENDED_REGION or an entirely new
> > ioctl(), e.g. KVM_SET_USER_MEMORY_REGION2, but in both approaches there's a 
> > decent
> > chance that we'll end up needed individual "this field is valid" flags 
> > anways.
> > 
> > E.g. if KVM requires pad1 and pad2 to be zero to carve out future 
> > extensions,
> > then we're right back here if some future extension needs to treat '0' as a 
> > legal
> > input.
> 
> I had such practice (always rejecting none-zero 'pad' value when
> introducing new user APIs) in other project previously, but I rarely
> see that in KVM.

Ya, KVM often uses flags to indicate the validity of a field specifically so 
that
KVM doesn't misinterpret a '0' from an older userspace as an intended value.

Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions

2022-08-02 Thread Sean Christopherson

On Tue, Aug 02, 2022, Sean Christopherson wrote:
> I think we should avoid UNMAPPABLE even on the KVM side of things for the core
> memslots functionality and instead be very literal, e.g.
> 
>   KVM_HAS_FD_BASED_MEMSLOTS
>   KVM_MEM_FD_VALID
> 
> We'll still need KVM_HAS_USER_UNMAPPABLE_MEMORY, but it won't be tied 
> directly to
> the memslot.  Decoupling the two thingis will require a bit of extra work, 
> but the
> code impact should be quite small, e.g. explicitly query and propagate
> MEMFILE_F_USER_INACCESSIBLE to kvm_memory_slot to track if a memslot can be 
> private.
> And unless I'm missing something, it won't require an additional memslot flag.
> The biggest oddity (if we don't also add KVM_MEM_PRIVATE) is that KVM would
> effectively ignore the hva for fd-based memslots for VM types that don't 
> support
> private memory, i.e. userspace can't opt out of using the fd-based backing, 
> but that
> doesn't seem like a deal breaker.

Hrm, but basing private memory on top of a generic FD_VALID would effectively 
require
shared memory to use hva-based memslots for confidential VMs.  That'd yield a 
very
weird API, e.g. non-confidential VMs could be backed entirely by fd-based 
memslots,
but confidential VMs would be forced to use hva-based memslots.

Ignore this idea for now.  If there's an actual use case for generic fd-based 
memory
then we'll want a separate flag, fd, and offset, i.e. that support could be 
added
independent of KVM_MEM_PRIVATE.

Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions

2022-08-01 Thread Sean Christopherson

On Fri, Jul 29, 2022, Sean Christopherson wrote:
> On Mon, Jul 25, 2022, Chao Peng wrote:
> > On Thu, Jul 21, 2022 at 05:58:50PM +0000, Sean Christopherson wrote:
> > > On Thu, Jul 21, 2022, Chao Peng wrote:
> > > > On Thu, Jul 21, 2022 at 03:34:59PM +0800, Wei Wang wrote:
> > > > > 
> > > > > 
> > > > > On 7/21/22 00:21, Sean Christopherson wrote:
> > > > > Maybe you could tag it with cgs for all the confidential guest support
> > > > > related stuff: e.g. kvm_vm_ioctl_set_cgs_mem()
> > > > > 
> > > > > bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > > > > ...
> > > > > kvm_vm_ioctl_set_cgs_mem(, is_private)
> > > > 
> > > > If we plan to widely use such abbr. through KVM (e.g. it's well known),
> > > > I'm fine.
> > > 
> > > I'd prefer to stay away from "confidential guest", and away from any 
> > > VM-scoped
> > > name for that matter.  User-unmappable memmory has use cases beyond 
> > > hiding guest
> > > state from the host, e.g. userspace could use inaccessible/unmappable 
> > > memory to
> > > harden itself against unintentional access to guest memory.
> > > 
> > > > I actually use mem_attr in patch: https://lkml.org/lkml/2022/7/20/610
> > > > But I also don't quite like it, it's so generic and sounds say nothing.
> > > > 
> > > > But I do want a name can cover future usages other than just 
> > > > private/shared (pKVM for example may have a third state).
> > > 
> > > I don't think there can be a third top-level state.  Memory is either 
> > > private to
> > > the guest or it's not.  There can be sub-states, e.g. memory could be 
> > > selectively
> > > shared or encrypted with a different key, in which case we'd need 
> > > metadata to
> > > track that state.
> > > 
> > > Though that begs the question of whether or not private_fd is the correct
> > > terminology.  E.g. if guest memory is backed by a memfd that can't be 
> > > mapped by
> > > userspace (currently F_SEAL_INACCESSIBLE), but something else in the 
> > > kernel plugs
> > > that memory into a device or another VM, then arguably that memory is 
> > > shared,
> > > especially the multi-VM scenario.
> > > 
> > > For TDX and SNP "private vs. shared" is likely the correct terminology 
> > > given the
> > > current specs, but for generic KVM it's probably better to align with 
> > > whatever
> > > terminology is used for memfd.  "inaccessible_fd" and 
> > > "user_inaccessible_fd" are
> > > a bit odd since the fd itself is accesible.
> > > 
> > > What about "user_unmappable"?  E.g.
> > > 
> > >   F_SEAL_USER_UNMAPPABLE, MFD_USER_UNMAPPABLE, 
> > > KVM_HAS_USER_UNMAPPABLE_MEMORY,
> > >   MEMFILE_F_USER_INACCESSIBLE, user_unmappable_fd, etc...
> > 
> > For KVM I also think user_unmappable looks better than 'private', e.g.
> > user_unmappable_fd/KVM_HAS_USER_UNMAPPABLE_MEMORY sounds more
> > appropriate names. For memfd however, I don't feel that strong to change
> > it from current 'inaccessible' to 'user_unmappable', one of the reason
> > is it's not just about unmappable, but actually also inaccessible
> > through direct ioctls like read()/write().
> 
> Heh, I _knew_ there had to be a catch.  I agree that INACCESSIBLE is better 
> for
> memfd.

Thought about this some more...

I think we should avoid UNMAPPABLE even on the KVM side of things for the core
memslots functionality and instead be very literal, e.g.

KVM_HAS_FD_BASED_MEMSLOTS
KVM_MEM_FD_VALID

We'll still need KVM_HAS_USER_UNMAPPABLE_MEMORY, but it won't be tied directly 
to
the memslot.  Decoupling the two thingis will require a bit of extra work, but 
the
code impact should be quite small, e.g. explicitly query and propagate
MEMFILE_F_USER_INACCESSIBLE to kvm_memory_slot to track if a memslot can be 
private.
And unless I'm missing something, it won't require an additional memslot flag.
The biggest oddity (if we don't also add KVM_MEM_PRIVATE) is that KVM would
effectively ignore the hva for fd-based memslots for VM types that don't support
private memory, i.e. userspace can't opt out of using the fd-based backing, but 
that
doesn't seem like a deal breaker.

Decoupling private memory from fd-based memslots will allow using fd-based 
memslots
for backing VMs even if the memory is user mappable, which opens up potentially
interesting use

Re: [PATCH v7 12/14] KVM: Handle page fault for private memory

2022-07-29 Thread Sean Christopherson

On Wed, Jul 06, 2022, Chao Peng wrote:
> A page fault can carry the private/shared information for
> KVM_MEM_PRIVATE memslot, this can be filled by architecture code(like
> TDX code). To handle page fault for such access, KVM maps the page only
> when this private property matches the host's view on the page.
> 
> For a successful match, private pfn is obtained with memfile_notifier
> callbacks from private fd and shared pfn is obtained with existing
> get_user_pages.
> 
> For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
> userspace. Userspace then can convert memory between private/shared from
> host's view then retry the access.
> 
> Co-developed-by: Yu Zhang 
> Signed-off-by: Yu Zhang 
> Signed-off-by: Chao Peng 
> ---
>  arch/x86/kvm/mmu/mmu.c  | 60 -
>  arch/x86/kvm/mmu/mmu_internal.h | 18 ++
>  arch/x86/kvm/mmu/mmutrace.h |  1 +
>  include/linux/kvm_host.h| 35 ++-
>  4 files changed, 112 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 545eb74305fe..27dbdd4fe8d1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3004,6 +3004,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
>   if (max_level == PG_LEVEL_4K)
>   return PG_LEVEL_4K;
>  
> + if (kvm_mem_is_private(kvm, gfn))
> + return max_level;
> +
>   host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
>   return min(host_level, max_level);
>  }
> @@ -4101,10 +4104,52 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, 
> struct kvm_async_pf *work)
>   kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
>  }
>  
> +static inline u8 order_to_level(int order)
> +{
> + enum pg_level level;
> +
> + for (level = KVM_MAX_HUGEPAGE_LEVEL; level > PG_LEVEL_4K; level--)

Curly braces needed for the for-loop.

And I think it makes sense to take in the fault->max_level, that way this is
slightly more performant when the guest mapping is smaller than the host, e.g.

for (level = max_level; level > PG_LEVEL_4K; level--)
...

return level;

Though I think I'd vote to avoid a loop entirely and do:

BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);

if (order > ???)
return PG_LEVEL_1G;

if (order > ???)
return PG_LEVEL_2M;

return PG_LEVEL_4K;


> + if (order >= page_level_shift(level) - PAGE_SHIFT)
> + return level;
> + return level;
> +}
> +
> +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> +struct kvm_page_fault *fault)
> +{
> + int order;
> + struct kvm_memory_slot *slot = fault->slot;
> + bool private_exist = kvm_mem_is_private(vcpu->kvm, fault->gfn);
> +
> + if (fault->is_private != private_exist) {
> + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> + if (fault->is_private)
> + vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
> + else
> + vcpu->run->memory.flags = 0;
> + vcpu->run->memory.padding = 0;
> + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT;
> + vcpu->run->memory.size = PAGE_SIZE;
> + return RET_PF_USER;
> + }
> +
> + if (fault->is_private) {
> + if (kvm_private_mem_get_pfn(slot, fault->gfn, >pfn, 
> ))
> + return RET_PF_RETRY;
> + fault->max_level = min(order_to_level(order), fault->max_level);
> + fault->map_writable = !(slot->flags & KVM_MEM_READONLY);
> + return RET_PF_FIXED;
> + }
> +
> + /* Fault is shared, fallthrough. */
> + return RET_PF_CONTINUE;
> +}
> +
>  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault 
> *fault)
>  {
>   struct kvm_memory_slot *slot = fault->slot;
>   bool async;
> + int r;
>  
>   /*
>* Retry the page fault if the gfn hit a memslot that is being deleted
> @@ -4133,6 +4178,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, 
> struct kvm_page_fault *fault)
>   return RET_PF_EMULATE;
>   }
>  
> + if (kvm_slot_can_be_private(slot)) {
> + r = kvm_faultin_pfn_private(vcpu, fault);
> + if (r != RET_PF_CONTINUE)
> + return r == RET_PF_FIXED ? RET_PF_CONTINUE : r;

I apologize if I've given you conflicting feedback in the past.  Now that this
returns RET_PF_* directly, I definitely think it makes sense to do:

if (kvm_slot_can_be_private(slot) &&
fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
if (fault->is_private)
vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE;
else

Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions

2022-07-29 Thread Sean Christopherson

On Mon, Jul 25, 2022, Chao Peng wrote:
> On Thu, Jul 21, 2022 at 05:58:50PM +0000, Sean Christopherson wrote:
> > On Thu, Jul 21, 2022, Chao Peng wrote:
> > > On Thu, Jul 21, 2022 at 03:34:59PM +0800, Wei Wang wrote:
> > > > 
> > > > 
> > > > On 7/21/22 00:21, Sean Christopherson wrote:
> > > > Maybe you could tag it with cgs for all the confidential guest support
> > > > related stuff: e.g. kvm_vm_ioctl_set_cgs_mem()
> > > > 
> > > > bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > > > ...
> > > > kvm_vm_ioctl_set_cgs_mem(, is_private)
> > > 
> > > If we plan to widely use such abbr. through KVM (e.g. it's well known),
> > > I'm fine.
> > 
> > I'd prefer to stay away from "confidential guest", and away from any 
> > VM-scoped
> > name for that matter.  User-unmappable memmory has use cases beyond hiding 
> > guest
> > state from the host, e.g. userspace could use inaccessible/unmappable 
> > memory to
> > harden itself against unintentional access to guest memory.
> > 
> > > I actually use mem_attr in patch: https://lkml.org/lkml/2022/7/20/610
> > > But I also don't quite like it, it's so generic and sounds say nothing.
> > > 
> > > But I do want a name can cover future usages other than just 
> > > private/shared (pKVM for example may have a third state).
> > 
> > I don't think there can be a third top-level state.  Memory is either 
> > private to
> > the guest or it's not.  There can be sub-states, e.g. memory could be 
> > selectively
> > shared or encrypted with a different key, in which case we'd need metadata 
> > to
> > track that state.
> > 
> > Though that begs the question of whether or not private_fd is the correct
> > terminology.  E.g. if guest memory is backed by a memfd that can't be 
> > mapped by
> > userspace (currently F_SEAL_INACCESSIBLE), but something else in the kernel 
> > plugs
> > that memory into a device or another VM, then arguably that memory is 
> > shared,
> > especially the multi-VM scenario.
> > 
> > For TDX and SNP "private vs. shared" is likely the correct terminology 
> > given the
> > current specs, but for generic KVM it's probably better to align with 
> > whatever
> > terminology is used for memfd.  "inaccessible_fd" and 
> > "user_inaccessible_fd" are
> > a bit odd since the fd itself is accesible.
> > 
> > What about "user_unmappable"?  E.g.
> > 
> >   F_SEAL_USER_UNMAPPABLE, MFD_USER_UNMAPPABLE, 
> > KVM_HAS_USER_UNMAPPABLE_MEMORY,
> >   MEMFILE_F_USER_INACCESSIBLE, user_unmappable_fd, etc...
> 
> For KVM I also think user_unmappable looks better than 'private', e.g.
> user_unmappable_fd/KVM_HAS_USER_UNMAPPABLE_MEMORY sounds more
> appropriate names. For memfd however, I don't feel that strong to change
> it from current 'inaccessible' to 'user_unmappable', one of the reason
> is it's not just about unmappable, but actually also inaccessible
> through direct ioctls like read()/write().

Heh, I _knew_ there had to be a catch.  I agree that INACCESSIBLE is better for
memfd.

Re: [PATCH v7 09/14] KVM: Extend the memslot to support fd-based private memory

2022-07-29 Thread Sean Christopherson

On Wed, Jul 06, 2022, Chao Peng wrote:
> @@ -1332,9 +1332,18 @@ yet and must be cleared on entry.
>   __u64 userspace_addr; /* start of the userspace allocated memory */
>};
>  
> +  struct kvm_userspace_memory_region_ext {
> + struct kvm_userspace_memory_region region;
> + __u64 private_offset;
> + __u32 private_fd;
> + __u32 pad1;
> + __u64 pad2[14];
> +};
> +
>/* for kvm_memory_region::flags */
>#define KVM_MEM_LOG_DIRTY_PAGES(1UL << 0)
>#define KVM_MEM_READONLY   (1UL << 1)
> +  #define KVM_MEM_PRIVATE(1UL << 2)

Very belatedly following up on prior feedback...

  | I think a flag is still needed, the problem is private_fd can be safely
  | accessed only when this flag is set, e.g. without this flag, we can't
  | copy_from_user these new fields since they don't exist for previous
  | kvm_userspace_memory_region callers.

I forgot about that aspect of things.  We don't technically need a dedicated
PRIVATE flag to handle that, but it does seem to be the least awful soltuion.
We could either add a generic KVM_MEM_EXTENDED_REGION or an entirely new
ioctl(), e.g. KVM_SET_USER_MEMORY_REGION2, but in both approaches there's a 
decent
chance that we'll end up needed individual "this field is valid" flags anways.

E.g. if KVM requires pad1 and pad2 to be zero to carve out future extensions,
then we're right back here if some future extension needs to treat '0' as a 
legal
input.

TL;DR: adding KVM_MEM_PRIVATE still seems like the best approach.

> @@ -4631,14 +4658,35 @@ static long kvm_vm_ioctl(struct file *filp,
>   break;
>   }
>   case KVM_SET_USER_MEMORY_REGION: {
> - struct kvm_userspace_memory_region kvm_userspace_mem;
> + struct kvm_user_mem_region mem;
> + unsigned long size;
> + u32 flags;
> +
> + kvm_sanity_check_user_mem_region_alias();
> +
> + memset(, 0, sizeof(mem));
>  
>   r = -EFAULT;
> - if (copy_from_user(_userspace_mem, argp,
> - sizeof(kvm_userspace_mem)))
> +
> + if (get_user(flags,
> + (u32 __user *)(argp + offsetof(typeof(mem), flags
> + goto out;

Indentation is funky.  It's hard to massage this into something short and
readable  What about capturing the offset separately?  E.g.

struct kvm_user_mem_region mem;
unsigned int flags_offset = offsetof(typeof(mem), flags));
unsigned long size;
u32 flags;

kvm_sanity_check_user_mem_region_alias();

memset(, 0, sizeof(mem));

r = -EFAULT;
if (get_user(flags, (u32 __user *)(argp + flags_offset)))
goto out;

But this can actually be punted until KVM_MEM_PRIVATE is fully supported.  As of
this patch, KVM doesn't read the extended size, so I believe the diff for this
patch can simply be:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index da263c370d00..5194beb7b52f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4640,6 +4640,10 @@ static long kvm_vm_ioctl(struct file *filp,
sizeof(kvm_userspace_mem)))
goto out;

+   r = -EINVAL;
+   if (mem.flags & KVM_MEM_PRIVATE)
+   goto out;
+
r = kvm_vm_ioctl_set_memory_region(kvm, _userspace_mem);
break;
}

Re: [PATCH v7 08/14] KVM: Rename mmu_notifier_*

2022-07-29 Thread Sean Christopherson

On Wed, Jul 06, 2022, Chao Peng wrote:
> The sync mechanism between mmu_notifier and page fault handler employs
> fields mmu_notifier_seq/count and mmu_notifier_range_start/end. For the
> to be added private memory, there is the same mechanism needed but not
> rely on mmu_notifier (It uses new introduced memfile_notifier). This
> patch renames the existing fields and related helper functions to a
> neutral name mmu_updating_* so private memory can reuse.

mmu_updating_* is too broad of a term, e.g. page faults and many other 
operations
also update the mmu.  Although the name most definitely came from the 
mmu_notifier,
it's not completely inaccurate for other sources, e.g. KVM's MMU is still being
notified of something, even if the source is not the actual mmu_notifier.

If we really want a different name, I'd vote for nomenclature that captures the
invalidation aspect, which is really what the variables are all trackng, e.g.

  mmu_invalidate_seq
  mmu_invalidate_in_progress
  mmu_invalidate_range_start
  mmu_invalidate_range_end

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

2022-07-21 Thread Sean Christopherson

On Thu, Jul 21, 2022, Gupta, Pankaj wrote:
> 
> Hi Sean, Chao,
> 
> While attempting to solve the pre-boot guest payload/firmware population
> into private memory for SEV SNP, retrieved this thread. Have question below:
> 
> > > > Requirements & Gaps
> > > > -
> > > >- Confidential computing(CC): TDX/SEV/CCA
> > > >  * Need support both explicit/implicit conversions.
> > > >  * Need support only destructive conversion at runtime.
> > > >  * The current patch should just work, but prefer to have pre-boot 
> > > > guest
> > > >payload/firmware population into private memory for performance.
> > > 
> > > Not just performance in the case of SEV, it's needed there because 
> > > firmware
> > > only supports in-place encryption of guest memory, there's no mechanism to
> > > provide a separate buffer to load into guest memory at pre-boot time. I
> > > think you're aware of this but wanted to point that out just in case.
> > 
> > I view it as a performance problem because nothing stops KVM from copying 
> > from
> > userspace into the private fd during the SEV ioctl().  What's missing is the
> > ability for userspace to directly initialze the private fd, which may or 
> > may not
> > avoid an extra memcpy() depending on how clever userspace is.
> Can you please elaborate more what you see as a performance problem? And
> possible ways to solve it?

Oh, I'm not saying there actually _is_ a performance problem.  What I'm saying 
is
that in-place encryption is not a functional requirement, which means it's 
purely
an optimization, and thus we should other bother supporting in-place encryption
_if_ it would solve a performane bottleneck.

Re: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions

2022-07-21 Thread Sean Christopherson

On Thu, Jul 21, 2022, Chao Peng wrote:
> On Thu, Jul 21, 2022 at 03:34:59PM +0800, Wei Wang wrote:
> > 
> > 
> > On 7/21/22 00:21, Sean Christopherson wrote:
> > Maybe you could tag it with cgs for all the confidential guest support
> > related stuff: e.g. kvm_vm_ioctl_set_cgs_mem()
> > 
> > bool is_private = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION;
> > ...
> > kvm_vm_ioctl_set_cgs_mem(, is_private)
> 
> If we plan to widely use such abbr. through KVM (e.g. it's well known),
> I'm fine.

I'd prefer to stay away from "confidential guest", and away from any VM-scoped
name for that matter.  User-unmappable memmory has use cases beyond hiding guest
state from the host, e.g. userspace could use inaccessible/unmappable memory to
harden itself against unintentional access to guest memory.

> I actually use mem_attr in patch: https://lkml.org/lkml/2022/7/20/610
> But I also don't quite like it, it's so generic and sounds say nothing.
> 
> But I do want a name can cover future usages other than just 
> private/shared (pKVM for example may have a third state).

I don't think there can be a third top-level state.  Memory is either private to
the guest or it's not.  There can be sub-states, e.g. memory could be 
selectively
shared or encrypted with a different key, in which case we'd need metadata to
track that state.

Though that begs the question of whether or not private_fd is the correct
terminology.  E.g. if guest memory is backed by a memfd that can't be mapped by
userspace (currently F_SEAL_INACCESSIBLE), but something else in the kernel 
plugs
that memory into a device or another VM, then arguably that memory is shared,
especially the multi-VM scenario.

For TDX and SNP "private vs. shared" is likely the correct terminology given the
current specs, but for generic KVM it's probably better to align with whatever
terminology is used for memfd.  "inaccessible_fd" and "user_inaccessible_fd" are
a bit odd since the fd itself is accesible.

What about "user_unmappable"?  E.g.

  F_SEAL_USER_UNMAPPABLE, MFD_USER_UNMAPPABLE, KVM_HAS_USER_UNMAPPABLE_MEMORY,
  MEMFILE_F_USER_INACCESSIBLE, user_unmappable_fd, etc...

that gives us flexibility to map the memory from within the kernel, e.g. into
other VMs or devices.

Hmm, and then keep your original "mem_attr_array" name?  And probably 

 int kvm_vm_ioctl_set_mem_attr(struct kvm *kvm, gpa_t gpa, gpa_t size,
   bool is_user_mappable)

Then the x86/mmu code for TDX/SNP private faults could be:

is_private = !kvm_is_gpa_user_mappable();

if (fault->is_private != is_private) {

or if we want to avoid mixing up "user_mappable" and "user_unmappable":

is_private = kvm_is_gpa_user_unmappable();

if (fault->is_private != is_private) {

though a helper that returns a negative (not mappable) feels kludgy.  And I like
kvm_is_gpa_user_mappable() because then when there's not "special" memory, it
defaults to true, which is more intuitive IMO.

And then if the future needs more precision, e.g. user-unmappable memory isn't
necessarily guest-exclusive, the uAPI names still work even though KVM internals
will need to be reworked, but that's unavoidable.  E.g. piggybacking
KVM_MEMORY_ENCRYPT_(UN)REG_REGION doesn't allow for further differentiation,
so we'd need to _extend_ the uAPI, but the _existing_ uAPI would still be sane.

1 2 3 >

1 - 100 of 217 matches

Mail list logo