subject:"\[PATCH v8 1\/8\] mm\/memfd\: Introduce userspace inaccessible memfd"

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-11-03 Thread Vishal Annapurve

On Mon, Oct 24, 2022 at 8:30 PM Kirill A . Shutemov
 wrote:
>
> On Fri, Oct 21, 2022 at 04:18:14PM +, Sean Christopherson wrote:
> > On Fri, Oct 21, 2022, Chao Peng wrote:
> > > >
> > > > In the context of userspace inaccessible memfd, what would be a
> > > > suggested way to enforce NUMA memory policy for physical memory
> > > > allocation? mbind[1] won't work here in absence of virtual address
> > > > range.
> > >
> > > How about set_mempolicy():
> > > https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html
> >
> > Andy Lutomirski brought this up in an off-list discussion way back when the 
> > whole
> > private-fd thing was first being proposed.
> >
> >   : The current Linux NUMA APIs (mbind, move_pages) work on virtual 
> > addresses.  If
> >   : we want to support them for TDX private memory, we either need TDX 
> > private
> >   : memory to have an HVA or we need file-based equivalents. Arguably we 
> > should add
> >   : fmove_pages and fbind syscalls anyway, since the current API is quite 
> > awkward
> >   : even for tools like numactl.
>
> Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be
> addressed in the initial submission.
>
> BTW, it is not regression comparing to old KVM slots, if the memory is
> backed by memfd or other file:
>
> MBIND(2)
>The  specified policy will be ignored for any MAP_SHARED mappings in 
> the
>specified memory range.  Rather the pages will be allocated according 
> to
>the  memory  policy  of the thread that caused the page to be 
> allocated.
>Again, this may not be the thread that called mbind().
>
> It is not clear how to define fbind(2) semantics, considering that multiple
> processes may compete for the same region of page cache.
>
> Should it be per-inode or per-fd? Or maybe per-range in inode/fd?
>

David's analysis on mempolicy with shmem seems to be right. set_policy
on virtual address range does seem to change the shared policy for the
inode irrespective of the mapping type.

Maybe having a way to set numa policy per-range in the inode would be
at par with what we can do today via mbind on virtual address ranges.



> fmove_pages(2) should be relatively straight forward, since it is
> best-effort and does not guarantee that the page will note be moved
> somewhare else just after return from the syscall.
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-24 Thread David Hildenbrand


On 24.10.22 16:59, Kirill A . Shutemov wrote:

On Fri, Oct 21, 2022 at 04:18:14PM +, Sean Christopherson wrote:

On Fri, Oct 21, 2022, Chao Peng wrote:


In the context of userspace inaccessible memfd, what would be a
suggested way to enforce NUMA memory policy for physical memory
allocation? mbind[1] won't work here in absence of virtual address
range.


How about set_mempolicy():
https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html


Andy Lutomirski brought this up in an off-list discussion way back when the 
whole
private-fd thing was first being proposed.

   : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses. 
 If
   : we want to support them for TDX private memory, we either need TDX private
   : memory to have an HVA or we need file-based equivalents. Arguably we 
should add
   : fmove_pages and fbind syscalls anyway, since the current API is quite 
awkward
   : even for tools like numactl.


Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be
addressed in the initial submission.

BTW, it is not regression comparing to old KVM slots, if the memory is
backed by memfd or other file:

MBIND(2)
The  specified policy will be ignored for any MAP_SHARED mappings in the
specified memory range.  Rather the pages will be allocated according to
the  memory  policy  of the thread that caused the page to be allocated.
Again, this may not be the thread that called mbind().


IIRC, that documentation is imprecise/incorrect especially when it comes 
to memfd. Page faults in shared mappings will similarly obey the set 
mbind() policy when allocating new pages.


QEMU relies on that.

The "fun" begins when we have multiple mappings, and only some have a 
policy set ... or if we already, previously allocated the pages.


--
Thanks,

David / dhildenb

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-24 Thread Kirill A . Shutemov

On Fri, Oct 21, 2022 at 04:18:14PM +, Sean Christopherson wrote:
> On Fri, Oct 21, 2022, Chao Peng wrote:
> > > 
> > > In the context of userspace inaccessible memfd, what would be a
> > > suggested way to enforce NUMA memory policy for physical memory
> > > allocation? mbind[1] won't work here in absence of virtual address
> > > range.
> > 
> > How about set_mempolicy():
> > https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html
> 
> Andy Lutomirski brought this up in an off-list discussion way back when the 
> whole
> private-fd thing was first being proposed.
> 
>   : The current Linux NUMA APIs (mbind, move_pages) work on virtual 
> addresses.  If
>   : we want to support them for TDX private memory, we either need TDX private
>   : memory to have an HVA or we need file-based equivalents. Arguably we 
> should add
>   : fmove_pages and fbind syscalls anyway, since the current API is quite 
> awkward
>   : even for tools like numactl.

Yeah, we definitely have gaps in API wrt NUMA, but I don't think it be
addressed in the initial submission.

BTW, it is not regression comparing to old KVM slots, if the memory is
backed by memfd or other file:

MBIND(2)
   The  specified policy will be ignored for any MAP_SHARED mappings in the
   specified memory range.  Rather the pages will be allocated according to
   the  memory  policy  of the thread that caused the page to be allocated.
   Again, this may not be the thread that called mbind().

It is not clear how to define fbind(2) semantics, considering that multiple
processes may compete for the same region of page cache.

Should it be per-inode or per-fd? Or maybe per-range in inode/fd?

fmove_pages(2) should be relatively straight forward, since it is
best-effort and does not guarantee that the page will note be moved
somewhare else just after return from the syscall.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-21 Thread Sean Christopherson

On Fri, Oct 21, 2022, Chao Peng wrote:
> On Thu, Oct 20, 2022 at 04:20:58PM +0530, Vishal Annapurve wrote:
> > On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov 
> >  wrote:
> > >
> > > On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> > > > I think moving this notifier_invalidate before fallocate may not solve
> > > > the problem completely. Is it possible that between invalidate and
> > > > fallocate, KVM tries to handle the page fault for the guest VM from
> > > > another vcpu and uses the pages to be freed to back gpa ranges? Should
> > > > hole punching here also update mem_attr first to say that KVM should
> > > > consider the corresponding gpa ranges to be no more backed by
> > > > inaccessible memfd?
> > >
> > > We rely on external synchronization to prevent this. See code around
> > > mmu_invalidate_retry_hva().
> > >
> > > --
> > >   Kiryl Shutsemau / Kirill A. Shutemov
> > 
> > IIUC, mmu_invalidate_retry_hva/gfn ensures that page faults on gfn
> > ranges that are being invalidated are retried till invalidation is
> > complete. In this case, is it possible that KVM tries to serve the
> > page fault after inaccessible_notifier_invalidate is complete but
> > before fallocate could punch hole into the files?

It's not just the page fault edge case.  In the more straightforward scenario
where the memory is already mapped into the guest, freeing pages back to the 
kernel
before they are removed from the guest will lead to use-after-free.

> > e.g.
> > inaccessible_notifier_invalidate(...)
> > ... (system event preempting this control flow, giving a window for
> > the guest to retry accessing the gfn range which was invalidated)
> > fallocate(.., PUNCH_HOLE..)
> 
> Looks this is something can happen.
> And sounds to me the solution needs
> just follow the mmu_notifier's way of using a invalidate_start/end pair.
> 
>   invalidate_start()  --> kvm->mmu_invalidate_in_progress++;
>   zap KVM page table entries;
>   fallocate()
>   invalidate_end()  --> kvm->mmu_invalidate_in_progress--;
> 
> Then during invalidate_start/end time window mmu_invalidate_retry_gfn
> checks 'mmu_invalidate_in_progress' and prevent repopulating the same
> page in KVM page table.

Yes, if it's not safe to invalidate after making the change (fallocate()), then
the change needs to be bookended by a start+end pair.  The mmu_notifier's 
unpaired
invalidate() hook works by zapping the primary MMU's PTEs before invalidate(), 
but
frees the underlying physical page _after_ invalidate().

And the only reason the unpaired invalidate() exists is because there are 
secondary
MMUs that reuse the primary MMU's page tables, e.g. shared virtual addressing, 
in
which case bookending doesn't work because the secondary MMU can't remove PTEs, 
it
can only flush its TLBs.

For this case, the whole point is to not create PTEs in the primary MMU, so 
there
should never be a use case that _needs_ an unpaired invalidate().

TL;DR: a start+end pair is likely the simplest solution.

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-21 Thread Sean Christopherson

On Fri, Oct 21, 2022, Chao Peng wrote:
> > 
> > In the context of userspace inaccessible memfd, what would be a
> > suggested way to enforce NUMA memory policy for physical memory
> > allocation? mbind[1] won't work here in absence of virtual address
> > range.
> 
> How about set_mempolicy():
> https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html

Andy Lutomirski brought this up in an off-list discussion way back when the 
whole
private-fd thing was first being proposed.

  : The current Linux NUMA APIs (mbind, move_pages) work on virtual addresses.  
If
  : we want to support them for TDX private memory, we either need TDX private
  : memory to have an HVA or we need file-based equivalents. Arguably we should 
add
  : fmove_pages and fbind syscalls anyway, since the current API is quite 
awkward
  : even for tools like numactl.

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-21 Thread Chao Peng

On Thu, Oct 20, 2022 at 04:20:58PM +0530, Vishal Annapurve wrote:
> On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov
>  wrote:
> >
> > On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> > > On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
> > >  wrote:
> > > >
> > > > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > > > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > > > > From: "Kirill A. Shutemov" 
> > > > > > > >
> > > > > > > > KVM can use memfd-provided memory for guest memory. For normal 
> > > > > > > > userspace
> > > > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd 
> > > > > > > > into its
> > > > > > > > virtual address space and then tells KVM to use the virtual 
> > > > > > > > address to
> > > > > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > > > > >
> > > > > > > > With confidential computing technologies like Intel TDX, the
> > > > > > > > memfd-provided memory may be encrypted with special key for 
> > > > > > > > special
> > > > > > > > software domain (e.g. KVM guest) and is not expected to be 
> > > > > > > > directly
> > > > > > > > accessed by userspace. Precisely, userspace access to such 
> > > > > > > > encrypted
> > > > > > > > memory may lead to host crash so it should be prevented.
> > > > > > > >
> > > > > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace 
> > > > > > > > through
> > > > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed 
> > > > > > > > via
> > > > > > > > in-kernel interface so KVM can directly interact with core-mm 
> > > > > > > > without
> > > > > > > > the need to map the memory into KVM userspace.
> > > > > > > >
> > > > > > > > It provides semantics required for KVM guest private(encrypted) 
> > > > > > > > memory
> > > > > > > > support that a file descriptor with this flag set is going to 
> > > > > > > > be used as
> > > > > > > > the source of guest memory in confidential computing 
> > > > > > > > environments such
> > > > > > > > as Intel TDX/AMD SEV.
> > > > > > > >
> > > > > > > > KVM userspace is still in charge of the lifecycle of the memfd. 
> > > > > > > > It
> > > > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs 
> > > > > > > > newly added
> > > > > > > > in this patch to obtain the physical memory address and then 
> > > > > > > > populate
> > > > > > > > the secondary page table entries.
> > > > > > > >
> > > > > > > > The userspace inaccessible memfd can be fallocate-ed and 
> > > > > > > > hole-punched
> > > > > > > > from userspace. When hole-punching happens, KVM can get 
> > > > > > > > notified through
> > > > > > > > inaccessible_notifier it then gets chance to remove any mapped 
> > > > > > > > entries
> > > > > > > > of the range in the secondary page tables.
> > > > > > > >
> > > > > > > > The userspace inaccessible memfd itself is implemented as a 
> > > > > > > > shim layer
> > > > > > > > on top of real memory file systems like tmpfs/hugetlbfs but 
> > > > > > > > this patch
> > > > > > > > only implemented tmpfs. The allocated memory is currently 
> > > > > > > > marked as
> > > > > > > > unmovable and unevictable, this is required for current 
> > > > > > > > confidential
> > > > > > > > usage. But in future this might be changed.
> > > > > > > >
> > > > > > > > Signed-off-by: Kirill A. Shutemov 
> > > > > > > > 
> > > > > > > > Signed-off-by: Chao Peng 
> > > > > > > > ---
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > > > > +  loff_t offset, loff_t len)
> > > > > > > > +{
> > > > > > > > +   struct inaccessible_data *data = 
> > > > > > > > file->f_mapping->private_data;
> > > > > > > > +   struct file *memfd = data->memfd;
> > > > > > > > +   int ret;
> > > > > > > > +
> > > > > > > > +   if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > > > > +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > > > > +   return -EINVAL;
> > > > > > > > +   }
> > > > > > > > +
> > > > > > > > +   ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > > > > +   inaccessible_notifier_invalidate(data, offset, offset + 
> > > > > > > > len);
> > > > > > >
> > > > > > > Wonder if invalidate should precede the actual hole punch, 
> > > > > > > otherwise we open
> > > > > > > a window where the page tables point to memory no longer valid?
> > > > > >
> > > > > > Yes, you are right. Thanks for catching this.
> > > > >
> > > > > I also noticed this. But then thought the memory would be anyways 
> > > > > zeroed
> > > > > (hole punched) before this call?
> > > >
> > > > Hole punching can

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-21 Thread Chao Peng

> 
> In the context of userspace inaccessible memfd, what would be a
> suggested way to enforce NUMA memory policy for physical memory
> allocation? mbind[1] won't work here in absence of virtual address
> range.

How about set_mempolicy():
https://www.man7.org/linux/man-pages/man2/set_mempolicy.2.html

Chao
> 
> [1] https://github.com/chao-p/qemu/blob/privmem-v8/backends/hostmem.c#L382

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-20 Thread Vishal Annapurve

On Wed, Oct 19, 2022 at 9:02 PM Kirill A . Shutemov
 wrote:
>
> On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> > On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
> >  wrote:
> > >
> > > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > > > From: "Kirill A. Shutemov" 
> > > > > > >
> > > > > > > KVM can use memfd-provided memory for guest memory. For normal 
> > > > > > > userspace
> > > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into 
> > > > > > > its
> > > > > > > virtual address space and then tells KVM to use the virtual 
> > > > > > > address to
> > > > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > > > >
> > > > > > > With confidential computing technologies like Intel TDX, the
> > > > > > > memfd-provided memory may be encrypted with special key for 
> > > > > > > special
> > > > > > > software domain (e.g. KVM guest) and is not expected to be 
> > > > > > > directly
> > > > > > > accessed by userspace. Precisely, userspace access to such 
> > > > > > > encrypted
> > > > > > > memory may lead to host crash so it should be prevented.
> > > > > > >
> > > > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace 
> > > > > > > through
> > > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > > > in-kernel interface so KVM can directly interact with core-mm 
> > > > > > > without
> > > > > > > the need to map the memory into KVM userspace.
> > > > > > >
> > > > > > > It provides semantics required for KVM guest private(encrypted) 
> > > > > > > memory
> > > > > > > support that a file descriptor with this flag set is going to be 
> > > > > > > used as
> > > > > > > the source of guest memory in confidential computing environments 
> > > > > > > such
> > > > > > > as Intel TDX/AMD SEV.
> > > > > > >
> > > > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly 
> > > > > > > added
> > > > > > > in this patch to obtain the physical memory address and then 
> > > > > > > populate
> > > > > > > the secondary page table entries.
> > > > > > >
> > > > > > > The userspace inaccessible memfd can be fallocate-ed and 
> > > > > > > hole-punched
> > > > > > > from userspace. When hole-punching happens, KVM can get notified 
> > > > > > > through
> > > > > > > inaccessible_notifier it then gets chance to remove any mapped 
> > > > > > > entries
> > > > > > > of the range in the secondary page tables.
> > > > > > >
> > > > > > > The userspace inaccessible memfd itself is implemented as a shim 
> > > > > > > layer
> > > > > > > on top of real memory file systems like tmpfs/hugetlbfs but this 
> > > > > > > patch
> > > > > > > only implemented tmpfs. The allocated memory is currently marked 
> > > > > > > as
> > > > > > > unmovable and unevictable, this is required for current 
> > > > > > > confidential
> > > > > > > usage. But in future this might be changed.
> > > > > > >
> > > > > > > Signed-off-by: Kirill A. Shutemov 
> > > > > > > 
> > > > > > > Signed-off-by: Chao Peng 
> > > > > > > ---
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > > > +  loff_t offset, loff_t len)
> > > > > > > +{
> > > > > > > +   struct inaccessible_data *data = 
> > > > > > > file->f_mapping->private_data;
> > > > > > > +   struct file *memfd = data->memfd;
> > > > > > > +   int ret;
> > > > > > > +
> > > > > > > +   if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > > > +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > > > +   return -EINVAL;
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > > > +   inaccessible_notifier_invalidate(data, offset, offset + 
> > > > > > > len);
> > > > > >
> > > > > > Wonder if invalidate should precede the actual hole punch, 
> > > > > > otherwise we open
> > > > > > a window where the page tables point to memory no longer valid?
> > > > >
> > > > > Yes, you are right. Thanks for catching this.
> > > >
> > > > I also noticed this. But then thought the memory would be anyways zeroed
> > > > (hole punched) before this call?
> > >
> > > Hole punching can free pages, given that offset/len covers full page.
> > >
> > > --
> > >   Kiryl Shutsemau / Kirill A. Shutemov
> >
> > I think moving this notifier_invalidate before fallocate may not solve
> > the problem completely. Is it possible that between invalidate and
> > fallocate, KVM tries to handle the page fault for the

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-19 Thread Kirill A . Shutemov

On Tue, Oct 18, 2022 at 07:12:10PM +0530, Vishal Annapurve wrote:
> On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
>  wrote:
> >
> > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > > From: "Kirill A. Shutemov" 
> > > > > >
> > > > > > KVM can use memfd-provided memory for guest memory. For normal 
> > > > > > userspace
> > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into 
> > > > > > its
> > > > > > virtual address space and then tells KVM to use the virtual address 
> > > > > > to
> > > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > > >
> > > > > > With confidential computing technologies like Intel TDX, the
> > > > > > memfd-provided memory may be encrypted with special key for special
> > > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > > memory may lead to host crash so it should be prevented.
> > > > > >
> > > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > > in-kernel interface so KVM can directly interact with core-mm 
> > > > > > without
> > > > > > the need to map the memory into KVM userspace.
> > > > > >
> > > > > > It provides semantics required for KVM guest private(encrypted) 
> > > > > > memory
> > > > > > support that a file descriptor with this flag set is going to be 
> > > > > > used as
> > > > > > the source of guest memory in confidential computing environments 
> > > > > > such
> > > > > > as Intel TDX/AMD SEV.
> > > > > >
> > > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly 
> > > > > > added
> > > > > > in this patch to obtain the physical memory address and then 
> > > > > > populate
> > > > > > the secondary page table entries.
> > > > > >
> > > > > > The userspace inaccessible memfd can be fallocate-ed and 
> > > > > > hole-punched
> > > > > > from userspace. When hole-punching happens, KVM can get notified 
> > > > > > through
> > > > > > inaccessible_notifier it then gets chance to remove any mapped 
> > > > > > entries
> > > > > > of the range in the secondary page tables.
> > > > > >
> > > > > > The userspace inaccessible memfd itself is implemented as a shim 
> > > > > > layer
> > > > > > on top of real memory file systems like tmpfs/hugetlbfs but this 
> > > > > > patch
> > > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > > unmovable and unevictable, this is required for current confidential
> > > > > > usage. But in future this might be changed.
> > > > > >
> > > > > > Signed-off-by: Kirill A. Shutemov 
> > > > > > Signed-off-by: Chao Peng 
> > > > > > ---
> > > > >
> > > > > ...
> > > > >
> > > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > > +  loff_t offset, loff_t len)
> > > > > > +{
> > > > > > +   struct inaccessible_data *data = 
> > > > > > file->f_mapping->private_data;
> > > > > > +   struct file *memfd = data->memfd;
> > > > > > +   int ret;
> > > > > > +
> > > > > > +   if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > > +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > > +   return -EINVAL;
> > > > > > +   }
> > > > > > +
> > > > > > +   ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > > +   inaccessible_notifier_invalidate(data, offset, offset + 
> > > > > > len);
> > > > >
> > > > > Wonder if invalidate should precede the actual hole punch, otherwise 
> > > > > we open
> > > > > a window where the page tables point to memory no longer valid?
> > > >
> > > > Yes, you are right. Thanks for catching this.
> > >
> > > I also noticed this. But then thought the memory would be anyways zeroed
> > > (hole punched) before this call?
> >
> > Hole punching can free pages, given that offset/len covers full page.
> >
> > --
> >   Kiryl Shutsemau / Kirill A. Shutemov
> 
> I think moving this notifier_invalidate before fallocate may not solve
> the problem completely. Is it possible that between invalidate and
> fallocate, KVM tries to handle the page fault for the guest VM from
> another vcpu and uses the pages to be freed to back gpa ranges? Should
> hole punching here also update mem_attr first to say that KVM should
> consider the corresponding gpa ranges to be no more backed by
> inaccessible memfd?

We rely on external synchronization to prevent this. See code around
mmu_invalidate_retry_hva().

-- 
  Kiryl Shutsemau / Kirill

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-19 Thread Fuad Tabba

Hi,

On Tue, Oct 18, 2022 at 1:34 AM Sean Christopherson  wrote:
>
> On Fri, Sep 30, 2022, Fuad Tabba wrote:
> > > > > > pKVM would also need a way to make an fd accessible again
> > > > > > when shared back, which I think isn't possible with this patch.
> > > > >
> > > > > But does pKVM really want to mmap/munmap a new region at the 
> > > > > page-level,
> > > > > that can cause VMA fragmentation if the conversion is frequent as I 
> > > > > see.
> > > > > Even with a KVM ioctl for mapping as mentioned below, I think there 
> > > > > will
> > > > > be the same issue.
> > > >
> > > > pKVM doesn't really need to unmap the memory. What is really important
> > > > is that the memory is not GUP'able.
> > >
> > > Well, not entirely unguppable, just unguppable without a magic FOLL_* 
> > > flag,
> > > otherwise KVM wouldn't be able to get the PFN to map into guest memory.
> > >
> > > The problem is that gup() and "mapped" are tied together.  So yes, pKVM 
> > > doesn't
> > > strictly need to unmap memory _in the untrusted host_, but since 
> > > mapped==guppable,
> > > the end result is the same.
> > >
> > > Emphasis above because pKVM still needs unmap the memory _somehwere_.  
> > > IIUC, the
> > > current approach is to do that only in the stage-2 page tables, i.e. only 
> > > in the
> > > context of the hypervisor.  Which is also the source of the gup() 
> > > problems; the
> > > untrusted kernel is blissfully unaware that the memory is inaccessible.
> > >
> > > Any approach that moves some of that information into the untrusted 
> > > kernel so that
> > > the kernel can protect itself will incur fragmentation in the VMAs.  
> > > Well, unless
> > > all of guest memory becomes unguppable, but that's likely not a viable 
> > > option.
> >
> > Actually, for pKVM, there is no need for the guest memory to be GUP'able at
> > all if we use the new inaccessible_get_pfn().
>
> Ya, I was referring to pKVM without UPM / inaccessible memory.
>
> Jumping back to blocking gup(), what about using the same tricks as secretmem 
> to
> block gup()?  E.g. compare vm_ops to block regular gup() and a_ops to block 
> fast
> gup() on struct page?  With a Kconfig that's selected by pKVM (which would 
> also
> need its own Kconfig), e.g. CONFIG_INACCESSIBLE_MAPPABLE_MEM, there would be 
> zero
> performance overhead for non-pKVM kernels, i.e. hooking gup() shouldn't be
> controversial.
>
> I suspect the fast gup() path could even be optimized to avoid the 
> page_mapping()
> lookup by adding a PG_inaccessible flag that's defined iff the TBD Kconfig is
> selected.  I'm guessing pKVM isn't expected to be deployed on massivve NUMA 
> systems
> anytime soon, so there should be plenty of page flags to go around.
>
> Blocking gup() instead of trying to play refcount games when converting back 
> to
> private would eliminate the need to put heavy restrictions on mapping, as the 
> goal
> of those were purely to simplify the KVM implementation, e.g. the "one 
> mapping per
> memslot" thing would go away entirely.

My implementation of mmap for inaccessible_fops was setting VM_PFNMAP.
That said, I realized that that might be adding an unnecessary
restriction, and now have changed it to do it the secretmem way.
That's straightforward and works well.

> > This of course goes back to what I'd mentioned before in v7; it seems that
> > representing the memslot memory as a file descriptor should be orthogonal to
> > whether the memory is shared or private, rather than a private_fd for 
> > private
> > memory and the userspace_addr for shared memory.
>
> I also explored the idea of backing any guest memory with an fd, but came to
> the conclusion that private memory needs a separate handle[1], at least on 
> x86.
>
> For SNP and TDX, even though the GPA is the same (ignoring the fact that SNP 
> and
> TDX steal GPA bits to differentiate private vs. shared), the two types need 
> to be
> treated as separate mappings[2].  Post-boot, converting is lossy in both 
> directions,
> so even conceptually they are two disctint pages that just happen to share 
> (some)
> GPA bits.
>
> To allow conversions, i.e. changing which mapping to use, without memslot 
> updates,
> KVM needs to let userspace provide both mappings in a single memslot.  So 
> while
> fd-based memory is an orthogonal concept, e.g. we could add fd-based shared 
> memory,
> KVM would still need a dedicated private handle.
>
> For pKVM, the fd doesn't strictly need to be mutually exclusive with the 
> existing
> userspace_addr, but since the private_fd is going to be added for x86, I 
> think it
> makes sense to use that instead of adding generic fd-based memory for pKVM's 
> use
> case (which is arguably still "private" memory but with special semantics).
>
> [1] https://lore.kernel.org/all/yulth7bl4mwt5...@google.com
> [2] 
> https://lore.kernel.org/all/869622df-5bf6-0fbb-cac4-34c6ae7df...@kernel.org

As long as the API does not impose this limit, which would imply pKVM
is misusing it,

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-19 Thread Chao Peng

On Mon, Oct 17, 2022 at 08:05:10PM +0100, Fuad Tabba wrote:
> Hi,
> 
> > > > Using both private_fd and userspace_addr is only needed in TDX and other
> > > > confidential computing scenarios, pKVM may only use private_fd if the fd
> > > > can also be mmaped as a whole to userspace as Sean suggested.
> > >
> > > That does work in practice, for now at least, and is what I do in my
> > > current port. However, the naming and how the API is defined as
> > > implied by the name and the documentation. By calling the field
> > > private_fd, it does imply that it should not be mapped, which is also
> > > what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
> > > would be mis/ab-using this interface, and that future changes could
> > > cause unforeseen issues for pKVM.
> >
> > That is fairly enough. We can change the naming and the documents.
> >
> > >
> > > Maybe renaming this to something like "guest_fp", and specifying in
> > > the documentation that it can be restricted, e.g., instead of "the
> > > content of the private memory is invisible to userspace" something
> > > along the lines of  "the content of the guest memory may be restricted
> > > to userspace".
> >
> > Some other candidates in my mind:
> > - restricted_fd: to pair with the mm side restricted_memfd
> > - protected_fd: as Sean suggested before
> > - fd: how it's explained relies on the memslot.flag.
> 
> All these sound good, since they all capture the potential use cases.
> Restricted might be the most logical choice if that's going to also
> become the name for the mem_fd.

Thanks, I will use 'restricted' for them. e.g.:
- memfd_restricted() syscall
- restricted_fd
- restricted_offset

The memslot flags will still be KVM_MEM_PRIVATE, since I think pKVM will
create its own one?

Chao
> 
> Thanks,
> /fuad
> 
> > Thanks,
> > Chao
> > >
> > > What do you think?
> > >
> > > Cheers,
> > > /fuad
> > >
> > > >
> > > > Thanks,
> > > > Chao
> > > > >
> > > > > Cheers,
> > > > > /fuad

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-19 Thread Vishal Annapurve

On Thu, Sep 15, 2022 at 8:04 PM Chao Peng  wrote:
>
> From: "Kirill A. Shutemov" 
>
> KVM can use memfd-provided memory for guest memory. For normal userspace
> accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> virtual address space and then tells KVM to use the virtual address to
> setup the mapping in the secondary page table (e.g. EPT).
>
> With confidential computing technologies like Intel TDX, the
> memfd-provided memory may be encrypted with special key for special
> software domain (e.g. KVM guest) and is not expected to be directly
> accessed by userspace. Precisely, userspace access to such encrypted
> memory may lead to host crash so it should be prevented.
>
> This patch introduces userspace inaccessible memfd (created with
> MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> in-kernel interface so KVM can directly interact with core-mm without
> the need to map the memory into KVM userspace.
>
> It provides semantics required for KVM guest private(encrypted) memory
> support that a file descriptor with this flag set is going to be used as
> the source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV.
>
> KVM userspace is still in charge of the lifecycle of the memfd. It
> should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> in this patch to obtain the physical memory address and then populate
> the secondary page table entries.
>
> The userspace inaccessible memfd can be fallocate-ed and hole-punched
> from userspace. When hole-punching happens, KVM can get notified through
> inaccessible_notifier it then gets chance to remove any mapped entries
> of the range in the secondary page tables.
>
> The userspace inaccessible memfd itself is implemented as a shim layer
> on top of real memory file systems like tmpfs/hugetlbfs but this patch
> only implemented tmpfs. The allocated memory is currently marked as
> unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
>
> Signed-off-by: Kirill A. Shutemov 
> Signed-off-by: Chao Peng 
> ---
>  include/linux/memfd.h  |  24 
>  include/uapi/linux/magic.h |   1 +
>  include/uapi/linux/memfd.h |   1 +
>  mm/Makefile|   2 +-
>  mm/memfd.c |  25 -
>  mm/memfd_inaccessible.c| 219 +
>  6 files changed, 270 insertions(+), 2 deletions(-)
>  create mode 100644 mm/memfd_inaccessible.c
>
> diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> index 4f1600413f91..334ddff08377 100644
> --- a/include/linux/memfd.h
> +++ b/include/linux/memfd.h
> @@ -3,6 +3,7 @@
>  #define __LINUX_MEMFD_H
>
>  #include 
> +#include 
>
>  #ifdef CONFIG_MEMFD_CREATE
>  extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long 
> arg);
> @@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned 
> int c, unsigned long a)
>  }
>  #endif
>
> +struct inaccessible_notifier;
> +
> +struct inaccessible_notifier_ops {
> +   void (*invalidate)(struct inaccessible_notifier *notifier,
> +  pgoff_t start, pgoff_t end);
> +};
> +
> +struct inaccessible_notifier {
> +   struct list_head list;
> +   const struct inaccessible_notifier_ops *ops;
> +};
> +
> +void inaccessible_register_notifier(struct file *file,
> +   struct inaccessible_notifier *notifier);
> +void inaccessible_unregister_notifier(struct file *file,
> + struct inaccessible_notifier *notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +int *order);
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn);
> +
> +struct file *memfd_mkinaccessible(struct file *memfd);
> +
>  #endif /* __LINUX_MEMFD_H */
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index 6325d1d0e90f..9d066be3d7e8 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -101,5 +101,6 @@
>  #define DMA_BUF_MAGIC  0x444d4142  /* "DMAB" */
>  #define DEVMEM_MAGIC   0x454d444d  /* "DMEM" */
>  #define SECRETMEM_MAGIC0x5345434d  /* "SECM" */
> +#define INACCESSIBLE_MAGIC 0x494e4143  /* "INAC" */
>
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 7a8a26751c23..48750474b904 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -8,6 +8,7 @@
>  #define MFD_CLOEXEC0x0001U
>  #define MFD_ALLOW_SEALING  0x0002U
>  #define MFD_HUGETLB0x0004U
> +#define MFD_INACCESSIBLE   0x0008U
>
>  /*
>   * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/mm/Makefile b/mm/Makefile
> index 9a564f836403..f82e5d4b4388 100644
> ---

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-18 Thread Vishal Annapurve

On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov
 wrote:
>
> On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > > On 9/15/22 16:29, Chao Peng wrote:
> > > > > From: "Kirill A. Shutemov" 
> > > > >
> > > > > KVM can use memfd-provided memory for guest memory. For normal 
> > > > > userspace
> > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > > virtual address space and then tells KVM to use the virtual address to
> > > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > >
> > > > > With confidential computing technologies like Intel TDX, the
> > > > > memfd-provided memory may be encrypted with special key for special
> > > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > > memory may lead to host crash so it should be prevented.
> > > > >
> > > > > This patch introduces userspace inaccessible memfd (created with
> > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > > the need to map the memory into KVM userspace.
> > > > >
> > > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > > support that a file descriptor with this flag set is going to be used 
> > > > > as
> > > > > the source of guest memory in confidential computing environments such
> > > > > as Intel TDX/AMD SEV.
> > > > >
> > > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > > in this patch to obtain the physical memory address and then populate
> > > > > the secondary page table entries.
> > > > >
> > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > > from userspace. When hole-punching happens, KVM can get notified 
> > > > > through
> > > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > > of the range in the secondary page tables.
> > > > >
> > > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > > unmovable and unevictable, this is required for current confidential
> > > > > usage. But in future this might be changed.
> > > > >
> > > > > Signed-off-by: Kirill A. Shutemov 
> > > > > Signed-off-by: Chao Peng 
> > > > > ---
> > > >
> > > > ...
> > > >
> > > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > > +  loff_t offset, loff_t len)
> > > > > +{
> > > > > +   struct inaccessible_data *data = 
> > > > > file->f_mapping->private_data;
> > > > > +   struct file *memfd = data->memfd;
> > > > > +   int ret;
> > > > > +
> > > > > +   if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > > +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > > +   return -EINVAL;
> > > > > +   }
> > > > > +
> > > > > +   ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > > +   inaccessible_notifier_invalidate(data, offset, offset + len);
> > > >
> > > > Wonder if invalidate should precede the actual hole punch, otherwise we 
> > > > open
> > > > a window where the page tables point to memory no longer valid?
> > >
> > > Yes, you are right. Thanks for catching this.
> >
> > I also noticed this. But then thought the memory would be anyways zeroed
> > (hole punched) before this call?
>
> Hole punching can free pages, given that offset/len covers full page.
>
> --
>   Kiryl Shutsemau / Kirill A. Shutemov

I think moving this notifier_invalidate before fallocate may not solve
the problem completely. Is it possible that between invalidate and
fallocate, KVM tries to handle the page fault for the guest VM from
another vcpu and uses the pages to be freed to back gpa ranges? Should
hole punching here also update mem_attr first to say that KVM should
consider the corresponding gpa ranges to be no more backed by
inaccessible memfd?

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-17 Thread Sean Christopherson

On Fri, Sep 30, 2022, Fuad Tabba wrote:
> > > > > pKVM would also need a way to make an fd accessible again
> > > > > when shared back, which I think isn't possible with this patch.
> > > >
> > > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > > be the same issue.
> > >
> > > pKVM doesn't really need to unmap the memory. What is really important
> > > is that the memory is not GUP'able.
> >
> > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> > otherwise KVM wouldn't be able to get the PFN to map into guest memory.
> >
> > The problem is that gup() and "mapped" are tied together.  So yes, pKVM 
> > doesn't
> > strictly need to unmap memory _in the untrusted host_, but since 
> > mapped==guppable,
> > the end result is the same.
> >
> > Emphasis above because pKVM still needs unmap the memory _somehwere_.  
> > IIUC, the
> > current approach is to do that only in the stage-2 page tables, i.e. only 
> > in the
> > context of the hypervisor.  Which is also the source of the gup() problems; 
> > the
> > untrusted kernel is blissfully unaware that the memory is inaccessible.
> >
> > Any approach that moves some of that information into the untrusted kernel 
> > so that
> > the kernel can protect itself will incur fragmentation in the VMAs.  Well, 
> > unless
> > all of guest memory becomes unguppable, but that's likely not a viable 
> > option.
> 
> Actually, for pKVM, there is no need for the guest memory to be GUP'able at
> all if we use the new inaccessible_get_pfn().

Ya, I was referring to pKVM without UPM / inaccessible memory.

Jumping back to blocking gup(), what about using the same tricks as secretmem to
block gup()?  E.g. compare vm_ops to block regular gup() and a_ops to block fast
gup() on struct page?  With a Kconfig that's selected by pKVM (which would also
need its own Kconfig), e.g. CONFIG_INACCESSIBLE_MAPPABLE_MEM, there would be 
zero
performance overhead for non-pKVM kernels, i.e. hooking gup() shouldn't be
controversial.

I suspect the fast gup() path could even be optimized to avoid the 
page_mapping()
lookup by adding a PG_inaccessible flag that's defined iff the TBD Kconfig is
selected.  I'm guessing pKVM isn't expected to be deployed on massivve NUMA 
systems
anytime soon, so there should be plenty of page flags to go around.

Blocking gup() instead of trying to play refcount games when converting back to
private would eliminate the need to put heavy restrictions on mapping, as the 
goal
of those were purely to simplify the KVM implementation, e.g. the "one mapping 
per
memslot" thing would go away entirely.

> This of course goes back to what I'd mentioned before in v7; it seems that
> representing the memslot memory as a file descriptor should be orthogonal to
> whether the memory is shared or private, rather than a private_fd for private
> memory and the userspace_addr for shared memory.

I also explored the idea of backing any guest memory with an fd, but came to
the conclusion that private memory needs a separate handle[1], at least on x86.

For SNP and TDX, even though the GPA is the same (ignoring the fact that SNP and
TDX steal GPA bits to differentiate private vs. shared), the two types need to 
be
treated as separate mappings[2].  Post-boot, converting is lossy in both 
directions,
so even conceptually they are two disctint pages that just happen to share 
(some)
GPA bits.

To allow conversions, i.e. changing which mapping to use, without memslot 
updates,
KVM needs to let userspace provide both mappings in a single memslot.  So while
fd-based memory is an orthogonal concept, e.g. we could add fd-based shared 
memory,
KVM would still need a dedicated private handle.

For pKVM, the fd doesn't strictly need to be mutually exclusive with the 
existing
userspace_addr, but since the private_fd is going to be added for x86, I think 
it
makes sense to use that instead of adding generic fd-based memory for pKVM's use
case (which is arguably still "private" memory but with special semantics).

[1] https://lore.kernel.org/all/yulth7bl4mwt5...@google.com
[2] https://lore.kernel.org/all/869622df-5bf6-0fbb-cac4-34c6ae7df...@kernel.org

>  The host can then map or unmap the shared/private memory using the fd, which
>  allows it more freedom in even choosing to unmap shared memory when not
>  needed, for example.

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-17 Thread Kirill A . Shutemov

On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote:
> On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:
> > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> > > On 9/15/22 16:29, Chao Peng wrote:
> > > > From: "Kirill A. Shutemov" 
> > > > 
> > > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > > virtual address space and then tells KVM to use the virtual address to
> > > > setup the mapping in the secondary page table (e.g. EPT).
> > > > 
> > > > With confidential computing technologies like Intel TDX, the
> > > > memfd-provided memory may be encrypted with special key for special
> > > > software domain (e.g. KVM guest) and is not expected to be directly
> > > > accessed by userspace. Precisely, userspace access to such encrypted
> > > > memory may lead to host crash so it should be prevented.
> > > > 
> > > > This patch introduces userspace inaccessible memfd (created with
> > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > > in-kernel interface so KVM can directly interact with core-mm without
> > > > the need to map the memory into KVM userspace.
> > > > 
> > > > It provides semantics required for KVM guest private(encrypted) memory
> > > > support that a file descriptor with this flag set is going to be used as
> > > > the source of guest memory in confidential computing environments such
> > > > as Intel TDX/AMD SEV.
> > > > 
> > > > KVM userspace is still in charge of the lifecycle of the memfd. It
> > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > > > in this patch to obtain the physical memory address and then populate
> > > > the secondary page table entries.
> > > > 
> > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > > > from userspace. When hole-punching happens, KVM can get notified through
> > > > inaccessible_notifier it then gets chance to remove any mapped entries
> > > > of the range in the secondary page tables.
> > > > 
> > > > The userspace inaccessible memfd itself is implemented as a shim layer
> > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > > > only implemented tmpfs. The allocated memory is currently marked as
> > > > unmovable and unevictable, this is required for current confidential
> > > > usage. But in future this might be changed.
> > > > 
> > > > Signed-off-by: Kirill A. Shutemov 
> > > > Signed-off-by: Chao Peng 
> > > > ---
> > > 
> > > ...
> > > 
> > > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > > +  loff_t offset, loff_t len)
> > > > +{
> > > > +   struct inaccessible_data *data = file->f_mapping->private_data;
> > > > +   struct file *memfd = data->memfd;
> > > > +   int ret;
> > > > +
> > > > +   if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > > +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > > +   return -EINVAL;
> > > > +   }
> > > > +
> > > > +   ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > > > +   inaccessible_notifier_invalidate(data, offset, offset + len);
> > > 
> > > Wonder if invalidate should precede the actual hole punch, otherwise we 
> > > open
> > > a window where the page tables point to memory no longer valid?
> > 
> > Yes, you are right. Thanks for catching this.
> 
> I also noticed this. But then thought the memory would be anyways zeroed
> (hole punched) before this call?

Hole punching can free pages, given that offset/len covers full page.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-17 Thread Fuad Tabba

Hi,

> > > Using both private_fd and userspace_addr is only needed in TDX and other
> > > confidential computing scenarios, pKVM may only use private_fd if the fd
> > > can also be mmaped as a whole to userspace as Sean suggested.
> >
> > That does work in practice, for now at least, and is what I do in my
> > current port. However, the naming and how the API is defined as
> > implied by the name and the documentation. By calling the field
> > private_fd, it does imply that it should not be mapped, which is also
> > what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
> > would be mis/ab-using this interface, and that future changes could
> > cause unforeseen issues for pKVM.
>
> That is fairly enough. We can change the naming and the documents.
>
> >
> > Maybe renaming this to something like "guest_fp", and specifying in
> > the documentation that it can be restricted, e.g., instead of "the
> > content of the private memory is invisible to userspace" something
> > along the lines of  "the content of the guest memory may be restricted
> > to userspace".
>
> Some other candidates in my mind:
> - restricted_fd: to pair with the mm side restricted_memfd
> - protected_fd: as Sean suggested before
> - fd: how it's explained relies on the memslot.flag.

All these sound good, since they all capture the potential use cases.
Restricted might be the most logical choice if that's going to also
become the name for the mem_fd.

Thanks,
/fuad

> Thanks,
> Chao
> >
> > What do you think?
> >
> > Cheers,
> > /fuad
> >
> > >
> > > Thanks,
> > > Chao
> > > >
> > > > Cheers,
> > > > /fuad

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-17 Thread Gupta, Pankaj


On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote:

On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:

On 9/15/22 16:29, Chao Peng wrote:

From: "Kirill A. Shutemov" 

KVM can use memfd-provided memory for guest memory. For normal userspace
accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
virtual address space and then tells KVM to use the virtual address to
setup the mapping in the secondary page table (e.g. EPT).

With confidential computing technologies like Intel TDX, the
memfd-provided memory may be encrypted with special key for special
software domain (e.g. KVM guest) and is not expected to be directly
accessed by userspace. Precisely, userspace access to such encrypted
memory may lead to host crash so it should be prevented.

This patch introduces userspace inaccessible memfd (created with
MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
ordinary MMU access (e.g. read/write/mmap) but can be accessed via
in-kernel interface so KVM can directly interact with core-mm without
the need to map the memory into KVM userspace.

It provides semantics required for KVM guest private(encrypted) memory
support that a file descriptor with this flag set is going to be used as
the source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV.

KVM userspace is still in charge of the lifecycle of the memfd. It
should pass the opened fd to KVM. KVM uses the kernel APIs newly added
in this patch to obtain the physical memory address and then populate
the secondary page table entries.

The userspace inaccessible memfd can be fallocate-ed and hole-punched
from userspace. When hole-punching happens, KVM can get notified through
inaccessible_notifier it then gets chance to remove any mapped entries
of the range in the secondary page tables.

The userspace inaccessible memfd itself is implemented as a shim layer
on top of real memory file systems like tmpfs/hugetlbfs but this patch
only implemented tmpfs. The allocated memory is currently marked as
unmovable and unevictable, this is required for current confidential
usage. But in future this might be changed.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Chao Peng 
---


...


+static long inaccessible_fallocate(struct file *file, int mode,
+  loff_t offset, loff_t len)
+{
+   struct inaccessible_data *data = file->f_mapping->private_data;
+   struct file *memfd = data->memfd;
+   int ret;
+
+   if (mode & FALLOC_FL_PUNCH_HOLE) {
+   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
+   return -EINVAL;
+   }
+
+   ret = memfd->f_op->fallocate(memfd, mode, offset, len);
+   inaccessible_notifier_invalidate(data, offset, offset + len);


Wonder if invalidate should precede the actual hole punch, otherwise we open
a window where the page tables point to memory no longer valid?


Yes, you are right. Thanks for catching this.


I also noticed this. But then thought the memory would be anyways zeroed 
(hole punched) before this call?





+   return ret;
+}
+


...


+
+static struct file_system_type inaccessible_fs = {
+   .owner  = THIS_MODULE,
+   .name   = "[inaccessible]",


Dunno where exactly is this name visible, but shouldn't it better be
"[memfd:inaccessible]"?


Maybe. And skip brackets.




+   .init_fs_context = inaccessible_init_fs_context,
+   .kill_sb= kill_anon_super,
+};
+

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-17 Thread Kirill A . Shutemov

On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote:
> On 9/15/22 16:29, Chao Peng wrote:
> > From: "Kirill A. Shutemov" 
> > 
> > KVM can use memfd-provided memory for guest memory. For normal userspace
> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > virtual address space and then tells KVM to use the virtual address to
> > setup the mapping in the secondary page table (e.g. EPT).
> > 
> > With confidential computing technologies like Intel TDX, the
> > memfd-provided memory may be encrypted with special key for special
> > software domain (e.g. KVM guest) and is not expected to be directly
> > accessed by userspace. Precisely, userspace access to such encrypted
> > memory may lead to host crash so it should be prevented.
> > 
> > This patch introduces userspace inaccessible memfd (created with
> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > in-kernel interface so KVM can directly interact with core-mm without
> > the need to map the memory into KVM userspace.
> > 
> > It provides semantics required for KVM guest private(encrypted) memory
> > support that a file descriptor with this flag set is going to be used as
> > the source of guest memory in confidential computing environments such
> > as Intel TDX/AMD SEV.
> > 
> > KVM userspace is still in charge of the lifecycle of the memfd. It
> > should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> > in this patch to obtain the physical memory address and then populate
> > the secondary page table entries.
> > 
> > The userspace inaccessible memfd can be fallocate-ed and hole-punched
> > from userspace. When hole-punching happens, KVM can get notified through
> > inaccessible_notifier it then gets chance to remove any mapped entries
> > of the range in the secondary page tables.
> > 
> > The userspace inaccessible memfd itself is implemented as a shim layer
> > on top of real memory file systems like tmpfs/hugetlbfs but this patch
> > only implemented tmpfs. The allocated memory is currently marked as
> > unmovable and unevictable, this is required for current confidential
> > usage. But in future this might be changed.
> > 
> > Signed-off-by: Kirill A. Shutemov 
> > Signed-off-by: Chao Peng 
> > ---
> 
> ...
> 
> > +static long inaccessible_fallocate(struct file *file, int mode,
> > +  loff_t offset, loff_t len)
> > +{
> > +   struct inaccessible_data *data = file->f_mapping->private_data;
> > +   struct file *memfd = data->memfd;
> > +   int ret;
> > +
> > +   if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +   return -EINVAL;
> > +   }
> > +
> > +   ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> > +   inaccessible_notifier_invalidate(data, offset, offset + len);
> 
> Wonder if invalidate should precede the actual hole punch, otherwise we open
> a window where the page tables point to memory no longer valid?

Yes, you are right. Thanks for catching this.

> > +   return ret;
> > +}
> > +
> 
> ...
> 
> > +
> > +static struct file_system_type inaccessible_fs = {
> > +   .owner  = THIS_MODULE,
> > +   .name   = "[inaccessible]",
> 
> Dunno where exactly is this name visible, but shouldn't it better be
> "[memfd:inaccessible]"?

Maybe. And skip brackets.

> 
> > +   .init_fs_context = inaccessible_init_fs_context,
> > +   .kill_sb= kill_anon_super,
> > +};
> > +
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-17 Thread Chao Peng

On Mon, Oct 17, 2022 at 11:31:19AM +0100, Fuad Tabba wrote:
> Hi,
> 
> > >
> > > Actually, for pKVM, there is no need for the guest memory to be
> > > GUP'able at all if we use the new inaccessible_get_pfn().
> >
> > If pKVM can use inaccessible_get_pfn() to get pfn and can avoid GUP (I
> > think that is the major concern?), do you see any other gap from
> > existing API?
> 
> Actually for this part no, there aren't any gaps and
> inaccessible_get_pfn() is sufficient.

Thanks for the confirmation.

> 
> > > This of
> > > course goes back to what I'd mentioned before in v7; it seems that
> > > representing the memslot memory as a file descriptor should be
> > > orthogonal to whether the memory is shared or private, rather than a
> > > private_fd for private memory and the userspace_addr for shared
> > > memory. The host can then map or unmap the shared/private memory using
> > > the fd, which allows it more freedom in even choosing to unmap shared
> > > memory when not needed, for example.
> >
> > Using both private_fd and userspace_addr is only needed in TDX and other
> > confidential computing scenarios, pKVM may only use private_fd if the fd
> > can also be mmaped as a whole to userspace as Sean suggested.
> 
> That does work in practice, for now at least, and is what I do in my
> current port. However, the naming and how the API is defined as
> implied by the name and the documentation. By calling the field
> private_fd, it does imply that it should not be mapped, which is also
> what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
> would be mis/ab-using this interface, and that future changes could
> cause unforeseen issues for pKVM.

That is fairly enough. We can change the naming and the documents.

> 
> Maybe renaming this to something like "guest_fp", and specifying in
> the documentation that it can be restricted, e.g., instead of "the
> content of the private memory is invisible to userspace" something
> along the lines of  "the content of the guest memory may be restricted
> to userspace".

Some other candidates in my mind:
- restricted_fd: to pair with the mm side restricted_memfd
- protected_fd: as Sean suggested before
- fd: how it's explained relies on the memslot.flag.

Thanks,
Chao
> 
> What do you think?
> 
> Cheers,
> /fuad
> 
> >
> > Thanks,
> > Chao
> > >
> > > Cheers,
> > > /fuad

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-17 Thread Vlastimil Babka

On 9/15/22 16:29, Chao Peng wrote:
> From: "Kirill A. Shutemov" 
> 
> KVM can use memfd-provided memory for guest memory. For normal userspace
> accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> virtual address space and then tells KVM to use the virtual address to
> setup the mapping in the secondary page table (e.g. EPT).
> 
> With confidential computing technologies like Intel TDX, the
> memfd-provided memory may be encrypted with special key for special
> software domain (e.g. KVM guest) and is not expected to be directly
> accessed by userspace. Precisely, userspace access to such encrypted
> memory may lead to host crash so it should be prevented.
> 
> This patch introduces userspace inaccessible memfd (created with
> MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> in-kernel interface so KVM can directly interact with core-mm without
> the need to map the memory into KVM userspace.
> 
> It provides semantics required for KVM guest private(encrypted) memory
> support that a file descriptor with this flag set is going to be used as
> the source of guest memory in confidential computing environments such
> as Intel TDX/AMD SEV.
> 
> KVM userspace is still in charge of the lifecycle of the memfd. It
> should pass the opened fd to KVM. KVM uses the kernel APIs newly added
> in this patch to obtain the physical memory address and then populate
> the secondary page table entries.
> 
> The userspace inaccessible memfd can be fallocate-ed and hole-punched
> from userspace. When hole-punching happens, KVM can get notified through
> inaccessible_notifier it then gets chance to remove any mapped entries
> of the range in the secondary page tables.
> 
> The userspace inaccessible memfd itself is implemented as a shim layer
> on top of real memory file systems like tmpfs/hugetlbfs but this patch
> only implemented tmpfs. The allocated memory is currently marked as
> unmovable and unevictable, this is required for current confidential
> usage. But in future this might be changed.
> 
> Signed-off-by: Kirill A. Shutemov 
> Signed-off-by: Chao Peng 
> ---

...

> +static long inaccessible_fallocate(struct file *file, int mode,
> +loff_t offset, loff_t len)
> +{
> + struct inaccessible_data *data = file->f_mapping->private_data;
> + struct file *memfd = data->memfd;
> + int ret;
> +
> + if (mode & FALLOC_FL_PUNCH_HOLE) {
> + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> + return -EINVAL;
> + }
> +
> + ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> + inaccessible_notifier_invalidate(data, offset, offset + len);

Wonder if invalidate should precede the actual hole punch, otherwise we open
a window where the page tables point to memory no longer valid?

> + return ret;
> +}
> +

...

> +
> +static struct file_system_type inaccessible_fs = {
> + .owner  = THIS_MODULE,
> + .name   = "[inaccessible]",

Dunno where exactly is this name visible, but shouldn't it better be
"[memfd:inaccessible]"?

> + .init_fs_context = inaccessible_init_fs_context,
> + .kill_sb= kill_anon_super,
> +};
> +

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-17 Thread Fuad Tabba

Hi,

> >
> > Actually, for pKVM, there is no need for the guest memory to be
> > GUP'able at all if we use the new inaccessible_get_pfn().
>
> If pKVM can use inaccessible_get_pfn() to get pfn and can avoid GUP (I
> think that is the major concern?), do you see any other gap from
> existing API?

Actually for this part no, there aren't any gaps and
inaccessible_get_pfn() is sufficient.

> > This of
> > course goes back to what I'd mentioned before in v7; it seems that
> > representing the memslot memory as a file descriptor should be
> > orthogonal to whether the memory is shared or private, rather than a
> > private_fd for private memory and the userspace_addr for shared
> > memory. The host can then map or unmap the shared/private memory using
> > the fd, which allows it more freedom in even choosing to unmap shared
> > memory when not needed, for example.
>
> Using both private_fd and userspace_addr is only needed in TDX and other
> confidential computing scenarios, pKVM may only use private_fd if the fd
> can also be mmaped as a whole to userspace as Sean suggested.

That does work in practice, for now at least, and is what I do in my
current port. However, the naming and how the API is defined as
implied by the name and the documentation. By calling the field
private_fd, it does imply that it should not be mapped, which is also
what api.rst says in PATCH v8 5/8. My worry is that in that case pKVM
would be mis/ab-using this interface, and that future changes could
cause unforeseen issues for pKVM.

Maybe renaming this to something like "guest_fp", and specifying in
the documentation that it can be restricted, e.g., instead of "the
content of the private memory is invisible to userspace" something
along the lines of  "the content of the guest memory may be restricted
to userspace".

What do you think?

Cheers,
/fuad

>
> Thanks,
> Chao
> >
> > Cheers,
> > /fuad

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-13 Thread Chao Peng

On Fri, Sep 30, 2022 at 05:19:00PM +0100, Fuad Tabba wrote:
> Hi,
> 
> On Tue, Sep 27, 2022 at 11:47 PM Sean Christopherson  
> wrote:
> >
> > On Mon, Sep 26, 2022, Fuad Tabba wrote:
> > > Hi,
> > >
> > > On Mon, Sep 26, 2022 at 3:28 PM Chao Peng  
> > > wrote:
> > > >
> > > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > > > > >
> > > > > >   1. Not be supported for TDX or SEV-SNP because they don't allow 
> > > > > > adding non-zero
> > > > > >  memory into the guest (after pre-boot phase).
> > > > > >
> > > > > >   2. Be mutually exclusive with shared<=>private conversions, and 
> > > > > > is allowed if
> > > > > >  and only if the entire gfn range of the associated memslot is 
> > > > > > shared.
> > > > >
> > > > > In general I think that this would work with pKVM. However, limiting
> > > > > private<->shared conversions to the granularity of a whole memslot
> > > > > might be difficult to handle in pKVM, since the guest doesn't have the
> > > > > concept of memslots. For example, in pKVM right now, when a guest
> > > > > shares back its restricted DMA pool with the host it does so at the
> > > > > page-level.
> >
> > Y'all are killing me :-)
> 
>  :D
> 
> > Isn't the guest enlightened?  E.g. can't you tell the guest "thou shalt 
> > share at
> > granularity X"?  With KVM's newfangled scalable memslots and per-vCPU MRU 
> > slot,
> > X doesn't even have to be that high to get reasonable performance, e.g. 
> > assuming
> > the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is 
> > supposed to
> > work just fine in KVM.
> 
> The guest is potentially enlightened, but the host doesn't necessarily
> know which memslot the guest might want to share back, since it
> doesn't know where the guest might want to place the DMA pool. If I
> understand this correctly, for this to work, all memslots would need
> to be the same size and sharing would always need to happen at that
> granularity.
> 
> Moreover, for something like a small DMA pool this might scale, but
> I'm not sure about potential future workloads (e.g., multimedia
> in-place sharing).
> 
> >
> > > > > pKVM would also need a way to make an fd accessible again
> > > > > when shared back, which I think isn't possible with this patch.
> > > >
> > > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > > be the same issue.
> > >
> > > pKVM doesn't really need to unmap the memory. What is really important
> > > is that the memory is not GUP'able.
> >
> > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> > otherwise KVM wouldn't be able to get the PFN to map into guest memory.
> >
> > The problem is that gup() and "mapped" are tied together.  So yes, pKVM 
> > doesn't
> > strictly need to unmap memory _in the untrusted host_, but since 
> > mapped==guppable,
> > the end result is the same.
> >
> > Emphasis above because pKVM still needs unmap the memory _somehwere_.  
> > IIUC, the
> > current approach is to do that only in the stage-2 page tables, i.e. only 
> > in the
> > context of the hypervisor.  Which is also the source of the gup() problems; 
> > the
> > untrusted kernel is blissfully unaware that the memory is inaccessible.
> >
> > Any approach that moves some of that information into the untrusted kernel 
> > so that
> > the kernel can protect itself will incur fragmentation in the VMAs.  Well, 
> > unless
> > all of guest memory becomes unguppable, but that's likely not a viable 
> > option.
> 
> Actually, for pKVM, there is no need for the guest memory to be
> GUP'able at all if we use the new inaccessible_get_pfn().

If pKVM can use inaccessible_get_pfn() to get pfn and can avoid GUP (I
think that is the major concern?), do you see any other gap from
existing API? 

> This of
> course goes back to what I'd mentioned before in v7; it seems that
> representing the memslot memory as a file descriptor should be
> orthogonal to whether the memory is shared or private, rather than a
> private_fd for private memory and the userspace_addr for shared
> memory. The host can then map or unmap the shared/private memory using
> the fd, which allows it more freedom in even choosing to unmap shared
> memory when not needed, for example.

Using both private_fd and userspace_addr is only needed in TDX and other
confidential computing scenarios, pKVM may only use private_fd if the fd
can also be mmaped as a whole to userspace as Sean suggested.

Thanks,
Chao
> 
> Cheers,
> /fuad

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-06 Thread Kirill A. Shutemov

On Thu, Oct 06, 2022 at 09:50:28AM +0100, Fuad Tabba wrote:
> Hi,
> 
> <...>
> 
> 
> > diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> > new file mode 100644
> > index ..2d33cbdd9282
> > --- /dev/null
> > +++ b/mm/memfd_inaccessible.c
> 
> <...>
> 
> > +struct file *memfd_mkinaccessible(struct file *memfd)
> > +{
> > +   struct inaccessible_data *data;
> > +   struct address_space *mapping;
> > +   struct inode *inode;
> > +   struct file *file;
> > +
> > +   data = kzalloc(sizeof(*data), GFP_KERNEL);
> > +   if (!data)
> > +   return ERR_PTR(-ENOMEM);
> > +
> > +   data->memfd = memfd;
> > +   mutex_init(>lock);
> > +   INIT_LIST_HEAD(>notifiers);
> > +
> > +   inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
> > +   if (IS_ERR(inode)) {
> > +   kfree(data);
> > +   return ERR_CAST(inode);
> > +   }
> > +
> > +   inode->i_mode |= S_IFREG;
> > +   inode->i_op = _iops;
> > +   inode->i_mapping->private_data = data;
> > +
> > +   file = alloc_file_pseudo(inode, inaccessible_mnt,
> > +"[memfd:inaccessible]", O_RDWR,
> > +_fops);
> > +   if (IS_ERR(file)) {
> > +   iput(inode);
> > +   kfree(data);
> 
> I think this might be missing a return at this point.

Good catch! Thanks!

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-06 Thread Fuad Tabba

Hi,

<...>


> diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> new file mode 100644
> index ..2d33cbdd9282
> --- /dev/null
> +++ b/mm/memfd_inaccessible.c

<...>

> +struct file *memfd_mkinaccessible(struct file *memfd)
> +{
> +   struct inaccessible_data *data;
> +   struct address_space *mapping;
> +   struct inode *inode;
> +   struct file *file;
> +
> +   data = kzalloc(sizeof(*data), GFP_KERNEL);
> +   if (!data)
> +   return ERR_PTR(-ENOMEM);
> +
> +   data->memfd = memfd;
> +   mutex_init(>lock);
> +   INIT_LIST_HEAD(>notifiers);
> +
> +   inode = alloc_anon_inode(inaccessible_mnt->mnt_sb);
> +   if (IS_ERR(inode)) {
> +   kfree(data);
> +   return ERR_CAST(inode);
> +   }
> +
> +   inode->i_mode |= S_IFREG;
> +   inode->i_op = _iops;
> +   inode->i_mapping->private_data = data;
> +
> +   file = alloc_file_pseudo(inode, inaccessible_mnt,
> +"[memfd:inaccessible]", O_RDWR,
> +_fops);
> +   if (IS_ERR(file)) {
> +   iput(inode);
> +   kfree(data);

I think this might be missing a return at this point.

> +   }
> +
> +   file->f_flags |= O_LARGEFILE;
> +
> +   mapping = memfd->f_mapping;
> +   mapping_set_unevictable(mapping);
> +   mapping_set_gfp_mask(mapping,
> +mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);
> +
> +   return file;
> +}

Thanks,
/fuad



> +
> +void inaccessible_register_notifier(struct file *file,
> +   struct inaccessible_notifier *notifier)
> +{
> +   struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +   mutex_lock(>lock);
> +   list_add(>list, >notifiers);
> +   mutex_unlock(>lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
> +
> +void inaccessible_unregister_notifier(struct file *file,
> + struct inaccessible_notifier *notifier)
> +{
> +   struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +   mutex_lock(>lock);
> +   list_del(>list);
> +   mutex_unlock(>lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +int *order)
> +{
> +   struct inaccessible_data *data = file->f_mapping->private_data;
> +   struct file *memfd = data->memfd;
> +   struct page *page;
> +   int ret;
> +
> +   ret = shmem_getpage(file_inode(memfd), offset, , SGP_WRITE);
> +   if (ret)
> +   return ret;
> +
> +   *pfn = page_to_pfn_t(page);
> +   *order = thp_order(compound_head(page));
> +   SetPageUptodate(page);
> +   unlock_page(page);
> +
> +   return 0;
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> +
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> +{
> +   struct page *page = pfn_t_to_page(pfn);
> +
> +   if (WARN_ON_ONCE(!page))
> +   return;
> +
> +   put_page(page);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> --
> 2.25.1
>

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-04 Thread Fuad Tabba

Hi,

On Mon, Oct 3, 2022 at 12:01 PM Kirill A. Shutemov  wrote:
>
> On Mon, Oct 03, 2022 at 08:33:13AM +0100, Fuad Tabba wrote:
> > > I think it is "don't do that" category. inaccessible_register_notifier()
> > > caller has to know what file it operates on, no?
> >
> > The thing is, you could oops the kernel from userspace. For that, all
> > you have to do is a memfd_create without the MFD_INACCESSIBLE,
> > followed by a KVM_SET_USER_MEMORY_REGION using that as the private_fd.
> > I ran into this using my port of this patch series to arm64.
>
> My point is that it has to be handled on a different level. KVM has to
> reject private_fd if it is now inaccessible. It should be trivial by
> checking file->f_inode->i_sb->s_magic.

Yes, that makes sense.

Thanks,
/fuad

> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-03 Thread Kirill A. Shutemov

On Mon, Oct 03, 2022 at 08:33:13AM +0100, Fuad Tabba wrote:
> > I think it is "don't do that" category. inaccessible_register_notifier()
> > caller has to know what file it operates on, no?
> 
> The thing is, you could oops the kernel from userspace. For that, all
> you have to do is a memfd_create without the MFD_INACCESSIBLE,
> followed by a KVM_SET_USER_MEMORY_REGION using that as the private_fd.
> I ran into this using my port of this patch series to arm64.

My point is that it has to be handled on a different level. KVM has to
reject private_fd if it is now inaccessible. It should be trivial by
checking file->f_inode->i_sb->s_magic.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-10-03 Thread Fuad Tabba

Hi

On Fri, Sep 30, 2022 at 5:23 PM Kirill A . Shutemov
 wrote:
>
> On Fri, Sep 30, 2022 at 05:14:00PM +0100, Fuad Tabba wrote:
> > Hi,
> >
> > <...>
> >
> > > diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> > > new file mode 100644
> > > index ..2d33cbdd9282
> > > --- /dev/null
> > > +++ b/mm/memfd_inaccessible.c
> > > @@ -0,0 +1,219 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include "linux/sbitmap.h"
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +
> > > +struct inaccessible_data {
> > > +   struct mutex lock;
> > > +   struct file *memfd;
> > > +   struct list_head notifiers;
> > > +};
> > > +
> > > +static void inaccessible_notifier_invalidate(struct inaccessible_data 
> > > *data,
> > > +pgoff_t start, pgoff_t end)
> > > +{
> > > +   struct inaccessible_notifier *notifier;
> > > +
> > > +   mutex_lock(>lock);
> > > +   list_for_each_entry(notifier, >notifiers, list) {
> > > +   notifier->ops->invalidate(notifier, start, end);
> > > +   }
> > > +   mutex_unlock(>lock);
> > > +}
> > > +
> > > +static int inaccessible_release(struct inode *inode, struct file *file)
> > > +{
> > > +   struct inaccessible_data *data = inode->i_mapping->private_data;
> > > +
> > > +   fput(data->memfd);
> > > +   kfree(data);
> > > +   return 0;
> > > +}
> > > +
> > > +static long inaccessible_fallocate(struct file *file, int mode,
> > > +  loff_t offset, loff_t len)
> > > +{
> > > +   struct inaccessible_data *data = file->f_mapping->private_data;
> > > +   struct file *memfd = data->memfd;
> > > +   int ret;
> > > +
> > > +   if (mode & FALLOC_FL_PUNCH_HOLE) {
> > > +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > +   return -EINVAL;
> > > +   }
> > > +
> > > +   ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> >
> > I think that shmem_file_operations.fallocate is only set if
> > CONFIG_TMPFS is enabled (shmem.c). Should there be a check at
> > initialization that fallocate is set, or maybe a config dependency, or
> > can we count on it always being enabled?
>
> It is already there:
>
> config MEMFD_CREATE
> def_bool TMPFS || HUGETLBFS
>
> And we reject inaccessible memfd_create() for HUGETLBFS.
>
> But if we go with a separate syscall, yes, we need the dependency.

I missed that, thanks.

>
> > > +   inaccessible_notifier_invalidate(data, offset, offset + len);
> > > +   return ret;
> > > +}
> > > +
> >
> > <...>
> >
> > > +void inaccessible_register_notifier(struct file *file,
> > > +   struct inaccessible_notifier 
> > > *notifier)
> > > +{
> > > +   struct inaccessible_data *data = file->f_mapping->private_data;
> > > +
> > > +   mutex_lock(>lock);
> > > +   list_add(>list, >notifiers);
> > > +   mutex_unlock(>lock);
> > > +}
> > > +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
> >
> > If the memfd wasn't marked as inaccessible, or more generally
> > speaking, if the file isn't a memfd_inaccessible file, this ends up
> > accessing an uninitialized pointer for the notifier list. Should there
> > be a check for that here, and have this function return an error if
> > that's not the case?
>
> I think it is "don't do that" category. inaccessible_register_notifier()
> caller has to know what file it operates on, no?

The thing is, you could oops the kernel from userspace. For that, all
you have to do is a memfd_create without the MFD_INACCESSIBLE,
followed by a KVM_SET_USER_MEMORY_REGION using that as the private_fd.
I ran into this using my port of this patch series to arm64.

Cheers,
/fuad


> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-30 Thread Fuad Tabba

Hi,

On Tue, Sep 27, 2022 at 11:47 PM Sean Christopherson  wrote:
>
> On Mon, Sep 26, 2022, Fuad Tabba wrote:
> > Hi,
> >
> > On Mon, Sep 26, 2022 at 3:28 PM Chao Peng  
> > wrote:
> > >
> > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > > > >
> > > > >   1. Not be supported for TDX or SEV-SNP because they don't allow 
> > > > > adding non-zero
> > > > >  memory into the guest (after pre-boot phase).
> > > > >
> > > > >   2. Be mutually exclusive with shared<=>private conversions, and is 
> > > > > allowed if
> > > > >  and only if the entire gfn range of the associated memslot is 
> > > > > shared.
> > > >
> > > > In general I think that this would work with pKVM. However, limiting
> > > > private<->shared conversions to the granularity of a whole memslot
> > > > might be difficult to handle in pKVM, since the guest doesn't have the
> > > > concept of memslots. For example, in pKVM right now, when a guest
> > > > shares back its restricted DMA pool with the host it does so at the
> > > > page-level.
>
> Y'all are killing me :-)

 :D

> Isn't the guest enlightened?  E.g. can't you tell the guest "thou shalt share 
> at
> granularity X"?  With KVM's newfangled scalable memslots and per-vCPU MRU 
> slot,
> X doesn't even have to be that high to get reasonable performance, e.g. 
> assuming
> the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed 
> to
> work just fine in KVM.

The guest is potentially enlightened, but the host doesn't necessarily
know which memslot the guest might want to share back, since it
doesn't know where the guest might want to place the DMA pool. If I
understand this correctly, for this to work, all memslots would need
to be the same size and sharing would always need to happen at that
granularity.

Moreover, for something like a small DMA pool this might scale, but
I'm not sure about potential future workloads (e.g., multimedia
in-place sharing).

>
> > > > pKVM would also need a way to make an fd accessible again
> > > > when shared back, which I think isn't possible with this patch.
> > >
> > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > be the same issue.
> >
> > pKVM doesn't really need to unmap the memory. What is really important
> > is that the memory is not GUP'able.
>
> Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> otherwise KVM wouldn't be able to get the PFN to map into guest memory.
>
> The problem is that gup() and "mapped" are tied together.  So yes, pKVM 
> doesn't
> strictly need to unmap memory _in the untrusted host_, but since 
> mapped==guppable,
> the end result is the same.
>
> Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, 
> the
> current approach is to do that only in the stage-2 page tables, i.e. only in 
> the
> context of the hypervisor.  Which is also the source of the gup() problems; 
> the
> untrusted kernel is blissfully unaware that the memory is inaccessible.
>
> Any approach that moves some of that information into the untrusted kernel so 
> that
> the kernel can protect itself will incur fragmentation in the VMAs.  Well, 
> unless
> all of guest memory becomes unguppable, but that's likely not a viable option.

Actually, for pKVM, there is no need for the guest memory to be
GUP'able at all if we use the new inaccessible_get_pfn(). This of
course goes back to what I'd mentioned before in v7; it seems that
representing the memslot memory as a file descriptor should be
orthogonal to whether the memory is shared or private, rather than a
private_fd for private memory and the userspace_addr for shared
memory. The host can then map or unmap the shared/private memory using
the fd, which allows it more freedom in even choosing to unmap shared
memory when not needed, for example.

Cheers,
/fuad

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-30 Thread Kirill A . Shutemov

On Fri, Sep 30, 2022 at 05:14:00PM +0100, Fuad Tabba wrote:
> Hi,
> 
> <...>
> 
> > diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> > new file mode 100644
> > index ..2d33cbdd9282
> > --- /dev/null
> > +++ b/mm/memfd_inaccessible.c
> > @@ -0,0 +1,219 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "linux/sbitmap.h"
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +struct inaccessible_data {
> > +   struct mutex lock;
> > +   struct file *memfd;
> > +   struct list_head notifiers;
> > +};
> > +
> > +static void inaccessible_notifier_invalidate(struct inaccessible_data 
> > *data,
> > +pgoff_t start, pgoff_t end)
> > +{
> > +   struct inaccessible_notifier *notifier;
> > +
> > +   mutex_lock(>lock);
> > +   list_for_each_entry(notifier, >notifiers, list) {
> > +   notifier->ops->invalidate(notifier, start, end);
> > +   }
> > +   mutex_unlock(>lock);
> > +}
> > +
> > +static int inaccessible_release(struct inode *inode, struct file *file)
> > +{
> > +   struct inaccessible_data *data = inode->i_mapping->private_data;
> > +
> > +   fput(data->memfd);
> > +   kfree(data);
> > +   return 0;
> > +}
> > +
> > +static long inaccessible_fallocate(struct file *file, int mode,
> > +  loff_t offset, loff_t len)
> > +{
> > +   struct inaccessible_data *data = file->f_mapping->private_data;
> > +   struct file *memfd = data->memfd;
> > +   int ret;
> > +
> > +   if (mode & FALLOC_FL_PUNCH_HOLE) {
> > +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > +   return -EINVAL;
> > +   }
> > +
> > +   ret = memfd->f_op->fallocate(memfd, mode, offset, len);
> 
> I think that shmem_file_operations.fallocate is only set if
> CONFIG_TMPFS is enabled (shmem.c). Should there be a check at
> initialization that fallocate is set, or maybe a config dependency, or
> can we count on it always being enabled?

It is already there:

config MEMFD_CREATE
def_bool TMPFS || HUGETLBFS

And we reject inaccessible memfd_create() for HUGETLBFS.

But if we go with a separate syscall, yes, we need the dependency.

> > +   inaccessible_notifier_invalidate(data, offset, offset + len);
> > +   return ret;
> > +}
> > +
> 
> <...>
> 
> > +void inaccessible_register_notifier(struct file *file,
> > +   struct inaccessible_notifier *notifier)
> > +{
> > +   struct inaccessible_data *data = file->f_mapping->private_data;
> > +
> > +   mutex_lock(>lock);
> > +   list_add(>list, >notifiers);
> > +   mutex_unlock(>lock);
> > +}
> > +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);
> 
> If the memfd wasn't marked as inaccessible, or more generally
> speaking, if the file isn't a memfd_inaccessible file, this ends up
> accessing an uninitialized pointer for the notifier list. Should there
> be a check for that here, and have this function return an error if
> that's not the case?

I think it is "don't do that" category. inaccessible_register_notifier()
caller has to know what file it operates on, no?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-30 Thread Fuad Tabba

Hi,

<...>

> diff --git a/mm/memfd_inaccessible.c b/mm/memfd_inaccessible.c
> new file mode 100644
> index ..2d33cbdd9282
> --- /dev/null
> +++ b/mm/memfd_inaccessible.c
> @@ -0,0 +1,219 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/sbitmap.h"
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct inaccessible_data {
> +   struct mutex lock;
> +   struct file *memfd;
> +   struct list_head notifiers;
> +};
> +
> +static void inaccessible_notifier_invalidate(struct inaccessible_data *data,
> +pgoff_t start, pgoff_t end)
> +{
> +   struct inaccessible_notifier *notifier;
> +
> +   mutex_lock(>lock);
> +   list_for_each_entry(notifier, >notifiers, list) {
> +   notifier->ops->invalidate(notifier, start, end);
> +   }
> +   mutex_unlock(>lock);
> +}
> +
> +static int inaccessible_release(struct inode *inode, struct file *file)
> +{
> +   struct inaccessible_data *data = inode->i_mapping->private_data;
> +
> +   fput(data->memfd);
> +   kfree(data);
> +   return 0;
> +}
> +
> +static long inaccessible_fallocate(struct file *file, int mode,
> +  loff_t offset, loff_t len)
> +{
> +   struct inaccessible_data *data = file->f_mapping->private_data;
> +   struct file *memfd = data->memfd;
> +   int ret;
> +
> +   if (mode & FALLOC_FL_PUNCH_HOLE) {
> +   if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +   return -EINVAL;
> +   }
> +
> +   ret = memfd->f_op->fallocate(memfd, mode, offset, len);

I think that shmem_file_operations.fallocate is only set if
CONFIG_TMPFS is enabled (shmem.c). Should there be a check at
initialization that fallocate is set, or maybe a config dependency, or
can we count on it always being enabled?

> +   inaccessible_notifier_invalidate(data, offset, offset + len);
> +   return ret;
> +}
> +

<...>

> +void inaccessible_register_notifier(struct file *file,
> +   struct inaccessible_notifier *notifier)
> +{
> +   struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +   mutex_lock(>lock);
> +   list_add(>list, >notifiers);
> +   mutex_unlock(>lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_register_notifier);

If the memfd wasn't marked as inaccessible, or more generally
speaking, if the file isn't a memfd_inaccessible file, this ends up
accessing an uninitialized pointer for the notifier list. Should there
be a check for that here, and have this function return an error if
that's not the case?

Thanks,
/fuad



> +
> +void inaccessible_unregister_notifier(struct file *file,
> + struct inaccessible_notifier *notifier)
> +{
> +   struct inaccessible_data *data = file->f_mapping->private_data;
> +
> +   mutex_lock(>lock);
> +   list_del(>list);
> +   mutex_unlock(>lock);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_unregister_notifier);
> +
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +int *order)
> +{
> +   struct inaccessible_data *data = file->f_mapping->private_data;
> +   struct file *memfd = data->memfd;
> +   struct page *page;
> +   int ret;
> +
> +   ret = shmem_getpage(file_inode(memfd), offset, , SGP_WRITE);
> +   if (ret)
> +   return ret;
> +
> +   *pfn = page_to_pfn_t(page);
> +   *order = thp_order(compound_head(page));
> +   SetPageUptodate(page);
> +   unlock_page(page);
> +
> +   return 0;
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> +
> +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> +{
> +   struct page *page = pfn_t_to_page(pfn);
> +
> +   if (WARN_ON_ONCE(!page))
> +   return;
> +
> +   put_page(page);
> +}
> +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> --
> 2.25.1
>

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-28 Thread Kirill A. Shutemov

On Tue, Sep 27, 2022 at 11:23:24PM +, Sean Christopherson wrote:
> On Mon, Sep 26, 2022, David Hildenbrand wrote:
> > On 26.09.22 16:48, Kirill A. Shutemov wrote:
> > > On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
> > > > When using DAX, what happens with the shared <->private conversion? 
> > > > Which
> > > > "type" is supposed to use dax, which not?
> > > > 
> > > > In other word, I'm missing too many details on the bigger picture of how
> > > > this would work at all to see why it makes sense right now to prepare 
> > > > for
> > > > that.
> > > 
> > > IIUC, KVM doesn't really care about pages or folios. They need PFN to
> > > populate SEPT. Returning page/folio would make KVM do additional steps to
> > > extract PFN and one more place to have a bug.
> > 
> > Fair enough. Smells KVM specific, though.
> 
> TL;DR: I'm good with either approach, though providing a "struct page" might 
> avoid
>refactoring the API in the nearish future.
> 
> Playing devil's advocate for a second, the counter argument is that KVM is the
> only user for the foreseeable future.
> 
> That said, it might make sense to return a "struct page" from the core API and
> force KVM to do page_to_pfn().  KVM already does that for HVA-based memory, so
> it's not exactly new code.

Core MM tries to move away from struct page in favour of struct folio. We
can make interface return folio.

But it would require more work on KVM side.

folio_pfn(folio) + offset % folio_nr_pages(folio) would give you PFN for
base-pagesize PFN for given offset. I guess it is not too hard.

It also gives KVM capability to populate multiple EPT entries for non-zero
order folio and save few cycles.

Does it work for you?

> More importantly, KVM may actually need/want the "struct page" in the 
> not-too-distant
> future to support mapping non-refcounted "struct page" memory into the guest. 
>  The
> ChromeOS folks have a use case involving virtio-gpu blobs where KVM can get 
> handed a
> "struct page" that _isn't_ refcounted[*].  Once the lack of mmu_notifier 
> integration
> is fixed, the remaining issue is that KVM doesn't currently have a way to 
> determine
> whether or not it holds a reference to the page.  Instead, KVM assumes that 
> if the
> page is "normal", it's refcounted, e.g. see kvm_release_pfn_clean().
> 
> KVM's current workaround for this is to refuse to map these pages into the 
> guest,
> i.e. KVM simply forces its assumption that normal pages are refcounted to be 
> true.
> To remove that workaround, the likely solution will be to pass around a tuple 
> of
> page+pfn, where "page" is non-NULL if the pfn is a refcounted "struct page".
> 
> At that point, getting handed a "struct page" from the core API would be a 
> good
> thing as KVM wouldn't need to probe the PFN to determine whether or not it's a
> refcounted page.
> 
> Note, I still want the order to be provided by the API so that KVM doesn't 
> need
> to run through a bunch of helpers to try and figure out the allowed mapping 
> size.
> 
> [*] 
> https://lore.kernel.org/all/CAD=HUj736L5oxkzeL2JoPV8g1S6Rugy_TquW=prt73ymfzp...@mail.gmail.com

These non-refcounted "struct page" confuses me.

IIUC (probably not), the idea is to share a buffer between host and guest
and avoid double buffering in page cache on the guest ("guest shadow
buffer" means page cache, right?). Don't we already have DAX interfaces to
bypass guest page cache?

And do you think it would need to be handled on inaccessible API lavel or
is it KVM-only thing that uses inaccessible API for some use-cases?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-27 Thread Sean Christopherson

On Mon, Sep 26, 2022, David Hildenbrand wrote:
> On 26.09.22 16:48, Kirill A. Shutemov wrote:
> > On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
> > > When using DAX, what happens with the shared <->private conversion? Which
> > > "type" is supposed to use dax, which not?
> > > 
> > > In other word, I'm missing too many details on the bigger picture of how
> > > this would work at all to see why it makes sense right now to prepare for
> > > that.
> > 
> > IIUC, KVM doesn't really care about pages or folios. They need PFN to
> > populate SEPT. Returning page/folio would make KVM do additional steps to
> > extract PFN and one more place to have a bug.
> 
> Fair enough. Smells KVM specific, though.

TL;DR: I'm good with either approach, though providing a "struct page" might 
avoid
   refactoring the API in the nearish future.

Playing devil's advocate for a second, the counter argument is that KVM is the
only user for the foreseeable future.

That said, it might make sense to return a "struct page" from the core API and
force KVM to do page_to_pfn().  KVM already does that for HVA-based memory, so
it's not exactly new code.

More importantly, KVM may actually need/want the "struct page" in the 
not-too-distant
future to support mapping non-refcounted "struct page" memory into the guest.  
The
ChromeOS folks have a use case involving virtio-gpu blobs where KVM can get 
handed a
"struct page" that _isn't_ refcounted[*].  Once the lack of mmu_notifier 
integration
is fixed, the remaining issue is that KVM doesn't currently have a way to 
determine
whether or not it holds a reference to the page.  Instead, KVM assumes that if 
the
page is "normal", it's refcounted, e.g. see kvm_release_pfn_clean().

KVM's current workaround for this is to refuse to map these pages into the 
guest,
i.e. KVM simply forces its assumption that normal pages are refcounted to be 
true.
To remove that workaround, the likely solution will be to pass around a tuple of
page+pfn, where "page" is non-NULL if the pfn is a refcounted "struct page".

At that point, getting handed a "struct page" from the core API would be a good
thing as KVM wouldn't need to probe the PFN to determine whether or not it's a
refcounted page.

Note, I still want the order to be provided by the API so that KVM doesn't need
to run through a bunch of helpers to try and figure out the allowed mapping 
size.

[*] 
https://lore.kernel.org/all/CAD=HUj736L5oxkzeL2JoPV8g1S6Rugy_TquW=prt73ymfzp...@mail.gmail.com

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-27 Thread Sean Christopherson

On Mon, Sep 26, 2022, Fuad Tabba wrote:
> Hi,
> 
> On Mon, Sep 26, 2022 at 3:28 PM Chao Peng  wrote:
> >
> > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > > >
> > > >   1. Not be supported for TDX or SEV-SNP because they don't allow 
> > > > adding non-zero
> > > >  memory into the guest (after pre-boot phase).
> > > >
> > > >   2. Be mutually exclusive with shared<=>private conversions, and is 
> > > > allowed if
> > > >  and only if the entire gfn range of the associated memslot is 
> > > > shared.
> > >
> > > In general I think that this would work with pKVM. However, limiting
> > > private<->shared conversions to the granularity of a whole memslot
> > > might be difficult to handle in pKVM, since the guest doesn't have the
> > > concept of memslots. For example, in pKVM right now, when a guest
> > > shares back its restricted DMA pool with the host it does so at the
> > > page-level.

Y'all are killing me :-)

Isn't the guest enlightened?  E.g. can't you tell the guest "thou shalt share at
granularity X"?  With KVM's newfangled scalable memslots and per-vCPU MRU slot,
X doesn't even have to be that high to get reasonable performance, e.g. assuming
the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to
work just fine in KVM.

> > > pKVM would also need a way to make an fd accessible again
> > > when shared back, which I think isn't possible with this patch.
> >
> > But does pKVM really want to mmap/munmap a new region at the page-level,
> > that can cause VMA fragmentation if the conversion is frequent as I see.
> > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > be the same issue.
> 
> pKVM doesn't really need to unmap the memory. What is really important
> is that the memory is not GUP'able.

Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
otherwise KVM wouldn't be able to get the PFN to map into guest memory.

The problem is that gup() and "mapped" are tied together.  So yes, pKVM doesn't
strictly need to unmap memory _in the untrusted host_, but since 
mapped==guppable,
the end result is the same.

Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, the
current approach is to do that only in the stage-2 page tables, i.e. only in the
context of the hypervisor.  Which is also the source of the gup() problems; the
untrusted kernel is blissfully unaware that the memory is inaccessible.

Any approach that moves some of that information into the untrusted kernel so 
that
the kernel can protect itself will incur fragmentation in the VMAs.  Well, 
unless
all of guest memory becomes unguppable, but that's likely not a viable option.

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-26 Thread Fuad Tabba

Hi,

On Mon, Sep 26, 2022 at 3:28 PM Chao Peng  wrote:
>
> On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > > Regarding pKVM's use case, with the shim approach I believe this can be 
> > > done by
> > > allowing userspace mmap() the "hidden" memfd, but with a ton of 
> > > restrictions
> > > piled on top.
> > >
> > > My first thought was to make the uAPI a set of KVM ioctls so that KVM 
> > > could tightly
> > > tightly control usage without taking on too much complexity in the 
> > > kernel, but
> > > working through things, routing the behavior through the shim itself 
> > > might not be
> > > all that horrific.
> > >
> > > IIRC, we discarded the idea of allowing userspace to map the "private" fd 
> > > because
> > > things got too complex, but with the shim it doesn't seem _that_ bad.
> > >
> > > E.g. on the memfd side:
> > >
> > >   1. The entire memfd must be mapped, and at most one mapping is allowed, 
> > > i.e.
> > >  mapping is all or nothing.
> > >
> > >   2. Acquiring a reference via get_pfn() is disallowed if there's a 
> > > mapping for
> > >  the restricted memfd.
> > >
> > >   3. Add notifier hooks to allow downstream users to further restrict 
> > > things.
> > >
> > >   4. Disallow splitting VMAs, e.g. to force userspace to munmap() 
> > > everything in
> > >  one shot.
> > >
> > >   5. Require that there are no outstanding references at munmap().  Or if 
> > > this
> > >  can't be guaranteed by userspace, maybe add some way for userspace 
> > > to wait
> > >  until it's ok to convert to private?  E.g. so that get_pfn() doesn't 
> > > need
> > >  to do an expensive check every time.
> > >
> > >   static int memfd_restricted_mmap(struct file *file, struct 
> > > vm_area_struct *vma)
> > >   {
> > > if (vma->vm_pgoff)
> > > return -EINVAL;
> > >
> > > if ((vma->vm_end - vma->vm_start) != )
> > > return -EINVAL;
> > >
> > > mutex_lock(>lock);
> > >
> > > if (data->has_mapping) {
> > > r = -EINVAL;
> > > goto err;
> > > }
> > > list_for_each_entry(notifier, >notifiers, list) {
> > > r = notifier->ops->mmap_start(notifier, ...);
> > > if (r)
> > > goto abort;
> > > }
> > >
> > > notifier->ops->mmap_end(notifier, ...);
> > > mutex_unlock(>lock);
> > > return 0;
> > >
> > >   abort:
> > > list_for_each_entry_continue_reverse(notifier >notifiers, 
> > > list)
> > > notifier->ops->mmap_abort(notifier, ...);
> > >   err:
> > > mutex_unlock(>lock);
> > > return r;
> > >   }
> > >
> > >   static void memfd_restricted_close(struct vm_area_struct *vma)
> > >   {
> > > mutex_lock(...);
> > >
> > > /*
> > >  * Destroy the memfd and disable all future accesses if there are
> > >  * outstanding refcounts (or other unsatisfied restrictions?).
> > >  */
> > > if ( || ???)
> > > memfd_restricted_destroy(...);
> > > else
> > > data->has_mapping = false;
> > >
> > > mutex_unlock(...);
> > >   }
> > >
> > >   static int memfd_restricted_may_split(struct vm_area_struct *area, 
> > > unsigned long addr)
> > >   {
> > > return -EINVAL;
> > >   }
> > >
> > >   static int memfd_restricted_mapping_mremap(struct vm_area_struct 
> > > *new_vma)
> > >   {
> > > return -EINVAL;
> > >   }
> > >
> > > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> > >
> > >   1. Not be supported for TDX or SEV-SNP because they don't allow adding 
> > > non-zero
> > >  memory into the guest (after pre-boot phase).
> > >
> > >   2. Be mutually exclusive with shared<=>private conversions, and is 
> > > allowed if
> > >  and only if the entire gfn range of the associated memslot is shared.
> >
> > In general I think that this would work with pKVM. However, limiting
> > private<->shared conversions to the granularity of a whole memslot
> > might be difficult to handle in pKVM, since the guest doesn't have the
> > concept of memslots. For example, in pKVM right now, when a guest
> > shares back its restricted DMA pool with the host it does so at the
> > page-level. pKVM would also need a way to make an fd accessible again
> > when shared back, which I think isn't possible with this patch.
>
> But does pKVM really want to mmap/munmap a new region at the page-level,
> that can cause VMA fragmentation if the conversion is frequent as I see.
> Even with a KVM ioctl for mapping as mentioned below, I think there will
> be the same issue.

pKVM doesn't really need to unmap the memory. What is really important
is that the memory is not GUP'able. Having private memory mapped and
then accessed by a misbehaving/malicious process will reinject a fault
into the misbehaving process.

Cheers,
/fuad

> >
> > You were initially considering a KVM

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-26 Thread David Hildenbrand


On 26.09.22 16:48, Kirill A. Shutemov wrote:

On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:

On 23.09.22 02:58, Kirill A . Shutemov wrote:

On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:

diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..9d066be3d7e8 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
#define DMA_BUF_MAGIC   0x444d4142  /* "DMAB" */
#define DEVMEM_MAGIC0x454d444d  /* "DMEM" */
#define SECRETMEM_MAGIC 0x5345434d  /* "SECM" */
+#define INACCESSIBLE_MAGIC 0x494e4143  /* "INAC" */



[...]


+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+int *order)
+{
+   struct inaccessible_data *data = file->f_mapping->private_data;
+   struct file *memfd = data->memfd;
+   struct page *page;
+   int ret;
+
+   ret = shmem_getpage(file_inode(memfd), offset, , SGP_WRITE);
+   if (ret)
+   return ret;
+
+   *pfn = page_to_pfn_t(page);
+   *order = thp_order(compound_head(page));
+   SetPageUptodate(page);
+   unlock_page(page);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
+
+void inaccessible_put_pfn(struct file *file, pfn_t pfn)
+{
+   struct page *page = pfn_t_to_page(pfn);
+
+   if (WARN_ON_ONCE(!page))
+   return;
+
+   put_page(page);
+}
+EXPORT_SYMBOL_GPL(inaccessible_put_pfn);


Sorry, I missed your reply regarding get/put interface.

https://lore.kernel.org/linux-mm/20220810092532.gd862...@chaop.bj.intel.com/

"We have a design assumption that somedays this can even support non-page
based backing stores."

As long as there is no such user in sight (especially how to get the memfd
from even allocating such memory which will require bigger changes), I
prefer to keep it simple here and work on pages/folios. No need to
over-complicate it for now.


Sean, Paolo , what is your take on this? Do you have conrete use case of
pageless backend for the mechanism in sight? Maybe DAX?


The problem I'm having with this is how to actually get such memory into the
memory backend (that triggers notifiers) and what the semantics are at all
with memory that is not managed by the buddy.

memfd with fixed PFNs doesn't make too much sense.


What do you mean by "fixed PFN". It is as fixed as struct page/folio, no?
PFN covers more possible backends.


For DAX, you usually bypass the buddy and map /dev/mem or a devdax. In 
contrast to ordinary memfd that allocates memory via the buddy. That's 
the difference I see -- and I wonder how it could work.





When using DAX, what happens with the shared <->private conversion? Which
"type" is supposed to use dax, which not?

In other word, I'm missing too many details on the bigger picture of how
this would work at all to see why it makes sense right now to prepare for
that.


IIUC, KVM doesn't really care about pages or folios. They need PFN to
populate SEPT. Returning page/folio would make KVM do additional steps to
extract PFN and one more place to have a bug.


Fair enough. Smells KVM specific, though.

--
Thanks,

David / dhildenb

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-26 Thread Chao Peng

On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote:
> > Regarding pKVM's use case, with the shim approach I believe this can be 
> > done by
> > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> > piled on top.
> >
> > My first thought was to make the uAPI a set of KVM ioctls so that KVM could 
> > tightly
> > tightly control usage without taking on too much complexity in the kernel, 
> > but
> > working through things, routing the behavior through the shim itself might 
> > not be
> > all that horrific.
> >
> > IIRC, we discarded the idea of allowing userspace to map the "private" fd 
> > because
> > things got too complex, but with the shim it doesn't seem _that_ bad.
> >
> > E.g. on the memfd side:
> >
> >   1. The entire memfd must be mapped, and at most one mapping is allowed, 
> > i.e.
> >  mapping is all or nothing.
> >
> >   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping 
> > for
> >  the restricted memfd.
> >
> >   3. Add notifier hooks to allow downstream users to further restrict 
> > things.
> >
> >   4. Disallow splitting VMAs, e.g. to force userspace to munmap() 
> > everything in
> >  one shot.
> >
> >   5. Require that there are no outstanding references at munmap().  Or if 
> > this
> >  can't be guaranteed by userspace, maybe add some way for userspace to 
> > wait
> >  until it's ok to convert to private?  E.g. so that get_pfn() doesn't 
> > need
> >  to do an expensive check every time.
> >
> >   static int memfd_restricted_mmap(struct file *file, struct vm_area_struct 
> > *vma)
> >   {
> > if (vma->vm_pgoff)
> > return -EINVAL;
> >
> > if ((vma->vm_end - vma->vm_start) != )
> > return -EINVAL;
> >
> > mutex_lock(>lock);
> >
> > if (data->has_mapping) {
> > r = -EINVAL;
> > goto err;
> > }
> > list_for_each_entry(notifier, >notifiers, list) {
> > r = notifier->ops->mmap_start(notifier, ...);
> > if (r)
> > goto abort;
> > }
> >
> > notifier->ops->mmap_end(notifier, ...);
> > mutex_unlock(>lock);
> > return 0;
> >
> >   abort:
> > list_for_each_entry_continue_reverse(notifier >notifiers, 
> > list)
> > notifier->ops->mmap_abort(notifier, ...);
> >   err:
> > mutex_unlock(>lock);
> > return r;
> >   }
> >
> >   static void memfd_restricted_close(struct vm_area_struct *vma)
> >   {
> > mutex_lock(...);
> >
> > /*
> >  * Destroy the memfd and disable all future accesses if there are
> >  * outstanding refcounts (or other unsatisfied restrictions?).
> >  */
> > if ( || ???)
> > memfd_restricted_destroy(...);
> > else
> > data->has_mapping = false;
> >
> > mutex_unlock(...);
> >   }
> >
> >   static int memfd_restricted_may_split(struct vm_area_struct *area, 
> > unsigned long addr)
> >   {
> > return -EINVAL;
> >   }
> >
> >   static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma)
> >   {
> > return -EINVAL;
> >   }
> >
> > Then on the KVM side, its mmap_start() + mmap_end() sequence would:
> >
> >   1. Not be supported for TDX or SEV-SNP because they don't allow adding 
> > non-zero
> >  memory into the guest (after pre-boot phase).
> >
> >   2. Be mutually exclusive with shared<=>private conversions, and is 
> > allowed if
> >  and only if the entire gfn range of the associated memslot is shared.
> 
> In general I think that this would work with pKVM. However, limiting
> private<->shared conversions to the granularity of a whole memslot
> might be difficult to handle in pKVM, since the guest doesn't have the
> concept of memslots. For example, in pKVM right now, when a guest
> shares back its restricted DMA pool with the host it does so at the
> page-level. pKVM would also need a way to make an fd accessible again
> when shared back, which I think isn't possible with this patch.

But does pKVM really want to mmap/munmap a new region at the page-level,
that can cause VMA fragmentation if the conversion is frequent as I see.
Even with a KVM ioctl for mapping as mentioned below, I think there will
be the same issue.

> 
> You were initially considering a KVM ioctl for mapping, which might be
> better suited for this since KVM knows which pages are shared and
> which ones are private. So routing things through KVM might simplify
> things and allow it to enforce all the necessary restrictions (e.g.,
> private memory cannot be mapped). What do you think?
> 
> Thanks,
> /fuad

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-26 Thread Kirill A. Shutemov

On Mon, Sep 26, 2022 at 12:35:34PM +0200, David Hildenbrand wrote:
> On 23.09.22 02:58, Kirill A . Shutemov wrote:
> > On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
> > > > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > > > index 6325d1d0e90f..9d066be3d7e8 100644
> > > > --- a/include/uapi/linux/magic.h
> > > > +++ b/include/uapi/linux/magic.h
> > > > @@ -101,5 +101,6 @@
> > > >#define DMA_BUF_MAGIC0x444d4142  /* "DMAB" */
> > > >#define DEVMEM_MAGIC 0x454d444d  /* "DMEM" */
> > > >#define SECRETMEM_MAGIC  0x5345434d  /* "SECM" */
> > > > +#define INACCESSIBLE_MAGIC 0x494e4143  /* "INAC" */
> > > 
> > > 
> > > [...]
> > > 
> > > > +
> > > > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > > > +int *order)
> > > > +{
> > > > +   struct inaccessible_data *data = file->f_mapping->private_data;
> > > > +   struct file *memfd = data->memfd;
> > > > +   struct page *page;
> > > > +   int ret;
> > > > +
> > > > +   ret = shmem_getpage(file_inode(memfd), offset, , 
> > > > SGP_WRITE);
> > > > +   if (ret)
> > > > +   return ret;
> > > > +
> > > > +   *pfn = page_to_pfn_t(page);
> > > > +   *order = thp_order(compound_head(page));
> > > > +   SetPageUptodate(page);
> > > > +   unlock_page(page);
> > > > +
> > > > +   return 0;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> > > > +
> > > > +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> > > > +{
> > > > +   struct page *page = pfn_t_to_page(pfn);
> > > > +
> > > > +   if (WARN_ON_ONCE(!page))
> > > > +   return;
> > > > +
> > > > +   put_page(page);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> > > 
> > > Sorry, I missed your reply regarding get/put interface.
> > > 
> > > https://lore.kernel.org/linux-mm/20220810092532.gd862...@chaop.bj.intel.com/
> > > 
> > > "We have a design assumption that somedays this can even support non-page
> > > based backing stores."
> > > 
> > > As long as there is no such user in sight (especially how to get the memfd
> > > from even allocating such memory which will require bigger changes), I
> > > prefer to keep it simple here and work on pages/folios. No need to
> > > over-complicate it for now.
> > 
> > Sean, Paolo , what is your take on this? Do you have conrete use case of
> > pageless backend for the mechanism in sight? Maybe DAX?
> 
> The problem I'm having with this is how to actually get such memory into the
> memory backend (that triggers notifiers) and what the semantics are at all
> with memory that is not managed by the buddy.
> 
> memfd with fixed PFNs doesn't make too much sense.

What do you mean by "fixed PFN". It is as fixed as struct page/folio, no?
PFN covers more possible backends.

> When using DAX, what happens with the shared <->private conversion? Which
> "type" is supposed to use dax, which not?
> 
> In other word, I'm missing too many details on the bigger picture of how
> this would work at all to see why it makes sense right now to prepare for
> that.

IIUC, KVM doesn't really care about pages or folios. They need PFN to
populate SEPT. Returning page/folio would make KVM do additional steps to
extract PFN and one more place to have a bug.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-26 Thread David Hildenbrand


On 23.09.22 02:58, Kirill A . Shutemov wrote:

On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:

diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..9d066be3d7e8 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
   #define DMA_BUF_MAGIC0x444d4142  /* "DMAB" */
   #define DEVMEM_MAGIC 0x454d444d  /* "DMEM" */
   #define SECRETMEM_MAGIC  0x5345434d  /* "SECM" */
+#define INACCESSIBLE_MAGIC 0x494e4143  /* "INAC" */



[...]


+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+int *order)
+{
+   struct inaccessible_data *data = file->f_mapping->private_data;
+   struct file *memfd = data->memfd;
+   struct page *page;
+   int ret;
+
+   ret = shmem_getpage(file_inode(memfd), offset, , SGP_WRITE);
+   if (ret)
+   return ret;
+
+   *pfn = page_to_pfn_t(page);
+   *order = thp_order(compound_head(page));
+   SetPageUptodate(page);
+   unlock_page(page);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
+
+void inaccessible_put_pfn(struct file *file, pfn_t pfn)
+{
+   struct page *page = pfn_t_to_page(pfn);
+
+   if (WARN_ON_ONCE(!page))
+   return;
+
+   put_page(page);
+}
+EXPORT_SYMBOL_GPL(inaccessible_put_pfn);


Sorry, I missed your reply regarding get/put interface.

https://lore.kernel.org/linux-mm/20220810092532.gd862...@chaop.bj.intel.com/

"We have a design assumption that somedays this can even support non-page
based backing stores."

As long as there is no such user in sight (especially how to get the memfd
from even allocating such memory which will require bigger changes), I
prefer to keep it simple here and work on pages/folios. No need to
over-complicate it for now.


Sean, Paolo , what is your take on this? Do you have conrete use case of
pageless backend for the mechanism in sight? Maybe DAX?


The problem I'm having with this is how to actually get such memory into 
the memory backend (that triggers notifiers) and what the semantics are 
at all with memory that is not managed by the buddy.


memfd with fixed PFNs doesn't make too much sense.

When using DAX, what happens with the shared <->private conversion? 
Which "type" is supposed to use dax, which not?


In other word, I'm missing too many details on the bigger picture of how 
this would work at all to see why it makes sense right now to prepare 
for that.


--
Thanks,

David / dhildenb

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-23 Thread Fuad Tabba

Hi,

On Fri, Sep 23, 2022 at 1:53 AM Kirill A . Shutemov
 wrote:
>
> On Thu, Sep 22, 2022 at 07:49:18PM +, Sean Christopherson wrote:
> > On Thu, Sep 22, 2022, Wang, Wei W wrote:
> > > On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> > > > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > > > +  int *order)
> > >
> > > Better to remove "order" from this interface?
> >
> > Hard 'no'.
> >
> > > Some callers only need to get pfn, and no need to bother with
> > > defining and inputting something unused. For callers who need the "order",
> > > can easily get it via thp_order(pfn_to_page(pfn)) on their own.
> >
> > That requires (a) assuming the pfn is backed by struct page, and (b) 
> > assuming the
> > struct page is a transparent huge page.  That might be true for the current
> > implementation, but it most certainly will not always be true.
> >
> > KVM originally did things like this, where there was dedicated code for THP 
> > vs.
> > HugeTLB, and it was a mess.  The goal here is very much to avoid repeating 
> > those
> > mistakes.  Have the backing store _tell_ KVM how big the mapping is, don't 
> > force
> > KVM to rediscover the info on its own.
>
> I guess we can allow order pointer to be NULL to cover caller that don't
> need to know the order. Is it useful?

I think that would be useful. In pKVM we don't need to know the order,
and I had to use a dummy variable when porting V7.

Cheers,
/fuad


> --
>   Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-23 Thread Fuad Tabba

Hi,

<...>

> > Regarding pKVM's use case, with the shim approach I believe this can be 
> > done by
> > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> > piled on top.
> >
> > My first thought was to make the uAPI a set of KVM ioctls so that KVM
> > could tightly
> > tightly control usage without taking on too much complexity in the
> > kernel, but
> > working through things, routing the behavior through the shim itself
> > might not be
> > all that horrific.
> >
> > IIRC, we discarded the idea of allowing userspace to map the "private"
> > fd because
> > things got too complex, but with the shim it doesn't seem _that_ bad.
>
> What's the exact use case?  Is it just to pre-populate the memory?

Prepopulate memory and access memory that could go back and forth from
being shared to being private.

Cheers,
/fuad



> >
> > E.g. on the memfd side:
> >
> >   1. The entire memfd must be mapped, and at most one mapping is allowed, 
> > i.e.
> >  mapping is all or nothing.
> >
> >   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping 
> > for
> >  the restricted memfd.
> >
> >   3. Add notifier hooks to allow downstream users to further restrict 
> > things.
> >
> >   4. Disallow splitting VMAs, e.g. to force userspace to munmap() 
> > everything in
> >  one shot.
> >
> >   5. Require that there are no outstanding references at munmap().  Or if 
> > this
> >  can't be guaranteed by userspace, maybe add some way for userspace to 
> > wait
> >  until it's ok to convert to private?  E.g. so that get_pfn() doesn't 
> > need
> >  to do an expensive check every time.
>
> Hmm.  I haven't looked at the code to see if this would really work, but I 
> think this could be done more in line with how the rest of the kernel works 
> by using the rmap infrastructure.  When the pKVM memfd is in not-yet-private 
> mode, just let it be mmapped as usual (but don't allow any form of GUP or 
> pinning).  Then have an ioctl to switch to to shared mode that takes locks or 
> sets flags so that no new faults can be serviced and does unmap_mapping_range.
>
> As long as the shim arranges to have its own vm_ops, I don't immediately see 
> any reason this can't work.

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-23 Thread Fuad Tabba

Hi,

On Mon, Sep 19, 2022 at 8:10 PM Sean Christopherson  wrote:
>
> +Will, Marc and Fuad (apologies if I missed other pKVM folks)
>
> On Mon, Sep 19, 2022, David Hildenbrand wrote:
> > On 15.09.22 16:29, Chao Peng wrote:
> > > From: "Kirill A. Shutemov" 
> > >
> > > KVM can use memfd-provided memory for guest memory. For normal userspace
> > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > > virtual address space and then tells KVM to use the virtual address to
> > > setup the mapping in the secondary page table (e.g. EPT).
> > >
> > > With confidential computing technologies like Intel TDX, the
> > > memfd-provided memory may be encrypted with special key for special
> > > software domain (e.g. KVM guest) and is not expected to be directly
> > > accessed by userspace. Precisely, userspace access to such encrypted
> > > memory may lead to host crash so it should be prevented.
> >
> > Initially my thaught was that this whole inaccessible thing is TDX specific
> > and there is no need to force that on other mechanisms. That's why I
> > suggested to not expose this to user space but handle the notifier
> > requirements internally.
> >
> > IIUC now, protected KVM has similar demands. Either access (read/write) of
> > guest RAM would result in a fault and possibly crash the hypervisor (at
> > least not the whole machine IIUC).
>
> Yep.  The missing piece for pKVM is the ability to convert from shared to 
> private
> while preserving the contents, e.g. to hand off a large buffer (hundreds of 
> MiB)
> for processing in the protected VM.  Thoughts on this at the bottom.

Just wanted to mention that for pKVM (arm64), this wouldn't crash the
hypervisor. A userspace access would crash the userspace process since
the hypervisor would inject a fault back. Because of that making it
inaccessible from userspace is good to have, but not really vital for
pKVM. What is important for pKVM is that the guest private memory is
not GUP'able by the host. This is because if it were, it might be
possible for a malicious userspace process (e.g., a malicious vmm) to
trick the host kernel into accessing guest private memory in a context
where it isn’t prepared to handle the fault injected by the
hypervisor. This of course might crash the host.

> > > This patch introduces userspace inaccessible memfd (created with
> > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > > in-kernel interface so KVM can directly interact with core-mm without
> > > the need to map the memory into KVM userspace.
> >
> > With secretmem we decided to not add such "concept switch" flags and instead
> > use a dedicated syscall.
> >
>
> I have no personal preference whatsoever between a flag and a dedicated 
> syscall,
> but a dedicated syscall does seem like it would give the kernel a bit more
> flexibility.
>
> > What about memfd_inaccessible()? Especially, sealing and hugetlb are not
> > even supported and it might take a while to support either.
>
> Don't know about sealing, but hugetlb support for "inaccessible" memory needs 
> to
> come sooner than later.  "inaccessible" in quotes because we might want to 
> choose
> a less binary name, e.g. "restricted"?.
>
> Regarding pKVM's use case, with the shim approach I believe this can be done 
> by
> allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> piled on top.
>
> My first thought was to make the uAPI a set of KVM ioctls so that KVM could 
> tightly
> tightly control usage without taking on too much complexity in the kernel, but
> working through things, routing the behavior through the shim itself might 
> not be
> all that horrific.
>
> IIRC, we discarded the idea of allowing userspace to map the "private" fd 
> because
> things got too complex, but with the shim it doesn't seem _that_ bad.
>
> E.g. on the memfd side:
>
>   1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
>  mapping is all or nothing.
>
>   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping 
> for
>  the restricted memfd.
>
>   3. Add notifier hooks to allow downstream users to further restrict things.
>
>   4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything 
> in
>  one shot.
>
>   5. Require that there are no outstanding references at munmap().  Or if this
>  can't be guaranteed by userspace, maybe add some way for userspace to 
> wait
>  until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
>  to do an expensive check every time.
>
>   static int memfd_restricted_mmap(struct file *file, struct vm_area_struct 
> *vma)
>   {
> if (vma->vm_pgoff)
> return -EINVAL;
>
> if ((vma->vm_end - vma->vm_start) != )
> return -EINVAL;
>
> mutex_lock(>lock);
>
> if (data->has_mapping) {
> r = -EINVAL;
>

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-22 Thread Kirill A . Shutemov

On Mon, Sep 19, 2022 at 11:12:46AM +0200, David Hildenbrand wrote:
> > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> > index 6325d1d0e90f..9d066be3d7e8 100644
> > --- a/include/uapi/linux/magic.h
> > +++ b/include/uapi/linux/magic.h
> > @@ -101,5 +101,6 @@
> >   #define DMA_BUF_MAGIC 0x444d4142  /* "DMAB" */
> >   #define DEVMEM_MAGIC  0x454d444d  /* "DMEM" */
> >   #define SECRETMEM_MAGIC   0x5345434d  /* "SECM" */
> > +#define INACCESSIBLE_MAGIC 0x494e4143  /* "INAC" */
> 
> 
> [...]
> 
> > +
> > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > +int *order)
> > +{
> > +   struct inaccessible_data *data = file->f_mapping->private_data;
> > +   struct file *memfd = data->memfd;
> > +   struct page *page;
> > +   int ret;
> > +
> > +   ret = shmem_getpage(file_inode(memfd), offset, , SGP_WRITE);
> > +   if (ret)
> > +   return ret;
> > +
> > +   *pfn = page_to_pfn_t(page);
> > +   *order = thp_order(compound_head(page));
> > +   SetPageUptodate(page);
> > +   unlock_page(page);
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(inaccessible_get_pfn);
> > +
> > +void inaccessible_put_pfn(struct file *file, pfn_t pfn)
> > +{
> > +   struct page *page = pfn_t_to_page(pfn);
> > +
> > +   if (WARN_ON_ONCE(!page))
> > +   return;
> > +
> > +   put_page(page);
> > +}
> > +EXPORT_SYMBOL_GPL(inaccessible_put_pfn);
> 
> Sorry, I missed your reply regarding get/put interface.
> 
> https://lore.kernel.org/linux-mm/20220810092532.gd862...@chaop.bj.intel.com/
> 
> "We have a design assumption that somedays this can even support non-page
> based backing stores."
> 
> As long as there is no such user in sight (especially how to get the memfd
> from even allocating such memory which will require bigger changes), I
> prefer to keep it simple here and work on pages/folios. No need to
> over-complicate it for now.

Sean, Paolo , what is your take on this? Do you have conrete use case of
pageless backend for the mechanism in sight? Maybe DAX?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-22 Thread Kirill A . Shutemov

On Thu, Sep 22, 2022 at 07:49:18PM +, Sean Christopherson wrote:
> On Thu, Sep 22, 2022, Wang, Wei W wrote:
> > On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> > > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > > +  int *order)
> > 
> > Better to remove "order" from this interface?
> 
> Hard 'no'.
> 
> > Some callers only need to get pfn, and no need to bother with
> > defining and inputting something unused. For callers who need the "order",
> > can easily get it via thp_order(pfn_to_page(pfn)) on their own.
> 
> That requires (a) assuming the pfn is backed by struct page, and (b) assuming 
> the
> struct page is a transparent huge page.  That might be true for the current
> implementation, but it most certainly will not always be true.
> 
> KVM originally did things like this, where there was dedicated code for THP 
> vs.
> HugeTLB, and it was a mess.  The goal here is very much to avoid repeating 
> those
> mistakes.  Have the backing store _tell_ KVM how big the mapping is, don't 
> force
> KVM to rediscover the info on its own.

I guess we can allow order pointer to be NULL to cover caller that don't
need to know the order. Is it useful?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-22 Thread Sean Christopherson

On Thu, Sep 22, 2022, Wang, Wei W wrote:
> On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> > +int *order)
> 
> Better to remove "order" from this interface?

Hard 'no'.

> Some callers only need to get pfn, and no need to bother with
> defining and inputting something unused. For callers who need the "order",
> can easily get it via thp_order(pfn_to_page(pfn)) on their own.

That requires (a) assuming the pfn is backed by struct page, and (b) assuming 
the
struct page is a transparent huge page.  That might be true for the current
implementation, but it most certainly will not always be true.

KVM originally did things like this, where there was dedicated code for THP vs.
HugeTLB, and it was a mess.  The goal here is very much to avoid repeating those
mistakes.  Have the backing store _tell_ KVM how big the mapping is, don't force
KVM to rediscover the info on its own.

RE: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-22 Thread Wang, Wei W

On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +  int *order)

Better to remove "order" from this interface?
Some callers only need to get pfn, and no need to bother with
defining and inputting something unused. For callers who need the "order",
can easily get it via thp_order(pfn_to_page(pfn)) on their own.

RE: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-22 Thread Wang, Wei W

On Thursday, September 22, 2022 5:11 AM, Andy Lutomirski wrote:
> To: Christopherson,, Sean ; David Hildenbrand
> 
> Cc: Chao Peng ; kvm list
> ; Linux Kernel Mailing List
> ; linux...@kvack.org;
> linux-fsde...@vger.kernel.org; Linux API ;
> linux-...@vger.kernel.org; qemu-devel@nongnu.org; Paolo Bonzini
> ; Jonathan Corbet ; Vitaly
> Kuznetsov ; Wanpeng Li ;
> Jim Mattson ; Joerg Roedel ;
> Thomas Gleixner ; Ingo Molnar ;
> Borislav Petkov ; the arch/x86 maintainers ;
> H. Peter Anvin ; Hugh Dickins ; Jeff
> Layton ; J . Bruce Fields ; Andrew
> Morton ; Shuah Khan ;
> Mike Rapoport ; Steven Price ;
> Maciej S . Szmigiero ; Vlastimil Babka
> ; Vishal Annapurve ; Yu Zhang
> ; Kirill A. Shutemov
> ; Nakajima, Jun ;
> Hansen, Dave ; Andi Kleen ;
> aarca...@redhat.com; ddut...@redhat.com; dhild...@redhat.com; Quentin
> Perret ; Michael Roth ;
> Hocko, Michal ; Muchun Song
> ; Wang, Wei W ;
> Will Deacon ; Marc Zyngier ; Fuad Tabba
> 
> Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible
> memfd
> 
> (please excuse any formatting disasters.  my internet went out as I was
> composing this, and i did my best to rescue it.)
> 
> On Mon, Sep 19, 2022, at 12:10 PM, Sean Christopherson wrote:
> > +Will, Marc and Fuad (apologies if I missed other pKVM folks)
> >
> > On Mon, Sep 19, 2022, David Hildenbrand wrote:
> >> On 15.09.22 16:29, Chao Peng wrote:
> >> > From: "Kirill A. Shutemov" 
> >> >
> >> > KVM can use memfd-provided memory for guest memory. For normal
> >> > userspace accessible memory, KVM userspace (e.g. QEMU) mmaps the
> >> > memfd into its virtual address space and then tells KVM to use the
> >> > virtual address to setup the mapping in the secondary page table (e.g.
> EPT).
> >> >
> >> > With confidential computing technologies like Intel TDX, the
> >> > memfd-provided memory may be encrypted with special key for special
> >> > software domain (e.g. KVM guest) and is not expected to be directly
> >> > accessed by userspace. Precisely, userspace access to such
> >> > encrypted memory may lead to host crash so it should be prevented.
> >>
> >> Initially my thaught was that this whole inaccessible thing is TDX
> >> specific and there is no need to force that on other mechanisms.
> >> That's why I suggested to not expose this to user space but handle
> >> the notifier requirements internally.
> >>
> >> IIUC now, protected KVM has similar demands. Either access
> >> (read/write) of guest RAM would result in a fault and possibly crash
> >> the hypervisor (at least not the whole machine IIUC).
> >
> > Yep.  The missing piece for pKVM is the ability to convert from shared
> > to private while preserving the contents, e.g. to hand off a large
> > buffer (hundreds of MiB) for processing in the protected VM.  Thoughts
> > on this at the bottom.
> >
> >> > This patch introduces userspace inaccessible memfd (created with
> >> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace
> >> > through ordinary MMU access (e.g. read/write/mmap) but can be
> >> > accessed via in-kernel interface so KVM can directly interact with
> >> > core-mm without the need to map the memory into KVM userspace.
> >>
> >> With secretmem we decided to not add such "concept switch" flags and
> >> instead use a dedicated syscall.
> >>
> >
> > I have no personal preference whatsoever between a flag and a
> > dedicated syscall, but a dedicated syscall does seem like it would
> > give the kernel a bit more flexibility.
> 
> The third option is a device node, e.g. /dev/kvm_secretmem or
> /dev/kvm_tdxmem or similar.  But if we need flags or other details in the
> future, maybe this isn't ideal.
> 
> >
> >> What about memfd_inaccessible()? Especially, sealing and hugetlb are
> >> not even supported and it might take a while to support either.
> >
> > Don't know about sealing, but hugetlb support for "inaccessible"
> > memory needs to come sooner than later.  "inaccessible" in quotes
> > because we might want to choose a less binary name, e.g.
> > "restricted"?.
> >
> > Regarding pKVM's use case, with the shim approach I believe this can
> > be done by allowing userspace mmap() the "hidden" memfd, but with a
> > ton of restrictions piled on top.
> >
> > My first thought was to make the uAPI a set of KVM ioctls so that KVM
> > could tightly tight

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-21 Thread Andy Lutomirski

(please excuse any formatting disasters.  my internet went out as I was 
composing this, and i did my best to rescue it.)

On Mon, Sep 19, 2022, at 12:10 PM, Sean Christopherson wrote:
> +Will, Marc and Fuad (apologies if I missed other pKVM folks)
>
> On Mon, Sep 19, 2022, David Hildenbrand wrote:
>> On 15.09.22 16:29, Chao Peng wrote:
>> > From: "Kirill A. Shutemov" 
>> > 
>> > KVM can use memfd-provided memory for guest memory. For normal userspace
>> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
>> > virtual address space and then tells KVM to use the virtual address to
>> > setup the mapping in the secondary page table (e.g. EPT).
>> > 
>> > With confidential computing technologies like Intel TDX, the
>> > memfd-provided memory may be encrypted with special key for special
>> > software domain (e.g. KVM guest) and is not expected to be directly
>> > accessed by userspace. Precisely, userspace access to such encrypted
>> > memory may lead to host crash so it should be prevented.
>> 
>> Initially my thaught was that this whole inaccessible thing is TDX specific
>> and there is no need to force that on other mechanisms. That's why I
>> suggested to not expose this to user space but handle the notifier
>> requirements internally.
>> 
>> IIUC now, protected KVM has similar demands. Either access (read/write) of
>> guest RAM would result in a fault and possibly crash the hypervisor (at
>> least not the whole machine IIUC).
>
> Yep.  The missing piece for pKVM is the ability to convert from shared 
> to private
> while preserving the contents, e.g. to hand off a large buffer 
> (hundreds of MiB)
> for processing in the protected VM.  Thoughts on this at the bottom.
>
>> > This patch introduces userspace inaccessible memfd (created with
>> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
>> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
>> > in-kernel interface so KVM can directly interact with core-mm without
>> > the need to map the memory into KVM userspace.
>> 
>> With secretmem we decided to not add such "concept switch" flags and instead
>> use a dedicated syscall.
>>
>
> I have no personal preference whatsoever between a flag and a dedicated 
> syscall,
> but a dedicated syscall does seem like it would give the kernel a bit more
> flexibility.

The third option is a device node, e.g. /dev/kvm_secretmem or /dev/kvm_tdxmem 
or similar.  But if we need flags or other details in the future, maybe this 
isn't ideal.

>
>> What about memfd_inaccessible()? Especially, sealing and hugetlb are not
>> even supported and it might take a while to support either.
>
> Don't know about sealing, but hugetlb support for "inaccessible" memory 
> needs to
> come sooner than later.  "inaccessible" in quotes because we might want 
> to choose
> a less binary name, e.g. "restricted"?.
>
> Regarding pKVM's use case, with the shim approach I believe this can be done 
> by
> allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
> piled on top.
>
> My first thought was to make the uAPI a set of KVM ioctls so that KVM 
> could tightly
> tightly control usage without taking on too much complexity in the 
> kernel, but
> working through things, routing the behavior through the shim itself 
> might not be
> all that horrific.
>
> IIRC, we discarded the idea of allowing userspace to map the "private" 
> fd because
> things got too complex, but with the shim it doesn't seem _that_ bad.

What's the exact use case?  Is it just to pre-populate the memory?

>
> E.g. on the memfd side:
>
>   1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
>  mapping is all or nothing.
>
>   2. Acquiring a reference via get_pfn() is disallowed if there's a mapping 
> for
>  the restricted memfd.
>
>   3. Add notifier hooks to allow downstream users to further restrict things.
>
>   4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything 
> in
>  one shot.
>
>   5. Require that there are no outstanding references at munmap().  Or if this
>  can't be guaranteed by userspace, maybe add some way for userspace to 
> wait
>  until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
>  to do an expensive check every time.

Hmm.  I haven't looked at the code to see if this would really work, but I 
think this could be done more in line with how the rest of the kernel works by 
using the rmap infrastructure.  When the pKVM memfd is in not-yet-private mode, 
just let it be mmapped as usual (but don't allow any form of GUP or pinning).  
Then have an ioctl to switch to to shared mode that takes locks or sets flags 
so that no new faults can be serviced and does unmap_mapping_range.

As long as the shim arranges to have its own vm_ops, I don't immediately see 
any reason this can't work.

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-19 Thread Sean Christopherson

+Will, Marc and Fuad (apologies if I missed other pKVM folks)

On Mon, Sep 19, 2022, David Hildenbrand wrote:
> On 15.09.22 16:29, Chao Peng wrote:
> > From: "Kirill A. Shutemov" 
> > 
> > KVM can use memfd-provided memory for guest memory. For normal userspace
> > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
> > virtual address space and then tells KVM to use the virtual address to
> > setup the mapping in the secondary page table (e.g. EPT).
> > 
> > With confidential computing technologies like Intel TDX, the
> > memfd-provided memory may be encrypted with special key for special
> > software domain (e.g. KVM guest) and is not expected to be directly
> > accessed by userspace. Precisely, userspace access to such encrypted
> > memory may lead to host crash so it should be prevented.
> 
> Initially my thaught was that this whole inaccessible thing is TDX specific
> and there is no need to force that on other mechanisms. That's why I
> suggested to not expose this to user space but handle the notifier
> requirements internally.
> 
> IIUC now, protected KVM has similar demands. Either access (read/write) of
> guest RAM would result in a fault and possibly crash the hypervisor (at
> least not the whole machine IIUC).

Yep.  The missing piece for pKVM is the ability to convert from shared to 
private
while preserving the contents, e.g. to hand off a large buffer (hundreds of MiB)
for processing in the protected VM.  Thoughts on this at the bottom.

> > This patch introduces userspace inaccessible memfd (created with
> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
> > ordinary MMU access (e.g. read/write/mmap) but can be accessed via
> > in-kernel interface so KVM can directly interact with core-mm without
> > the need to map the memory into KVM userspace.
> 
> With secretmem we decided to not add such "concept switch" flags and instead
> use a dedicated syscall.
>

I have no personal preference whatsoever between a flag and a dedicated syscall,
but a dedicated syscall does seem like it would give the kernel a bit more
flexibility.

> What about memfd_inaccessible()? Especially, sealing and hugetlb are not
> even supported and it might take a while to support either.

Don't know about sealing, but hugetlb support for "inaccessible" memory needs to
come sooner than later.  "inaccessible" in quotes because we might want to 
choose
a less binary name, e.g. "restricted"?.

Regarding pKVM's use case, with the shim approach I believe this can be done by
allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions
piled on top.

My first thought was to make the uAPI a set of KVM ioctls so that KVM could 
tightly
tightly control usage without taking on too much complexity in the kernel, but
working through things, routing the behavior through the shim itself might not 
be
all that horrific.

IIRC, we discarded the idea of allowing userspace to map the "private" fd 
because
things got too complex, but with the shim it doesn't seem _that_ bad.

E.g. on the memfd side:

  1. The entire memfd must be mapped, and at most one mapping is allowed, i.e.
 mapping is all or nothing.

  2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for
 the restricted memfd.

  3. Add notifier hooks to allow downstream users to further restrict things.

  4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in
 one shot.

  5. Require that there are no outstanding references at munmap().  Or if this
 can't be guaranteed by userspace, maybe add some way for userspace to wait
 until it's ok to convert to private?  E.g. so that get_pfn() doesn't need
 to do an expensive check every time.

  static int memfd_restricted_mmap(struct file *file, struct vm_area_struct 
*vma)
  {
if (vma->vm_pgoff)
return -EINVAL;

if ((vma->vm_end - vma->vm_start) != )
return -EINVAL;

mutex_lock(>lock);

if (data->has_mapping) {
r = -EINVAL;
goto err;
}
list_for_each_entry(notifier, >notifiers, list) {
r = notifier->ops->mmap_start(notifier, ...);
if (r)
goto abort;
}

notifier->ops->mmap_end(notifier, ...);
mutex_unlock(>lock);
return 0;

  abort:
list_for_each_entry_continue_reverse(notifier >notifiers, list)
notifier->ops->mmap_abort(notifier, ...);
  err:
mutex_unlock(>lock);
return r;
  }

  static void memfd_restricted_close(struct vm_area_struct *vma)
  {
mutex_lock(...);

/*
 * Destroy the memfd and disable all future accesses if there are
 * outstanding refcounts (or other unsatisfied restrictions?).
 */
if ( || ???)
memfd_restricted_destroy(...);
else
data->has_mapping = false;

Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-19 Thread David Hildenbrand


On 15.09.22 16:29, Chao Peng wrote:

From: "Kirill A. Shutemov" 

KVM can use memfd-provided memory for guest memory. For normal userspace
accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
virtual address space and then tells KVM to use the virtual address to
setup the mapping in the secondary page table (e.g. EPT).

With confidential computing technologies like Intel TDX, the
memfd-provided memory may be encrypted with special key for special
software domain (e.g. KVM guest) and is not expected to be directly
accessed by userspace. Precisely, userspace access to such encrypted
memory may lead to host crash so it should be prevented.


Initially my thaught was that this whole inaccessible thing is TDX 
specific and there is no need to force that on other mechanisms. That's 
why I suggested to not expose this to user space but handle the notifier 
requirements internally.


IIUC now, protected KVM has similar demands. Either access (read/write) 
of guest RAM would result in a fault and possibly crash the hypervisor 
(at least not the whole machine IIUC).




This patch introduces userspace inaccessible memfd (created with
MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
ordinary MMU access (e.g. read/write/mmap) but can be accessed via
in-kernel interface so KVM can directly interact with core-mm without
the need to map the memory into KVM userspace.


With secretmem we decided to not add such "concept switch" flags and 
instead use a dedicated syscall.


What about memfd_inaccessible()? Especially, sealing and hugetlb are not 
even supported and it might take a while to support either.





It provides semantics required for KVM guest private(encrypted) memory
support that a file descriptor with this flag set is going to be used as
the source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV.

KVM userspace is still in charge of the lifecycle of the memfd. It
should pass the opened fd to KVM. KVM uses the kernel APIs newly added
in this patch to obtain the physical memory address and then populate
the secondary page table entries.

The userspace inaccessible memfd can be fallocate-ed and hole-punched
from userspace. When hole-punching happens, KVM can get notified through
inaccessible_notifier it then gets chance to remove any mapped entries
of the range in the secondary page tables.

The userspace inaccessible memfd itself is implemented as a shim layer
on top of real memory file systems like tmpfs/hugetlbfs but this patch
only implemented tmpfs. The allocated memory is currently marked as
unmovable and unevictable, this is required for current confidential
usage. But in future this might be changed.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Chao Peng 
---
  include/linux/memfd.h  |  24 
  include/uapi/linux/magic.h |   1 +
  include/uapi/linux/memfd.h |   1 +
  mm/Makefile|   2 +-
  mm/memfd.c |  25 -
  mm/memfd_inaccessible.c| 219 +
  6 files changed, 270 insertions(+), 2 deletions(-)
  create mode 100644 mm/memfd_inaccessible.c

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..334ddff08377 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -3,6 +3,7 @@
  #define __LINUX_MEMFD_H
  
  #include 

+#include 
  
  #ifdef CONFIG_MEMFD_CREATE

  extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long 
arg);
@@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int 
c, unsigned long a)
  }
  #endif
  
+struct inaccessible_notifier;

+
+struct inaccessible_notifier_ops {
+   void (*invalidate)(struct inaccessible_notifier *notifier,
+  pgoff_t start, pgoff_t end);
+};
+
+struct inaccessible_notifier {
+   struct list_head list;
+   const struct inaccessible_notifier_ops *ops;
+};
+
+void inaccessible_register_notifier(struct file *file,
+   struct inaccessible_notifier *notifier);
+void inaccessible_unregister_notifier(struct file *file,
+ struct inaccessible_notifier *notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+int *order);
+void inaccessible_put_pfn(struct file *file, pfn_t pfn);
+
+struct file *memfd_mkinaccessible(struct file *memfd);
+
  #endif /* __LINUX_MEMFD_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..9d066be3d7e8 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
  #define DMA_BUF_MAGIC 0x444d4142  /* "DMAB" */
  #define DEVMEM_MAGIC  0x454d444d  /* "DMEM" */
  #define SECRETMEM_MAGIC   0x5345434d  /* "SECM" */
+#define INACCESSIBLE_MAGIC 0x494e4143  /* "INAC" */



[...]


+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+

[PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-15 Thread Chao Peng

From: "Kirill A. Shutemov" 

KVM can use memfd-provided memory for guest memory. For normal userspace
accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its
virtual address space and then tells KVM to use the virtual address to
setup the mapping in the secondary page table (e.g. EPT).

With confidential computing technologies like Intel TDX, the
memfd-provided memory may be encrypted with special key for special
software domain (e.g. KVM guest) and is not expected to be directly
accessed by userspace. Precisely, userspace access to such encrypted
memory may lead to host crash so it should be prevented.

This patch introduces userspace inaccessible memfd (created with
MFD_INACCESSIBLE). Its memory is inaccessible from userspace through
ordinary MMU access (e.g. read/write/mmap) but can be accessed via
in-kernel interface so KVM can directly interact with core-mm without
the need to map the memory into KVM userspace.

It provides semantics required for KVM guest private(encrypted) memory
support that a file descriptor with this flag set is going to be used as
the source of guest memory in confidential computing environments such
as Intel TDX/AMD SEV.

KVM userspace is still in charge of the lifecycle of the memfd. It
should pass the opened fd to KVM. KVM uses the kernel APIs newly added
in this patch to obtain the physical memory address and then populate
the secondary page table entries.

The userspace inaccessible memfd can be fallocate-ed and hole-punched
from userspace. When hole-punching happens, KVM can get notified through
inaccessible_notifier it then gets chance to remove any mapped entries
of the range in the secondary page tables.

The userspace inaccessible memfd itself is implemented as a shim layer
on top of real memory file systems like tmpfs/hugetlbfs but this patch
only implemented tmpfs. The allocated memory is currently marked as
unmovable and unevictable, this is required for current confidential
usage. But in future this might be changed.

Signed-off-by: Kirill A. Shutemov 
Signed-off-by: Chao Peng 
---
 include/linux/memfd.h  |  24 
 include/uapi/linux/magic.h |   1 +
 include/uapi/linux/memfd.h |   1 +
 mm/Makefile|   2 +-
 mm/memfd.c |  25 -
 mm/memfd_inaccessible.c| 219 +
 6 files changed, 270 insertions(+), 2 deletions(-)
 create mode 100644 mm/memfd_inaccessible.c

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index 4f1600413f91..334ddff08377 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -3,6 +3,7 @@
 #define __LINUX_MEMFD_H
 
 #include 
+#include 
 
 #ifdef CONFIG_MEMFD_CREATE
 extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long 
arg);
@@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int 
c, unsigned long a)
 }
 #endif
 
+struct inaccessible_notifier;
+
+struct inaccessible_notifier_ops {
+   void (*invalidate)(struct inaccessible_notifier *notifier,
+  pgoff_t start, pgoff_t end);
+};
+
+struct inaccessible_notifier {
+   struct list_head list;
+   const struct inaccessible_notifier_ops *ops;
+};
+
+void inaccessible_register_notifier(struct file *file,
+   struct inaccessible_notifier *notifier);
+void inaccessible_unregister_notifier(struct file *file,
+ struct inaccessible_notifier *notifier);
+
+int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
+int *order);
+void inaccessible_put_pfn(struct file *file, pfn_t pfn);
+
+struct file *memfd_mkinaccessible(struct file *memfd);
+
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..9d066be3d7e8 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
 #define DMA_BUF_MAGIC  0x444d4142  /* "DMAB" */
 #define DEVMEM_MAGIC   0x454d444d  /* "DMEM" */
 #define SECRETMEM_MAGIC0x5345434d  /* "SECM" */
+#define INACCESSIBLE_MAGIC 0x494e4143  /* "INAC" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..48750474b904 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,7 @@
 #define MFD_CLOEXEC0x0001U
 #define MFD_ALLOW_SEALING  0x0002U
 #define MFD_HUGETLB0x0004U
+#define MFD_INACCESSIBLE   0x0008U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/Makefile b/mm/Makefile
index 9a564f836403..f82e5d4b4388 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -126,7 +126,7 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
-obj-$(CONFIG_MEMFD_CREATE) += memfd.o

50 matches

Mail list logo