Re: [PATCH v7 5/8] mm: Device exclusive memory access

2021-04-01 Thread Jason Gunthorpe
On Thu, Apr 01, 2021 at 01:20:05PM +1100, Alistair Popple wrote:
> On Thursday, 1 April 2021 11:48:13 AM AEDT Jason Gunthorpe wrote:
> > On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote:
> > > On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> > > > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > > > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > > > > > 
> > > > > > > I guess that makes sense as the split could go either way at the
> > > > > > > moment but I should add a check to make sure this isn't used with
> > > > > > > pinned pages anyway.
> > > > > > 
> > > > > > Is it possible to have a pinned page under one of these things? If I
> > > > > > pin it before you migrate it then it remains pinned but hidden under
> > > > > > the swap entry?
> > > > > 
> > > > > At the moment yes. But I had planned (and this reminded me) to add a 
> check 
> > > to 
> > > > > prevent marking pinned pages for exclusive access. 
> > > > 
> > > > How do you even do that without races with GUP fast?
> > > 
> > > Unless I've missed something I think I've convinced myself it should be 
> safe 
> > > to do the pin check after make_device_exclusive() has replaced all the 
> PTEs 
> > > with exclusive entries.
> > > 
> > > GUP fast sequence:
> > > 1. Read PTE
> > > 2. Pin page
> > > 3. Check PTE
> > > 4. if PTE changed -> unpin and fallback
> > > 
> > > If make_device_exclusive() runs after (1) it will either succeed or see 
> the 
> > > pin from (2) and fail (as desired). GUP should always see the PTE change 
> and 
> > > fallback which will revoke the exclusive access.
> > 
> > AFAICT the user can trigger fork at that instant and fork will try to
> > copy the desposited migration entry before it has been checked
> 
> In that case the child will get a read-only exclusive entry and eventually a 
> page copy via do_wp_page() 

Having do_wp_page() do a copy is a security bug. We closed it with the
at-fork checks.

Jason


Re: [PATCH v7 5/8] mm: Device exclusive memory access

2021-03-31 Thread Alistair Popple
On Thursday, 1 April 2021 11:48:13 AM AEDT Jason Gunthorpe wrote:
> On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote:
> > On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> > > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > > > > 
> > > > > > I guess that makes sense as the split could go either way at the
> > > > > > moment but I should add a check to make sure this isn't used with
> > > > > > pinned pages anyway.
> > > > > 
> > > > > Is it possible to have a pinned page under one of these things? If I
> > > > > pin it before you migrate it then it remains pinned but hidden under
> > > > > the swap entry?
> > > > 
> > > > At the moment yes. But I had planned (and this reminded me) to add a 
check 
> > to 
> > > > prevent marking pinned pages for exclusive access. 
> > > 
> > > How do you even do that without races with GUP fast?
> > 
> > Unless I've missed something I think I've convinced myself it should be 
safe 
> > to do the pin check after make_device_exclusive() has replaced all the 
PTEs 
> > with exclusive entries.
> > 
> > GUP fast sequence:
> > 1. Read PTE
> > 2. Pin page
> > 3. Check PTE
> > 4. if PTE changed -> unpin and fallback
> > 
> > If make_device_exclusive() runs after (1) it will either succeed or see 
the 
> > pin from (2) and fail (as desired). GUP should always see the PTE change 
and 
> > fallback which will revoke the exclusive access.
> 
> AFAICT the user can trigger fork at that instant and fork will try to
> copy the desposited migration entry before it has been checked

In that case the child will get a read-only exclusive entry and eventually a 
page copy via do_wp_page() and GUP will fallback (or fail in the case of fast 
only) so the parent's exclusive entry will get removed before the page can be 
pinned and therefore shouldn't split the wrong way.

But that is sounding rather complex, and I am not convinced I haven't missed a 
corner case. It also seems like it shouldn't be necessary to copy exclusive 
entries anyway. I could just remove them and restore the original entry, which 
would be far simpler.

> Jason
> 






Re: [PATCH v7 5/8] mm: Device exclusive memory access

2021-03-31 Thread Jason Gunthorpe
On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote:
> On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > > > 
> > > > > I guess that makes sense as the split could go either way at the
> > > > > moment but I should add a check to make sure this isn't used with
> > > > > pinned pages anyway.
> > > > 
> > > > Is it possible to have a pinned page under one of these things? If I
> > > > pin it before you migrate it then it remains pinned but hidden under
> > > > the swap entry?
> > > 
> > > At the moment yes. But I had planned (and this reminded me) to add a 
> > > check 
> to 
> > > prevent marking pinned pages for exclusive access. 
> > 
> > How do you even do that without races with GUP fast?
> 
> Unless I've missed something I think I've convinced myself it should be safe 
> to do the pin check after make_device_exclusive() has replaced all the PTEs 
> with exclusive entries.
> 
> GUP fast sequence:
> 1. Read PTE
> 2. Pin page
> 3. Check PTE
> 4. if PTE changed -> unpin and fallback
> 
> If make_device_exclusive() runs after (1) it will either succeed or see the 
> pin from (2) and fail (as desired). GUP should always see the PTE change and 
> fallback which will revoke the exclusive access.

AFAICT the user can trigger fork at that instant and fork will try to
copy the desposited migration entry before it has been checked

Jason


Re: [PATCH v7 5/8] mm: Device exclusive memory access

2021-03-31 Thread Alistair Popple
On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > > 
> > > > I guess that makes sense as the split could go either way at the
> > > > moment but I should add a check to make sure this isn't used with
> > > > pinned pages anyway.
> > > 
> > > Is it possible to have a pinned page under one of these things? If I
> > > pin it before you migrate it then it remains pinned but hidden under
> > > the swap entry?
> > 
> > At the moment yes. But I had planned (and this reminded me) to add a check 
to 
> > prevent marking pinned pages for exclusive access. 
> 
> How do you even do that without races with GUP fast?

Unless I've missed something I think I've convinced myself it should be safe 
to do the pin check after make_device_exclusive() has replaced all the PTEs 
with exclusive entries.

GUP fast sequence:
1. Read PTE
2. Pin page
3. Check PTE
4. if PTE changed -> unpin and fallback

If make_device_exclusive() runs after (1) it will either succeed or see the 
pin from (2) and fail (as desired). GUP should always see the PTE change and 
fallback which will revoke the exclusive access.

 - Alistair

> Jason
> 






Re: [PATCH v7 5/8] mm: Device exclusive memory access

2021-03-31 Thread Jason Gunthorpe
On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > 
> > > I guess that makes sense as the split could go either way at the
> > > moment but I should add a check to make sure this isn't used with
> > > pinned pages anyway.
> > 
> > Is it possible to have a pinned page under one of these things? If I
> > pin it before you migrate it then it remains pinned but hidden under
> > the swap entry?
> 
> At the moment yes. But I had planned (and this reminded me) to add a check to 
> prevent marking pinned pages for exclusive access. 

How do you even do that without races with GUP fast?

Jason


Re: [PATCH v7 5/8] mm: Device exclusive memory access

2021-03-31 Thread Alistair Popple
On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> 
> > I guess that makes sense as the split could go either way at the
> > moment but I should add a check to make sure this isn't used with
> > pinned pages anyway.
> 
> Is it possible to have a pinned page under one of these things? If I
> pin it before you migrate it then it remains pinned but hidden under
> the swap entry?

At the moment yes. But I had planned (and this reminded me) to add a check to 
prevent marking pinned pages for exclusive access. This check was in the 
original migration based implementation as I don't think it makes much sense 
to allow exclusive access to pinned pages given it indicates another device is 
possibly using it. 

> So the special logic is needed and the pinned page has to be copied
> and written as a normal pte, not dropped as a migration entry

Yep, if we end up allowing pinned pages to exist under these then that makes 
sense. Thanks for the clarification.

 - Alistair

> Jason
> 





Re: [PATCH v7 5/8] mm: Device exclusive memory access

2021-03-31 Thread Jason Gunthorpe
On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:

> I guess that makes sense as the split could go either way at the
> moment but I should add a check to make sure this isn't used with
> pinned pages anyway.

Is it possible to have a pinned page under one of these things? If I
pin it before you migrate it then it remains pinned but hidden under
the swap entry?

So the special logic is needed and the pinned page has to be copied
and written as a normal pte, not dropped as a migration entry

Jason


Re: [PATCH v7 5/8] mm: Device exclusive memory access

2021-03-31 Thread Alistair Popple
On Wednesday, 31 March 2021 6:32:34 AM AEDT Jason Gunthorpe wrote:
> On Fri, Mar 26, 2021 at 11:08:02AM +1100, Alistair Popple wrote:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3a5705cfc891..33d11527ef77 100644
> > +++ b/mm/memory.c
> > @@ -781,6 +781,27 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
> > pte = pte_swp_mkuffd_wp(pte);
> > set_pte_at(src_mm, addr, src_pte, pte);
> > }
> > +   } else if (is_device_exclusive_entry(entry)) {
> > +   page = pfn_swap_entry_to_page(entry);
> > +
> > +   get_page(page);
> > +   rss[mm_counter(page)]++;
> > +
> > +   if (is_writable_device_exclusive_entry(entry) &&
> > +   is_cow_mapping(vm_flags)) {
> > +   /*
> > +* COW mappings require pages in both
> > +* parent and child to be set to read.
> > +*/
> > +   entry = make_readable_device_exclusive_entry(
> > +   swp_offset(entry));
> > +   pte = swp_entry_to_pte(entry);
> > +   if (pte_swp_soft_dirty(*src_pte))
> > +   pte = pte_swp_mksoft_dirty(pte);
> > +   if (pte_swp_uffd_wp(*src_pte))
> > +   pte = pte_swp_mkuffd_wp(pte);
> > +   set_pte_at(src_mm, addr, src_pte, pte);
> > +   }
> 
> This needs to have the same logic as we now have in
> copy_present_page(). The page *is* present and we can't copy the PTE
> value hidden in a swap entry if we can't copy the PTE normally.

You're saying we need to use copy_present_page() to make sure the split goes 
the right way for pinned pages? I guess that makes sense as the split could go 
either way at the moment but I should add a check to make sure this isn't used 
with pinned pages anyway.

 - Alistair

> The code should be shared because nobody is going to remember about
> this corner case.
> 
> Jason
> 






Re: [PATCH v7 5/8] mm: Device exclusive memory access

2021-03-30 Thread Jason Gunthorpe
On Fri, Mar 26, 2021 at 11:08:02AM +1100, Alistair Popple wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index 3a5705cfc891..33d11527ef77 100644
> +++ b/mm/memory.c
> @@ -781,6 +781,27 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct 
> mm_struct *src_mm,
>   pte = pte_swp_mkuffd_wp(pte);
>   set_pte_at(src_mm, addr, src_pte, pte);
>   }
> + } else if (is_device_exclusive_entry(entry)) {
> + page = pfn_swap_entry_to_page(entry);
> +
> + get_page(page);
> + rss[mm_counter(page)]++;
> +
> + if (is_writable_device_exclusive_entry(entry) &&
> + is_cow_mapping(vm_flags)) {
> + /*
> +  * COW mappings require pages in both
> +  * parent and child to be set to read.
> +  */
> + entry = make_readable_device_exclusive_entry(
> + swp_offset(entry));
> + pte = swp_entry_to_pte(entry);
> + if (pte_swp_soft_dirty(*src_pte))
> + pte = pte_swp_mksoft_dirty(pte);
> + if (pte_swp_uffd_wp(*src_pte))
> + pte = pte_swp_mkuffd_wp(pte);
> + set_pte_at(src_mm, addr, src_pte, pte);
> + }

This needs to have the same logic as we now have in
copy_present_page(). The page *is* present and we can't copy the PTE
value hidden in a swap entry if we can't copy the PTE normally.

The code should be shared because nobody is going to remember about
this corner case.

Jason


[PATCH v7 5/8] mm: Device exclusive memory access

2021-03-25 Thread Alistair Popple
Some devices require exclusive write access to shared virtual
memory (SVM) ranges to perform atomic operations on that memory. This
requires CPU page tables to be updated to deny access whilst atomic
operations are occurring.

In order to do this introduce a new swap entry
type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
exclusive access by a device all page table mappings for the particular
range are replaced with device exclusive swap entries. This causes any
CPU access to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive
access to the region.

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 

---

v7:
* Added Christoph's Reviewed-by.
* Minor cosmetic cleanups suggested by Christoph.
* Replace mmu_notifier_range_init_migrate/exclusive with
  mmu_notifier_range_init_owner as suggested by Christoph.
* Replaced lock_page() with lock_page_retry() when handling faults.
* Restrict to anonymous pages for now.

v6:
* Fixed a bisectablity issue due to incorrectly applying the rename of
  migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.

v5:
* Renamed range->migrate_pgmap_owner to range->owner.
* Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which
  allows notifiers called as a result of make_device_exclusive_range() to
  be ignored.
* Added a check to try_to_protect_one() to detect if the pages originally
  returned from get_user_pages() have been unmapped or not.
* Removed check_device_exclusive_range() as it is no longer required with
  the other changes.
* Documentation update.

v4:
* Add function to check that mappings are still valid and exclusive.
* s/long/unsigned long/ in make_device_exclusive_entry().
---
 Documentation/vm/hmm.rst  |  19 ++-
 drivers/gpu/drm/nouveau/nouveau_svm.c |   2 +-
 include/linux/mmu_notifier.h  |  26 ++--
 include/linux/rmap.h  |   4 +
 include/linux/swap.h  |   4 +-
 include/linux/swapops.h   |  44 +-
 lib/test_hmm.c|   2 +-
 mm/hmm.c  |   5 +
 mm/memory.c   | 108 -
 mm/migrate.c  |  10 +-
 mm/mprotect.c |   8 +
 mm/page_vma_mapped.c  |   9 +-
 mm/rmap.c | 210 ++
 13 files changed, 426 insertions(+), 25 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 09e28507f5b2..a14c2938e7af 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -332,7 +332,7 @@ between device driver specific code and shared common code:
walks to fill in the ``args->src`` array with PFNs to be migrated.
The ``invalidate_range_start()`` callback is passed a
``struct mmu_notifier_range`` with the ``event`` field set to
-   ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
+   ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to
the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
allows the device driver to skip the invalidation callback and only
invalidate device private MMU mappings that are actually migrating.
@@ -405,6 +405,23 @@ between device driver specific code and shared common code:
 
The lock can now be released.
 
+Exclusive access memory
+===
+
+Some devices have features such as atomic PTE bits that can be used to 
implement
+atomic access to system memory. To support atomic operations to a shared 
virtual
+memory page such a device needs access to that page which is exclusive of any
+userspace access from the CPU. The ``make_device_exclusive_range()`` function
+can be used to make a memory range inaccessible from userspace.
+
+This replaces all mappings for pages in the given range with special swap
+entries. Any attempt to access the swap entry results in a fault which is
+resovled by replacing the entry with the original mapping. A driver gets
+notified that the mapping has been changed by MMU notifiers, after which point
+it will no longer have exclusive access to the page. Exclusive access is
+guranteed to last until the driver drops the page lock and page reference, at
+which point any CPU faults on the page may proceed as described.
+
 Memory cgroup (memcg) and rss accounting
 
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index f18bd53da052..94f841026c3b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -265,7 +265,7 @@ nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn,
 * the invalidation is handled as part of