Re: [PATCH v6 5/8] mm: Device exclusive memory access

2021-03-22 Thread Alistair Popple
On Monday, 15 March 2021 6:42:45 PM AEDT Christoph Hellwig wrote:
> > +Not all devices support atomic access to system memory. To support atomic
> > +operations to a shared virtual memory page such a device needs access to 
that
> > +page which is exclusive of any userspace access from the CPU. The
> > +``make_device_exclusive_range()`` function can be used to make a memory 
range
> > +inaccessible from userspace.
> 
> s/Not all devices/Some devices/ ?

I will reword this. What I was trying to convey is that devices may have 
features which allow for atomics to be implemented with SW assistance.

> >  static inline int mm_has_notifiers(struct mm_struct *mm)
> > @@ -528,7 +534,17 @@ static inline void mmu_notifier_range_init_migrate(
> >  {
> > mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm,
> > start, end);
> > -   range->migrate_pgmap_owner = pgmap;
> > +   range->owner = pgmap;
> > +}
> > +
> > +static inline void mmu_notifier_range_init_exclusive(
> > +   struct mmu_notifier_range *range, unsigned int flags,
> > +   struct vm_area_struct *vma, struct mm_struct *mm,
> > +   unsigned long start, unsigned long end, void *owner)
> > +{
> > +   mmu_notifier_range_init(range, MMU_NOTIFY_EXCLUSIVE, flags, vma, mm,
> > +   start, end);
> > +   range->owner = owner;
> 
> Maybe just replace mmu_notifier_range_init_migrate with a
> mmu_notifier_range_init_owner helper that takes the owner but does
> not hard code a type?

Ok. That does result in a function which takes a fair number of arguments, but 
I guess that's no worse than multiple functions hard coding the different 
types and it does result in less code overall.

> > }
> > +   } else if (is_device_exclusive_entry(entry)) {
> > +   page = pfn_swap_entry_to_page(entry);
> > +
> > +   get_page(page);
> > +   rss[mm_counter(page)]++;
> > +
> > +   if (is_writable_device_exclusive_entry(entry) &&
> > +   is_cow_mapping(vm_flags)) {
> > +   /*
> > +* COW mappings require pages in both
> > +* parent and child to be set to read.
> > +*/
> > +   entry = make_readable_device_exclusive_entry(
> > +   swp_offset(entry));
> > +   pte = swp_entry_to_pte(entry);
> > +   if (pte_swp_soft_dirty(*src_pte))
> > +   pte = pte_swp_mksoft_dirty(pte);
> > +   if (pte_swp_uffd_wp(*src_pte))
> > +   pte = pte_swp_mkuffd_wp(pte);
> > +   set_pte_at(src_mm, addr, src_pte, pte);
> > +   }
> 
> Just cosmetic, but I wonder if should factor this code block into
> a little helper.

In that case there are arguably are other bits of this function which should 
be refactored into helpers as well. Unless you feel strongly about it I would 
like to leave this as is and put together a future series to fix this and a 
couple of other areas I've noticed that could do with some refactoring/clean 
ups.

> > +
> > +static bool try_to_protect_one(struct page *page, struct vm_area_struct 
*vma,
> > +   unsigned long address, void *arg)
> > +{
> > +   struct mm_struct *mm = vma->vm_mm;
> > +   struct page_vma_mapped_walk pvmw = {
> > +   .page = page,
> > +   .vma = vma,
> > +   .address = address,
> > +   };
> > +   struct ttp_args *ttp = (struct ttp_args *) arg;
> 
> This cast should not be needed.
> 
> > +   return ttp.valid && (!page_mapcount(page) ? true : false);
> 
> This can be simplified to:
> 
>   return ttp.valid && !page_mapcount(page);
> 
> > +   npages = get_user_pages_remote(mm, start, npages,
> > +  FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> > +  pages, NULL, NULL);
> > +   for (i = 0; i < npages; i++, start += PAGE_SIZE) {
> > +   if (!trylock_page(pages[i])) {
> > +   put_page(pages[i]);
> > +   pages[i] = NULL;
> > +   continue;
> > +   }
> > +
> > +   if (!try_to_protect(pages[i], mm, start, arg)) {
> > +   unlock_page(pages[i]);
> > +   put_page(pages[i]);
> > +   pages[i] = NULL;
> > +   }
> 
> Should the trylock_page go into try_to_protect to simplify the loop
> a little?  Also I wonder if we need make_device_exclusive_range or
> should just open code the get_user_pages_remote + try_to_protect
> loop in the callers, as that might allow them to also deduct other
> information about the found pages.

This function has evolved over time and putting the trylock_page into 
try_to_protect does simplify things nicely. I'm not sure what other 
information a caller could deduct through open coding though, but I guess in 
some 

Re: [PATCH v6 5/8] mm: Device exclusive memory access

2021-03-15 Thread Christoph Hellwig
> +Not all devices support atomic access to system memory. To support atomic
> +operations to a shared virtual memory page such a device needs access to that
> +page which is exclusive of any userspace access from the CPU. The
> +``make_device_exclusive_range()`` function can be used to make a memory range
> +inaccessible from userspace.

s/Not all devices/Some devices/ ?

>  static inline int mm_has_notifiers(struct mm_struct *mm)
> @@ -528,7 +534,17 @@ static inline void mmu_notifier_range_init_migrate(
>  {
>   mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm,
>   start, end);
> - range->migrate_pgmap_owner = pgmap;
> + range->owner = pgmap;
> +}
> +
> +static inline void mmu_notifier_range_init_exclusive(
> + struct mmu_notifier_range *range, unsigned int flags,
> + struct vm_area_struct *vma, struct mm_struct *mm,
> + unsigned long start, unsigned long end, void *owner)
> +{
> + mmu_notifier_range_init(range, MMU_NOTIFY_EXCLUSIVE, flags, vma, mm,
> + start, end);
> + range->owner = owner;

Maybe just replace mmu_notifier_range_init_migrate with a
mmu_notifier_range_init_owner helper that takes the owner but does
not hard code a type?

>   }
> + } else if (is_device_exclusive_entry(entry)) {
> + page = pfn_swap_entry_to_page(entry);
> +
> + get_page(page);
> + rss[mm_counter(page)]++;
> +
> + if (is_writable_device_exclusive_entry(entry) &&
> + is_cow_mapping(vm_flags)) {
> + /*
> +  * COW mappings require pages in both
> +  * parent and child to be set to read.
> +  */
> + entry = make_readable_device_exclusive_entry(
> + swp_offset(entry));
> + pte = swp_entry_to_pte(entry);
> + if (pte_swp_soft_dirty(*src_pte))
> + pte = pte_swp_mksoft_dirty(pte);
> + if (pte_swp_uffd_wp(*src_pte))
> + pte = pte_swp_mkuffd_wp(pte);
> + set_pte_at(src_mm, addr, src_pte, pte);
> + }

Just cosmetic, but I wonder if should factor this code block into
a little helper.

> +
> +static bool try_to_protect_one(struct page *page, struct vm_area_struct *vma,
> + unsigned long address, void *arg)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + struct page_vma_mapped_walk pvmw = {
> + .page = page,
> + .vma = vma,
> + .address = address,
> + };
> + struct ttp_args *ttp = (struct ttp_args *) arg;

This cast should not be needed.

> + return ttp.valid && (!page_mapcount(page) ? true : false);

This can be simplified to:

return ttp.valid && !page_mapcount(page);

> + npages = get_user_pages_remote(mm, start, npages,
> +FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> +pages, NULL, NULL);
> + for (i = 0; i < npages; i++, start += PAGE_SIZE) {
> + if (!trylock_page(pages[i])) {
> + put_page(pages[i]);
> + pages[i] = NULL;
> + continue;
> + }
> +
> + if (!try_to_protect(pages[i], mm, start, arg)) {
> + unlock_page(pages[i]);
> + put_page(pages[i]);
> + pages[i] = NULL;
> + }

Should the trylock_page go into try_to_protect to simplify the loop
a little?  Also I wonder if we need make_device_exclusive_range or
should just open code the get_user_pages_remote + try_to_protect
loop in the callers, as that might allow them to also deduct other
information about the found pages.

Otherwise looks good:

Reviewed-by: Christoph Hellwig 


[PATCH v6 5/8] mm: Device exclusive memory access

2021-03-12 Thread Alistair Popple
Some devices require exclusive write access to shared virtual
memory (SVM) ranges to perform atomic operations on that memory. This
requires CPU page tables to be updated to deny access whilst atomic
operations are occurring.

In order to do this introduce a new swap entry
type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
exclusive access by a device all page table mappings for the particular
range are replaced with device exclusive swap entries. This causes any
CPU access to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive
access to the region.

Signed-off-by: Alistair Popple 

---

v6:
* Fixed a bisectablity issue due to incorrectly applying the rename of
  migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.

v5:
* Renamed range->migrate_pgmap_owner to range->owner.
* Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which
  allows notifiers called as a result of make_device_exclusive_range() to
  be ignored.
* Added a check to try_to_protect_one() to detect if the pages originally
  returned from get_user_pages() have been unmapped or not.
* Removed check_device_exclusive_range() as it is no longer required with
  the other changes.
* Documentation update.

v4:
* Add function to check that mappings are still valid and exclusive.
* s/long/unsigned long/ in make_device_exclusive_entry().
---
 Documentation/vm/hmm.rst  |  19 ++-
 drivers/gpu/drm/nouveau/nouveau_svm.c |   2 +-
 include/linux/mmu_notifier.h  |  25 +++-
 include/linux/rmap.h  |   4 +
 include/linux/swap.h  |   4 +-
 include/linux/swapops.h   |  44 +-
 lib/test_hmm.c|   2 +-
 mm/hmm.c  |   5 +
 mm/memory.c   | 107 +-
 mm/mprotect.c |   8 +
 mm/page_vma_mapped.c  |   9 +-
 mm/rmap.c | 203 ++
 12 files changed, 419 insertions(+), 13 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 09e28507f5b2..a5fdee82c037 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -332,7 +332,7 @@ between device driver specific code and shared common code:
walks to fill in the ``args->src`` array with PFNs to be migrated.
The ``invalidate_range_start()`` callback is passed a
``struct mmu_notifier_range`` with the ``event`` field set to
-   ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
+   ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to
the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
allows the device driver to skip the invalidation callback and only
invalidate device private MMU mappings that are actually migrating.
@@ -405,6 +405,23 @@ between device driver specific code and shared common code:
 
The lock can now be released.
 
+Exclusive access memory
+===
+
+Not all devices support atomic access to system memory. To support atomic
+operations to a shared virtual memory page such a device needs access to that
+page which is exclusive of any userspace access from the CPU. The
+``make_device_exclusive_range()`` function can be used to make a memory range
+inaccessible from userspace.
+
+This replaces all mappings for pages in the given range with special swap
+entries. Any attempt to access the swap entry results in a fault which is
+resovled by replacing the entry with the original mapping. A driver gets
+notified that the mapping has been changed by MMU notifiers, after which point
+it will no longer have exclusive access to the page. Exclusive access is
+guranteed to last until the driver drops the page lock and page reference, at
+which point any CPU faults on the page may proceed as described.
+
 Memory cgroup (memcg) and rss accounting
 
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index f18bd53da052..94f841026c3b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -265,7 +265,7 @@ nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn,
 * the invalidation is handled as part of the migration process.
 */
if (update->event == MMU_NOTIFY_MIGRATE &&
-   update->migrate_pgmap_owner == svmm->vmm->cli->drm->dev)
+   update->owner == svmm->vmm->cli->drm->dev)
goto out;
 
if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) {
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b8200782dede..455e269bf825 100644
---