Re: [PATCH v7 5/8] mm: Device exclusive memory access
On Thu, Apr 01, 2021 at 01:20:05PM +1100, Alistair Popple wrote: > On Thursday, 1 April 2021 11:48:13 AM AEDT Jason Gunthorpe wrote: > > On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote: > > > On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote: > > > > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote: > > > > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote: > > > > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote: > > > > > > > > > > > > > I guess that makes sense as the split could go either way at the > > > > > > > moment but I should add a check to make sure this isn't used with > > > > > > > pinned pages anyway. > > > > > > > > > > > > Is it possible to have a pinned page under one of these things? If I > > > > > > pin it before you migrate it then it remains pinned but hidden under > > > > > > the swap entry? > > > > > > > > > > At the moment yes. But I had planned (and this reminded me) to add a > check > > > to > > > > > prevent marking pinned pages for exclusive access. > > > > > > > > How do you even do that without races with GUP fast? > > > > > > Unless I've missed something I think I've convinced myself it should be > safe > > > to do the pin check after make_device_exclusive() has replaced all the > PTEs > > > with exclusive entries. > > > > > > GUP fast sequence: > > > 1. Read PTE > > > 2. Pin page > > > 3. Check PTE > > > 4. if PTE changed -> unpin and fallback > > > > > > If make_device_exclusive() runs after (1) it will either succeed or see > the > > > pin from (2) and fail (as desired). GUP should always see the PTE change > and > > > fallback which will revoke the exclusive access. > > > > AFAICT the user can trigger fork at that instant and fork will try to > > copy the desposited migration entry before it has been checked > > In that case the child will get a read-only exclusive entry and eventually a > page copy via do_wp_page() Having do_wp_page() do a copy is a security bug. We closed it with the at-fork checks. Jason
Re: [PATCH v7 5/8] mm: Device exclusive memory access
On Thursday, 1 April 2021 11:48:13 AM AEDT Jason Gunthorpe wrote: > On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote: > > On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote: > > > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote: > > > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote: > > > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote: > > > > > > > > > > > I guess that makes sense as the split could go either way at the > > > > > > moment but I should add a check to make sure this isn't used with > > > > > > pinned pages anyway. > > > > > > > > > > Is it possible to have a pinned page under one of these things? If I > > > > > pin it before you migrate it then it remains pinned but hidden under > > > > > the swap entry? > > > > > > > > At the moment yes. But I had planned (and this reminded me) to add a check > > to > > > > prevent marking pinned pages for exclusive access. > > > > > > How do you even do that without races with GUP fast? > > > > Unless I've missed something I think I've convinced myself it should be safe > > to do the pin check after make_device_exclusive() has replaced all the PTEs > > with exclusive entries. > > > > GUP fast sequence: > > 1. Read PTE > > 2. Pin page > > 3. Check PTE > > 4. if PTE changed -> unpin and fallback > > > > If make_device_exclusive() runs after (1) it will either succeed or see the > > pin from (2) and fail (as desired). GUP should always see the PTE change and > > fallback which will revoke the exclusive access. > > AFAICT the user can trigger fork at that instant and fork will try to > copy the desposited migration entry before it has been checked In that case the child will get a read-only exclusive entry and eventually a page copy via do_wp_page() and GUP will fallback (or fail in the case of fast only) so the parent's exclusive entry will get removed before the page can be pinned and therefore shouldn't split the wrong way. But that is sounding rather complex, and I am not convinced I haven't missed a corner case. It also seems like it shouldn't be necessary to copy exclusive entries anyway. I could just remove them and restore the original entry, which would be far simpler. > Jason >
Re: [PATCH v7 5/8] mm: Device exclusive memory access
On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote: > On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote: > > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote: > > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote: > > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote: > > > > > > > > > I guess that makes sense as the split could go either way at the > > > > > moment but I should add a check to make sure this isn't used with > > > > > pinned pages anyway. > > > > > > > > Is it possible to have a pinned page under one of these things? If I > > > > pin it before you migrate it then it remains pinned but hidden under > > > > the swap entry? > > > > > > At the moment yes. But I had planned (and this reminded me) to add a > > > check > to > > > prevent marking pinned pages for exclusive access. > > > > How do you even do that without races with GUP fast? > > Unless I've missed something I think I've convinced myself it should be safe > to do the pin check after make_device_exclusive() has replaced all the PTEs > with exclusive entries. > > GUP fast sequence: > 1. Read PTE > 2. Pin page > 3. Check PTE > 4. if PTE changed -> unpin and fallback > > If make_device_exclusive() runs after (1) it will either succeed or see the > pin from (2) and fail (as desired). GUP should always see the PTE change and > fallback which will revoke the exclusive access. AFAICT the user can trigger fork at that instant and fork will try to copy the desposited migration entry before it has been checked Jason
Re: [PATCH v7 5/8] mm: Device exclusive memory access
On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote: > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote: > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote: > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote: > > > > > > > I guess that makes sense as the split could go either way at the > > > > moment but I should add a check to make sure this isn't used with > > > > pinned pages anyway. > > > > > > Is it possible to have a pinned page under one of these things? If I > > > pin it before you migrate it then it remains pinned but hidden under > > > the swap entry? > > > > At the moment yes. But I had planned (and this reminded me) to add a check to > > prevent marking pinned pages for exclusive access. > > How do you even do that without races with GUP fast? Unless I've missed something I think I've convinced myself it should be safe to do the pin check after make_device_exclusive() has replaced all the PTEs with exclusive entries. GUP fast sequence: 1. Read PTE 2. Pin page 3. Check PTE 4. if PTE changed -> unpin and fallback If make_device_exclusive() runs after (1) it will either succeed or see the pin from (2) and fail (as desired). GUP should always see the PTE change and fallback which will revoke the exclusive access. - Alistair > Jason >
Re: [PATCH v7 5/8] mm: Device exclusive memory access
On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote: > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote: > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote: > > > > > I guess that makes sense as the split could go either way at the > > > moment but I should add a check to make sure this isn't used with > > > pinned pages anyway. > > > > Is it possible to have a pinned page under one of these things? If I > > pin it before you migrate it then it remains pinned but hidden under > > the swap entry? > > At the moment yes. But I had planned (and this reminded me) to add a check to > prevent marking pinned pages for exclusive access. How do you even do that without races with GUP fast? Jason
Re: [PATCH v7 5/8] mm: Device exclusive memory access
On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote: > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote: > > > I guess that makes sense as the split could go either way at the > > moment but I should add a check to make sure this isn't used with > > pinned pages anyway. > > Is it possible to have a pinned page under one of these things? If I > pin it before you migrate it then it remains pinned but hidden under > the swap entry? At the moment yes. But I had planned (and this reminded me) to add a check to prevent marking pinned pages for exclusive access. This check was in the original migration based implementation as I don't think it makes much sense to allow exclusive access to pinned pages given it indicates another device is possibly using it. > So the special logic is needed and the pinned page has to be copied > and written as a normal pte, not dropped as a migration entry Yep, if we end up allowing pinned pages to exist under these then that makes sense. Thanks for the clarification. - Alistair > Jason >
Re: [PATCH v7 5/8] mm: Device exclusive memory access
On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote: > I guess that makes sense as the split could go either way at the > moment but I should add a check to make sure this isn't used with > pinned pages anyway. Is it possible to have a pinned page under one of these things? If I pin it before you migrate it then it remains pinned but hidden under the swap entry? So the special logic is needed and the pinned page has to be copied and written as a normal pte, not dropped as a migration entry Jason
Re: [PATCH v7 5/8] mm: Device exclusive memory access
On Wednesday, 31 March 2021 6:32:34 AM AEDT Jason Gunthorpe wrote: > On Fri, Mar 26, 2021 at 11:08:02AM +1100, Alistair Popple wrote: > > diff --git a/mm/memory.c b/mm/memory.c > > index 3a5705cfc891..33d11527ef77 100644 > > +++ b/mm/memory.c > > @@ -781,6 +781,27 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > pte = pte_swp_mkuffd_wp(pte); > > set_pte_at(src_mm, addr, src_pte, pte); > > } > > + } else if (is_device_exclusive_entry(entry)) { > > + page = pfn_swap_entry_to_page(entry); > > + > > + get_page(page); > > + rss[mm_counter(page)]++; > > + > > + if (is_writable_device_exclusive_entry(entry) && > > + is_cow_mapping(vm_flags)) { > > + /* > > +* COW mappings require pages in both > > +* parent and child to be set to read. > > +*/ > > + entry = make_readable_device_exclusive_entry( > > + swp_offset(entry)); > > + pte = swp_entry_to_pte(entry); > > + if (pte_swp_soft_dirty(*src_pte)) > > + pte = pte_swp_mksoft_dirty(pte); > > + if (pte_swp_uffd_wp(*src_pte)) > > + pte = pte_swp_mkuffd_wp(pte); > > + set_pte_at(src_mm, addr, src_pte, pte); > > + } > > This needs to have the same logic as we now have in > copy_present_page(). The page *is* present and we can't copy the PTE > value hidden in a swap entry if we can't copy the PTE normally. You're saying we need to use copy_present_page() to make sure the split goes the right way for pinned pages? I guess that makes sense as the split could go either way at the moment but I should add a check to make sure this isn't used with pinned pages anyway. - Alistair > The code should be shared because nobody is going to remember about > this corner case. > > Jason >
Re: [PATCH v7 5/8] mm: Device exclusive memory access
On Fri, Mar 26, 2021 at 11:08:02AM +1100, Alistair Popple wrote: > diff --git a/mm/memory.c b/mm/memory.c > index 3a5705cfc891..33d11527ef77 100644 > +++ b/mm/memory.c > @@ -781,6 +781,27 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct > mm_struct *src_mm, > pte = pte_swp_mkuffd_wp(pte); > set_pte_at(src_mm, addr, src_pte, pte); > } > + } else if (is_device_exclusive_entry(entry)) { > + page = pfn_swap_entry_to_page(entry); > + > + get_page(page); > + rss[mm_counter(page)]++; > + > + if (is_writable_device_exclusive_entry(entry) && > + is_cow_mapping(vm_flags)) { > + /* > + * COW mappings require pages in both > + * parent and child to be set to read. > + */ > + entry = make_readable_device_exclusive_entry( > + swp_offset(entry)); > + pte = swp_entry_to_pte(entry); > + if (pte_swp_soft_dirty(*src_pte)) > + pte = pte_swp_mksoft_dirty(pte); > + if (pte_swp_uffd_wp(*src_pte)) > + pte = pte_swp_mkuffd_wp(pte); > + set_pte_at(src_mm, addr, src_pte, pte); > + } This needs to have the same logic as we now have in copy_present_page(). The page *is* present and we can't copy the PTE value hidden in a swap entry if we can't copy the PTE normally. The code should be shared because nobody is going to remember about this corner case. Jason
[PATCH v7 5/8] mm: Device exclusive memory access
Some devices require exclusive write access to shared virtual memory (SVM) ranges to perform atomic operations on that memory. This requires CPU page tables to be updated to deny access whilst atomic operations are occurring. In order to do this introduce a new swap entry type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for exclusive access by a device all page table mappings for the particular range are replaced with device exclusive swap entries. This causes any CPU access to the page to result in a fault. Faults are resovled by replacing the faulting entry with the original mapping. This results in MMU notifiers being called which a driver uses to update access permissions such as revoking atomic access. After notifiers have been called the device will no longer have exclusive access to the region. Signed-off-by: Alistair Popple Reviewed-by: Christoph Hellwig --- v7: * Added Christoph's Reviewed-by. * Minor cosmetic cleanups suggested by Christoph. * Replace mmu_notifier_range_init_migrate/exclusive with mmu_notifier_range_init_owner as suggested by Christoph. * Replaced lock_page() with lock_page_retry() when handling faults. * Restrict to anonymous pages for now. v6: * Fixed a bisectablity issue due to incorrectly applying the rename of migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test. v5: * Renamed range->migrate_pgmap_owner to range->owner. * Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which allows notifiers called as a result of make_device_exclusive_range() to be ignored. * Added a check to try_to_protect_one() to detect if the pages originally returned from get_user_pages() have been unmapped or not. * Removed check_device_exclusive_range() as it is no longer required with the other changes. * Documentation update. v4: * Add function to check that mappings are still valid and exclusive. * s/long/unsigned long/ in make_device_exclusive_entry(). --- Documentation/vm/hmm.rst | 19 ++- drivers/gpu/drm/nouveau/nouveau_svm.c | 2 +- include/linux/mmu_notifier.h | 26 ++-- include/linux/rmap.h | 4 + include/linux/swap.h | 4 +- include/linux/swapops.h | 44 +- lib/test_hmm.c| 2 +- mm/hmm.c | 5 + mm/memory.c | 108 - mm/migrate.c | 10 +- mm/mprotect.c | 8 + mm/page_vma_mapped.c | 9 +- mm/rmap.c | 210 ++ 13 files changed, 426 insertions(+), 25 deletions(-) diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst index 09e28507f5b2..a14c2938e7af 100644 --- a/Documentation/vm/hmm.rst +++ b/Documentation/vm/hmm.rst @@ -332,7 +332,7 @@ between device driver specific code and shared common code: walks to fill in the ``args->src`` array with PFNs to be migrated. The ``invalidate_range_start()`` callback is passed a ``struct mmu_notifier_range`` with the ``event`` field set to - ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to + ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is allows the device driver to skip the invalidation callback and only invalidate device private MMU mappings that are actually migrating. @@ -405,6 +405,23 @@ between device driver specific code and shared common code: The lock can now be released. +Exclusive access memory +=== + +Some devices have features such as atomic PTE bits that can be used to implement +atomic access to system memory. To support atomic operations to a shared virtual +memory page such a device needs access to that page which is exclusive of any +userspace access from the CPU. The ``make_device_exclusive_range()`` function +can be used to make a memory range inaccessible from userspace. + +This replaces all mappings for pages in the given range with special swap +entries. Any attempt to access the swap entry results in a fault which is +resovled by replacing the entry with the original mapping. A driver gets +notified that the mapping has been changed by MMU notifiers, after which point +it will no longer have exclusive access to the page. Exclusive access is +guranteed to last until the driver drops the page lock and page reference, at +which point any CPU faults on the page may proceed as described. + Memory cgroup (memcg) and rss accounting diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c index f18bd53da052..94f841026c3b 100644 --- a/drivers/gpu/drm/nouveau/nouveau_svm.c +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c @@ -265,7 +265,7 @@ nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn, * the invalidation is handled as part of