Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-21 Thread Alistair Popple
On Tuesday, 18 May 2021 11:19:05 PM AEST Alistair Popple wrote:

[...]

> > > +/*
> > > + * Restore a potential device exclusive pte to a working pte entry
> > > + */
> > > +static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > > +{
> > > + struct page *page = vmf->page;
> > > + struct vm_area_struct *vma = vmf->vma;
> > > + struct page_vma_mapped_walk pvmw = {
> > > + .page = page,
> > > + .vma = vma,
> > > + .address = vmf->address,
> > > + .flags = PVMW_SYNC,
> > > + };
> > > + vm_fault_t ret = 0;
> > > + struct mmu_notifier_range range;
> > > +
> > > + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
> > > + return VM_FAULT_RETRY;
> > > + mmu_notifier_range_init(, MMU_NOTIFY_CLEAR, 0, vma,
> > > vma->vm_mm,
> > > + vmf->address & PAGE_MASK,
> > > + (vmf->address & PAGE_MASK) + PAGE_SIZE);
> > > + mmu_notifier_invalidate_range_start();
> > 
> > I looked at MMU_NOTIFIER_CLEAR document and it tells me:
> > 
> > * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like
> > * madvise() or replacing a page by another one, ...).
> > 
> > Does MMU_NOTIFIER_CLEAR suite for this case?  Normally I think for such a
> > case (existing pte is invalid) we don't need to notify at all.  However
> > from what I read from the whole series, this seems to be a critical point
> > where we'd like to kick the owner/driver to synchronously stop doing
> > atomic
> > operations from the device.  Not sure whether we'd like a new notifier
> > type, or maybe at least comment on why to use CLEAR?
> 
> Right, notifying the owner/driver when it no longer has exclusive access to
> the page and allowing it to stop atomic operations is the critical point and
> why it notifies when we ordinarily wouldn't (ie. invalid -> valid).
> 
> I did consider adding a new type, but in the driver implementation it ends
> up
> being treated the same as a CLEAR notification anyway so didn't think it was
> necessary. But I suppose adding a different type would allow other listening
> notifiers to filter these which might be worthwhile.
>
> > > +
> > > + while (page_vma_mapped_walk()) {
> > 
> > IIUC a while loop of page_vma_mapped_walk() handles thps, however here
> > it's
> > already in do_swap_page() so it's small pte only?  Meanwhile...
> > 
> > > + if (unlikely(!pte_same(*pvmw.pte, vmf->orig_pte))) {
> > > + page_vma_mapped_walk_done();
> > > + break;
> > > + }
> > > +
> > > + restore_exclusive_pte(vma, page, pvmw.address, pvmw.pte);
> > 
> > ... I'm not sure whether passing in page would work for thp after all, as
> > iiuc we may need to pass in the subpage rather than the head page if so.
> 
> Hmm, I need to check this and follow up.

Thanks Peter for pointing that out. I needed to follow this up because I had 
slightly misunderstood page_vma_mapped_walk(). As you say this is actually a 
small pte and restore_exclusive_pte() needs the actual page from the fault. So 
I should be able to drop the page_vma_mapped_walk() and use 
pte_offset_map_lock() to call restore_exclusive_pte directly.

 - Alistair





Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Jason Gunthorpe
> Sorry for the noise.

Not at all, it is good that more people understand things!

Jason


Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu
On Wed, May 19, 2021 at 10:28:42AM -0300, Jason Gunthorpe wrote:
> On Tue, May 18, 2021 at 07:45:05PM -0400, Peter Xu wrote:
> > On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> > > Logically during fork all these device exclusive pages should be
> > > reverted back to their CPU pages, write protected and the CPU page PTE
> > > copied to the fork.
> > > 
> > > We should not copy the device exclusive page PTE to the fork. I think
> > > I pointed to this on an earlier rev..
> > 
> > Agreed.  Though please see the question I posted in the other thread: now I 
> > am
> > not very sure whether we'll be able to mark a page as device exclusive if 
> > that
> > page has mapcount>1.
> 
> IMHO it is similar to write protect done by filesystems on shared
> mappings - all VMAs with a copy of the CPU page have to get switched
> to the device exclusive PTE. This is why the rmap stuff is involved in
> the migration helpers

Right, I think Alistair corrected me there that I missed the early COW
happening in GUP.

Actually even without that GUP triggering early COW it won't be a problem,
because as long as one child mm restored the pte from exclusive to normal
(before any further COW happens) device exclusiveness is broken in the mmu
notifiers, and after that point all previous-exclusive ptes actually becomes
the same as a very normal PageAnon.  Then it's very sane to even not have the
original page in parent process, because we know each COWed page will contain
all the device atomic modifications (so we don't really have the requirement to
return the original page to parent).

Sorry for the noise.

-- 
Peter Xu



Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu
On Wed, May 19, 2021 at 11:11:55PM +1000, Alistair Popple wrote:
> On Wednesday, 19 May 2021 10:15:41 PM AEST Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > On Wed, May 19, 2021 at 09:04:53PM +1000, Alistair Popple wrote:
> > > Failing fork() because we couldn't take a lock doesn't seem like the right
> > > approach though, especially as there is already existing code that
> > > retries. I get this adds complexity though, so would be happy to take a
> > > look at cleaning copy_pte_range() up in future.
> > 
> > Yes, I proposed that as this one won't affect any existing applications
> > (unlike the existing ones) but only new userspace driver apps that will use
> > this new atomic feature.
> > 
> > IMHO it'll be a pity to add extra complexity and maintainance burden into
> > fork() if only for keeping the "logical correctness of fork()" however the
> > code never triggers. If we start with trylock we'll know whether people
> > will use it, since people will complain with a reason when needed; however
> > I still doubt whether a sane userspace device driver should fork() within
> > busy interaction with the device underneath..
> 
> I will refrain from commenting on the sanity or otherwise of doing that :-)
> 
> Agree such a scenario seems unlikely in practice (and possibly unreasonable). 
> Keeping the "logical correctness of fork()" still seems worthwhile to me, but 
> if the added complexity/maintenance burden for an admittedly fairly specific 
> feature is going to stop progress here I am happy to take the fail fork 
> approach. I could then possibly fix it up as a future clean up to 
> copy_pte_range(). Perhaps others have thoughts?

Yes, it's more about making this series easier to be accepted, and it'll be
great to have others' input.

Btw, just to mention that I don't even think fail fork() on failed trylock() is
against "logical correctness of fork()": IMHO it's still 100% correct just like
most syscalls can return with -EAGAIN, that suggests the userspace to try again
the syscall, and I hope that also works for fork().  I'd be more than glad to
be corrected too.

-- 
Peter Xu



Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Jason Gunthorpe
On Tue, May 18, 2021 at 07:45:05PM -0400, Peter Xu wrote:
> On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> > Logically during fork all these device exclusive pages should be
> > reverted back to their CPU pages, write protected and the CPU page PTE
> > copied to the fork.
> > 
> > We should not copy the device exclusive page PTE to the fork. I think
> > I pointed to this on an earlier rev..
> 
> Agreed.  Though please see the question I posted in the other thread: now I am
> not very sure whether we'll be able to mark a page as device exclusive if that
> page has mapcount>1.

IMHO it is similar to write protect done by filesystems on shared
mappings - all VMAs with a copy of the CPU page have to get switched
to the device exclusive PTE. This is why the rmap stuff is involved in
the migration helpers

Jason


Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 10:15:41 PM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Wed, May 19, 2021 at 09:04:53PM +1000, Alistair Popple wrote:
> > Failing fork() because we couldn't take a lock doesn't seem like the right
> > approach though, especially as there is already existing code that
> > retries. I get this adds complexity though, so would be happy to take a
> > look at cleaning copy_pte_range() up in future.
> 
> Yes, I proposed that as this one won't affect any existing applications
> (unlike the existing ones) but only new userspace driver apps that will use
> this new atomic feature.
> 
> IMHO it'll be a pity to add extra complexity and maintainance burden into
> fork() if only for keeping the "logical correctness of fork()" however the
> code never triggers. If we start with trylock we'll know whether people
> will use it, since people will complain with a reason when needed; however
> I still doubt whether a sane userspace device driver should fork() within
> busy interaction with the device underneath..

I will refrain from commenting on the sanity or otherwise of doing that :-)

Agree such a scenario seems unlikely in practice (and possibly unreasonable). 
Keeping the "logical correctness of fork()" still seems worthwhile to me, but 
if the added complexity/maintenance burden for an admittedly fairly specific 
feature is going to stop progress here I am happy to take the fail fork 
approach. I could then possibly fix it up as a future clean up to 
copy_pte_range(). Perhaps others have thoughts?

> In all cases, please still consider to keep them in copy_nonpresent_pte()
> (and if to rework, separating patches would be great).
>
> Thanks,
> 
> --
> Peter Xu






Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 10:21:08 PM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Wed, May 19, 2021 at 09:35:10PM +1000, Alistair Popple wrote:
> > I think the approach you are describing is similar to what
> > migrate_vma_collect()/migrate_vma_unamp() does now and I think it could be
> > made to work. I ended up going with the GUP+unmap approach in part because
> > Christoph suggested it but primarily because it required less code
> > especially given we needed to call something to fault the page in/break
> > COW anyway (or extend what was there to call handle_mm_fault(), etc.).
> 
> I see, thank. Would you mind add a short paragraph in the commit message
> talking about these two solutions and why we choose the GUP way?

Sure.

 - Alistair

> --
> Peter Xu






Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 10:24:27 PM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Wed, May 19, 2021 at 08:49:01PM +1000, Alistair Popple wrote:
> > On Wednesday, 19 May 2021 7:16:38 AM AEST Peter Xu wrote:
> > > External email: Use caution opening links or attachments
> > > 
> > > 
> > > On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:
> > > 
> > > [...]
> > > 
> > > > +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> > > > +unsigned long address, void *arg)
> > > > +{
> > > > + struct ttp_args ttp = {
> > > > + .mm = mm,
> > > > + .address = address,
> > > > + .arg = arg,
> > > > + .valid = false,
> > > > + };
> > > > + struct rmap_walk_control rwc = {
> > > > + .rmap_one = try_to_protect_one,
> > > > + .done = page_not_mapped,
> > > > + .anon_lock = page_lock_anon_vma_read,
> > > > + .arg = ,
> > > > + };
> > > > +
> > > > + /*
> > > > +  * Restrict to anonymous pages for now to avoid potential
> > > > writeback
> > > > +  * issues.
> > > > +  */
> > > > + if (!PageAnon(page))
> > > > + return false;
> > > > +
> > > > + /*
> > > > +  * During exec, a temporary VMA is setup and later moved.
> > > > +  * The VMA is moved under the anon_vma lock but not the
> > > > +  * page tables leading to a race where migration cannot
> > > > +  * find the migration ptes. Rather than increasing the
> > > > +  * locking requirements of exec(), migration skips
> > > > +  * temporary VMAs until after exec() completes.
> > > > +  */
> > > > + if (!PageKsm(page) && PageAnon(page))
> > > > + rwc.invalid_vma = invalid_migration_vma;
> > > > +
> > > > + rmap_walk(page, );
> > > > +
> > > > + return ttp.valid && !page_mapcount(page);
> > > > +}
> > > 
> > > I raised a question in the other thread regarding fork():
> > > 
> > > https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/
> > > 
> > > While I suddenly noticed that we may have similar issues even if we
> > > fork()
> > > before creating the ptes.
> > > 
> > > In that case, we may see multiple read-only ptes pointing to the same
> > > page.
> > > We will convert all of them into device exclusive read ptes in
> > > rmap_walk()
> > > above, however how do we guarantee after all COW done in the parent and
> > > all
> > > the childs processes, the device owned page will be returned to the
> > > parent?
> > 
> > I assume you are talking about a fork() followed by a call to
> > make_device_exclusive()? I think this should be ok because
> > make_device_exclusive() always calls GUP with FOLL_WRITE both to break COW
> > and because a device performing atomic operations needs to write to the
> > page. I suppose a comment here highlighting the need to break COW to
> > avoid this scenario would be useful though.
> 
> Indeed, sorry for the false alarm!  Yes it would be great to mention that
> too.

No problem! Thanks for the comments.

> --
> Peter Xu






Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu
On Wed, May 19, 2021 at 08:49:01PM +1000, Alistair Popple wrote:
> On Wednesday, 19 May 2021 7:16:38 AM AEST Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:
> > 
> > [...]
> > 
> > > +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> > > +unsigned long address, void *arg)
> > > +{
> > > + struct ttp_args ttp = {
> > > + .mm = mm,
> > > + .address = address,
> > > + .arg = arg,
> > > + .valid = false,
> > > + };
> > > + struct rmap_walk_control rwc = {
> > > + .rmap_one = try_to_protect_one,
> > > + .done = page_not_mapped,
> > > + .anon_lock = page_lock_anon_vma_read,
> > > + .arg = ,
> > > + };
> > > +
> > > + /*
> > > +  * Restrict to anonymous pages for now to avoid potential writeback
> > > +  * issues.
> > > +  */
> > > + if (!PageAnon(page))
> > > + return false;
> > > +
> > > + /*
> > > +  * During exec, a temporary VMA is setup and later moved.
> > > +  * The VMA is moved under the anon_vma lock but not the
> > > +  * page tables leading to a race where migration cannot
> > > +  * find the migration ptes. Rather than increasing the
> > > +  * locking requirements of exec(), migration skips
> > > +  * temporary VMAs until after exec() completes.
> > > +  */
> > > + if (!PageKsm(page) && PageAnon(page))
> > > + rwc.invalid_vma = invalid_migration_vma;
> > > +
> > > + rmap_walk(page, );
> > > +
> > > + return ttp.valid && !page_mapcount(page);
> > > +}
> > 
> > I raised a question in the other thread regarding fork():
> > 
> > https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/
> > 
> > While I suddenly noticed that we may have similar issues even if we fork()
> > before creating the ptes.
> > 
> > In that case, we may see multiple read-only ptes pointing to the same page. 
> > We will convert all of them into device exclusive read ptes in rmap_walk()
> > above, however how do we guarantee after all COW done in the parent and all
> > the childs processes, the device owned page will be returned to the parent?
> 
> I assume you are talking about a fork() followed by a call to 
> make_device_exclusive()? I think this should be ok because 
> make_device_exclusive() always calls GUP with FOLL_WRITE both to break COW 
> and 
> because a device performing atomic operations needs to write to the page. I 
> suppose a comment here highlighting the need to break COW to avoid this 
> scenario would be useful though.

Indeed, sorry for the false alarm!  Yes it would be great to mention that too.

-- 
Peter Xu



Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu
On Wed, May 19, 2021 at 09:35:10PM +1000, Alistair Popple wrote:
> I think the approach you are describing is similar to what 
> migrate_vma_collect()/migrate_vma_unamp() does now and I think it could be 
> made to work. I ended up going with the GUP+unmap approach in part because 
> Christoph suggested it but primarily because it required less code especially 
> given we needed to call something to fault the page in/break COW anyway (or 
> extend what was there to call handle_mm_fault(), etc.).

I see, thank. Would you mind add a short paragraph in the commit message
talking about these two solutions and why we choose the GUP way?

-- 
Peter Xu



Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu
On Wed, May 19, 2021 at 09:04:53PM +1000, Alistair Popple wrote:
> Failing fork() because we couldn't take a lock doesn't seem like the right 
> approach though, especially as there is already existing code that retries. I 
> get this adds complexity though, so would be happy to take a look at cleaning 
> copy_pte_range() up in future.

Yes, I proposed that as this one won't affect any existing applications (unlike
the existing ones) but only new userspace driver apps that will use this new
atomic feature.

IMHO it'll be a pity to add extra complexity and maintainance burden into
fork() if only for keeping the "logical correctness of fork()" however the code
never triggers. If we start with trylock we'll know whether people will use it,
since people will complain with a reason when needed; however I still doubt
whether a sane userspace device driver should fork() within busy interaction
with the device underneath..

In all cases, please still consider to keep them in copy_nonpresent_pte() (and
if to rework, separating patches would be great).

Thanks,

-- 
Peter Xu



Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 3:27:42 AM AEST Peter Xu wrote:
> > > The odd part is the remote GUP should have walked the page table
> > > already, so since the target here is the vaddr to replace, the 1st page
> > > table walk should be able to both trylock/lock the page, then modify
> > > the pte with pgtable lock held, return the locked page, then walk the
> > > rmap again to remove all the rest of the ptes that are mapping to this
> > > page.  In that case before we call the rmap_walk() we know this must be
> > > the page we want to take care of, and no one will be able to restore
> > > the original mm pte either (as we're with the page lock).  Then we
> > > don't need this check, neither do we need ttp->address.
> > 
> > If I am understanding you correctly I think this would be similar to the
> > approach that was taken in v2. However it pretty much ended up being just
> > an open-coded version of gup which is useful anyway to fault the page in.
> I see.  For easier reference this is v2 patch 1:
> 
> https://lore.kernel.org/lkml/20210219020750.16444-2-apop...@nvidia.com/

Sorry, I should have been clearer and just included that reference for you.

> Indeed that looks like it, it's just that instead of grabbing the page only
> in hmm_exclusive_pmd() we can do the pte modification along the way to seal
> the whole thing (address/pte & page).  I saw Christoph and Jason commented
> in that patch, but not regarding to this approach.  So is there a reason
> that you switched?  Do you think it'll work?

I think the approach you are describing is similar to what 
migrate_vma_collect()/migrate_vma_unamp() does now and I think it could be 
made to work. I ended up going with the GUP+unmap approach in part because 
Christoph suggested it but primarily because it required less code especially 
given we needed to call something to fault the page in/break COW anyway (or 
extend what was there to call handle_mm_fault(), etc.).

> I have no strong opinion either, it's just not crystal clear why we'd need
> that ttp->address at all for a rmap walk along with that "valid" field.

It's purely to ensure the PTE pointing to the GUP page was replaced with an 
exclusive swap entry and that the mapping didn't change between calls.

> Meanwhile it should be slightly less efficient too to go with current
> approach, especially when the page array gets huge, I think: since there'll
> be longer time we do GUP before doing the rmap walk, so higher possibility
> that the GUPed pages got replaced for whatever reason.  Then the call to
> make_device_exclusive_range() will fail as a whole just for a single page
> replacement within the range.






Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 9:45:05 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> > Logically during fork all these device exclusive pages should be
> > reverted back to their CPU pages, write protected and the CPU page PTE
> > copied to the fork.
> > 
> > We should not copy the device exclusive page PTE to the fork. I think
> > I pointed to this on an earlier rev..
> 
> Agreed.  Though please see the question I posted in the other thread: now I
> am not very sure whether we'll be able to mark a page as device exclusive
> if that page has mapcount>1.
>
> > We can optimize this into the various variants above, but logically
> > device exclusive stop existing during fork.
> 
> Makes sense, I think that's indeed what this patch did at least for the COW
> case, so I think Alistair did address that comment.  It's just that I think
> we need to drop the other !COW case (imho that should correspond to the
> changes in copy_nonpresent_pte()) in this patch to guarantee it.

Right. The main change from v7 -> v8 was to remove device exclusive entries on 
fork instead of copying them. The change in copy_nonpresent_pte() is for the
!COW case. I think what you are getting at is given exclusive entries are 
(currently) only supported for PageAnon pages is_cow_mapping() will always be 
true and therefore the change to copy_nonpresent_pte() can be dropped. That 
logic seems reasonable so I will change the exclusive case in 
copy_nonpresent_pte() to a VM_WARN_ON.

> I also hope we don't make copy_pte_range() even more complicated just to do
> the lock_page() right, so we could fail the fork() if the lock is hard to
> take.

Failing fork() because we couldn't take a lock doesn't seem like the right 
approach though, especially as there is already existing code that retries. I 
get this adds complexity though, so would be happy to take a look at cleaning 
copy_pte_range() up in future.

> --
> Peter Xu






Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 7:16:38 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:
> 
> [...]
> 
> > +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> > +unsigned long address, void *arg)
> > +{
> > + struct ttp_args ttp = {
> > + .mm = mm,
> > + .address = address,
> > + .arg = arg,
> > + .valid = false,
> > + };
> > + struct rmap_walk_control rwc = {
> > + .rmap_one = try_to_protect_one,
> > + .done = page_not_mapped,
> > + .anon_lock = page_lock_anon_vma_read,
> > + .arg = ,
> > + };
> > +
> > + /*
> > +  * Restrict to anonymous pages for now to avoid potential writeback
> > +  * issues.
> > +  */
> > + if (!PageAnon(page))
> > + return false;
> > +
> > + /*
> > +  * During exec, a temporary VMA is setup and later moved.
> > +  * The VMA is moved under the anon_vma lock but not the
> > +  * page tables leading to a race where migration cannot
> > +  * find the migration ptes. Rather than increasing the
> > +  * locking requirements of exec(), migration skips
> > +  * temporary VMAs until after exec() completes.
> > +  */
> > + if (!PageKsm(page) && PageAnon(page))
> > + rwc.invalid_vma = invalid_migration_vma;
> > +
> > + rmap_walk(page, );
> > +
> > + return ttp.valid && !page_mapcount(page);
> > +}
> 
> I raised a question in the other thread regarding fork():
> 
> https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/
> 
> While I suddenly noticed that we may have similar issues even if we fork()
> before creating the ptes.
> 
> In that case, we may see multiple read-only ptes pointing to the same page. 
> We will convert all of them into device exclusive read ptes in rmap_walk()
> above, however how do we guarantee after all COW done in the parent and all
> the childs processes, the device owned page will be returned to the parent?

I assume you are talking about a fork() followed by a call to 
make_device_exclusive()? I think this should be ok because 
make_device_exclusive() always calls GUP with FOLL_WRITE both to break COW and 
because a device performing atomic operations needs to write to the page. I 
suppose a comment here highlighting the need to break COW to avoid this 
scenario would be useful though.

> E.g., if parent accesses the page earlier than the children processes
> (actually, as long as not the last one), do_wp_page() will do COW for parent
> on this page because refcount(page)>1, then the page seems to get lost to a
> random child too..
>
> To resolve all these complexity, not sure whether try_to_protect() could
> enforce VM_DONTCOPY (needs madvise MADV_DONTFORK in the user app), meanwhile
> make sure mapcount(page)==1 before granting the page to the device, so that
> this will guarantee this mm owns this page forever, I think?  It'll
> simplify fork() too as a side effect, since VM_DONTCOPY vma go away when
> fork.
> 
> --
> Peter Xu






Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Peter Xu
On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> Logically during fork all these device exclusive pages should be
> reverted back to their CPU pages, write protected and the CPU page PTE
> copied to the fork.
> 
> We should not copy the device exclusive page PTE to the fork. I think
> I pointed to this on an earlier rev..

Agreed.  Though please see the question I posted in the other thread: now I am
not very sure whether we'll be able to mark a page as device exclusive if that
page has mapcount>1.

> 
> We can optimize this into the various variants above, but logically
> device exclusive stop existing during fork.

Makes sense, I think that's indeed what this patch did at least for the COW
case, so I think Alistair did address that comment.  It's just that I think we
need to drop the other !COW case (imho that should correspond to the changes in
copy_nonpresent_pte()) in this patch to guarantee it.

I also hope we don't make copy_pte_range() even more complicated just to do the
lock_page() right, so we could fail the fork() if the lock is hard to take.

-- 
Peter Xu



Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Jason Gunthorpe
On Tue, May 18, 2021 at 04:29:14PM -0400, Peter Xu wrote:
> On Tue, May 18, 2021 at 04:45:09PM -0300, Jason Gunthorpe wrote:
> > On Tue, May 18, 2021 at 02:01:36PM -0400, Peter Xu wrote:
> > > > > Indeed it'll be odd for a COW page since for COW page then it means 
> > > > > after
> > > > > parent/child writting to the page it'll clone into two, then it's a 
> > > > > mistery on
> > > > > which one will be the one that "exclusived owned" by the device..
> > > > 
> > > > For COW pages it is like every other fork case.. We can't reliably
> > > > write-protect the device_exclusive page during fork so we must copy it
> > > > at fork time.
> > > > 
> > > > Thus three reasonable choices:
> > > >  - Copy to a new CPU page
> > > >  - Migrate back to a CPU page and write protect it
> > > >  - Copy to a new device exclusive page
> > > 
> > > IMHO the ownership question would really help us to answer this one..
> > 
> > I'm confused about what device ownership you are talking about
> 
> My question was more about the user scenario rather than anything related to
> the kernel code, nor does it related to page struct at all.
> 
> Let me try to be a little bit more verbose...
> 
> Firstly, I think one simple solution to handle fork() of device exclusive ptes
> is to do just like device private ptes: if COW we convert writable ptes into
> readable ptes.  Then when CPU access happens (in either parent/child) page
> restore triggers which will convert those readable ptes into read-only present
> ptes (with the original page backing it).  Then do_wp_page() will take care of
> page copy.

I suspect it doesn't work. This is much more like pinning than
anything, the data in the page is still under active use by a device
and if we cannot globally write write protect it, both from CPU and
device access, then we cannot do COW. IIRC the mm can't trigger a full
global write protect through the pgmap?
 
> Then here comes the ownership question: If we still want to have the parent
> process behave like before it fork()ed, IMHO we must make sure that original
> page (that exclusively owned by the device once) still belongs to the parent
> process not the child.  That's why I think if that's the case we'd do early 
> cow
> in fork(), because it guarantees that.

Logically during fork all these device exclusive pages should be
reverted back to their CPU pages, write protected and the CPU page PTE
copied to the fork.

We should not copy the device exclusive page PTE to the fork. I think
I pointed to this on an earlier rev..

We can optimize this into the various variants above, but logically
device exclusive stop existing during fork.

Jason


Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Peter Xu
On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:

[...]

> +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> +unsigned long address, void *arg)
> +{
> + struct ttp_args ttp = {
> + .mm = mm,
> + .address = address,
> + .arg = arg,
> + .valid = false,
> + };
> + struct rmap_walk_control rwc = {
> + .rmap_one = try_to_protect_one,
> + .done = page_not_mapped,
> + .anon_lock = page_lock_anon_vma_read,
> + .arg = ,
> + };
> +
> + /*
> +  * Restrict to anonymous pages for now to avoid potential writeback
> +  * issues.
> +  */
> + if (!PageAnon(page))
> + return false;
> +
> + /*
> +  * During exec, a temporary VMA is setup and later moved.
> +  * The VMA is moved under the anon_vma lock but not the
> +  * page tables leading to a race where migration cannot
> +  * find the migration ptes. Rather than increasing the
> +  * locking requirements of exec(), migration skips
> +  * temporary VMAs until after exec() completes.
> +  */
> + if (!PageKsm(page) && PageAnon(page))
> + rwc.invalid_vma = invalid_migration_vma;
> +
> + rmap_walk(page, );
> +
> + return ttp.valid && !page_mapcount(page);
> +}

I raised a question in the other thread regarding fork():

https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/

While I suddenly noticed that we may have similar issues even if we fork()
before creating the ptes.

In that case, we may see multiple read-only ptes pointing to the same page.  We
will convert all of them into device exclusive read ptes in rmap_walk() above,
however how do we guarantee after all COW done in the parent and all the childs
processes, the device owned page will be returned to the parent?

E.g., if parent accesses the page earlier than the children processes
(actually, as long as not the last one), do_wp_page() will do COW for parent on
this page because refcount(page)>1, then the page seems to get lost to a random
child too..

To resolve all these complexity, not sure whether try_to_protect() could
enforce VM_DONTCOPY (needs madvise MADV_DONTFORK in the user app), meanwhile
make sure mapcount(page)==1 before granting the page to the device, so that
this will guarantee this mm owns this page forever, I think?  It'll simplify
fork() too as a side effect, since VM_DONTCOPY vma go away when fork.

-- 
Peter Xu



Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Peter Xu
On Tue, May 18, 2021 at 04:45:09PM -0300, Jason Gunthorpe wrote:
> On Tue, May 18, 2021 at 02:01:36PM -0400, Peter Xu wrote:
> > > > Indeed it'll be odd for a COW page since for COW page then it means 
> > > > after
> > > > parent/child writting to the page it'll clone into two, then it's a 
> > > > mistery on
> > > > which one will be the one that "exclusived owned" by the device..
> > > 
> > > For COW pages it is like every other fork case.. We can't reliably
> > > write-protect the device_exclusive page during fork so we must copy it
> > > at fork time.
> > > 
> > > Thus three reasonable choices:
> > >  - Copy to a new CPU page
> > >  - Migrate back to a CPU page and write protect it
> > >  - Copy to a new device exclusive page
> > 
> > IMHO the ownership question would really help us to answer this one..
> 
> I'm confused about what device ownership you are talking about

My question was more about the user scenario rather than anything related to
the kernel code, nor does it related to page struct at all.

Let me try to be a little bit more verbose...

Firstly, I think one simple solution to handle fork() of device exclusive ptes
is to do just like device private ptes: if COW we convert writable ptes into
readable ptes.  Then when CPU access happens (in either parent/child) page
restore triggers which will convert those readable ptes into read-only present
ptes (with the original page backing it).  Then do_wp_page() will take care of
page copy.

However... if you see that also means parent/child have the equal opportunity
to reuse that original page: who access first will do COW because refcount>1
for that page (note! it's possible that mapcount==1 here, as we drop mapcount
when converting to device exclusive ptes; however with the most recent
do_wp_page change from Linus where we'll also check page_count(), we'll still
do COW just like when this page was GUPed by someone else).  While that matters
because the device is writting to that original page only, not the COWed one.

Then here comes the ownership question: If we still want to have the parent
process behave like before it fork()ed, IMHO we must make sure that original
page (that exclusively owned by the device once) still belongs to the parent
process not the child.  That's why I think if that's the case we'd do early cow
in fork(), because it guarantees that.

I can't say I fully understand the whole picture, so sorry if I missed
something important there.

Thanks,

-- 
Peter Xu



Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Jason Gunthorpe
On Tue, May 18, 2021 at 02:01:36PM -0400, Peter Xu wrote:
> > > Indeed it'll be odd for a COW page since for COW page then it means after
> > > parent/child writting to the page it'll clone into two, then it's a 
> > > mistery on
> > > which one will be the one that "exclusived owned" by the device..
> > 
> > For COW pages it is like every other fork case.. We can't reliably
> > write-protect the device_exclusive page during fork so we must copy it
> > at fork time.
> > 
> > Thus three reasonable choices:
> >  - Copy to a new CPU page
> >  - Migrate back to a CPU page and write protect it
> >  - Copy to a new device exclusive page
> 
> IMHO the ownership question would really help us to answer this one..

I'm confused about what device ownership you are talking about

It is just a page and it is tied to some specific pgmap?

If the thing providing the migration stuff goes away then all
device_exclusive pages should revert back to CPU pages and destroy the
pgmap?

Jason


Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Peter Xu
On Tue, May 18, 2021 at 02:33:34PM -0300, Jason Gunthorpe wrote:
> On Tue, May 18, 2021 at 01:27:42PM -0400, Peter Xu wrote:
> 
> > I also have a pure and high level question regarding a process fork() when
> > there're device exclusive ptes: would the two processes then own the device
> > together?  Is this a real usecase?
> 
> If the pages are MAP_SHARED then yes. All VMAs should point at the
> same device_exclusive page and all VMA should migrate back to CPU
> pages together.

Makes sense.  If we keep the anonymous-only in this series (I think it's good
to separate these), maybe we can drop the !COW case, plus some proper installed
WARN_ON_ONCE()s.

> 
> > Indeed it'll be odd for a COW page since for COW page then it means after
> > parent/child writting to the page it'll clone into two, then it's a mistery 
> > on
> > which one will be the one that "exclusived owned" by the device..
> 
> For COW pages it is like every other fork case.. We can't reliably
> write-protect the device_exclusive page during fork so we must copy it
> at fork time.
> 
> Thus three reasonable choices:
>  - Copy to a new CPU page
>  - Migrate back to a CPU page and write protect it
>  - Copy to a new device exclusive page

IMHO the ownership question would really help us to answer this one..

If the device ownership should be kept in parent IMHO option (1) is the best
approach. To be explicit on page copy: we can do best-effort, even if the copy
is during a device atomic operation, perhaps?

If the ownership will be shared, seems option (3) will be easier as I don't see
a strong reason to do immediate restorinng of ptes; as long as we're careful on
the refcounting.

Thanks,

-- 
Peter Xu



Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Jason Gunthorpe
On Tue, May 18, 2021 at 01:27:42PM -0400, Peter Xu wrote:

> I also have a pure and high level question regarding a process fork() when
> there're device exclusive ptes: would the two processes then own the device
> together?  Is this a real usecase?

If the pages are MAP_SHARED then yes. All VMAs should point at the
same device_exclusive page and all VMA should migrate back to CPU
pages together.

> Indeed it'll be odd for a COW page since for COW page then it means after
> parent/child writting to the page it'll clone into two, then it's a mistery on
> which one will be the one that "exclusived owned" by the device..

For COW pages it is like every other fork case.. We can't reliably
write-protect the device_exclusive page during fork so we must copy it
at fork time.

Thus three reasonable choices:
 - Copy to a new CPU page
 - Migrate back to a CPU page and write protect it
 - Copy to a new device exclusive page

Jason


Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Peter Xu
Alistair,

While I got one reply below to your previous email, I also looked at the rest
code (majorly restore and fork sides) today and added a few more comments.

On Tue, May 18, 2021 at 11:19:05PM +1000, Alistair Popple wrote:

[...]

> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 3a5705cfc891..556ff396f2e9 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -700,6 +700,84 @@ struct page *vm_normal_page_pmd(struct vm_area_struct
> > > *vma, unsigned long addr,> 
> > >  }
> > >  #endif
> > > 
> > > +static void restore_exclusive_pte(struct vm_area_struct *vma,
> > > +   struct page *page, unsigned long address,
> > > +   pte_t *ptep)
> > > +{
> > > + pte_t pte;
> > > + swp_entry_t entry;
> > > +
> > > + pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
> > > + if (pte_swp_soft_dirty(*ptep))
> > > + pte = pte_mksoft_dirty(pte);
> > > +
> > > + entry = pte_to_swp_entry(*ptep);
> > > + if (pte_swp_uffd_wp(*ptep))
> > > + pte = pte_mkuffd_wp(pte);
> > > + else if (is_writable_device_exclusive_entry(entry))
> > > + pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > > +
> > > + set_pte_at(vma->vm_mm, address, ptep, pte);
> > > +
> > > + /*
> > > +  * No need to take a page reference as one was already
> > > +  * created when the swap entry was made.
> > > +  */
> > > + if (PageAnon(page))
> > > + page_add_anon_rmap(page, vma, address, false);
> > > + else
> > > + page_add_file_rmap(page, false);

This seems to be another leftover; maybe WARN_ON_ONCE(!PageAnon(page))?

> > > +
> > > + if (vma->vm_flags & VM_LOCKED)
> > > + mlock_vma_page(page);
> > > +
> > > + /*
> > > +  * No need to invalidate - it was non-present before. However
> > > +  * secondary CPUs may have mappings that need invalidating.
> > > +  */
> > > + update_mmu_cache(vma, address, ptep);
> > > +}

[...]

> > >  /*
> > >  
> > >   * copy one vm_area from one task to the other. Assumes the page tables
> > >   * already present in the new task to be cleared in the whole range
> > > 
> > > @@ -781,6 +859,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct
> > > mm_struct *src_mm,> 
> > >   pte = pte_swp_mkuffd_wp(pte);
> > >   
> > >   set_pte_at(src_mm, addr, src_pte, pte);
> > >   
> > >   }
> > > 
> > > + } else if (is_device_exclusive_entry(entry)) {
> > > + /* COW mappings should be dealt with by removing the entry
> > > */

Here the comment says "removing the entry" however I think it didn't remove the
pte, instead it keeps it (as there's no "return", so set_pte_at() will be
called below), so I got a bit confused.

> > > + VM_BUG_ON(is_cow_mapping(vm_flags));

Also here, if PageAnon() is the only case to support so far, doesn't that
easily satisfy is_cow_mapping()? Maybe I missed something..

I also have a pure and high level question regarding a process fork() when
there're device exclusive ptes: would the two processes then own the device
together?  Is this a real usecase?

Indeed it'll be odd for a COW page since for COW page then it means after
parent/child writting to the page it'll clone into two, then it's a mistery on
which one will be the one that "exclusived owned" by the device..

> > > + page = pfn_swap_entry_to_page(entry);
> > > + get_page(page);
> > > + rss[mm_counter(page)]++;
> > > 
> > >   }
> > >   set_pte_at(dst_mm, addr, dst_pte, pte);
> > >   return 0;
> > > 
> > > @@ -947,6 +1031,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct
> > > vm_area_struct *src_vma,> 
> > >   int rss[NR_MM_COUNTERS];
> > >   swp_entry_t entry = (swp_entry_t){0};
> > >   struct page *prealloc = NULL;
> > > 
> > > + struct page *locked_page = NULL;
> > > 
> > >  again:
> > >   progress = 0;
> > > 
> > > @@ -980,13 +1065,36 @@ copy_pte_range(struct vm_area_struct *dst_vma,
> > > struct vm_area_struct *src_vma,> 
> > >   continue;
> > >   
> > >   }
> > >   if (unlikely(!pte_present(*src_pte))) {
> > > 
> > > - entry.val = copy_nonpresent_pte(dst_mm, src_mm,
> > > - dst_pte, src_pte,
> > > - src_vma, addr, rss);
> > > - if (entry.val)
> > > - break;
> > > - progress += 8;
> > > - continue;
> > > + swp_entry_t swp_entry = pte_to_swp_entry(*src_pte);

(Just a side note to all of us: this will be one more place that I'll need to
 look after in my uffd-wp series if this series lands first, as after that
 series we can only call 

Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Alistair Popple
On Tuesday, 18 May 2021 12:08:34 PM AEST Peter Xu wrote:
> > v6:
> > * Fixed a bisectablity issue due to incorrectly applying the rename of
> > 
> >   migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.
> > 
> > v5:
> > * Renamed range->migrate_pgmap_owner to range->owner.
> 
> May be nicer to mention this rename in commit message (or make it a separate
> patch)?

Ok, I think if it needs to be explicitly called out in the commit message it 
should be a separate patch. Originally I thought the change was _just_ small 
enough to include here, but this patch has since grown so I'll split it out.
 


> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index 0e25d829f742..3a1ce4ef9276 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -193,6 +193,10 @@ int page_referenced(struct page *, int is_locked,
> > 
> >  bool try_to_migrate(struct page *page, enum ttu_flags flags);
> >  bool try_to_unmap(struct page *, enum ttu_flags flags);
> > 
> > +int make_device_exclusive_range(struct mm_struct *mm, unsigned long
> > start,
> > + unsigned long end, struct page **pages,
> > + void *arg);
> > +
> > 
> >  /* Avoid racy checks */
> >  #define PVMW_SYNC(1 << 0)
> >  /* Look for migarion entries rather than present PTEs */
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 516104b9334b..7a3c260146df 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -63,9 +63,11 @@ static inline int current_is_kswapd(void)
> > 
> >   * to a special SWP_DEVICE_* entry.
> >   */
> 
> Should we add another short description for the newly added two types?
> Otherwise the reader could get confused assuming the above comment is
> explaining all four types, while it is for SWP_DEVICE_{READ|WRITE} only?

Good idea, I can see how that could be confusing. Will add a short 
description.



> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3a5705cfc891..556ff396f2e9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -700,6 +700,84 @@ struct page *vm_normal_page_pmd(struct vm_area_struct
> > *vma, unsigned long addr,> 
> >  }
> >  #endif
> > 
> > +static void restore_exclusive_pte(struct vm_area_struct *vma,
> > +   struct page *page, unsigned long address,
> > +   pte_t *ptep)
> > +{
> > + pte_t pte;
> > + swp_entry_t entry;
> > +
> > + pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
> > + if (pte_swp_soft_dirty(*ptep))
> > + pte = pte_mksoft_dirty(pte);
> > +
> > + entry = pte_to_swp_entry(*ptep);
> > + if (pte_swp_uffd_wp(*ptep))
> > + pte = pte_mkuffd_wp(pte);
> > + else if (is_writable_device_exclusive_entry(entry))
> > + pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > +
> > + set_pte_at(vma->vm_mm, address, ptep, pte);
> > +
> > + /*
> > +  * No need to take a page reference as one was already
> > +  * created when the swap entry was made.
> > +  */
> > + if (PageAnon(page))
> > + page_add_anon_rmap(page, vma, address, false);
> > + else
> > + page_add_file_rmap(page, false);
> > +
> > + if (vma->vm_flags & VM_LOCKED)
> > + mlock_vma_page(page);
> > +
> > + /*
> > +  * No need to invalidate - it was non-present before. However
> > +  * secondary CPUs may have mappings that need invalidating.
> > +  */
> > + update_mmu_cache(vma, address, ptep);
> > +}
> > +
> > +/*
> > + * Tries to restore an exclusive pte if the page lock can be acquired
> > without + * sleeping. Returns 0 on success or -EBUSY if the page could
> > not be locked or + * the entry no longer points at locked_page in which
> > case locked_page should be + * locked before retrying the call.
> > + */
> > +static unsigned long
> > +try_restore_exclusive_pte(struct mm_struct *src_mm, pte_t *src_pte,
> > +   struct vm_area_struct *vma, unsigned long addr,
> > +   struct page **locked_page)
> > +{
> > + swp_entry_t entry = pte_to_swp_entry(*src_pte);
> > + struct page *page = pfn_swap_entry_to_page(entry);
> > +
> > + if (*locked_page) {
> > + /* The entry changed, retry */
> > + if (unlikely(*locked_page != page)) {
> > + unlock_page(*locked_page);
> > + put_page(*locked_page);
> > + *locked_page = page;
> > + return -EBUSY;
> > + }
> > + restore_exclusive_pte(vma, page, addr, src_pte);
> > + unlock_page(page);
> > + put_page(page);
> > + *locked_page = NULL;
> > + return 0;
> > + }
> > +
> > + if (trylock_page(page)) {
> > + restore_exclusive_pte(vma, page, addr, src_pte);
> > + unlock_page(page);
> > + return 0;
> > + 

Re: [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-17 Thread Peter Xu
Hi, Alistair,

The overall patch looks good to me, however I have a few comments or questions
inlined below.

On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:
> Some devices require exclusive write access to shared virtual
> memory (SVM) ranges to perform atomic operations on that memory. This
> requires CPU page tables to be updated to deny access whilst atomic
> operations are occurring.
> 
> In order to do this introduce a new swap entry
> type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
> exclusive access by a device all page table mappings for the particular
> range are replaced with device exclusive swap entries. This causes any
> CPU access to the page to result in a fault.
> 
> Faults are resovled by replacing the faulting entry with the original
> mapping. This results in MMU notifiers being called which a driver uses
> to update access permissions such as revoking atomic access. After
> notifiers have been called the device will no longer have exclusive
> access to the region.
> 
> Signed-off-by: Alistair Popple 
> Reviewed-by: Christoph Hellwig 
> 
> ---
> 
> v8:
> * Remove device exclusive entries on fork rather than copy them.
> 
> v7:
> * Added Christoph's Reviewed-by.
> * Minor cosmetic cleanups suggested by Christoph.
> * Replace mmu_notifier_range_init_migrate/exclusive with
>   mmu_notifier_range_init_owner as suggested by Christoph.
> * Replaced lock_page() with lock_page_retry() when handling faults.
> * Restrict to anonymous pages for now.
> 
> v6:
> * Fixed a bisectablity issue due to incorrectly applying the rename of
>   migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.
> 
> v5:
> * Renamed range->migrate_pgmap_owner to range->owner.

May be nicer to mention this rename in commit message (or make it a separate
patch)?

> * Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which
>   allows notifiers called as a result of make_device_exclusive_range() to
>   be ignored.
> * Added a check to try_to_protect_one() to detect if the pages originally
>   returned from get_user_pages() have been unmapped or not.
> * Removed check_device_exclusive_range() as it is no longer required with
>   the other changes.
> * Documentation update.
> 
> v4:
> * Add function to check that mappings are still valid and exclusive.
> * s/long/unsigned long/ in make_device_exclusive_entry().
> ---
>  Documentation/vm/hmm.rst  |  19 ++-
>  drivers/gpu/drm/nouveau/nouveau_svm.c |   2 +-
>  include/linux/mmu_notifier.h  |  26 ++--
>  include/linux/rmap.h  |   4 +
>  include/linux/swap.h  |   4 +-
>  include/linux/swapops.h   |  44 +-
>  lib/test_hmm.c|   2 +-
>  mm/hmm.c  |   5 +
>  mm/memory.c   | 176 +++--
>  mm/migrate.c  |  10 +-
>  mm/mprotect.c |   8 +
>  mm/page_vma_mapped.c  |   9 +-
>  mm/rmap.c | 210 ++
>  13 files changed, 487 insertions(+), 32 deletions(-)
> 
> diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
> index 09e28507f5b2..a14c2938e7af 100644
> --- a/Documentation/vm/hmm.rst
> +++ b/Documentation/vm/hmm.rst
> @@ -332,7 +332,7 @@ between device driver specific code and shared common 
> code:
> walks to fill in the ``args->src`` array with PFNs to be migrated.
> The ``invalidate_range_start()`` callback is passed a
> ``struct mmu_notifier_range`` with the ``event`` field set to
> -   ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
> +   ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to
> the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
> allows the device driver to skip the invalidation callback and only
> invalidate device private MMU mappings that are actually migrating.
> @@ -405,6 +405,23 @@ between device driver specific code and shared common 
> code:
>  
> The lock can now be released.
>  
> +Exclusive access memory
> +===
> +
> +Some devices have features such as atomic PTE bits that can be used to 
> implement
> +atomic access to system memory. To support atomic operations to a shared 
> virtual
> +memory page such a device needs access to that page which is exclusive of any
> +userspace access from the CPU. The ``make_device_exclusive_range()`` function
> +can be used to make a memory range inaccessible from userspace.
> +
> +This replaces all mappings for pages in the given range with special swap
> +entries. Any attempt to access the swap entry results in a fault which is
> +resovled by replacing the entry with the original mapping. A driver gets
> +notified that the mapping has been changed by MMU notifiers, after which 
> point
> +it will no longer have exclusive access to the page. Exclusive access is
> +guranteed to last until the 

[PATCH v8 5/8] mm: Device exclusive memory access

2021-04-07 Thread Alistair Popple
Some devices require exclusive write access to shared virtual
memory (SVM) ranges to perform atomic operations on that memory. This
requires CPU page tables to be updated to deny access whilst atomic
operations are occurring.

In order to do this introduce a new swap entry
type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
exclusive access by a device all page table mappings for the particular
range are replaced with device exclusive swap entries. This causes any
CPU access to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive
access to the region.

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 

---

v8:
* Remove device exclusive entries on fork rather than copy them.

v7:
* Added Christoph's Reviewed-by.
* Minor cosmetic cleanups suggested by Christoph.
* Replace mmu_notifier_range_init_migrate/exclusive with
  mmu_notifier_range_init_owner as suggested by Christoph.
* Replaced lock_page() with lock_page_retry() when handling faults.
* Restrict to anonymous pages for now.

v6:
* Fixed a bisectablity issue due to incorrectly applying the rename of
  migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.

v5:
* Renamed range->migrate_pgmap_owner to range->owner.
* Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which
  allows notifiers called as a result of make_device_exclusive_range() to
  be ignored.
* Added a check to try_to_protect_one() to detect if the pages originally
  returned from get_user_pages() have been unmapped or not.
* Removed check_device_exclusive_range() as it is no longer required with
  the other changes.
* Documentation update.

v4:
* Add function to check that mappings are still valid and exclusive.
* s/long/unsigned long/ in make_device_exclusive_entry().
---
 Documentation/vm/hmm.rst  |  19 ++-
 drivers/gpu/drm/nouveau/nouveau_svm.c |   2 +-
 include/linux/mmu_notifier.h  |  26 ++--
 include/linux/rmap.h  |   4 +
 include/linux/swap.h  |   4 +-
 include/linux/swapops.h   |  44 +-
 lib/test_hmm.c|   2 +-
 mm/hmm.c  |   5 +
 mm/memory.c   | 176 +++--
 mm/migrate.c  |  10 +-
 mm/mprotect.c |   8 +
 mm/page_vma_mapped.c  |   9 +-
 mm/rmap.c | 210 ++
 13 files changed, 487 insertions(+), 32 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 09e28507f5b2..a14c2938e7af 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -332,7 +332,7 @@ between device driver specific code and shared common code:
walks to fill in the ``args->src`` array with PFNs to be migrated.
The ``invalidate_range_start()`` callback is passed a
``struct mmu_notifier_range`` with the ``event`` field set to
-   ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
+   ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to
the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
allows the device driver to skip the invalidation callback and only
invalidate device private MMU mappings that are actually migrating.
@@ -405,6 +405,23 @@ between device driver specific code and shared common code:
 
The lock can now be released.
 
+Exclusive access memory
+===
+
+Some devices have features such as atomic PTE bits that can be used to 
implement
+atomic access to system memory. To support atomic operations to a shared 
virtual
+memory page such a device needs access to that page which is exclusive of any
+userspace access from the CPU. The ``make_device_exclusive_range()`` function
+can be used to make a memory range inaccessible from userspace.
+
+This replaces all mappings for pages in the given range with special swap
+entries. Any attempt to access the swap entry results in a fault which is
+resovled by replacing the entry with the original mapping. A driver gets
+notified that the mapping has been changed by MMU notifiers, after which point
+it will no longer have exclusive access to the page. Exclusive access is
+guranteed to last until the driver drops the page lock and page reference, at
+which point any CPU faults on the page may proceed as described.
+
 Memory cgroup (memcg) and rss accounting
 
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index f18bd53da052..94f841026c3b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -265,7 +265,7 @@