Re: [Nouveau] [Intel-gfx] [PATCH 0/7] Per client engine busyness

2021-05-19 Thread Nieto, David M
[AMD Official Use Only]

Parsing over 550 processes for fdinfo is taking between 40-100ms single 
threaded in a 2GHz skylake IBRS within a VM using simple string comparisons and 
DIRent parsing. And that is pretty much the worst case scenario with some more 
optimized implementations.

David

From: Daniel Vetter 
Sent: Wednesday, May 19, 2021 11:23 AM
To: Tvrtko Ursulin 
Cc: Daniel Stone ; jhubb...@nvidia.com 
; nouveau@lists.freedesktop.org 
; Intel Graphics Development 
; Maling list - DRI developers 
; Simon Ser ; Koenig, 
Christian ; arit...@nvidia.com ; 
Nieto, David M 
Subject: Re: [Intel-gfx] [PATCH 0/7] Per client engine busyness

On Wed, May 19, 2021 at 6:16 PM Tvrtko Ursulin
 wrote:
>
>
> On 18/05/2021 10:40, Tvrtko Ursulin wrote:
> >
> > On 18/05/2021 10:16, Daniel Stone wrote:
> >> Hi,
> >>
> >> On Tue, 18 May 2021 at 10:09, Tvrtko Ursulin
> >>  wrote:
> >>> I was just wondering if stat(2) and a chrdev major check would be a
> >>> solid criteria to more efficiently (compared to parsing the text
> >>> content) detect drm files while walking procfs.
> >>
> >> Maybe I'm missing something, but is the per-PID walk actually a
> >> measurable performance issue rather than just a bit unpleasant?
> >
> > Per pid and per each open fd.
> >
> > As said in the other thread what bothers me a bit in this scheme is that
> > the cost of obtaining GPU usage scales based on non-GPU criteria.
> >
> > For use case of a top-like tool which shows all processes this is a
> > smaller additional cost, but then for a gpu-top like tool it is somewhat
> > higher.
>
> To further expand, not only cost would scale per pid multiplies per open
> fd, but to detect which of the fds are DRM I see these three options:
>
> 1) Open and parse fdinfo.
> 2) Name based matching ie /dev/dri/.. something.
> 3) Stat the symlink target and check for DRM major.

stat with symlink following should be plenty fast.

> All sound quite sub-optimal to me.
>
> Name based matching is probably the least evil on system resource usage
> (Keeping the dentry cache too hot? Too many syscalls?), even though
> fundamentally I don't it is the right approach.
>
> What happens with dup(2) is another question.

We need benchmark numbers showing that on anything remotely realistic
it's an actual problem. Until we've demonstrated it's a real problem
we don't need to solve it.

E.g. top with any sorting enabled also parses way more than it
displays on every update. It seems to be doing Just Fine (tm).

> Does anyone have any feedback on the /proc//gpu idea at all?

When we know we have a problem to solve we can take a look at solutions.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2Fdata=04%7C01%7CDavid.Nieto%40amd.com%7Cf6aea97532cf41f916de08d91af32cc1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637570453997158377%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=4CFrY9qWbJREcIcSzeO9KIn2P%2Fw6k%2BYdNlh6rdS%2BEh4%3Dreserved=0
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [Intel-gfx] [PATCH 0/7] Per client engine busyness

2021-05-19 Thread Daniel Vetter
On Wed, May 19, 2021 at 6:16 PM Tvrtko Ursulin
 wrote:
>
>
> On 18/05/2021 10:40, Tvrtko Ursulin wrote:
> >
> > On 18/05/2021 10:16, Daniel Stone wrote:
> >> Hi,
> >>
> >> On Tue, 18 May 2021 at 10:09, Tvrtko Ursulin
> >>  wrote:
> >>> I was just wondering if stat(2) and a chrdev major check would be a
> >>> solid criteria to more efficiently (compared to parsing the text
> >>> content) detect drm files while walking procfs.
> >>
> >> Maybe I'm missing something, but is the per-PID walk actually a
> >> measurable performance issue rather than just a bit unpleasant?
> >
> > Per pid and per each open fd.
> >
> > As said in the other thread what bothers me a bit in this scheme is that
> > the cost of obtaining GPU usage scales based on non-GPU criteria.
> >
> > For use case of a top-like tool which shows all processes this is a
> > smaller additional cost, but then for a gpu-top like tool it is somewhat
> > higher.
>
> To further expand, not only cost would scale per pid multiplies per open
> fd, but to detect which of the fds are DRM I see these three options:
>
> 1) Open and parse fdinfo.
> 2) Name based matching ie /dev/dri/.. something.
> 3) Stat the symlink target and check for DRM major.

stat with symlink following should be plenty fast.

> All sound quite sub-optimal to me.
>
> Name based matching is probably the least evil on system resource usage
> (Keeping the dentry cache too hot? Too many syscalls?), even though
> fundamentally I don't it is the right approach.
>
> What happens with dup(2) is another question.

We need benchmark numbers showing that on anything remotely realistic
it's an actual problem. Until we've demonstrated it's a real problem
we don't need to solve it.

E.g. top with any sorting enabled also parses way more than it
displays on every update. It seems to be doing Just Fine (tm).

> Does anyone have any feedback on the /proc//gpu idea at all?

When we know we have a problem to solve we can take a look at solutions.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Jason Gunthorpe
> Sorry for the noise.

Not at all, it is good that more people understand things!

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu
On Wed, May 19, 2021 at 09:04:53PM +1000, Alistair Popple wrote:
> Failing fork() because we couldn't take a lock doesn't seem like the right 
> approach though, especially as there is already existing code that retries. I 
> get this adds complexity though, so would be happy to take a look at cleaning 
> copy_pte_range() up in future.

Yes, I proposed that as this one won't affect any existing applications (unlike
the existing ones) but only new userspace driver apps that will use this new
atomic feature.

IMHO it'll be a pity to add extra complexity and maintainance burden into
fork() if only for keeping the "logical correctness of fork()" however the code
never triggers. If we start with trylock we'll know whether people will use it,
since people will complain with a reason when needed; however I still doubt
whether a sane userspace device driver should fork() within busy interaction
with the device underneath..

In all cases, please still consider to keep them in copy_nonpresent_pte() (and
if to rework, separating patches would be great).

Thanks,

-- 
Peter Xu

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu
On Wed, May 19, 2021 at 09:35:10PM +1000, Alistair Popple wrote:
> I think the approach you are describing is similar to what 
> migrate_vma_collect()/migrate_vma_unamp() does now and I think it could be 
> made to work. I ended up going with the GUP+unmap approach in part because 
> Christoph suggested it but primarily because it required less code especially 
> given we needed to call something to fault the page in/break COW anyway (or 
> extend what was there to call handle_mm_fault(), etc.).

I see, thank. Would you mind add a short paragraph in the commit message
talking about these two solutions and why we choose the GUP way?

-- 
Peter Xu

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu
On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> Logically during fork all these device exclusive pages should be
> reverted back to their CPU pages, write protected and the CPU page PTE
> copied to the fork.
> 
> We should not copy the device exclusive page PTE to the fork. I think
> I pointed to this on an earlier rev..

Agreed.  Though please see the question I posted in the other thread: now I am
not very sure whether we'll be able to mark a page as device exclusive if that
page has mapcount>1.

> 
> We can optimize this into the various variants above, but logically
> device exclusive stop existing during fork.

Makes sense, I think that's indeed what this patch did at least for the COW
case, so I think Alistair did address that comment.  It's just that I think we
need to drop the other !COW case (imho that should correspond to the changes in
copy_nonpresent_pte()) in this patch to guarantee it.

I also hope we don't make copy_pte_range() even more complicated just to do the
lock_page() right, so we could fail the fork() if the lock is hard to take.

-- 
Peter Xu

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu
On Wed, May 19, 2021 at 08:49:01PM +1000, Alistair Popple wrote:
> On Wednesday, 19 May 2021 7:16:38 AM AEST Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:
> > 
> > [...]
> > 
> > > +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> > > +unsigned long address, void *arg)
> > > +{
> > > + struct ttp_args ttp = {
> > > + .mm = mm,
> > > + .address = address,
> > > + .arg = arg,
> > > + .valid = false,
> > > + };
> > > + struct rmap_walk_control rwc = {
> > > + .rmap_one = try_to_protect_one,
> > > + .done = page_not_mapped,
> > > + .anon_lock = page_lock_anon_vma_read,
> > > + .arg = ,
> > > + };
> > > +
> > > + /*
> > > +  * Restrict to anonymous pages for now to avoid potential writeback
> > > +  * issues.
> > > +  */
> > > + if (!PageAnon(page))
> > > + return false;
> > > +
> > > + /*
> > > +  * During exec, a temporary VMA is setup and later moved.
> > > +  * The VMA is moved under the anon_vma lock but not the
> > > +  * page tables leading to a race where migration cannot
> > > +  * find the migration ptes. Rather than increasing the
> > > +  * locking requirements of exec(), migration skips
> > > +  * temporary VMAs until after exec() completes.
> > > +  */
> > > + if (!PageKsm(page) && PageAnon(page))
> > > + rwc.invalid_vma = invalid_migration_vma;
> > > +
> > > + rmap_walk(page, );
> > > +
> > > + return ttp.valid && !page_mapcount(page);
> > > +}
> > 
> > I raised a question in the other thread regarding fork():
> > 
> > https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/
> > 
> > While I suddenly noticed that we may have similar issues even if we fork()
> > before creating the ptes.
> > 
> > In that case, we may see multiple read-only ptes pointing to the same page. 
> > We will convert all of them into device exclusive read ptes in rmap_walk()
> > above, however how do we guarantee after all COW done in the parent and all
> > the childs processes, the device owned page will be returned to the parent?
> 
> I assume you are talking about a fork() followed by a call to 
> make_device_exclusive()? I think this should be ok because 
> make_device_exclusive() always calls GUP with FOLL_WRITE both to break COW 
> and 
> because a device performing atomic operations needs to write to the page. I 
> suppose a comment here highlighting the need to break COW to avoid this 
> scenario would be useful though.

Indeed, sorry for the false alarm!  Yes it would be great to mention that too.

-- 
Peter Xu

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu
On Wed, May 19, 2021 at 11:11:55PM +1000, Alistair Popple wrote:
> On Wednesday, 19 May 2021 10:15:41 PM AEST Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > On Wed, May 19, 2021 at 09:04:53PM +1000, Alistair Popple wrote:
> > > Failing fork() because we couldn't take a lock doesn't seem like the right
> > > approach though, especially as there is already existing code that
> > > retries. I get this adds complexity though, so would be happy to take a
> > > look at cleaning copy_pte_range() up in future.
> > 
> > Yes, I proposed that as this one won't affect any existing applications
> > (unlike the existing ones) but only new userspace driver apps that will use
> > this new atomic feature.
> > 
> > IMHO it'll be a pity to add extra complexity and maintainance burden into
> > fork() if only for keeping the "logical correctness of fork()" however the
> > code never triggers. If we start with trylock we'll know whether people
> > will use it, since people will complain with a reason when needed; however
> > I still doubt whether a sane userspace device driver should fork() within
> > busy interaction with the device underneath..
> 
> I will refrain from commenting on the sanity or otherwise of doing that :-)
> 
> Agree such a scenario seems unlikely in practice (and possibly unreasonable). 
> Keeping the "logical correctness of fork()" still seems worthwhile to me, but 
> if the added complexity/maintenance burden for an admittedly fairly specific 
> feature is going to stop progress here I am happy to take the fail fork 
> approach. I could then possibly fix it up as a future clean up to 
> copy_pte_range(). Perhaps others have thoughts?

Yes, it's more about making this series easier to be accepted, and it'll be
great to have others' input.

Btw, just to mention that I don't even think fail fork() on failed trylock() is
against "logical correctness of fork()": IMHO it's still 100% correct just like
most syscalls can return with -EAGAIN, that suggests the userspace to try again
the syscall, and I hope that also works for fork().  I'd be more than glad to
be corrected too.

-- 
Peter Xu

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu
On Wed, May 19, 2021 at 10:28:42AM -0300, Jason Gunthorpe wrote:
> On Tue, May 18, 2021 at 07:45:05PM -0400, Peter Xu wrote:
> > On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> > > Logically during fork all these device exclusive pages should be
> > > reverted back to their CPU pages, write protected and the CPU page PTE
> > > copied to the fork.
> > > 
> > > We should not copy the device exclusive page PTE to the fork. I think
> > > I pointed to this on an earlier rev..
> > 
> > Agreed.  Though please see the question I posted in the other thread: now I 
> > am
> > not very sure whether we'll be able to mark a page as device exclusive if 
> > that
> > page has mapcount>1.
> 
> IMHO it is similar to write protect done by filesystems on shared
> mappings - all VMAs with a copy of the CPU page have to get switched
> to the device exclusive PTE. This is why the rmap stuff is involved in
> the migration helpers

Right, I think Alistair corrected me there that I missed the early COW
happening in GUP.

Actually even without that GUP triggering early COW it won't be a problem,
because as long as one child mm restored the pte from exclusive to normal
(before any further COW happens) device exclusiveness is broken in the mmu
notifiers, and after that point all previous-exclusive ptes actually becomes
the same as a very normal PageAnon.  Then it's very sane to even not have the
original page in parent process, because we know each COWed page will contain
all the device atomic modifications (so we don't really have the requirement to
return the original page to parent).

Sorry for the noise.

-- 
Peter Xu

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Peter Xu
On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:

[...]

> +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> +unsigned long address, void *arg)
> +{
> + struct ttp_args ttp = {
> + .mm = mm,
> + .address = address,
> + .arg = arg,
> + .valid = false,
> + };
> + struct rmap_walk_control rwc = {
> + .rmap_one = try_to_protect_one,
> + .done = page_not_mapped,
> + .anon_lock = page_lock_anon_vma_read,
> + .arg = ,
> + };
> +
> + /*
> +  * Restrict to anonymous pages for now to avoid potential writeback
> +  * issues.
> +  */
> + if (!PageAnon(page))
> + return false;
> +
> + /*
> +  * During exec, a temporary VMA is setup and later moved.
> +  * The VMA is moved under the anon_vma lock but not the
> +  * page tables leading to a race where migration cannot
> +  * find the migration ptes. Rather than increasing the
> +  * locking requirements of exec(), migration skips
> +  * temporary VMAs until after exec() completes.
> +  */
> + if (!PageKsm(page) && PageAnon(page))
> + rwc.invalid_vma = invalid_migration_vma;
> +
> + rmap_walk(page, );
> +
> + return ttp.valid && !page_mapcount(page);
> +}

I raised a question in the other thread regarding fork():

https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/

While I suddenly noticed that we may have similar issues even if we fork()
before creating the ptes.

In that case, we may see multiple read-only ptes pointing to the same page.  We
will convert all of them into device exclusive read ptes in rmap_walk() above,
however how do we guarantee after all COW done in the parent and all the childs
processes, the device owned page will be returned to the parent?

E.g., if parent accesses the page earlier than the children processes
(actually, as long as not the last one), do_wp_page() will do COW for parent on
this page because refcount(page)>1, then the page seems to get lost to a random
child too..

To resolve all these complexity, not sure whether try_to_protect() could
enforce VM_DONTCOPY (needs madvise MADV_DONTFORK in the user app), meanwhile
make sure mapcount(page)==1 before granting the page to the device, so that
this will guarantee this mm owns this page forever, I think?  It'll simplify
fork() too as a side effect, since VM_DONTCOPY vma go away when fork.

-- 
Peter Xu

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 0/7] Per client engine busyness

2021-05-19 Thread Tvrtko Ursulin



On 18/05/2021 10:40, Tvrtko Ursulin wrote:


On 18/05/2021 10:16, Daniel Stone wrote:

Hi,

On Tue, 18 May 2021 at 10:09, Tvrtko Ursulin
 wrote:

I was just wondering if stat(2) and a chrdev major check would be a
solid criteria to more efficiently (compared to parsing the text
content) detect drm files while walking procfs.


Maybe I'm missing something, but is the per-PID walk actually a
measurable performance issue rather than just a bit unpleasant?


Per pid and per each open fd.

As said in the other thread what bothers me a bit in this scheme is that 
the cost of obtaining GPU usage scales based on non-GPU criteria.


For use case of a top-like tool which shows all processes this is a 
smaller additional cost, but then for a gpu-top like tool it is somewhat 
higher.


To further expand, not only cost would scale per pid multiplies per open 
fd, but to detect which of the fds are DRM I see these three options:


1) Open and parse fdinfo.
2) Name based matching ie /dev/dri/.. something.
3) Stat the symlink target and check for DRM major.

All sound quite sub-optimal to me.

Name based matching is probably the least evil on system resource usage 
(Keeping the dentry cache too hot? Too many syscalls?), even though 
fundamentally I don't it is the right approach.


What happens with dup(2) is another question.

Does anyone have any feedback on the /proc//gpu idea at all?

Regards,

Tvrtko
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Jason Gunthorpe
On Tue, May 18, 2021 at 07:45:05PM -0400, Peter Xu wrote:
> On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> > Logically during fork all these device exclusive pages should be
> > reverted back to their CPU pages, write protected and the CPU page PTE
> > copied to the fork.
> > 
> > We should not copy the device exclusive page PTE to the fork. I think
> > I pointed to this on an earlier rev..
> 
> Agreed.  Though please see the question I posted in the other thread: now I am
> not very sure whether we'll be able to mark a page as device exclusive if that
> page has mapcount>1.

IMHO it is similar to write protect done by filesystems on shared
mappings - all VMAs with a copy of the CPU page have to get switched
to the device exclusive PTE. This is why the rmap stuff is involved in
the migration helpers

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 10:15:41 PM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Wed, May 19, 2021 at 09:04:53PM +1000, Alistair Popple wrote:
> > Failing fork() because we couldn't take a lock doesn't seem like the right
> > approach though, especially as there is already existing code that
> > retries. I get this adds complexity though, so would be happy to take a
> > look at cleaning copy_pte_range() up in future.
> 
> Yes, I proposed that as this one won't affect any existing applications
> (unlike the existing ones) but only new userspace driver apps that will use
> this new atomic feature.
> 
> IMHO it'll be a pity to add extra complexity and maintainance burden into
> fork() if only for keeping the "logical correctness of fork()" however the
> code never triggers. If we start with trylock we'll know whether people
> will use it, since people will complain with a reason when needed; however
> I still doubt whether a sane userspace device driver should fork() within
> busy interaction with the device underneath..

I will refrain from commenting on the sanity or otherwise of doing that :-)

Agree such a scenario seems unlikely in practice (and possibly unreasonable). 
Keeping the "logical correctness of fork()" still seems worthwhile to me, but 
if the added complexity/maintenance burden for an admittedly fairly specific 
feature is going to stop progress here I am happy to take the fail fork 
approach. I could then possibly fix it up as a future clean up to 
copy_pte_range(). Perhaps others have thoughts?

> In all cases, please still consider to keep them in copy_nonpresent_pte()
> (and if to rework, separating patches would be great).
>
> Thanks,
> 
> --
> Peter Xu




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 10:21:08 PM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Wed, May 19, 2021 at 09:35:10PM +1000, Alistair Popple wrote:
> > I think the approach you are describing is similar to what
> > migrate_vma_collect()/migrate_vma_unamp() does now and I think it could be
> > made to work. I ended up going with the GUP+unmap approach in part because
> > Christoph suggested it but primarily because it required less code
> > especially given we needed to call something to fault the page in/break
> > COW anyway (or extend what was there to call handle_mm_fault(), etc.).
> 
> I see, thank. Would you mind add a short paragraph in the commit message
> talking about these two solutions and why we choose the GUP way?

Sure.

 - Alistair

> --
> Peter Xu




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 10:24:27 PM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Wed, May 19, 2021 at 08:49:01PM +1000, Alistair Popple wrote:
> > On Wednesday, 19 May 2021 7:16:38 AM AEST Peter Xu wrote:
> > > External email: Use caution opening links or attachments
> > > 
> > > 
> > > On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:
> > > 
> > > [...]
> > > 
> > > > +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> > > > +unsigned long address, void *arg)
> > > > +{
> > > > + struct ttp_args ttp = {
> > > > + .mm = mm,
> > > > + .address = address,
> > > > + .arg = arg,
> > > > + .valid = false,
> > > > + };
> > > > + struct rmap_walk_control rwc = {
> > > > + .rmap_one = try_to_protect_one,
> > > > + .done = page_not_mapped,
> > > > + .anon_lock = page_lock_anon_vma_read,
> > > > + .arg = ,
> > > > + };
> > > > +
> > > > + /*
> > > > +  * Restrict to anonymous pages for now to avoid potential
> > > > writeback
> > > > +  * issues.
> > > > +  */
> > > > + if (!PageAnon(page))
> > > > + return false;
> > > > +
> > > > + /*
> > > > +  * During exec, a temporary VMA is setup and later moved.
> > > > +  * The VMA is moved under the anon_vma lock but not the
> > > > +  * page tables leading to a race where migration cannot
> > > > +  * find the migration ptes. Rather than increasing the
> > > > +  * locking requirements of exec(), migration skips
> > > > +  * temporary VMAs until after exec() completes.
> > > > +  */
> > > > + if (!PageKsm(page) && PageAnon(page))
> > > > + rwc.invalid_vma = invalid_migration_vma;
> > > > +
> > > > + rmap_walk(page, );
> > > > +
> > > > + return ttp.valid && !page_mapcount(page);
> > > > +}
> > > 
> > > I raised a question in the other thread regarding fork():
> > > 
> > > https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/
> > > 
> > > While I suddenly noticed that we may have similar issues even if we
> > > fork()
> > > before creating the ptes.
> > > 
> > > In that case, we may see multiple read-only ptes pointing to the same
> > > page.
> > > We will convert all of them into device exclusive read ptes in
> > > rmap_walk()
> > > above, however how do we guarantee after all COW done in the parent and
> > > all
> > > the childs processes, the device owned page will be returned to the
> > > parent?
> > 
> > I assume you are talking about a fork() followed by a call to
> > make_device_exclusive()? I think this should be ok because
> > make_device_exclusive() always calls GUP with FOLL_WRITE both to break COW
> > and because a device performing atomic operations needs to write to the
> > page. I suppose a comment here highlighting the need to break COW to
> > avoid this scenario would be useful though.
> 
> Indeed, sorry for the false alarm!  Yes it would be great to mention that
> too.

No problem! Thanks for the comments.

> --
> Peter Xu




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 6:04:51 AM AEST Liam Howlett wrote:
> External email: Use caution opening links or attachments
> 
> * Alistair Popple  [210407 04:43]:
> > The behaviour of try_to_unmap_one() is difficult to follow because it
> > performs different operations based on a fairly large set of flags used
> > in different combinations.
> > 
> > TTU_MUNLOCK is one such flag. However it is exclusively used by
> > try_to_munlock() which specifies no other flags. Therefore rather than
> > overload try_to_unmap_one() with unrelated behaviour split this out into
> > it's own function and remove the flag.
> > 
> > Signed-off-by: Alistair Popple 
> > Reviewed-by: Ralph Campbell 
> > Reviewed-by: Christoph Hellwig 
> > 
> > ---
> > 
> > v8:
> > * Renamed try_to_munlock to page_mlock to better reflect what the
> > 
> >   function actually does.
> > 
> > * Removed the TODO from the documentation that this patch addresses.
> > 
> > v7:
> > * Added Christoph's Reviewed-by
> > 
> > v4:
> > * Removed redundant check for VM_LOCKED
> > ---
> > 
> >  Documentation/vm/unevictable-lru.rst | 33 ---
> >  include/linux/rmap.h |  3 +-
> >  mm/mlock.c   | 10 +++---
> >  mm/rmap.c| 48 +---
> >  4 files changed, 55 insertions(+), 39 deletions(-)
> > 
> > diff --git a/Documentation/vm/unevictable-lru.rst
> > b/Documentation/vm/unevictable-lru.rst index 0e1490524f53..eae3af17f2d9
> > 100644
> > --- a/Documentation/vm/unevictable-lru.rst
> > +++ b/Documentation/vm/unevictable-lru.rst
> > @@ -389,14 +389,14 @@ mlocked, munlock_vma_page() updates that zone
> > statistics for the number of> 
> >  mlocked pages.  Note, however, that at this point we haven't checked
> >  whether the page is mapped by other VM_LOCKED VMAs.
> > 
> > -We can't call try_to_munlock(), the function that walks the reverse map
> > to
> > +We can't call page_mlock(), the function that walks the reverse map to
> > 
> >  check for other VM_LOCKED VMAs, without first isolating the page from the
> >  LRU.> 
> > -try_to_munlock() is a variant of try_to_unmap() and thus requires that
> > the page +page_mlock() is a variant of try_to_unmap() and thus requires
> > that the page> 
> >  not be on an LRU list [more on these below].  However, the call to
> > 
> > -isolate_lru_page() could fail, in which case we couldn't
> > try_to_munlock().  So, +isolate_lru_page() could fail, in which case we
> > can't call page_mlock().  So,> 
> >  we go ahead and clear PG_mlocked up front, as this might be the only
> >  chance we> 
> > -have.  If we can successfully isolate the page, we go ahead and
> > -try_to_munlock(), which will restore the PG_mlocked flag and update the
> > zone +have.  If we can successfully isolate the page, we go ahead and
> > call +page_mlock(), which will restore the PG_mlocked flag and update the
> > zone> 
> >  page statistics if it finds another VMA holding the page mlocked.  If we
> >  fail to isolate the page, we'll have left a potentially mlocked page on
> >  the LRU. This is fine, because we'll catch it later if and if vmscan
> >  tries to reclaim> 
> > @@ -545,31 +545,24 @@ munlock or munmap system calls, mm teardown
> > (munlock_vma_pages_all), reclaim,> 
> >  holepunching, and truncation of file pages and their anonymous COWed
> >  pages.
> > 
> > -try_to_munlock() Reverse Map Scan
> > +page_mlock() Reverse Map Scan
> > 
> >  -
> > 
> > -.. warning::
> > -   [!] TODO/FIXME: a better name might be page_mlocked() - analogous to
> > the -   page_referenced() reverse map walker.
> > -
> > 
> >  When munlock_vma_page() [see section :ref:`munlock()/munlockall() System
> >  Call Handling ` above] tries to munlock a
> >  page, it needs to determine whether or not the page is mapped by any
> >  VM_LOCKED VMA without actually attempting to unmap all PTEs from the
> >  page.  For this purpose, the unevictable/mlock infrastructure
> > 
> > -introduced a variant of try_to_unmap() called try_to_munlock().
> > +introduced a variant of try_to_unmap() called page_mlock().
> > 
> > -try_to_munlock() calls the same functions as try_to_unmap() for anonymous
> > and -mapped file and KSM pages with a flag argument specifying unlock
> > versus unmap -processing.  Again, these functions walk the respective
> > reverse maps looking -for VM_LOCKED VMAs.  When such a VMA is found, as
> > in the try_to_unmap() case, -the functions mlock the page via
> > mlock_vma_page() and return SWAP_MLOCK.  This -undoes the pre-clearing of
> > the page's PG_mlocked done by munlock_vma_page. +page_mlock() walks the
> > respective reverse maps looking for VM_LOCKED VMAs. When +such a VMA is
> > found the page is mlocked via mlock_vma_page(). This undoes the
> > +pre-clearing of the page's PG_mlocked done by munlock_vma_page.
> > 
> > -Note that try_to_munlock()'s reverse map walk must visit every VMA in a
> > page's +Note that page_mlock()'s reverse map 

Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 3:27:42 AM AEST Peter Xu wrote:
> > > The odd part is the remote GUP should have walked the page table
> > > already, so since the target here is the vaddr to replace, the 1st page
> > > table walk should be able to both trylock/lock the page, then modify
> > > the pte with pgtable lock held, return the locked page, then walk the
> > > rmap again to remove all the rest of the ptes that are mapping to this
> > > page.  In that case before we call the rmap_walk() we know this must be
> > > the page we want to take care of, and no one will be able to restore
> > > the original mm pte either (as we're with the page lock).  Then we
> > > don't need this check, neither do we need ttp->address.
> > 
> > If I am understanding you correctly I think this would be similar to the
> > approach that was taken in v2. However it pretty much ended up being just
> > an open-coded version of gup which is useful anyway to fault the page in.
> I see.  For easier reference this is v2 patch 1:
> 
> https://lore.kernel.org/lkml/20210219020750.16444-2-apop...@nvidia.com/

Sorry, I should have been clearer and just included that reference for you.

> Indeed that looks like it, it's just that instead of grabbing the page only
> in hmm_exclusive_pmd() we can do the pte modification along the way to seal
> the whole thing (address/pte & page).  I saw Christoph and Jason commented
> in that patch, but not regarding to this approach.  So is there a reason
> that you switched?  Do you think it'll work?

I think the approach you are describing is similar to what 
migrate_vma_collect()/migrate_vma_unamp() does now and I think it could be 
made to work. I ended up going with the GUP+unmap approach in part because 
Christoph suggested it but primarily because it required less code especially 
given we needed to call something to fault the page in/break COW anyway (or 
extend what was there to call handle_mm_fault(), etc.).

> I have no strong opinion either, it's just not crystal clear why we'd need
> that ttp->address at all for a rmap walk along with that "valid" field.

It's purely to ensure the PTE pointing to the GUP page was replaced with an 
exclusive swap entry and that the mapping didn't change between calls.

> Meanwhile it should be slightly less efficient too to go with current
> approach, especially when the page array gets huge, I think: since there'll
> be longer time we do GUP before doing the rmap walk, so higher possibility
> that the GUPed pages got replaced for whatever reason.  Then the call to
> make_device_exclusive_range() will fail as a whole just for a single page
> replacement within the range.




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 9:45:05 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> > Logically during fork all these device exclusive pages should be
> > reverted back to their CPU pages, write protected and the CPU page PTE
> > copied to the fork.
> > 
> > We should not copy the device exclusive page PTE to the fork. I think
> > I pointed to this on an earlier rev..
> 
> Agreed.  Though please see the question I posted in the other thread: now I
> am not very sure whether we'll be able to mark a page as device exclusive
> if that page has mapcount>1.
>
> > We can optimize this into the various variants above, but logically
> > device exclusive stop existing during fork.
> 
> Makes sense, I think that's indeed what this patch did at least for the COW
> case, so I think Alistair did address that comment.  It's just that I think
> we need to drop the other !COW case (imho that should correspond to the
> changes in copy_nonpresent_pte()) in this patch to guarantee it.

Right. The main change from v7 -> v8 was to remove device exclusive entries on 
fork instead of copying them. The change in copy_nonpresent_pte() is for the
!COW case. I think what you are getting at is given exclusive entries are 
(currently) only supported for PageAnon pages is_cow_mapping() will always be 
true and therefore the change to copy_nonpresent_pte() can be dropped. That 
logic seems reasonable so I will change the exclusive case in 
copy_nonpresent_pte() to a VM_WARN_ON.

> I also hope we don't make copy_pte_range() even more complicated just to do
> the lock_page() right, so we could fail the fork() if the lock is hard to
> take.

Failing fork() because we couldn't take a lock doesn't seem like the right 
approach though, especially as there is already existing code that retries. I 
get this adds complexity though, so would be happy to take a look at cleaning 
copy_pte_range() up in future.

> --
> Peter Xu




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Alistair Popple
On Wednesday, 19 May 2021 7:16:38 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Apr 07, 2021 at 06:42:35PM +1000, Alistair Popple wrote:
> 
> [...]
> 
> > +static bool try_to_protect(struct page *page, struct mm_struct *mm,
> > +unsigned long address, void *arg)
> > +{
> > + struct ttp_args ttp = {
> > + .mm = mm,
> > + .address = address,
> > + .arg = arg,
> > + .valid = false,
> > + };
> > + struct rmap_walk_control rwc = {
> > + .rmap_one = try_to_protect_one,
> > + .done = page_not_mapped,
> > + .anon_lock = page_lock_anon_vma_read,
> > + .arg = ,
> > + };
> > +
> > + /*
> > +  * Restrict to anonymous pages for now to avoid potential writeback
> > +  * issues.
> > +  */
> > + if (!PageAnon(page))
> > + return false;
> > +
> > + /*
> > +  * During exec, a temporary VMA is setup and later moved.
> > +  * The VMA is moved under the anon_vma lock but not the
> > +  * page tables leading to a race where migration cannot
> > +  * find the migration ptes. Rather than increasing the
> > +  * locking requirements of exec(), migration skips
> > +  * temporary VMAs until after exec() completes.
> > +  */
> > + if (!PageKsm(page) && PageAnon(page))
> > + rwc.invalid_vma = invalid_migration_vma;
> > +
> > + rmap_walk(page, );
> > +
> > + return ttp.valid && !page_mapcount(page);
> > +}
> 
> I raised a question in the other thread regarding fork():
> 
> https://lore.kernel.org/lkml/YKQjmtMo+YQGx%2FwZ@t490s/
> 
> While I suddenly noticed that we may have similar issues even if we fork()
> before creating the ptes.
> 
> In that case, we may see multiple read-only ptes pointing to the same page. 
> We will convert all of them into device exclusive read ptes in rmap_walk()
> above, however how do we guarantee after all COW done in the parent and all
> the childs processes, the device owned page will be returned to the parent?

I assume you are talking about a fork() followed by a call to 
make_device_exclusive()? I think this should be ok because 
make_device_exclusive() always calls GUP with FOLL_WRITE both to break COW and 
because a device performing atomic operations needs to write to the page. I 
suppose a comment here highlighting the need to break COW to avoid this 
scenario would be useful though.

> E.g., if parent accesses the page earlier than the children processes
> (actually, as long as not the last one), do_wp_page() will do COW for parent
> on this page because refcount(page)>1, then the page seems to get lost to a
> random child too..
>
> To resolve all these complexity, not sure whether try_to_protect() could
> enforce VM_DONTCOPY (needs madvise MADV_DONTFORK in the user app), meanwhile
> make sure mapcount(page)==1 before granting the page to the device, so that
> this will guarantee this mm owns this page forever, I think?  It'll
> simplify fork() too as a side effect, since VM_DONTCOPY vma go away when
> fork.
> 
> --
> Peter Xu




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau