Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Jason Gunthorpe
On Tue, May 18, 2021 at 04:29:14PM -0400, Peter Xu wrote:
> On Tue, May 18, 2021 at 04:45:09PM -0300, Jason Gunthorpe wrote:
> > On Tue, May 18, 2021 at 02:01:36PM -0400, Peter Xu wrote:
> > > > > Indeed it'll be odd for a COW page since for COW page then it means 
> > > > > after
> > > > > parent/child writting to the page it'll clone into two, then it's a 
> > > > > mistery on
> > > > > which one will be the one that "exclusived owned" by the device..
> > > > 
> > > > For COW pages it is like every other fork case.. We can't reliably
> > > > write-protect the device_exclusive page during fork so we must copy it
> > > > at fork time.
> > > > 
> > > > Thus three reasonable choices:
> > > >  - Copy to a new CPU page
> > > >  - Migrate back to a CPU page and write protect it
> > > >  - Copy to a new device exclusive page
> > > 
> > > IMHO the ownership question would really help us to answer this one..
> > 
> > I'm confused about what device ownership you are talking about
> 
> My question was more about the user scenario rather than anything related to
> the kernel code, nor does it related to page struct at all.
> 
> Let me try to be a little bit more verbose...
> 
> Firstly, I think one simple solution to handle fork() of device exclusive ptes
> is to do just like device private ptes: if COW we convert writable ptes into
> readable ptes.  Then when CPU access happens (in either parent/child) page
> restore triggers which will convert those readable ptes into read-only present
> ptes (with the original page backing it).  Then do_wp_page() will take care of
> page copy.

I suspect it doesn't work. This is much more like pinning than
anything, the data in the page is still under active use by a device
and if we cannot globally write write protect it, both from CPU and
device access, then we cannot do COW. IIRC the mm can't trigger a full
global write protect through the pgmap?
 
> Then here comes the ownership question: If we still want to have the parent
> process behave like before it fork()ed, IMHO we must make sure that original
> page (that exclusively owned by the device once) still belongs to the parent
> process not the child.  That's why I think if that's the case we'd do early 
> cow
> in fork(), because it guarantees that.

Logically during fork all these device exclusive pages should be
reverted back to their CPU pages, write protected and the CPU page PTE
copied to the fork.

We should not copy the device exclusive page PTE to the fork. I think
I pointed to this on an earlier rev..

We can optimize this into the various variants above, but logically
device exclusive stop existing during fork.

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-05-18 Thread Liam Howlett
* Alistair Popple  [210407 04:43]:
> The behaviour of try_to_unmap_one() is difficult to follow because it
> performs different operations based on a fairly large set of flags used
> in different combinations.
> 
> TTU_MUNLOCK is one such flag. However it is exclusively used by
> try_to_munlock() which specifies no other flags. Therefore rather than
> overload try_to_unmap_one() with unrelated behaviour split this out into
> it's own function and remove the flag.
> 
> Signed-off-by: Alistair Popple 
> Reviewed-by: Ralph Campbell 
> Reviewed-by: Christoph Hellwig 
> 
> ---
> 
> v8:
> * Renamed try_to_munlock to page_mlock to better reflect what the
>   function actually does.
> * Removed the TODO from the documentation that this patch addresses.
> 
> v7:
> * Added Christoph's Reviewed-by
> 
> v4:
> * Removed redundant check for VM_LOCKED
> ---
>  Documentation/vm/unevictable-lru.rst | 33 ---
>  include/linux/rmap.h |  3 +-
>  mm/mlock.c   | 10 +++---
>  mm/rmap.c| 48 +---
>  4 files changed, 55 insertions(+), 39 deletions(-)
> 
> diff --git a/Documentation/vm/unevictable-lru.rst 
> b/Documentation/vm/unevictable-lru.rst
> index 0e1490524f53..eae3af17f2d9 100644
> --- a/Documentation/vm/unevictable-lru.rst
> +++ b/Documentation/vm/unevictable-lru.rst
> @@ -389,14 +389,14 @@ mlocked, munlock_vma_page() updates that zone 
> statistics for the number of
>  mlocked pages.  Note, however, that at this point we haven't checked whether
>  the page is mapped by other VM_LOCKED VMAs.
>  
> -We can't call try_to_munlock(), the function that walks the reverse map to
> +We can't call page_mlock(), the function that walks the reverse map to
>  check for other VM_LOCKED VMAs, without first isolating the page from the 
> LRU.
> -try_to_munlock() is a variant of try_to_unmap() and thus requires that the 
> page
> +page_mlock() is a variant of try_to_unmap() and thus requires that the page
>  not be on an LRU list [more on these below].  However, the call to
> -isolate_lru_page() could fail, in which case we couldn't try_to_munlock().  
> So,
> +isolate_lru_page() could fail, in which case we can't call page_mlock().  So,
>  we go ahead and clear PG_mlocked up front, as this might be the only chance 
> we
> -have.  If we can successfully isolate the page, we go ahead and
> -try_to_munlock(), which will restore the PG_mlocked flag and update the zone
> +have.  If we can successfully isolate the page, we go ahead and call
> +page_mlock(), which will restore the PG_mlocked flag and update the zone
>  page statistics if it finds another VMA holding the page mlocked.  If we fail
>  to isolate the page, we'll have left a potentially mlocked page on the LRU.
>  This is fine, because we'll catch it later if and if vmscan tries to reclaim
> @@ -545,31 +545,24 @@ munlock or munmap system calls, mm teardown 
> (munlock_vma_pages_all), reclaim,
>  holepunching, and truncation of file pages and their anonymous COWed pages.
>  
>  
> -try_to_munlock() Reverse Map Scan
> +page_mlock() Reverse Map Scan
>  -
>  
> -.. warning::
> -   [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
> -   page_referenced() reverse map walker.
> -
>  When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call
>  Handling ` above] tries to munlock a
>  page, it needs to determine whether or not the page is mapped by any
>  VM_LOCKED VMA without actually attempting to unmap all PTEs from the
>  page.  For this purpose, the unevictable/mlock infrastructure
> -introduced a variant of try_to_unmap() called try_to_munlock().
> +introduced a variant of try_to_unmap() called page_mlock().
>  
> -try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
> -mapped file and KSM pages with a flag argument specifying unlock versus unmap
> -processing.  Again, these functions walk the respective reverse maps looking
> -for VM_LOCKED VMAs.  When such a VMA is found, as in the try_to_unmap() case,
> -the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK.  
> This
> -undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page.
> +page_mlock() walks the respective reverse maps looking for VM_LOCKED VMAs. 
> When
> +such a VMA is found the page is mlocked via mlock_vma_page(). This undoes the
> +pre-clearing of the page's PG_mlocked done by munlock_vma_page.
>  
> -Note that try_to_munlock()'s reverse map walk must visit every VMA in a 
> page's
> +Note that page_mlock()'s reverse map walk must visit every VMA in a page's
>  reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
>  However, the scan can terminate when it encounters a VM_LOCKED VMA.
> -Although try_to_munlock() might be called a great many times when munlocking 
> a
> +Although page_mlock() might be called a great many times when munlocking a
>  large region or 

Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Peter Xu
On Tue, May 18, 2021 at 04:45:09PM -0300, Jason Gunthorpe wrote:
> On Tue, May 18, 2021 at 02:01:36PM -0400, Peter Xu wrote:
> > > > Indeed it'll be odd for a COW page since for COW page then it means 
> > > > after
> > > > parent/child writting to the page it'll clone into two, then it's a 
> > > > mistery on
> > > > which one will be the one that "exclusived owned" by the device..
> > > 
> > > For COW pages it is like every other fork case.. We can't reliably
> > > write-protect the device_exclusive page during fork so we must copy it
> > > at fork time.
> > > 
> > > Thus three reasonable choices:
> > >  - Copy to a new CPU page
> > >  - Migrate back to a CPU page and write protect it
> > >  - Copy to a new device exclusive page
> > 
> > IMHO the ownership question would really help us to answer this one..
> 
> I'm confused about what device ownership you are talking about

My question was more about the user scenario rather than anything related to
the kernel code, nor does it related to page struct at all.

Let me try to be a little bit more verbose...

Firstly, I think one simple solution to handle fork() of device exclusive ptes
is to do just like device private ptes: if COW we convert writable ptes into
readable ptes.  Then when CPU access happens (in either parent/child) page
restore triggers which will convert those readable ptes into read-only present
ptes (with the original page backing it).  Then do_wp_page() will take care of
page copy.

However... if you see that also means parent/child have the equal opportunity
to reuse that original page: who access first will do COW because refcount>1
for that page (note! it's possible that mapcount==1 here, as we drop mapcount
when converting to device exclusive ptes; however with the most recent
do_wp_page change from Linus where we'll also check page_count(), we'll still
do COW just like when this page was GUPed by someone else).  While that matters
because the device is writting to that original page only, not the COWed one.

Then here comes the ownership question: If we still want to have the parent
process behave like before it fork()ed, IMHO we must make sure that original
page (that exclusively owned by the device once) still belongs to the parent
process not the child.  That's why I think if that's the case we'd do early cow
in fork(), because it guarantees that.

I can't say I fully understand the whole picture, so sorry if I missed
something important there.

Thanks,

-- 
Peter Xu

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Peter Xu
On Tue, May 18, 2021 at 02:33:34PM -0300, Jason Gunthorpe wrote:
> On Tue, May 18, 2021 at 01:27:42PM -0400, Peter Xu wrote:
> 
> > I also have a pure and high level question regarding a process fork() when
> > there're device exclusive ptes: would the two processes then own the device
> > together?  Is this a real usecase?
> 
> If the pages are MAP_SHARED then yes. All VMAs should point at the
> same device_exclusive page and all VMA should migrate back to CPU
> pages together.

Makes sense.  If we keep the anonymous-only in this series (I think it's good
to separate these), maybe we can drop the !COW case, plus some proper installed
WARN_ON_ONCE()s.

> 
> > Indeed it'll be odd for a COW page since for COW page then it means after
> > parent/child writting to the page it'll clone into two, then it's a mistery 
> > on
> > which one will be the one that "exclusived owned" by the device..
> 
> For COW pages it is like every other fork case.. We can't reliably
> write-protect the device_exclusive page during fork so we must copy it
> at fork time.
> 
> Thus three reasonable choices:
>  - Copy to a new CPU page
>  - Migrate back to a CPU page and write protect it
>  - Copy to a new device exclusive page

IMHO the ownership question would really help us to answer this one..

If the device ownership should be kept in parent IMHO option (1) is the best
approach. To be explicit on page copy: we can do best-effort, even if the copy
is during a device atomic operation, perhaps?

If the ownership will be shared, seems option (3) will be easier as I don't see
a strong reason to do immediate restorinng of ptes; as long as we're careful on
the refcounting.

Thanks,

-- 
Peter Xu

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Jason Gunthorpe
On Tue, May 18, 2021 at 02:01:36PM -0400, Peter Xu wrote:
> > > Indeed it'll be odd for a COW page since for COW page then it means after
> > > parent/child writting to the page it'll clone into two, then it's a 
> > > mistery on
> > > which one will be the one that "exclusived owned" by the device..
> > 
> > For COW pages it is like every other fork case.. We can't reliably
> > write-protect the device_exclusive page during fork so we must copy it
> > at fork time.
> > 
> > Thus three reasonable choices:
> >  - Copy to a new CPU page
> >  - Migrate back to a CPU page and write protect it
> >  - Copy to a new device exclusive page
> 
> IMHO the ownership question would really help us to answer this one..

I'm confused about what device ownership you are talking about

It is just a page and it is tied to some specific pgmap?

If the thing providing the migration stuff goes away then all
device_exclusive pages should revert back to CPU pages and destroy the
pgmap?

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 1/8] mm: Remove special swap entry functions

2021-05-18 Thread Peter Xu
On Tue, May 18, 2021 at 09:58:09PM +1000, Alistair Popple wrote:
> On Tuesday, 18 May 2021 12:17:32 PM AEST Peter Xu wrote:
> > On Wed, Apr 07, 2021 at 06:42:31PM +1000, Alistair Popple wrote:
> > > +static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> > > +{
> > > + struct page *p = pfn_to_page(swp_offset(entry));
> > > +
> > > + /*
> > > +  * Any use of migration entries may only occur while the
> > > +  * corresponding page is locked
> > > +  */
> > > + BUG_ON(is_migration_entry(entry) && !PageLocked(p));
> > > +
> > > + return p;
> > > +}
> > 
> > Would swap_pfn_entry_to_page() be slightly better?
> > 
> > The thing is it's very easy to read pfn_*() as a function to take a pfn as
> > parameter...
> > 
> > Since I'm also recently working on some swap-related new ptes [1], I'm
> > thinking whether we could name these swap entries as "swap XXX entries". 
> > Say, "swap hwpoison entry", "swap pfn entry" (which is a superset of "swap
> > migration entry", "swap device exclusive entry", ...).  That's where I came
> > with the above swap_pfn_entry_to_page(), then below will be naturally
> > is_swap_pfn_entry().
> 
> Equally though "hwpoison swap entry", "pfn swap entry", "migration swap 
> entry", etc. also makes sense (at least to me), but does that not fit in as 
> well with your series? I haven't looked too deeply at your series but have 
> been meaning to so thanks for the pointer.

Yeah it looks good too.  It's just to avoid starting with "pfn_" I guess, so at
least we know that's something related to swap not one specific pfn.

I found the naming is important as e.g. in my series I introduced another idea
called "swap special pte" or "special swap pte" (yeah so indeed my naming is
not that coherent as well... :), then I noticed if without a good naming I'm
afraid we can get lost easier.

I have a description here in the commit message re the new swap special pte:

https://lore.kernel.org/lkml/20210427161317.50682-5-pet...@redhat.com/

In which the uffd-wp marker pte will be the first swap special pte.  Feel free
to ignore the details too, just want to mention some context, while it should
be orthogonal to this series.

I think yet-another swap entry suites for my case too but it'll be a waste as
in that pte I don't need to keep pfn information, but only a marker (one single
bit would suffice), so I chose one single pte value (one for each arch, I only
defined that on x86 in my series which is a special pte with only one bit set)
to not pollute the swap entry address spaces.

> 
> > No strong opinion as this is already a v8 series (and sorry to chim in this
> > late), just to raise this up.
> 
> No worries, it's good timing as I was about to send a v9 which was just a 
> rebase anyway. I am hoping to try and get this accepted for the next merge 
> window but I will wait before sending v9 to see if anyone else has thoughts 
> on 
> the naming here.
> 
> I don't have a particularly strong opinion either, and your justification is 
> more thought than I gave to naming these originally so am happy to rename if 
> it's more readable or fits better with your series.

I'll leave the decision to you, especially in case you still prefer the old
naming.  Or feel free to wait too until someone else shares the thoughts so as
to avoid unnecessary rebase work.

Thanks,

-- 
Peter Xu

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v6 00/15] Restricted DMA

2021-05-18 Thread Claire Chang
v7: https://lore.kernel.org/patchwork/cover/1431031/

On Mon, May 10, 2021 at 5:50 PM Claire Chang  wrote:
>
> From: Claire Chang 
>
> This series implements mitigations for lack of DMA access control on
> systems without an IOMMU, which could result in the DMA accessing the
> system memory at unexpected times and/or unexpected addresses, possibly
> leading to data leakage or corruption.
>
> For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
> not behind an IOMMU. As PCI-e, by design, gives the device full access to
> system memory, a vulnerability in the Wi-Fi firmware could easily escalate
> to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
> full chain of exploits; [2], [3]).
>
> To mitigate the security concerns, we introduce restricted DMA. Restricted
> DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
> specially allocated region and does memory allocation from the same region.
> The feature on its own provides a basic level of protection against the DMA
> overwriting buffer contents at unexpected times. However, to protect
> against general data leakage and system memory corruption, the system needs
> to provide a way to restrict the DMA to a predefined memory region (this is
> usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]).
>
> [1a] 
> https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html
> [1b] 
> https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html
> [2] https://blade.tencent.com/en/advisories/qualpwn/
> [3] 
> https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/
> [4] 
> https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132
>
> v6:
> Address the comments in v5
>
> v5:
> Rebase on latest linux-next
> https://lore.kernel.org/patchwork/cover/1416899/
>
> v4:
> - Fix spinlock bad magic
> - Use rmem->name for debugfs entry
> - Address the comments in v3
> https://lore.kernel.org/patchwork/cover/1378113/
>
> v3:
> Using only one reserved memory region for both streaming DMA and memory
> allocation.
> https://lore.kernel.org/patchwork/cover/1360992/
>
> v2:
> Building on top of swiotlb.
> https://lore.kernel.org/patchwork/cover/1280705/
>
> v1:
> Using dma_map_ops.
> https://lore.kernel.org/patchwork/cover/1271660/
> *** BLURB HERE ***
>
> Claire Chang (15):
>   swiotlb: Refactor swiotlb init functions
>   swiotlb: Refactor swiotlb_create_debugfs
>   swiotlb: Add DMA_RESTRICTED_POOL
>   swiotlb: Add restricted DMA pool initialization
>   swiotlb: Add a new get_io_tlb_mem getter
>   swiotlb: Update is_swiotlb_buffer to add a struct device argument
>   swiotlb: Update is_swiotlb_active to add a struct device argument
>   swiotlb: Bounce data from/to restricted DMA pool if available
>   swiotlb: Move alloc_size to find_slots
>   swiotlb: Refactor swiotlb_tbl_unmap_single
>   dma-direct: Add a new wrapper __dma_direct_free_pages()
>   swiotlb: Add restricted DMA alloc/free support.
>   dma-direct: Allocate memory from restricted DMA pool if available
>   dt-bindings: of: Add restricted DMA pool
>   of: Add plumbing for restricted DMA pool
>
>  .../reserved-memory/reserved-memory.txt   |  27 ++
>  drivers/gpu/drm/i915/gem/i915_gem_internal.c  |   2 +-
>  drivers/gpu/drm/nouveau/nouveau_ttm.c |   2 +-
>  drivers/iommu/dma-iommu.c |  12 +-
>  drivers/of/address.c  |  25 ++
>  drivers/of/device.c   |   3 +
>  drivers/of/of_private.h   |   5 +
>  drivers/pci/xen-pcifront.c|   2 +-
>  drivers/xen/swiotlb-xen.c |   2 +-
>  include/linux/device.h|   4 +
>  include/linux/swiotlb.h   |  41 ++-
>  kernel/dma/Kconfig|  14 +
>  kernel/dma/direct.c   |  63 +++--
>  kernel/dma/direct.h   |   9 +-
>  kernel/dma/swiotlb.c  | 242 +-
>  15 files changed, 356 insertions(+), 97 deletions(-)
>
> --
> 2.31.1.607.g51e8a6a459-goog
>
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Peter Xu
Alistair,

While I got one reply below to your previous email, I also looked at the rest
code (majorly restore and fork sides) today and added a few more comments.

On Tue, May 18, 2021 at 11:19:05PM +1000, Alistair Popple wrote:

[...]

> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 3a5705cfc891..556ff396f2e9 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -700,6 +700,84 @@ struct page *vm_normal_page_pmd(struct vm_area_struct
> > > *vma, unsigned long addr,> 
> > >  }
> > >  #endif
> > > 
> > > +static void restore_exclusive_pte(struct vm_area_struct *vma,
> > > +   struct page *page, unsigned long address,
> > > +   pte_t *ptep)
> > > +{
> > > + pte_t pte;
> > > + swp_entry_t entry;
> > > +
> > > + pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
> > > + if (pte_swp_soft_dirty(*ptep))
> > > + pte = pte_mksoft_dirty(pte);
> > > +
> > > + entry = pte_to_swp_entry(*ptep);
> > > + if (pte_swp_uffd_wp(*ptep))
> > > + pte = pte_mkuffd_wp(pte);
> > > + else if (is_writable_device_exclusive_entry(entry))
> > > + pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > > +
> > > + set_pte_at(vma->vm_mm, address, ptep, pte);
> > > +
> > > + /*
> > > +  * No need to take a page reference as one was already
> > > +  * created when the swap entry was made.
> > > +  */
> > > + if (PageAnon(page))
> > > + page_add_anon_rmap(page, vma, address, false);
> > > + else
> > > + page_add_file_rmap(page, false);

This seems to be another leftover; maybe WARN_ON_ONCE(!PageAnon(page))?

> > > +
> > > + if (vma->vm_flags & VM_LOCKED)
> > > + mlock_vma_page(page);
> > > +
> > > + /*
> > > +  * No need to invalidate - it was non-present before. However
> > > +  * secondary CPUs may have mappings that need invalidating.
> > > +  */
> > > + update_mmu_cache(vma, address, ptep);
> > > +}

[...]

> > >  /*
> > >  
> > >   * copy one vm_area from one task to the other. Assumes the page tables
> > >   * already present in the new task to be cleared in the whole range
> > > 
> > > @@ -781,6 +859,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct
> > > mm_struct *src_mm,> 
> > >   pte = pte_swp_mkuffd_wp(pte);
> > >   
> > >   set_pte_at(src_mm, addr, src_pte, pte);
> > >   
> > >   }
> > > 
> > > + } else if (is_device_exclusive_entry(entry)) {
> > > + /* COW mappings should be dealt with by removing the entry
> > > */

Here the comment says "removing the entry" however I think it didn't remove the
pte, instead it keeps it (as there's no "return", so set_pte_at() will be
called below), so I got a bit confused.

> > > + VM_BUG_ON(is_cow_mapping(vm_flags));

Also here, if PageAnon() is the only case to support so far, doesn't that
easily satisfy is_cow_mapping()? Maybe I missed something..

I also have a pure and high level question regarding a process fork() when
there're device exclusive ptes: would the two processes then own the device
together?  Is this a real usecase?

Indeed it'll be odd for a COW page since for COW page then it means after
parent/child writting to the page it'll clone into two, then it's a mistery on
which one will be the one that "exclusived owned" by the device..

> > > + page = pfn_swap_entry_to_page(entry);
> > > + get_page(page);
> > > + rss[mm_counter(page)]++;
> > > 
> > >   }
> > >   set_pte_at(dst_mm, addr, dst_pte, pte);
> > >   return 0;
> > > 
> > > @@ -947,6 +1031,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct
> > > vm_area_struct *src_vma,> 
> > >   int rss[NR_MM_COUNTERS];
> > >   swp_entry_t entry = (swp_entry_t){0};
> > >   struct page *prealloc = NULL;
> > > 
> > > + struct page *locked_page = NULL;
> > > 
> > >  again:
> > >   progress = 0;
> > > 
> > > @@ -980,13 +1065,36 @@ copy_pte_range(struct vm_area_struct *dst_vma,
> > > struct vm_area_struct *src_vma,> 
> > >   continue;
> > >   
> > >   }
> > >   if (unlikely(!pte_present(*src_pte))) {
> > > 
> > > - entry.val = copy_nonpresent_pte(dst_mm, src_mm,
> > > - dst_pte, src_pte,
> > > - src_vma, addr, rss);
> > > - if (entry.val)
> > > - break;
> > > - progress += 8;
> > > - continue;
> > > + swp_entry_t swp_entry = pte_to_swp_entry(*src_pte);

(Just a side note to all of us: this will be one more place that I'll need to
 look after in my uffd-wp series if this series lands first, as after that
 series we can only call 

Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Jason Gunthorpe
On Tue, May 18, 2021 at 01:27:42PM -0400, Peter Xu wrote:

> I also have a pure and high level question regarding a process fork() when
> there're device exclusive ptes: would the two processes then own the device
> together?  Is this a real usecase?

If the pages are MAP_SHARED then yes. All VMAs should point at the
same device_exclusive page and all VMA should migrate back to CPU
pages together.

> Indeed it'll be odd for a COW page since for COW page then it means after
> parent/child writting to the page it'll clone into two, then it's a mistery on
> which one will be the one that "exclusived owned" by the device..

For COW pages it is like every other fork case.. We can't reliably
write-protect the device_exclusive page during fork so we must copy it
at fork time.

Thus three reasonable choices:
 - Copy to a new CPU page
 - Migrate back to a CPU page and write protect it
 - Copy to a new device exclusive page

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Alistair Popple
On Tuesday, 18 May 2021 12:08:34 PM AEST Peter Xu wrote:
> > v6:
> > * Fixed a bisectablity issue due to incorrectly applying the rename of
> > 
> >   migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.
> > 
> > v5:
> > * Renamed range->migrate_pgmap_owner to range->owner.
> 
> May be nicer to mention this rename in commit message (or make it a separate
> patch)?

Ok, I think if it needs to be explicitly called out in the commit message it 
should be a separate patch. Originally I thought the change was _just_ small 
enough to include here, but this patch has since grown so I'll split it out.
 


> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index 0e25d829f742..3a1ce4ef9276 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -193,6 +193,10 @@ int page_referenced(struct page *, int is_locked,
> > 
> >  bool try_to_migrate(struct page *page, enum ttu_flags flags);
> >  bool try_to_unmap(struct page *, enum ttu_flags flags);
> > 
> > +int make_device_exclusive_range(struct mm_struct *mm, unsigned long
> > start,
> > + unsigned long end, struct page **pages,
> > + void *arg);
> > +
> > 
> >  /* Avoid racy checks */
> >  #define PVMW_SYNC(1 << 0)
> >  /* Look for migarion entries rather than present PTEs */
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 516104b9334b..7a3c260146df 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -63,9 +63,11 @@ static inline int current_is_kswapd(void)
> > 
> >   * to a special SWP_DEVICE_* entry.
> >   */
> 
> Should we add another short description for the newly added two types?
> Otherwise the reader could get confused assuming the above comment is
> explaining all four types, while it is for SWP_DEVICE_{READ|WRITE} only?

Good idea, I can see how that could be confusing. Will add a short 
description.



> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3a5705cfc891..556ff396f2e9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -700,6 +700,84 @@ struct page *vm_normal_page_pmd(struct vm_area_struct
> > *vma, unsigned long addr,> 
> >  }
> >  #endif
> > 
> > +static void restore_exclusive_pte(struct vm_area_struct *vma,
> > +   struct page *page, unsigned long address,
> > +   pte_t *ptep)
> > +{
> > + pte_t pte;
> > + swp_entry_t entry;
> > +
> > + pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
> > + if (pte_swp_soft_dirty(*ptep))
> > + pte = pte_mksoft_dirty(pte);
> > +
> > + entry = pte_to_swp_entry(*ptep);
> > + if (pte_swp_uffd_wp(*ptep))
> > + pte = pte_mkuffd_wp(pte);
> > + else if (is_writable_device_exclusive_entry(entry))
> > + pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > +
> > + set_pte_at(vma->vm_mm, address, ptep, pte);
> > +
> > + /*
> > +  * No need to take a page reference as one was already
> > +  * created when the swap entry was made.
> > +  */
> > + if (PageAnon(page))
> > + page_add_anon_rmap(page, vma, address, false);
> > + else
> > + page_add_file_rmap(page, false);
> > +
> > + if (vma->vm_flags & VM_LOCKED)
> > + mlock_vma_page(page);
> > +
> > + /*
> > +  * No need to invalidate - it was non-present before. However
> > +  * secondary CPUs may have mappings that need invalidating.
> > +  */
> > + update_mmu_cache(vma, address, ptep);
> > +}
> > +
> > +/*
> > + * Tries to restore an exclusive pte if the page lock can be acquired
> > without + * sleeping. Returns 0 on success or -EBUSY if the page could
> > not be locked or + * the entry no longer points at locked_page in which
> > case locked_page should be + * locked before retrying the call.
> > + */
> > +static unsigned long
> > +try_restore_exclusive_pte(struct mm_struct *src_mm, pte_t *src_pte,
> > +   struct vm_area_struct *vma, unsigned long addr,
> > +   struct page **locked_page)
> > +{
> > + swp_entry_t entry = pte_to_swp_entry(*src_pte);
> > + struct page *page = pfn_swap_entry_to_page(entry);
> > +
> > + if (*locked_page) {
> > + /* The entry changed, retry */
> > + if (unlikely(*locked_page != page)) {
> > + unlock_page(*locked_page);
> > + put_page(*locked_page);
> > + *locked_page = page;
> > + return -EBUSY;
> > + }
> > + restore_exclusive_pte(vma, page, addr, src_pte);
> > + unlock_page(page);
> > + put_page(page);
> > + *locked_page = NULL;
> > + return 0;
> > + }
> > +
> > + if (trylock_page(page)) {
> > + restore_exclusive_pte(vma, page, addr, src_pte);
> > + unlock_page(page);
> > + return 0;
> > + 

Re: [Nouveau] [PATCH v8 1/8] mm: Remove special swap entry functions

2021-05-18 Thread Alistair Popple
On Tuesday, 18 May 2021 12:17:32 PM AEST Peter Xu wrote:
> On Wed, Apr 07, 2021 at 06:42:31PM +1000, Alistair Popple wrote:
> > +static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> > +{
> > + struct page *p = pfn_to_page(swp_offset(entry));
> > +
> > + /*
> > +  * Any use of migration entries may only occur while the
> > +  * corresponding page is locked
> > +  */
> > + BUG_ON(is_migration_entry(entry) && !PageLocked(p));
> > +
> > + return p;
> > +}
> 
> Would swap_pfn_entry_to_page() be slightly better?
> 
> The thing is it's very easy to read pfn_*() as a function to take a pfn as
> parameter...
> 
> Since I'm also recently working on some swap-related new ptes [1], I'm
> thinking whether we could name these swap entries as "swap XXX entries". 
> Say, "swap hwpoison entry", "swap pfn entry" (which is a superset of "swap
> migration entry", "swap device exclusive entry", ...).  That's where I came
> with the above swap_pfn_entry_to_page(), then below will be naturally
> is_swap_pfn_entry().

Equally though "hwpoison swap entry", "pfn swap entry", "migration swap 
entry", etc. also makes sense (at least to me), but does that not fit in as 
well with your series? I haven't looked too deeply at your series but have 
been meaning to so thanks for the pointer.

> No strong opinion as this is already a v8 series (and sorry to chim in this
> late), just to raise this up.

No worries, it's good timing as I was about to send a v9 which was just a 
rebase anyway. I am hoping to try and get this accepted for the next merge 
window but I will wait before sending v9 to see if anyone else has thoughts on 
the naming here.

I don't have a particularly strong opinion either, and your justification is 
more thought than I gave to naming these originally so am happy to rename if 
it's more readable or fits better with your series.

Thanks.

 - Alistair

> [1] https://lore.kernel.org/lkml/20210427161317.50682-1-pet...@redhat.com/
> 
> Thanks,
> 
> > +
> > +/*
> > + * A pfn swap entry is a special type of swap entry that always has a pfn
> > stored + * in the swap offset. They are used to represent unaddressable
> > device memory + * and to restrict access to a page undergoing migration.
> > + */
> > +static inline bool is_pfn_swap_entry(swp_entry_t entry)
> > +{
> > + return is_migration_entry(entry) || is_device_private_entry(entry);
> > +}
> 
> --
> Peter Xu




___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 0/7] Per client engine busyness

2021-05-18 Thread Tvrtko Ursulin



On 18/05/2021 10:16, Daniel Stone wrote:

Hi,

On Tue, 18 May 2021 at 10:09, Tvrtko Ursulin
 wrote:

I was just wondering if stat(2) and a chrdev major check would be a
solid criteria to more efficiently (compared to parsing the text
content) detect drm files while walking procfs.


Maybe I'm missing something, but is the per-PID walk actually a
measurable performance issue rather than just a bit unpleasant?


Per pid and per each open fd.

As said in the other thread what bothers me a bit in this scheme is that 
the cost of obtaining GPU usage scales based on non-GPU criteria.


For use case of a top-like tool which shows all processes this is a 
smaller additional cost, but then for a gpu-top like tool it is somewhat 
higher.


Regards,

Tvrtko
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 0/7] Per client engine busyness

2021-05-18 Thread Daniel Stone
Hi,

On Tue, 18 May 2021 at 10:09, Tvrtko Ursulin
 wrote:
> I was just wondering if stat(2) and a chrdev major check would be a
> solid criteria to more efficiently (compared to parsing the text
> content) detect drm files while walking procfs.

Maybe I'm missing something, but is the per-PID walk actually a
measurable performance issue rather than just a bit unpleasant?

Cheers,
Daniel
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH 0/7] Per client engine busyness

2021-05-18 Thread Tvrtko Ursulin



On 17/05/2021 20:03, Simon Ser wrote:

On Monday, May 17th, 2021 at 8:16 PM, Nieto, David M  
wrote:


Btw is DRM_MAJOR 226 consider uapi? I don't see it in uapi headers.


It's not in the headers, but it's de facto uAPI, as seen in libdrm:

 > git grep 226
 xf86drm.c
 99:#define DRM_MAJOR 226 /* Linux */


I suspected it would be yes, thanks.

I was just wondering if stat(2) and a chrdev major check would be a 
solid criteria to more efficiently (compared to parsing the text 
content) detect drm files while walking procfs.


Regards,

Tvrtko
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau