Re: [PATCH v2 3/4] mm: Do early cow for pinned pages during fork() for ptes

2020-09-26 Thread Linus Torvalds
On Sat, Sep 26, 2020 at 4:23 PM Jason Gunthorpe  wrote:
>
> Linus's version doesn't do pte_sw_mkyoung(), but looks OK to have it

I don't think it matters. But I don't think it should make it young,
since there's no access, but it's not like it's a big deal.

> > + pte = maybe_mkwrite(pte_mkdirty(pte), new);
>
> maybe_mkwrite() was not in Linus's version but it is wp_page_copy().

Actually, it is in my version too, just in a different form.

I did it using

if (vma->vm_flags & VM_WRITE)
*src_pte = pte_mkwrite(*src_pte);

instead, ie avoiding the write to src_pte if it wasn't a writable vma
(and I had checked that it was dirty and not done this at all if not,
so no need for the mkdirty).

> It seemed like mk_pte() should set the proper write
> bit already from the vm_page_prot?

No, vm_page_prot won't have the writable bit for a COW mapping.

The write bit gets set when the write happens (which may be on the
first access, of course), by the code that makes sure it's a private
copy.

> Perhaps this is harmless but redundant?

No, the pte_mkwrite() is required in some form, whether it's that
"maybe_mkwrite()" that looks at the vm flags, or in that explicit
form.

> > + page_add_new_anon_rmap(new_page, new, addr, 
> > false);
> > + rss[mm_counter(new_page)]++;
> > + set_pte_at(dst_mm, addr, dst_pte, pte);
>
> Linus's patch had a lru_cache_add_inactive_or_unevictable() here, like
> wp_page_copy()

Yeah, I do think that is needed so that we have the new page on the
LRU and it gets properly evicted under memory pressure.

   Linus


Re: [PATCH v2 3/4] mm: Do early cow for pinned pages during fork() for ptes

2020-09-26 Thread Jason Gunthorpe
On Fri, Sep 25, 2020 at 06:25:59PM -0400, Peter Xu wrote:
> -static inline void
> +/*
> + * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
> + * is required to copy this pte.
> + */
> +static inline int
>  copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
> - unsigned long addr, int *rss)
> + struct vm_area_struct *new,
> + unsigned long addr, int *rss, struct page **prealloc)
>  {
>   unsigned long vm_flags = vma->vm_flags;
>   pte_t pte = *src_pte;
>   struct page *page;
>  
> + page = vm_normal_page(vma, addr, pte);
> + if (page) {
> + if (is_cow_mapping(vm_flags)) {
> + bool is_write = pte_write(pte);

Very minor, but I liked the readability to put this chunk in a
function 'copy_normal_page' with the src/dst naming

> +
> + /*
> +  * We have a prealloc page, all good!  Take it
> +  * over and copy the page & arm it.
> +  */
> + *prealloc = NULL;
> + copy_user_highpage(new_page, page, addr, vma);
> + __SetPageUptodate(new_page);
> + pte = mk_pte(new_page, new->vm_page_prot);
> + pte = pte_sw_mkyoung(pte);

Linus's version doesn't do pte_sw_mkyoung(), but looks OK to have it

> + pte = maybe_mkwrite(pte_mkdirty(pte), new);

maybe_mkwrite() was not in Linus's version, but is in
wp_page_copy(). It seemed like mk_pte() should set the proper write
bit already from the vm_page_prot? Perhaps this is harmless but
redundant?

> + page_add_new_anon_rmap(new_page, new, addr, 
> false);
> + rss[mm_counter(new_page)]++;
> + set_pte_at(dst_mm, addr, dst_pte, pte);

Linus's patch had a lru_cache_add_inactive_or_unevictable() here, like
wp_page_copy()

Didn't think of anything profound to say, looks good thanks!

I'll forward this for testing as well, there are some holidays next
week so I may have been optimistic to think by Monday.

Jason


[PATCH v2 3/4] mm: Do early cow for pinned pages during fork() for ptes

2020-09-25 Thread Peter Xu
It allows copy_pte_range() to do early cow if the pages were pinned on the
source mm.  Currently we don't have an accurate way to know whether a page is
pinned or not.  The only thing we have is page_maybe_dma_pinned().  However
that's good enough for now.  Especially, with the newly added mm->has_pinned
flag to make sure we won't affect processes that never pinned any pages.

It would be easier if we can do GFP_KERNEL allocation within copy_one_pte().
Unluckily, we can't because we're with the page table locks held for both the
parent and child processes.  So the page allocation needs to be done outside
copy_one_pte().

Some trick is there in copy_present_pte(), majorly the wrprotect trick to block
concurrent fast-gup.  Comments in the function should explain better in place.

Oleg Nesterov reported a (probably harmless) bug during review that we didn't
reset entry.val properly in copy_pte_range() so that potentially there's chance
to call add_swap_count_continuation() multiple times on the same swp entry.
However that should be harmless since even if it happens, the same function
(add_swap_count_continuation()) will return directly noticing that there're
enough space for the swp counter.  So instead of a standalone stable patch, it
is touched up in this patch directly.

Reference discussions:

  https://lore.kernel.org/lkml/20200914143829.ga1424...@nvidia.com/

Suggested-by: Linus Torvalds 
Signed-off-by: Peter Xu 
---
 mm/memory.c | 172 +++-
 1 file changed, 156 insertions(+), 16 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4c56d7b92b0e..92ad08616e60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -773,15 +773,109 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
return 0;
 }
 
-static inline void
+/*
+ * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
+ * is required to copy this pte.
+ */
+static inline int
 copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
-   unsigned long addr, int *rss)
+   struct vm_area_struct *new,
+   unsigned long addr, int *rss, struct page **prealloc)
 {
unsigned long vm_flags = vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
 
+   page = vm_normal_page(vma, addr, pte);
+   if (page) {
+   if (is_cow_mapping(vm_flags)) {
+   bool is_write = pte_write(pte);
+
+   /*
+* The trick starts.
+*
+* What we want to do is to check whether this page may
+* have been pinned by the parent process.  If so,
+* instead of wrprotect the pte on both sides, we copy
+* the page immediately so that we'll always guarantee
+* the pinned page won't be randomly replaced in the
+* future.
+*
+* To achieve this, we do the following:
+*
+* 1. Write-protect the pte if it's writable.  This is
+*to protect concurrent write fast-gup with
+*FOLL_PIN, so that we'll fail the fast-gup with
+*the write bit removed.
+*
+* 2. Check page_maybe_dma_pinned() to see whether this
+*page may have been pinned.
+*
+* The order of these steps is important to serialize
+* against the fast-gup code (gup_pte_range()) on the
+* pte check and try_grab_compound_head(), so that
+* we'll make sure either we'll capture that fast-gup
+* so we'll copy the pinned page here, or we'll fail
+* that fast-gup.
+*/
+   if (is_write) {
+   ptep_set_wrprotect(src_mm, addr, src_pte);
+   /*
+* This is not needed for serializing fast-gup,
+* however always make it consistent with
+* src_pte, since we'll need it when current
+* page is not pinned.
+*/
+   pte = pte_wrprotect(pte);
+   }
+
+   if (atomic_read(_mm->has_pinned) &&
+   page_maybe_dma_pinned(page)) {
+   struct page *new_page = *prealloc;
+
+   /*
+* This is possibly pinned page, need to copy.
+