It allows copy_pte_range() to do early cow if the pages were pinned on the
source mm. Currently we don't have an accurate way to know whether a page is
pinned or not. The only thing we have is page_maybe_dma_pinned(). However
that's good enough for now. Especially, with the newly added mm->has_pinned
flag to make sure we won't affect processes that never pinned any pages.
It would be easier if we can do GFP_KERNEL allocation within copy_one_pte().
Unluckily, we can't because we're with the page table locks held for both the
parent and child processes. So the page allocation needs to be done outside
copy_one_pte().
Some trick is there in copy_present_pte(), majorly the wrprotect trick to block
concurrent fast-gup. Comments in the function should explain better in place.
Oleg Nesterov reported a (probably harmless) bug during review that we didn't
reset entry.val properly in copy_pte_range() so that potentially there's chance
to call add_swap_count_continuation() multiple times on the same swp entry.
However that should be harmless since even if it happens, the same function
(add_swap_count_continuation()) will return directly noticing that there're
enough space for the swp counter. So instead of a standalone stable patch, it
is touched up in this patch directly.
Reference discussions:
https://lore.kernel.org/lkml/20200914143829.ga1424...@nvidia.com/
Suggested-by: Linus Torvalds
Signed-off-by: Peter Xu
---
mm/memory.c | 172 +++-
1 file changed, 156 insertions(+), 16 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 4c56d7b92b0e..92ad08616e60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -773,15 +773,109 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct
mm_struct *src_mm,
return 0;
}
-static inline void
+/*
+ * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page
+ * is required to copy this pte.
+ */
+static inline int
copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
- unsigned long addr, int *rss)
+ struct vm_area_struct *new,
+ unsigned long addr, int *rss, struct page **prealloc)
{
unsigned long vm_flags = vma->vm_flags;
pte_t pte = *src_pte;
struct page *page;
+ page = vm_normal_page(vma, addr, pte);
+ if (page) {
+ if (is_cow_mapping(vm_flags)) {
+ bool is_write = pte_write(pte);
+
+ /*
+* The trick starts.
+*
+* What we want to do is to check whether this page may
+* have been pinned by the parent process. If so,
+* instead of wrprotect the pte on both sides, we copy
+* the page immediately so that we'll always guarantee
+* the pinned page won't be randomly replaced in the
+* future.
+*
+* To achieve this, we do the following:
+*
+* 1. Write-protect the pte if it's writable. This is
+*to protect concurrent write fast-gup with
+*FOLL_PIN, so that we'll fail the fast-gup with
+*the write bit removed.
+*
+* 2. Check page_maybe_dma_pinned() to see whether this
+*page may have been pinned.
+*
+* The order of these steps is important to serialize
+* against the fast-gup code (gup_pte_range()) on the
+* pte check and try_grab_compound_head(), so that
+* we'll make sure either we'll capture that fast-gup
+* so we'll copy the pinned page here, or we'll fail
+* that fast-gup.
+*/
+ if (is_write) {
+ ptep_set_wrprotect(src_mm, addr, src_pte);
+ /*
+* This is not needed for serializing fast-gup,
+* however always make it consistent with
+* src_pte, since we'll need it when current
+* page is not pinned.
+*/
+ pte = pte_wrprotect(pte);
+ }
+
+ if (atomic_read(_mm->has_pinned) &&
+ page_maybe_dma_pinned(page)) {
+ struct page *new_page = *prealloc;
+
+ /*
+* This is possibly pinned page, need to copy.
+