from:"Hugh Dickins"

Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

2023-08-22 Thread Hugh Dickins

On Tue, 22 Aug 2023, Matthew Wilcox wrote:
> On Tue, Aug 22, 2023 at 11:34:19AM -0700, Hugh Dickins wrote:
> > (Yes, the locking is a bit confusing: but mainly for the unrelated reason,
> > that with the split locking configs, we never quite know whether this lock
> > is the same as that lock or not, and so have to be rather careful.)
> 
> Is it time to remove the PTE split locking config option?  I believe all
> supported architectures have at least two levels of page tables, so if we
> have split ptlocks, ptl and pml are always different from each other (it's
> just that on two level machines, pmd == pud == p4d == pgd).  With huge
> thread counts now being the norm, it's hard to see why anybody would want
> to support SMP and !SPLIT_PTE_PTLOCKS.  To quote the documentation ...
> 
>   Split page table lock for PTE tables is enabled compile-time if
>   CONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS.
>   If split lock is disabled, all tables are guarded by mm->page_table_lock.
> 
> You can barely buy a wrist-watch without eight CPUs these days.

Whilst I'm still happy with my 0-CPU wrist-watch, I do think you're right:
that SPLIT_PTLOCK_CPUS business was really just a safety-valve for when
introducing split ptlock in the first place, 4 pulled out of a hat, and
the unsplit ptlock path quite under-tested.

But I'll leave it to someone else do the job of removing it whenever.

Hugh

Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

2023-08-22 Thread Hugh Dickins

On Tue, 22 Aug 2023, Jann Horn wrote:
> On Tue, Aug 22, 2023 at 4:51 AM Hugh Dickins  wrote:
> > On Mon, 21 Aug 2023, Jann Horn wrote:
> > > On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins  wrote:
> > > > Just for this case, take the pmd_lock() two steps earlier: not because
> > > > it gives any protection against this case itself, but because ptlock
> > > > nests inside it, and it's the dropping of ptlock which let the bug in.
> > > > In other cases, continue to minimize the pmd_lock() hold time.
> > >
> > > Special-casing userfaultfd like this makes me a bit uncomfortable; but
> > > I also can't find anything other than userfaultfd that would insert
> > > pages into regions that are khugepaged-compatible, so I guess this
> > > works?
> >
> > I'm as sure as I can be that it's solely because userfaultfd breaks
> > the usual rules here (and in fairness, IIRC Andrea did ask my permission
> > before making it behave that way on shmem, COWing without a source page).
> >
> > Perhaps something else will want that same behaviour in future (it's
> > tempting, but difficult to guarantee correctness); for now, it is just
> > userfaultfd (but by saying "_armed" rather than "_missing", I'm half-
> > expecting uffd to add more such exceptional modes in future).
> 
> Hm, yeah, sounds okay. (I guess we'd also run into this if we ever
> wanted to make it possible to reliably install PTE markers with
> madvise() or something like that, which might be nice for allowing
> userspace to create guard pages without unnecessary extra VMAs...)

I see the mailthread has taken inspiration from your comment there,
and veered off in that direction: but I'll ignore those futures.

> 
> > > I guess an alternative would be to use a spin_trylock() instead of the
> > > current pmd_lock(), and if that fails, temporarily drop the page table
> > > lock and then restart from step 2 with both locks held - and at that
> > > point the page table scan should be fast since we expect it to usually
> > > be empty.
> >
> > That's certainly a good idea, if collapse on userfaultfd_armed private
> > is anything of a common case (I doubt, but I don't know).  It may be a
> > better idea anyway (saving a drop and retake of ptlock).
> 
> I was thinking it also has the advantage that it would still perform
> okay if we got rid of the userfaultfd_armed() condition at some point
> - though I realize that designing too much for hypothetical future
> features is an antipattern.
> 
> > I gave it a try, expecting to end up with something that would lead
> > me to say "I tried it, but it didn't work out well"; but actually it
> > looks okay to me.  I wouldn't say I prefer it, but it seems reasonable,
> > and no more complicated (as Peter rightly observes) than the original.
> >
> > It's up to you and Peter, and whoever has strong feelings about it,
> > to choose between them: I don't mind (but I shall be sad if someone
> > demands that I indent that comment deeper - I'm not a fan of long
> > multi-line comments near column 80).
> 
> I prefer this version because it would make it easier to remove the
> "userfaultfd_armed()" check in the future if we have to, but I guess
> we could also always change it later if that becomes necessary, so I
> don't really have strong feelings on it at this point.

Thanks for considering them both, Jann.  I do think your trylock way,
as in v2, is in principle superior, and we may well have good reason
to switch over to it in future; but I find it slightly more confusing,
so will follow your and Peter's "no strong feelings" for now, and ask
Andrew please to take the original (implicit v1).

Overriding reason: I realized overnight that v2 is not quite correct:
I was clever enough to realize that nr_ptes needed to be reset to 0
to get the accounting right with a recheck pass, but not clever enough
to realize that resetting it to 0 there would likely skip the abort
path's flush_tlb_mm(mm), when we actually had cleared entries on the
first pass.  It needs a separate bool to decide the flush_tlb_mm(mm),
or it needs that (ridiculously minor!) step 3 to be moved down.

But rather than reworking it, please let's just go with v1 for now.

Thanks,
Hugh

Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

2023-08-21 Thread Hugh Dickins

On Mon, 21 Aug 2023, Jann Horn wrote:
> On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins  wrote:
> > Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
> > shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
> > thought it had emptied: page lock on the huge page is enough to protect
> > against WP faults (which find the PTE has been cleared), but not enough
> > to protect against userfaultfd.  "BUG: Bad rss-counter state" followed.
> >
> > retract_page_tables() protects against this by checking !vma->anon_vma;
> > but we know that MADV_COLLAPSE needs to be able to work on private shmem
> > mappings, even those with an anon_vma prepared for another part of the
> > mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
> > mappings which are userfaultfd_armed().  Whether it needs to work on
> > private shmem mappings which are userfaultfd_armed(), I'm not so sure:
> > but assume that it does.
> 
> I think we couldn't rely on anon_vma here anyway, since holding the
> mmap_lock in read mode doesn't prevent concurrent creation of an
> anon_vma?

We would have had to do the same as in retract_page_tables() (which
doesn't even have mmap_lock for read): recheck !vma->anon_vma after
finally acquiring ptlock.  But the !anon_vma limitation is certainly
not acceptable here anyway.

> 
> > Just for this case, take the pmd_lock() two steps earlier: not because
> > it gives any protection against this case itself, but because ptlock
> > nests inside it, and it's the dropping of ptlock which let the bug in.
> > In other cases, continue to minimize the pmd_lock() hold time.
> 
> Special-casing userfaultfd like this makes me a bit uncomfortable; but
> I also can't find anything other than userfaultfd that would insert
> pages into regions that are khugepaged-compatible, so I guess this
> works?

I'm as sure as I can be that it's solely because userfaultfd breaks
the usual rules here (and in fairness, IIRC Andrea did ask my permission
before making it behave that way on shmem, COWing without a source page).

Perhaps something else will want that same behaviour in future (it's
tempting, but difficult to guarantee correctness); for now, it is just
userfaultfd (but by saying "_armed" rather than "_missing", I'm half-
expecting uffd to add more such exceptional modes in future).

> 
> I guess an alternative would be to use a spin_trylock() instead of the
> current pmd_lock(), and if that fails, temporarily drop the page table
> lock and then restart from step 2 with both locks held - and at that
> point the page table scan should be fast since we expect it to usually
> be empty.

That's certainly a good idea, if collapse on userfaultfd_armed private
is anything of a common case (I doubt, but I don't know).  It may be a
better idea anyway (saving a drop and retake of ptlock).

I gave it a try, expecting to end up with something that would lead
me to say "I tried it, but it didn't work out well"; but actually it
looks okay to me.  I wouldn't say I prefer it, but it seems reasonable,
and no more complicated (as Peter rightly observes) than the original.

It's up to you and Peter, and whoever has strong feelings about it,
to choose between them: I don't mind (but I shall be sad if someone
demands that I indent that comment deeper - I'm not a fan of long
multi-line comments near column 80).

[PATCH mm-unstable v2] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
thought it had emptied: page lock on the huge page is enough to protect
against WP faults (which find the PTE has been cleared), but not enough
to protect against userfaultfd.  "BUG: Bad rss-counter state" followed.

retract_page_tables() protects against this by checking !vma->anon_vma;
but we know that MADV_COLLAPSE needs to be able to work on private shmem
mappings, even those with an anon_vma prepared for another part of the
mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
mappings which are userfaultfd_armed().  Whether it needs to work on
private shmem mappings which are userfaultfd_armed(), I'm not so sure:
but assume that it does.

Now trylock pmd lock without dropping ptlock (suggested by jannh): if
that fails, drop and retake ptlock around taking pmd lock, and just in
the uffd private case, go back to recheck and empty the page table.

Reported-by: Jann Horn 
Closes: 
https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5fw-kyd5rg5xo_rdbm0ltt...@mail.gmail.com/
Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with 
mmap_read_lock()")
Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 39 +++

[PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

2023-08-21 Thread Hugh Dickins

Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
thought it had emptied: page lock on the huge page is enough to protect
against WP faults (which find the PTE has been cleared), but not enough
to protect against userfaultfd.  "BUG: Bad rss-counter state" followed.

retract_page_tables() protects against this by checking !vma->anon_vma;
but we know that MADV_COLLAPSE needs to be able to work on private shmem
mappings, even those with an anon_vma prepared for another part of the
mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
mappings which are userfaultfd_armed().  Whether it needs to work on
private shmem mappings which are userfaultfd_armed(), I'm not so sure:
but assume that it does.

Just for this case, take the pmd_lock() two steps earlier: not because
it gives any protection against this case itself, but because ptlock
nests inside it, and it's the dropping of ptlock which let the bug in.
In other cases, continue to minimize the pmd_lock() hold time.

Reported-by: Jann Horn 
Closes: 
https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5fw-kyd5rg5xo_rdbm0ltt...@mail.gmail.com/
Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with 
mmap_read_lock()")
Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 38 +-
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 40d43eccdee8..d5650541083a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1476,7 +1476,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
struct page *hpage;
pte_t *start_pte, *pte;
pmd_t *pmd, pgt_pmd;
-   spinlock_t *pml, *ptl;
+   spinlock_t *pml = NULL, *ptl;
int nr_ptes = 0, result = SCAN_FAIL;
int i;
 
@@ -1572,9 +1572,25 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
haddr, haddr + HPAGE_PMD_SIZE);
mmu_notifier_invalidate_range_start();
notified = true;
-   start_pte = pte_offset_map_lock(mm, pmd, haddr, );
+
+   /*
+* pmd_lock covers a wider range than ptl, and (if split from mm's
+* page_table_lock) ptl nests inside pml. The less time we hold pml,
+* the better; but userfaultfd's mfill_atomic_pte() on a private VMA
+* inserts a valid as-if-COWed PTE without even looking up page cache.
+* So page lock of hpage does not protect from it, so we must not drop
+* ptl before pgt_pmd is removed, so uffd private needs pml taken now.
+*/
+   if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
+   pml = pmd_lock(mm, pmd);
+
+   start_pte = pte_offset_map_nolock(mm, pmd, haddr, );
if (!start_pte) /* mmap_lock + page lock should prevent this */
goto abort;
+   if (!pml)
+   spin_lock(ptl);
+   else if (ptl != pml)
+   spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 
/* step 2: clear page table and adjust rmap */
for (i = 0, addr = haddr, pte = start_pte;
@@ -1608,7 +1624,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
nr_ptes++;
}
 
-   pte_unmap_unlock(start_pte, ptl);
+   pte_unmap(start_pte);
+   if (!pml)
+   spin_unlock(ptl);
 
/* step 3: set proper refcount and mm_counters. */
if (nr_ptes) {
@@ -1616,12 +1634,12 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
}
 
-   /* step 4: remove page table */
-
-   /* Huge page lock is still held, so page table must remain empty */
-   pml = pmd_lock(mm, pmd);
-   if (ptl != pml)
-   spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+   /* step 4: remove empty page table */
+   if (!pml) {
+   pml = pmd_lock(mm, pmd);
+   if (ptl != pml)
+   spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+   }
pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
pmdp_get_lockless_sync();
if (ptl != pml)
@@ -1648,6 +1666,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
}
if (start_pte)
pte_unmap_unlock(start_pte, ptl);
+   if (pml && pml != ptl)
+   spin_unlock(pml);
if (notified)
mmu_notifier_invalidate_range_end();
 drop_hpage:
-- 
2.35.3

Re: [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

2023-08-21 Thread Hugh Dickins

On Mon, 14 Aug 2023, Jann Horn wrote:
> On Wed, Jul 12, 2023 at 6:42 AM Hugh Dickins  wrote:
> > Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> > It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> > nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> > paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
> 
> We can still have a racing userfaultfd operation at the "/* step 4:
> remove page table */" point that installs a new PTE before the page
> table is removed.

And you've been very polite not to remind me that this is exactly
what you warned me about, in connection with retract_page_tables(),
nearly three months ago:

https://lore.kernel.org/linux-mm/cag48ez0af1rf1apsjn9ycnfyfq4yqsd4gqb6f2wfhf7jmdi...@mail.gmail.com/

> 
> To reproduce, patch a delay into the kernel like this:
> 
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9a6e0d507759..27cc8dfbf3a7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -1617,6 +1618,11 @@ int collapse_pte_mapped_thp(struct mm_struct
> *mm, unsigned long addr,
> }
> 
> /* step 4: remove page table */
> +   if (strcmp(current->comm, "DELAYME") == 0) {
> +   pr_warn("%s: BEGIN DELAY INJECTION\n", __func__);
> +   mdelay(5000);
> +   pr_warn("%s: END DELAY INJECTION\n", __func__);
> +   }
> 
> /* Huge page lock is still held, so page table must remain empty */
> pml = pmd_lock(mm, pmd);
> 
> 
> And then run the attached reproducer against mm/mm-everything. You
> should get this in dmesg:
> 
> [  206.578096] BUG: Bad rss-counter state mm:0942ebea
> type:MM_ANONPAGES val:1

Very helpful, thank you Jann.

I got a bit distracted when I then found mm's recent addition of
UFFDIO_POISON: thought I needed to change both collapse_pte_mapped_thp()
and retract_page_tables() now to cope with mfill_atomic_pte_poison()
inserting into even a userfaultfd_armed shared VMA.

But eventually, on second thoughts, realized that's only inserting a pte
marker, invalid, so won't cause any actual trouble.  A little untidy,
to leave that behind in a supposedly empty page table about to be freed,
but not worth refactoring these functions to avoid a non-bug.

And though syzbot and JH may find some fun with it, I don't think any
real application would be insertng a PTE_MARKER_POISONED where a huge
page collapse is almost complete.

So I scaled back to a more proportionate fix, following.  Sorry, I've
slightly messed up applying the "DELAY INJECTION" patch above: not
intentional, honest!  (mdelay while holding the locks is still good.)

Hugh

Re: [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

2023-08-15 Thread Hugh Dickins

On Tue, 15 Aug 2023, David Hildenbrand wrote:
> On 15.08.23 08:34, Hugh Dickins wrote:
> > On Mon, 14 Aug 2023, Jann Horn wrote:
> >>
> >>  /* step 4: remove page table */
> >> +   if (strcmp(current->comm, "DELAYME") == 0) {
> >> +   pr_warn("%s: BEGIN DELAY INJECTION\n", __func__);
> >> +   mdelay(5000);
> >> +   pr_warn("%s: END DELAY INJECTION\n", __func__);
> >> +   }
> >>
> >>  /* Huge page lock is still held, so page table must remain empty
> >>  */
> >>  pml = pmd_lock(mm, pmd);
> >>
> >>
> >> And then run the attached reproducer against mm/mm-everything. You
> >> should get this in dmesg:
> >>
> >> [  206.578096] BUG: Bad rss-counter state mm:0942ebea
> >> type:MM_ANONPAGES val:1
> > 
> > Thanks a lot, Jann. I haven't thought about it at all yet; and just
> > tried to reproduce, but haven't yet got the "BUG: Bad rss-counter":
> > just see "Invalid argument" on the UFFDIO_COPY ioctl.
> > Will investigate tomorrow.
> 
> Maybe you're missing a fixup:
> 
> https://lkml.kernel.org/r/20230810192128.1855570-1-axelrasmus...@google.com
> 
> When the src address is not page aligned, UFFDIO_COPY in mm-unstable would
> erroneously fail.

You got it, many thanks David: I had assumed that my next-20230808 tree
would be up-to-date enough, but it wasn't.  Reproduced now.

Hugh

Re: [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

2023-08-15 Thread Hugh Dickins

On Mon, 14 Aug 2023, Jann Horn wrote:
> On Wed, Jul 12, 2023 at 6:42 AM Hugh Dickins  wrote:
> > Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> > It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> > nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> > paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
> 
> We can still have a racing userfaultfd operation at the "/* step 4:
> remove page table */" point that installs a new PTE before the page
> table is removed.
> 
> To reproduce, patch a delay into the kernel like this:
> 
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9a6e0d507759..27cc8dfbf3a7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -1617,6 +1618,11 @@ int collapse_pte_mapped_thp(struct mm_struct
> *mm, unsigned long addr,
> }
> 
> /* step 4: remove page table */
> +   if (strcmp(current->comm, "DELAYME") == 0) {
> +   pr_warn("%s: BEGIN DELAY INJECTION\n", __func__);
> +   mdelay(5000);
> +   pr_warn("%s: END DELAY INJECTION\n", __func__);
> +   }
> 
> /* Huge page lock is still held, so page table must remain empty */
> pml = pmd_lock(mm, pmd);
> 
> 
> And then run the attached reproducer against mm/mm-everything. You
> should get this in dmesg:
> 
> [  206.578096] BUG: Bad rss-counter state mm:0942ebea
> type:MM_ANONPAGES val:1

Thanks a lot, Jann. I haven't thought about it at all yet; and just
tried to reproduce, but haven't yet got the "BUG: Bad rss-counter":
just see "Invalid argument" on the UFFDIO_COPY ioctl.
Will investigate tomorrow.

Hugh

[PATCH v3 10/13 fix2] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock(): fix2

2023-08-05 Thread Hugh Dickins

Use ptep_clear() instead of pte_clear(): when CONFIG_PAGE_TABLE_CHECK=y,
ptep_clear() adds some accounting, missing which would cause a BUG later.

Signed-off-by: Hugh Dickins 
Reported-by: Qi Zheng 
Closes: 
https://lore.kernel.org/linux-mm/0df84f9f-e9b0-80b1-4c9e-95abc1a73...@bytedance.com/
---
 mm/khugepaged.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index bb76a5d454de..78fc1a24a1cc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1603,7 +1603,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
 * TLB flush can be left until pmdp_collapse_flush() does it.
 * PTE dirty? Shmem page is already dirty; file is read-only.
 */
-   pte_clear(mm, addr, pte);
+   ptep_clear(mm, addr, pte);
page_remove_rmap(page, vma, false);
nr_ptes++;
}
-- 
2.35.3

Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

2023-08-05 Thread Hugh Dickins

On Thu, 3 Aug 2023, Qi Zheng wrote:
> On 2023/7/12 12:42, Hugh Dickins wrote:
> > Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> > It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> > nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> > paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
...
> > @@ -1681,47 +1634,76 @@ int collapse_pte_mapped_thp(struct mm_struct *mm,
> > unsigned long addr,
> >   
> > if (pte_none(ptent))
> > continue;
> > -   page = vm_normal_page(vma, addr, ptent);
> > -   if (WARN_ON_ONCE(page && is_zone_device_page(page)))
> > +   /*
> > +* We dropped ptl after the first scan, to do the
> > mmu_notifier:
> > +* page lock stops more PTEs of the hpage being faulted in,
> > but
> > +* does not stop write faults COWing anon copies from existing
> > +* PTEs; and does not stop those being swapped out or
> > migrated.
> > +*/
> > +   if (!pte_present(ptent)) {
> > +   result = SCAN_PTE_NON_PRESENT;
> > goto abort;
> > +   }
> > +   page = vm_normal_page(vma, addr, ptent);
> > +   if (hpage + i != page)
> > +   goto abort;
> > +
> > +   /*
> > +* Must clear entry, or a racing truncate may re-remove it.
> > +* TLB flush can be left until pmdp_collapse_flush() does it.
> > +* PTE dirty? Shmem page is already dirty; file is read-only.
> > +*/
> > +   pte_clear(mm, addr, pte);
> 
> This is not non-present PTE entry, so we should call ptep_clear() to let
> page_table_check track the PTE clearing operation, right? Otherwise it
> may lead to false positives?

You are right: thanks a lot for catching that: fix patch follows.

Hugh

Re: [PATCH mm-unstable v7 00/31] Split ptdesc from struct page

2023-07-24 Thread Hugh Dickins

On Mon, 24 Jul 2023, Vishal Moola (Oracle) wrote:

> The MM subsystem is trying to shrink struct page. This patchset
> introduces a memory descriptor for page table tracking - struct ptdesc.
> 
> This patchset introduces ptdesc, splits ptdesc from struct page, and
> converts many callers of page table constructor/destructors to use ptdescs.
> 
> Ptdesc is a foundation to further standardize page tables, and eventually
> allow for dynamic allocation of page tables independent of struct page.
> However, the use of pages for page table tracking is quite deeply
> ingrained and varied across archictectures, so there is still a lot of
> work to be done before that can happen.

Others may differ, but it remains the case that I see no point to this
patchset, until the minimal descriptor that replaces struct page is
working, and struct page then becomes just overhead.  Until that time,
let architectures continue to use struct page as they do - whyever not?

Hugh

> 
> This is rebased on mm-unstable.
> 
> v7:
>   Drop s390 gmap ptdesc conversions - gmap is unecessary complication
> that can be dealt with later
>   Be more thorough with ptdesc struct sanity checks and comments
>   Rebase onto mm-unstable
> 
> Vishal Moola (Oracle) (31):
>   mm: Add PAGE_TYPE_OP folio functions
>   pgtable: Create struct ptdesc
>   mm: add utility functions for ptdesc
>   mm: Convert pmd_pgtable_page() callers to use pmd_ptdesc()
>   mm: Convert ptlock_alloc() to use ptdescs
>   mm: Convert ptlock_ptr() to use ptdescs
>   mm: Convert pmd_ptlock_init() to use ptdescs
>   mm: Convert ptlock_init() to use ptdescs
>   mm: Convert pmd_ptlock_free() to use ptdescs
>   mm: Convert ptlock_free() to use ptdescs
>   mm: Create ptdesc equivalents for pgtable_{pte,pmd}_page_{ctor,dtor}
>   powerpc: Convert various functions to use ptdescs
>   x86: Convert various functions to use ptdescs
>   s390: Convert various pgalloc functions to use ptdescs
>   mm: Remove page table members from struct page
>   pgalloc: Convert various functions to use ptdescs
>   arm: Convert various functions to use ptdescs
>   arm64: Convert various functions to use ptdescs
>   csky: Convert __pte_free_tlb() to use ptdescs
>   hexagon: Convert __pte_free_tlb() to use ptdescs
>   loongarch: Convert various functions to use ptdescs
>   m68k: Convert various functions to use ptdescs
>   mips: Convert various functions to use ptdescs
>   nios2: Convert __pte_free_tlb() to use ptdescs
>   openrisc: Convert __pte_free_tlb() to use ptdescs
>   riscv: Convert alloc_{pmd, pte}_late() to use ptdescs
>   sh: Convert pte_free_tlb() to use ptdescs
>   sparc64: Convert various functions to use ptdescs
>   sparc: Convert pgtable_pte_page_{ctor, dtor}() to ptdesc equivalents
>   um: Convert {pmd, pte}_free_tlb() to use ptdescs
>   mm: Remove pgtable_{pmd, pte}_page_{ctor, dtor}() wrappers
> 
>  Documentation/mm/split_page_table_lock.rst|  12 +-
>  .../zh_CN/mm/split_page_table_lock.rst|  14 +-
>  arch/arm/include/asm/tlb.h|  12 +-
>  arch/arm/mm/mmu.c |   7 +-
>  arch/arm64/include/asm/tlb.h  |  14 +-
>  arch/arm64/mm/mmu.c   |   7 +-
>  arch/csky/include/asm/pgalloc.h   |   4 +-
>  arch/hexagon/include/asm/pgalloc.h|   8 +-
>  arch/loongarch/include/asm/pgalloc.h  |  27 ++--
>  arch/loongarch/mm/pgtable.c   |   7 +-
>  arch/m68k/include/asm/mcf_pgalloc.h   |  47 +++---
>  arch/m68k/include/asm/sun3_pgalloc.h  |   8 +-
>  arch/m68k/mm/motorola.c   |   4 +-
>  arch/mips/include/asm/pgalloc.h   |  32 ++--
>  arch/mips/mm/pgtable.c|   8 +-
>  arch/nios2/include/asm/pgalloc.h  |   8 +-
>  arch/openrisc/include/asm/pgalloc.h   |   8 +-
>  arch/powerpc/mm/book3s64/mmu_context.c|  10 +-
>  arch/powerpc/mm/book3s64/pgtable.c|  32 ++--
>  arch/powerpc/mm/pgtable-frag.c|  56 +++
>  arch/riscv/include/asm/pgalloc.h  |   8 +-
>  arch/riscv/mm/init.c  |  16 +-
>  arch/s390/include/asm/pgalloc.h   |   4 +-
>  arch/s390/include/asm/tlb.h   |   4 +-
>  arch/s390/mm/pgalloc.c| 128 +++
>  arch/sh/include/asm/pgalloc.h |   9 +-
>  arch/sparc/mm/init_64.c   |  17 +-
>  arch/sparc/mm/srmmu.c |   5 +-
>  arch/um/include/asm/pgalloc.h |  18 +--
>  arch/x86/mm/pgtable.c |  47 +++---
>  arch/x86/xen/mmu_pv.c |   2 +-
>  include/asm-generic/pgalloc.h |  88 +-
>  include/asm-generic/tlb.h |  11 ++
>  include/linux/mm.h| 151 +-
>  include/linux/mm_types.h  |  18 ---
>  include/linux/page-flags.h

[PATCH v3 11/13 fix] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps(): fix

2023-07-23 Thread Hugh Dickins

Though not yet detected by syzbot, this commit was making the same
mistake with mmap_locked as the previous commit: fix that.

Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1c773db26e88..41913730db4c 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2380,19 +2380,17 @@ static unsigned int khugepaged_scan_mm_slot(unsigned 
int pages, int *result,
mmap_locked = false;
*result = hpage_collapse_scan_file(mm,
khugepaged_scan.address, file, pgoff, 
cc);
+   fput(file);
if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
mmap_read_lock(mm);
-   mmap_locked = true;
-   if (hpage_collapse_test_exit(mm)) {
-   fput(file);
+   if (hpage_collapse_test_exit(mm))
goto breakouterloop;
-   }
*result = collapse_pte_mapped_thp(mm,
khugepaged_scan.address, false);
if (*result == SCAN_PMD_MAPPED)
*result = SCAN_SUCCEED;
+   mmap_read_unlock(mm);
}
-   fput(file);
} else {
*result = hpage_collapse_scan_pmd(mm, vma,
khugepaged_scan.address, _locked, 
cc);
-- 
2.35.3

[PATCH v3 10/13 fix] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock(): fix

2023-07-23 Thread Hugh Dickins

madvise_collapse() setting "mmap_locked = true" after calling
collapse_pte_mapped_thp() looked good but was wrong.  If the loop then
moves on to the next extent, mmap_locked assures it that "vma" has been
revalidated under mmap_lock, which was not the case: and led to UAFs,
crashes in __fput() or task_work_run(), even collapse_file()'s
VM_BUG_ON(start & (HPAGE_PMD_NR - 1)) - all detected by syzbot.

(collapse_pte_mapped_thp() does validate the vma that it works on:
but it's not passed in as an argument, collapse_pte_mapped_thp() finds
the vma for mm and addr by itself - which may by this time have changed
from the vma saved in madvise_collapse().)

Reported-by: syzbot+fe7b1487405295d29...@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/f9de430600ae0...@google.com/
Reported-by: syzbot+173cc8cfdfbbef6dd...@syzkaller.appspotmail.com
Closes: 
https://lore.kernel.org/linux-mm/e4b0f0060123c...@google.com/
Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6bad69c0e4bd..1c773db26e88 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2747,7 +2747,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct 
vm_area_struct **prev,
BUG_ON(*prev);
mmap_read_lock(mm);
result = collapse_pte_mapped_thp(mm, addr, true);
-   mmap_locked = true;
+   mmap_read_unlock(mm);
goto handle_result;
/* Whitelisted set of results where continuing OK */
case SCAN_PMD_NULL:
-- 
2.35.3

[PATCH v3 07/13 fix] s390: add pte_free_defer() for pgtables sharing page: fix

2023-07-23 Thread Hugh Dickins

Claudio finds warning on mm_has_pgste() more useful than on mm_alloc_pgste().

Signed-off-by: Hugh Dickins 
---
 arch/s390/mm/pgalloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 760b4ace475e..d7374add7820 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -459,7 +459,7 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
 * page_table_free() does not do the pgste gmap_unlink() which
 * page_table_free_rcu() does: warn us if pgste ever reaches here.
 */
-   WARN_ON_ONCE(mm_alloc_pgste(mm));
+   WARN_ON_ONCE(mm_has_pgste(mm));
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-- 
2.35.3

[PATCH v3 04/13 fix] powerpc: assert_pte_locked() use pte_offset_map_nolock(): fix

2023-07-23 Thread Hugh Dickins

Aneesh points out that assert_pte_locked() still needs the pmd_none()
check, to stop crashing in khugepaged: restore that comment and check.

Andrew, when merging with original commit, please edit its 1st para to:

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in assert_pte_locked().  BUG if pte_offset_map_nolock() fails.

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/mm/pgtable.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 16b061af86d7..a3dcdb2d5b4b 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -323,6 +323,14 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long 
addr)
pud = pud_offset(p4d, addr);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, addr);
+   /*
+* khugepaged to collapse normal pages to hugepage, first set
+* pmd to none to force page fault/gup to take mmap_lock. After
+* pmd is set to none, we do a pte_clear which does this assertion
+* so if we find pmd none, return.
+*/
+   if (pmd_none(*pmd))
+   return;
pte = pte_offset_map_nolock(mm, pmd, addr, );
BUG_ON(!pte);
assert_spin_locked(ptl);
-- 
2.35.3

Re: [PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock()

2023-07-18 Thread Hugh Dickins

On Tue, 18 Jul 2023, Aneesh Kumar K.V wrote:
> Hugh Dickins  writes:
> 
> > Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
> > in assert_pte_locked().  BUG if pte_offset_map_nolock() fails: this is
> > stricter than the previous implementation, which skipped when pmd_none()
> > (with a comment on khugepaged collapse transitions): but wouldn't we want
> > to know, if an assert_pte_locked() caller can be racing such transitions?
> >
> 
> The reason we had that pmd_none check there was to handle khugpaged. In
> case of khugepaged we do pmdp_collapse_flush and then do a ptep_clear.
> ppc64 had the assert_pte_locked check inside that ptep_clear.
> 
> _pmd = pmdp_collapse_flush(vma, address, pmd);
> ..
> ptep_clear()
> -> asset_ptep_locked()
> ---> pmd_none
> -> BUG
> 
> 
> The problem is how assert_pte_locked() verify whether we are holding
> ptl. It does that by walking the page table again and in this specific
> case by the time we call the function we already had cleared pmd .

Aneesh, please clarify, I've spent hours on this.

>From all your use of past tense ("had"), I thought you were Acking my
patch; but now, after looking again at v3.11 source and today's,
I think you are NAKing my patch in its present form.

You are pointing out that anon THP's __collapse_huge_page_copy_succeeded()
uses ptep_clear() at a point after pmdp_collapse_flush() already cleared
*pmd, so my patch now leads that one use of assert_pte_locked() to BUG.
Is that your point?

I can easily restore that khugepaged comment (which had appeared to me
out of date at the time, but now looks still relevant) and pmd_none(*pmd)
check: but please clarify.

Thanks,
Hugh

> >
> > This mod might cause new crashes: which either expose my ignorance, or
> > indicate issues to be fixed, or limit the usage of assert_pte_locked().
> >
> > Signed-off-by: Hugh Dickins 
> > ---
> >  arch/powerpc/mm/pgtable.c | 16 ++--
> >  1 file changed, 6 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> > index cb2dcdb18f8e..16b061af86d7 100644
> > --- a/arch/powerpc/mm/pgtable.c
> > +++ b/arch/powerpc/mm/pgtable.c
> > @@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned 
> > long addr)
> > p4d_t *p4d;
> > pud_t *pud;
> > pmd_t *pmd;
> > +   pte_t *pte;
> > +   spinlock_t *ptl;
> >  
> > if (mm == _mm)
> > return;
> > @@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned 
> > long addr)
> > pud = pud_offset(p4d, addr);
> > BUG_ON(pud_none(*pud));
> > pmd = pmd_offset(pud, addr);
> > -   /*
> > -* khugepaged to collapse normal pages to hugepage, first set
> > -* pmd to none to force page fault/gup to take mmap_lock. After
> > -* pmd is set to none, we do a pte_clear which does this assertion
> > -* so if we find pmd none, return.
> > -*/
> > -   if (pmd_none(*pmd))
> > -   return;
> > -   BUG_ON(!pmd_present(*pmd));
> > -   assert_spin_locked(pte_lockptr(mm, pmd));
> > +   pte = pte_offset_map_nolock(mm, pmd, addr, );
> > +   BUG_ON(!pte);
> > +   assert_spin_locked(ptl);
> > +   pte_unmap(pte);
> >  }
> >  #endif /* CONFIG_DEBUG_VM */
> >  
> > -- 
> > 2.35.3

[PATCH mm 12/13] mm: delete mmap_write_trylock() and vma_try_start_write()

2023-07-11 Thread Hugh Dickins

mmap_write_trylock() and vma_try_start_write() were added just for
khugepaged, but now it has no use for them: delete.

Signed-off-by: Hugh Dickins 
---
This is the version which applies to mm-unstable or linux-next.

 include/linux/mm.h| 17 -
 include/linux/mmap_lock.h | 10 --
 2 files changed, 27 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -692,21 +692,6 @@ static inline void vma_start_write(struc
up_write(>vm_lock->lock);
 }
 
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-{
-   int mm_lock_seq;
-
-   if (__is_vma_write_locked(vma, _lock_seq))
-   return true;
-
-   if (!down_write_trylock(>vm_lock->lock))
-   return false;
-
-   vma->vm_lock_seq = mm_lock_seq;
-   up_write(>vm_lock->lock);
-   return true;
-}
-
 static inline void vma_assert_locked(struct vm_area_struct *vma)
 {
int mm_lock_seq;
@@ -758,8 +743,6 @@ static inline bool vma_start_read(struct
{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-   { return true; }
 static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 bool detached) {}
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -112,16 +112,6 @@ static inline int mmap_write_lock_killab
return ret;
 }
 
-static inline bool mmap_write_trylock(struct mm_struct *mm)
-{
-   bool ret;
-
-   __mmap_lock_trace_start_locking(mm, true);
-   ret = down_write_trylock(>mmap_lock) != 0;
-   __mmap_lock_trace_acquire_returned(mm, true, ret);
-   return ret;
-}
-
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
__mmap_lock_trace_released(mm, true);

[PATCH v3 13/13] mm/pgtable: notes on pte_offset_map[_lock]()

2023-07-11 Thread Hugh Dickins

Add a block of comments on pte_offset_map_lock(), pte_offset_map() and
pte_offset_map_nolock() to mm/pgtable-generic.c, to help explain them.

Signed-off-by: Hugh Dickins 
---
 mm/pgtable-generic.c | 44 
 1 file changed, 44 insertions(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index fa9d4d084291..4fcd959dcc4d 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -315,6 +315,50 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t 
*pmd,
return pte;
 }
 
+/*
+ * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
+ * __pte_offset_map_lock() below, is usually called with the pmd pointer for
+ * addr, reached by walking down the mm's pgd, p4d, pud for addr: either while
+ * holding mmap_lock or vma lock for read or for write; or in truncate or rmap
+ * context, while holding file's i_mmap_lock or anon_vma lock for read (or for
+ * write). In a few cases, it may be used with pmd pointing to a pmd_t already
+ * copied to or constructed on the stack.
+ *
+ * When successful, it returns the pte pointer for addr, with its page table
+ * kmapped if necessary (when CONFIG_HIGHPTE), and locked against concurrent
+ * modification by software, with a pointer to that spinlock in ptlp (in some
+ * configs mm->page_table_lock, in SPLIT_PTLOCK configs a spinlock in table's
+ * struct page).  pte_unmap_unlock(pte, ptl) to unlock and unmap afterwards.
+ *
+ * But it is unsuccessful, returning NULL with *ptlp unchanged, if there is no
+ * page table at *pmd: if, for example, the page table has just been removed,
+ * or replaced by the huge pmd of a THP.  (When successful, *pmd is rechecked
+ * after acquiring the ptlock, and retried internally if it changed: so that a
+ * page table can be safely removed or replaced by THP while holding its lock.)
+ *
+ * pte_offset_map(pmd, addr), and its internal helper __pte_offset_map() above,
+ * just returns the pte pointer for addr, its page table kmapped if necessary;
+ * or NULL if there is no page table at *pmd.  It does not attempt to lock the
+ * page table, so cannot normally be used when the page table is to be updated,
+ * or when entries read must be stable.  But it does take rcu_read_lock(): so
+ * that even when page table is racily removed, it remains a valid though empty
+ * and disconnected table.  Until pte_unmap(pte) unmaps and rcu_read_unlock()s
+ * afterwards.
+ *
+ * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
+ * but when successful, it also outputs a pointer to the spinlock in ptlp - as
+ * pte_offset_map_lock() does, but in this case without locking it.  This helps
+ * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
+ * act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock
+ * pointer for the page table that it returns.  In principle, the caller should
+ * recheck *pmd once the lock is taken; in practice, no callsite needs that -
+ * either the mmap_lock for write, or pte_same() check on contents, is enough.
+ *
+ * Note that free_pgtables(), used after unmapping detached vmas, or when
+ * exiting the whole mm, does not take page table lock before freeing a page
+ * table, and may not use RCU at all: "outsiders" like khugepaged should avoid
+ * pte_offset_map() and co once the vma is detached from mm or mm_users is 
zero.
+ */
 pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
 unsigned long addr, spinlock_t **ptlp)
 {
-- 
2.35.3

[PATCH v3 12/13] mm: delete mmap_write_trylock() and vma_try_start_write()

2023-07-11 Thread Hugh Dickins

mmap_write_trylock() and vma_try_start_write() were added just for
khugepaged, but now it has no use for them: delete.

Signed-off-by: Hugh Dickins 
---
 include/linux/mm.h| 17 -
 include/linux/mmap_lock.h | 10 --
 2 files changed, 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..b7b45be616ad 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -692,21 +692,6 @@ static inline void vma_start_write(struct vm_area_struct 
*vma)
up_write(>vm_lock->lock);
 }
 
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-{
-   int mm_lock_seq;
-
-   if (__is_vma_write_locked(vma, _lock_seq))
-   return true;
-
-   if (!down_write_trylock(>vm_lock->lock))
-   return false;
-
-   vma->vm_lock_seq = mm_lock_seq;
-   up_write(>vm_lock->lock);
-   return true;
-}
-
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 {
int mm_lock_seq;
@@ -731,8 +716,6 @@ static inline bool vma_start_read(struct vm_area_struct 
*vma)
{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-   { return true; }
 static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 bool detached) {}
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index aab8f1b28d26..d1191f02c7fa 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -112,16 +112,6 @@ static inline int mmap_write_lock_killable(struct 
mm_struct *mm)
return ret;
 }
 
-static inline bool mmap_write_trylock(struct mm_struct *mm)
-{
-   bool ret;
-
-   __mmap_lock_trace_start_locking(mm, true);
-   ret = down_write_trylock(>mmap_lock) != 0;
-   __mmap_lock_trace_acquire_returned(mm, true, ret);
-   return ret;
-}
-
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
__mmap_lock_trace_released(mm, true);
-- 
2.35.3

[PATCH v3 11/13] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()

2023-07-11 Thread Hugh Dickins

Now that retract_page_tables() can retract page tables reliably, without
depending on trylocks, delete all the apparatus for khugepaged to try
again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the
per-mm memory which was set aside for that in the khugepaged_mm_slot.

But one part of that is worth keeping: when hpage_collapse_scan_file()
found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot
to be tried for retraction later - catching, for example, page tables
where a reversible mprotect() of a portion had required splitting the
pmd, but now it can be recollapsed.  Call collapse_pte_mapped_thp()
directly in this case (why was it deferred before?  I assume an issue
with needing mmap_lock for write, but now it's only needed for read).

Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 125 +++---
 1 file changed, 16 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 46986eb4eebb..7c7aaddbe130 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -92,8 +92,6 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, 
MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __read_mostly;
 
-#define MAX_PTE_MAPPED_THP 8
-
 struct collapse_control {
bool is_khugepaged;
 
@@ -107,15 +105,9 @@ struct collapse_control {
 /**
  * struct khugepaged_mm_slot - khugepaged information per mm that is being 
scanned
  * @slot: hash lookup from mm to mm_slot
- * @nr_pte_mapped_thp: number of pte mapped THP
- * @pte_mapped_thp: address array corresponding pte mapped THP
  */
 struct khugepaged_mm_slot {
struct mm_slot slot;
-
-   /* pte-mapped THP in this mm */
-   int nr_pte_mapped_thp;
-   unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP];
 };
 
 /**
@@ -1439,50 +1431,6 @@ static void collect_mm_slot(struct khugepaged_mm_slot 
*mm_slot)
 }
 
 #ifdef CONFIG_SHMEM
-/*
- * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
- * khugepaged should try to collapse the page table.
- *
- * Note that following race exists:
- * (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A,
- * emptying the A's ->pte_mapped_thp[] array.
- * (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and
- * retract_page_tables() finds a VMA in mm_struct A mapping the same extent
- * (at virtual address X) and adds an entry (for X) into mm_struct A's
- * ->pte-mapped_thp[] array.
- * (3) khugepaged calls khugepaged_collapse_scan_file() for mm_struct A at X,
- * sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry
- * (for X) into mm_struct A's ->pte-mapped_thp[] array.
- * Thus, it's possible the same address is added multiple times for the same
- * mm_struct.  Should this happen, we'll simply attempt
- * collapse_pte_mapped_thp() multiple times for the same address, under the 
same
- * exclusive mmap_lock, and assuming the first call is successful, subsequent
- * attempts will return quickly (without grabbing any additional locks) when
- * a huge pmd is found in find_pmd_or_thp_or_none().  Since this is a cheap
- * check, and since this is a rare occurrence, the cost of preventing this
- * "multiple-add" is thought to be more expensive than just handling it, should
- * it occur.
- */
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
- unsigned long addr)
-{
-   struct khugepaged_mm_slot *mm_slot;
-   struct mm_slot *slot;
-   bool ret = false;
-
-   VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
-
-   spin_lock(_mm_lock);
-   slot = mm_slot_lookup(mm_slots_hash, mm);
-   mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
-   if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) 
{
-   mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
-   ret = true;
-   }
-   spin_unlock(_mm_lock);
-   return ret;
-}
-
 /* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct page *hpage)
@@ -1706,29 +1654,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
return result;
 }
 
-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot 
*mm_slot)
-{
-   struct mm_slot *slot = _slot->slot;
-   struct mm_struct *mm = slot->mm;
-   int i;
-
-   if (likely(mm_slot->nr_pte_mapped_thp == 0))
-   return;
-
-   if (!mmap_write_trylock(mm))
-   return;
-
-   if (unlikely(hpage_collapse_test_exit(mm)))
-   goto out;
-
-   for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
-   collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
-
-out:
-   mm_slot->nr_pte_mapped_thp = 0;
-   mma

[PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

2023-07-11 Thread Hugh Dickins

Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
It does need mmap_read_lock(), but it does not need mmap_write_lock(),
nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.

Follow the pattern in retract_page_tables(); and using pte_free_defer()
removes most of the need for tlb_remove_table_sync_one() here; but call
pmdp_get_lockless_sync() to use it in the PAE case.

First check the VMA, in case page tables are being torn down: from JannH.
Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
acquired and the page looks suitable: from then on its state is stable.

However, collapse_pte_mapped_thp() was doing something others don't:
freeing a page table still containing "valid" entries.  i_mmap lock did
stop a racing truncate from double-freeing those pages, but we prefer
collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB
flush can wait until the pmdp_collapse_flush() which follows, but the
mmu_notifier_invalidate_range_start() has to be done earlier.

Do the "step 1" checking loop without mmu_notifier: it wouldn't be good
for khugepaged to keep on repeatedly invalidating a range which is then
found unsuitable e.g. contains COWs.  "step 2", which does the clearing,
must then be more careful (after dropping ptl to do mmu_notifier), with
abort prepared to correct the accounting like "step 3".  But with those
entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept
safe by the huge page lock, which stops new PTEs from being faulted in.

Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 172 ++
 1 file changed, 77 insertions(+), 95 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3bb05147961b..46986eb4eebb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1483,7 +1483,7 @@ static bool khugepaged_add_pte_mapped_thp(struct 
mm_struct *mm,
return ret;
 }
 
-/* hpage must be locked, and mmap_lock must be held in write */
+/* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct page *hpage)
 {
@@ -1495,7 +1495,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, 
unsigned long addr,
};
 
VM_BUG_ON(!PageTransHuge(hpage));
-   mmap_assert_write_locked(vma->vm_mm);
+   mmap_assert_locked(vma->vm_mm);
 
if (do_set_pmd(, hpage))
return SCAN_FAIL;
@@ -1504,48 +1504,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, 
unsigned long addr,
return SCAN_SUCCEED;
 }
 
-/*
- * A note about locking:
- * Trying to take the page table spinlocks would be useless here because those
- * are only used to synchronize:
- *
- *  - modifying terminal entries (ones that point to a data page, not to 
another
- *page table)
- *  - installing *new* non-terminal entries
- *
- * Instead, we need roughly the same kind of protection as free_pgtables() or
- * mm_take_all_locks() (but only for a single VMA):
- * The mmap lock together with this VMA's rmap locks covers all paths towards
- * the page table entries we're messing with here, except for hardware page
- * table walks and lockless_pages_from_mm().
- */
-static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct 
*vma,
- unsigned long addr, pmd_t *pmdp)
-{
-   pmd_t pmd;
-   struct mmu_notifier_range range;
-
-   mmap_assert_write_locked(mm);
-   if (vma->vm_file)
-   
lockdep_assert_held_write(>vm_file->f_mapping->i_mmap_rwsem);
-   /*
-* All anon_vmas attached to the VMA have the same root and are
-* therefore locked by the same lock.
-*/
-   if (vma->anon_vma)
-   lockdep_assert_held_write(>anon_vma->root->rwsem);
-
-   mmu_notifier_range_init(, MMU_NOTIFY_CLEAR, 0, mm, addr,
-   addr + HPAGE_PMD_SIZE);
-   mmu_notifier_invalidate_range_start();
-   pmd = pmdp_collapse_flush(vma, addr, pmdp);
-   tlb_remove_table_sync_one();
-   mmu_notifier_invalidate_range_end();
-   mm_dec_nr_ptes(mm);
-   page_table_check_pte_clear_range(mm, addr, pmd);
-   pte_free(mm, pmd_pgtable(pmd));
-}
-
 /**
  * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
  * address haddr.
@@ -1561,26 +1519,29 @@ static void collapse_and_free_pmd(struct mm_struct *mm, 
struct vm_area_struct *v
 int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
bool install_pmd)
 {
+   struct mmu_notifier_range range;
+   bool notified = false;
unsigned long haddr = addr & HPAGE_PMD_MASK;
struct vm_area_struct *vma = vma_lookup(mm, haddr);
struct page

[PATCH v3 09/13] mm/khugepaged: retract_page_tables() without mmap or vma lock

2023-07-11 Thread Hugh Dickins

Simplify shmem and file THP collapse's retract_page_tables(), and relax
its locking: to improve its success rate and to lessen impact on others.

Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
target_mm, leave that part of the work to madvise_collapse() calling
collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s
result code to arrange for that.  That spares retract_page_tables() four
arguments; and since it will be successful in retracting all of the page
tables expected of it, no need to track and return a result code itself.

It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
THPs.  retract_page_tables() just needs to use those same spinlocks to
exclude it briefly, while transitioning pmd from page table to none: so
restore its use of pmd_lock() inside of which pte lock is nested.

Users of pte_offset_map_lock() etc all now allow for them to fail:
so retract_page_tables() now has no use for mmap_write_trylock() or
vma_try_start_write().  In common with rmap and page_vma_mapped_walk(),
it does not even need the mmap_read_lock().

But those users do expect the page table to remain a good page table,
until they unlock and rcu_read_unlock(): so the page table cannot be
freed immediately, but rather by the recently added pte_free_defer().

Use the (usually a no-op) pmdp_get_lockless_sync() to send an interrupt
when PAE, and pmdp_collapse_flush() did not already do so: to make sure
that the start,pmdp_get_lockless(),end sequence in __pte_offset_map()
cannot pick up a pmd entry with mismatched pmd_low and pmd_high.

retract_page_tables() can be enhanced to replace_page_tables(), which
inserts the final huge pmd without mmap lock: going through an invalid
state instead of pmd_none() followed by fault.  But that enhancement
does raise some more questions: leave it until a later release.

Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 184 --
 1 file changed, 75 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 78c8d5d8b628..3bb05147961b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1615,9 +1615,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
break;
case SCAN_PMD_NONE:
/*
-* In MADV_COLLAPSE path, possible race with khugepaged where
-* all pte entries have been removed and pmd cleared.  If so,
-* skip all the pte checks and just update the pmd mapping.
+* All pte entries have been removed and pmd cleared.
+* Skip all the pte checks and just update the pmd mapping.
 */
goto maybe_install_pmd;
default:
@@ -1748,123 +1747,88 @@ static void khugepaged_collapse_pte_mapped_thps(struct 
khugepaged_mm_slot *mm_sl
mmap_write_unlock(mm);
 }
 
-static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
-  struct mm_struct *target_mm,
-  unsigned long target_addr, struct page *hpage,
-  struct collapse_control *cc)
+static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
struct vm_area_struct *vma;
-   int target_result = SCAN_FAIL;
 
-   i_mmap_lock_write(mapping);
+   i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
-   int result = SCAN_FAIL;
-   struct mm_struct *mm = NULL;
-   unsigned long addr = 0;
-   pmd_t *pmd;
-   bool is_target = false;
+   struct mmu_notifier_range range;
+   struct mm_struct *mm;
+   unsigned long addr;
+   pmd_t *pmd, pgt_pmd;
+   spinlock_t *pml;
+   spinlock_t *ptl;
+   bool skipped_uffd = false;
 
/*
 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
-* got written to. These VMAs are likely not worth investing
-* mmap_write_lock(mm) as PMD-mapping is likely to be split
-* later.
-*
-* Note that vma->anon_vma check is racy: it can be set up after
-* the check but before we took mmap_lock by the fault path.
-* But page lock would prevent establishing any new ptes of the
-* page, so we are safe.
-*
-* An alternative would be drop the check, but check that page
-* table is clear before calling pmdp_collapse_flush() under
-* ptl. It has higher chance to recover THP for the VMA, but
-* has higher cost too. It would also

[PATCH v3 08/13] mm/pgtable: add pte_free_defer() for pgtable as page

2023-07-11 Thread Hugh Dickins

Add the generic pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This version
suits all those architectures which use an unfragmented page for one page
table (none of whose pte_free()s use the mm arg which was passed to it).

Signed-off-by: Hugh Dickins 
---
 include/linux/mm_types.h |  4 
 include/linux/pgtable.h  |  2 ++
 mm/pgtable-generic.c | 20 
 3 files changed, 26 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index de10fc797c8e..17a7868f00bd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -144,6 +144,10 @@ struct page {
struct {/* Page table pages */
unsigned long _pt_pad_1;/* compound_head */
pgtable_t pmd_huge_pte; /* protected by page->ptl */
+   /*
+* A PTE page table page might be freed by use of
+* rcu_head: which overlays those two fields above.
+*/
unsigned long _pt_pad_2;/* mapping */
union {
struct mm_struct *pt_mm; /* x86 pgds only */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 7f2db400f653..9fa34be65159 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -112,6 +112,8 @@ static inline void pte_unmap(pte_t *pte)
 }
 #endif
 
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /* Find an entry in the second-level page table.. */
 #ifndef pmd_offset
 static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index b9a0c2137cc1..fa9d4d084291 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -230,6 +231,25 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, 
unsigned long address,
return pmd;
 }
 #endif
+
+/* arch define pte_free_defer in asm/pgalloc.h for its own implementation */
+#ifndef pte_free_defer
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+
+   page = container_of(head, struct page, rcu_head);
+   pte_free(NULL /* mm not passed and not used */, (pgtable_t)page);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+   struct page *page;
+
+   page = pgtable;
+   call_rcu(>rcu_head, pte_free_now);
+}
+#endif /* pte_free_defer */
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
-- 
2.35.3

[PATCH v3 07/13] s390: add pte_free_defer() for pgtables sharing page

2023-07-11 Thread Hugh Dickins

Add s390-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because s390 fits two 2K
page tables into one 4K page (so page->rcu_head must be shared between
both halves), and already uses page->lru (which page->rcu_head overlays)
to list any free halves; with clever management by page->_refcount bits.

Build upon the existing management, adjusted to follow a new rule: that
a page is never on the free list if pte_free_defer() was used on either
half (marked by PageActive).  And for simplicity, delay calling RCU until
both halves are freed.

Not adding back unallocated fragments to the list in pte_free_defer()
can result in wasting some amount of memory for pagetables, depending
on how long the allocated fragment will stay in use. In practice, this
effect is expected to be insignificant, and not justify a far more
complex approach, which might allow to add the fragments back later
in __tlb_remove_table(), where we might not have a stable mm any more.

Signed-off-by: Hugh Dickins 
Reviewed-by: Gerald Schaefer 
---
 arch/s390/include/asm/pgalloc.h |  4 ++
 arch/s390/mm/pgalloc.c  | 80 +--
 2 files changed, 72 insertions(+), 12 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
 
+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..760b4ace475e 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
  * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
  * while the PP bits are never used, nor such a page is added to or removed
  * from mm_context_t::pgtable_list.
+ *
+ * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
+ * and prevents both 2K fragments from being reused. pte_free_defer() has to
+ * guarantee that its pgtable cannot be reused before the RCU grace period
+ * has elapsed (which page_table_free_rcu() does not actually guarantee).
+ * But for simplicity, because page->rcu_head overlays page->lru, and because
+ * the RCU callback might not be called before the mm_context_t has been freed,
+ * pte_free_defer() in this implementation prevents both fragments from being
+ * reused, and delays making the call to RCU until both fragments are freed.
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
@@ -261,7 +270,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
table += PTRS_PER_PTE;
atomic_xor_bits(>_refcount,
0x01U << (bit + 24));
-   list_del(>lru);
+   list_del_init(>lru);
}
}
spin_unlock_bh(>context.lock);
@@ -281,6 +290,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
table = (unsigned long *) page_to_virt(page);
if (mm_alloc_pgste(mm)) {
/* Return 4K page table with PGSTEs */
+   INIT_LIST_HEAD(>lru);
atomic_xor_bits(>_refcount, 0x03U << 24);
memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE);
memset64((u64 *)table + PTRS_PER_PTE, 0, PTRS_PER_PTE);
@@ -300,7 +310,9 @@ static void page_table_release_check(struct page *page, 
void *table,
 {
char msg[128];
 
-   if (!IS_ENABLED(CONFIG_DEBUG_VM) || !mask)
+   if (!IS_ENABLED(CONFIG_DEBUG_VM))
+   return;
+   if (!mask && list_empty(>lru))
return;
snprintf(msg, sizeof(msg),
 "Invalid pgtable %p release half 0x%02x mask 0x%02x",
@@ -308,6 +320,15 @@ static void page_table_release_check(struct page *page, 
void *table,
dump_page(page, msg);
 }
 
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+
+   page = container_of(head, struct page, rcu_head);
+   pgtable_pte_page_dtor(page);
+   __free_page(

[PATCH v3 06/13] sparc: add pte_free_defer() for pte_t *pgtable_t

2023-07-11 Thread Hugh Dickins

Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

sparc32 supports pagetables sharing a page, but does not support THP;
sparc64 supports THP, but does not support pagetables sharing a page.
So the sparc-specific pte_free_defer() is as simple as the generic one,
except for converting between pte_t *pgtable_t and struct page *.

Signed-off-by: Hugh Dickins 
---
 arch/sparc/include/asm/pgalloc_64.h |  4 
 arch/sparc/mm/init_64.c | 16 
 2 files changed, 20 insertions(+)

diff --git a/arch/sparc/include/asm/pgalloc_64.h 
b/arch/sparc/include/asm/pgalloc_64.h
index 7b5561d17ab1..caa7632be4c2 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
 void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
 void pte_free(struct mm_struct *mm, pgtable_t ptepage);
 
+/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 #define pmd_populate_kernel(MM, PMD, PTE)  pmd_set(MM, PMD, PTE)
 #define pmd_populate(MM, PMD, PTE) pmd_set(MM, PMD, PTE)
 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 04f9db0c3111..0d7fd793924c 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+
+   page = container_of(head, struct page, rcu_head);
+   __pte_free((pgtable_t)page_address(page));
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+   struct page *page;
+
+   page = virt_to_page(pgtable);
+   call_rcu(>rcu_head, pte_free_now);
+}
+
 void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
  pmd_t *pmd)
 {
-- 
2.35.3

[PATCH v3 05/13] powerpc: add pte_free_defer() for pgtables sharing page

2023-07-11 Thread Hugh Dickins

Add powerpc-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time.  But powerpc never reuses a fragment
once it has been freed: so mark the page Active in pte_free_defer(),
before calling pte_fragment_free() directly; and there call_rcu() to
pte_free_now() when last fragment is freed and the page is PageActive.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Hugh Dickins 
---
 arch/powerpc/include/asm/pgalloc.h |  4 
 arch/powerpc/mm/pgtable-frag.c | 29 ++---
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc.h 
b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t 
ptepage)
pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c 
*/
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..0c6b68130025 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -106,6 +106,15 @@ pte_t *pte_fragment_alloc(struct mm_struct *mm, int kernel)
return __alloc_for_ptecache(mm, kernel);
 }
 
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+
+   page = container_of(head, struct page, rcu_head);
+   pgtable_pte_page_dtor(page);
+   __free_page(page);
+}
+
 void pte_fragment_free(unsigned long *table, int kernel)
 {
struct page *page = virt_to_page(table);
@@ -115,8 +124,22 @@ void pte_fragment_free(unsigned long *table, int kernel)
 
BUG_ON(atomic_read(>pt_frag_refcount) <= 0);
if (atomic_dec_and_test(>pt_frag_refcount)) {
-   if (!kernel)
-   pgtable_pte_page_dtor(page);
-   __free_page(page);
+   if (kernel)
+   __free_page(page);
+   else if (TestClearPageActive(page))
+   call_rcu(>rcu_head, pte_free_now);
+   else
+   pte_free_now(>rcu_head);
}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+   struct page *page;
+
+   page = virt_to_page(pgtable);
+   SetPageActive(page);
+   pte_fragment_free((unsigned long *)pgtable, 0);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.35.3

[PATCH v3 04/13] powerpc: assert_pte_locked() use pte_offset_map_nolock()

2023-07-11 Thread Hugh Dickins

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in assert_pte_locked().  BUG if pte_offset_map_nolock() fails: this is
stricter than the previous implementation, which skipped when pmd_none()
(with a comment on khugepaged collapse transitions): but wouldn't we want
to know, if an assert_pte_locked() caller can be racing such transitions?

This mod might cause new crashes: which either expose my ignorance, or
indicate issues to be fixed, or limit the usage of assert_pte_locked().

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/mm/pgtable.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cb2dcdb18f8e..16b061af86d7 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long 
addr)
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
+   pte_t *pte;
+   spinlock_t *ptl;
 
if (mm == _mm)
return;
@@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned 
long addr)
pud = pud_offset(p4d, addr);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, addr);
-   /*
-* khugepaged to collapse normal pages to hugepage, first set
-* pmd to none to force page fault/gup to take mmap_lock. After
-* pmd is set to none, we do a pte_clear which does this assertion
-* so if we find pmd none, return.
-*/
-   if (pmd_none(*pmd))
-   return;
-   BUG_ON(!pmd_present(*pmd));
-   assert_spin_locked(pte_lockptr(mm, pmd));
+   pte = pte_offset_map_nolock(mm, pmd, addr, );
+   BUG_ON(!pte);
+   assert_spin_locked(ptl);
+   pte_unmap(pte);
 }
 #endif /* CONFIG_DEBUG_VM */
 
-- 
2.35.3

[PATCH v3 03/13] arm: adjust_pte() use pte_offset_map_nolock()

2023-07-11 Thread Hugh Dickins

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in adjust_pte(): because it gives the not-locked ptl for precisely that
pte, which the caller can then safely lock; whereas pte_lockptr() is not
so tightly coupled, because it dereferences the pmd pointer again.

Signed-off-by: Hugh Dickins 
---
 arch/arm/mm/fault-armv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index ca5302b0b7ee..7cb125497976 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -117,11 +117,10 @@ static int adjust_pte(struct vm_area_struct *vma, 
unsigned long address,
 * must use the nested version.  This also means we need to
 * open-code the spin-locking.
 */
-   pte = pte_offset_map(pmd, address);
+   pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, );
if (!pte)
return 0;
 
-   ptl = pte_lockptr(vma->vm_mm, pmd);
do_pte_lock(ptl);
 
ret = do_adjust_pte(vma, address, pfn, pte);
-- 
2.35.3

[PATCH v3 02/13] mm/pgtable: add PAE safety to __pte_offset_map()

2023-07-11 Thread Hugh Dickins

There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
a pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.

Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests.  It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.

Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
at the right moment on those configs which do not already send it.

CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that
Will Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm.
It is not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.

Limit the IRQ disablement to CONFIG_HIGHPTE?  Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).

Signed-off-by: Hugh Dickins 
---
 include/linux/pgtable.h |  4 
 mm/pgtable-generic.c| 29 +
 2 files changed, 33 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5134edcec668..7f2db400f653 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -390,6 +390,7 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
return pmd;
 }
 #define pmdp_get_lockless pmdp_get_lockless
+#define pmdp_get_lockless_sync() tlb_remove_table_sync_one()
 #endif /* CONFIG_PGTABLE_LEVELS > 2 */
 #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
 
@@ -408,6 +409,9 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
 {
return pmdp_get(pmdp);
 }
+static inline void pmdp_get_lockless_sync(void)
+{
+}
 #endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 400e5a045848..b9a0c2137cc1 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -232,12 +232,41 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, 
unsigned long address,
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
+   (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
+/*
+ * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
+ * the barriers in pmdp_get_lockless() cannot guarantee that the value in
+ * pmd_high actually belongs with the value in pmd_low; but holding interrupts
+ * off blocks the TLB flush between present updates, which guarantees that a
+ * successful __pte_offset_map() points to a page from matched halves.
+ */
+static unsigned long pmdp_get_lockless_start(void)
+{
+   unsigned long irqflags;
+
+   local_irq_save(irqflags);
+   return irqflags;
+}
+static void pmdp_get_lockless_end(unsigned long irqflags)
+{
+   local_irq_restore(irqflags);
+}
+#else
+static unsigned long pmdp_get_lockless_start(void) { return 0; }
+static void pmdp_get_lockless_end(unsigned long irqflags) { }
+#endif
+
 pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 {
+   unsigned long irqflags;
pmd_t pmdval;
 
rcu_read_lock();
+   irqflags = pmdp_get_lockless_start();
pmdval = pmdp_get_lockless(pmd);
+   pmdp_get_lockless_end(irqflags);
+
if (pmdvalp)
*pmdvalp = pmdval;
if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
-- 
2.35.3

[PATCH v3 01/13] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s

2023-07-11 Thread Hugh Dickins

Before putting them to use (several commits later), add rcu_read_lock()
to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
separate commit, since it risks exposing imbalances: prior commits have
fixed all the known imbalances, but we may find some have been missed.

Signed-off-by: Hugh Dickins 
---
 include/linux/pgtable.h | 4 ++--
 mm/pgtable-generic.c| 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5063b482e34f..5134edcec668 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -99,7 +99,7 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned 
long address)
((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
 #define pte_unmap(pte) do {\
kunmap_local((pte));\
-   /* rcu_read_unlock() to be added later */   \
+   rcu_read_unlock();  \
 } while (0)
 #else
 static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
@@ -108,7 +108,7 @@ static inline pte_t *__pte_map(pmd_t *pmd, unsigned long 
address)
 }
 static inline void pte_unmap(pte_t *pte)
 {
-   /* rcu_read_unlock() to be added later */
+   rcu_read_unlock();
 }
 #endif
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4d454953046f..400e5a045848 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, 
pmd_t *pmdvalp)
 {
pmd_t pmdval;
 
-   /* rcu_read_lock() to be added later */
+   rcu_read_lock();
pmdval = pmdp_get_lockless(pmd);
if (pmdvalp)
*pmdvalp = pmdval;
@@ -250,7 +250,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, 
pmd_t *pmdvalp)
}
return __pte_map(, addr);
 nomap:
-   /* rcu_read_unlock() to be added later */
+   rcu_read_unlock();
return NULL;
 }
 
-- 
2.35.3

[PATCH v3 00/13] mm: free retracted page table by RCU

2023-07-11 Thread Hugh Dickins

Here is v3 of the series of patches to mm (and a few architectures), based
on v6.5-rc1 which includes the preceding two series (thank you!): in which
khugepaged takes advantage of pte_offset_map[_lock]() allowing for pmd
transitions.  Differences from v1 and v2 are noted patch by patch below.

This replaces the v2 "mm: free retracted page table by RCU"
https://lore.kernel.org/linux-mm/54cb04f-3762-987f-8294-91dafd8eb...@google.com/
series of 12 posted on 2023-06-20.

What is it all about?  Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs:
the usefulness of MADV_COLLAPSE on shmem is being limited by that
mmap_write_lock it currently requires.

Likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing).  mmap_write_lock
avoidance when collapsing to anon THPs?  Perhaps, but again that's not
work I've done: a quick attempt was not as easy as the shmem/file case.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

Based on v6.5-rc1; and almost good on current mm-unstable or current
linux-next - just one patch conflicts, the 12/13: I'll reply to that
one with its mm-unstable or linux-next equivalent (vma_assert_locked()
has been added next to where vma_try_start_write() is being removed).

01/13 mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
  v3: same as v1
02/13 mm/pgtable: add PAE safety to __pte_offset_map()
  v3: same as v2
  v2: rename to pmdp_get_lockless_start/end() per Matthew;
  so use inlines without _irq_save(flags) macro oddity;
  add pmdp_get_lockless_sync() for use later in 09/13.
03/13 arm: adjust_pte() use pte_offset_map_nolock()
  v3: same as v1
04/13 powerpc: assert_pte_locked() use pte_offset_map_nolock()
  v3: same as v1
05/13 powerpc: add pte_free_defer() for pgtables sharing page
  v3: much simpler version, following suggestion by Jason
  v2: fix rcu_head usage to cope with concurrent deferrals;
  add para to commit message explaining rcu_head issue.
06/13 sparc: add pte_free_defer() for pte_t *pgtable_t
  v3: same as v2
  v2: use page_address() instead of less common page_to_virt();
  add para to commit message explaining simple conversion;
  changed title since sparc64 pgtables do not share page.
07/13 s390: add pte_free_defer() for pgtables sharing page
  v3: much simpler version, following suggestion by Gerald
  v2: complete rewrite, integrated with s390's existing pgtable
  management; temporarily using a global mm_pgtable_list_lock,
  to be restored to per-mm spinlock in a later followup patch.
08/13 mm/pgtable: add pte_free_defer() for pgtable as page
  v3: same as v2
  v2: add comment on rcu_head to "Page table pages", per JannH
09/13 mm/khugepaged: retract_page_tables() without mmap or vma lock
  v3: same as v2
  v2: repeat checks under ptl because UFFD, per PeterX and JannH;
  bring back mmu_notifier calls for PMD, per JannH and Jason;
  pmdp_get_lockless_sync() to issue missing interrupt if PAE.
10/13 mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
  v3: updated to using ptent instead of *pte
  v2: first check VMA, in case page tables torn down, per JannH;
  pmdp_get_lockless_sync() to issue missing interrupt if PAE;
  moved mmu_notifier after step 1, reworked final goto labels.
11/13 mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
  v3: rediffed
  v2: same as v1
12/13 mm: delete mmap_write_trylock() and vma_try_start_write()
  v3: rediffed (different diff needed for mm-unstable or linux-next)
  v2: same as v1
13/13 mm/pgtable: notes on pte_offset_map[_lock]()
  v3: new: JannH asked for more helpful comment, this is my attempt;
  could be moved to be the first in the series.

 arch/arm/mm/fault-armv.c|   3 +-
 arch/powerpc/include/asm/pgalloc.h  |   4 +
 arch/powerpc/mm/pgtable-frag.c  |  29 +-
 arch/powerpc/mm/pgtable.c   |  16 +-
 arch/s390/include/asm/pgalloc.h |   4 +
 arch/s390/mm/pgalloc.c  |  80 -
 arch/sparc/include/asm/pgalloc_64.h |   4 +
 arch/sparc/mm/init_64.c |  16 +
 include/linux/mm.h  |  17 --
 include/linux/mm_types.h|   4 +
 include/linux/mmap_lock.h   |  10 -
 include/linux/pgtable.h |  10 +-
 mm/khugepaged.c | 481 +++---
 mm/pgtable-generic.c|  97 +-
 14 files changed, 404 insertions(+), 371 deletions(-)

Hugh

[PATCH] mm: lock newly mapped VMA with corrected ordering

2023-07-08 Thread Hugh Dickins

Lockdep is certainly right to complain about
(>vm_lock->lock){}-{3:3}, at: vma_start_write+0x2d/0x3f
   but task is already holding lock:
(>i_mmap_rwsem){+.+.}-{3:3}, at: mmap_region+0x4dc/0x6db
Invert those to the usual ordering.

Fixes: 33313a747e81 ("mm: lock newly mapped VMA which can be modified after it 
becomes visible")
Cc: sta...@vger.kernel.org
Signed-off-by: Hugh Dickins 
---
 mm/mmap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 84c71431a527..3eda23c9ebe7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2809,11 +2809,11 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
if (vma_iter_prealloc())
goto close_and_free_vma;
 
+   /* Lock the VMA since it is modified after insertion into VMA tree */
+   vma_start_write(vma);
if (vma->vm_file)
i_mmap_lock_write(vma->vm_file->f_mapping);
 
-   /* Lock the VMA since it is modified after insertion into VMA tree */
-   vma_start_write(vma);
vma_iter_store(, vma);
mm->map_count++;
if (vma->vm_file) {
-- 
2.35.3

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-06 Thread Hugh Dickins

On Thu, 6 Jul 2023, Gerald Schaefer wrote:
> 
> Since none of my remarks on the comments seem valid or strictly necessary
> any more, and I also could not find functional issues, I think you can add
> this patch as new version for 07/12. And I can now give you this:
> 
> Reviewed-by: Gerald Schaefer 

Great, thanks a lot Gerald.
The one change I'm making to it is then this merged in:

--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -455,6 +455,11 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t 
pgtable)
page = virt_to_page(pgtable);
SetPageActive(page);
page_table_free(mm, (unsigned long *)pgtable);
+   /*
+* page_table_free() does not do the pgste gmap_unlink() which
+* page_table_free_rcu() does: warn us if pgste ever reaches here.
+*/
+   WARN_ON_ONCE(mm_alloc_pgste(mm));
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */

Re: [PATCH] powerpc/mm/book3s64/hash/4k: Add pmd_same callback for 4K page size

2023-07-05 Thread Hugh Dickins

On Thu, 6 Jul 2023, Aneesh Kumar K.V wrote:

> With commit 0d940a9b270b ("mm/pgtable: allow pte_offset_map[_lock]() to
> fail") the kernel is now using pmd_same to compare pmd values that are
> pointing to a level 4 page table page. Move the functions out of #ifdef
> CONFIG_TRANSPARENT_HUGEPAGE and add a variant that can work with both 4K
> and 64K page size.
> 
> kernel BUG at arch/powerpc/include/asm/book3s/64/hash-4k.h:141!
> Oops: Exception in kernel mode, sig: 5 [#1]
> LE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> .
> NIP [c048aee0] __pte_offset_map_lock+0xf0/0x164
> LR [c048ae78] __pte_offset_map_lock+0x88/0x164
> Call Trace:
>  0xc0003f09a340 (unreliable)
>  __handle_mm_fault+0x1340/0x1980
>  handle_mm_fault+0xbc/0x380
>  __get_user_pages+0x320/0x550
>  get_user_pages_remote+0x13c/0x520
>  get_arg_page+0x80/0x1d0
>  copy_string_kernel+0xc8/0x250
>  kernel_execve+0x11c/0x270
>  run_init_process+0xe4/0x10c
>  kernel_init+0xbc/0x1a0
>  ret_from_kernel_user_thread+0x14/0x1c
> 
> Cc: Hugh Dickins 
> Reported-by: Michael Ellerman 
> Signed-off-by: Aneesh Kumar K.V 

Acked-by: Hugh Dickins 

Thanks for rescuing us so quickly!

> ---
>  arch/powerpc/include/asm/book3s/64/hash-4k.h  | 6 --
>  arch/powerpc/include/asm/book3s/64/hash-64k.h | 5 -
>  arch/powerpc/include/asm/book3s/64/hash.h | 5 +
>  3 files changed, 5 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
> b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> index b6ac4f86c87b..6472b08fa1b0 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> @@ -136,12 +136,6 @@ static inline int hash__pmd_trans_huge(pmd_t pmd)
>   return 0;
>  }
>  
> -static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b)
> -{
> - BUG();
> - return 0;
> -}
> -
>  static inline pmd_t hash__pmd_mkhuge(pmd_t pmd)
>  {
>   BUG();
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
> b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> index 338e62fbea0b..0bf6fd0bf42a 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> @@ -263,11 +263,6 @@ static inline int hash__pmd_trans_huge(pmd_t pmd)
> (_PAGE_PTE | H_PAGE_THP_HUGE));
>  }
>  
> -static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b)
> -{
> - return (((pmd_raw(pmd_a) ^ pmd_raw(pmd_b)) & 
> ~cpu_to_be64(_PAGE_HPTEFLAGS)) == 0);
> -}
> -
>  static inline pmd_t hash__pmd_mkhuge(pmd_t pmd)
>  {
>   return __pmd(pmd_val(pmd) | (_PAGE_PTE | H_PAGE_THP_HUGE));
> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
> b/arch/powerpc/include/asm/book3s/64/hash.h
> index 17e7a778c856..d4a19e6547ac 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> @@ -132,6 +132,11 @@ static inline int get_region_id(unsigned long ea)
>   return region_id;
>  }
>  
> +static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b)
> +{
> + return (((pmd_raw(pmd_a) ^ pmd_raw(pmd_b)) & 
> ~cpu_to_be64(_PAGE_HPTEFLAGS)) == 0);
> +}
> +
>  #define  hash__pmd_bad(pmd)  (pmd_val(pmd) & H_PMD_BAD_BITS)
>  #define  hash__pud_bad(pud)  (pud_val(pud) & H_PUD_BAD_BITS)
>  static inline int hash__p4d_bad(p4d_t p4d)
> -- 
> 2.41.0
> 
>

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-05 Thread Hugh Dickins

On Wed, 5 Jul 2023, Gerald Schaefer wrote:
> On Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
> Hugh Dickins  wrote:
> > On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> > > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> > > Hugh Dickins  wrote:  
> > > > On Thu, 29 Jun 2023, Hugh Dickins wrote:  
> > ...
> > > > --- a/arch/s390/mm/pgalloc.c
> > > > +++ b/arch/s390/mm/pgalloc.c
> > > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > > >   * logic described above. Both AA bits are set to 1 to denote a 
> > > > 4KB-pgtable
> > > >   * while the PP bits are never used, nor such a page is added to or 
> > > > removed
> > > >   * from mm_context_t::pgtable_list.
> > > > + *
> > > > + * pte_free_defer() overrides those rules: it takes the page off 
> > > > pgtable_list,
> > > > + * and prevents both 2K fragments from being reused. pte_free_defer() 
> > > > has to
> > > > + * guarantee that its pgtable cannot be reused before the RCU grace 
> > > > period
> > > > + * has elapsed (which page_table_free_rcu() does not actually 
> > > > guarantee).  
> > > 
> > > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> > > allow reuse before grace period elapsed. And I hope that it does so, by
> > > setting the PP bits, which would be noticed in page_table_alloc(), in
> > > case the page would be seen there.
> > > 
> > > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> > > end of the list, and so they could be seen in page_table_alloc(), but they
> > > should not be reused before grace period elapsed and __tlb_remove_table()
> > > cleared the PP bits, as far as I understand.
> > > 
> > > So what exactly do you mean with "which page_table_free_rcu() does not 
> > > actually
> > > guarantee"?  
> > 
> > I'll answer without locating and re-reading what Jason explained earlier,
> > perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
> > he may have explained it better.  And without working out again all the
> > MMU_GATHER #defines, and which of them do and do not apply to s390 here.
> > 
> > The detail that sticks in my mind is the fallback in tlb_remove_table()
> 
> Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
> but that is rather a generic issue, and not s390-specific.

Yes.

> I thought you
> meant some s390-oddity here, of which we have a lot, unfortunately...
> Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
> I guess you could say that page_table_free_rcu() cannot guarantee what
> tlb_remove_table() cannot guarantee.
> 
> Maybe change to "which page_table_free_rcu() does not actually guarantee,
> by calling tlb_remove_table()", to make it clear that this is not a problem
> of page_table_free_rcu() itself.

Okay - I'll rephrase slightly to avoid being sued by s390's lawyers :-)

> 
> > in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
> > batch the tables for freeing by RCU, and resorts instead to an immediate 
> > TLB flush (I think: that again involves chasing definitions) followed by
> > tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
> > and is commented: 
> > /*
> >  * This isn't an RCU grace period and hence the page-tables cannot be
> >  * assumed to be actually RCU-freed.
> >  *
> >  * It is however sufficient for software page-table walkers that rely on
> >  * IRQ disabling.
> >  */
> > 
> > Whether that's good for your PP pages or not, I've given no thought:
> > I've just taken it on trust that what s390 has working today is good.
> 
> Yes, we should be fine with that, current code can be trusted :-)

Glad to hear it :-)  Yes, I think it's not actually relying on the "rcu"
implied by the function name.

> 
> > 
> > If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
> > then I would not have written "(which page_table_free_rcu() does not
> > actually guarantee)".  But it cannot use call_rcu() because it does
> > not have an rcu_head to work with - it's in some generic code, and
> > there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.
> > 
> > And Jason would have much preferred us to address the issue from that
> > angle; but not only would doing so destroy my sanity, I'd also destroy
> > 20 architectures TLB-flushing, unbuilt and unteste

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-05 Thread Hugh Dickins

On Wed, 5 Jul 2023, Alexander Gordeev wrote:
> On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> 
> Hi Hugh,
> 
> ...
> 
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +   struct page *page;
> 
> If I got your and Claudio conversation right, you were going to add
> here WARN_ON_ONCE() in case of mm_alloc_pgste(mm)?

Well, Claudio approved, so I would have put it in, if we had stuck with
that version which had "if (mm_alloc_pgste(mm)) {" in pte_free_defer();
but once that went away, it became somewhat irrelevant... to me anyway.

But I don't mind adding it here, in the v3 I'll post when -rc1 is out,
if it might help you guys - there is some point, since pte_free_defer()
is a route which can usefully check for such a case, without confusion
from harmless traffic from immediate frees of just-in-case allocations.

But don't expect it to catch all such cases (if they exist): another
category of s390 page_table_free()s comes from the PageAnon
zap_deposited_table() in zap_huge_pmd(): those tables might or might
not have been exposed to userspace at some time in the past.

I'll add the WARN_ON_ONCE in pte_free_defer() (after checking that
WARN_ON_ONCE is the one we want - I get confused by all the different
flavours of WARN, and have to check the header file each time to be
sure of the syntax and semantics): but be aware that it won't be
checking all potential cases.

Hugh

> 
> > +   page = virt_to_page(pgtable);
> > +   SetPageActive(page);
> > +   page_table_free(mm, (unsigned long *)pgtable);
> > +}
> > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > +
> >  /*
> >   * Base infrastructure required to generate basic asces, region, segment,
> >   * and page tables that do not make use of enhanced features like EDAT1.
> 
> Thanks!

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-04 Thread Hugh Dickins

On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> Hugh Dickins  wrote:
> > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> > > 
> > > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > > extent that I've not even tried to verify it; but I think I do get the
> > > point now, that we need further info than just PPHHAA to know whether
> > > the page is on the list or not.  But I think that if we move where the
> > > call_rcu() is done, then the page can stay on or off the list by same
> > > rules as before (but need to check HH bits along with PP when deciding
> > > whether to allocate, and whether to list_add_tail() when freeing).  
> > 
> > No, not quite the same rules as before: I came to realize that using
> > list_add_tail() for the HH pages would be liable to put a page on the
> > list which forever blocked reuse of PP list_add_tail() pages after it
> > (could be solved by a list_move() somewhere, but we have agreed to
> > prefer simplicity).
> > 
> > I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> > I've dropped most of the pte_free_*() helpers, and list_del_init() is
> > an easier way of dealing with those "is it on the list" questions.
> > I expect that we shall be close to reaching agreement on...
> 
> This looks really nice, almost too good and easy to be true. I did not
> find any obvious flaw, just some comments below. It also survived LTP
> without any visible havoc, so I guess this approach is the best so far.

Phew! I'm of course glad to hear this: thanks for your efforts on it.

...
> > --- a/arch/s390/mm/pgalloc.c
> > +++ b/arch/s390/mm/pgalloc.c
> > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> >   * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> >   * while the PP bits are never used, nor such a page is added to or removed
> >   * from mm_context_t::pgtable_list.
> > + *
> > + * pte_free_defer() overrides those rules: it takes the page off 
> > pgtable_list,
> > + * and prevents both 2K fragments from being reused. pte_free_defer() has 
> > to
> > + * guarantee that its pgtable cannot be reused before the RCU grace period
> > + * has elapsed (which page_table_free_rcu() does not actually guarantee).
> 
> Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> allow reuse before grace period elapsed. And I hope that it does so, by
> setting the PP bits, which would be noticed in page_table_alloc(), in
> case the page would be seen there.
> 
> Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> end of the list, and so they could be seen in page_table_alloc(), but they
> should not be reused before grace period elapsed and __tlb_remove_table()
> cleared the PP bits, as far as I understand.
> 
> So what exactly do you mean with "which page_table_free_rcu() does not 
> actually
> guarantee"?

I'll answer without locating and re-reading what Jason explained earlier,
perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
he may have explained it better.  And without working out again all the
MMU_GATHER #defines, and which of them do and do not apply to s390 here.

The detail that sticks in my mind is the fallback in tlb_remove_table()
in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
batch the tables for freeing by RCU, and resorts instead to an immediate 
TLB flush (I think: that again involves chasing definitions) followed by
tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
and is commented: 
/*
 * This isn't an RCU grace period and hence the page-tables cannot be
 * assumed to be actually RCU-freed.
 *
 * It is however sufficient for software page-table walkers that rely on
 * IRQ disabling.
 */

Whether that's good for your PP pages or not, I've given no thought:
I've just taken it on trust that what s390 has working today is good.

If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
then I would not have written "(which page_table_free_rcu() does not
actually guarantee)".  But it cannot use call_rcu() because it does
not have an rcu_head to work with - it's in some generic code, and
there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.

And Jason would have much preferred us to address the issue from that
angle; but not only would doing so destroy my sanity, I'd also destroy
20 architectures TLB-flushing, unbuilt and untested, in the attempt.

...
> > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned 
> > long *table)
> >  */
> > mask = at

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-04 Thread Hugh Dickins

On Tue, 4 Jul 2023, Alexander Gordeev wrote:
> On Sat, Jul 01, 2023 at 09:32:38PM -0700, Hugh Dickins wrote:
> > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> 
> Hi Hugh,
> 
> ...
> > No, not quite the same rules as before: I came to realize that using
> > list_add_tail() for the HH pages would be liable to put a page on the
> > list which forever blocked reuse of PP list_add_tail() pages after it
> > (could be solved by a list_move() somewhere, but we have agreed to
> > prefer simplicity).
> 
> Just to make things more clear for me: do I understand correctly that this
> was an attempt to add HH fragments to pgtable_list from pte_free_defer()?

Yes, from page_table_free() called from pte_free_defer(): I had claimed
they could be put on the list (or not) without needing to consider their
HH-ness, apart from wanting to list_add_tail() rather than list_add() them.

But then realized that this category of list_add_tail() pages would block
access to the others.

But I think I was mistaken then to say "could be solved by a list_move()
somewhere"; because "somewhere" would have had to be __tlb_remove_table()
when it removes PP-bits, which would bring us back to the issues of
getting a spinlock from an mm which might already be freed.

Hugh

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-01 Thread Hugh Dickins

On Thu, 29 Jun 2023, Hugh Dickins wrote:
> 
> I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> extent that I've not even tried to verify it; but I think I do get the
> point now, that we need further info than just PPHHAA to know whether
> the page is on the list or not.  But I think that if we move where the
> call_rcu() is done, then the page can stay on or off the list by same
> rules as before (but need to check HH bits along with PP when deciding
> whether to allocate, and whether to list_add_tail() when freeing).

No, not quite the same rules as before: I came to realize that using
list_add_tail() for the HH pages would be liable to put a page on the
list which forever blocked reuse of PP list_add_tail() pages after it
(could be solved by a list_move() somewhere, but we have agreed to
prefer simplicity).

I've dropped the HH bits, I'm using PageActive like we did on powerpc,
I've dropped most of the pte_free_*() helpers, and list_del_init() is
an easier way of dealing with those "is it on the list" questions.
I expect that we shall be close to reaching agreement on...

[PATCH v? 07/12] s390: add pte_free_defer() for pgtables sharing page

Add s390-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because s390 fits two 2K
page tables into one 4K page (so page->rcu_head must be shared between
both halves), and already uses page->lru (which page->rcu_head overlays)
to list any free halves; with clever management by page->_refcount bits.

Build upon the existing management, adjusted to follow a new rule: that
a page is never on the free list if pte_free_defer() was used on either
half (marked by PageActive).  And for simplicity, delay calling RCU until
both halves are freed.

Not adding back unallocated fragments to the list in pte_free_defer()
can result in wasting some amount of memory for pagetables, depending
on how long the allocated fragment will stay in use. In practice, this
effect is expected to be insignificant, and not justify a far more
complex approach, which might allow to add the fragments back later
in __tlb_remove_table(), where we might not have a stable mm any more.

Signed-off-by: Hugh Dickins 
---
 arch/s390/include/asm/pgalloc.h |  4 ++
 arch/s390/mm/pgalloc.c  | 75 +++--
 2 files changed, 67 insertions(+), 12 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)

+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..fd0c4312da16 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
  * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
  * while the PP bits are never used, nor such a page is added to or removed
  * from mm_context_t::pgtable_list.
+ *
+ * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
+ * and prevents both 2K fragments from being reused. pte_free_defer() has to
+ * guarantee that its pgtable cannot be reused before the RCU grace period
+ * has elapsed (which page_table_free_rcu() does not actually guarantee).
+ * But for simplicity, because page->rcu_head overlays page->lru, and because
+ * the RCU callback might not be called before the mm_context_t has been freed,
+ * pte_free_defer() in this implementation prevents both fragments from being
+ * reused, and delays making the call to RCU until both fragments are freed.
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
@@ -261,7 +270,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
table += PTRS_PER_PTE;
atomic_xor_bits(>_refcount,
0x01U << (bit + 24));
-   list_del(>lru);
+   list_del_init(>lru);
}
}
spin_unlock_bh(>context.lock);
@@ -281,6 +290,7

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-06-30 Thread Hugh Dickins

On Fri, 30 Jun 2023, Claudio Imbrenda wrote:
> On Fri, 30 Jun 2023 08:28:54 -0700 (PDT)
> Hugh Dickins  wrote:
> > On Fri, 30 Jun 2023, Claudio Imbrenda wrote:
> > > On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> > > Hugh Dickins  wrote:
> > > 
> > > [...]
> > >   
> > > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > > +{
> > > > +   unsigned int bit, mask;
> > > > +   struct page *page;
> > > > +
> > > > +   page = virt_to_page(pgtable);
> > > > +   if (mm_alloc_pgste(mm)) {
> > > > +   call_rcu(>rcu_head, pte_free_pgste);  
> > > 
> > > so is this now going to be used to free page tables
> > > instead of page_table_free_rcu?  
> > 
> > No.
> > 
> > All pte_free_defer() is being used for (in this series; and any future
> > use beyond this series will have to undertake its own evaluations) is
> > for the case of removing an empty page table, which used to map a group
> > of PTE mappings of a file, in order to make way for one PMD mapping of
> > the huge page which those scattered pages have now been gathered into.
> > 
> > You're worried by that mm_alloc_pgste() block: it's something I didn't
> 
> actually no, but thanks for bringing it up :D
> 
> > have at all in my first draft, then I thought that perhaps the pgste
> > case might be able to come this way, so it seemed stupid to leave out
> > the handling for it.
> > 
> > I hope that you're implying that should be dead code here?  Perhaps,
> > that the pgste case corresponds to the case in s390 where THPs are
> > absolutely forbidden?  That would be good news for us.
> > 
> > Gerald, in his version of this block, added a comment asking:
> > /*
> >  * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
> >  * page_table_free_rcu()?
> >  * If yes -> need addr parameter here, like in pte_free_tlb().
> >  */
> > Do you have the answer to that?  Neither of us could work it out.
> 
> this is the thing I'm worried about; removing a page table that was
> used to map a guest will leave dangling pointers in the gmap that will
> cause memory corruption (I actually ran into that problem myself for
> another patchseries).
> 
> gmap_unlink() is needed to clean up the pointers before they become
> dangling (and also potentially do some TLB purging as needed)

That's something I would have expected to be handled already via
mmu_notifiers, rather than buried inside the page table freeing.

If s390 is the only architecture to go that way, and could instead do
it via mmu_notifiers, then I think that will be more easily supported
in the long term.

But I'm writing from a position of very great ignorance: advising
KVM on s390 is many dimensions away from what I'm capable of.

> 
> the point here is: we need that only for page_table_free_rcu(); all
> other users of page_table_free() cannot act on guest page tables

I might be wrong, but I think that most users of page_table_free()
are merely freeing a page table which had to be allocated up front,
but was then found unnecessary (maybe a racing task already inserted
one): page tables which were never exposed to actual use.

> (because we don't allow THP for KVM guests). and that is why
> page_table_free() does not do gmap_unlink() currently.

But THP collapse does (or did before this series) use it to free a
page table which had been exposed to use.  The fact that s390 does
not allow THP for KVM guests makes page_table_free(), and this new
pte_free_defer(), safe for that; but it feels dangerously coincidental.

It's easy to imagine a future change being made, which would stumble
over this issue.  I have imagined that pte_free_defer() will be useful
in future, in the freeing of empty page tables: but s390 may pose a
problem there - though perhaps no more of a problem than additionally
needing to pass a virtual address down the stack.

> 
> > 
> > > 
> > > or will it be used instead of page_table_free?  
> > 
> > Not always; but yes, this case of removing a page table used
> > page_table_free() before; but now, with the lighter locking, needs
> > to keep the page table valid until the RCU grace period expires.
> 
> so if I understand correctly your code will, sometimes, under some
> circumstances, replace what page_table_free() does, but it will never
> replace page_table_free_rcu()?
> 
> because in that case there would be no issues 

Yes, thanks for confirming: we have no issue here at present, but may
do if use of pte_free_defer() is extended to other contexts in future.

Would it be appropriate to add a WAR

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-06-30 Thread Hugh Dickins

On Fri, 30 Jun 2023, Claudio Imbrenda wrote:
> On Tue, 20 Jun 2023 00:51:19 -0700 (PDT)
> Hugh Dickins  wrote:
> 
> [...]
> 
> > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > +{
> > +   unsigned int bit, mask;
> > +   struct page *page;
> > +
> > +   page = virt_to_page(pgtable);
> > +   if (mm_alloc_pgste(mm)) {
> > +   call_rcu(>rcu_head, pte_free_pgste);
> 
> so is this now going to be used to free page tables
> instead of page_table_free_rcu?

No.

All pte_free_defer() is being used for (in this series; and any future
use beyond this series will have to undertake its own evaluations) is
for the case of removing an empty page table, which used to map a group
of PTE mappings of a file, in order to make way for one PMD mapping of
the huge page which those scattered pages have now been gathered into.

You're worried by that mm_alloc_pgste() block: it's something I didn't
have at all in my first draft, then I thought that perhaps the pgste
case might be able to come this way, so it seemed stupid to leave out
the handling for it.

I hope that you're implying that should be dead code here?  Perhaps,
that the pgste case corresponds to the case in s390 where THPs are
absolutely forbidden?  That would be good news for us.

Gerald, in his version of this block, added a comment asking:
/*
 * TODO: Do we need gmap_unlink(mm, pgtable, addr), like in
 * page_table_free_rcu()?
 * If yes -> need addr parameter here, like in pte_free_tlb().
 */
Do you have the answer to that?  Neither of us could work it out.

> 
> or will it be used instead of page_table_free?

Not always; but yes, this case of removing a page table used
page_table_free() before; but now, with the lighter locking, needs
to keep the page table valid until the RCU grace period expires.

> 
> this is actually quite important for KVM on s390

None of us are wanting to break KVM on s390: your guidance appreciated!

Thanks,
Hugh

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-06-30 Thread Hugh Dickins

On Thu, 29 Jun 2023, Gerald Schaefer wrote:
> On Thu, 29 Jun 2023 12:22:24 -0300
> Jason Gunthorpe  wrote:
> > On Wed, Jun 28, 2023 at 10:08:08PM -0700, Hugh Dickins wrote:
> > > On Wed, 28 Jun 2023, Gerald Schaefer wrote:  
> > > > 
> > > > As discussed in the other thread, we would rather go with less 
> > > > complexity,
> > > > possibly switching to an approach w/o the list and fragment re-use in 
> > > > the
> > > > future. For now, as a first step in that direction, we can try with not
> > > > adding fragments back only for pte_free_defer(). Here is an adjusted
> > > > version of your patch, copying most of your pte_free_defer() logic and
> > > > also description, tested with LTP and all three of your patch series 
> > > > applied:  
> > > 
> > > Thanks, Gerald: I don't mind abandoning my 13/12 SLAB_TYPESAFE_BY_RCU
> > > patch (posted with fewer Cc's to the s390 list last week), and switching
> > > to your simpler who-cares-if-we-sometimes-don't-make-maximal-use-of-page
> > > patch.
> > > 
> > > But I didn't get deep enough into it today to confirm it - and 
> > > disappointed
> > > that you've found it necessary to play with pt_frag_refcount in addition 
> > > to
> > > _refcount and HH bits.  No real problem with that, but my instinct says it
> > > should be simpler.  
> 
> Yes, I also found it a bit awkward, but it seemed "good and simple enough",
> to have something to go forward with, while my instinct was in line with 
> yours.
> 
> > 
> > Is there any reason it should be any different at all from what PPC is
> > doing?
> > 
> > I still think the right thing to do here is make the PPC code common
> > (with Hugh's proposed RCU modification) and just use it in both
> > arches
> 
> With the current approach, we would not add back fragments _only_ for
> the new pte_free_defer() path, while keeping our cleverness for the other
> paths. Not having a good overview of the negative impact wrt potential
> memory waste, I would rather take small steps, if possible.
> 
> If we later switch to never adding back fragments, of course we should
> try to be in line with PPC implementation.

I find myself half-agreeing with everyone.

I agree with Gerald that s390 should keep close to what it is already
doing (except for adding pte_free_defer()): that changing its strategy
and implementation to be much more like powerpc, is a job for some other
occasion (and would depend on gathering data about how well each does).

But I agree with Jason that the powerpc solution we ended up with cut
out a lot of unnecessary complication: it shifts the RCU delay from
when pte_free_defer() is called, to when the shared page comes to be
freed; which may be a lot later, and might not be welcome in a common
path, but is quite okay for the uncommon pte_free_defer().

And I agree with Alexander that pte_free_lower() and pte_free_upper()
are better names than pte_free_now0() and pte_free_now1(): I was going
to make that change, except all those functions disappear if we follow
Jason's advice and switch the call_rcu() to when freeing the page.

(Lower and upper seem unambiguous to me: Gerald, does your confusion
come just from the way they are shown the wrong way round in the PP AA
diagram?  I corrected that in my patch, but you reverted it in yours.)

I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
extent that I've not even tried to verify it; but I think I do get the
point now, that we need further info than just PPHHAA to know whether
the page is on the list or not.  But I think that if we move where the
call_rcu() is done, then the page can stay on or off the list by same
rules as before (but need to check HH bits along with PP when deciding
whether to allocate, and whether to list_add_tail() when freeing).

So, starting from Gerald's but cutting it down, I was working on the
patch which follows those ideas.  But have run out of puff for tonight,
and would just waste all our time (again) if I sent anything out now.

Hugh

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-06-28 Thread Hugh Dickins

On Wed, 28 Jun 2023, Gerald Schaefer wrote:
> 
> As discussed in the other thread, we would rather go with less complexity,
> possibly switching to an approach w/o the list and fragment re-use in the
> future. For now, as a first step in that direction, we can try with not
> adding fragments back only for pte_free_defer(). Here is an adjusted
> version of your patch, copying most of your pte_free_defer() logic and
> also description, tested with LTP and all three of your patch series applied:

Thanks, Gerald: I don't mind abandoning my 13/12 SLAB_TYPESAFE_BY_RCU
patch (posted with fewer Cc's to the s390 list last week), and switching
to your simpler who-cares-if-we-sometimes-don't-make-maximal-use-of-page
patch.

But I didn't get deep enough into it today to confirm it - and disappointed
that you've found it necessary to play with pt_frag_refcount in addition to
_refcount and HH bits.  No real problem with that, but my instinct says it
should be simpler.

Tomorrow...
Hugh

Re: [PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables sharing page

2023-06-27 Thread Hugh Dickins

On Tue, 27 Jun 2023, Jason Gunthorpe wrote:
> On Wed, Jun 21, 2023 at 07:36:11PM -0700, Hugh Dickins wrote:
> > [PATCH v3 05/12] powerpc: add pte_free_defer() for pgtables sharing page
...
> Yes, this makes sense to me, very simple..
> 
> I always for get these details but atomic_dec_and_test() is a release?
> So the SetPageActive is guarenteed to be visible in another thread
> that reaches 0?

Yes, that's my understanding - so the TestClearPageActive adds more
to the guarantee than is actually needed.

"release": you speak the modern language, whereas I haven't advanced
from olden-days barriers: atomic_dec_and_test() meets atomic_t.txt's
 - RMW operations that have a return value are fully ordered;
which a quick skim of present-day memory-barriers.txt suggests would
imply both ACQUIRE and RELEASE.  Please correct me if I'm mistaken!

Hugh

Re: [PATCH v6 00/33] Split ptdesc from struct page

2023-06-27 Thread Hugh Dickins

On Tue, 27 Jun 2023, Matthew Wilcox wrote:
> On Mon, Jun 26, 2023 at 09:44:08PM -0700, Hugh Dickins wrote:
> > On Mon, 26 Jun 2023, Vishal Moola (Oracle) wrote:
> > 
> > > The MM subsystem is trying to shrink struct page. This patchset
> > > introduces a memory descriptor for page table tracking - struct ptdesc.
> > ...
> > >  39 files changed, 686 insertions(+), 455 deletions(-)
> > 
> > I don't see the point of this patchset: to me it is just obfuscation of
> > the present-day tight relationship between page table and struct page.
> > 
> > Matthew already explained:
> > 
> > > The intent is to get ptdescs to be dynamically allocated at some point
> > > in the ~2-3 years out future when we have finished the folio project ...
> > 
> > So in a kindly mood, I'd say that this patchset is ahead of its time.
> > But I can certainly adapt to it, if everyone else sees some point to it.
> 
> If you think this patchset is ahead of its time, we can certainly put
> it on hold.  We're certainly prepared to redo it to be merged after your
> current patch series.

Thank you, but I can adapt.  That was not my point:
I'm claiming this patchset is ~2-3 years ahead of its time.

> 
> I think you can see the advantage of the destination, so I don't think
> you're against that.

Maybe - I have some scepticism, but I'll be happy for that to be dissolved.

> Are you opposed to the sequencing of the work to
> get us there?  I'd be happy to discuss another way to do it.

Yes, I'm opposed to churn for no benefit.

> 
> For example, we could dynamically allocate ptdescs right now.  We'd get
> the benefit of having an arbitrary amount of space in the ptdesc,
> although not the benefit of a smaller memmap until everything else is
> also dynamically allocated.

That sounded much better, at first: churn serving good purpose.  But now
I suspect you're offering to dynamically allocate a ptdesc, in addition
to the struct page of the page table(s) itself, which will be wasted:
more memory consumption to no advantage.  If that's so, no thanks.

Hugh

Re: [PATCH v6 00/33] Split ptdesc from struct page

2023-06-27 Thread Hugh Dickins

On Tue, 27 Jun 2023, David Hildenbrand wrote:
> On 27.06.23 06:44, Hugh Dickins wrote:
> > On Mon, 26 Jun 2023, Vishal Moola (Oracle) wrote:
> > 
> >> The MM subsystem is trying to shrink struct page. This patchset
> >> introduces a memory descriptor for page table tracking - struct ptdesc.
> > ...
> >>   39 files changed, 686 insertions(+), 455 deletions(-)
> > 
> > I don't see the point of this patchset: to me it is just obfuscation of
> > the present-day tight relationship between page table and struct page.
> > 
> > Matthew already explained:
> > 
> >> The intent is to get ptdescs to be dynamically allocated at some point
> >> in the ~2-3 years out future when we have finished the folio project ...
> > 
> > So in a kindly mood, I'd say that this patchset is ahead of its time.
> > But I can certainly adapt to it, if everyone else sees some point to it.
> 
> I share your thoughts, that code churn which will help eventually in the far,
> far future (not wanting to sound too pessimistic, but it's not going to be
> there tomorrow ;) ).
> 
> However, if it's just the same as the other conversions we already did (e.g.,
> struct slab), then I guess there is no reason to stop now -- the obfuscation
> already happened.
> 
> ... or is there a difference regarding this conversion and the previous ones?

I was aware of the struct slab thing, didn't see much point there myself
either; but it was welcomed by Vlastimil, and barely affected outside of
slab allocators, so I had no reason to object.

You think that if a little unnecessary churn (a *lot* of churn if you
include folios, which did save some repeated calls to compound_head())
has already occurred, that's a good precedent for allowing more and more?
My opinion happens to differ on that.

Hugh

Re: [PATCH v6 00/33] Split ptdesc from struct page

2023-06-26 Thread Hugh Dickins

On Mon, 26 Jun 2023, Vishal Moola (Oracle) wrote:

> The MM subsystem is trying to shrink struct page. This patchset
> introduces a memory descriptor for page table tracking - struct ptdesc.
...
>  39 files changed, 686 insertions(+), 455 deletions(-)

I don't see the point of this patchset: to me it is just obfuscation of
the present-day tight relationship between page table and struct page.

Matthew already explained:

> The intent is to get ptdescs to be dynamically allocated at some point
> in the ~2-3 years out future when we have finished the folio project ...

So in a kindly mood, I'd say that this patchset is ahead of its time.
But I can certainly adapt to it, if everyone else sees some point to it.

Hugh

Re: [PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables sharing page

2023-06-21 Thread Hugh Dickins

On Tue, 20 Jun 2023, Jason Gunthorpe wrote:
> On Tue, Jun 20, 2023 at 12:54:25PM -0700, Hugh Dickins wrote:
> > On Tue, 20 Jun 2023, Jason Gunthorpe wrote:
> > > On Tue, Jun 20, 2023 at 12:47:54AM -0700, Hugh Dickins wrote:
> > > > Add powerpc-specific pte_free_defer(), to call pte_free() via 
> > > > call_rcu().
> > > > pte_free_defer() will be called inside khugepaged's 
> > > > retract_page_tables()
> > > > loop, where allocating extra memory cannot be relied upon.  This 
> > > > precedes
> > > > the generic version to avoid build breakage from incompatible pgtable_t.
> > > > 
> > > > This is awkward because the struct page contains only one rcu_head, but
> > > > that page may be shared between PTE_FRAG_NR pagetables, each wanting to
> > > > use the rcu_head at the same time: account concurrent deferrals with a
> > > > heightened refcount, only the first making use of the rcu_head, but
> > > > re-deferring if more deferrals arrived during its grace period.
> > > 
> > > You didn't answer my question why we can't just move the rcu to the
> > > actual free page?
> > 
> > I thought that I had answered it, perhaps not to your satisfaction:
> > 
> > https://lore.kernel.org/linux-mm/9130acb-193-6fdd-f8df-75766e663...@google.com/
> > 
> > My conclusion then was:
> > Not very good reasons: good enough, or can you supply a better patch?
> 
> Oh, I guess I didn't read that email as answering the question..
> 
> I was saying to make pte_fragment_free() unconditionally do the
> RCU. It is the only thing that uses the page->rcu_head, and it means
> PPC would double RCU the final free on the TLB path, but that is
> probably OK for now. This means pte_free_defer() won't do anything
> special on PPC as PPC will always RCU free these things, this address
> the defer concern too, I think. Overall it is easier to reason about.
> 
> I looked at fixing the TLB stuff to avoid the double rcu but quickly
> got scared that ppc was using a kmem_cache to allocate other page
> table sizes so there is not a reliable struct page to get a rcu_head
> from. This looks like the main challenge for ppc... We'd have to teach
> the tlb code to not do its own RCU stuff for table levels that the
> arch is already RCU freeing - and that won't get us to full RCU
> freeing on PPC.

Sorry for being so dense all along: yes, your way is unquestionably
much better than mine.  I guess I must have been obsessive about
keeping pte_free_defer()+pte_free_now() "on the outside", as they
were on x86, and never perceived how much easier it is with a small
tweak inside pte_fragment_free(); and never reconsidered it since.

But I'm not so keen on the double-RCU, extending this call_rcu() to
all the normal cases, while still leaving the TLB batching in place:
here is the replacement patch I'd prefer us to go forward with now.

Many thanks!

[PATCH v3 05/12] powerpc: add pte_free_defer() for pgtables sharing page

Add powerpc-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time.  But powerpc never reuses a fragment
once it has been freed: so mark the page Active in pte_free_defer(),
before calling pte_fragment_free() directly; and there call_rcu() to
pte_free_now() when last fragment is freed and the page is PageActive.

Suggested-by: Jason Gunthorpe 
Signed-off-by: Hugh Dickins 
---
 arch/powerpc/include/asm/pgalloc.h |  4 
 arch/powerpc/mm/pgtable-frag.c | 29 ++---
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc.h 
b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t 
ptepage)
pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c 
*/
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..0c6b68130025 100644
--- a

Re: [PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables sharing page

2023-06-20 Thread Hugh Dickins

On Tue, 20 Jun 2023, Jason Gunthorpe wrote:
> On Tue, Jun 20, 2023 at 12:47:54AM -0700, Hugh Dickins wrote:
> > Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
> > pte_free_defer() will be called inside khugepaged's retract_page_tables()
> > loop, where allocating extra memory cannot be relied upon.  This precedes
> > the generic version to avoid build breakage from incompatible pgtable_t.
> > 
> > This is awkward because the struct page contains only one rcu_head, but
> > that page may be shared between PTE_FRAG_NR pagetables, each wanting to
> > use the rcu_head at the same time: account concurrent deferrals with a
> > heightened refcount, only the first making use of the rcu_head, but
> > re-deferring if more deferrals arrived during its grace period.
> 
> You didn't answer my question why we can't just move the rcu to the
> actual free page?

I thought that I had answered it, perhaps not to your satisfaction:

https://lore.kernel.org/linux-mm/9130acb-193-6fdd-f8df-75766e663...@google.com/

My conclusion then was:
Not very good reasons: good enough, or can you supply a better patch?

Hugh

> 
> Since PPC doesn't recycle the frags, we don't need to carefully RCU
> free each frag, we just need to RCU free the entire page when it
> becomes eventually free?
> 
> Jason

[PATCH mm 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

2023-06-20 Thread Hugh Dickins

Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
It does need mmap_read_lock(), but it does not need mmap_write_lock(),
nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.

Follow the pattern in retract_page_tables(); and using pte_free_defer()
removes most of the need for tlb_remove_table_sync_one() here; but call
pmdp_get_lockless_sync() to use it in the PAE case.

First check the VMA, in case page tables are being torn down: from JannH.
Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
acquired and the page looks suitable: from then on its state is stable.

However, collapse_pte_mapped_thp() was doing something others don't:
freeing a page table still containing "valid" entries.  i_mmap lock did
stop a racing truncate from double-freeing those pages, but we prefer
collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB
flush can wait until the pmdp_collapse_flush() which follows, but the
mmu_notifier_invalidate_range_start() has to be done earlier.

Do the "step 1" checking loop without mmu_notifier: it wouldn't be good
for khugepaged to keep on repeatedly invalidating a range which is then
found unsuitable e.g. contains COWs.  "step 2", which does the clearing,
must then be more careful (after dropping ptl to do mmu_notifier), with
abort prepared to correct the accounting like "step 3".  But with those
entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept
safe by the huge page lock, which stops new PTEs from being faulted in.

Signed-off-by: Hugh Dickins 
---
This is the version which applies to mm-everything or linux-next.

 mm/khugepaged.c | 174 ++--
 1 file changed, 78 insertions(+), 96 deletions(-)

--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1483,7 +1483,7 @@ static bool khugepaged_add_pte_mapped_th
return ret;
 }
 
-/* hpage must be locked, and mmap_lock must be held in write */
+/* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct page *hpage)
 {
@@ -1495,7 +1495,7 @@ static int set_huge_pmd(struct vm_area_s
};
 
VM_BUG_ON(!PageTransHuge(hpage));
-   mmap_assert_write_locked(vma->vm_mm);
+   mmap_assert_locked(vma->vm_mm);
 
if (do_set_pmd(, hpage))
return SCAN_FAIL;
@@ -1504,48 +1504,6 @@ static int set_huge_pmd(struct vm_area_s
return SCAN_SUCCEED;
 }
 
-/*
- * A note about locking:
- * Trying to take the page table spinlocks would be useless here because those
- * are only used to synchronize:
- *
- *  - modifying terminal entries (ones that point to a data page, not to 
another
- *page table)
- *  - installing *new* non-terminal entries
- *
- * Instead, we need roughly the same kind of protection as free_pgtables() or
- * mm_take_all_locks() (but only for a single VMA):
- * The mmap lock together with this VMA's rmap locks covers all paths towards
- * the page table entries we're messing with here, except for hardware page
- * table walks and lockless_pages_from_mm().
- */
-static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct 
*vma,
- unsigned long addr, pmd_t *pmdp)
-{
-   pmd_t pmd;
-   struct mmu_notifier_range range;
-
-   mmap_assert_write_locked(mm);
-   if (vma->vm_file)
-   
lockdep_assert_held_write(>vm_file->f_mapping->i_mmap_rwsem);
-   /*
-* All anon_vmas attached to the VMA have the same root and are
-* therefore locked by the same lock.
-*/
-   if (vma->anon_vma)
-   lockdep_assert_held_write(>anon_vma->root->rwsem);
-
-   mmu_notifier_range_init(, MMU_NOTIFY_CLEAR, 0, mm, addr,
-   addr + HPAGE_PMD_SIZE);
-   mmu_notifier_invalidate_range_start();
-   pmd = pmdp_collapse_flush(vma, addr, pmdp);
-   tlb_remove_table_sync_one();
-   mmu_notifier_invalidate_range_end();
-   mm_dec_nr_ptes(mm);
-   page_table_check_pte_clear_range(mm, addr, pmd);
-   pte_free(mm, pmd_pgtable(pmd));
-}
-
 /**
  * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
  * address haddr.
@@ -1561,26 +1519,29 @@ static void collapse_and_free_pmd(struct
 int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
bool install_pmd)
 {
+   struct mmu_notifier_range range;
+   bool notified = false;
unsigned long haddr = addr & HPAGE_PMD_MASK;
struct vm_area_struct *vma = vma_lookup(mm, haddr);
struct page *hpage;
pte_t *start_pte, *pte;
-   pmd_t *pmd;
-   spinlock_t *ptl;
-   int count = 0, result = SCAN_FAIL;
+   pmd_t *pmd,

[PATCH v2 12/12] mm: delete mmap_write_trylock() and vma_try_start_write()

2023-06-20 Thread Hugh Dickins

mmap_write_trylock() and vma_try_start_write() were added just for
khugepaged, but now it has no use for them: delete.

Signed-off-by: Hugh Dickins 
---
 include/linux/mm.h| 17 -
 include/linux/mmap_lock.h | 10 --
 2 files changed, 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3c2e56980853..9b24f8fbf899 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -690,21 +690,6 @@ static inline void vma_start_write(struct vm_area_struct 
*vma)
up_write(>vm_lock->lock);
 }
 
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-{
-   int mm_lock_seq;
-
-   if (__is_vma_write_locked(vma, _lock_seq))
-   return true;
-
-   if (!down_write_trylock(>vm_lock->lock))
-   return false;
-
-   vma->vm_lock_seq = mm_lock_seq;
-   up_write(>vm_lock->lock);
-   return true;
-}
-
 static inline void vma_assert_write_locked(struct vm_area_struct *vma)
 {
int mm_lock_seq;
@@ -730,8 +715,6 @@ static inline bool vma_start_read(struct vm_area_struct 
*vma)
{ return false; }
 static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
-static inline bool vma_try_start_write(struct vm_area_struct *vma)
-   { return true; }
 static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 bool detached) {}
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index aab8f1b28d26..d1191f02c7fa 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -112,16 +112,6 @@ static inline int mmap_write_lock_killable(struct 
mm_struct *mm)
return ret;
 }
 
-static inline bool mmap_write_trylock(struct mm_struct *mm)
-{
-   bool ret;
-
-   __mmap_lock_trace_start_locking(mm, true);
-   ret = down_write_trylock(>mmap_lock) != 0;
-   __mmap_lock_trace_acquire_returned(mm, true, ret);
-   return ret;
-}
-
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
__mmap_lock_trace_released(mm, true);
-- 
2.35.3

[PATCH v2 11/12] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()

2023-06-20 Thread Hugh Dickins

Now that retract_page_tables() can retract page tables reliably, without
depending on trylocks, delete all the apparatus for khugepaged to try
again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the
per-mm memory which was set aside for that in the khugepaged_mm_slot.

But one part of that is worth keeping: when hpage_collapse_scan_file()
found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot
to be tried for retraction later - catching, for example, page tables
where a reversible mprotect() of a portion had required splitting the
pmd, but now it can be recollapsed.  Call collapse_pte_mapped_thp()
directly in this case (why was it deferred before?  I assume an issue
with needing mmap_lock for write, but now it's only needed for read).

Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 125 +++-
 1 file changed, 16 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 060ac8789a1e..06c659e6a89e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -92,8 +92,6 @@ static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, 
MM_SLOTS_HASH_BITS);
 
 static struct kmem_cache *mm_slot_cache __read_mostly;
 
-#define MAX_PTE_MAPPED_THP 8
-
 struct collapse_control {
bool is_khugepaged;
 
@@ -107,15 +105,9 @@ struct collapse_control {
 /**
  * struct khugepaged_mm_slot - khugepaged information per mm that is being 
scanned
  * @slot: hash lookup from mm to mm_slot
- * @nr_pte_mapped_thp: number of pte mapped THP
- * @pte_mapped_thp: address array corresponding pte mapped THP
  */
 struct khugepaged_mm_slot {
struct mm_slot slot;
-
-   /* pte-mapped THP in this mm */
-   int nr_pte_mapped_thp;
-   unsigned long pte_mapped_thp[MAX_PTE_MAPPED_THP];
 };
 
 /**
@@ -1441,50 +1433,6 @@ static void collect_mm_slot(struct khugepaged_mm_slot 
*mm_slot)
 }
 
 #ifdef CONFIG_SHMEM
-/*
- * Notify khugepaged that given addr of the mm is pte-mapped THP. Then
- * khugepaged should try to collapse the page table.
- *
- * Note that following race exists:
- * (1) khugepaged calls khugepaged_collapse_pte_mapped_thps() for mm_struct A,
- * emptying the A's ->pte_mapped_thp[] array.
- * (2) MADV_COLLAPSE collapses some file extent with target mm_struct B, and
- * retract_page_tables() finds a VMA in mm_struct A mapping the same extent
- * (at virtual address X) and adds an entry (for X) into mm_struct A's
- * ->pte-mapped_thp[] array.
- * (3) khugepaged calls khugepaged_collapse_scan_file() for mm_struct A at X,
- * sees a pte-mapped THP (SCAN_PTE_MAPPED_HUGEPAGE) and adds an entry
- * (for X) into mm_struct A's ->pte-mapped_thp[] array.
- * Thus, it's possible the same address is added multiple times for the same
- * mm_struct.  Should this happen, we'll simply attempt
- * collapse_pte_mapped_thp() multiple times for the same address, under the 
same
- * exclusive mmap_lock, and assuming the first call is successful, subsequent
- * attempts will return quickly (without grabbing any additional locks) when
- * a huge pmd is found in find_pmd_or_thp_or_none().  Since this is a cheap
- * check, and since this is a rare occurrence, the cost of preventing this
- * "multiple-add" is thought to be more expensive than just handling it, should
- * it occur.
- */
-static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
- unsigned long addr)
-{
-   struct khugepaged_mm_slot *mm_slot;
-   struct mm_slot *slot;
-   bool ret = false;
-
-   VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
-
-   spin_lock(_mm_lock);
-   slot = mm_slot_lookup(mm_slots_hash, mm);
-   mm_slot = mm_slot_entry(slot, struct khugepaged_mm_slot, slot);
-   if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) 
{
-   mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr;
-   ret = true;
-   }
-   spin_unlock(_mm_lock);
-   return ret;
-}
-
 /* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct page *hpage)
@@ -1706,29 +1654,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
return result;
 }
 
-static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot 
*mm_slot)
-{
-   struct mm_slot *slot = _slot->slot;
-   struct mm_struct *mm = slot->mm;
-   int i;
-
-   if (likely(mm_slot->nr_pte_mapped_thp == 0))
-   return;
-
-   if (!mmap_write_trylock(mm))
-   return;
-
-   if (unlikely(hpage_collapse_test_exit(mm)))
-   goto out;
-
-   for (i = 0; i < mm_slot->nr_pte_mapped_thp; i++)
-   collapse_pte_mapped_thp(mm, mm_slot->pte_mapped_thp[i], false);
-
-out:
-   mm_slot->nr_pte_mapped_thp = 0;
-   mma

[PATCH v2 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

2023-06-20 Thread Hugh Dickins

Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
It does need mmap_read_lock(), but it does not need mmap_write_lock(),
nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.

Follow the pattern in retract_page_tables(); and using pte_free_defer()
removes most of the need for tlb_remove_table_sync_one() here; but call
pmdp_get_lockless_sync() to use it in the PAE case.

First check the VMA, in case page tables are being torn down: from JannH.
Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
acquired and the page looks suitable: from then on its state is stable.

However, collapse_pte_mapped_thp() was doing something others don't:
freeing a page table still containing "valid" entries.  i_mmap lock did
stop a racing truncate from double-freeing those pages, but we prefer
collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB
flush can wait until the pmdp_collapse_flush() which follows, but the
mmu_notifier_invalidate_range_start() has to be done earlier.

Do the "step 1" checking loop without mmu_notifier: it wouldn't be good
for khugepaged to keep on repeatedly invalidating a range which is then
found unsuitable e.g. contains COWs.  "step 2", which does the clearing,
must then be more careful (after dropping ptl to do mmu_notifier), with
abort prepared to correct the accounting like "step 3".  But with those
entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept
safe by the huge page lock, which stops new PTEs from being faulted in.

Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 172 ++--
 1 file changed, 77 insertions(+), 95 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f7a0f7673127..060ac8789a1e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1485,7 +1485,7 @@ static bool khugepaged_add_pte_mapped_thp(struct 
mm_struct *mm,
return ret;
 }
 
-/* hpage must be locked, and mmap_lock must be held in write */
+/* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmdp, struct page *hpage)
 {
@@ -1497,7 +1497,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, 
unsigned long addr,
};
 
VM_BUG_ON(!PageTransHuge(hpage));
-   mmap_assert_write_locked(vma->vm_mm);
+   mmap_assert_locked(vma->vm_mm);
 
if (do_set_pmd(, hpage))
return SCAN_FAIL;
@@ -1506,48 +1506,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, 
unsigned long addr,
return SCAN_SUCCEED;
 }
 
-/*
- * A note about locking:
- * Trying to take the page table spinlocks would be useless here because those
- * are only used to synchronize:
- *
- *  - modifying terminal entries (ones that point to a data page, not to 
another
- *page table)
- *  - installing *new* non-terminal entries
- *
- * Instead, we need roughly the same kind of protection as free_pgtables() or
- * mm_take_all_locks() (but only for a single VMA):
- * The mmap lock together with this VMA's rmap locks covers all paths towards
- * the page table entries we're messing with here, except for hardware page
- * table walks and lockless_pages_from_mm().
- */
-static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct 
*vma,
- unsigned long addr, pmd_t *pmdp)
-{
-   pmd_t pmd;
-   struct mmu_notifier_range range;
-
-   mmap_assert_write_locked(mm);
-   if (vma->vm_file)
-   
lockdep_assert_held_write(>vm_file->f_mapping->i_mmap_rwsem);
-   /*
-* All anon_vmas attached to the VMA have the same root and are
-* therefore locked by the same lock.
-*/
-   if (vma->anon_vma)
-   lockdep_assert_held_write(>anon_vma->root->rwsem);
-
-   mmu_notifier_range_init(, MMU_NOTIFY_CLEAR, 0, mm, addr,
-   addr + HPAGE_PMD_SIZE);
-   mmu_notifier_invalidate_range_start();
-   pmd = pmdp_collapse_flush(vma, addr, pmdp);
-   tlb_remove_table_sync_one();
-   mmu_notifier_invalidate_range_end();
-   mm_dec_nr_ptes(mm);
-   page_table_check_pte_clear_range(mm, addr, pmd);
-   pte_free(mm, pmd_pgtable(pmd));
-}
-
 /**
  * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
  * address haddr.
@@ -1563,26 +1521,29 @@ static void collapse_and_free_pmd(struct mm_struct *mm, 
struct vm_area_struct *v
 int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
bool install_pmd)
 {
+   struct mmu_notifier_range range;
+   bool notified = false;
unsigned long haddr = addr & HPAGE_PMD_MASK;
struct vm_area_struct *vma = vma_lookup(mm, haddr);
struct page

[PATCH v2 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock

2023-06-20 Thread Hugh Dickins

Simplify shmem and file THP collapse's retract_page_tables(), and relax
its locking: to improve its success rate and to lessen impact on others.

Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
target_mm, leave that part of the work to madvise_collapse() calling
collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s
result code to arrange for that.  That spares retract_page_tables() four
arguments; and since it will be successful in retracting all of the page
tables expected of it, no need to track and return a result code itself.

It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
THPs.  retract_page_tables() just needs to use those same spinlocks to
exclude it briefly, while transitioning pmd from page table to none: so
restore its use of pmd_lock() inside of which pte lock is nested.

Users of pte_offset_map_lock() etc all now allow for them to fail:
so retract_page_tables() now has no use for mmap_write_trylock() or
vma_try_start_write().  In common with rmap and page_vma_mapped_walk(),
it does not even need the mmap_read_lock().

But those users do expect the page table to remain a good page table,
until they unlock and rcu_read_unlock(): so the page table cannot be
freed immediately, but rather by the recently added pte_free_defer().

Use the (usually a no-op) pmdp_get_lockless_sync() to send an interrupt
when PAE, and pmdp_collapse_flush() did not already do so: to make sure
that the start,pmdp_get_lockless(),end sequence in __pte_offset_map()
cannot pick up a pmd entry with mismatched pmd_low and pmd_high.

retract_page_tables() can be enhanced to replace_page_tables(), which
inserts the final huge pmd without mmap lock: going through an invalid
state instead of pmd_none() followed by fault.  But that enhancement
does raise some more questions: leave it until a later release.

Signed-off-by: Hugh Dickins 
---
 mm/khugepaged.c | 184 
 1 file changed, 75 insertions(+), 109 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 1083f0e38a07..f7a0f7673127 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1617,9 +1617,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
unsigned long addr,
break;
case SCAN_PMD_NONE:
/*
-* In MADV_COLLAPSE path, possible race with khugepaged where
-* all pte entries have been removed and pmd cleared.  If so,
-* skip all the pte checks and just update the pmd mapping.
+* All pte entries have been removed and pmd cleared.
+* Skip all the pte checks and just update the pmd mapping.
 */
goto maybe_install_pmd;
default:
@@ -1748,123 +1747,88 @@ static void khugepaged_collapse_pte_mapped_thps(struct 
khugepaged_mm_slot *mm_sl
mmap_write_unlock(mm);
 }
 
-static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
-  struct mm_struct *target_mm,
-  unsigned long target_addr, struct page *hpage,
-  struct collapse_control *cc)
+static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
struct vm_area_struct *vma;
-   int target_result = SCAN_FAIL;
 
-   i_mmap_lock_write(mapping);
+   i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
-   int result = SCAN_FAIL;
-   struct mm_struct *mm = NULL;
-   unsigned long addr = 0;
-   pmd_t *pmd;
-   bool is_target = false;
+   struct mmu_notifier_range range;
+   struct mm_struct *mm;
+   unsigned long addr;
+   pmd_t *pmd, pgt_pmd;
+   spinlock_t *pml;
+   spinlock_t *ptl;
+   bool skipped_uffd = false;
 
/*
 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
-* got written to. These VMAs are likely not worth investing
-* mmap_write_lock(mm) as PMD-mapping is likely to be split
-* later.
-*
-* Note that vma->anon_vma check is racy: it can be set up after
-* the check but before we took mmap_lock by the fault path.
-* But page lock would prevent establishing any new ptes of the
-* page, so we are safe.
-*
-* An alternative would be drop the check, but check that page
-* table is clear before calling pmdp_collapse_flush() under
-* ptl. It has higher chance to recover THP for the VMA, but
-* has higher cost too. It would also probabl

[PATCH v2 08/12] mm/pgtable: add pte_free_defer() for pgtable as page

2023-06-20 Thread Hugh Dickins

Add the generic pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This version
suits all those architectures which use an unfragmented page for one page
table (none of whose pte_free()s use the mm arg which was passed to it).

Signed-off-by: Hugh Dickins 
---
 include/linux/mm_types.h |  4 
 include/linux/pgtable.h  |  2 ++
 mm/pgtable-generic.c | 20 
 3 files changed, 26 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1667a1bdb8a8..09335fa28c41 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -144,6 +144,10 @@ struct page {
struct {/* Page table pages */
unsigned long _pt_pad_1;/* compound_head */
pgtable_t pmd_huge_pte; /* protected by page->ptl */
+   /*
+* A PTE page table page might be freed by use of
+* rcu_head: which overlays those two fields above.
+*/
unsigned long _pt_pad_2;/* mapping */
union {
struct mm_struct *pt_mm; /* x86 pgd, s390 */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 525f1782b466..d18d3e963967 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -112,6 +112,8 @@ static inline void pte_unmap(pte_t *pte)
 }
 #endif
 
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /* Find an entry in the second-level page table.. */
 #ifndef pmd_offset
 static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 5e85a625ab30..ab3741064bb8 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -230,6 +231,25 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, 
unsigned long address,
return pmd;
 }
 #endif
+
+/* arch define pte_free_defer in asm/pgalloc.h for its own implementation */
+#ifndef pte_free_defer
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+
+   page = container_of(head, struct page, rcu_head);
+   pte_free(NULL /* mm not passed and not used */, (pgtable_t)page);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+   struct page *page;
+
+   page = pgtable;
+   call_rcu(>rcu_head, pte_free_now);
+}
+#endif /* pte_free_defer */
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
-- 
2.35.3

[PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-06-20 Thread Hugh Dickins

Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This version is more complicated than others: because s390 fits two 2K
page tables into one 4K page (so page->rcu_head must be shared between
both halves), and already uses page->lru (which page->rcu_head overlays)
to list any free halves; with clever management by page->_refcount bits.

Build upon the existing management, adjusted to follow a new rule: that
a page is not linked to mm_context_t::pgtable_list while either half is
pending free, by either tlb_remove_table() or pte_free_defer(); but is
afterwards either relinked to the list (if other half is allocated), or
freed (if other half is free): by __tlb_remove_table() in both cases.

This rule ensures that page->lru is no longer in use while page->rcu_head
may be needed for use by pte_free_defer().  And a fortuitous byproduct of
following this rule is that page_table_free() no longer needs its curious
two-step manipulation of _refcount - read commit c2c224932fd0 ("s390/mm:
fix 2KB pgtable release race") for what to think of there.  But it does
not solve the problem that two halves may need rcu_head at the same time.

For that, add HHead bits between s390's AAllocated and PPending bits in
the upper byte of page->_refcount: then the second pte_free_defer() can
see that rcu_head is already in use, and the RCU callee pte_free_half()
can see that it needs to make a further call_rcu() for that other half.

page_table_alloc() set the page->pt_mm field, so __tlb_remove_table()
knows where to link the freed half while its other half is allocated.
But linking to the list needs mm->context.lock: and although AA bit set
guarantees that pt_mm must still be valid, it does not guarantee that mm
is still valid an instant later: so acquiring mm->context.lock would not
be safe.  For now, use a static global mm_pgtable_list_lock instead:
then a soon-to-follow commit will split it per-mm as before (probably by
using a SLAB_TYPESAFE_BY_RCU structure for the list head and its lock);
and update the commentary on the pgtable_list.

Signed-off-by: Hugh Dickins 
---
 arch/s390/include/asm/pgalloc.h |   4 +
 arch/s390/mm/pgalloc.c  | 205 +++-
 include/linux/mm_types.h|   2 +-
 3 files changed, 154 insertions(+), 57 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..89a9d5ef94f8 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -143,6 +143,10 @@ static inline void pmd_populate(struct mm_struct *mm,
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
 
+/* arch use pte_free_defer() implementation in arch/s390/mm/pgalloc.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 void vmem_map_init(void);
 void *vmem_crst_alloc(unsigned long val);
 pte_t *vmem_pte_alloc(void);
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..11983a3ff95a 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -159,6 +159,11 @@ void page_table_free_pgste(struct page *page)
 
 #endif /* CONFIG_PGSTE */
 
+/*
+ * Temporarily use a global spinlock instead of mm->context.lock.
+ * This will be replaced by a per-mm spinlock in a followup commit.
+ */
+static DEFINE_SPINLOCK(mm_pgtable_list_lock);
 /*
  * A 2KB-pgtable is either upper or lower half of a normal page.
  * The second half of the page may be unused or used as another
@@ -172,7 +177,7 @@ void page_table_free_pgste(struct page *page)
  * When a parent page gets fully allocated it contains 2KB-pgtables in both
  * upper and lower halves and is removed from mm_context_t::pgtable_list.
  *
- * When 2KB-pgtable is freed from to fully allocated parent page that
+ * When 2KB-pgtable is freed from the fully allocated parent page that
  * page turns partially allocated and added to mm_context_t::pgtable_list.
  *
  * If 2KB-pgtable is freed from the partially allocated parent page that
@@ -182,16 +187,24 @@ void page_table_free_pgste(struct page *page)
  * As follows from the above, no unallocated or fully allocated parent
  * pages are contained in mm_context_t::pgtable_list.
  *
+ * NOTE NOTE NOTE: The commentary above and below has not yet been updated:
+ * the new rule is that a page is not linked to mm_context_t::pgtable_list
+ * while either half is pending free by any method; but afterwards is
+ * either relinked to it, or freed, by __tlb_remove_table().  This allows
+ * pte_free_defer() to use the page->rcu_head (which overlays page->lru).
+ *
  * The upper byte (bits 24-31) of t

[PATCH v2 06/12] sparc: add pte_free_defer() for pte_t *pgtable_t

2023-06-20 Thread Hugh Dickins

Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

sparc32 supports pagetables sharing a page, but does not support THP;
sparc64 supports THP, but does not support pagetables sharing a page.
So the sparc-specific pte_free_defer() is as simple as the generic one,
except for converting between pte_t *pgtable_t and struct page *.

Signed-off-by: Hugh Dickins 
---
 arch/sparc/include/asm/pgalloc_64.h |  4 
 arch/sparc/mm/init_64.c | 16 
 2 files changed, 20 insertions(+)

diff --git a/arch/sparc/include/asm/pgalloc_64.h 
b/arch/sparc/include/asm/pgalloc_64.h
index 7b5561d17ab1..caa7632be4c2 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
 void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
 void pte_free(struct mm_struct *mm, pgtable_t ptepage);
 
+/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 #define pmd_populate_kernel(MM, PMD, PTE)  pmd_set(MM, PMD, PTE)
 #define pmd_populate(MM, PMD, PTE) pmd_set(MM, PMD, PTE)
 
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 04f9db0c3111..0d7fd793924c 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+
+   page = container_of(head, struct page, rcu_head);
+   __pte_free((pgtable_t)page_address(page));
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+   struct page *page;
+
+   page = virt_to_page(pgtable);
+   call_rcu(>rcu_head, pte_free_now);
+}
+
 void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
  pmd_t *pmd)
 {
-- 
2.35.3

[PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables sharing page

2023-06-20 Thread Hugh Dickins

Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time: account concurrent deferrals with a
heightened refcount, only the first making use of the rcu_head, but
re-deferring if more deferrals arrived during its grace period.

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/include/asm/pgalloc.h |  4 +++
 arch/powerpc/mm/pgtable-frag.c | 51 ++
 2 files changed, 55 insertions(+)

diff --git a/arch/powerpc/include/asm/pgalloc.h 
b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t 
ptepage)
pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c 
*/
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..e4f58c5fc2ac 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
__free_page(page);
}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define PTE_FREE_DEFERRED 0x1 /* beyond any PTE_FRAG_NR */
+
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+   int refcount;
+
+   page = container_of(head, struct page, rcu_head);
+   refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
+>pt_frag_refcount);
+   if (refcount < PTE_FREE_DEFERRED) {
+   pte_fragment_free((unsigned long *)page_address(page), 0);
+   return;
+   }
+   /*
+* One page may be shared between PTE_FRAG_NR pagetables.
+* At least one more call to pte_free_defer() came in while we
+* were already deferring, so the free must be deferred again;
+* but just for one grace period, however many calls came in.
+*/
+   while (refcount >= PTE_FREE_DEFERRED + PTE_FREE_DEFERRED) {
+   refcount = atomic_sub_return(PTE_FREE_DEFERRED,
+>pt_frag_refcount);
+   }
+   /* Remove that refcount of 1 left for fragment freeing above */
+   atomic_dec(>pt_frag_refcount);
+   call_rcu(>rcu_head, pte_free_now);
+}
+
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+   struct page *page;
+
+   page = virt_to_page(pgtable);
+   /*
+* One page may be shared between PTE_FRAG_NR pagetables: only queue
+* it once for freeing, but note whenever the free must be deferred.
+*
+* (This would be much simpler if the struct page had an rcu_head for
+* each fragment, or if we could allocate a separate array for that.)
+*
+* Convert our refcount of 1 to a refcount of PTE_FREE_DEFERRED, and
+* proceed to call_rcu() only when the rcu_head is not already in use.
+*/
+   if (atomic_add_return(PTE_FREE_DEFERRED - 1, >pt_frag_refcount) <
+ PTE_FREE_DEFERRED + PTE_FREE_DEFERRED)
+   call_rcu(>rcu_head, pte_free_now);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.35.3

[PATCH v2 04/12] powerpc: assert_pte_locked() use pte_offset_map_nolock()

2023-06-20 Thread Hugh Dickins

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in assert_pte_locked().  BUG if pte_offset_map_nolock() fails: this is
stricter than the previous implementation, which skipped when pmd_none()
(with a comment on khugepaged collapse transitions): but wouldn't we want
to know, if an assert_pte_locked() caller can be racing such transitions?

This mod might cause new crashes: which either expose my ignorance, or
indicate issues to be fixed, or limit the usage of assert_pte_locked().

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/mm/pgtable.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cb2dcdb18f8e..16b061af86d7 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -311,6 +311,8 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long 
addr)
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
+   pte_t *pte;
+   spinlock_t *ptl;
 
if (mm == _mm)
return;
@@ -321,16 +323,10 @@ void assert_pte_locked(struct mm_struct *mm, unsigned 
long addr)
pud = pud_offset(p4d, addr);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, addr);
-   /*
-* khugepaged to collapse normal pages to hugepage, first set
-* pmd to none to force page fault/gup to take mmap_lock. After
-* pmd is set to none, we do a pte_clear which does this assertion
-* so if we find pmd none, return.
-*/
-   if (pmd_none(*pmd))
-   return;
-   BUG_ON(!pmd_present(*pmd));
-   assert_spin_locked(pte_lockptr(mm, pmd));
+   pte = pte_offset_map_nolock(mm, pmd, addr, );
+   BUG_ON(!pte);
+   assert_spin_locked(ptl);
+   pte_unmap(pte);
 }
 #endif /* CONFIG_DEBUG_VM */
 
-- 
2.35.3

[PATCH v2 03/12] arm: adjust_pte() use pte_offset_map_nolock()

2023-06-20 Thread Hugh Dickins

Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
in adjust_pte(): because it gives the not-locked ptl for precisely that
pte, which the caller can then safely lock; whereas pte_lockptr() is not
so tightly coupled, because it dereferences the pmd pointer again.

Signed-off-by: Hugh Dickins 
---
 arch/arm/mm/fault-armv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index ca5302b0b7ee..7cb125497976 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -117,11 +117,10 @@ static int adjust_pte(struct vm_area_struct *vma, 
unsigned long address,
 * must use the nested version.  This also means we need to
 * open-code the spin-locking.
 */
-   pte = pte_offset_map(pmd, address);
+   pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, );
if (!pte)
return 0;
 
-   ptl = pte_lockptr(vma->vm_mm, pmd);
do_pte_lock(ptl);
 
ret = do_adjust_pte(vma, address, pfn, pte);
-- 
2.35.3

[PATCH v2 02/12] mm/pgtable: add PAE safety to __pte_offset_map()

2023-06-20 Thread Hugh Dickins

There is a faint risk that __pte_offset_map(), on a 32-bit architecture
with a 64-bit pmd_t e.g. x86-32 with CONFIG_X86_PAE=y, would succeed on
a pmdval assembled from a pmd_low and a pmd_high which never belonged
together: their combination not pointing to a page table at all, perhaps
not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.

Guard against that (on such configs) by local_irq_save() blocking TLB
flush between present updates, as linux/pgtable.h suggests.  It's only
needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
__pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
lock, would just send it back to __pte_offset_map() again.

Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
at the right moment on those configs which do not already send it.

CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86.
It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that
Will Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm.
It is not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.

Limit the IRQ disablement to CONFIG_HIGHPTE?  Perhaps, but would need a
little more work, to retry if pmd_low good for page table, but pmd_high
non-zero from THP (and that might be making x86-specific assumptions).

Signed-off-by: Hugh Dickins 
---
 include/linux/pgtable.h |  4 
 mm/pgtable-generic.c| 29 +
 2 files changed, 33 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8b0fc7fdc46f..525f1782b466 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -390,6 +390,7 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
return pmd;
 }
 #define pmdp_get_lockless pmdp_get_lockless
+#define pmdp_get_lockless_sync() tlb_remove_table_sync_one()
 #endif /* CONFIG_PGTABLE_LEVELS > 2 */
 #endif /* CONFIG_GUP_GET_PXX_LOW_HIGH */
 
@@ -408,6 +409,9 @@ static inline pmd_t pmdp_get_lockless(pmd_t *pmdp)
 {
return pmdp_get(pmdp);
 }
+static inline void pmdp_get_lockless_sync(void)
+{
+}
 #endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 674671835631..5e85a625ab30 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -232,12 +232,41 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, 
unsigned long address,
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
+   (defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
+/*
+ * See the comment above ptep_get_lockless() in include/linux/pgtable.h:
+ * the barriers in pmdp_get_lockless() cannot guarantee that the value in
+ * pmd_high actually belongs with the value in pmd_low; but holding interrupts
+ * off blocks the TLB flush between present updates, which guarantees that a
+ * successful __pte_offset_map() points to a page from matched halves.
+ */
+static unsigned long pmdp_get_lockless_start(void)
+{
+   unsigned long irqflags;
+
+   local_irq_save(irqflags);
+   return irqflags;
+}
+static void pmdp_get_lockless_end(unsigned long irqflags)
+{
+   local_irq_restore(irqflags);
+}
+#else
+static unsigned long pmdp_get_lockless_start(void) { return 0; }
+static void pmdp_get_lockless_end(unsigned long irqflags) { }
+#endif
+
 pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 {
+   unsigned long irqflags;
pmd_t pmdval;
 
rcu_read_lock();
+   irqflags = pmdp_get_lockless_start();
pmdval = pmdp_get_lockless(pmd);
+   pmdp_get_lockless_end(irqflags);
+
if (pmdvalp)
*pmdvalp = pmdval;
if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
-- 
2.35.3

[PATCH v2 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s

2023-06-20 Thread Hugh Dickins

Before putting them to use (several commits later), add rcu_read_lock()
to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
separate commit, since it risks exposing imbalances: prior commits have
fixed all the known imbalances, but we may find some have been missed.

Signed-off-by: Hugh Dickins 
---
 include/linux/pgtable.h | 4 ++--
 mm/pgtable-generic.c| 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index a1326e61d7ee..8b0fc7fdc46f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -99,7 +99,7 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned 
long address)
((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
 #define pte_unmap(pte) do {\
kunmap_local((pte));\
-   /* rcu_read_unlock() to be added later */   \
+   rcu_read_unlock();  \
 } while (0)
 #else
 static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
@@ -108,7 +108,7 @@ static inline pte_t *__pte_map(pmd_t *pmd, unsigned long 
address)
 }
 static inline void pte_unmap(pte_t *pte)
 {
-   /* rcu_read_unlock() to be added later */
+   rcu_read_unlock();
 }
 #endif
 
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c7ab18a5fb77..674671835631 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, 
pmd_t *pmdvalp)
 {
pmd_t pmdval;
 
-   /* rcu_read_lock() to be added later */
+   rcu_read_lock();
pmdval = pmdp_get_lockless(pmd);
if (pmdvalp)
*pmdvalp = pmdval;
@@ -250,7 +250,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, 
pmd_t *pmdvalp)
}
return __pte_map(, addr);
 nomap:
-   /* rcu_read_unlock() to be added later */
+   rcu_read_unlock();
return NULL;
 }
 
-- 
2.35.3

[PATCH v2 00/12] mm: free retracted page table by RCU

2023-06-20 Thread Hugh Dickins

Here is v2 third series of patches to mm (and a few architectures), based
on v6.4-rc5 with the preceding two series applied: in which khugepaged
takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.
Differences from v1 are noted patch by patch below

This follows on from the v2 "arch: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/a4963be9-7aa6-350-66d0-2ba843e1a...@google.com/
series of 23 posted on 2023-06-08 (and now in mm-stable - thank you),
and the v2 "mm: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/c1c9a74a-bc5b-15ea-e5d2-8ec34bc9...@google.com/
series of 32 posted on 2023-06-08 (and now in mm-stable - thank you),
and replaces the v1 "mm: free retracted page table by RCU"
https://lore.kernel.org/linux-mm/35e983f5-7ed3-b310-d949-9ae8b130c...@google.com/
series of 12 posted on 2023-05-28 (which was bad on powerpc and s390).

The first two series were "independent": neither depending for build or
correctness on the other, but both series had to be in before this third
series is added to make the effective changes; and it would probably be
best to hold this series back until the following release, since it might
now reveal missed imbalances which the first series hoped to fix.

What is it all about?  Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs:
the usefulness of MADV_COLLAPSE on shmem is being limited by that
mmap_write_lock it currently requires.

Likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing).  mmap_write_lock
avoidance when collapsing to anon THPs?  Perhaps, but again that's not
work I've done: a quick attempt was not as easy as the shmem/file case.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

Based on the preceding two series over v6.4-rc5, or any v6.4-rc; and
almost good on current mm-everything or current linux-next - just one
patch conflicts, the 10/12: I'll reply to that one with its
mm-everything or linux-next equivalent (ptent replacing *pte).

01/12 mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s
  v2: same as v1
02/12 mm/pgtable: add PAE safety to __pte_offset_map()
  v2: rename to pmdp_get_lockless_start/end() per Matthew;
  so use inlines without _irq_save(flags) macro oddity;
  add pmdp_get_lockless_sync() for use later in 09/12.
03/12 arm: adjust_pte() use pte_offset_map_nolock()
  v2: same as v1
04/12 powerpc: assert_pte_locked() use pte_offset_map_nolock()
  v2: same as v1
05/12 powerpc: add pte_free_defer() for pgtables sharing page
  v2: fix rcu_head usage to cope with concurrent deferrals;
  add para to commit message explaining rcu_head issue.
06/12 sparc: add pte_free_defer() for pte_t *pgtable_t
  v2: use page_address() instead of less common page_to_virt();
  add para to commit message explaining simple conversion;
  changed title since sparc64 pgtables do not share page.
07/12 s390: add pte_free_defer() for pgtables sharing page
  v2: complete rewrite, integrated with s390's existing pgtable
  management; temporarily using a global mm_pgtable_list_lock,
  to be restored to per-mm spinlock in a later followup patch.
08/12 mm/pgtable: add pte_free_defer() for pgtable as page
  v2: add comment on rcu_head to "Page table pages", per JannH
09/12 mm/khugepaged: retract_page_tables() without mmap or vma lock
  v2: repeat checks under ptl because UFFD, per PeterX and JannH;
  bring back mmu_notifier calls for PMD, per JannH and Jason;
  pmdp_get_lockless_sync() to issue missing interrupt if PAE.
10/12 mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
  v2: first check VMA, in case page tables torn down, per JannH;
  pmdp_get_lockless_sync() to issue missing interrupt if PAE;
  moved mmu_notifier after step 1, reworked final goto labels.
11/12 mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps()
  v2: same as v1
12/12 mm: delete mmap_write_trylock() and vma_try_start_write()
  v2: same as v1

 arch/arm/mm/fault-armv.c|   3 +-
 arch/powerpc/include/asm/pgalloc.h  |   4 +
 arch/powerpc/mm/pgtable-frag.c  |  51 
 arch/powerpc/mm/pgtable.c   |  16 +-
 arch/s390/include/asm/pgalloc.h |   4 +
 arch/s390/mm/pgalloc.c  | 205 +
 arch/sparc/include/asm/pgalloc_64.h |   4 +
 arch/sparc/mm/init_64.c |  16 +
 include/linux/mm.h  |  17 --
 include/linux/mm_types.h|   6 +-
 include/linux/mmap_lock.h   |  10 -
 include/linux/pgtable.h |  10 +-
 mm/khugepaged.c | 481 +++---
 mm/pgtable-generic.c|  53 +++-
 14 files changed, 467 insertions(+), 413 deletions(-)

[PATCH v2 07/23 replacement] mips: add pte_unmap() to balance pte_offset_map()

2023-06-15 Thread Hugh Dickins

To keep balance in future, __update_tlb() remember to pte_unmap() after
pte_offset_map().  This is an odd case, since the caller has already done
pte_offset_map_lock(), then mips forgets the address and recalculates it;
but my two naive attempts to clean that up did more harm than good.

Tested-by: Nathan Chancellor 
Signed-off-by: Hugh Dickins 
---
Andrew, please replace my mips patch, and its build warning fix patch,
in mm-unstable by this less ambitious but working replacement - thanks.

 arch/mips/mm/tlb-r4k.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/mips/mm/tlb-r4k.c b/arch/mips/mm/tlb-r4k.c
index 1b939abbe4ca..93c2d695588a 100644
--- a/arch/mips/mm/tlb-r4k.c
+++ b/arch/mips/mm/tlb-r4k.c
@@ -297,7 +297,7 @@ void __update_tlb(struct vm_area_struct * vma, unsigned 
long address, pte_t pte)
p4d_t *p4dp;
pud_t *pudp;
pmd_t *pmdp;
-   pte_t *ptep;
+   pte_t *ptep, *ptemap = NULL;
int idx, pid;
 
/*
@@ -344,7 +344,12 @@ void __update_tlb(struct vm_area_struct * vma, unsigned 
long address, pte_t pte)
} else
 #endif
{
-   ptep = pte_offset_map(pmdp, address);
+   ptemap = ptep = pte_offset_map(pmdp, address);
+   /*
+* update_mmu_cache() is called between pte_offset_map_lock()
+* and pte_unmap_unlock(), so we can assume that ptep is not
+* NULL here: and what should be done below if it were NULL?
+*/
 
 #if defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32)
 #ifdef CONFIG_XPA
@@ -373,6 +378,9 @@ void __update_tlb(struct vm_area_struct * vma, unsigned 
long address, pte_t pte)
tlbw_use_hazard();
htw_start();
flush_micro_tlb_vm(vma);
+
+   if (ptemap)
+   pte_unmap(ptemap);
local_irq_restore(flags);
 }
 
-- 
2.35.3

Re: [PATCH v2 07/23] mips: update_mmu_cache() can replace __update_tlb()

2023-06-15 Thread Hugh Dickins

On Thu, 15 Jun 2023, Nathan Chancellor wrote:
> On Wed, Jun 14, 2023 at 10:43:30PM -0700, Hugh Dickins wrote:
> > 
> > I do hope that you find the first fixes the breakage; but if not, then
> 
> I hate to be the bearer of bad news but the first patch did not fix the
> breakage, I see the same issue.

Boo!

> 
> > I even more fervently hope that the second will, despite my hating it.
> > Touch wood for the first, fingers crossed for the second, thanks,
> 
> Thankfully, the second one does. Thanks for the quick and thoughtful
> responses!

Hurrah!

Thanks a lot, Nathan.  I'll set aside my disappointment and curiosity,
clearly I'm not going to have much of a future as a MIPS programmer.

I must take a break, then rush Andrew the second patch, well, not
exactly that second patch, since most of that is revert: I'll just
send the few lines of replacement patch (with a new Subject line, as
update_mmu_cache() goes back to being separate from __update_tlb()).

Unless you object, I'll include a Tested-by: you.  I realize that
your testing is limited to seeing it running; but that's true of
most of the testing at this stage - it gets to be more interesting
when the patch that adds the rcu_read_lock() and rcu_read_unlock()
is added on top later.

Thanks again,
Hugh

Re: [PATCH v4 04/34] pgtable: Create struct ptdesc

2023-06-15 Thread Hugh Dickins

On Mon, 12 Jun 2023, Vishal Moola (Oracle) wrote:

> Currently, page table information is stored within struct page. As part
> of simplifying struct page, create struct ptdesc for page table
> information.
> 
> Signed-off-by: Vishal Moola (Oracle) 

Vishal, as I think you have already guessed, your ptdesc series and
my pte_free_defer() "mm: free retracted page table by RCU" series are
on a collision course.

Probably just trivial collisions in most architectures, which either
of us can easily adjust to the other; powerpc likely to be more awkward,
but fairly easily resolved; s390 quite a problem.

I've so far been unable to post a v2 of my series (and powerpc and s390
were stupidly wrong in the v1), because a good s390 patch is not yet
decided - Gerald Schaefer and I are currently working on that, on the
s390 list (I took off most Ccs until we are settled and I can post v2).

As you have no doubt found yourself, s390 has sophisticated handling of
free half-pages already, and I need to add rcu_head usage in there too:
it's tricky to squeeze it all in, and ptdesc does not appear to help us
in any way (though mostly it's just changing some field names, okay).

If ptdesc were actually allowing a flexible structure which architectures
could add into, that would (in some future) be nice; but of course at
present it's still fitting it all into one struct page, and mandating
new restrictions which just make an architecture's job harder.

Some notes on problematic fields below FYI.

> ---
>  include/linux/pgtable.h | 51 +
>  1 file changed, 51 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index c5a51481bbb9..330de96ebfd6 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -975,6 +975,57 @@ static inline void ptep_modify_prot_commit(struct 
> vm_area_struct *vma,
>  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>  #endif /* CONFIG_MMU */
>  
> +
> +/**
> + * struct ptdesc - Memory descriptor for page tables.
> + * @__page_flags: Same as page flags. Unused for page tables.
> + * @pt_list: List of used page tables. Used for s390 and x86.
> + * @_pt_pad_1: Padding that aliases with page's compound head.
> + * @pmd_huge_pte: Protected by ptdesc->ptl, used for THPs.
> + * @_pt_s390_gaddr: Aliases with page's mapping. Used for s390 gmap only.
> + * @pt_mm: Used for x86 pgds.
> + * @pt_frag_refcount: For fragmented page table tracking. Powerpc and s390 
> only.
> + * @ptl: Lock for the page table.
> + *
> + * This struct overlays struct page for now. Do not modify without a good
> + * understanding of the issues.
> + */
> +struct ptdesc {
> + unsigned long __page_flags;
> +
> + union {
> + struct list_head pt_list;

I shall be needing struct rcu_head rcu_head (or pt_rcu_head or whatever,
if you prefer) in this union too.  Sharing the lru or pt_list with rcu_head
is what's difficult to get right and efficient on s390 - and if ptdesc gave
us an independent rcu_head for each page table, that would be a blessing!
but sadly not, it still has to squeeze into a struct page.

> + struct {
> + unsigned long _pt_pad_1;
> + pgtable_t pmd_huge_pte;
> + };
> + };
> + unsigned long _pt_s390_gaddr;
> +
> + union {
> + struct mm_struct *pt_mm;
> + atomic_t pt_frag_refcount;

Whether s390 will want pt_mm is not yet decided: I want to use it,
Gerald prefers to go without it; but if we do end up using it,
then pt_frag_refcount is a luxury we would have to give up.

s390 does very well already with its _refcount tricks, and I'd expect
powerpc's simpler but more wasteful implementation to work as well
with _refcount too - I know that a few years back, powerpc did misuse
_refcount (it did not allow for speculative accesses, thought it had
sole ownership of that field); but s390 copes well with that, and I
expect powerpc can do so too, without the luxury of pt_frag_refcount.

But I've no desire to undo powerpc's use of pt_frag_refcount:
just warning that we may want to undo any use of it in s390.

I thought I had more issues to mention, probably Gerald will
remind me of a whole new unexplored dimension! gmap perhaps.

Hugh

> + };
> +
> +#if ALLOC_SPLIT_PTLOCKS
> + spinlock_t *ptl;
> +#else
> + spinlock_t ptl;
> +#endif
> +};
> +
> +#define TABLE_MATCH(pg, pt)  \
> + static_assert(offsetof(struct page, pg) == offsetof(struct ptdesc, pt))
> +TABLE_MATCH(flags, __page_flags);
> +TABLE_MATCH(compound_head, pt_list);
> +TABLE_MATCH(compound_head, _pt_pad_1);
> +TABLE_MATCH(pmd_huge_pte, pmd_huge_pte);
> +TABLE_MATCH(mapping, _pt_s390_gaddr);
> +TABLE_MATCH(pt_mm, pt_mm);
> +TABLE_MATCH(ptl, ptl);
> +#undef TABLE_MATCH
> +static_assert(sizeof(struct ptdesc) <= sizeof(struct page));
> +
>  /*
>   * No-op macros that just return the current protection value. Defined here

Re: [PATCH v2 07/23] mips: update_mmu_cache() can replace __update_tlb()

2023-06-14 Thread Hugh Dickins

On Wed, 14 Jun 2023, Hugh Dickins wrote:
> On Wed, 14 Jun 2023, Nathan Chancellor wrote:
> > 
> > I just bisected a crash while powering down a MIPS machine in QEMU to
> > this change as commit 8044511d3893 ("mips: update_mmu_cache() can
> > replace __update_tlb()") in linux-next.
> 
> Thank you, Nathan, that's very helpful indeed.  This patch certainly knew
> that it wanted testing, and I'm glad to hear that it is now seeing some.
> 
> While powering down?  The messages below look like it was just coming up,
> but no doubt that's because you were bisecting (or because I'm unfamiliar
> with what messages to expect there).  It's probably irrelevant information,
> but I wonder whether the (V)machine worked well enough for a while before
> you first powered down and spotted the problem, or whether it's never got
> much further than trying to run init (busybox)?  I'm trying to get a feel
> for whether the problem occurs under common or uncommon conditions.
> 
> > Unfortunately, I can still
> > reproduce it with the existing fix you have for this change on the
> > mailing list, which is present in next-20230614.
> 
> Right, that later fix was only for a build warning, nothing functional
> (or at least I hoped that it wasn't making any functional difference).
> 
> Thanks a lot for the detailed instructions below: unfortunately, those
> would draw me into a realm of testing I've never needed to enter before,
> so a lot of time spent on setup and learning.  Usually, I just stare at
> the source.
> 
> What this probably says is that I should revert most my cleanup there,
> and keep as close to the existing code as possible.  But some change is
> needed, and I may need to understand (or have a good guess at) what was
> going wrong, to decide what kind of retreat will be successful.
> 
> Back to the source for a while: I hope I'll find examples in nearby MIPS
> kernel source (and git history), which will hint at the right way forward.
> Then send you a patch against next-20230614 to try, when I'm reasonably
> confident that it's enough to satisfy my purpose, but likely not to waste
> your time.

I'm going to take advantage of your good nature by attaching
two alternative patches, either to go on top of next-20230614.

mips1.patch,
 arch/mips/mm/tlb-r4k.c |   12 +---
 1 file changed, 1 insertion(+), 11 deletions(-)

is by far my favourite.  I couldn't see anything wrong with what's
already there for mips, but it seems possible that (though I didn't
find it) somewhere calls update_mmu_cache_pmd() on a page table.  So
mips1.patch restores the pmd_huge() check, and cleans up further by
removing the silly pgdp, p4dp, pudp, pmdp stuff: the pointer has now
been passed in by the caller, why walk the tree again?  I should have
done it this way before.

But if that doesn't work, then I'm afraid it will have to be
mips2.patch,
 arch/mips/include/asm/pgtable.h |   15 ---
 arch/mips/mm/tlb-r3k.c  |5 ++---
 arch/mips/mm/tlb-r4k.c  |   27 ++-
 3 files changed, 32 insertions(+), 15 deletions(-)

which reverts all of the original patch and its build warning fix,
and does a pte_unmap() to balance the silly pte_offset_map() there;
with an apologetic comment for this being about the only place in
the tree where I have no idea what to do if ptep were NULL.

I do hope that you find the first fixes the breakage; but if not, then
I even more fervently hope that the second will, despite my hating it.
Touch wood for the first, fingers crossed for the second, thanks,

Hugh--- a/arch/mips/mm/tlb-r4k.c
+++ b/arch/mips/mm/tlb-r4k.c
@@ -293,12 +293,6 @@ void local_flush_tlb_one(unsigned long page)
 void update_mmu_cache(struct vm_area_struct *vma,
 		  unsigned long address, pte_t *ptep)
 {
-#ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
-	pgd_t *pgdp;
-	p4d_t *p4dp;
-	pud_t *pudp;
-	pmd_t *pmdp;
-#endif
 	unsigned long flags;
 	int idx, pid;
 
@@ -323,12 +317,8 @@ void update_mmu_cache(struct vm_area_struct *vma,
 	tlb_probe_hazard();
 	idx = read_c0_index();
 #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
-	pgdp = pgd_offset(vma->vm_mm, address);
-	p4dp = p4d_offset(pgdp, address);
-	pudp = pud_offset(p4dp, address);
-	pmdp = pmd_offset(pudp, address);
 	/* this could be a huge page  */
-	if (ptep == (pte_t *)pmdp) {
+	if (pmd_huge(*(pmd_t *)ptep)) {
 		unsigned long lo;
 		write_c0_pagemask(PM_HUGE_MASK);
 		lo = pte_to_entrylo(pte_val(*ptep));
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -565,8 +565,15 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
 }
 #endif
 
-extern void update_mmu_cache(struct vm_area_struct *vma,
-	unsigned long address, pte_t *ptep);
+extern void __update_tlb(struct vm_area_struct *vma, unsigned long address,
+	pte_t pte);
+
+static inline void update_mmu_cache(struct vm_area

Re: [PATCH v2 07/23] mips: update_mmu_cache() can replace __update_tlb()

2023-06-14 Thread Hugh Dickins

On Wed, 14 Jun 2023, Nathan Chancellor wrote:

> Hi Hugh,
> 
> On Thu, Jun 08, 2023 at 12:17:24PM -0700, Hugh Dickins wrote:
> > Don't make update_mmu_cache() a wrapper around __update_tlb(): call it
> > directly, and use the ptep (or pmdp) provided by the caller, instead of
> > re-calling pte_offset_map() - which would raise a question of whether a
> > pte_unmap() is needed to balance it.
> > 
> > Check whether the "ptep" provided by the caller is actually the pmdp,
> > instead of testing pmd_huge(): or test pmd_huge() too and warn if it
> > disagrees?  This is "hazardous" territory: needs review and testing.
> > 
> > Signed-off-by: Hugh Dickins 
> > ---
> >  arch/mips/include/asm/pgtable.h | 15 +++
> >  arch/mips/mm/tlb-r3k.c  |  5 +++--
> >  arch/mips/mm/tlb-r4k.c  |  9 +++--
> >  3 files changed, 9 insertions(+), 20 deletions(-)
> > 
> > diff --git a/arch/mips/include/asm/pgtable.h 
> > b/arch/mips/include/asm/pgtable.h
> > index 574fa14ac8b2..9175dfab08d5 100644
> > --- a/arch/mips/include/asm/pgtable.h
> > +++ b/arch/mips/include/asm/pgtable.h
> > @@ -565,15 +565,8 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
> >  }
> >  #endif
> >  
> > -extern void __update_tlb(struct vm_area_struct *vma, unsigned long address,
> > -   pte_t pte);
> > -
> > -static inline void update_mmu_cache(struct vm_area_struct *vma,
> > -   unsigned long address, pte_t *ptep)
> > -{
> > -   pte_t pte = *ptep;
> > -   __update_tlb(vma, address, pte);
> > -}
> > +extern void update_mmu_cache(struct vm_area_struct *vma,
> > +   unsigned long address, pte_t *ptep);
> >  
> >  #define__HAVE_ARCH_UPDATE_MMU_TLB
> >  #define update_mmu_tlb update_mmu_cache
> > @@ -581,9 +574,7 @@ static inline void update_mmu_cache(struct 
> > vm_area_struct *vma,
> >  static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
> > unsigned long address, pmd_t *pmdp)
> >  {
> > -   pte_t pte = *(pte_t *)pmdp;
> > -
> > -   __update_tlb(vma, address, pte);
> > +   update_mmu_cache(vma, address, (pte_t *)pmdp);
> >  }
> >  
> >  /*
> > diff --git a/arch/mips/mm/tlb-r3k.c b/arch/mips/mm/tlb-r3k.c
> > index 53dfa2b9316b..e5722cd8dd6d 100644
> > --- a/arch/mips/mm/tlb-r3k.c
> > +++ b/arch/mips/mm/tlb-r3k.c
> > @@ -176,7 +176,8 @@ void local_flush_tlb_page(struct vm_area_struct *vma, 
> > unsigned long page)
> > }
> >  }
> >  
> > -void __update_tlb(struct vm_area_struct *vma, unsigned long address, pte_t 
> > pte)
> > +void update_mmu_cache(struct vm_area_struct *vma,
> > + unsigned long address, pte_t *ptep)
> >  {
> > unsigned long asid_mask = cpu_asid_mask(_cpu_data);
> > unsigned long flags;
> > @@ -203,7 +204,7 @@ void __update_tlb(struct vm_area_struct *vma, unsigned 
> > long address, pte_t pte)
> > BARRIER;
> > tlb_probe();
> > idx = read_c0_index();
> > -   write_c0_entrylo0(pte_val(pte));
> > +   write_c0_entrylo0(pte_val(*ptep));
> > write_c0_entryhi(address | pid);
> > if (idx < 0) {  /* BARRIER */
> > tlb_write_random();
> > diff --git a/arch/mips/mm/tlb-r4k.c b/arch/mips/mm/tlb-r4k.c
> > index 1b939abbe4ca..c96725d17cab 100644
> > --- a/arch/mips/mm/tlb-r4k.c
> > +++ b/arch/mips/mm/tlb-r4k.c
> > @@ -290,14 +290,14 @@ void local_flush_tlb_one(unsigned long page)
> >   * updates the TLB with the new pte(s), and another which also checks
> >   * for the R4k "end of page" hardware bug and does the needy.
> >   */
> > -void __update_tlb(struct vm_area_struct * vma, unsigned long address, 
> > pte_t pte)
> > +void update_mmu_cache(struct vm_area_struct *vma,
> > + unsigned long address, pte_t *ptep)
> >  {
> > unsigned long flags;
> > pgd_t *pgdp;
> > p4d_t *p4dp;
> > pud_t *pudp;
> > pmd_t *pmdp;
> > -   pte_t *ptep;
> > int idx, pid;
> >  
> > /*
> > @@ -326,10 +326,9 @@ void __update_tlb(struct vm_area_struct * vma, 
> > unsigned long address, pte_t pte)
> > idx = read_c0_index();
> >  #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
> > /* this could be a huge page  */
> > -   if (pmd_huge(*pmdp)) {
> > +   if (ptep == (pte_t *)pmdp) {
> > unsigned long lo;
> > write_c0_pagemask(PM_HUGE_MASK);
> > -   ptep = (pte_t *)pmdp;
&

[PATCH v2 07/23 fix] mips: update_mmu_cache() can replace __update_tlb(): fix

2023-06-09 Thread Hugh Dickins

I expect this to fix the
arch/mips/mm/tlb-r4k.c:300:16: warning: variable 'pmdp' set but not used
reported by the kernel test robot; but I am uncomfortable rearranging
lines in this tlb_probe_hazard() area, and would be glad for review and
testing by someone familiar with mips - thanks in advance!

Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202306091304.cnvispk0-...@intel.com/
Signed-off-by: Hugh Dickins 
---
 arch/mips/mm/tlb-r4k.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/mips/mm/tlb-r4k.c b/arch/mips/mm/tlb-r4k.c
index c96725d17cab..80fc90d8d2f1 100644
--- a/arch/mips/mm/tlb-r4k.c
+++ b/arch/mips/mm/tlb-r4k.c
@@ -293,11 +293,13 @@ void local_flush_tlb_one(unsigned long page)
 void update_mmu_cache(struct vm_area_struct *vma,
  unsigned long address, pte_t *ptep)
 {
-   unsigned long flags;
+#ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
pgd_t *pgdp;
p4d_t *p4dp;
pud_t *pudp;
pmd_t *pmdp;
+#endif
+   unsigned long flags;
int idx, pid;
 
/*
@@ -316,15 +318,15 @@ void update_mmu_cache(struct vm_area_struct *vma,
pid = read_c0_entryhi() & cpu_asid_mask(_cpu_data);
write_c0_entryhi(address | pid);
}
-   pgdp = pgd_offset(vma->vm_mm, address);
mtc0_tlbw_hazard();
tlb_probe();
tlb_probe_hazard();
+   idx = read_c0_index();
+#ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
+   pgdp = pgd_offset(vma->vm_mm, address);
p4dp = p4d_offset(pgdp, address);
pudp = pud_offset(p4dp, address);
pmdp = pmd_offset(pudp, address);
-   idx = read_c0_index();
-#ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
/* this could be a huge page  */
if (ptep == (pte_t *)pmdp) {
unsigned long lo;
-- 
2.35.3

[PATCH v2 23/23] xtensa: add pte_unmap() to balance pte_offset_map()

2023-06-08 Thread Hugh Dickins

To keep balance in future, remember to pte_unmap() after a successful
pte_offset_map().  And act as if get_pte_for_vaddr() really needs a map
there, to read the pteval before "unmapping", to be sure page table is
not removed.

Signed-off-by: Hugh Dickins 
---
 arch/xtensa/mm/tlb.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
index 27a477dae232..0a11fc5f185b 100644
--- a/arch/xtensa/mm/tlb.c
+++ b/arch/xtensa/mm/tlb.c
@@ -179,6 +179,7 @@ static unsigned get_pte_for_vaddr(unsigned vaddr)
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
+   unsigned int pteval;
 
if (!mm)
mm = task->active_mm;
@@ -197,7 +198,9 @@ static unsigned get_pte_for_vaddr(unsigned vaddr)
pte = pte_offset_map(pmd, vaddr);
if (!pte)
return 0;
-   return pte_val(*pte);
+   pteval = pte_val(*pte);
+   pte_unmap(pte);
+   return pteval;
 }
 
 enum {
-- 
2.35.3

[PATCH v2 22/23] x86: sme_populate_pgd() use pte_offset_kernel()

2023-06-08 Thread Hugh Dickins

sme_populate_pgd() is an __init function for sme_encrypt_kernel():
it should use pte_offset_kernel() instead of pte_offset_map(), to avoid
the question of whether a pte_unmap() will be needed to balance.

Signed-off-by: Hugh Dickins 
---
 arch/x86/mm/mem_encrypt_identity.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/mem_encrypt_identity.c 
b/arch/x86/mm/mem_encrypt_identity.c
index c6efcf559d88..a1ab542bdfd6 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -188,7 +188,7 @@ static void __init sme_populate_pgd(struct 
sme_populate_pgd_data *ppd)
if (pmd_large(*pmd))
return;
 
-   pte = pte_offset_map(pmd, ppd->vaddr);
+   pte = pte_offset_kernel(pmd, ppd->vaddr);
if (pte_none(*pte))
set_pte(pte, __pte(ppd->paddr | ppd->pte_flags));
 }
-- 
2.35.3

[PATCH v2 21/23] x86: Allow get_locked_pte() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Signed-off-by: Hugh Dickins 
---
 arch/x86/kernel/ldt.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 525876e7b9f4..adc67f98819a 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -367,8 +367,10 @@ static void unmap_ldt_struct(struct mm_struct *mm, struct 
ldt_struct *ldt)
 
va = (unsigned long)ldt_slot_va(ldt->slot) + offset;
ptep = get_locked_pte(mm, va, );
-   pte_clear(mm, va, ptep);
-   pte_unmap_unlock(ptep, ptl);
+   if (!WARN_ON_ONCE(!ptep)) {
+   pte_clear(mm, va, ptep);
+   pte_unmap_unlock(ptep, ptl);
+   }
}
 
va = (unsigned long)ldt_slot_va(ldt->slot);
-- 
2.35.3

[PATCH v2 20/23] sparc: iounit and iommu use pte_offset_kernel()

2023-06-08 Thread Hugh Dickins

iounit_alloc() and sbus_iommu_alloc() are working from pmd_off_k(),
so should use pte_offset_kernel() instead of pte_offset_map(), to avoid
the question of whether a pte_unmap() will be needed to balance.

Signed-off-by: Hugh Dickins 
---
 arch/sparc/mm/io-unit.c | 2 +-
 arch/sparc/mm/iommu.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/mm/io-unit.c b/arch/sparc/mm/io-unit.c
index bf3e6d2fe5d9..133dd42570d6 100644
--- a/arch/sparc/mm/io-unit.c
+++ b/arch/sparc/mm/io-unit.c
@@ -244,7 +244,7 @@ static void *iounit_alloc(struct device *dev, size_t len,
long i;
 
pmdp = pmd_off_k(addr);
-   ptep = pte_offset_map(pmdp, addr);
+   ptep = pte_offset_kernel(pmdp, addr);
 
set_pte(ptep, mk_pte(virt_to_page(page), dvma_prot));
 
diff --git a/arch/sparc/mm/iommu.c b/arch/sparc/mm/iommu.c
index 9e3f6933ca13..3a6caef68348 100644
--- a/arch/sparc/mm/iommu.c
+++ b/arch/sparc/mm/iommu.c
@@ -358,7 +358,7 @@ static void *sbus_iommu_alloc(struct device *dev, size_t 
len,
__flush_page_to_ram(page);
 
pmdp = pmd_off_k(addr);
-   ptep = pte_offset_map(pmdp, addr);
+   ptep = pte_offset_kernel(pmdp, addr);
 
set_pte(ptep, mk_pte(virt_to_page(page), dvma_prot));
}
-- 
2.35.3

[PATCH v2 19/23] sparc: allow pte_offset_map() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Signed-off-by: Hugh Dickins 
---
 arch/sparc/kernel/signal32.c | 2 ++
 arch/sparc/mm/fault_64.c | 3 +++
 arch/sparc/mm/tlb.c  | 2 ++
 3 files changed, 7 insertions(+)

diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..ca450c7bc53f 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -328,6 +328,8 @@ static void flush_signal_insns(unsigned long address)
goto out_irqs_on;
 
ptep = pte_offset_map(pmdp, address);
+   if (!ptep)
+   goto out_irqs_on;
pte = *ptep;
if (!pte_present(pte))
goto out_unmap;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index d91305de694c..d8a407fbe350 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -99,6 +99,7 @@ static unsigned int get_user_insn(unsigned long tpc)
local_irq_disable();
 
pmdp = pmd_offset(pudp, tpc);
+again:
if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp)))
goto out_irq_enable;
 
@@ -115,6 +116,8 @@ static unsigned int get_user_insn(unsigned long tpc)
 #endif
{
ptep = pte_offset_map(pmdp, tpc);
+   if (!ptep)
+   goto again;
pte = *ptep;
if (pte_present(pte)) {
pa  = (pte_pfn(pte) << PAGE_SHIFT);
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 9a725547578e..7ecf8556947a 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -149,6 +149,8 @@ static void tlb_batch_pmd_scan(struct mm_struct *mm, 
unsigned long vaddr,
pte_t *pte;
 
pte = pte_offset_map(, vaddr);
+   if (!pte)
+   return;
end = vaddr + HPAGE_SIZE;
while (vaddr < end) {
if (pte_val(*pte) & _PAGE_VALID) {
-- 
2.35.3

[PATCH v2 18/23] sparc/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-06-08 Thread Hugh Dickins

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins 
---
 arch/sparc/mm/hugetlbpage.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index d8e0e3c7038d..d7018823206c 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -298,7 +298,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
return NULL;
if (sz >= PMD_SIZE)
return (pte_t *)pmd;
-   return pte_alloc_map(mm, pmd, addr);
+   return pte_alloc_huge(mm, pmd, addr);
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
@@ -325,7 +325,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
return NULL;
if (is_hugetlb_pmd(*pmd))
return (pte_t *)pmd;
-   return pte_offset_map(pmd, addr);
+   return pte_offset_huge(pmd, addr);
 }
 
 void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
-- 
2.35.3

[PATCH v2 17/23] sh/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-06-08 Thread Hugh Dickins

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins 
---
 arch/sh/mm/hugetlbpage.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index 999ab5916e69..6cb0ad73dbb9 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -38,7 +38,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (pud) {
pmd = pmd_alloc(mm, pud, addr);
if (pmd)
-   pte = pte_alloc_map(mm, pmd, addr);
+   pte = pte_alloc_huge(mm, pmd, addr);
}
}
}
@@ -63,7 +63,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
if (pud) {
pmd = pmd_offset(pud, addr);
if (pmd)
-   pte = pte_offset_map(pmd, addr);
+   pte = pte_offset_huge(pmd, addr);
}
}
}
-- 
2.35.3

[PATCH v2 16/23] s390: gmap use pte_unmap_unlock() not spin_unlock()

2023-06-08 Thread Hugh Dickins

pte_alloc_map_lock() expects to be followed by pte_unmap_unlock(): to
keep balance in future, pass ptep as well as ptl to gmap_pte_op_end(),
and use pte_unmap_unlock() instead of direct spin_unlock() (even though
ptep ends up unused inside the macro).

Signed-off-by: Hugh Dickins 
Acked-by: Alexander Gordeev 
---
 arch/s390/mm/gmap.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 3a2a31a15ea8..f4b6fc746fce 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -895,12 +895,12 @@ static int gmap_pte_op_fixup(struct gmap *gmap, unsigned 
long gaddr,
 
 /**
  * gmap_pte_op_end - release the page table lock
- * @ptl: pointer to the spinlock pointer
+ * @ptep: pointer to the locked pte
+ * @ptl: pointer to the page table spinlock
  */
-static void gmap_pte_op_end(spinlock_t *ptl)
+static void gmap_pte_op_end(pte_t *ptep, spinlock_t *ptl)
 {
-   if (ptl)
-   spin_unlock(ptl);
+   pte_unmap_unlock(ptep, ptl);
 }
 
 /**
@@ -1011,7 +1011,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned 
long gaddr,
 {
int rc;
pte_t *ptep;
-   spinlock_t *ptl = NULL;
+   spinlock_t *ptl;
unsigned long pbits = 0;
 
if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
@@ -1025,7 +1025,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned 
long gaddr,
pbits |= (bits & GMAP_NOTIFY_SHADOW) ? PGSTE_VSIE_BIT : 0;
/* Protect and unlock. */
rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, pbits);
-   gmap_pte_op_end(ptl);
+   gmap_pte_op_end(ptep, ptl);
return rc;
 }
 
@@ -1154,7 +1154,7 @@ int gmap_read_table(struct gmap *gmap, unsigned long 
gaddr, unsigned long *val)
/* Do *NOT* clear the _PAGE_INVALID bit! */
rc = 0;
}
-   gmap_pte_op_end(ptl);
+   gmap_pte_op_end(ptep, ptl);
}
if (!rc)
break;
@@ -1248,7 +1248,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned 
long raddr,
if (!rc)
gmap_insert_rmap(sg, vmaddr, rmap);
spin_unlock(>guest_table_lock);
-   gmap_pte_op_end(ptl);
+   gmap_pte_op_end(ptep, ptl);
}
radix_tree_preload_end();
if (rc) {
@@ -2156,7 +2156,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long 
saddr, pte_t pte)
tptep = (pte_t *) gmap_table_walk(sg, saddr, 0);
if (!tptep) {
spin_unlock(>guest_table_lock);
-   gmap_pte_op_end(ptl);
+   gmap_pte_op_end(sptep, ptl);
radix_tree_preload_end();
break;
}
@@ -2167,7 +2167,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long 
saddr, pte_t pte)
rmap = NULL;
rc = 0;
}
-   gmap_pte_op_end(ptl);
+   gmap_pte_op_end(sptep, ptl);
spin_unlock(>guest_table_lock);
}
radix_tree_preload_end();
@@ -2495,7 +2495,7 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned 
long bitmap[4],
continue;
if (ptep_test_and_clear_uc(gmap->mm, vmaddr, ptep))
set_bit(i, bitmap);
-   spin_unlock(ptl);
+   pte_unmap_unlock(ptep, ptl);
}
}
gmap_pmd_op_end(gmap, pmdp);
-- 
2.35.3

[PATCH v2 15/23] s390: allow pte_offset_map_lock() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Add comment on mm's contract with s390 above __zap_zero_pages(),
and fix old comment there: must be called after THP was disabled.

Signed-off-by: Hugh Dickins 
---
 arch/s390/kernel/uv.c  |  2 ++
 arch/s390/mm/gmap.c|  9 -
 arch/s390/mm/pgtable.c | 12 +---
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index cb2ee06df286..3c62d1b218b1 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -294,6 +294,8 @@ int gmap_make_secure(struct gmap *gmap, unsigned long 
gaddr, void *uvcb)
 
rc = -ENXIO;
ptep = get_locked_pte(gmap->mm, uaddr, );
+   if (!ptep)
+   goto out;
if (pte_present(*ptep) && !(pte_val(*ptep) & _PAGE_INVALID) && 
pte_write(*ptep)) {
page = pte_page(*ptep);
rc = -EAGAIN;
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index dc90d1eb0d55..3a2a31a15ea8 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2537,7 +2537,12 @@ static inline void thp_split_mm(struct mm_struct *mm)
  * Remove all empty zero pages from the mapping for lazy refaulting
  * - This must be called after mm->context.has_pgste is set, to avoid
  *   future creation of zero pages
- * - This must be called after THP was enabled
+ * - This must be called after THP was disabled.
+ *
+ * mm contracts with s390, that even if mm were to remove a page table,
+ * racing with the loop below and so causing pte_offset_map_lock() to fail,
+ * it will never insert a page table containing empty zero pages once
+ * mm_forbids_zeropage(mm) i.e. mm->context.has_pgste is set.
  */
 static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
   unsigned long end, struct mm_walk *walk)
@@ -2549,6 +2554,8 @@ static int __zap_zero_pages(pmd_t *pmd, unsigned long 
start,
spinlock_t *ptl;
 
ptep = pte_offset_map_lock(walk->mm, pmd, addr, );
+   if (!ptep)
+   break;
if (is_zero_pfn(pte_pfn(*ptep)))
ptep_xchg_direct(walk->mm, addr, ptep, 
__pte(_PAGE_INVALID));
pte_unmap_unlock(ptep, ptl);
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 6effb24de6d9..3bd2ab2a9a34 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -829,7 +829,7 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned 
long addr,
default:
return -EFAULT;
}
-
+again:
ptl = pmd_lock(mm, pmdp);
if (!pmd_present(*pmdp)) {
spin_unlock(ptl);
@@ -850,6 +850,8 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned 
long addr,
spin_unlock(ptl);
 
ptep = pte_offset_map_lock(mm, pmdp, addr, );
+   if (!ptep)
+   goto again;
new = old = pgste_get_lock(ptep);
pgste_val(new) &= ~(PGSTE_GR_BIT | PGSTE_GC_BIT |
PGSTE_ACC_BITS | PGSTE_FP_BIT);
@@ -938,7 +940,7 @@ int reset_guest_reference_bit(struct mm_struct *mm, 
unsigned long addr)
default:
return -EFAULT;
}
-
+again:
ptl = pmd_lock(mm, pmdp);
if (!pmd_present(*pmdp)) {
spin_unlock(ptl);
@@ -955,6 +957,8 @@ int reset_guest_reference_bit(struct mm_struct *mm, 
unsigned long addr)
spin_unlock(ptl);
 
ptep = pte_offset_map_lock(mm, pmdp, addr, );
+   if (!ptep)
+   goto again;
new = old = pgste_get_lock(ptep);
/* Reset guest reference bit only */
pgste_val(new) &= ~PGSTE_GR_BIT;
@@ -1000,7 +1004,7 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned 
long addr,
default:
return -EFAULT;
}
-
+again:
ptl = pmd_lock(mm, pmdp);
if (!pmd_present(*pmdp)) {
spin_unlock(ptl);
@@ -1017,6 +1021,8 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned 
long addr,
spin_unlock(ptl);
 
ptep = pte_offset_map_lock(mm, pmdp, addr, );
+   if (!ptep)
+   goto again;
pgste = pgste_get_lock(ptep);
*key = (pgste_val(pgste) & (PGSTE_ACC_BITS | PGSTE_FP_BIT)) >> 56;
paddr = pte_val(*ptep) & PAGE_MASK;
-- 
2.35.3

[PATCH v2 14/23] riscv/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-06-08 Thread Hugh Dickins

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins 
Reviewed-by: Alexandre Ghiti 
Acked-by: Palmer Dabbelt 
---
 arch/riscv/mm/hugetlbpage.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index e0ef56dc57b9..542883b3b49b 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -67,7 +67,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 
for_each_napot_order(order) {
if (napot_cont_size(order) == sz) {
-   pte = pte_alloc_map(mm, pmd, addr & 
napot_cont_mask(order));
+   pte = pte_alloc_huge(mm, pmd, addr & 
napot_cont_mask(order));
break;
}
}
@@ -114,7 +114,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
 
for_each_napot_order(order) {
if (napot_cont_size(order) == sz) {
-   pte = pte_offset_kernel(pmd, addr & 
napot_cont_mask(order));
+   pte = pte_offset_huge(pmd, addr & 
napot_cont_mask(order));
break;
}
}
-- 
2.35.3

[PATCH v2 13/23] powerpc/hugetlb: pte_alloc_huge()

2023-06-08 Thread Hugh Dickins

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead.  huge_pte_offset() is using __find_linux_pte(), which is using
pte_offset_kernel() - don't rename that to _huge, it's more complicated.

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/mm/hugetlbpage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index b900933507da..f7c683b672c1 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -183,7 +183,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
return NULL;
 
if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT)
-   return pte_alloc_map(mm, (pmd_t *)hpdp, addr);
+   return pte_alloc_huge(mm, (pmd_t *)hpdp, addr);
 
BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
 
-- 
2.35.3

[PATCH v2 12/23] powerpc: allow pte_offset_map[_lock]() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.
Balance successful pte_offset_map() with pte_unmap() where omitted.

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/mm/book3s64/hash_tlb.c | 4 
 arch/powerpc/mm/book3s64/subpage_prot.c | 2 ++
 arch/powerpc/xmon/xmon.c| 5 -
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c 
b/arch/powerpc/mm/book3s64/hash_tlb.c
index a64ea0a7ef96..21fcad97ae80 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -239,12 +239,16 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, 
pmd_t *pmd, unsigned long
local_irq_save(flags);
arch_enter_lazy_mmu_mode();
start_pte = pte_offset_map(pmd, addr);
+   if (!start_pte)
+   goto out;
for (pte = start_pte; pte < start_pte + PTRS_PER_PTE; pte++) {
unsigned long pteval = pte_val(*pte);
if (pteval & H_PAGE_HASHPTE)
hpte_need_flush(mm, addr, pte, pteval, 0);
addr += PAGE_SIZE;
}
+   pte_unmap(start_pte);
+out:
arch_leave_lazy_mmu_mode();
local_irq_restore(flags);
 }
diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c 
b/arch/powerpc/mm/book3s64/subpage_prot.c
index b75a9fb99599..0dc85556dec5 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -71,6 +71,8 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned 
long addr,
if (pmd_none(*pmd))
return;
pte = pte_offset_map_lock(mm, pmd, addr, );
+   if (!pte)
+   return;
arch_enter_lazy_mmu_mode();
for (; npages > 0; --npages) {
pte_update(mm, addr, pte, 0, 0, 0);
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 70c4c59a1a8f..fae747cc57d2 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -3376,12 +3376,15 @@ static void show_pte(unsigned long addr)
printf("pmdp @ 0x%px = 0x%016lx\n", pmdp, pmd_val(*pmdp));
 
ptep = pte_offset_map(pmdp, addr);
-   if (pte_none(*ptep)) {
+   if (!ptep || pte_none(*ptep)) {
+   if (ptep)
+   pte_unmap(ptep);
printf("no valid PTE\n");
return;
}
 
format_pte(ptep, pte_val(*ptep));
+   pte_unmap(ptep);
 
sync();
__delay(200);
-- 
2.35.3

[PATCH v2 11/23] powerpc: kvmppc_unmap_free_pmd() pte_offset_kernel()

2023-06-08 Thread Hugh Dickins

kvmppc_unmap_free_pmd() use pte_offset_kernel(), like everywhere else
in book3s_64_mmu_radix.c: instead of pte_offset_map(), which will come
to need a pte_unmap() to balance it.

But note that this is a more complex case than most: see those -EAGAINs
in kvmppc_create_pte(), which is coping with kvmppc races beween page
table and huge entry, of the kind which we are expecting to address
in pte_offset_map() - this might want to be revisited in future.

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 461307b89c3a..572707858d65 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -509,7 +509,7 @@ static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t 
*pmd, bool full,
} else {
pte_t *pte;
 
-   pte = pte_offset_map(p, 0);
+   pte = pte_offset_kernel(p, 0);
kvmppc_unmap_free_pte(kvm, pte, full, lpid);
pmd_clear(p);
}
-- 
2.35.3

[PATCH v2 10/23] parisc/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-06-08 Thread Hugh Dickins

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins 
---
 arch/parisc/mm/hugetlbpage.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index d1d3990b83f6..a8a1a7c1e16e 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -66,7 +66,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (pud) {
pmd = pmd_alloc(mm, pud, addr);
if (pmd)
-   pte = pte_alloc_map(mm, pmd, addr);
+   pte = pte_alloc_huge(mm, pmd, addr);
}
return pte;
 }
@@ -90,7 +90,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
if (!pud_none(*pud)) {
pmd = pmd_offset(pud, addr);
if (!pmd_none(*pmd))
-   pte = pte_offset_map(pmd, addr);
+   pte = pte_offset_huge(pmd, addr);
}
}
}
-- 
2.35.3

[PATCH v2 09/23] parisc: unmap_uncached_pte() use pte_offset_kernel()

2023-06-08 Thread Hugh Dickins

unmap_uncached_pte() is working from pgd_offset_k(vaddr), so it should
use pte_offset_kernel() instead of pte_offset_map(), to avoid the
question of whether a pte_unmap() will be needed to balance.

Signed-off-by: Hugh Dickins 
---
 arch/parisc/kernel/pci-dma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c
index 71ed5391f29d..415f12d5bab3 100644
--- a/arch/parisc/kernel/pci-dma.c
+++ b/arch/parisc/kernel/pci-dma.c
@@ -164,7 +164,7 @@ static inline void unmap_uncached_pte(pmd_t * pmd, unsigned 
long vaddr,
pmd_clear(pmd);
return;
}
-   pte = pte_offset_map(pmd, vaddr);
+   pte = pte_offset_kernel(pmd, vaddr);
vaddr &= ~PMD_MASK;
end = vaddr + size;
if (end > PMD_SIZE)
-- 
2.35.3

[PATCH v2 08/23] parisc: add pte_unmap() to balance get_ptep()

2023-06-08 Thread Hugh Dickins

To keep balance in future, remember to pte_unmap() after a successful
get_ptep().  And act as if flush_cache_pages() really needs a map there,
to read the pfn before "unmapping", to be sure page table is not removed.

Signed-off-by: Hugh Dickins 
---
 arch/parisc/kernel/cache.c | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index ca4a302d4365..501160250bb7 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -426,10 +426,15 @@ void flush_dcache_page(struct page *page)
offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
addr = mpnt->vm_start + offset;
if (parisc_requires_coherency()) {
+   bool needs_flush = false;
pte_t *ptep;
 
ptep = get_ptep(mpnt->vm_mm, addr);
-   if (ptep && pte_needs_flush(*ptep))
+   if (ptep) {
+   needs_flush = pte_needs_flush(*ptep);
+   pte_unmap(ptep);
+   }
+   if (needs_flush)
flush_user_cache_page(mpnt, addr);
} else {
/*
@@ -561,14 +566,20 @@ EXPORT_SYMBOL(flush_kernel_dcache_page_addr);
 static void flush_cache_page_if_present(struct vm_area_struct *vma,
unsigned long vmaddr, unsigned long pfn)
 {
-   pte_t *ptep = get_ptep(vma->vm_mm, vmaddr);
+   bool needs_flush = false;
+   pte_t *ptep;
 
/*
 * The pte check is racy and sometimes the flush will trigger
 * a non-access TLB miss. Hopefully, the page has already been
 * flushed.
 */
-   if (ptep && pte_needs_flush(*ptep))
+   ptep = get_ptep(vma->vm_mm, vmaddr);
+   if (ptep) {
+   needs_flush = pte_needs_flush(*ptep);
+   pte_unmap(ptep);
+   }
+   if (needs_flush)
flush_cache_page(vma, vmaddr, pfn);
 }
 
@@ -635,17 +646,22 @@ static void flush_cache_pages(struct vm_area_struct *vma, 
unsigned long start, u
pte_t *ptep;
 
for (addr = start; addr < end; addr += PAGE_SIZE) {
+   bool needs_flush = false;
/*
 * The vma can contain pages that aren't present. Although
 * the pte search is expensive, we need the pte to find the
 * page pfn and to check whether the page should be flushed.
 */
ptep = get_ptep(vma->vm_mm, addr);
-   if (ptep && pte_needs_flush(*ptep)) {
+   if (ptep) {
+   needs_flush = pte_needs_flush(*ptep);
+   pfn = pte_pfn(*ptep);
+   pte_unmap(ptep);
+   }
+   if (needs_flush) {
if (parisc_requires_coherency()) {
flush_user_cache_page(vma, addr);
} else {
-   pfn = pte_pfn(*ptep);
if (WARN_ON(!pfn_valid(pfn)))
return;
__flush_cache_page(vma, addr, PFN_PHYS(pfn));
-- 
2.35.3

[PATCH v2 07/23] mips: update_mmu_cache() can replace __update_tlb()

2023-06-08 Thread Hugh Dickins

Don't make update_mmu_cache() a wrapper around __update_tlb(): call it
directly, and use the ptep (or pmdp) provided by the caller, instead of
re-calling pte_offset_map() - which would raise a question of whether a
pte_unmap() is needed to balance it.

Check whether the "ptep" provided by the caller is actually the pmdp,
instead of testing pmd_huge(): or test pmd_huge() too and warn if it
disagrees?  This is "hazardous" territory: needs review and testing.

Signed-off-by: Hugh Dickins 
---
 arch/mips/include/asm/pgtable.h | 15 +++
 arch/mips/mm/tlb-r3k.c  |  5 +++--
 arch/mips/mm/tlb-r4k.c  |  9 +++--
 3 files changed, 9 insertions(+), 20 deletions(-)

diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index 574fa14ac8b2..9175dfab08d5 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -565,15 +565,8 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
 }
 #endif
 
-extern void __update_tlb(struct vm_area_struct *vma, unsigned long address,
-   pte_t pte);
-
-static inline void update_mmu_cache(struct vm_area_struct *vma,
-   unsigned long address, pte_t *ptep)
-{
-   pte_t pte = *ptep;
-   __update_tlb(vma, address, pte);
-}
+extern void update_mmu_cache(struct vm_area_struct *vma,
+   unsigned long address, pte_t *ptep);
 
 #define__HAVE_ARCH_UPDATE_MMU_TLB
 #define update_mmu_tlb update_mmu_cache
@@ -581,9 +574,7 @@ static inline void update_mmu_cache(struct vm_area_struct 
*vma,
 static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp)
 {
-   pte_t pte = *(pte_t *)pmdp;
-
-   __update_tlb(vma, address, pte);
+   update_mmu_cache(vma, address, (pte_t *)pmdp);
 }
 
 /*
diff --git a/arch/mips/mm/tlb-r3k.c b/arch/mips/mm/tlb-r3k.c
index 53dfa2b9316b..e5722cd8dd6d 100644
--- a/arch/mips/mm/tlb-r3k.c
+++ b/arch/mips/mm/tlb-r3k.c
@@ -176,7 +176,8 @@ void local_flush_tlb_page(struct vm_area_struct *vma, 
unsigned long page)
}
 }
 
-void __update_tlb(struct vm_area_struct *vma, unsigned long address, pte_t pte)
+void update_mmu_cache(struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep)
 {
unsigned long asid_mask = cpu_asid_mask(_cpu_data);
unsigned long flags;
@@ -203,7 +204,7 @@ void __update_tlb(struct vm_area_struct *vma, unsigned long 
address, pte_t pte)
BARRIER;
tlb_probe();
idx = read_c0_index();
-   write_c0_entrylo0(pte_val(pte));
+   write_c0_entrylo0(pte_val(*ptep));
write_c0_entryhi(address | pid);
if (idx < 0) {  /* BARRIER */
tlb_write_random();
diff --git a/arch/mips/mm/tlb-r4k.c b/arch/mips/mm/tlb-r4k.c
index 1b939abbe4ca..c96725d17cab 100644
--- a/arch/mips/mm/tlb-r4k.c
+++ b/arch/mips/mm/tlb-r4k.c
@@ -290,14 +290,14 @@ void local_flush_tlb_one(unsigned long page)
  * updates the TLB with the new pte(s), and another which also checks
  * for the R4k "end of page" hardware bug and does the needy.
  */
-void __update_tlb(struct vm_area_struct * vma, unsigned long address, pte_t 
pte)
+void update_mmu_cache(struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep)
 {
unsigned long flags;
pgd_t *pgdp;
p4d_t *p4dp;
pud_t *pudp;
pmd_t *pmdp;
-   pte_t *ptep;
int idx, pid;
 
/*
@@ -326,10 +326,9 @@ void __update_tlb(struct vm_area_struct * vma, unsigned 
long address, pte_t pte)
idx = read_c0_index();
 #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
/* this could be a huge page  */
-   if (pmd_huge(*pmdp)) {
+   if (ptep == (pte_t *)pmdp) {
unsigned long lo;
write_c0_pagemask(PM_HUGE_MASK);
-   ptep = (pte_t *)pmdp;
lo = pte_to_entrylo(pte_val(*ptep));
write_c0_entrylo0(lo);
write_c0_entrylo1(lo + (HPAGE_SIZE >> 7));
@@ -344,8 +343,6 @@ void __update_tlb(struct vm_area_struct * vma, unsigned 
long address, pte_t pte)
} else
 #endif
{
-   ptep = pte_offset_map(pmdp, address);
-
 #if defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32)
 #ifdef CONFIG_XPA
write_c0_entrylo0(pte_to_entrylo(ptep->pte_high));
-- 
2.35.3

[PATCH v2 06/23] microblaze: allow pte_offset_map() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Signed-off-by: Hugh Dickins 
---
 arch/microblaze/kernel/signal.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/microblaze/kernel/signal.c b/arch/microblaze/kernel/signal.c
index c3aebec71c0c..c78a0ff48066 100644
--- a/arch/microblaze/kernel/signal.c
+++ b/arch/microblaze/kernel/signal.c
@@ -194,7 +194,7 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t 
*set,
 
preempt_disable();
ptep = pte_offset_map(pmdp, address);
-   if (pte_present(*ptep)) {
+   if (ptep && pte_present(*ptep)) {
address = (unsigned long) page_address(pte_page(*ptep));
/* MS: I need add offset in page */
address += ((unsigned long)frame->tramp) & ~PAGE_MASK;
@@ -203,7 +203,8 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t 
*set,
invalidate_icache_range(address, address + 8);
flush_dcache_range(address, address + 8);
}
-   pte_unmap(ptep);
+   if (ptep)
+   pte_unmap(ptep);
preempt_enable();
if (err)
return -EFAULT;
-- 
2.35.3

[PATCH v2 05/23] m68k: allow pte_offset_map[_lock]() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Restructure cf_tlb_miss() with a pte_unmap() (previously omitted)
at label out, followed by one local_irq_restore() for all.

Signed-off-by: Hugh Dickins 
---
 arch/m68k/include/asm/mmu_context.h |  6 ++--
 arch/m68k/kernel/sys_m68k.c |  2 ++
 arch/m68k/mm/mcfmmu.c   | 52 -
 3 files changed, 27 insertions(+), 33 deletions(-)

diff --git a/arch/m68k/include/asm/mmu_context.h 
b/arch/m68k/include/asm/mmu_context.h
index 8ed6ac14d99f..141bbdfad960 100644
--- a/arch/m68k/include/asm/mmu_context.h
+++ b/arch/m68k/include/asm/mmu_context.h
@@ -99,7 +99,7 @@ static inline void load_ksp_mmu(struct task_struct *task)
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte;
+   pte_t *pte = NULL;
unsigned long mmuar;
 
local_irq_save(flags);
@@ -139,7 +139,7 @@ static inline void load_ksp_mmu(struct task_struct *task)
 
pte = (mmuar >= PAGE_OFFSET) ? pte_offset_kernel(pmd, mmuar)
 : pte_offset_map(pmd, mmuar);
-   if (pte_none(*pte) || !pte_present(*pte))
+   if (!pte || pte_none(*pte) || !pte_present(*pte))
goto bug;
 
set_pte(pte, pte_mkyoung(*pte));
@@ -161,6 +161,8 @@ static inline void load_ksp_mmu(struct task_struct *task)
 bug:
pr_info("ksp load failed: mm=0x%p ksp=0x08%lx\n", mm, mmuar);
 end:
+   if (pte && mmuar < PAGE_OFFSET)
+   pte_unmap(pte);
local_irq_restore(flags);
 }
 
diff --git a/arch/m68k/kernel/sys_m68k.c b/arch/m68k/kernel/sys_m68k.c
index bd0274c7592e..c586034d2a7a 100644
--- a/arch/m68k/kernel/sys_m68k.c
+++ b/arch/m68k/kernel/sys_m68k.c
@@ -488,6 +488,8 @@ sys_atomic_cmpxchg_32(unsigned long newval, int oldval, int 
d3, int d4, int d5,
if (!pmd_present(*pmd))
goto bad_access;
pte = pte_offset_map_lock(mm, pmd, (unsigned long)mem, );
+   if (!pte)
+   goto bad_access;
if (!pte_present(*pte) || !pte_dirty(*pte)
|| !pte_write(*pte)) {
pte_unmap_unlock(pte, ptl);
diff --git a/arch/m68k/mm/mcfmmu.c b/arch/m68k/mm/mcfmmu.c
index 70aa0979e027..42f45abea37a 100644
--- a/arch/m68k/mm/mcfmmu.c
+++ b/arch/m68k/mm/mcfmmu.c
@@ -91,7 +91,8 @@ int cf_tlb_miss(struct pt_regs *regs, int write, int dtlb, 
int extension_word)
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
-   pte_t *pte;
+   pte_t *pte = NULL;
+   int ret = -1;
int asid;
 
local_irq_save(flags);
@@ -100,47 +101,33 @@ int cf_tlb_miss(struct pt_regs *regs, int write, int 
dtlb, int extension_word)
regs->pc + (extension_word * sizeof(long));
 
mm = (!user_mode(regs) && KMAPAREA(mmuar)) ? _mm : current->mm;
-   if (!mm) {
-   local_irq_restore(flags);
-   return -1;
-   }
+   if (!mm)
+   goto out;
 
pgd = pgd_offset(mm, mmuar);
-   if (pgd_none(*pgd))  {
-   local_irq_restore(flags);
-   return -1;
-   }
+   if (pgd_none(*pgd))
+   goto out;
 
p4d = p4d_offset(pgd, mmuar);
-   if (p4d_none(*p4d)) {
-   local_irq_restore(flags);
-   return -1;
-   }
+   if (p4d_none(*p4d))
+   goto out;
 
pud = pud_offset(p4d, mmuar);
-   if (pud_none(*pud)) {
-   local_irq_restore(flags);
-   return -1;
-   }
+   if (pud_none(*pud))
+   goto out;
 
pmd = pmd_offset(pud, mmuar);
-   if (pmd_none(*pmd)) {
-   local_irq_restore(flags);
-   return -1;
-   }
+   if (pmd_none(*pmd))
+   goto out;
 
pte = (KMAPAREA(mmuar)) ? pte_offset_kernel(pmd, mmuar)
: pte_offset_map(pmd, mmuar);
-   if (pte_none(*pte) || !pte_present(*pte)) {
-   local_irq_restore(flags);
-   return -1;
-   }
+   if (!pte || pte_none(*pte) || !pte_present(*pte))
+   goto out;
 
if (write) {
-   if (!pte_write(*pte)) {
-   local_irq_restore(flags);
-   return -1;
-   }
+   if (!pte_write(*pte))
+   goto out;
set_pte(pte, pte_mkdirty(*pte));
}
 
@@ -161,9 +148,12 @@ int cf_tlb_miss(struct pt_regs *regs, int write, int dtlb, 
int extension_word)
mmu_write(MMUOR, MMUOR_ACC | MMUOR_UAA);
else
mmu_write(MMUOR, MMUOR_ITLB | MMUOR_ACC | MMUOR_UAA);
-
+   ret = 0;
+out:
+   if (pte && !KMAPAREA(mmuar))
+   pte_unmap(pte);
local_irq_restore(flags);
-   r

[PATCH v2 04/23] ia64/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-06-08 Thread Hugh Dickins

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins 
---
 arch/ia64/mm/hugetlbpage.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index 78a02e026164..adc49f2d22e8 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -41,7 +41,7 @@ huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct 
*vma,
if (pud) {
pmd = pmd_alloc(mm, pud, taddr);
if (pmd)
-   pte = pte_alloc_map(mm, pmd, taddr);
+   pte = pte_alloc_huge(mm, pmd, taddr);
}
return pte;
 }
@@ -64,7 +64,7 @@ huge_pte_offset (struct mm_struct *mm, unsigned long addr, 
unsigned long sz)
if (pud_present(*pud)) {
pmd = pmd_offset(pud, taddr);
if (pmd_present(*pmd))
-   pte = pte_offset_map(pmd, taddr);
+   pte = pte_offset_huge(pmd, taddr);
}
}
}
-- 
2.35.3

[PATCH v2 03/23] arm64/hugetlb: pte_alloc_huge() pte_offset_huge()

2023-06-08 Thread Hugh Dickins

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins 
Acked-by: Catalin Marinas 
---
 arch/arm64/mm/hugetlbpage.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 95364e8bdc19..21716c940682 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -307,14 +307,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
return NULL;
 
WARN_ON(addr & (sz - 1));
-   /*
-* Note that if this code were ever ported to the
-* 32-bit arm platform then it will cause trouble in
-* the case where CONFIG_HIGHPTE is set, since there
-* will be no pte_unmap() to correspond with this
-* pte_alloc_map().
-*/
-   ptep = pte_alloc_map(mm, pmdp, addr);
+   ptep = pte_alloc_huge(mm, pmdp, addr);
} else if (sz == PMD_SIZE) {
if (want_pmd_share(vma, addr) && pud_none(READ_ONCE(*pudp)))
ptep = huge_pmd_share(mm, vma, addr, pudp);
@@ -366,7 +359,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
return (pte_t *)pmdp;
 
if (sz == CONT_PTE_SIZE)
-   return pte_offset_kernel(pmdp, (addr & CONT_PTE_MASK));
+   return pte_offset_huge(pmdp, (addr & CONT_PTE_MASK));
 
return NULL;
 }
-- 
2.35.3

[PATCH v2 02/23] arm64: allow pte_offset_map() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Signed-off-by: Hugh Dickins 
Acked-by: Catalin Marinas 
---
 arch/arm64/mm/fault.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index cb21ccd7940d..f3aaba853547 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -177,6 +177,9 @@ static void show_pte(unsigned long addr)
break;
 
ptep = pte_offset_map(pmdp, addr);
+   if (!ptep)
+   break;
+
pte = READ_ONCE(*ptep);
pr_cont(", pte=%016llx", pte_val(pte));
pte_unmap(ptep);
-- 
2.35.3

[PATCH v2 01/23] arm: allow pte_offset_map[_lock]() to fail

2023-06-08 Thread Hugh Dickins

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Signed-off-by: Hugh Dickins 
---
 arch/arm/lib/uaccess_with_memcpy.c | 3 +++
 arch/arm/mm/fault-armv.c   | 5 -
 arch/arm/mm/fault.c| 3 +++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/arm/lib/uaccess_with_memcpy.c 
b/arch/arm/lib/uaccess_with_memcpy.c
index e4c2677cc1e9..2f6163f05e93 100644
--- a/arch/arm/lib/uaccess_with_memcpy.c
+++ b/arch/arm/lib/uaccess_with_memcpy.c
@@ -74,6 +74,9 @@ pin_page_for_write(const void __user *_addr, pte_t **ptep, 
spinlock_t **ptlp)
return 0;
 
pte = pte_offset_map_lock(current->mm, pmd, addr, );
+   if (unlikely(!pte))
+   return 0;
+
if (unlikely(!pte_present(*pte) || !pte_young(*pte) ||
!pte_write(*pte) || !pte_dirty(*pte))) {
pte_unmap_unlock(pte, ptl);
diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index 0e49154454a6..ca5302b0b7ee 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -117,8 +117,11 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned 
long address,
 * must use the nested version.  This also means we need to
 * open-code the spin-locking.
 */
-   ptl = pte_lockptr(vma->vm_mm, pmd);
pte = pte_offset_map(pmd, address);
+   if (!pte)
+   return 0;
+
+   ptl = pte_lockptr(vma->vm_mm, pmd);
do_pte_lock(ptl);
 
ret = do_adjust_pte(vma, address, pfn, pte);
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 2418f1efabd8..83598649a094 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -85,6 +85,9 @@ void show_pte(const char *lvl, struct mm_struct *mm, unsigned 
long addr)
break;
 
pte = pte_offset_map(pmd, addr);
+   if (!pte)
+   break;
+
pr_cont(", *pte=%08llx", (long long)pte_val(*pte));
 #ifndef CONFIG_ARM_LPAE
pr_cont(", *ppte=%08llx",
-- 
2.35.3

[PATCH v2 00/23] arch: allow pte_offset_map[_lock]() to fail

2023-06-08 Thread Hugh Dickins

Here is v2 series of patches to various architectures, based on v6.4-rc5:
preparing for v2 of changes following in mm, affecting pte_offset_map()
and pte_offset_map_lock().  There are very few differences from v1:
noted patch by patch below.

v1 was "arch: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/77a5d8c-406b-7068-4f17-23b7ac53b...@google.com/
series of 23 posted on 2023-05-09,
followed by "mm: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/68a97fbe-5c1e-7ac6-72c-7b9c6290b...@google.com/
series of 31 posted on 2023-05-21,
followed by  "mm: free retracted page table by RCU"
https://lore.kernel.org/linux-mm/35e983f5-7ed3-b310-d949-9ae8b130c...@google.com/
series of 12 posted on 2023-05-28.

The first two series are "independent": neither depends
for build or correctness on the other, and the arch patches can either be
merged separately via arch trees, or be picked up by akpm; but both series
must be in before the third series is added to make the effective changes
(and that adds just a little more in arm, powerpc, s390 and sparc).

What is it all about?  Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs;
but likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing).  mmap_write_lock
avoidance when collapsing to anon THPs?  Perhaps, but again that's not
work I've done: a quick attempt was not as easy as the shmem/file case.

I would much prefer not to have to make these small but wide-ranging
changes for such a niche case; but failed to find another way, and
have heard that shmem MADV_COLLAPSE's usefulness is being limited by
that mmap_write_lock it currently requires.

These changes (though of course not these exact patches, and not all
of these architectures!) have been in Google's data centre kernel for
three years now: we do rely upon them.

What are the per-arch changes about?  Generally, two things.

One: the current mmap locking may not be enough to guard against that
tricky transition between pmd entry pointing to page table, and empty
pmd entry, and pmd entry pointing to huge page: pte_offset_map() will
have to validate the pmd entry for itself, returning NULL if no page
table is there.  What to do about that varies: often the nearby error
handling indicates just to skip it; but in some cases a "goto again"
looks appropriate (and if that risks an infinite loop, then there
must have been an oops, or pfn 0 mistaken for page table, before).

Deeper study of each site might show that 90% of them here in arch
code could only fail if there's corruption e.g. a transition to THP
would be surprising on an arch without HAVE_ARCH_TRANSPARENT_HUGEPAGE.
But given the likely extension to freeing empty page tables, I have
not limited this set of changes to THP; and it has been easier, and
sets a better example, if each site is given appropriate handling.

Two: pte_offset_map() will need to do an rcu_read_lock(), with the
corresponding rcu_read_unlock() in pte_unmap().  But most architectures
never supported CONFIG_HIGHPTE, so some don't always call pte_unmap()
after pte_offset_map(), or have used userspace pte_offset_map() where
pte_offset_kernel() is more correct.  No problem in the current tree,
but a problem once an rcu_read_unlock() will be needed to keep balance.

A common special case of that comes in arch/*/mm/hugetlbpage.c, if
the architecture supports hugetlb pages down at the lowest PTE level.
huge_pte_alloc() uses pte_alloc_map(), but generic hugetlb code does
no corresponding pte_unmap(); similarly for huge_pte_offset().
Thanks to Mike Kravetz and Andrew Morton, v6.4-rc1 already provides
pte_alloc_huge() and pte_offset_huge() to help fix up those cases.

This posting is based on v6.4-rc5, but good for any v6.4-rc,
current mm-everything and linux-next.

01/23 arm: allow pte_offset_map[_lock]() to fail
  v2: same as v1
02/23 arm64: allow pte_offset_map() to fail
  v2: add ack from Catalin
03/23 arm64/hugetlb: pte_alloc_huge() pte_offset_huge()
  v2: add ack from Catalin
04/23 ia64/hugetlb: pte_alloc_huge() pte_offset_huge()
  v2: same as v1
05/23 m68k: allow pte_offset_map[_lock]() to fail
  v2: same as v1
06/23 microblaze: allow pte_offset_map() to fail
  v2: same as v1
07/23 mips: update_mmu_cache() can replace __update_tlb()
  v2: same as v1
08/23 parisc: add pte_unmap() to balance get_ptep()
  v2: typo fix from Helge; stronger commit message
09/23 parisc: unmap_uncached_pte() use pte_offset_kernel()
  v2: same as v1
10/23 parisc/hugetlb: pte_alloc_huge() pte_offset_huge()
  v2: same as v1
11/23 powerpc: kvmppc_unmap_free_pmd() pte_offset_kernel()
  v2: same as v1
12/23 powerpc: allow pte_offset_map[_lock]() to fail
  v2: same as v1
13/23 powerpc/hugetlb: pte_alloc_huge()
  v2: same as v1
14/23 riscv/hugetlb: pte_alloc_huge() pte_offset_huge()
  v2: add review from

Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-07 Thread Hugh Dickins

On Tue, 6 Jun 2023, Gerald Schaefer wrote:
> On Mon, 5 Jun 2023 22:11:52 -0700 (PDT)
> Hugh Dickins  wrote:
> > On Thu, 1 Jun 2023 15:57:51 +0200
> > Gerald Schaefer  wrote:
> > > 
> > > Yes, we have 2 pagetables in one 4K page, which could result in same
> > > rcu_head reuse. It might be possible to use the cleverness from our
> > > page_table_free() function, e.g. to only do the call_rcu() once, for
> > > the case where both 2K pagetable fragments become unused, similar to
> > > how we decide when to actually call __free_page().
> > > 
> > > However, it might be much worse, and page->rcu_head from a pagetable
> > > page cannot be used at all for s390, because we also use page->lru
> > > to keep our list of free 2K pagetable fragments. I always get confused
> > > by struct page unions, so not completely sure, but it seems to me that
> > > page->rcu_head would overlay with page->lru, right?  
> > 
> > Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
> > I'm wrong) I think that s390 could use exactly the same technique for
> > its list of free 2K pagetable fragments as it uses for its list of THP
> > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > the first two longs of the page table itself for threading the list.
> 
> Nice idea, I think that could actually work, since we only need the empty
> 2K halves on the list. So it should be possible to store the list_head
> inside those.

Jason quickly pointed out the flaw in my thinking there.

> 
> > 
> > And while it could use third and fourth longs instead, I don't see any
> > need for that: a deposited pagetable has been allocated, so would not
> > be on the list of free fragments.
> 
> Correct, that should not interfere.
> 
> > 
> > Below is one of the grossest patches I've ever posted: gross because
> > it's a rushed attempt to see whether that is viable, while it would take
> > me longer to understand all the s390 cleverness there (even though the
> > PP AA commentary above page_table_alloc() is excellent).
> 
> Sounds fair, this is also one of the grossest code we have, which is also
> why Alexander added the comment. I guess we could need even more comments
> inside the code, as it still confuses me more than it should.
> 
> Considering that, you did remarkably well. Your patch seems to work fine,
> at least it survived some LTP mm tests. I will also add it to our CI runs,
> to give it some more testing. Will report tomorrow when it broke something.
> See also below for some patch comments.

Many thanks for your effort on this patch.  I don't expect the testing
of it to catch Jason's point, that I'm corrupting the page table while
it's on its way through RCU to being freed, but he's right nonetheless.

I'll integrate your fixes below into what I have here, but probably
just archive it as something to refer to later in case it might play
a part; but probably it will not - sorry for wasting your time.

> 
> > 
> > I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
> > And cmma_init_nodat()? Ah, that's __init so I guess disjoint.
> 
> cmma_init_nodat() should be disjoint, not only because it is __init,
> but also because it explicitly skips pagetable pages, so it should
> never touch page->lru of those.
> 
> Not very familiar with the gmap code, it does look disjoint, and we should
> also use complete 4K pages for pagetables instead of 2K fragments there,
> but Christian or Claudio should also have a look.
> 
> > 
> > Gerald, s390 folk: would it be possible for you to give this
> > a try, suggest corrections and improvements, and then I can make it
> > a separate patch of the series; and work on avoiding concurrent use
> > of the rcu_head by pagetable fragment buddies (ideally fit in with
> > the scheme already there, maybe DD bits to go along with the PP AA).
> 
> It feels like it could be possible to not only avoid the double
> rcu_head, but also avoid passing over the mm via page->pt_mm.
> I.e. have pte_free_defer(), which has the mm, do all the checks and
> list updates that page_table_free() does, for which we need the mm.
> Then just skip the pgtable_pte_page_dtor() + __free_page() at the end,
> and do call_rcu(pte_free_now) instead. The pte_free_now() could then
> just do _dtor/__free_page similar to the generic version.

I'm not sure: I missed your suggestion there when I first skimmed
through, and today have spent more time getting deeper into how it's
done at present.  I am now feeling more confident of a way forward,
a nicely integrated way forward, than I was yesterday.
Thou

Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-07 Thread Hugh Dickins

On Tue, 6 Jun 2023, Jason Gunthorpe wrote:
> On Mon, Jun 05, 2023 at 10:11:52PM -0700, Hugh Dickins wrote:
> 
> > "deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
> > the first two longs of the page table itself for threading the list.
> 
> It is not RCU anymore if it writes to the page table itself before the
> grace period, so this change seems to break the RCU behavior of
> page_table_free_rcu().. The rcu sync is inside tlb_remove_table()
> called after the stores.

Yes indeed, thanks for pointing that out.

> 
> Maybe something like an xarray on the mm to hold the frags?

I think we can manage without that:
I'll say slightly more in reply to Gerald.

Hugh

Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

2023-06-06 Thread Hugh Dickins

On Tue, 6 Jun 2023, Jason Gunthorpe wrote:
> On Tue, Jun 06, 2023 at 03:03:31PM -0400, Peter Xu wrote:
> > On Tue, Jun 06, 2023 at 03:23:30PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote:
> > > 
> > > > diff --git a/arch/powerpc/mm/pgtable-frag.c 
> > > > b/arch/powerpc/mm/pgtable-frag.c
> > > > index 20652daa1d7e..e4f58c5fc2ac 100644
> > > > --- a/arch/powerpc/mm/pgtable-frag.c
> > > > +++ b/arch/powerpc/mm/pgtable-frag.c
> > > > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int 
> > > > kernel)
> > > > __free_page(page);
> > > > }
> > > >  }
> > > > +
> > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > +#define PTE_FREE_DEFERRED 0x1 /* beyond any PTE_FRAG_NR */
> > > > +
> > > > +static void pte_free_now(struct rcu_head *head)
> > > > +{
> > > > +   struct page *page;
> > > > +   int refcount;
> > > > +
> > > > +   page = container_of(head, struct page, rcu_head);
> > > > +   refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
> > > > +>pt_frag_refcount);
> > > > +   if (refcount < PTE_FREE_DEFERRED) {
> > > > +   pte_fragment_free((unsigned long *)page_address(page), 
> > > > 0);
> > > > +   return;
> > > > +   }
> > > 
> > > From what I can tell power doesn't recycle the sub fragment into any
> > > kind of free list. It just waits for the last fragment to be unused
> > > and then frees the whole page.

Yes, it's relatively simple in that way: not as sophisticated as s390.

> > > 
> > > So why not simply go into pte_fragment_free() and do the call_rcu 
> > > directly:
> > > 
> > >   BUG_ON(atomic_read(>pt_frag_refcount) <= 0);
> > >   if (atomic_dec_and_test(>pt_frag_refcount)) {
> > >   if (!kernel)
> > >   pgtable_pte_page_dtor(page);
> > >   call_rcu(>rcu_head, free_page_rcu)
> > 
> > We need to be careful on the lock being freed in pgtable_pte_page_dtor(),
> > in Hugh's series IIUC we need the spinlock being there for the rcu section
> > alongside the page itself.  So even if to do so we'll need to also rcu call 
> > pgtable_pte_page_dtor() when needed.

Thanks, Peter, yes that's right.

> 
> Er yes, I botched that, the dtor and the free_page should be in a the
> rcu callback function

But it was just a botched detail, and won't have answered Jason's doubt.

I had three (or perhaps it amounts to two) reasons for doing it this way:
none of which may seem good enough reasons to you.  Certainly I'd agree
that the way it's done seems... arcane.

One, as I've indicated before, I don't actually dare to go all
the way into RCU freeing of all page tables for powerpc (or any other):
I should think it's a good idea that everyone wants in the end, but I'm
limited by my time and competence - and dread of losing my way in the
mmu_gather TLB #ifdef maze.  It's work for someone else not me.

(pte_free_defer() do as you suggest, without changing pte_fragment_free()
itself?  No, that doesn't work out when defer does, say, the decrement of
pt_frag_refcount from 2 to 1, then pte_fragment_free() does the decrement
from 1 to 0: page freed without deferral.)

Two, this was the code I'd worked out before, and was used in production,
so I had confidence in it - it was just my mistake that I'd forgotten the
single rcu_head issue, and thought I could avoid it in the initial posting.
powerpc has changed around since then, but apparently not in any way that
affects this.  And it's too easy to agree in review that something can be
simpler, without bringing back to mind why the complications are there.

Three (just an explanation of why the old code was like this), powerpc
relies on THP's page table deposit+withdraw protocol, even for shmem/
file THPs.  I've skirted that issue in this series, by sticking with
retract_page_tables(), not attempting to insert huge pmd immediately.
But if huge pmd is inserted to replace ptetable pmd, then ptetable must
be deposited: pte_free_defer() as written protects the deposited ptetable
from then being freed without deferral (rather like in the example above).

But does not protect it from being withdrawn and reused within that
grace period.  Jann has grave doubts whether that can ever be allowed
(or perhaps I should grant him certainty, and examples that it cannot).
I did convince myself, back in the day, that it was safe here: but I'll
have to put in a lot more thought to re-justify it now, and on the way
may instead be completely persuaded by Jann.

Not very good reasons: good enough, or can you supply a better patch?

Thanks,
Hugh

Re: [PATCH 00/12] mm: free retracted page table by RCU

2023-06-06 Thread Hugh Dickins

On Fri, 2 Jun 2023, Jann Horn wrote:
> On Fri, Jun 2, 2023 at 6:37 AM Hugh Dickins  wrote:
> 
> > The most obvious vital thing (in the split ptlock case) is that it
> > remains a struct page with a usable ptl spinlock embedded in it.
> >
> > The question becomes more urgent when/if extending to replacing the
> > pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc
> > wants to deposit the page table for later use even in the shmem/file
> > case (and all arches in the anon case): I did work out the details once
> > before, but I'm not sure whether I would still agree with myself; and was
> > glad to leave replacement out of this series, to revisit some time later.
> >
> > >
> > > So in particular, in handle_pte_fault() we can reach the "if
> > > (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a
> > > detached zeroed page table, but we're okay with that because in that
> > > case we know that !pte_none(vmf->orig_pte)&_none(*vmf->pte) ,
> > > which implies !pte_same(*vmf->pte, entry) , which means we'll bail
> > > out?
> >
> > There is no current (even at end of series) circumstance in which we
> > could be pointing to a detached page table there; but yes, I want to
> > allow for that, and yes I agree with your analysis.
> 
> Hmm, what am I missing here?

I spent quite a while trying to reconstruct what I had been thinking,
what meaning of "detached" or "there" I had in mind when I asserted so
confidently "There is no current (even at end of series) circumstance
in which we could be pointing to a detached page table there".

But had to give up and get on with more useful work.
Of course you are right, and that is what this series is about.

Hugh

> 
> static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> {
>   pte_t entry;
> 
>   if (unlikely(pmd_none(*vmf->pmd))) {
> [not executed]
>   } else {
> /*
>  * A regular pmd is established and it can't morph into a huge
>  * pmd by anon khugepaged, since that takes mmap_lock in write
>  * mode; but shmem or file collapse to THP could still morph
>  * it into a huge pmd: just retry later if so.
>  */
> vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>  vmf->address, >ptl);
> if (unlikely(!vmf->pte))
>   [not executed]
> // this reads a present readonly PTE
> vmf->orig_pte = ptep_get_lockless(vmf->pte);
> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
> 
> if (pte_none(vmf->orig_pte)) {
>   [not executed]
> }
>   }
> 
>   [at this point, a concurrent THP collapse operation detaches the page table]
>   // vmf->pte now points into a detached page table
> 
>   if (!vmf->pte)
> [not executed]
> 
>   if (!pte_present(vmf->orig_pte))
> [not executed]
> 
>   if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> [not executed]
> 
>   spin_lock(vmf->ptl);
>   entry = vmf->orig_pte;
>   // vmf->pte still points into a detached page table
>   if (unlikely(!pte_same(*vmf->pte, entry))) {
> update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
> goto unlock;
>   }
>   [...]
> }

Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock

2023-06-06 Thread Hugh Dickins

On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:25 AM Hugh Dickins  wrote:
> > +static void retract_page_tables(struct address_space *mapping, pgoff_t 
> > pgoff)
...
> > +* Note that vma->anon_vma check is racy: it can be set 
> > after
> > +* the check, but page locks (with XA_RETRY_ENTRYs in holes)
> > +* prevented establishing new ptes of the page. So we are 
> > safe
> > +* to remove page table below, without even checking it's 
> > empty.
> 
> This "we are safe to remove page table below, without even checking
> it's empty" assumes that the only way to create new anonymous PTEs is
> to use existing file PTEs, right? What about private shmem VMAs that
> are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
> the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
> looking at the mapping and its pages (except for checking that the
> insertion point is before end-of-file), protected only by mmap_lock
> (shared) and pte_offset_map_lock().

Right, from your comments and Peter's, thank you both, I can see that
userfaultfd breaks the usual assumptions here: so I'm putting an
if (unlikely(vma->anon_vma || userfaultfd_wp(vma)))
check in once we've got the ptlock; with a comment above it to point
the blame at uffd, though I gave up on describing all the detail.
And deleted this earlier "we are safe" paragraph.

You did suggest, in another mail, that perhaps there should be a scan
checking all pte_none() when we get the ptlock.  I wasn't keen on yet
another debug scan for bugs and didn't add that, thinking I was going
to add a patch on the end to do so in page_table_check_pte_clear_range().

But when I came to write that patch, found that I'd been misled by its
name: it's about checking or adjusting some accounting, not really a
suitable place to check for pte_none() at all; so just scrapped it.

...
> > -   collapse_and_free_pmd(mm, vma, addr, pmd);
> 
> The old code called collapse_and_free_pmd(), which involves MMU
> notifier invocation...

...
> > +   pml = pmd_lock(mm, pmd);
> > +   ptl = pte_lockptr(mm, pmd);
> > +   if (ptl != pml)
> > +   spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> > +   pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> 
> ... while the new code only does pmdp_collapse_flush(), which clears
> the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
> notifiers. My understanding is that that's problematic - maybe (?) it
> is sort of okay with regards to classic MMU notifier users like KVM,
> but it's probably wrong for IOMMUv2 users, where an IOMMU directly
> consumes the normal page tables?

Right, I intentionally left out the MMU notifier invocation, knowing
that we have already done an MMU notifier invocation when unmapping
any PTEs which were mapped: it was necessary for collapse_and_free_pmd()
in the collapse_pte_mapped_thp() case, but there was no notifier in this
case for many years, and I was glad to be rid of it.

However, I now see that you were adding it intentionally even for this
case in your f268f6cf875f; and from later comments in this thread, it
looks like there is still uncertainty about whether it is needed here,
but safer to assume that it is needed: I'll add it back.

> 
> (FWIW, last I looked, there also seemed to be some other issues with
> MMU notifier usage wrt IOMMUv2, see the thread
> <https://lore.kernel.org/linux-mm/yzbaf9hw1%2frek...@nvidia.com/>.)

Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-05 Thread Hugh Dickins

On Sun, 28 May 2023, Hugh Dickins wrote:

> Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
> 
> This version is more complicated than others: because page_table_free()
> needs to know which fragment is being freed, and which mm to link it to.
> 
> page_table_free()'s fragment handling is clever, but I could too easily
> break it: what's done here in pte_free_defer() and pte_free_now() might
> be better integrated with page_table_free()'s cleverness, but not by me!
> 
> By the time that page_table_free() gets called via RCU, it's conceivable
> that mm would already have been freed: so mmgrab() in pte_free_defer()
> and mmdrop() in pte_free_now().  No, that is not a good context to call
> mmdrop() from, so make mmdrop_async() public and use that.

But Matthew Wilcox quickly pointed out that sharing one page->rcu_head
between multiple page tables is tricky: something I knew but had lost
sight of.  So the powerpc and s390 patches were broken: powerpc fairly
easily fixed, but s390 more painful.

In https://lore.kernel.org/linux-s390/20230601155751.7c949ca4@thinkpad-T15/
On Thu, 1 Jun 2023 15:57:51 +0200
Gerald Schaefer  wrote:
> 
> Yes, we have 2 pagetables in one 4K page, which could result in same
> rcu_head reuse. It might be possible to use the cleverness from our
> page_table_free() function, e.g. to only do the call_rcu() once, for
> the case where both 2K pagetable fragments become unused, similar to
> how we decide when to actually call __free_page().
> 
> However, it might be much worse, and page->rcu_head from a pagetable
> page cannot be used at all for s390, because we also use page->lru
> to keep our list of free 2K pagetable fragments. I always get confused
> by struct page unions, so not completely sure, but it seems to me that
> page->rcu_head would overlay with page->lru, right?

Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
I'm wrong) I think that s390 could use exactly the same technique for
its list of free 2K pagetable fragments as it uses for its list of THP
"deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
the first two longs of the page table itself for threading the list.

And while it could use third and fourth longs instead, I don't see any
need for that: a deposited pagetable has been allocated, so would not
be on the list of free fragments.

Below is one of the grossest patches I've ever posted: gross because
it's a rushed attempt to see whether that is viable, while it would take
me longer to understand all the s390 cleverness there (even though the
PP AA commentary above page_table_alloc() is excellent).

I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
And cmma_init_nodat()? Ah, that's __init so I guess disjoint.

Gerald, s390 folk: would it be possible for you to give this
a try, suggest corrections and improvements, and then I can make it
a separate patch of the series; and work on avoiding concurrent use
of the rcu_head by pagetable fragment buddies (ideally fit in with
the scheme already there, maybe DD bits to go along with the PP AA).

Why am I even asking you to move away from page->lru: why don't I
thread s390's pte_free_defer() pagetables like THP's deposit does?
I cannot, because the deferred pagetables have to remain accessible
as valid pagetables, until the RCU grace period has elapsed - unless
all the list pointers would appear as pte_none(), which I doubt.

(That may limit our possibilities with the deposited pagetables in
future: I can imagine them too wanting to remain accessible as valid
pagetables.  But that's not needed by this series, and s390 only uses
deposit/withdraw for anon THP; and some are hoping that we might be
able to move away from deposit/withdraw altogther - though powerpc's
special use will make that more difficult.)

Thanks!
Hugh

--- 6.4-rc5/arch/s390/mm/pgalloc.c
+++ linux/arch/s390/mm/pgalloc.c
@@ -232,6 +232,7 @@ void page_table_free_pgste(struct page *
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
+   struct list_head *listed;
unsigned long *table;
struct page *page;
unsigned int mask, bit;
@@ -241,8 +242,8 @@ unsigned long *page_table_alloc(struct m
table = NULL;
spin_lock_bh(>context.lock);
if (!list_empty(>context.pgtable_list)) {
-   page = list_first_entry(>context.pgtable_list,
-   struct page, lru);
+   listed = mm->context.pgtable_list.next;
+   page = virt_to_page(listed);

Re: [PATCH 06/12] sparc: add pte_free_defer() for pgtables sharing page

2023-06-05 Thread Hugh Dickins

On Sun, 28 May 2023, Hugh Dickins wrote:

> Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.

sparc32 supports pagetables sharing a page, but does not support THP;
sparc64 supports THP, but does not support pagetables sharing a page.
So the sparc-specific pte_free_defer() is as simple as the generic one,
except for converting between pte_t *pgtable_t and struct page *.
The patch should be fine as posted (except its title is misleading).

> 
> Signed-off-by: Hugh Dickins 
> ---
>  arch/sparc/include/asm/pgalloc_64.h |  4 
>  arch/sparc/mm/init_64.c | 16 
>  2 files changed, 20 insertions(+)
> 
> diff --git a/arch/sparc/include/asm/pgalloc_64.h 
> b/arch/sparc/include/asm/pgalloc_64.h
> index 7b5561d17ab1..caa7632be4c2 100644
> --- a/arch/sparc/include/asm/pgalloc_64.h
> +++ b/arch/sparc/include/asm/pgalloc_64.h
> @@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
>  void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
>  void pte_free(struct mm_struct *mm, pgtable_t ptepage);
>  
> +/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
> +#define pte_free_defer pte_free_defer
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> +
>  #define pmd_populate_kernel(MM, PMD, PTE)pmd_set(MM, PMD, PTE)
>  #define pmd_populate(MM, PMD, PTE)   pmd_set(MM, PMD, PTE)
>  
> diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
> index 04f9db0c3111..b7c6aa085ef6 100644
> --- a/arch/sparc/mm/init_64.c
> +++ b/arch/sparc/mm/init_64.c
> @@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static void pte_free_now(struct rcu_head *head)
> +{
> + struct page *page;
> +
> + page = container_of(head, struct page, rcu_head);
> + __pte_free((pgtable_t)page_to_virt(page));
> +}
> +
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> + struct page *page;
> +
> + page = virt_to_page(pgtable);
> + call_rcu(>rcu_head, pte_free_now);
> +}
> +
>  void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t *pmd)
>  {
> -- 
> 2.35.3
> 
>

Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

2023-06-05 Thread Hugh Dickins

On Fri, 2 Jun 2023, Jason Gunthorpe wrote:
> On Mon, May 29, 2023 at 03:02:02PM +0100, Matthew Wilcox wrote:
> > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > + struct page *page;
> > > +
> > > + page = virt_to_page(pgtable);
> > > + call_rcu(>rcu_head, pte_free_now);
> > > +}
> > 
> > This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> > tables sharing one 64kB page.  So if you have two page tables from the
> > same page being defer-freed simultaneously, you'll reuse the rcu_head
> > and I cannot imagine things go well from that point.
> > 
> > I have no idea how to solve this problem.
> 
> Maybe power and s390 should allocate a side structure, sort of a
> pre-memdesc thing to store enough extra data?
> 
> If we can get enough bytes then something like this would let a single
> rcu head be shared to manage the free bits.
> 
> struct 64k_page {
> u8 free_pages;
> u8 pending_rcu_free_pages;
> struct rcu_head head;
> }
> 
> free_sub_page(sub_id)
> if (atomic_fetch_or(1 << sub_id, &64k_page->pending_rcu_free_pages))
>  call_rcu(&64k_page->head)
> 
> rcu_func()
>64k_page->free_pages |= atomic_xchg(0, &64k_page->pending_rcu_free_pages)
> 
>if (64k_pages->free_pages == all_ones)
>   free_pgea(64k_page);

Or simply allocate as many rcu_heads as page tables.

I have not thought through your suggestion above, because I'm against
asking s390, or any other architecture, to degrade its page table
implementation by demanding more memory, just for the sake of my patch
series.  In a future memdesc world it might turn out to be reasonable,
but not for this (if I can possibly avoid it).

Below is what I believe to be the correct powerpc patch (built but not
retested).  sparc I thought was going to be an equal problem, but turns
out not: I'll comment on 06/12.  And let's move s390 discussion to 07/12.

[PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time: account concurrent deferrals with a
heightened refcount, only the first making use of the rcu_head, but
re-deferring if more deferrals arrived during its grace period.

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/include/asm/pgalloc.h |  4 +++
 arch/powerpc/mm/pgtable-frag.c | 51 ++
 2 files changed, 55 insertions(+)

diff --git a/arch/powerpc/include/asm/pgalloc.h 
b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t 
ptepage)
pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c 
*/
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..e4f58c5fc2ac 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
__free_page(page);
}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define PTE_FREE_DEFERRED 0x1 /* beyond any PTE_FRAG_NR */
+
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+   int refcount;
+
+   page = container_of(head, struct page, rcu_head);
+   refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
+>pt_frag_refcount);
+   if (refcount < PTE_FREE_DEFERRED) {
+   pte_fragment_free((unsigned long *)page_address(page), 0);
+   return;
+   }
+   /*
+* One page may be shared between PTE_FRAG_NR pagetables.
+* At least one more call to pte_free_defer() came in while we
+* were already deferring, so the free must be deferred again;
+* but just for one grace period, however many calls came in.
+*/
+   while (refcount >= PTE_FREE_DEFE

1 2 3 >

1 - 100 of 234 matches

Mail list logo