Re: [PATCH 2/2] mm/selftests: Don't prefault in gup_longterm tests

2024-04-29 Thread David Hildenbrand

On 29.04.24 15:10, Peter Xu wrote:

On Mon, Apr 29, 2024 at 09:28:15AM +0200, David Hildenbrand wrote:

On 28.04.24 21:01, Peter Xu wrote:

Prefault, especially with RW, makes the GUP test too easy, and may not yet
reach the core of the test.

For example, R/O longterm pins will just hit, pte_write()==true for
whatever cases, the unsharing logic won't be ever tested.

This patch remove the prefault.  This tortures more code paths at least to
cover the unshare care for R/O longterm pins, in which case the first R/O
GUP attempt will fault in the page R/O first, then the 2nd will go through
the unshare path, checking whether an unshare is needed.

Cc: David Hildenbrand 
Signed-off-by: Peter Xu 
---
   tools/testing/selftests/mm/gup_longterm.c | 12 +---
   1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/mm/gup_longterm.c 
b/tools/testing/selftests/mm/gup_longterm.c
index ad168d35b23b..488e32186246 100644
--- a/tools/testing/selftests/mm/gup_longterm.c
+++ b/tools/testing/selftests/mm/gup_longterm.c
@@ -119,10 +119,16 @@ static void do_test(int fd, size_t size, enum test_type 
type, bool shared)
}
/*
-* Fault in the page writable such that GUP-fast can eventually pin
-* it immediately.
+* Explicitly avoid pre-faulting in the page, this can help testing
+* more code paths.
+*
+* Take example of an upcoming R/O pin test, if we RW prefault the
+* page, such pin will directly skip R/O unsharing and the longterm
+* pin will success mostly always.  When not prefaulted, R/O
+* longterm pin will first fault in a RO page, then the 2nd round
+* it'll go via the unshare check.  Otherwise those paths aren't
+* covered.
 */

This will mean that GUP-fast never succeeds, which removes quite some testing
coverage for most other tests here.

Note that the main motivation of this test was to test gup_fast_folio_allowed(),
where we had issues with GUP-fast during development.


Ah I didn't notice that, as I thought that whitelists memfd ones.



Would the following also get the job done?

diff --git a/tools/testing/selftests/mm/gup_longterm.c 
b/tools/testing/selftests/mm/gup_longterm.c
index ad168d35b23b7..e917a7c58d571 100644
--- a/tools/testing/selftests/mm/gup_longterm.c
+++ b/tools/testing/selftests/mm/gup_longterm.c
@@ -92,7 +92,7 @@ static void do_test(int fd, size_t size, enum test_type type, 
bool shared)
  {
__fsword_t fs_type = get_fs_type(fd);
bool should_work;
-   char *mem;
+   char tmp, *mem;
int ret;
if (ftruncate(fd, size)) {
@@ -119,10 +119,19 @@ static void do_test(int fd, size_t size, enum test_type 
type, bool shared)
}
/*
-* Fault in the page writable such that GUP-fast can eventually pin
-* it immediately.
+* Fault in the page such that GUP-fast might be able to pin it
+* immediately. To cover more cases, don't fault in pages writable when
+* R/O pinning.
 */
-   memset(mem, 0, size);
+   switch (type) {
+   case TEST_TYPE_RO:
+   case TEST_TYPE_RO_FAST:
+   tmp = *mem;
+   asm volatile("" : "+r" (tmp));
+   break;
+   default:
+   memset(mem, 0, size);
+   };
switch (type) {
case TEST_TYPE_RO:


Yes this could work too.

The test patch here doesn't need to rush. David, how about you prepare a
better and verified patch and post it separately, making sure to cover all
the things we used to cover plus the unshare?  IIUC it used to be not
touched because of pte_write() always returns true with a write prefault.

Then we let patch 1 go through first, and drop this one?


Whatever you prefer!

--
Cheers,

David / dhildenb



Re: [PATCH 2/2] mm/selftests: Don't prefault in gup_longterm tests

2024-04-29 Thread David Hildenbrand

On 28.04.24 21:01, Peter Xu wrote:

Prefault, especially with RW, makes the GUP test too easy, and may not yet
reach the core of the test.

For example, R/O longterm pins will just hit, pte_write()==true for
whatever cases, the unsharing logic won't be ever tested.

This patch remove the prefault.  This tortures more code paths at least to
cover the unshare care for R/O longterm pins, in which case the first R/O
GUP attempt will fault in the page R/O first, then the 2nd will go through
the unshare path, checking whether an unshare is needed.

Cc: David Hildenbrand 
Signed-off-by: Peter Xu 
---
  tools/testing/selftests/mm/gup_longterm.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/mm/gup_longterm.c 
b/tools/testing/selftests/mm/gup_longterm.c
index ad168d35b23b..488e32186246 100644
--- a/tools/testing/selftests/mm/gup_longterm.c
+++ b/tools/testing/selftests/mm/gup_longterm.c
@@ -119,10 +119,16 @@ static void do_test(int fd, size_t size, enum test_type 
type, bool shared)
}
  
  	/*

-* Fault in the page writable such that GUP-fast can eventually pin
-* it immediately.
+* Explicitly avoid pre-faulting in the page, this can help testing
+* more code paths.
+*
+* Take example of an upcoming R/O pin test, if we RW prefault the
+* page, such pin will directly skip R/O unsharing and the longterm
+* pin will success mostly always.  When not prefaulted, R/O
+* longterm pin will first fault in a RO page, then the 2nd round
+* it'll go via the unshare check.  Otherwise those paths aren't
+* covered.
 */

This will mean that GUP-fast never succeeds, which removes quite some testing
coverage for most other tests here.

Note that the main motivation of this test was to test gup_fast_folio_allowed(),
where we had issues with GUP-fast during development.

Would the following also get the job done?

diff --git a/tools/testing/selftests/mm/gup_longterm.c 
b/tools/testing/selftests/mm/gup_longterm.c
index ad168d35b23b7..e917a7c58d571 100644
--- a/tools/testing/selftests/mm/gup_longterm.c
+++ b/tools/testing/selftests/mm/gup_longterm.c
@@ -92,7 +92,7 @@ static void do_test(int fd, size_t size, enum test_type type, 
bool shared)
 {
__fsword_t fs_type = get_fs_type(fd);
bool should_work;
-   char *mem;
+   char tmp, *mem;
int ret;
 
 	if (ftruncate(fd, size)) {

@@ -119,10 +119,19 @@ static void do_test(int fd, size_t size, enum test_type 
type, bool shared)
}
 
 	/*

-* Fault in the page writable such that GUP-fast can eventually pin
-* it immediately.
+* Fault in the page such that GUP-fast might be able to pin it
+* immediately. To cover more cases, don't fault in pages writable when
+* R/O pinning.
 */
-   memset(mem, 0, size);
+   switch (type) {
+   case TEST_TYPE_RO:
+   case TEST_TYPE_RO_FAST:
+   tmp = *mem;
+   asm volatile("" : "+r" (tmp));
+   break;
+   default:
+   memset(mem, 0, size);
+   };
 
 	switch (type) {

case TEST_TYPE_RO:
--
2.44.0


--
Cheers,

David / dhildenb



Re: [PATCH 1/2] mm/gup: Fix hugepd handling in hugetlb rework

2024-04-29 Thread David Hildenbrand

On 28.04.24 21:01, Peter Xu wrote:

Commit a12083d721d7 added hugepd handling for gup-slow, reusing gup-fast
functions.  follow_hugepd() correctly took the vma pointer in, however
didn't pass it over into the lower functions, which was overlooked.

The issue is gup_fast_hugepte() uses the vma pointer to make the correct
decision on whether an unshare is needed for a FOLL_PIN|FOLL_LONGTERM.  Now
without vma ponter it will constantly return "true" (needs an unshare) for
a page cache, even though in the SHARED case it will be wrong to unshare.

The other problem is, even if an unshare is needed, it now returns 0 rather
than -EMLINK, which will not trigger a follow up FAULT_FLAG_UNSHARE fault.
That will need to be fixed too when the unshare is wanted.

gup_longterm test didn't expose this issue in the past because it didn't
yet test R/O unshare in this case, another separate patch will enable that
in future tests.

Fix it by passing vma correctly to the bottom, rename gup_fast_hugepte()
back to gup_hugepte() as it is shared between the fast/slow paths, and also
allow -EMLINK to be returned properly by gup_hugepte() even though gup-fast
will take it the same as zero.

Reported-by: David Hildenbrand 
Fixes: a12083d721d7 ("mm/gup: handle hugepd for follow_page()")
Signed-off-by: Peter Xu 
---


LGTM

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb



Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions

2024-04-27 Thread David Hildenbrand

On 26.04.24 23:58, Peter Xu wrote:

On Fri, Apr 26, 2024 at 11:33:08PM +0200, David Hildenbrand wrote:

I raised this topic in the past, and IMHO we either (a) never should have
added COW support; or (b) added COW support by using ordinary anonymous
memory (hey, partial mappings of hugetlb pages! ;) ).

After all, COW is an optimization to speed up fork and defer copying. It
relies on memory overcommit, but that doesn't really apply to hugetlb, so we
fake it ...


Good summary.



One easy ABI break I had in mind was to simply *not* allow COW-sharing of
anon hugetlb folios; for example, simply don't copy the page into the child.
Chances are there are not really a lot of child processes that would fail
... but likely we would break *something*. So there is no easy way out :(


Right, not easy.  The thing is this is one spot out of many of the
specialties, it also may or may not be worthwhile to have dedicated time
while nobody yet has a problem with it.  It might be easier to start with
v2, even though that's also hard to nail everything properly - the
challenge can come from different angles.

Thanks for the sharings, helpful.  I'll go ahead with the Power fix on
hugepd putting this aside.


Yes, hopefully we already do have a test case for that. When writing 
gup_longterm.c I was more focusing on memfd vs. ordinary file systems 
("filesystem type") than how it's mapped into the page tables.




I hope that before the end of this year, whatever I'll fix can go away, by
removing hugepd completely from Linux.  For now that may or may not be as
smooth, so we'd better still fix it.


Crossing fingers, I'm annoyed whenever I stumble over it :)

--
Cheers,

David / dhildenb



Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions

2024-04-26 Thread David Hildenbrand



Hmm, so when I enable 2M hugetlb I found ./cow is even failing on x86.

# ./cow  | grep -B1 "not ok"
# [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB)
not ok 161 No leak from parent into child
--
# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with 
hugetlb (2048 kB)
not ok 215 No leak from parent into child
--
# [RUN] vmsplice() before fork(), unmap in parent after fork() ... with 
hugetlb (2048 kB)
not ok 269 No leak from child into parent
--
# [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB)
not ok 323 No leak from child into parent

And it looks like it was always failing.. perhaps since the start?  We


Yes!

commit 7dad331be7816103eba8c12caeb88fbd3599c0b9
Author: David Hildenbrand 
Date:   Tue Sep 27 13:01:17 2022 +0200

 selftests/vm: anon_cow: hugetlb tests
 Let's run all existing test cases with all hugetlb sizes we're able to
 detect.
 Note that some tests cases still fail. This will, for example, be fixed
 once vmsplice properly uses FOLL_PIN instead of FOLL_GET for pinning.
 With 2 MiB and 1 GiB hugetlb on x86_64, the expected failures are:
  # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB)
  not ok 23 No leak from parent into child
  # [RUN] vmsplice() + unmap in child ... with hugetlb (1048576 kB)
  not ok 24 No leak from parent into child
  # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with 
hugetlb (2048 kB)
  not ok 35 No leak from child into parent
  # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with 
hugetlb (1048576 kB)
  not ok 36 No leak from child into parent
  # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 
kB)
  not ok 47 No leak from child into parent
  # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb 
(1048576 kB)
  not ok 48 No leak from child into parent

As it keeps confusing people (until somebody cares enough to fix vmsplice), I 
already
thought about just disabling the test and adding a comment why it happens and
why nobody cares.


I think we should, and when doing so maybe add a rich comment in
hugetlb_wp() too explaining everything?


Likely yes. Let me think of something.






didn't do the same on hugetlb v.s. normal anon from that regard on the
vmsplice() fix.

I drafted a patch to allow refcount>1 detection as the same, then all tests
pass for me, as below.

David, I'd like to double check with you before I post anything: is that
your intention to do so when working on the R/O pinning or not?


Here certainly the "if it's easy it would already have done" principle applies. 
:)

The issue is the following: hugetlb pages are scarce resources that cannot 
usually
be overcommitted. For ordinary memory, we don't care if we COW in some corner 
case
because there is an unexpected reference. You temporarily consume an additional 
page
that gets freed as soon as the unexpected reference is dropped.

For hugetlb, it is problematic. Assume you have reserved a single 1 GiB hugetlb 
page
and your process uses that in a MAP_PRIVATE mapping. Then it calls fork() and 
the
child quits immediately.

If you decide to COW, you would need a second hugetlb page, which we don't 
have, so
you have to crash the program.

And in hugetlb it's extremely easy to not get folio_ref_count() == 1:

hugetlb_fault() will do a folio_get(folio) before calling hugetlb_wp()!

... so you essentially always copy.


Hmm yes there's one extra refcount. I think this is all fine, we can simply
take all of them into account when making a CoW decision.  However crashing
an userspace can be a problem for sure.


Right, and a simple reference from page migration or some other PFN 
walker would be sufficient for that.


I did not dare being responsible for that, even though races are rare :)

The vmsplice leak is not worth that: hugetlb with MAP_PRIVATE to 
COW-share data between processes with different privilege levels is not 
really common.







At that point I walked away from that, letting vmsplice() be fixed at some 
point. Dave
Howells was close at some point IIRC ...

I had some ideas about retrying until the other reference is gone (which cannot 
be a
longterm GUP pin), but as vmsplice essentially does without 
FOLL_PIN|FOLL_LONGTERM,
it's quit hopeless to resolve that as long as vmsplice holds longterm 
references the wrong
way.

---

One could argue that fork() with hugetlb and MAP_PRIVATE is stupid and fragile: 
assume
your child MM is torn down deferred, and will unmap the hugetlb page deferred. 
Or assume
you access the page concurrently with fork(). You'd have to COW and crash the 
program.
BUT, there is a horribly ugly hack in hugetlb COW code where you *steal* the 
page form
the child program and crash your child. I'm not making that up, it's horrible.


I didn't notice that code before; do

Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions

2024-04-26 Thread David Hildenbrand

On 26.04.24 18:12, Peter Xu wrote:

On Fri, Apr 26, 2024 at 09:44:58AM -0400, Peter Xu wrote:

On Fri, Apr 26, 2024 at 09:17:47AM +0200, David Hildenbrand wrote:

On 02.04.24 14:55, David Hildenbrand wrote:

Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename
all relevant internal functions to start with "gup_fast", to make it
clearer that this is not ordinary GUP. The current mixture of
"lockless", "gup" and "gup_fast" is confusing.

Further, avoid the term "huge" when talking about a "leaf" -- for
example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the
"hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that
stays.

What remains is the "external" interface:
* get_user_pages_fast_only()
* get_user_pages_fast()
* pin_user_pages_fast()

The high-level internal functions for GUP-fast (+slow fallback) are now:
* internal_get_user_pages_fast() -> gup_fast_fallback()
* lockless_pages_from_mm() -> gup_fast()

The basic GUP-fast walker functions:
* gup_pgd_range() -> gup_fast_pgd_range()
* gup_p4d_range() -> gup_fast_p4d_range()
* gup_pud_range() -> gup_fast_pud_range()
* gup_pmd_range() -> gup_fast_pmd_range()
* gup_pte_range() -> gup_fast_pte_range()
* gup_huge_pgd()  -> gup_fast_pgd_leaf()
* gup_huge_pud()  -> gup_fast_pud_leaf()
* gup_huge_pmd()  -> gup_fast_pmd_leaf()

The weird hugepd stuff:
* gup_huge_pd() -> gup_fast_hugepd()
* gup_hugepte() -> gup_fast_hugepte()


I just realized that we end up calling these from follow_hugepd() as well.
And something seems to be off, because gup_fast_hugepd() won't have the VMA
even in the slow-GUP case to pass it to gup_must_unshare().

So these are GUP-fast functions and the terminology seem correct. But the
usage from follow_hugepd() is questionable,

commit a12083d721d703f985f4403d6b333cc449f838f6
Author: Peter Xu 
Date:   Wed Mar 27 11:23:31 2024 -0400

 mm/gup: handle hugepd for follow_page()


states "With previous refactors on fast-gup gup_huge_pd(), most of the code
can be leveraged", which doesn't look quite true just staring the the
gup_must_unshare() call where we don't pass the VMA. Also,
"unlikely(pte_val(pte) != pte_val(ptep_get(ptep)" doesn't make any sense for
slow GUP ...


Yes it's not needed, just doesn't look worthwhile to put another helper on
top just for this.  I mentioned this in the commit message here:

   There's something not needed for follow page, for example, gup_hugepte()
   tries to detect pgtable entry change which will never happen with slow
   gup (which has the pgtable lock held), but that's not a problem to check.



@Peter, any insights?


However I think we should pass vma in for sure, I guess I overlooked that,
and it didn't expose in my tests too as I probably missed ./cow.

I'll prepare a separate patch on top of this series and the gup-fast rename
patches (I saw this one just reached mm-stable), and I'll see whether I can
test it too if I can find a Power system fast enough.  I'll probably drop
the "fast" in the hugepd function names too.




For the missing VMA parameter, the cow.c test might not trigger it. We never 
need the VMA to make
a pinning decision for anonymous memory. We'll trigger an unsharing fault, get 
an exclusive anonymous page
and can continue.

We need the VMA in gup_must_unshare(), when long-term pinning a file hugetlb 
page. I *think*
the gup_longterm.c selftest should trigger that, especially:

# [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd 
hugetlb (2048 kB)
...
# [RUN] R/O longterm GUP-fast pin in MAP_SHARED file mapping ... with memfd 
hugetlb (1048576 kB)


We need a MAP_SHARED page where the PTE is R/O that we want to long-term pin 
R/O.
I don't remember from the top of my head if the test here might have a 
R/W-mapped
folio. If so, we could extend it to cover that.


Hmm, so when I enable 2M hugetlb I found ./cow is even failing on x86.

   # ./cow  | grep -B1 "not ok"
   # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB)
   not ok 161 No leak from parent into child
   --
   # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with 
hugetlb (2048 kB)
   not ok 215 No leak from parent into child
   --
   # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with 
hugetlb (2048 kB)
   not ok 269 No leak from child into parent
   --
   # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB)
   not ok 323 No leak from child into parent

And it looks like it was always failing.. perhaps since the start?  We


Yes!

commit 7dad331be7816103eba8c12caeb88fbd3599c0b9
Author: David Hildenbrand 
Date:   Tue Sep 27 13:01:17 2022 +0200

selftests/vm: anon_cow: hugetlb tests

Let's run all existing test cases with all hugetlb sizes we're able to


Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions

2024-04-26 Thread David Hildenbrand

On 02.04.24 14:55, David Hildenbrand wrote:

Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename
all relevant internal functions to start with "gup_fast", to make it
clearer that this is not ordinary GUP. The current mixture of
"lockless", "gup" and "gup_fast" is confusing.

Further, avoid the term "huge" when talking about a "leaf" -- for
example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the
"hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that
stays.

What remains is the "external" interface:
* get_user_pages_fast_only()
* get_user_pages_fast()
* pin_user_pages_fast()

The high-level internal functions for GUP-fast (+slow fallback) are now:
* internal_get_user_pages_fast() -> gup_fast_fallback()
* lockless_pages_from_mm() -> gup_fast()

The basic GUP-fast walker functions:
* gup_pgd_range() -> gup_fast_pgd_range()
* gup_p4d_range() -> gup_fast_p4d_range()
* gup_pud_range() -> gup_fast_pud_range()
* gup_pmd_range() -> gup_fast_pmd_range()
* gup_pte_range() -> gup_fast_pte_range()
* gup_huge_pgd()  -> gup_fast_pgd_leaf()
* gup_huge_pud()  -> gup_fast_pud_leaf()
* gup_huge_pmd()  -> gup_fast_pmd_leaf()

The weird hugepd stuff:
* gup_huge_pd() -> gup_fast_hugepd()
* gup_hugepte() -> gup_fast_hugepte()


I just realized that we end up calling these from follow_hugepd() as 
well. And something seems to be off, because gup_fast_hugepd() won't 
have the VMA even in the slow-GUP case to pass it to gup_must_unshare().


So these are GUP-fast functions and the terminology seem correct. But 
the usage from follow_hugepd() is questionable,


commit a12083d721d703f985f4403d6b333cc449f838f6
Author: Peter Xu 
Date:   Wed Mar 27 11:23:31 2024 -0400

mm/gup: handle hugepd for follow_page()


states "With previous refactors on fast-gup gup_huge_pd(), most of the 
code can be leveraged", which doesn't look quite true just staring the 
the gup_must_unshare() call where we don't pass the VMA. Also, 
"unlikely(pte_val(pte) != pte_val(ptep_get(ptep)" doesn't make any sense 
for slow GUP ...


@Peter, any insights?

--
Cheers,

David / dhildenb



Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-12 Thread David Hildenbrand

On 11.04.24 18:55, Paolo Bonzini wrote:

On Mon, Apr 8, 2024 at 3:56 PM Peter Xu  wrote:

Paolo,

I may miss a bunch of details here (as I still remember some change_pte
patches previously on the list..), however not sure whether we considered
enable it?  Asked because I remember Andrea used to have a custom tree
maintaining that part:

https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968


The patch enables it only for KSM, so it would still require a bunch
of cleanups, for example I also would still use set_pte_at() in all
the places that are not KSM. This would at least fix the issue with
the poor documentation of where to use set_pte_at_notify() vs
set_pte_at().

With regard to the implementation, I like the idea of disabling the
invalidation on the MMU notifier side, but I would rather have
MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of
overloading the event field.


Maybe it can't be enabled for some reason that I overlooked in the current
tree, or we just decided to not to?


I have just learnt about the patch, nobody had ever mentioned it even
though it's almost 2 years old... It's a lot of code though and no one


I assume Andrea used it on his tree where he also has a version of 
"randprotect" (even included in that commit subject) to mitigate a KSM 
security issue that was reported by some security researchers [1] a 
while ago. From what I recall, the industry did not end up caring about 
that security issue that much.


IIUC, with "randprotect" we get a lot more R/O protection even when not 
de-duplicating a page -- thus the name. Likely, the reporter mentioned 
in the commit is a researcher that played with Andreas fix for the 
security issue. But I'm just speculating at this point :)



has ever reported an issue for over 10 years, so I think it's easiest
to just rip the code out.


Yes. Can always be readded in a possibly cleaner fashion (like you note 
above), when deemed necessary and we are willing to support it.


[1] https://gruss.cc/files/remote_dedup.pdf

--
Cheers,

David / dhildenb



Re: [PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at()

2024-04-08 Thread David Hildenbrand

On 05.04.24 13:58, Paolo Bonzini wrote:

With the demise of the .change_pte() MMU notifier callback, there is no
notification happening in set_pte_at_notify().  It is a synonym of
set_pte_at() and can be replaced with it.

Signed-off-by: Paolo Bonzini 
---


A real joy seeing that gone

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb



Re: [PATCH 3/4] mmu_notifier: remove the .change_pte() callback

2024-04-08 Thread David Hildenbrand

On 05.04.24 13:58, Paolo Bonzini wrote:

The scope of set_pte_at_notify() has reduced more and more through the
years.  Initially, it was meant for when the change to the PTE was
not bracketed by mmu_notifier_invalidate_range_{start,end}().  However,
that has not been so for over ten years.  During all this period
the only implementation of .change_pte() was KVM and it
had no actual functionality, because it was called after
mmu_notifier_invalidate_range_start() zapped the secondary PTE.

Now that this (nonfunctional) user of the .change_pte() callback is
gone, the whole callback can be removed.  For now, leave in place
set_pte_at_notify() even though it is just a synonym for set_pte_at().

Signed-off-by: Paolo Bonzini 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb



Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-04-02 Thread David Hildenbrand

On 02.04.24 19:57, Peter Xu wrote:

On Tue, Apr 02, 2024 at 06:39:31PM +0200, David Hildenbrand wrote:

On 02.04.24 18:20, Peter Xu wrote:

On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:

On 02.04.24 16:48, Ryan Roberts wrote:

Hi Peter,


Hey, Ryan,

Thanks for the report!



On 27/03/2024 15:23, pet...@redhat.com wrote:

From: Peter Xu 

Now follow_page() is ready to handle hugetlb pages in whatever form, and
over all architectures.  Switch to the generic code path.

Time to retire hugetlb_follow_page_mask(), following the previous
retirement of follow_hugetlb_page() in 4849807114b8.

There may be a slight difference of how the loops run when processing slow
GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
loop of __get_user_pages() will resolve one pgtable entry with the patch
applied, rather than relying on the size of hugetlb hstate, the latter may
cover multiple entries in one loop.

A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
a tight loop of slow gup after the path switched.  That shouldn't be a
problem because slow-gup should not be a hot path for GUP in general: when
page is commonly present, fast-gup will already succeed, while when the
page is indeed missing and require a follow up page fault, the slow gup
degrade will probably buried in the fault paths anyway.  It also explains
why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
accelerate thp gup even for "pages != NULL"") lands, the latter not part of
a performance analysis but a side benefit.  If the performance will be a
concern, we can consider handle CONT_PTE in follow_page().

Before that is justified to be necessary, keep everything clean and simple.

Reviewed-by: Jason Gunthorpe 
Signed-off-by: Peter Xu 


Afraid I'm seeing an oops when running gup_longterm test on arm64 with current 
mm-unstable. Git bisect blames this patch. The oops reproduces for me every 
time on 2 different machines:


[9.340416] kernel BUG at mm/gup.c:778!
[9.340746] Internal error: Oops - BUG: f2000800 [#1] PREEMPT SMP
[9.341199] Modules linked in:
[9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 
6.9.0-rc2-00210-g910ff1a347e4 #11
[9.342232] Hardware name: linux,dummy-virt (DT)
[9.342647] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[9.343195] pc : follow_page_mask+0x4d4/0x880
[9.343580] lr : follow_page_mask+0x4d4/0x880
[9.344018] sp : 8000898b3aa0
[9.344345] x29: 8000898b3aa0 x28: fdffc53973e8 x27: 3c0005d08000
[9.345028] x26: 00014e5cfd08 x25: d3513a40c000 x24: fdffc5d08000
[9.345682] x23: c1ffc000 x22: 00080101 x21: 8000898b3ba8
[9.346337] x20: f420 x19: 00014e52d508 x18: 0010
[9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
[9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
[9.348371] x11: 65645f656e6f7a5f x10: d3513b31d6e0 x9 : d3513852f090
[9.349062] x8 : efff x7 : d3513b31d6e0 x6 : 
[9.349753] x5 : 00017ff98cc8 x4 : 0fff x3 : 
[9.350397] x2 :  x1 : 000190e8b480 x0 : 0052
[9.351097] Call trace:
[9.351312]  follow_page_mask+0x4d4/0x880
[9.351700]  __get_user_pages+0xf4/0x3e8
[9.352089]  __gup_longterm_locked+0x204/0xa70
[9.352516]  pin_user_pages+0x88/0xc0
[9.352873]  gup_test_ioctl+0x860/0xc40
[9.353249]  __arm64_sys_ioctl+0xb0/0x100
[9.353648]  invoke_syscall+0x50/0x128
[9.354022]  el0_svc_common.constprop.0+0x48/0xf8
[9.354488]  do_el0_svc+0x28/0x40
[9.354822]  el0_svc+0x34/0xe0
[9.355128]  el0t_64_sync_handler+0x13c/0x158
[9.355489]  el0t_64_sync+0x190/0x198
[9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d421)
[9.356280] ---[ end trace  ]---
[9.356651] note: gup_longterm[1159] exited with irqs disabled
[9.357141] note: gup_longterm[1159] exited with preempt_count 2
[9.358033] [ cut here ]
[9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 
ct_kernel_exit.constprop.0+0x108/0x120
[9.360157] Modules linked in:
[9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G  D
6.9.0-rc2-00210-g910ff1a347e4 #11
[9.361626] Hardware name: linux,dummy-virt (DT)
[9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
[9.363306] lr : ct_idle_enter+0x10/0x20
[9.363845] sp : 8000801abdc0
[9.364222] x29: 8000801abdc0 x28:  x27: 
[9.364961] x26:  x25: 00014149d780 x24: 
[9.365557] x23:  x22: d3513b299d48 x21: d3513a785730
[9.366239] x20: 

Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-04-02 Thread David Hildenbrand

On 02.04.24 18:00, Matthew Wilcox wrote:

On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:

The oops trigger is at mm/gup.c:778:
VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);

So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm guessing 
you're trying to iterate 2M into a cont-pmd folio and ending up with an 
unexpected tail page?


I assume we find the expected tail page, it's just that the check

VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);

Doesn't make sense with hugetlb folios. We might have a tail page mapped in
a cont-pmd entry. As soon as we call follow_huge_pmd() on "not the first
cont-pmd entry", we trigger this check.

Likely this sanity check must also allow for hugetlb folios. Or we should
just remove it completely.

In the past, we wanted to make sure that we never get tail pages of THP from
PMD entries, because something would currently be broken (we don't support
THP > PMD).


That was a practical limitation on my part.  We have various parts of
the MM which assume that pmd_page() returns a head page and until we
get all of those fixed, adding support for folios larger than PMD_SIZE
was only going to cause trouble for no significant wins.

I agree with you we should get rid of this assertion entirely.  We should
fix all the places which assume that pmd_page() returns a head page,
but that may take some time.

As an example, filemap_map_pmd() has:

if (pmd_none(*vmf->pmd) && folio_test_pmd_mappable(folio)) {
 struct page *page = folio_file_page(folio, start);
 vm_fault_t ret = do_set_pmd(vmf, page);

and then do_set_pmd() has:

 if (page != >page || folio_order(folio) != HPAGE_PMD_ORDER)
 return ret;

so we'd simply refuse to use a PMD to map a folio larger than PMD_SIZE.
There's a lot of work to be done to make this work generally (not to
mention figuring out how to handle mapcount for such folios ;-).


Yes :)

--
Cheers,

David / dhildenb



Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-04-02 Thread David Hildenbrand

On 02.04.24 18:20, Peter Xu wrote:

On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:

On 02.04.24 16:48, Ryan Roberts wrote:

Hi Peter,


Hey, Ryan,

Thanks for the report!



On 27/03/2024 15:23, pet...@redhat.com wrote:

From: Peter Xu 

Now follow_page() is ready to handle hugetlb pages in whatever form, and
over all architectures.  Switch to the generic code path.

Time to retire hugetlb_follow_page_mask(), following the previous
retirement of follow_hugetlb_page() in 4849807114b8.

There may be a slight difference of how the loops run when processing slow
GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
loop of __get_user_pages() will resolve one pgtable entry with the patch
applied, rather than relying on the size of hugetlb hstate, the latter may
cover multiple entries in one loop.

A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
a tight loop of slow gup after the path switched.  That shouldn't be a
problem because slow-gup should not be a hot path for GUP in general: when
page is commonly present, fast-gup will already succeed, while when the
page is indeed missing and require a follow up page fault, the slow gup
degrade will probably buried in the fault paths anyway.  It also explains
why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
accelerate thp gup even for "pages != NULL"") lands, the latter not part of
a performance analysis but a side benefit.  If the performance will be a
concern, we can consider handle CONT_PTE in follow_page().

Before that is justified to be necessary, keep everything clean and simple.

Reviewed-by: Jason Gunthorpe 
Signed-off-by: Peter Xu 


Afraid I'm seeing an oops when running gup_longterm test on arm64 with current 
mm-unstable. Git bisect blames this patch. The oops reproduces for me every 
time on 2 different machines:


[9.340416] kernel BUG at mm/gup.c:778!
[9.340746] Internal error: Oops - BUG: f2000800 [#1] PREEMPT SMP
[9.341199] Modules linked in:
[9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 
6.9.0-rc2-00210-g910ff1a347e4 #11
[9.342232] Hardware name: linux,dummy-virt (DT)
[9.342647] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[9.343195] pc : follow_page_mask+0x4d4/0x880
[9.343580] lr : follow_page_mask+0x4d4/0x880
[9.344018] sp : 8000898b3aa0
[9.344345] x29: 8000898b3aa0 x28: fdffc53973e8 x27: 3c0005d08000
[9.345028] x26: 00014e5cfd08 x25: d3513a40c000 x24: fdffc5d08000
[9.345682] x23: c1ffc000 x22: 00080101 x21: 8000898b3ba8
[9.346337] x20: f420 x19: 00014e52d508 x18: 0010
[9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
[9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
[9.348371] x11: 65645f656e6f7a5f x10: d3513b31d6e0 x9 : d3513852f090
[9.349062] x8 : efff x7 : d3513b31d6e0 x6 : 
[9.349753] x5 : 00017ff98cc8 x4 : 0fff x3 : 
[9.350397] x2 :  x1 : 000190e8b480 x0 : 0052
[9.351097] Call trace:
[9.351312]  follow_page_mask+0x4d4/0x880
[9.351700]  __get_user_pages+0xf4/0x3e8
[9.352089]  __gup_longterm_locked+0x204/0xa70
[9.352516]  pin_user_pages+0x88/0xc0
[9.352873]  gup_test_ioctl+0x860/0xc40
[9.353249]  __arm64_sys_ioctl+0xb0/0x100
[9.353648]  invoke_syscall+0x50/0x128
[9.354022]  el0_svc_common.constprop.0+0x48/0xf8
[9.354488]  do_el0_svc+0x28/0x40
[9.354822]  el0_svc+0x34/0xe0
[9.355128]  el0t_64_sync_handler+0x13c/0x158
[9.355489]  el0t_64_sync+0x190/0x198
[9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d421)
[9.356280] ---[ end trace  ]---
[9.356651] note: gup_longterm[1159] exited with irqs disabled
[9.357141] note: gup_longterm[1159] exited with preempt_count 2
[9.358033] [ cut here ]
[9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 
ct_kernel_exit.constprop.0+0x108/0x120
[9.360157] Modules linked in:
[9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G  D
6.9.0-rc2-00210-g910ff1a347e4 #11
[9.361626] Hardware name: linux,dummy-virt (DT)
[9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
[9.363306] lr : ct_idle_enter+0x10/0x20
[9.363845] sp : 8000801abdc0
[9.364222] x29: 8000801abdc0 x28:  x27: 
[9.364961] x26:  x25: 00014149d780 x24: 
[9.365557] x23:  x22: d3513b299d48 x21: d3513a785730
[9.366239] x20: d3513b299c28 x19: 00017ffa7da0 x18: f5ff
[9.366869] x17:  x16: 

Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-04-02 Thread David Hildenbrand

On 02.04.24 16:48, Ryan Roberts wrote:

Hi Peter,

On 27/03/2024 15:23, pet...@redhat.com wrote:

From: Peter Xu 

Now follow_page() is ready to handle hugetlb pages in whatever form, and
over all architectures.  Switch to the generic code path.

Time to retire hugetlb_follow_page_mask(), following the previous
retirement of follow_hugetlb_page() in 4849807114b8.

There may be a slight difference of how the loops run when processing slow
GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
loop of __get_user_pages() will resolve one pgtable entry with the patch
applied, rather than relying on the size of hugetlb hstate, the latter may
cover multiple entries in one loop.

A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
a tight loop of slow gup after the path switched.  That shouldn't be a
problem because slow-gup should not be a hot path for GUP in general: when
page is commonly present, fast-gup will already succeed, while when the
page is indeed missing and require a follow up page fault, the slow gup
degrade will probably buried in the fault paths anyway.  It also explains
why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
accelerate thp gup even for "pages != NULL"") lands, the latter not part of
a performance analysis but a side benefit.  If the performance will be a
concern, we can consider handle CONT_PTE in follow_page().

Before that is justified to be necessary, keep everything clean and simple.

Reviewed-by: Jason Gunthorpe 
Signed-off-by: Peter Xu 


Afraid I'm seeing an oops when running gup_longterm test on arm64 with current 
mm-unstable. Git bisect blames this patch. The oops reproduces for me every 
time on 2 different machines:


[9.340416] kernel BUG at mm/gup.c:778!
[9.340746] Internal error: Oops - BUG: f2000800 [#1] PREEMPT SMP
[9.341199] Modules linked in:
[9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 
6.9.0-rc2-00210-g910ff1a347e4 #11
[9.342232] Hardware name: linux,dummy-virt (DT)
[9.342647] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[9.343195] pc : follow_page_mask+0x4d4/0x880
[9.343580] lr : follow_page_mask+0x4d4/0x880
[9.344018] sp : 8000898b3aa0
[9.344345] x29: 8000898b3aa0 x28: fdffc53973e8 x27: 3c0005d08000
[9.345028] x26: 00014e5cfd08 x25: d3513a40c000 x24: fdffc5d08000
[9.345682] x23: c1ffc000 x22: 00080101 x21: 8000898b3ba8
[9.346337] x20: f420 x19: 00014e52d508 x18: 0010
[9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 6170286461654865
[9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 6761705f65636976
[9.348371] x11: 65645f656e6f7a5f x10: d3513b31d6e0 x9 : d3513852f090
[9.349062] x8 : efff x7 : d3513b31d6e0 x6 : 
[9.349753] x5 : 00017ff98cc8 x4 : 0fff x3 : 
[9.350397] x2 :  x1 : 000190e8b480 x0 : 0052
[9.351097] Call trace:
[9.351312]  follow_page_mask+0x4d4/0x880
[9.351700]  __get_user_pages+0xf4/0x3e8
[9.352089]  __gup_longterm_locked+0x204/0xa70
[9.352516]  pin_user_pages+0x88/0xc0
[9.352873]  gup_test_ioctl+0x860/0xc40
[9.353249]  __arm64_sys_ioctl+0xb0/0x100
[9.353648]  invoke_syscall+0x50/0x128
[9.354022]  el0_svc_common.constprop.0+0x48/0xf8
[9.354488]  do_el0_svc+0x28/0x40
[9.354822]  el0_svc+0x34/0xe0
[9.355128]  el0t_64_sync_handler+0x13c/0x158
[9.355489]  el0t_64_sync+0x190/0x198
[9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d421)
[9.356280] ---[ end trace  ]---
[9.356651] note: gup_longterm[1159] exited with irqs disabled
[9.357141] note: gup_longterm[1159] exited with preempt_count 2
[9.358033] [ cut here ]
[9.358800] WARNING: CPU: 1 PID: 0 at kernel/context_tracking.c:128 
ct_kernel_exit.constprop.0+0x108/0x120
[9.360157] Modules linked in:
[9.360541] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G  D
6.9.0-rc2-00210-g910ff1a347e4 #11
[9.361626] Hardware name: linux,dummy-virt (DT)
[9.362087] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[9.362758] pc : ct_kernel_exit.constprop.0+0x108/0x120
[9.363306] lr : ct_idle_enter+0x10/0x20
[9.363845] sp : 8000801abdc0
[9.364222] x29: 8000801abdc0 x28:  x27: 
[9.364961] x26:  x25: 00014149d780 x24: 
[9.365557] x23:  x22: d3513b299d48 x21: d3513a785730
[9.366239] x20: d3513b299c28 x19: 00017ffa7da0 x18: f5ff
[9.366869] x17:  x16: 1fffe0002a21a8c1 x15: 
[9.367524] x14:  x13:  x12: 0002
[9.368207] x11: 0001 x10: 

[PATCH v1 3/3] mm: use "GUP-fast" instead "fast GUP" in remaining comments

2024-04-02 Thread David Hildenbrand
Let's fixup the remaining comments to consistently call that thing
"GUP-fast". With this change, we consistently call it "GUP-fast".

Reviewed-by: Mike Rapoport (IBM) 
Signed-off-by: David Hildenbrand 
---
 mm/filemap.c| 2 +-
 mm/khugepaged.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 387b394754fa..c668e11cd6ef 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1810,7 +1810,7 @@ EXPORT_SYMBOL(page_cache_prev_miss);
  * C. Return the page to the page allocator
  *
  * This means that any page may have its reference count temporarily
- * increased by a speculative page cache (or fast GUP) lookup as it can
+ * increased by a speculative page cache (or GUP-fast) lookup as it can
  * be allocated by another user before the RCU grace period expires.
  * Because the refcount temporarily acquired here may end up being the
  * last refcount on the page, any page allocation must be freeable by
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 38830174608f..6972fa05132e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1169,7 +1169,7 @@ static int collapse_huge_page(struct mm_struct *mm, 
unsigned long address,
 * huge and small TLB entries for the same virtual address to
 * avoid the risk of CPU bugs in that area.
 *
-* Parallel fast GUP is fine since fast GUP will back off when
+* Parallel GUP-fast is fine since GUP-fast will back off when
 * it detects PMD is changed.
 */
_pmd = pmdp_collapse_flush(vma, address, pmd);
-- 
2.44.0



[PATCH v1 2/3] mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FAST

2024-04-02 Thread David Hildenbrand
Nowadays, we call it "GUP-fast", the external interface includes
functions like "get_user_pages_fast()", and we renamed all internal
functions to reflect that as well.

Let's make the config option reflect that.

Reviewed-by: Mike Rapoport (IBM) 
Signed-off-by: David Hildenbrand 
---
 arch/arm/Kconfig   |  2 +-
 arch/arm64/Kconfig |  2 +-
 arch/loongarch/Kconfig |  2 +-
 arch/mips/Kconfig  |  2 +-
 arch/powerpc/Kconfig   |  2 +-
 arch/riscv/Kconfig |  2 +-
 arch/s390/Kconfig  |  2 +-
 arch/sh/Kconfig|  2 +-
 arch/x86/Kconfig   |  2 +-
 include/linux/rmap.h   |  8 
 kernel/events/core.c   |  4 ++--
 mm/Kconfig |  2 +-
 mm/gup.c   | 10 +-
 mm/internal.h  |  2 +-
 14 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index b14aed3a17ab..817918f6635a 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -99,7 +99,7 @@ config ARM
select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
select HAVE_EFFICIENT_UNALIGNED_ACCESS if (CPU_V6 || CPU_V6K || CPU_V7) 
&& MMU
select HAVE_EXIT_THREAD
-   select HAVE_FAST_GUP if ARM_LPAE
+   select HAVE_GUP_FAST if ARM_LPAE
select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL
select HAVE_FUNCTION_ERROR_INJECTION
select HAVE_FUNCTION_GRAPH_TRACER
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7b11c98b3e84..de076a191e9f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -205,7 +205,7 @@ config ARM64
select HAVE_SAMPLE_FTRACE_DIRECT
select HAVE_SAMPLE_FTRACE_DIRECT_MULTI
select HAVE_EFFICIENT_UNALIGNED_ACCESS
-   select HAVE_FAST_GUP
+   select HAVE_GUP_FAST
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_ERROR_INJECTION
diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index a5f300ec6f28..cd642eefd9e5 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -119,7 +119,7 @@ config LOONGARCH
select HAVE_EBPF_JIT
select HAVE_EFFICIENT_UNALIGNED_ACCESS if !ARCH_STRICT_ALIGN
select HAVE_EXIT_THREAD
-   select HAVE_FAST_GUP
+   select HAVE_GUP_FAST
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_ARG_ACCESS_API
select HAVE_FUNCTION_ERROR_INJECTION
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 516dc7022bd7..f1aa1bf11166 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -68,7 +68,7 @@ config MIPS
select HAVE_DYNAMIC_FTRACE
select HAVE_EBPF_JIT if !CPU_MICROMIPS
select HAVE_EXIT_THREAD
-   select HAVE_FAST_GUP
+   select HAVE_GUP_FAST
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_TRACER
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1c4be3373686..e42cc8cd415f 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -236,7 +236,7 @@ config PPC
select HAVE_DYNAMIC_FTRACE_WITH_REGSif 
ARCH_USING_PATCHABLE_FUNCTION_ENTRY || MPROFILE_KERNEL || PPC32
select HAVE_EBPF_JIT
select HAVE_EFFICIENT_UNALIGNED_ACCESS
-   select HAVE_FAST_GUP
+   select HAVE_GUP_FAST
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_ARG_ACCESS_API
select HAVE_FUNCTION_DESCRIPTORSif PPC64_ELF_ABI_V1
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index be09c8836d56..3ee60ddef93e 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,7 +132,7 @@ config RISCV
select HAVE_FUNCTION_GRAPH_RETVAL if HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_TRACER if !XIP_KERNEL && !PREEMPTION
select HAVE_EBPF_JIT if MMU
-   select HAVE_FAST_GUP if MMU
+   select HAVE_GUP_FAST if MMU
select HAVE_FUNCTION_ARG_ACCESS_API
select HAVE_FUNCTION_ERROR_INJECTION
select HAVE_GCC_PLUGINS
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 8f01ada6845e..d9aed4c93ee6 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -174,7 +174,7 @@ config S390
select HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_EBPF_JIT if HAVE_MARCH_Z196_FEATURES
select HAVE_EFFICIENT_UNALIGNED_ACCESS
-   select HAVE_FAST_GUP
+   select HAVE_GUP_FAST
select HAVE_FENTRY
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_ARG_ACCESS_API
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 2ad3e29f0ebe..7292542f75e8 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -38,7 +38,7 @@ config SUPERH
select HAVE_DEBUG_BUGVERBOSE
select HAVE_DEBUG_KMEMLEAK
select HAVE_DYNAMIC_FTRACE
-   select HAVE_FAST_GUP if MMU
+   select HAVE_GUP_FAST if MMU
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_TRACER
select HAVE_FTRACE_MCOUNT_RECORD
diff --git a/arch/x86/Kconfig b

[PATCH v1 0/3] mm/gup: consistently call it GUP-fast

2024-04-02 Thread David Hildenbrand
Some cleanups around function names, comments and the config option of
"GUP-fast" -- GUP without "lock" safety belts on.

With this cleanup it's easy to judge which functions are GUP-fast specific.
We now consistently call it "GUP-fast", avoiding mixing it with "fast GUP",
"lockless", or simply "gup" (which I always considered confusing in the
ode).

So the magic now happens in functions that contain "gup_fast", whereby
gup_fast() is the entry point into that magic. Comments consistently
reference either "GUP-fast" or "gup_fast()".

Based on mm-unstable from today. I won't CC arch maintainers, but only
arch mailing lists, to reduce noise.

Tested on x86_64, cross compiled on a bunch of archs.

RFC -> v1:
* Rebased on latest mm/mm-unstable
* "mm/gup: consistently name GUP-fast functions"
 -> "internal_get_user_pages_fast()" -> "gup_fast_fallback()"
 -> "undo_dev_pagemap()" -> "gup_fast_undo_dev_pagemap()"
 -> Fixup a bunch more comments
* "mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FAST"
 -> Take care of RISCV

Cc: Andrew Morton 
Cc: Mike Rapoport (IBM) 
Cc: Jason Gunthorpe 
Cc: John Hubbard 
Cc: Peter Xu 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
Cc: loonga...@lists.linux.dev
Cc: linux-m...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-perf-us...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux-ri...@lists.infradead.org
Cc: x...@kernel.org

David Hildenbrand (3):
  mm/gup: consistently name GUP-fast functions
  mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FAST
  mm: use "GUP-fast" instead "fast GUP" in remaining comments

 arch/arm/Kconfig   |   2 +-
 arch/arm64/Kconfig |   2 +-
 arch/loongarch/Kconfig |   2 +-
 arch/mips/Kconfig  |   2 +-
 arch/powerpc/Kconfig   |   2 +-
 arch/riscv/Kconfig |   2 +-
 arch/s390/Kconfig  |   2 +-
 arch/sh/Kconfig|   2 +-
 arch/x86/Kconfig   |   2 +-
 include/linux/rmap.h   |   8 +-
 kernel/events/core.c   |   4 +-
 mm/Kconfig |   2 +-
 mm/filemap.c   |   2 +-
 mm/gup.c   | 215 +
 mm/internal.h  |   2 +-
 mm/khugepaged.c|   2 +-
 16 files changed, 127 insertions(+), 126 deletions(-)

-- 
2.44.0



[PATCH v1 1/3] mm/gup: consistently name GUP-fast functions

2024-04-02 Thread David Hildenbrand
Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename
all relevant internal functions to start with "gup_fast", to make it
clearer that this is not ordinary GUP. The current mixture of
"lockless", "gup" and "gup_fast" is confusing.

Further, avoid the term "huge" when talking about a "leaf" -- for
example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the
"hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that
stays.

What remains is the "external" interface:
* get_user_pages_fast_only()
* get_user_pages_fast()
* pin_user_pages_fast()

The high-level internal functions for GUP-fast (+slow fallback) are now:
* internal_get_user_pages_fast() -> gup_fast_fallback()
* lockless_pages_from_mm() -> gup_fast()

The basic GUP-fast walker functions:
* gup_pgd_range() -> gup_fast_pgd_range()
* gup_p4d_range() -> gup_fast_p4d_range()
* gup_pud_range() -> gup_fast_pud_range()
* gup_pmd_range() -> gup_fast_pmd_range()
* gup_pte_range() -> gup_fast_pte_range()
* gup_huge_pgd()  -> gup_fast_pgd_leaf()
* gup_huge_pud()  -> gup_fast_pud_leaf()
* gup_huge_pmd()  -> gup_fast_pmd_leaf()

The weird hugepd stuff:
* gup_huge_pd() -> gup_fast_hugepd()
* gup_hugepte() -> gup_fast_hugepte()

The weird devmap stuff:
* __gup_device_huge_pud() -> gup_fast_devmap_pud_leaf()
* __gup_device_huge_pmd   -> gup_fast_devmap_pmd_leaf()
* __gup_device_huge() -> gup_fast_devmap_leaf()
* undo_dev_pagemap()  -> gup_fast_undo_dev_pagemap()

Helper functions:
* unpin_user_pages_lockless() -> gup_fast_unpin_user_pages()
* gup_fast_folio_allowed() is already properly named
* gup_fast_permitted() is already properly named

With "gup_fast()", we now even have a function that is referred to in
comment in mm/mmu_gather.c.

Reviewed-by: Jason Gunthorpe 
Reviewed-by: Mike Rapoport (IBM) 
Signed-off-by: David Hildenbrand 
---
 mm/gup.c | 205 ---
 1 file changed, 103 insertions(+), 102 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 95bd9d4b7cfb..f1ac2c5a7f6d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -440,7 +440,7 @@ void unpin_user_page_range_dirty_lock(struct page *page, 
unsigned long npages,
 }
 EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
 
-static void unpin_user_pages_lockless(struct page **pages, unsigned long 
npages)
+static void gup_fast_unpin_user_pages(struct page **pages, unsigned long 
npages)
 {
unsigned long i;
struct folio *folio;
@@ -525,9 +525,9 @@ static unsigned long hugepte_addr_end(unsigned long addr, 
unsigned long end,
return (__boundary - 1 < end - 1) ? __boundary : end;
 }
 
-static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
-  unsigned long end, unsigned int flags,
-  struct page **pages, int *nr)
+static int gup_fast_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
+   unsigned long end, unsigned int flags, struct page **pages,
+   int *nr)
 {
unsigned long pte_end;
struct page *page;
@@ -577,7 +577,7 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
  * of the other folios. See writable_file_mapping_allowed() and
  * gup_fast_folio_allowed() for more information.
  */
-static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
+static int gup_fast_hugepd(hugepd_t hugepd, unsigned long addr,
unsigned int pdshift, unsigned long end, unsigned int flags,
struct page **pages, int *nr)
 {
@@ -588,7 +588,7 @@ static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
ptep = hugepte_offset(hugepd, addr, pdshift);
do {
next = hugepte_addr_end(addr, end, sz);
-   if (!gup_hugepte(ptep, sz, addr, end, flags, pages, nr))
+   if (!gup_fast_hugepte(ptep, sz, addr, end, flags, pages, nr))
return 0;
} while (ptep++, addr = next, addr != end);
 
@@ -613,8 +613,8 @@ static struct page *follow_hugepd(struct vm_area_struct 
*vma, hugepd_t hugepd,
h = hstate_vma(vma);
ptep = hugepte_offset(hugepd, addr, pdshift);
ptl = huge_pte_lock(h, vma->vm_mm, ptep);
-   ret = gup_huge_pd(hugepd, addr, pdshift, addr + PAGE_SIZE,
- flags, , );
+   ret = gup_fast_hugepd(hugepd, addr, pdshift, addr + PAGE_SIZE,
+ flags, , );
spin_unlock(ptl);
 
if (ret) {
@@ -626,7 +626,7 @@ static struct page *follow_hugepd(struct vm_area_struct 
*vma, hugepd_t hugepd,
return NULL;
 }
 #else /* CONFIG_ARCH_HAS_HUGEPD */
-static inline int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
+static inline int gup_fast_hugepd(hugepd_t hugepd, unsigned long addr,
unsigned int pdshi

Re: [PATCH v4 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2024-03-28 Thread David Hildenbrand

On 27.03.24 16:23, pet...@redhat.com wrote:

From: Peter Xu 

Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
PPC_8XX), however those pages are not candidates for GUP.

Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
file-backed mappings") added a check to fail gup-fast if there's potential
risk of violating GUP over writeback file systems.  That should never apply
to hugepd.  Considering that hugepd is an old format (and even
software-only), there's no plan to extend hugepd into other file typed
memories that is prone to the same issue.

Drop that check, not only because it'll never be true for hugepd per any
known plan, but also it paves way for reusing the function outside
fast-gup.

To make sure we'll still remember this issue just in case hugepd will be
extended to support non-hugetlbfs memories, add a rich comment above
gup_huge_pd(), explaining the issue with proper references.

Cc: Christoph Hellwig 
Cc: Lorenzo Stoakes 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Peter Xu 
---


@Andrew, you properly adjusted the code to remove the 
gup_fast_folio_allowed() call instead of the folio_fast_pin_allowed() 
call, but


(1) the commit subject
(2) comment for gup_huge_pd()

Still mention folio_fast_pin_allowed().

The patch "mm/gup: handle hugepd for follow_page()" then moves that 
(outdated) comment.


--
Cheers,

David / dhildenb



Re: [PATCH RFC 0/3] mm/gup: consistently call it GUP-fast

2024-03-28 Thread David Hildenbrand

On 28.03.24 08:15, Mike Rapoport wrote:

On Thu, Mar 28, 2024 at 07:09:13AM +0100, Arnd Bergmann wrote:

On Thu, Mar 28, 2024, at 06:51, Vineet Gupta wrote:

On 3/27/24 09:22, Arnd Bergmann wrote:

On Wed, Mar 27, 2024, at 16:39, David Hildenbrand wrote:

On 27.03.24 16:21, Peter Xu wrote:

On Wed, Mar 27, 2024 at 02:05:35PM +0100, David Hildenbrand wrote:

I'm not sure what config you tried there; as I am doing some build tests
recently, I found turning off CONFIG_SAMPLES + CONFIG_GCC_PLUGINS could
avoid a lot of issues, I think it's due to libc missing.  But maybe not the
case there.

CCin Arnd; I use some of his compiler chains, others from Fedora directly. For
example for alpha and arc, the Fedora gcc is "13.2.1".
But there is other stuff like (arc):

./arch/arc/include/asm/mmu-arcv2.h: In function 'mmu_setup_asid':
./arch/arc/include/asm/mmu-arcv2.h:82:9: error: implicit declaration of
function 'write_aux_reg' [-Werro
r=implicit-function-declaration]
 82 | write_aux_reg(ARC_REG_PID, asid | MMU_ENABLE);
| ^

Seems to be missing an #include of soc/arc/aux.h, but I can't
tell when this first broke without bisecting.


Weird I don't see this one but I only have gcc 12 handy ATM.

     gcc version 12.2.1 20230306 (ARC HS GNU/Linux glibc toolchain -
build 1360)

I even tried W=1 (which according to scripts/Makefile.extrawarn) should
include -Werror=implicit-function-declaration but don't see this still.

Tomorrow I'll try building a gcc 13.2.1 for ARC.


David reported them with the toolchains I built at
https://mirrors.edge.kernel.org/pub/tools/crosstool/
I'm fairly sure the problem is specific to the .config
and tree, not the toolchain though.


This happens with defconfig and both gcc 12.2.0 and gcc 13.2.0 from your
crosstools. I also see these on the current Linus' tree:

arc/kernel/ptrace.c:342:16: warning: no previous prototype for 
'syscall_trace_enter' [-Wmissing-prototypes]
arch/arc/kernel/kprobes.c:193:15: warning: no previous prototype for 
'arc_kprobe_handler' [-Wmissing-prototypes]

This fixed the warning about write_aux_reg for me, probably Vineet would
want this include somewhere else...

diff --git a/arch/arc/include/asm/mmu-arcv2.h b/arch/arc/include/asm/mmu-arcv2.h
index ed9036d4ede3..0fca342d7b79 100644
--- a/arch/arc/include/asm/mmu-arcv2.h
+++ b/arch/arc/include/asm/mmu-arcv2.h
@@ -69,6 +69,8 @@
  
  #ifndef __ASSEMBLY__
  
+#include 

+
  struct mm_struct;
  extern int pae40_exist_but_not_enab(void);



Here are all err+warn I see with my configs on Linus' tree from today (not 
mm-unstable).
Most of them are warnings due to missing prototypes or missing "clone3".

Parisc64 seems to be a bit more broken. Maybe nobody cares about parisc64 
anymore? Or
it's a toolchain issue, don't know.

xtensa is also broken, but "invalid register" smells like a toolchain issue to 
me.


Maybe all known/expected, just posting it if anybody cares. I can share my full 
build script
on request.



[INFO] Compiling alpha
[INFO] 0 errors
[INFO] 102 warnings
[PASS]

$ cat alpha_log  | grep warn
:1519:2: warning: #warning syscall clone3 not implemented [-Wcpp]
arch/alpha/lib/checksum.c:45:9: warning: no previous prototype for 
'csum_tcpudp_magic' [-Wmissing-prototypes]
arch/alpha/lib/checksum.c:54:8: warning: no previous prototype for 
'csum_tcpudp_nofold' [-Wmissing-prototypes]
arch/alpha/lib/checksum.c:145:9: warning: no previous prototype for 
'ip_fast_csum' [-Wmissing-prototypes]
arch/alpha/lib/checksum.c:163:8: warning: no previous prototype for 
'csum_partial' [-Wmissing-prototypes]
arch/alpha/lib/checksum.c:180:9: warning: no previous prototype for 
'ip_compute_csum' [-Wmissing-prototypes]
arch/alpha/kernel/traps.c:211:1: warning: no previous prototype for 
'do_entArith' [-Wmissing-prototypes]
arch/alpha/kernel/traps.c:233:1: warning: no previous prototype for 'do_entIF' 
[-Wmissing-prototypes]
arch/alpha/kernel/traps.c:400:1: warning: no previous prototype for 'do_entDbg' 
[-Wmissing-prototypes]
arch/alpha/kernel/traps.c:436:1: warning: no previous prototype for 'do_entUna' 
[-Wmissing-prototypes]
arch/alpha/kernel/traps.c:721:1: warning: no previous prototype for 
'do_entUnaUser' [-Wmissing-prototypes]
arch/alpha/mm/init.c:261:1: warning: no previous prototype for 
'srm_paging_stop' [-Wmissing-prototypes]
arch/alpha/lib/fpreg.c:20:1: warning: no previous prototype for 
'alpha_read_fp_reg' [-Wmissing-prototypes]
[]

[INFO] Compiling arc
[INFO] 0 errors
[INFO] 2 warnings
[PASS]

$ cat arc_log  | grep warn
arch/arc/kernel/ptrace.c:342:16: warning: no previous prototype for 
'syscall_trace_enter' [-Wmissing-prototypes]
arch/arc/kernel/kprobes.c:193:15: warning: no previous prototype for 
'arc_kprobe_handler' [-Wmissing-prototypes]


[INFO] Compiling hexagon
[INFO] 0 errors
[INFO] 1 warnings
[PASS]

 $ cat hexagon_log  | grep warn
:1519:2: warning: syscall clone3 not implemented [-W#warnings]
 1519 | #warning syscall clone3 not i

Re: [PATCH RFC 0/3] mm/gup: consistently call it GUP-fast

2024-03-27 Thread David Hildenbrand

On 27.03.24 16:46, Ryan Roberts wrote:


Some of them look like mm-unstable issue, For example, arm64 fails with

   CC  arch/arm64/mm/extable.o
In file included from ./include/linux/hugetlb.h:828,
  from security/commoncap.c:19:
./arch/arm64/include/asm/hugetlb.h:25:34: error: redefinition of
'arch_clear_hugetlb_flags'
    25 | #define arch_clear_hugetlb_flags arch_clear_hugetlb_flags
   |  ^~~~
./include/linux/hugetlb.h:840:20: note: in expansion of macro
'arch_clear_hugetlb_flags'
   840 | static inline void arch_clear_hugetlb_flags(struct folio *folio) { }
   |    ^~~~
./arch/arm64/include/asm/hugetlb.h:21:20: note: previous definition of
'arch_clear_hugetlb_flags' with t
ype 'void(struct folio *)'
    21 | static inline void arch_clear_hugetlb_flags(struct folio *folio)
   |    ^~~~
In file included from ./include/linux/hugetlb.h:828,
  from mm/filemap.c:37:
./arch/arm64/include/asm/hugetlb.h:25:34: error: redefinition of
'arch_clear_hugetlb_flags'
    25 | #define arch_clear_hugetlb_flags arch_clear_hugetlb_flags
   |  ^~~~
./include/linux/hugetlb.h:840:20: note: in expansion of macro
'arch_clear_hugetlb_flags'
   840 | static inline void arch_clear_hugetlb_flags(struct folio *folio) { }
   |    ^~~~
./arch/arm64/include/asm/hugetlb.h:21:20: note: previous definition of
'arch_clear_hugetlb_flags' with type 'void(struct folio *)'
    21 | static inline void arch_clear_hugetlb_flags(struct folio *folio)


see: https://lore.kernel.org/linux-mm/zgqvnkgdldkwh...@casper.infradead.org/



Yes, besides the other failures I see (odd targets), I was expecting 
that someone else noticed that already :) thanks!


--
Cheers,

David / dhildenb



Re: [PATCH RFC 0/3] mm/gup: consistently call it GUP-fast

2024-03-27 Thread David Hildenbrand

On 27.03.24 16:21, Peter Xu wrote:

On Wed, Mar 27, 2024 at 02:05:35PM +0100, David Hildenbrand wrote:

Some cleanups around function names, comments and the config option of
"GUP-fast" -- GUP without "lock" safety belts on.

With this cleanup it's easy to judge which functions are GUP-fast specific.
We now consistently call it "GUP-fast", avoiding mixing it with "fast GUP",
"lockless", or simply "gup" (which I always considered confusing in the
ode).

So the magic now happens in functions that contain "gup_fast", whereby
gup_fast() is the entry point into that magic. Comments consistently
reference either "GUP-fast" or "gup_fast()".

Based on mm-unstable from today. I won't CC arch maintainers, but only
arch mailing lists, to reduce noise.

Tested on x86_64, cross compiled on a bunch of archs, whereby some of them
don't properly even compile on mm-unstable anymore in my usual setup
(alpha, arc, parisc64, sh) ... maybe the cross compilers are outdated,
but there are no new ones around. Hm.


I'm not sure what config you tried there; as I am doing some build tests
recently, I found turning off CONFIG_SAMPLES + CONFIG_GCC_PLUGINS could
avoid a lot of issues, I think it's due to libc missing.  But maybe not the
case there.


CCin Arnd; I use some of his compiler chains, others from Fedora directly. For
example for alpha and arc, the Fedora gcc is "13.2.1".


I compile quite some targets, usually with defconfig. From my compile script:

# COMPILER NAME ARCH CROSS_COMPILE CONFIG(if different from defconfig)

compile_gcc "alpha" "alpha" "alpha-linux-gnu-"
compile_gcc "arc" "arc" "arc-linux-gnu-"
compile_gcc "arm" "arm" "arm-linux-gnu-" "axm55xx_defconfig"
compile_gcc "arm-nommu" "arm" "arm-linux-gnu-" "imxrt_defconfig"
compile_gcc "arm64" "arm64" "aarch64-linux-gnu-"
compile_gcc "csky" "csky" 
"../cross/gcc-13.2.0-nolibc/csky-linux/bin/csky-linux-"
compile_gcc "loongarch" "loongarch" 
"../cross/gcc-13.2.0-nolibc/loongarch64-linux/bin/loongarch64-linux-"
compile_gcc "m68k-nommu" "m68k" "m68k-linux-gnu-" "amcore_defconfig"
compile_gcc "m68k-sun3" "m68k" "m68k-linux-gnu-" "sun3_defconfig"
compile_gcc "m68k-coldfire" "m68k" "m68k-linux-gnu-" "m5475evb_defconfig"
compile_gcc "m68k-virt" "m68k" "m68k-linux-gnu-" "virt_defconfig"
compile_gcc "microblaze" "microblaze" "microblaze-linux-gnu-"
compile_gcc "mips64" "mips" "mips64-linux-gnu-" "bigsur_defconfig"
compile_gcc "mips32-xpa" "mips" "mips64-linux-gnu-" "maltaup_xpa_defconfig"
compile_gcc "mips32-alchemy" "mips" "mips64-linux-gnu-" "gpr_defconfig"
compile_gcc "mips32" "mips" "mips64-linux-gnu-"
compile_gcc "nios2" "nios2" "nios2-linux-gnu-" "3c120_defconfig"
compile_gcc "openrisc" "openrisc" "../cross/gcc-13.2.0-nolibc/or1k-linux/bin/or1k-linux-" 
"virt_defconfig"
compile_gcc "parisc32" "parisc" "hppa-linux-gnu-" "generic-32bit_defconfig"
compile_gcc "parisc64" "parisc" "hppa64-linux-gnu-" "generic-64bit_defconfig"
compile_gcc "riscv32" "riscv" "riscv64-linux-gnu-" "32-bit.config"
compile_gcc "riscv64" "riscv" "riscv64-linux-gnu-" "64-bit.config"
compile_gcc "riscv64-nommu" "riscv" "riscv64-linux-gnu-" "nommu_virt_defconfig"
compile_gcc "s390x" "s390" "s390x-linux-gnu-"
compile_gcc "sh" "sh" "../cross/gcc-13.2.0-nolibc/sh4-linux/bin/sh4-linux-"
compile_gcc "sparc32" "sparc" "../cross/gcc-13.2.0-nolibc/sparc-linux/bin/sparc-linux-" 
"sparc32_defconfig"
compile_gcc "sparc64" "sparc" 
"../cross/gcc-13.2.0-nolibc/sparc64-linux/bin/sparc64-linux-" "sparc64_defconfig"
compile_gcc "uml64" "um" "" "x86_64_defconfig"
compile_gcc "x86" "x86" "" "i386_defconfig"
compile_gcc "x86-pae" "x86" "" "i386_defconfig"
compile_gcc "x86_64" "x86" ""
compile_gcc "xtensa" "xtensa" "../cross/gcc-13.2.0-nolibc/xtensa-linux/bin/xtensa-linux-&qu

Re: [PATCH RFC 1/3] mm/gup: consistently name GUP-fast functions

2024-03-27 Thread David Hildenbrand

On 27.03.24 14:52, Jason Gunthorpe wrote:

On Wed, Mar 27, 2024 at 02:05:36PM +0100, David Hildenbrand wrote:

Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename
all relevant internal functions to start with "gup_fast", to make it
clearer that this is not ordinary GUP. The current mixture of
"lockless", "gup" and "gup_fast" is confusing.

Further, avoid the term "huge" when talking about a "leaf" -- for
example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the
"hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that
says.

What remains is the "external" interface:
* get_user_pages_fast_only()
* get_user_pages_fast()
* pin_user_pages_fast()

And the "internal" interface that handles GUP-fast + fallback:
* internal_get_user_pages_fast()


This would like a better name too. How about gup_fast_fallback() ?


Yes, I was not able to come up with something I liked. But I do like
your proposal, so I'll do that!

[...]



I think it is a great idea, it always takes a moment to figure out if
a function is part of the fast callchain or not..

(even better would be to shift the fast stuff into its own file, but I
expect that is too much)


Yes, one step at a time :)



Reviewed-by: Jason Gunthorpe 


Thanks Jason!

--
Cheers,

David / dhildenb



[PATCH RFC 3/3] mm: use "GUP-fast" instead "fast GUP" in remaining comments

2024-03-27 Thread David Hildenbrand
Let's fixup the remaining comments to consistently call that thing
"GUP-fast". With this change, we consistently call it "GUP-fast".

Signed-off-by: David Hildenbrand 
---
 mm/filemap.c| 2 +-
 mm/khugepaged.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 387b394754fa..c668e11cd6ef 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1810,7 +1810,7 @@ EXPORT_SYMBOL(page_cache_prev_miss);
  * C. Return the page to the page allocator
  *
  * This means that any page may have its reference count temporarily
- * increased by a speculative page cache (or fast GUP) lookup as it can
+ * increased by a speculative page cache (or GUP-fast) lookup as it can
  * be allocated by another user before the RCU grace period expires.
  * Because the refcount temporarily acquired here may end up being the
  * last refcount on the page, any page allocation must be freeable by
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 38830174608f..6972fa05132e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1169,7 +1169,7 @@ static int collapse_huge_page(struct mm_struct *mm, 
unsigned long address,
 * huge and small TLB entries for the same virtual address to
 * avoid the risk of CPU bugs in that area.
 *
-* Parallel fast GUP is fine since fast GUP will back off when
+* Parallel GUP-fast is fine since GUP-fast will back off when
 * it detects PMD is changed.
 */
_pmd = pmdp_collapse_flush(vma, address, pmd);
-- 
2.43.2



[PATCH RFC 2/3] mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FAST

2024-03-27 Thread David Hildenbrand
Nowadays, we call it "GUP-fast", the external interface includes
functions like "get_user_pages_fast()", and we renamed all internal
functions to reflect that as well.

Let's make the config option reflect that.

Signed-off-by: David Hildenbrand 
---
 arch/arm/Kconfig   | 2 +-
 arch/arm64/Kconfig | 2 +-
 arch/loongarch/Kconfig | 2 +-
 arch/mips/Kconfig  | 2 +-
 arch/powerpc/Kconfig   | 2 +-
 arch/s390/Kconfig  | 2 +-
 arch/sh/Kconfig| 2 +-
 arch/x86/Kconfig   | 2 +-
 include/linux/rmap.h   | 8 
 kernel/events/core.c   | 4 ++--
 mm/Kconfig | 2 +-
 mm/gup.c   | 6 +++---
 mm/internal.h  | 2 +-
 13 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index b14aed3a17ab..817918f6635a 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -99,7 +99,7 @@ config ARM
select HAVE_DYNAMIC_FTRACE_WITH_REGS if HAVE_DYNAMIC_FTRACE
select HAVE_EFFICIENT_UNALIGNED_ACCESS if (CPU_V6 || CPU_V6K || CPU_V7) 
&& MMU
select HAVE_EXIT_THREAD
-   select HAVE_FAST_GUP if ARM_LPAE
+   select HAVE_GUP_FAST if ARM_LPAE
select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL
select HAVE_FUNCTION_ERROR_INJECTION
select HAVE_FUNCTION_GRAPH_TRACER
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7b11c98b3e84..de076a191e9f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -205,7 +205,7 @@ config ARM64
select HAVE_SAMPLE_FTRACE_DIRECT
select HAVE_SAMPLE_FTRACE_DIRECT_MULTI
select HAVE_EFFICIENT_UNALIGNED_ACCESS
-   select HAVE_FAST_GUP
+   select HAVE_GUP_FAST
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_ERROR_INJECTION
diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index a5f300ec6f28..cd642eefd9e5 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -119,7 +119,7 @@ config LOONGARCH
select HAVE_EBPF_JIT
select HAVE_EFFICIENT_UNALIGNED_ACCESS if !ARCH_STRICT_ALIGN
select HAVE_EXIT_THREAD
-   select HAVE_FAST_GUP
+   select HAVE_GUP_FAST
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_ARG_ACCESS_API
select HAVE_FUNCTION_ERROR_INJECTION
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 06ef440d16ce..10f7c6d88163 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -68,7 +68,7 @@ config MIPS
select HAVE_DYNAMIC_FTRACE
select HAVE_EBPF_JIT if !CPU_MICROMIPS
select HAVE_EXIT_THREAD
-   select HAVE_FAST_GUP
+   select HAVE_GUP_FAST
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_TRACER
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1c4be3373686..e42cc8cd415f 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -236,7 +236,7 @@ config PPC
select HAVE_DYNAMIC_FTRACE_WITH_REGSif 
ARCH_USING_PATCHABLE_FUNCTION_ENTRY || MPROFILE_KERNEL || PPC32
select HAVE_EBPF_JIT
select HAVE_EFFICIENT_UNALIGNED_ACCESS
-   select HAVE_FAST_GUP
+   select HAVE_GUP_FAST
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_ARG_ACCESS_API
select HAVE_FUNCTION_DESCRIPTORSif PPC64_ELF_ABI_V1
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 8f01ada6845e..d9aed4c93ee6 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -174,7 +174,7 @@ config S390
select HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_EBPF_JIT if HAVE_MARCH_Z196_FEATURES
select HAVE_EFFICIENT_UNALIGNED_ACCESS
-   select HAVE_FAST_GUP
+   select HAVE_GUP_FAST
select HAVE_FENTRY
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_ARG_ACCESS_API
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 2ad3e29f0ebe..7292542f75e8 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -38,7 +38,7 @@ config SUPERH
select HAVE_DEBUG_BUGVERBOSE
select HAVE_DEBUG_KMEMLEAK
select HAVE_DYNAMIC_FTRACE
-   select HAVE_FAST_GUP if MMU
+   select HAVE_GUP_FAST if MMU
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_TRACER
select HAVE_FTRACE_MCOUNT_RECORD
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 39886bab943a..f82171292cf3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -221,7 +221,7 @@ config X86
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select HAVE_EISA
select HAVE_EXIT_THREAD
-   select HAVE_FAST_GUP
+   select HAVE_GUP_FAST
select HAVE_FENTRY  if X86_64 || DYNAMIC_FTRACE
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_GRAPH_RETVAL   if HAVE_FUNCTION_GRAPH_TRACER
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b7944a833668..9bf9324214fc 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.

[PATCH RFC 1/3] mm/gup: consistently name GUP-fast functions

2024-03-27 Thread David Hildenbrand
Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename
all relevant internal functions to start with "gup_fast", to make it
clearer that this is not ordinary GUP. The current mixture of
"lockless", "gup" and "gup_fast" is confusing.

Further, avoid the term "huge" when talking about a "leaf" -- for
example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the
"hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that
says.

What remains is the "external" interface:
* get_user_pages_fast_only()
* get_user_pages_fast()
* pin_user_pages_fast()

And the "internal" interface that handles GUP-fast + fallback:
* internal_get_user_pages_fast()

The high-level internal function for GUP-fast is now:
* gup_fast()

The basic GUP-fast walker functions:
* gup_pgd_range() -> gup_fast_pgd_range()
* gup_p4d_range() -> gup_fast_p4d_range()
* gup_pud_range() -> gup_fast_pud_range()
* gup_pmd_range() -> gup_fast_pmd_range()
* gup_pte_range() -> gup_fast_pte_range()
* gup_huge_pgd()  -> gup_fast_pgd_leaf()
* gup_huge_pud()  -> gup_fast_pud_leaf()
* gup_huge_pmd()  -> gup_fast_pmd_leaf()

The weird hugepd stuff:
* gup_huge_pd() -> gup_fast_hugepd()
* gup_hugepte() -> gup_fast_hugepte()

The weird devmap stuff:
* __gup_device_huge_pud() -> gup_fast_devmap_pud_leaf()
* __gup_device_huge_pmd   -> gup_fast_devmap_pmd_leaf()
* __gup_device_huge() -> gup_fast_devmap_leaf()

Helper functions:
* unpin_user_pages_lockless() -> gup_fast_unpin_user_pages()
* gup_fast_folio_allowed() is already properly named
* gup_fast_permitted() is already properly named

With "gup_fast()", we now even have a function that is referred to in
comment in mm/mmu_gather.c.

Signed-off-by: David Hildenbrand 
---
 mm/gup.c | 164 ---
 1 file changed, 84 insertions(+), 80 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 03b74b148e30..c293aff30c5d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -440,7 +440,7 @@ void unpin_user_page_range_dirty_lock(struct page *page, 
unsigned long npages,
 }
 EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
 
-static void unpin_user_pages_lockless(struct page **pages, unsigned long 
npages)
+static void gup_fast_unpin_user_pages(struct page **pages, unsigned long 
npages)
 {
unsigned long i;
struct folio *folio;
@@ -2431,7 +2431,7 @@ long get_user_pages_unlocked(unsigned long start, 
unsigned long nr_pages,
 EXPORT_SYMBOL(get_user_pages_unlocked);
 
 /*
- * Fast GUP
+ * GUP-fast
  *
  * get_user_pages_fast attempts to pin user pages by walking the page
  * tables directly and avoids taking locks. Thus the walker needs to be
@@ -2445,7 +2445,7 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
  *
  * Another way to achieve this is to batch up page table containing pages
  * belonging to more than one mm_user, then rcu_sched a callback to free those
- * pages. Disabling interrupts will allow the fast_gup walker to both block
+ * pages. Disabling interrupts will allow the gup_fast() walker to both block
  * the rcu_sched callback, and an IPI that we broadcast for splitting THPs
  * (which is a relatively rare event). The code below adopts this strategy.
  *
@@ -2589,9 +2589,9 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int 
nr_start,
  * also check pmd here to make sure pmd doesn't change (corresponds to
  * pmdp_collapse_flush() in the THP collapse code path).
  */
-static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
-unsigned long end, unsigned int flags,
-struct page **pages, int *nr)
+static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
+   unsigned long end, unsigned int flags, struct page **pages,
+   int *nr)
 {
struct dev_pagemap *pgmap = NULL;
int nr_start = *nr, ret = 0;
@@ -2688,20 +2688,19 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, 
unsigned long addr,
  *
  * For a futex to be placed on a THP tail page, get_futex_key requires a
  * get_user_pages_fast_only implementation that can pin pages. Thus it's still
- * useful to have gup_huge_pmd even if we can't operate on ptes.
+ * useful to have gup_fast_pmd_leaf even if we can't operate on ptes.
  */
-static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
-unsigned long end, unsigned int flags,
-struct page **pages, int *nr)
+static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
+   unsigned long end, unsigned int flags, struct page **pages,
+   int *nr)
 {
return 0;
 }
 #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
 
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static int __gup

[PATCH RFC 0/3] mm/gup: consistently call it GUP-fast

2024-03-27 Thread David Hildenbrand
Some cleanups around function names, comments and the config option of
"GUP-fast" -- GUP without "lock" safety belts on.

With this cleanup it's easy to judge which functions are GUP-fast specific.
We now consistently call it "GUP-fast", avoiding mixing it with "fast GUP",
"lockless", or simply "gup" (which I always considered confusing in the
ode).

So the magic now happens in functions that contain "gup_fast", whereby
gup_fast() is the entry point into that magic. Comments consistently
reference either "GUP-fast" or "gup_fast()".

Based on mm-unstable from today. I won't CC arch maintainers, but only
arch mailing lists, to reduce noise.

Tested on x86_64, cross compiled on a bunch of archs, whereby some of them
don't properly even compile on mm-unstable anymore in my usual setup
(alpha, arc, parisc64, sh) ... maybe the cross compilers are outdated,
but there are no new ones around. Hm.

Cc: Andrew Morton 
Cc: Mike Rapoport (IBM) 
Cc: Jason Gunthorpe 
Cc: John Hubbard 
Cc: Peter Xu 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
Cc: loonga...@lists.linux.dev
Cc: linux-m...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-perf-us...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: x...@kernel.org

David Hildenbrand (3):
  mm/gup: consistently name GUP-fast functions
  mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FAST
  mm: use "GUP-fast" instead "fast GUP" in remaining comments

 arch/arm/Kconfig   |   2 +-
 arch/arm64/Kconfig |   2 +-
 arch/loongarch/Kconfig |   2 +-
 arch/mips/Kconfig  |   2 +-
 arch/powerpc/Kconfig   |   2 +-
 arch/s390/Kconfig  |   2 +-
 arch/sh/Kconfig|   2 +-
 arch/x86/Kconfig   |   2 +-
 include/linux/rmap.h   |   8 +-
 kernel/events/core.c   |   4 +-
 mm/Kconfig |   2 +-
 mm/filemap.c   |   2 +-
 mm/gup.c   | 170 +
 mm/internal.h  |   2 +-
 mm/khugepaged.c|   2 +-
 15 files changed, 105 insertions(+), 101 deletions(-)

-- 
2.43.2



Re: [PATCH 2/4] mm: pgalloc: support address-conditional pmd allocation

2024-02-21 Thread David Hildenbrand

On 21.02.24 08:13, Christophe Leroy wrote:



Le 20/02/2024 à 21:32, Maxwell Bland a écrit :

[Vous ne recevez pas souvent de courriers de mbl...@motorola.com. Découvrez 
pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]

While other descriptors (e.g. pud) allow allocations conditional on
which virtual address is allocated, pmd descriptor allocations do not.
However, adding support for this is straightforward and is beneficial to
future kernel development targeting the PMD memory granularity.

As many architectures already implement pmd_populate_kernel in an
address-generic manner, it is necessary to roll out support
incrementally. For this purpose a preprocessor flag,


Is it really worth it ? It is only 48 call sites that need to be
updated. It would avoid that processor flag and avoid introducing that
pmd_populate_kernel_at() in kernel core.


+1, let's avoid that if possible.

--
Cheers,

David / dhildenb



Re: [PATCH v6 06/18] mm: Tidy up pte_next_pfn() definition

2024-02-15 Thread David Hildenbrand

On 15.02.24 11:31, Ryan Roberts wrote:

Now that the all architecture overrides of pte_next_pfn() have been
replaced with pte_advance_pfn(), we can simplify the definition of the
generic pte_next_pfn() macro so that it is unconditionally defined.

Signed-off-by: Ryan Roberts 
---
  include/linux/pgtable.h | 2 --
  1 file changed, 2 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b7ac8358f2aa..bc005d84f764 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,7 +212,6 @@ static inline int pmd_dirty(pmd_t pmd)
  #define arch_flush_lazy_mmu_mode()do {} while (0)
  #endif
  
-#ifndef pte_next_pfn

  #ifndef pte_advance_pfn
  static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
  {
@@ -221,7 +220,6 @@ static inline pte_t pte_advance_pfn(pte_t pte, unsigned 
long nr)
  #endif
  
  #define pte_next_pfn(pte) pte_advance_pfn(pte, 1)

-#endif
  
  #ifndef set_ptes

  /**


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb



Re: [PATCH v6 05/18] x86/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-15 Thread David Hildenbrand

On 15.02.24 11:31, Ryan Roberts wrote:

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
override the new pte_advance_pfn() API to do so.

Signed-off-by: Ryan Roberts 
---
  arch/x86/include/asm/pgtable.h | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b50b2ef63672..69ed0ea0641b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -955,13 +955,13 @@ static inline int pte_same(pte_t a, pte_t b)
return a.pte == b.pte;
  }
  
-static inline pte_t pte_next_pfn(pte_t pte)

+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
  {
if (__pte_needs_invert(pte_val(pte)))
-   return __pte(pte_val(pte) - (1UL << PFN_PTE_SHIFT));
-   return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+   return __pte(pte_val(pte) - (nr << PFN_PTE_SHIFT));
+   return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
  }
-#define pte_next_pfn   pte_next_pfn
+#define pte_advance_pfnpte_advance_pfn
  
  static inline int pte_present(pte_t a)

  {


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb



Re: [PATCH v6 04/18] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-15 Thread David Hildenbrand

On 15.02.24 11:31, Ryan Roberts wrote:

Core-mm needs to be able to advance the pfn by an arbitrary amount, so
override the new pte_advance_pfn() API to do so.

Signed-off-by: Ryan Roberts 
---
  arch/arm64/include/asm/pgtable.h | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 52d0b0a763f1..b6d3e9e0a946 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -351,10 +351,10 @@ static inline pgprot_t pte_pgprot(pte_t pte)
return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
  }
  
-#define pte_next_pfn pte_next_pfn

-static inline pte_t pte_next_pfn(pte_t pte)
+#define pte_advance_pfn pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
  {
-   return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
+   return pfn_pte(pte_pfn(pte) + nr, pte_pgprot(pte));
  }
  
  static inline void set_ptes(struct mm_struct *mm,

@@ -370,7 +370,7 @@ static inline void set_ptes(struct mm_struct *mm,
if (--nr == 0)
break;
ptep++;
-   pte = pte_next_pfn(pte);
+   pte = pte_advance_pfn(pte, 1);



Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb



Re: [PATCH v6 03/18] mm: Introduce pte_advance_pfn() and use for pte_next_pfn()

2024-02-15 Thread David Hildenbrand

On 15.02.24 11:31, Ryan Roberts wrote:

The goal is to be able to advance a PTE by an arbitrary number of PFNs.
So introduce a new API that takes a nr param. Define the default
implementation here and allow for architectures to override.
pte_next_pfn() becomes a wrapper around pte_advance_pfn().

Follow up commits will convert each overriding architecture's
pte_next_pfn() to pte_advance_pfn().

Signed-off-by: Ryan Roberts 
---
  include/linux/pgtable.h | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 231370e1b80f..b7ac8358f2aa 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,14 +212,17 @@ static inline int pmd_dirty(pmd_t pmd)
  #define arch_flush_lazy_mmu_mode()do {} while (0)
  #endif
  
-

  #ifndef pte_next_pfn
-static inline pte_t pte_next_pfn(pte_t pte)
+#ifndef pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
  {
-   return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+   return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
  }
  #endif
  
+#define pte_next_pfn(pte) pte_advance_pfn(pte, 1)

+#endif
+
  #ifndef set_ptes
  /**
   * set_ptes - Map consecutive pages to a contiguous range of addresses.


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb



Re: [PATCH v3 12/15] mm/memory: pass PTE to copy_present_pte()

2024-02-14 Thread David Hildenbrand

On 29.01.24 13:46, David Hildenbrand wrote:

We already read it, let's just forward it.

This patch is based on work by Ryan Roberts.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
  mm/memory.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index a3bdb25f4c8d..41b24da5be38 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -959,10 +959,9 @@ static inline void __copy_present_pte(struct 
vm_area_struct *dst_vma,
   */
  static inline int
  copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct 
*src_vma,
-pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
-struct folio **prealloc)
+pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr,
+int *rss, struct folio **prealloc)
  {
-   pte_t pte = ptep_get(src_pte);
struct page *page;
struct folio *folio;
  
@@ -1103,7 +1102,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,

}
/* copy_present_pte() will clear `*prealloc' if consumed */
ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
-  addr, rss, );
+  ptent, addr, rss, );
/*
 * If we need a pre-allocated page for this pte, drop the
 * locks, allocate, and try again.


The following fixup for that device-exclusive thingy on top (fixing a hmm
selftest I just discovered to be broken).


From 8f9e44f25087dc71890b8d9bd680375691232e85 Mon Sep 17 00:00:00 2001
From: David Hildenbrand 
Date: Wed, 14 Feb 2024 23:09:29 +0100
Subject: [PATCH] fixup: mm/memory: pass PTE to copy_present_pte()

For device-exclusive nonswp entries (is_device_exclusive_entry()),
copy_nonpresent_pte() can turn the PTEs into actual present PTEs while
holding the page table lock.

We hae to re-read the PTE after that operation, such that we won't be
working on the stale non-present PTE, assuming it would be present.

This fixes the hmm "exclusive_cow" selftest.

 ./run_vmtests.sh -t hmm
 # #  RUN   hmm.hmm_device_private.exclusive_cow ...
 # #OK  hmm.hmm_device_private.exclusive_cow
 # ok 23 hmm.hmm_device_private.exclusive_cow

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 3b8e56eb08a3..29a75f38df7c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1208,6 +1208,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct 
vm_area_struct *src_vma,
progress += 8;
continue;
}
+   ptent = ptep_get(src_pte);
+   VM_WARN_ON_ONCE(!pte_present(ptent));
 
 			/*

 * Device exclusive entry restored, continue by copying
--
2.43.0


--
Cheers,

David / dhildenb



[PATCH v3 02/10] mm/memory: handle !page case in zap_present_pte() separately

2024-02-14 Thread David Hildenbrand
We don't need uptodate accessed/dirty bits, so in theory we could
replace ptep_get_and_clear_full() by an optimized ptep_clear_full()
function. Let's rely on the provided pte.

Further, there is no scenario where we would have to insert uffd-wp
markers when zapping something that is not a normal page (i.e., zeropage).
Add a sanity check to make sure this remains true.

should_zap_folio() no longer has to handle NULL pointers. This change
replaces 2/3 "!page/!folio" checks by a single "!page" one.

Note that arch_check_zapped_pte() on x86-64 checks the HW-dirty bit to
detect shadow stack entries. But for shadow stack entries, the HW dirty
bit (in combination with non-writable PTEs) is set by software. So for the
arch_check_zapped_pte() check, we don't have to sync against HW setting
the HW dirty bit concurrently, it is always set.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5b0dc33133a6..4da6923709b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1497,10 +1497,6 @@ static inline bool should_zap_folio(struct zap_details 
*details,
if (should_zap_cows(details))
return true;
 
-   /* E.g. the caller passes NULL for the case of a zero folio */
-   if (!folio)
-   return true;
-
/* Otherwise we should only zap non-anon folios */
return !folio_test_anon(folio);
 }
@@ -1538,24 +1534,28 @@ static inline void zap_present_pte(struct mmu_gather 
*tlb,
int *rss, bool *force_flush, bool *force_break)
 {
struct mm_struct *mm = tlb->mm;
-   struct folio *folio = NULL;
bool delay_rmap = false;
+   struct folio *folio;
struct page *page;
 
page = vm_normal_page(vma, addr, ptent);
-   if (page)
-   folio = page_folio(page);
+   if (!page) {
+   /* We don't need up-to-date accessed/dirty bits. */
+   ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+   arch_check_zapped_pte(vma, ptent);
+   tlb_remove_tlb_entry(tlb, pte, addr);
+   VM_WARN_ON_ONCE(userfaultfd_wp(vma));
+   ksm_might_unmap_zero_page(mm, ptent);
+   return;
+   }
 
+   folio = page_folio(page);
if (unlikely(!should_zap_folio(details, folio)))
return;
ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
arch_check_zapped_pte(vma, ptent);
tlb_remove_tlb_entry(tlb, pte, addr);
zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
-   if (unlikely(!page)) {
-   ksm_might_unmap_zero_page(mm, ptent);
-   return;
-   }
 
if (!folio_test_anon(folio)) {
if (pte_dirty(ptent)) {
-- 
2.43.0



[PATCH v3 10/10] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-02-14 Thread David Hildenbrand
Similar to how we optimized fork(), let's implement PTE batching when
consecutive (present) PTEs map consecutive pages of the same large
folio.

Most infrastructure we need for batching (mmu gather, rmap) is already
there. We only have to add get_and_clear_full_ptes() and
clear_full_ptes(). Similarly, extend zap_install_uffd_wp_if_needed() to
process a PTE range.

We won't bother sanity-checking the mapcount of all subpages, but only
check the mapcount of the first subpage we process. If there is a real
problem hiding somewhere, we can trigger it simply by using small
folios, or when we zap single pages of a large folio. Ideally, we had
that check in rmap code (including for delayed rmap), but then we cannot
print the PTE. Let's keep it simple for now. If we ever have a cheap
folio_mapcount(), we might just want to check for underflows there.

To keep small folios as fast as possible force inlining of a specialized
variant using __always_inline with nr=1.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 include/linux/pgtable.h | 70 +++
 mm/memory.c | 92 +
 2 files changed, 136 insertions(+), 26 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index aab227e12493..49ab1f73b5c2 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -580,6 +580,76 @@ static inline pte_t ptep_get_and_clear_full(struct 
mm_struct *mm,
 }
 #endif
 
+#ifndef get_and_clear_full_ptes
+/**
+ * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
+ *  the same folio, collecting dirty/accessed bits.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
+ * returned PTE.
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+   pte_t pte, tmp_pte;
+
+   pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+   while (--nr) {
+   ptep++;
+   addr += PAGE_SIZE;
+   tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+   if (pte_dirty(tmp_pte))
+   pte = pte_mkdirty(pte);
+   if (pte_young(tmp_pte))
+   pte = pte_mkyoung(pte);
+   }
+   return pte;
+}
+#endif
+
+#ifndef clear_full_ptes
+/**
+ * clear_full_ptes - Clear present PTEs that map consecutive pages of the same
+ *  folio.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_get_and_clear_full().
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, unsigned int nr, int full)
+{
+   for (;;) {
+   ptep_get_and_clear_full(mm, addr, ptep, full);
+   if (--nr == 0)
+   break;
+   ptep++;
+   addr += PAGE_SIZE;
+   }
+}
+#endif
 
 /*
  * If two threads concurrently fault at the same page, the thread that
diff --git a/mm/memory.c b/mm/memory.c
index a3efc4da258a..3b8e56eb08a3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1515,7 +1515,7 @@ static inline bool zap_drop_file_uffd_wp(struct 
zap_details *details)
  */
 static inline void
 zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
- unsigned long addr, pte_t *pte,
+ unsigned long addr, pte_t *pte, int nr,
  struct zap_details *details, pte_t pteval)
 {
/* Zap on anonymous always means dropping everything */
@@ -1525,20 +1525,27 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct 
*vma,
if (zap_drop_file_uffd_wp(details))
return;
 
-   pte_install_uffd_wp_if_needed(vma, addr, pte, pteval

[PATCH v3 09/10] mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing

2024-02-14 Thread David Hildenbrand
In tlb_batch_pages_flush(), we can end up freeing up to 512 pages or
now up to 256 folio fragments that span more than one page, before we
conditionally reschedule.

It's a pain that we have to handle cond_resched() in
tlb_batch_pages_flush() manually and cannot simply handle it in
release_pages() -- release_pages() can be called from atomic context.
Well, in a perfect world we wouldn't have to make our code more
complicated at all.

With page poisoning and init_on_free, we might now run into soft lockups
when we free a lot of rather large folio fragments, because page freeing
time then depends on the actual memory size we are freeing instead of on
the number of folios that are involved.

In the absolute (unlikely) worst case, on arm64 with 64k we will be able
to free up to 256 folio fragments that each span 512 MiB: zeroing out 128
GiB does sound like it might take a while. But instead of ignoring this
unlikely case, let's just handle it.

So, let's teach tlb_batch_pages_flush() that there are some
configurations where page freeing is horribly slow, and let's reschedule
more frequently -- similarly like we did for now before we had large folio
fragments in there. Avoid yet another loop over all encoded pages in the
common case by handling that separately.

Note that with page poisoning/zeroing, we might now end up freeing only a
single folio fragment at a time that might exceed the old 512 pages limit:
but if we cannot even free a single MAX_ORDER page on a system without
running into soft lockups, something else is already completely bogus.
Freeing a PMD-mapped THP would similarly cause trouble.

In theory, we might even free 511 order-0 pages + a single MAX_ORDER page,
effectively having to zero out 8703 pages on arm64 with 64k, translating to
~544 MiB of memory: however, if 512 MiB doesn't result in soft lockups,
544 MiB is unlikely to result in soft lockups, so we won't care about
that for the time being.

In the future, we might want to detect if handling cond_resched() is
required at all, and just not do any of that with full preemption enabled.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 mm/mmu_gather.c | 58 -
 1 file changed, 43 insertions(+), 15 deletions(-)

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index d175c0f1e2c8..99b3e9408aa0 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -91,18 +91,21 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct 
vm_area_struct *vma)
 }
 #endif
 
-static void tlb_batch_pages_flush(struct mmu_gather *tlb)
-{
-   struct mmu_gather_batch *batch;
+/*
+ * We might end up freeing a lot of pages. Reschedule on a regular
+ * basis to avoid soft lockups in configurations without full
+ * preemption enabled. The magic number of 512 folios seems to work.
+ */
+#define MAX_NR_FOLIOS_PER_FREE 512
 
-   for (batch = >local; batch && batch->nr; batch = batch->next) {
-   struct encoded_page **pages = batch->encoded_pages;
+static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
+{
+   struct encoded_page **pages = batch->encoded_pages;
+   unsigned int nr, nr_pages;
 
-   while (batch->nr) {
-   /*
-* limit free batch count when PAGE_SIZE > 4K
-*/
-   unsigned int nr = min(512U, batch->nr);
+   while (batch->nr) {
+   if (!page_poisoning_enabled_static() && !want_init_on_free()) {
+   nr = min(MAX_NR_FOLIOS_PER_FREE, batch->nr);
 
/*
 * Make sure we cover page + nr_pages, and don't leave
@@ -111,14 +114,39 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
if (unlikely(encoded_page_flags(pages[nr - 1]) &
 ENCODED_PAGE_BIT_NR_PAGES_NEXT))
nr++;
+   } else {
+   /*
+* With page poisoning and init_on_free, the time it
+* takes to free memory grows proportionally with the
+* actual memory size. Therefore, limit based on the
+* actual memory size and not the number of involved
+* folios.
+*/
+   for (nr = 0, nr_pages = 0;
+nr < batch->nr && nr_pages < 
MAX_NR_FOLIOS_PER_FREE;
+nr++) {
+   if (unlikely(encoded_page_flags(pages[nr]) &
+ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+   nr_pages += 
encoded_nr_pages(pages[++nr]);
+   else
+   nr_pages++;
+   }
+   

[PATCH v3 08/10] mm/mmu_gather: add __tlb_remove_folio_pages()

2024-02-14 Thread David Hildenbrand
Add __tlb_remove_folio_pages(), which will remove multiple consecutive
pages that belong to the same large folio, instead of only a single
page. We'll be using this function when optimizing unmapping/zapping of
large folios that are mapped by PTEs.

We're using the remaining spare bit in an encoded_page to indicate that
the next enoced page in an array contains actually shifted "nr_pages".
Teach swap/freeing code about putting multiple folio references, and
delayed rmap handling to remove page ranges of a folio.

This extension allows for still gathering almost as many small folios
as we used to (-1, because we have to prepare for a possibly bigger next
entry), but still allows for gathering consecutive pages that belong to the
same large folio.

Note that we don't pass the folio pointer, because it is not required for
now. Further, we don't support page_size != PAGE_SIZE, it won't be
required for simple PTE batching.

We have to provide a separate s390 implementation, but it's fairly
straight forward.

Another, more invasive and likely more expensive, approach would be to
use folio+range or a PFN range instead of page+nr_pages. But, we should
do that consistently for the whole mmu_gather. For now, let's keep it
simple and add "nr_pages" only.

Note that it is now possible to gather significantly more pages: In the
past, we were able to gather ~1 pages, now we can also gather ~5000
folio fragments that span multiple pages. A folio fragment on x86-64 can
span up to 512 pages (2 MiB THP) and on arm64 with 64k in theory 8192 pages
(512 MiB THP). Gathering more memory is not considered something we should
worry about, especially because these are already corner cases.

While we can gather more total memory, we won't free more folio
fragments. As long as page freeing time primarily only depends on the
number of involved folios, there is no effective change for !preempt
configurations. However, we'll adjust tlb_batch_pages_flush() separately to
handle corner cases where page freeing time grows proportionally with the
actual memory size.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 arch/s390/include/asm/tlb.h | 17 +++
 include/asm-generic/tlb.h   |  8 +
 include/linux/mm_types.h| 20 
 mm/mmu_gather.c | 61 +++--
 mm/swap.c   | 12 ++--
 mm/swap_state.c | 15 +++--
 6 files changed, 119 insertions(+), 14 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 48df896d5b79..e95b2c8081eb 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table);
 static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, bool delay_rmap, int page_size);
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
+   struct page *page, unsigned int nr_pages, bool delay_rmap);
 
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
@@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct mmu_gather 
*tlb,
return false;
 }
 
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
+   struct page *page, unsigned int nr_pages, bool delay_rmap)
+{
+   struct encoded_page *encoded_pages[] = {
+   encode_page(page, ENCODED_PAGE_BIT_NR_PAGES_NEXT),
+   encode_nr_pages(nr_pages),
+   };
+
+   VM_WARN_ON_ONCE(delay_rmap);
+   VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
+
+   free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
+   return false;
+}
+
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
__tlb_flush_mm_lazy(tlb->mm);
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 95d60a4f468a..bd00dd238b79 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -69,6 +69,7 @@
  *
  *  - tlb_remove_page() / __tlb_remove_page()
  *  - tlb_remove_page_size() / __tlb_remove_page_size()
+ *  - __tlb_remove_folio_pages()
  *
  *__tlb_remove_page_size() is the basic primitive that queues a page for
  *freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a
@@ -78,6 +79,11 @@
  *tlb_remove_page() and tlb_remove_page_size() imply the call to
  *tlb_flush_mmu() when required and has no return value.
  *
+ *__tlb_remove_folio_pages() is similar to __tlb_remove_page(), however,
+ *instead of removing a single page, remove the given number of consecutive
+ *pages that are all part of the same (large) folio: just like calling
+ *__tlb_remove_page() on each page individually.
+ *
  *  - tlb_change_page_size()
  *
  *call before __tlb_remove_page*() to set the current page-size; implies a
@@ -262,6 +268,8 @@ struct mmu_gather

[PATCH v3 07/10] mm/mmu_gather: add tlb_remove_tlb_entries()

2024-02-14 Thread David Hildenbrand
Let's add a helper that lets us batch-process multiple consecutive PTEs.

Note that the loop will get optimized out on all architectures except on
powerpc. We have to add an early define of __tlb_remove_tlb_entry() on
ppc to make the compiler happy (and avoid making tlb_remove_tlb_entries() a
macro).

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/include/asm/tlb.h |  2 ++
 include/asm-generic/tlb.h  | 20 
 2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index b3de6102a907..1ca7d4c4b90d 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -19,6 +19,8 @@
 
 #include 
 
+static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
+ unsigned long address);
 #define __tlb_remove_tlb_entry __tlb_remove_tlb_entry
 
 #define tlb_flush tlb_flush
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2eb7b0d4f5d2..95d60a4f468a 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -608,6 +608,26 @@ static inline void tlb_flush_p4d_range(struct mmu_gather 
*tlb,
__tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
+/**
+ * tlb_remove_tlb_entries - remember unmapping of multiple consecutive ptes for
+ * later tlb invalidation.
+ *
+ * Similar to tlb_remove_tlb_entry(), but remember unmapping of multiple
+ * consecutive ptes instead of only a single one.
+ */
+static inline void tlb_remove_tlb_entries(struct mmu_gather *tlb,
+   pte_t *ptep, unsigned int nr, unsigned long address)
+{
+   tlb_flush_pte_range(tlb, address, PAGE_SIZE * nr);
+   for (;;) {
+   __tlb_remove_tlb_entry(tlb, ptep, address);
+   if (--nr == 0)
+   break;
+   ptep++;
+   address += PAGE_SIZE;
+   }
+}
+
 #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)   \
do {\
unsigned long _sz = huge_page_size(h);  \
-- 
2.43.0



[PATCH v3 06/10] mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP

2024-02-14 Thread David Hildenbrand
Nowadays, encoded pages are only used in mmu_gather handling. Let's
update the documentation, and define ENCODED_PAGE_BIT_DELAY_RMAP. While at
it, rename ENCODE_PAGE_BITS to ENCODED_PAGE_BITS.

If encoded page pointers would ever be used in other context again, we'd
likely want to change the defines to reflect their context (e.g.,
ENCODED_PAGE_FLAG_MMU_GATHER_DELAY_RMAP). For now, let's keep it simple.

This is a preparation for using the remaining spare bit to indicate that
the next item in an array of encoded pages is a "nr_pages" argument and
not an encoded page.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 include/linux/mm_types.h | 17 +++--
 mm/mmu_gather.c  |  5 +++--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8b611e13153e..1b89eec0d6df 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -210,8 +210,8 @@ struct page {
  *
  * An 'encoded_page' pointer is a pointer to a regular 'struct page', but
  * with the low bits of the pointer indicating extra context-dependent
- * information. Not super-common, but happens in mmu_gather and mlock
- * handling, and this acts as a type system check on that use.
+ * information. Only used in mmu_gather handling, and this acts as a type
+ * system check on that use.
  *
  * We only really have two guaranteed bits in general, although you could
  * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
@@ -220,21 +220,26 @@ struct page {
  * Use the supplied helper functions to endcode/decode the pointer and bits.
  */
 struct encoded_page;
-#define ENCODE_PAGE_BITS 3ul
+
+#define ENCODED_PAGE_BITS  3ul
+
+/* Perform rmap removal after we have flushed the TLB. */
+#define ENCODED_PAGE_BIT_DELAY_RMAP1ul
+
 static __always_inline struct encoded_page *encode_page(struct page *page, 
unsigned long flags)
 {
-   BUILD_BUG_ON(flags > ENCODE_PAGE_BITS);
+   BUILD_BUG_ON(flags > ENCODED_PAGE_BITS);
return (struct encoded_page *)(flags | (unsigned long)page);
 }
 
 static inline unsigned long encoded_page_flags(struct encoded_page *page)
 {
-   return ENCODE_PAGE_BITS & (unsigned long)page;
+   return ENCODED_PAGE_BITS & (unsigned long)page;
 }
 
 static inline struct page *encoded_page_ptr(struct encoded_page *page)
 {
-   return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
+   return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page);
 }
 
 /*
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index ac733d81b112..6540c99c6758 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -53,7 +53,7 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch 
*batch, struct vm_area_
for (int i = 0; i < batch->nr; i++) {
struct encoded_page *enc = batch->encoded_pages[i];
 
-   if (encoded_page_flags(enc)) {
+   if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) {
struct page *page = encoded_page_ptr(enc);
folio_remove_rmap_pte(page_folio(page), page, vma);
}
@@ -119,6 +119,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
 bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
bool delay_rmap, int page_size)
 {
+   int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0;
struct mmu_gather_batch *batch;
 
VM_BUG_ON(!tlb->end);
@@ -132,7 +133,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct 
page *page,
 * Add the page and check if we are full. If so
 * force a flush.
 */
-   batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
+   batch->encoded_pages[batch->nr++] = encode_page(page, flags);
if (batch->nr == batch->max) {
if (!tlb_next_batch(tlb))
return true;
-- 
2.43.0



[PATCH v3 05/10] mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size()

2024-02-14 Thread David Hildenbrand
We have two bits available in the encoded page pointer to store
additional information. Currently, we use one bit to request delay of the
rmap removal until after a TLB flush.

We want to make use of the remaining bit internally for batching of
multiple pages of the same folio, specifying that the next encoded page
pointer in an array is actually "nr_pages". So pass page + delay_rmap flag
instead of an encoded page, to handle the encoding internally.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 arch/s390/include/asm/tlb.h | 13 ++---
 include/asm-generic/tlb.h   | 12 ++--
 mm/mmu_gather.c |  7 ---
 3 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index d1455a601adc..48df896d5b79 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -25,8 +25,7 @@
 void __tlb_remove_table(void *_table);
 static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
- struct encoded_page *page,
- int page_size);
+   struct page *page, bool delay_rmap, int page_size);
 
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
@@ -42,14 +41,14 @@ static inline bool __tlb_remove_page_size(struct mmu_gather 
*tlb,
  * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page
  * has already been freed, so just do free_page_and_swap_cache.
  *
- * s390 doesn't delay rmap removal, so there is nothing encoded in
- * the page pointer.
+ * s390 doesn't delay rmap removal.
  */
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
- struct encoded_page *page,
- int page_size)
+   struct page *page, bool delay_rmap, int page_size)
 {
-   free_page_and_swap_cache(encoded_page_ptr(page));
+   VM_WARN_ON_ONCE(delay_rmap);
+
+   free_page_and_swap_cache(page);
return false;
 }
 
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 129a3a759976..2eb7b0d4f5d2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -260,9 +260,8 @@ struct mmu_gather_batch {
  */
 #define MAX_GATHER_BATCH_COUNT (1UL/MAX_GATHER_BATCH)
 
-extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
-  struct encoded_page *page,
-  int page_size);
+extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+   bool delay_rmap, int page_size);
 
 #ifdef CONFIG_SMP
 /*
@@ -462,13 +461,14 @@ static inline void tlb_flush_mmu_tlbonly(struct 
mmu_gather *tlb)
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, int page_size)
 {
-   if (__tlb_remove_page_size(tlb, encode_page(page, 0), page_size))
+   if (__tlb_remove_page_size(tlb, page, false, page_size))
tlb_flush_mmu(tlb);
 }
 
-static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb, struct 
page *page, unsigned int flags)
+static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb,
+   struct page *page, bool delay_rmap)
 {
-   return __tlb_remove_page_size(tlb, encode_page(page, flags), PAGE_SIZE);
+   return __tlb_remove_page_size(tlb, page, delay_rmap, PAGE_SIZE);
 }
 
 /* tlb_remove_page
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 604ddf08affe..ac733d81b112 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -116,7 +116,8 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
tlb->local.next = NULL;
 }
 
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, 
int page_size)
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+   bool delay_rmap, int page_size)
 {
struct mmu_gather_batch *batch;
 
@@ -131,13 +132,13 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, 
struct encoded_page *page, i
 * Add the page and check if we are full. If so
 * force a flush.
 */
-   batch->encoded_pages[batch->nr++] = page;
+   batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
if (batch->nr == batch->max) {
if (!tlb_next_batch(tlb))
return true;
batch = tlb->active;
}
-   VM_BUG_ON_PAGE(batch->nr > batch->max, encoded_page_ptr(page));
+   VM_BUG_ON_PAGE(batch->nr > batch->max, page);
 
return false;
 }
-- 
2.43.0



[PATCH v3 04/10] mm/memory: factor out zapping folio pte into zap_present_folio_pte()

2024-02-14 Thread David Hildenbrand
Let's prepare for further changes by factoring it out into a separate
function.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 53 -
 1 file changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7a3ebb6e5909..a3efc4da258a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1528,30 +1528,14 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct 
*vma,
pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
 }
 
-static inline void zap_present_pte(struct mmu_gather *tlb,
-   struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
-   unsigned long addr, struct zap_details *details,
-   int *rss, bool *force_flush, bool *force_break)
+static inline void zap_present_folio_pte(struct mmu_gather *tlb,
+   struct vm_area_struct *vma, struct folio *folio,
+   struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
+   struct zap_details *details, int *rss, bool *force_flush,
+   bool *force_break)
 {
struct mm_struct *mm = tlb->mm;
bool delay_rmap = false;
-   struct folio *folio;
-   struct page *page;
-
-   page = vm_normal_page(vma, addr, ptent);
-   if (!page) {
-   /* We don't need up-to-date accessed/dirty bits. */
-   ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
-   arch_check_zapped_pte(vma, ptent);
-   tlb_remove_tlb_entry(tlb, pte, addr);
-   VM_WARN_ON_ONCE(userfaultfd_wp(vma));
-   ksm_might_unmap_zero_page(mm, ptent);
-   return;
-   }
-
-   folio = page_folio(page);
-   if (unlikely(!should_zap_folio(details, folio)))
-   return;
 
if (!folio_test_anon(folio)) {
ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
@@ -1586,6 +1570,33 @@ static inline void zap_present_pte(struct mmu_gather 
*tlb,
}
 }
 
+static inline void zap_present_pte(struct mmu_gather *tlb,
+   struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+   unsigned long addr, struct zap_details *details,
+   int *rss, bool *force_flush, bool *force_break)
+{
+   struct mm_struct *mm = tlb->mm;
+   struct folio *folio;
+   struct page *page;
+
+   page = vm_normal_page(vma, addr, ptent);
+   if (!page) {
+   /* We don't need up-to-date accessed/dirty bits. */
+   ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+   arch_check_zapped_pte(vma, ptent);
+   tlb_remove_tlb_entry(tlb, pte, addr);
+   VM_WARN_ON_ONCE(userfaultfd_wp(vma));
+   ksm_might_unmap_zero_page(mm, ptent);
+   return;
+   }
+
+   folio = page_folio(page);
+   if (unlikely(!should_zap_folio(details, folio)))
+   return;
+   zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details,
+ rss, force_flush, force_break);
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
-- 
2.43.0



[PATCH v3 03/10] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()

2024-02-14 Thread David Hildenbrand
We don't need up-to-date accessed-dirty information for anon folios and can
simply work with the ptent we already have. Also, we know the RSS counter
we want to update.

We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
zap_install_uffd_wp_if_needed() after updating the folio and RSS.

While at it, only call zap_install_uffd_wp_if_needed() if there is even
any chance that pte_install_uffd_wp_if_needed() would do *something*.
That is, just don't bother if uffd-wp does not apply.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4da6923709b2..7a3ebb6e5909 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1552,12 +1552,9 @@ static inline void zap_present_pte(struct mmu_gather 
*tlb,
folio = page_folio(page);
if (unlikely(!should_zap_folio(details, folio)))
return;
-   ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
-   arch_check_zapped_pte(vma, ptent);
-   tlb_remove_tlb_entry(tlb, pte, addr);
-   zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
 
if (!folio_test_anon(folio)) {
+   ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
if (pte_dirty(ptent)) {
folio_mark_dirty(folio);
if (tlb_delay_rmap(tlb)) {
@@ -1567,8 +1564,17 @@ static inline void zap_present_pte(struct mmu_gather 
*tlb,
}
if (pte_young(ptent) && likely(vma_has_recency(vma)))
folio_mark_accessed(folio);
+   rss[mm_counter(folio)]--;
+   } else {
+   /* We don't need up-to-date accessed/dirty bits. */
+   ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+   rss[MM_ANONPAGES]--;
}
-   rss[mm_counter(folio)]--;
+   arch_check_zapped_pte(vma, ptent);
+   tlb_remove_tlb_entry(tlb, pte, addr);
+   if (unlikely(userfaultfd_pte_wp(vma, ptent)))
+   zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+
if (!delay_rmap) {
folio_remove_rmap_pte(folio, page, vma);
if (unlikely(page_mapcount(page) < 0))
-- 
2.43.0



[PATCH v3 01/10] mm/memory: factor out zapping of present pte into zap_present_pte()

2024-02-14 Thread David Hildenbrand
Let's prepare for further changes by factoring out processing of present
PTEs.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 94 ++---
 1 file changed, 53 insertions(+), 41 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7c3ca41a7610..5b0dc33133a6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct 
*vma,
pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
 }
 
+static inline void zap_present_pte(struct mmu_gather *tlb,
+   struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+   unsigned long addr, struct zap_details *details,
+   int *rss, bool *force_flush, bool *force_break)
+{
+   struct mm_struct *mm = tlb->mm;
+   struct folio *folio = NULL;
+   bool delay_rmap = false;
+   struct page *page;
+
+   page = vm_normal_page(vma, addr, ptent);
+   if (page)
+   folio = page_folio(page);
+
+   if (unlikely(!should_zap_folio(details, folio)))
+   return;
+   ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+   arch_check_zapped_pte(vma, ptent);
+   tlb_remove_tlb_entry(tlb, pte, addr);
+   zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+   if (unlikely(!page)) {
+   ksm_might_unmap_zero_page(mm, ptent);
+   return;
+   }
+
+   if (!folio_test_anon(folio)) {
+   if (pte_dirty(ptent)) {
+   folio_mark_dirty(folio);
+   if (tlb_delay_rmap(tlb)) {
+   delay_rmap = true;
+   *force_flush = true;
+   }
+   }
+   if (pte_young(ptent) && likely(vma_has_recency(vma)))
+   folio_mark_accessed(folio);
+   }
+   rss[mm_counter(folio)]--;
+   if (!delay_rmap) {
+   folio_remove_rmap_pte(folio, page, vma);
+   if (unlikely(page_mapcount(page) < 0))
+   print_bad_pte(vma, addr, ptent, page);
+   }
+   if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
+   *force_flush = true;
+   *force_break = true;
+   }
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
struct zap_details *details)
 {
+   bool force_flush = false, force_break = false;
struct mm_struct *mm = tlb->mm;
-   int force_flush = 0;
int rss[NR_MM_COUNTERS];
spinlock_t *ptl;
pte_t *start_pte;
@@ -1555,7 +1603,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
arch_enter_lazy_mmu_mode();
do {
pte_t ptent = ptep_get(pte);
-   struct folio *folio = NULL;
+   struct folio *folio;
struct page *page;
 
if (pte_none(ptent))
@@ -1565,45 +1613,9 @@ static unsigned long zap_pte_range(struct mmu_gather 
*tlb,
break;
 
if (pte_present(ptent)) {
-   unsigned int delay_rmap;
-
-   page = vm_normal_page(vma, addr, ptent);
-   if (page)
-   folio = page_folio(page);
-
-   if (unlikely(!should_zap_folio(details, folio)))
-   continue;
-   ptent = ptep_get_and_clear_full(mm, addr, pte,
-   tlb->fullmm);
-   arch_check_zapped_pte(vma, ptent);
-   tlb_remove_tlb_entry(tlb, pte, addr);
-   zap_install_uffd_wp_if_needed(vma, addr, pte, details,
- ptent);
-   if (unlikely(!page)) {
-   ksm_might_unmap_zero_page(mm, ptent);
-   continue;
-   }
-
-   delay_rmap = 0;
-   if (!folio_test_anon(folio)) {
-   if (pte_dirty(ptent)) {
-   folio_mark_dirty(folio);
-   if (tlb_delay_rmap(tlb)) {
-   delay_rmap = 1;
-   force_flush = 1;
-   }
-   }
-   if (pte_young(ptent) && 
likely(vma_has_recency(vma)))
-   folio_mark_accessed(folio);
-   }
-   rss[mm_counter(folio)]--;
-   

[PATCH v3 00/10] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-02-14 Thread David Hildenbrand
quot;Naveen N. Rao" 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Alexander Gordeev 
Cc: Christian Borntraeger 
Cc: Sven Schnelle 
Cc: Arnd Bergmann 
Cc: linux-a...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org

David Hildenbrand (10):
  mm/memory: factor out zapping of present pte into zap_present_pte()
  mm/memory: handle !page case in zap_present_pte() separately
  mm/memory: further separate anon and pagecache folio handling in
zap_present_pte()
  mm/memory: factor out zapping folio pte into zap_present_folio_pte()
  mm/mmu_gather: pass "delay_rmap" instead of encoded page to
__tlb_remove_page_size()
  mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP
  mm/mmu_gather: add tlb_remove_tlb_entries()
  mm/mmu_gather: add __tlb_remove_folio_pages()
  mm/mmu_gather: improve cond_resched() handling with large folios and
expensive page freeing
  mm/memory: optimize unmap/zap with PTE-mapped THP

 arch/powerpc/include/asm/tlb.h |   2 +
 arch/s390/include/asm/tlb.h|  30 --
 include/asm-generic/tlb.h  |  40 ++--
 include/linux/mm_types.h   |  37 ++--
 include/linux/pgtable.h|  70 ++
 mm/memory.c| 169 +++--
 mm/mmu_gather.c| 111 ++
 mm/swap.c  |  12 ++-
 mm/swap_state.c|  15 ++-
 9 files changed, 393 insertions(+), 93 deletions(-)


base-commit: 7e56cf9a7f108e8129d75cea0dabc9488fb4defa
-- 
2.43.0



Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-13 Thread David Hildenbrand

On 13.02.24 15:02, Ryan Roberts wrote:

On 13/02/2024 13:45, David Hildenbrand wrote:

On 13.02.24 14:33, Ard Biesheuvel wrote:

On Tue, 13 Feb 2024 at 14:21, Ryan Roberts  wrote:


On 13/02/2024 13:13, David Hildenbrand wrote:

On 13.02.24 14:06, Ryan Roberts wrote:

On 13/02/2024 12:19, David Hildenbrand wrote:

On 13.02.24 13:06, Ryan Roberts wrote:

On 12/02/2024 20:38, Ryan Roberts wrote:

[...]


+static inline bool mm_is_user(struct mm_struct *mm)
+{
+    /*
+ * Don't attempt to apply the contig bit to kernel mappings,
because
+ * dynamically adding/removing the contig bit can cause page
faults.
+ * These racing faults are ok for user space, since they get
serialized
+ * on the PTL. But kernel mappings can't tolerate faults.
+ */
+    return mm != _mm;
+}


We also have the efi_mm as a non-user mm, though I don't think we
manipulate
that while it is live, and I'm not sure if that needs any special
handling.


Well we never need this function in the hot (order-0 folio) path, so I
think I
could add a check for efi_mm here with performance implication. It's
probably
safest to explicitly exclude it? What do you think?


Oops: This should have read "I think I could add a check for efi_mm here
*without* performance implication"


It turns out that efi_mm is only defined when CONFIG_EFI is enabled I
can do
this:

return mm != _mm && (!IS_ENABLED(CONFIG_EFI) || mm != _mm);

Is that acceptable? This is my preference, but nothing else outside of efi
references this symbol currently.

Or perhaps I can convince myself that its safe to treat efi_mm like
userspace.
There are a couple of things that need to be garanteed for it to be safe:

  - The PFNs of present ptes either need to have an associated struct
page or
    need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
    pte_mkdevmap())

  - Live mappings must either be static (no changes that could cause
fold/unfold
    while live) or the system must be able to tolerate a temporary fault

Mark suggests efi_mm is not manipulated while live, so that meets the
latter
requirement, but I'm not sure about the former?


I've gone through all the efi code, and conclude that, as Mark suggests, the
mappings are indeed static. And additionally, the ptes are populated
using only
the _private_ ptep API, so there is no issue here. As just discussed with
Mark,
my prefereence is to not make any changes to code, and just add a comment
describing why efi_mm is safe.

Details:

* Registered with ptdump
    * ptep_get_lockless()
* efi_create_mapping -> create_pgd_mapping … -> init_pte:
    * __ptep_get()
    * __set_pte()
* efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
set_permissions
    * __ptep_get()
    * __set_pte()


Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
"official" APIs.


We could, but that would lead to the same linkage issue, which I'm trying to
avoid in the first place:

VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);

This creates new source code dependencies, which I would rather avoid if
possible.


Just a thought, you could have a is_efi_mm() function that abstracts all that.

diff --git a/include/linux/efi.h b/include/linux/efi.h
index c74f47711f0b..152f5fa66a2a 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -692,6 +692,15 @@ extern struct efi {

   extern struct mm_struct efi_mm;

+static inline void is_efi_mm(struct mm_struct *mm)
+{
+#ifdef CONFIG_EFI
+   return mm == _mm;
+#else
+   return false;
+#endif
+}
+
   static inline int
   efi_guidcmp (efi_guid_t left, efi_guid_t right)
   {




That would definitely work, but in that case, I might as well just check for it
in mm_is_user() (and personally I would change the name to mm_is_efi()):


static inline bool mm_is_user(struct mm_struct *mm)
{
  return mm != _mm && !mm_is_efi(mm);
}

Any objections?



Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
declaration is visible to the compiler, and any references should
disappear before the linker could notice that efi_mm does not exist.



Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)


I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?

The former was what I did initially; It works great, but I didn't like that I
was introducing a new code dependecy between efi and arm64 (nothing else outside
of efi references efi_mm).

So then concluded that it is safe to not worry about efi_mm (thanks for your
confirmation). But then David wanted a VM_WARN check, which reintroduces the
code dependency. So he suggested the mm_is_efi() helper to hide that... This is
all starting to feel circular...


I think Ard meant that insid

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-13 Thread David Hildenbrand

On 13.02.24 14:33, Ard Biesheuvel wrote:

On Tue, 13 Feb 2024 at 14:21, Ryan Roberts  wrote:


On 13/02/2024 13:13, David Hildenbrand wrote:

On 13.02.24 14:06, Ryan Roberts wrote:

On 13/02/2024 12:19, David Hildenbrand wrote:

On 13.02.24 13:06, Ryan Roberts wrote:

On 12/02/2024 20:38, Ryan Roberts wrote:

[...]


+static inline bool mm_is_user(struct mm_struct *mm)
+{
+/*
+ * Don't attempt to apply the contig bit to kernel mappings, because
+ * dynamically adding/removing the contig bit can cause page faults.
+ * These racing faults are ok for user space, since they get
serialized
+ * on the PTL. But kernel mappings can't tolerate faults.
+ */
+return mm != _mm;
+}


We also have the efi_mm as a non-user mm, though I don't think we
manipulate
that while it is live, and I'm not sure if that needs any special handling.


Well we never need this function in the hot (order-0 folio) path, so I
think I
could add a check for efi_mm here with performance implication. It's
probably
safest to explicitly exclude it? What do you think?


Oops: This should have read "I think I could add a check for efi_mm here
*without* performance implication"


It turns out that efi_mm is only defined when CONFIG_EFI is enabled I can do
this:

return mm != _mm && (!IS_ENABLED(CONFIG_EFI) || mm != _mm);

Is that acceptable? This is my preference, but nothing else outside of efi
references this symbol currently.

Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
There are a couple of things that need to be garanteed for it to be safe:

 - The PFNs of present ptes either need to have an associated struct
page or
   need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
   pte_mkdevmap())

 - Live mappings must either be static (no changes that could cause
fold/unfold
   while live) or the system must be able to tolerate a temporary fault

Mark suggests efi_mm is not manipulated while live, so that meets the latter
requirement, but I'm not sure about the former?


I've gone through all the efi code, and conclude that, as Mark suggests, the
mappings are indeed static. And additionally, the ptes are populated using only
the _private_ ptep API, so there is no issue here. As just discussed with Mark,
my prefereence is to not make any changes to code, and just add a comment
describing why efi_mm is safe.

Details:

* Registered with ptdump
   * ptep_get_lockless()
* efi_create_mapping -> create_pgd_mapping … -> init_pte:
   * __ptep_get()
   * __set_pte()
* efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
set_permissions
   * __ptep_get()
   * __set_pte()


Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
"official" APIs.


We could, but that would lead to the same linkage issue, which I'm trying to
avoid in the first place:

VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);

This creates new source code dependencies, which I would rather avoid if
possible.


Just a thought, you could have a is_efi_mm() function that abstracts all that.

diff --git a/include/linux/efi.h b/include/linux/efi.h
index c74f47711f0b..152f5fa66a2a 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -692,6 +692,15 @@ extern struct efi {

  extern struct mm_struct efi_mm;

+static inline void is_efi_mm(struct mm_struct *mm)
+{
+#ifdef CONFIG_EFI
+   return mm == _mm;
+#else
+   return false;
+#endif
+}
+
  static inline int
  efi_guidcmp (efi_guid_t left, efi_guid_t right)
  {




That would definitely work, but in that case, I might as well just check for it
in mm_is_user() (and personally I would change the name to mm_is_efi()):


static inline bool mm_is_user(struct mm_struct *mm)
{
 return mm != _mm && !mm_is_efi(mm);
}

Any objections?



Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
declaration is visible to the compiler, and any references should
disappear before the linker could notice that efi_mm does not exist.



Sure, as long as the linker is happy why not. I'll let Ryan mess with 
that :)



In any case, feel free to add

Acked-by: Ard Biesheuvel 


Thanks for the review.

--
Cheers,

David / dhildenb



Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-13 Thread David Hildenbrand

On 13.02.24 14:20, Ryan Roberts wrote:

On 13/02/2024 13:13, David Hildenbrand wrote:

On 13.02.24 14:06, Ryan Roberts wrote:

On 13/02/2024 12:19, David Hildenbrand wrote:

On 13.02.24 13:06, Ryan Roberts wrote:

On 12/02/2024 20:38, Ryan Roberts wrote:

[...]


+static inline bool mm_is_user(struct mm_struct *mm)
+{
+    /*
+ * Don't attempt to apply the contig bit to kernel mappings, because
+ * dynamically adding/removing the contig bit can cause page faults.
+ * These racing faults are ok for user space, since they get
serialized
+ * on the PTL. But kernel mappings can't tolerate faults.
+ */
+    return mm != _mm;
+}


We also have the efi_mm as a non-user mm, though I don't think we
manipulate
that while it is live, and I'm not sure if that needs any special handling.


Well we never need this function in the hot (order-0 folio) path, so I
think I
could add a check for efi_mm here with performance implication. It's
probably
safest to explicitly exclude it? What do you think?


Oops: This should have read "I think I could add a check for efi_mm here
*without* performance implication"


It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
this:

return mm != _mm && (!IS_ENABLED(CONFIG_EFI) || mm != _mm);

Is that acceptable? This is my preference, but nothing else outside of efi
references this symbol currently.

Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
There are a couple of things that need to be garanteed for it to be safe:

     - The PFNs of present ptes either need to have an associated struct
page or
   need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
   pte_mkdevmap())

     - Live mappings must either be static (no changes that could cause
fold/unfold
   while live) or the system must be able to tolerate a temporary fault

Mark suggests efi_mm is not manipulated while live, so that meets the latter
requirement, but I'm not sure about the former?


I've gone through all the efi code, and conclude that, as Mark suggests, the
mappings are indeed static. And additionally, the ptes are populated using only
the _private_ ptep API, so there is no issue here. As just discussed with Mark,
my prefereence is to not make any changes to code, and just add a comment
describing why efi_mm is safe.

Details:

* Registered with ptdump
   * ptep_get_lockless()
* efi_create_mapping -> create_pgd_mapping … -> init_pte:
   * __ptep_get()
   * __set_pte()
* efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
set_permissions
   * __ptep_get()
   * __set_pte()


Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
"official" APIs.


We could, but that would lead to the same linkage issue, which I'm trying to
avoid in the first place:

VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);

This creates new source code dependencies, which I would rather avoid if
possible.


Just a thought, you could have a is_efi_mm() function that abstracts all that.

diff --git a/include/linux/efi.h b/include/linux/efi.h
index c74f47711f0b..152f5fa66a2a 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -692,6 +692,15 @@ extern struct efi {
  
  extern struct mm_struct efi_mm;
  
+static inline void is_efi_mm(struct mm_struct *mm)

+{
+#ifdef CONFIG_EFI
+   return mm == _mm;
+#else
+   return false;
+#endif
+}
+
  static inline int
  efi_guidcmp (efi_guid_t left, efi_guid_t right)
  {




That would definitely work, but in that case, I might as well just check for it
in mm_is_user() (and personally I would change the name to mm_is_efi()):


static inline bool mm_is_user(struct mm_struct *mm)
{
return mm != _mm && !mm_is_efi(mm);
}

Any objections?



Nope :) Maybe slap in an "unlikely()", because efi_mm *is* unlikely to 
show up.


--
Cheers,

David / dhildenb



Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-13 Thread David Hildenbrand

On 13.02.24 14:06, Ryan Roberts wrote:

On 13/02/2024 12:19, David Hildenbrand wrote:

On 13.02.24 13:06, Ryan Roberts wrote:

On 12/02/2024 20:38, Ryan Roberts wrote:

[...]


+static inline bool mm_is_user(struct mm_struct *mm)
+{
+    /*
+ * Don't attempt to apply the contig bit to kernel mappings, because
+ * dynamically adding/removing the contig bit can cause page faults.
+ * These racing faults are ok for user space, since they get serialized
+ * on the PTL. But kernel mappings can't tolerate faults.
+ */
+    return mm != _mm;
+}


We also have the efi_mm as a non-user mm, though I don't think we manipulate
that while it is live, and I'm not sure if that needs any special handling.


Well we never need this function in the hot (order-0 folio) path, so I think I
could add a check for efi_mm here with performance implication. It's probably
safest to explicitly exclude it? What do you think?


Oops: This should have read "I think I could add a check for efi_mm here
*without* performance implication"


It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
this:

return mm != _mm && (!IS_ENABLED(CONFIG_EFI) || mm != _mm);

Is that acceptable? This is my preference, but nothing else outside of efi
references this symbol currently.

Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
There are a couple of things that need to be garanteed for it to be safe:

    - The PFNs of present ptes either need to have an associated struct page or
  need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
  pte_mkdevmap())

    - Live mappings must either be static (no changes that could cause
fold/unfold
  while live) or the system must be able to tolerate a temporary fault

Mark suggests efi_mm is not manipulated while live, so that meets the latter
requirement, but I'm not sure about the former?


I've gone through all the efi code, and conclude that, as Mark suggests, the
mappings are indeed static. And additionally, the ptes are populated using only
the _private_ ptep API, so there is no issue here. As just discussed with Mark,
my prefereence is to not make any changes to code, and just add a comment
describing why efi_mm is safe.

Details:

* Registered with ptdump
  * ptep_get_lockless()
* efi_create_mapping -> create_pgd_mapping … -> init_pte:
  * __ptep_get()
  * __set_pte()
* efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
set_permissions
  * __ptep_get()
  * __set_pte()


Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
"official" APIs.


We could, but that would lead to the same linkage issue, which I'm trying to
avoid in the first place:

VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);

This creates new source code dependencies, which I would rather avoid if 
possible.


Just a thought, you could have a is_efi_mm() function that abstracts all that.

diff --git a/include/linux/efi.h b/include/linux/efi.h
index c74f47711f0b..152f5fa66a2a 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -692,6 +692,15 @@ extern struct efi {
 
 extern struct mm_struct efi_mm;
 
+static inline void is_efi_mm(struct mm_struct *mm)

+{
+#ifdef CONFIG_EFI
+   return mm == _mm;
+#else
+   return false;
+#endif
+}
+
 static inline int
 efi_guidcmp (efi_guid_t left, efi_guid_t right)
 {


--
Cheers,

David / dhildenb



Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-13 Thread David Hildenbrand

On 13.02.24 13:06, Ryan Roberts wrote:

On 12/02/2024 20:38, Ryan Roberts wrote:

[...]


+static inline bool mm_is_user(struct mm_struct *mm)
+{
+   /*
+* Don't attempt to apply the contig bit to kernel mappings, because
+* dynamically adding/removing the contig bit can cause page faults.
+* These racing faults are ok for user space, since they get serialized
+* on the PTL. But kernel mappings can't tolerate faults.
+*/
+   return mm != _mm;
+}


We also have the efi_mm as a non-user mm, though I don't think we manipulate
that while it is live, and I'm not sure if that needs any special handling.


Well we never need this function in the hot (order-0 folio) path, so I think I
could add a check for efi_mm here with performance implication. It's probably
safest to explicitly exclude it? What do you think?


Oops: This should have read "I think I could add a check for efi_mm here
*without* performance implication"


It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do 
this:

return mm != _mm && (!IS_ENABLED(CONFIG_EFI) || mm != _mm);

Is that acceptable? This is my preference, but nothing else outside of efi
references this symbol currently.

Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
There are a couple of things that need to be garanteed for it to be safe:

   - The PFNs of present ptes either need to have an associated struct page or
 need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
 pte_mkdevmap())

   - Live mappings must either be static (no changes that could cause 
fold/unfold
 while live) or the system must be able to tolerate a temporary fault

Mark suggests efi_mm is not manipulated while live, so that meets the latter
requirement, but I'm not sure about the former?


I've gone through all the efi code, and conclude that, as Mark suggests, the
mappings are indeed static. And additionally, the ptes are populated using only
the _private_ ptep API, so there is no issue here. As just discussed with Mark,
my prefereence is to not make any changes to code, and just add a comment
describing why efi_mm is safe.

Details:

* Registered with ptdump
 * ptep_get_lockless()
* efi_create_mapping -> create_pgd_mapping … -> init_pte:
 * __ptep_get()
 * __set_pte()
* efi_memattr_apply_permissions -> efi_set_mapping_permissions … -> 
set_permissions
 * __ptep_get()
 * __set_pte()


Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via 
the "official" APIs.


--
Cheers,

David / dhildenb



Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-13 Thread David Hildenbrand

On 12.02.24 21:38, Ryan Roberts wrote:

[...]


+static inline bool mm_is_user(struct mm_struct *mm)
+{
+   /*
+* Don't attempt to apply the contig bit to kernel mappings, because
+* dynamically adding/removing the contig bit can cause page faults.
+* These racing faults are ok for user space, since they get serialized
+* on the PTL. But kernel mappings can't tolerate faults.
+*/
+   return mm != _mm;
+}


We also have the efi_mm as a non-user mm, though I don't think we manipulate
that while it is live, and I'm not sure if that needs any special handling.


Well we never need this function in the hot (order-0 folio) path, so I think I
could add a check for efi_mm here with performance implication. It's probably
safest to explicitly exclude it? What do you think?


Oops: This should have read "I think I could add a check for efi_mm here
*without* performance implication"


It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do 
this:

return mm != _mm && (!IS_ENABLED(CONFIG_EFI) || mm != _mm);


Please use all the lines you need ;)

if (IS_ENABLED(CONFIG_EFI) && unlikely(mm == _mm))
return false;
return mm != _mm;



Is that acceptable? This is my preference, but nothing else outside of efi
references this symbol currently.


We could also mark MMs in some way to be special.

return mm->is_user;

Then it's easy to extend.

--
Cheers,

David / dhildenb



Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()

2024-02-13 Thread David Hildenbrand

On 12.02.24 22:34, Ryan Roberts wrote:

On 12/02/2024 14:29, David Hildenbrand wrote:

On 12.02.24 15:10, Ryan Roberts wrote:

On 12/02/2024 12:14, David Hildenbrand wrote:

On 02.02.24 09:07, Ryan Roberts wrote:

The goal is to be able to advance a PTE by an arbitrary number of PFNs.
So introduce a new API that takes a nr param.

We are going to remove pte_next_pfn() and replace it with
pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
wrapper around pte_advance_pfn() so that we can incrementally switch the
architectures over. Once all arches are moved over, we will change all
the core-mm callers to call pte_advance_pfn() directly and remove the
wrapper.

Signed-off-by: Ryan Roberts 
---
    include/linux/pgtable.h | 8 +++-
    1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5e7eaf8f2b97..815d92dcb96b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
        #ifndef pte_next_pfn
+#ifndef pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
+{
+    return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
+}
+#endif
    static inline pte_t pte_next_pfn(pte_t pte)
    {
-    return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+    return pte_advance_pfn(pte, 1);
    }
    #endif



I do wonder if we simply want to leave pte_next_pfn() around? Especially patch
#4, #6 don't really benefit from the change? So are the other set_ptes()
implementations.

That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
pte_next_pfn() macro in place.

Any downsides to that?


The downside is just having multiple functions that effectively do the same
thing. Personally I think its cleaner and easier to understand the code with
just one generic function which we pass 1 to it where we only want to advance by
1. In the end, there are only a couple of places where pte_advance_pfn(1) is
used, so doesn't really seem valuable to me to maintain a specialization.


Well, not really functions, just a macro. Like we have set_pte_at() translating
to set_ptes().

Arguably, we have more callers of set_pte_at().

"Easier to understand", I don't know. :)



Unless you feel strongly that we need to keep pte_next_pfn() then I'd prefer to
leave it as I've done in this series.


Well, it makes you patch set shorter and there is less code churn.

So personally, I'd just leave pte_next_pfn() in there. But whatever you prefer,
not the end of the world.


I thought about this a bit more and remembered that I'm the apprentice so I've
changed it as you suggested.


Oh, I say stupid things all the time. Please push back if you disagree. :)

[shrinking a patch set if possible and reasonable is often a good idea]

--
Cheers,

David / dhildenb



Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()

2024-02-12 Thread David Hildenbrand

On 12.02.24 16:47, Ryan Roberts wrote:

On 12/02/2024 13:43, David Hildenbrand wrote:

On 02.02.24 09:07, Ryan Roberts wrote:

Some architectures (e.g. arm64) can tell from looking at a pte, if some
follow-on ptes also map contiguous physical memory with the same pgprot.
(for arm64, these are contpte mappings).

Take advantage of this knowledge to optimize folio_pte_batch() so that
it can skip these ptes when scanning to create a batch. By default, if
an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
the changes are optimized out and the behaviour is as before.

arm64 will opt-in to providing this hint in the next patch, which will
greatly reduce the cost of ptep_get() when scanning a range of contptes.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
   include/linux/pgtable.h | 18 ++
   mm/memory.c | 20 +---
   2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 50f32cccbd92..cba31f177d27 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
   #define arch_flush_lazy_mmu_mode()    do {} while (0)
   #endif
   +#ifndef pte_batch_hint
+/**
+ * pte_batch_hint - Number of pages that can be added to batch without 
scanning.
+ * @ptep: Page table pointer for the entry.
+ * @pte: Page table entry.
+ *
+ * Some architectures know that a set of contiguous ptes all map the same
+ * contiguous memory with the same permissions. In this case, it can provide a
+ * hint to aid pte batching without the core code needing to scan every pte.


I think we might want to document here the expectation regarding
dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
always when batching, because the dirty bit may target any pte part of the
cont-pte group either way.

Maybe something like:

"
An architecture implementation may only ignore the PTE accessed and dirty bits.
Further, it may only ignore the dirty bit if that bit is already not
maintained with precision per PTE inside the hinted batch, and ptep_get()
would already have to collect it from various PTEs.
"


I'm proposing to simplify this to:

"
An architecture implementation may ignore the PTE accessed state. Further, the
dirty state must apply atomically to all the PTEs described by the hint.
"

Which I think more accurately describes the requirement. Shout if you disagree.


I'm not 100% sure if the "must apply atomically" is clear without all of 
the cont-pte details and ptep_get(). But I fail to describe it in a 
better way.


It's all better compared to what we had before, so LGTM :)

--
Cheers,

David / dhildenb



Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-12 Thread David Hildenbrand

On 12.02.24 16:34, Ryan Roberts wrote:

On 12/02/2024 15:26, David Hildenbrand wrote:

On 12.02.24 15:45, Ryan Roberts wrote:

On 12/02/2024 13:54, David Hildenbrand wrote:

If so, I wonder if we could instead do that comparison modulo the access/dirty
bits,


I think that would work - but will need to think a bit more on it.


and leave ptep_get_lockless() only reading a single entry?


I think we will need to do something a bit less fragile. ptep_get() does
collect
the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
we will likely want to rename the function and make its documentation explicit
that it does not return those bits.

ptep_get_lockless_noyoungdirty()? yuk... Any ideas?

Of course if I could convince you the current implementation is safe, I
might be
able to sidestep this optimization until a later date?


As discussed (and pointed out abive), there might be quite some callsites where
we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
used nowadays.

One way to approach that I had in mind was having an explicit interface:

ptep_get()
ptep_get_uptodate()
ptep_get_lockless()
ptep_get_lockless_uptodate()


Yes, I like the direction of this. I guess we anticipate that call sites
requiring the "_uptodate" variant will be the minority so it makes sense to use
the current names for the "_not_uptodate" variants? But to do a slow migration,
it might be better/safer to have the weaker variant use the new name - that
would allow us to downgrade one at a time?


Yes, I was primarily struggling with names. Likely it makes sense to either have
two completely new function names, or use the new name only for the "faster but
less precise" variant.





Especially the last one might not be needed.

I've done a scan through the code and agree with Mark's original conclusions.
Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
access/dirty info. So I think I could migrate everything to the weaker variant
fairly easily.



Futher, "uptodate" might not be the best choice because of PageUptodate() and
friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.


Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
"_nosync"?


I could live with

ptep_get_sync()
ptep_get_nosync()

with proper documentation :)


but could you live with:

ptep_get()
ptep_get_nosync()
ptep_get_lockless_nosync()

?

So leave the "slower, more precise" version with the existing name.


Sure.

--
Cheers,

David / dhildenb



Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-12 Thread David Hildenbrand

On 12.02.24 15:45, Ryan Roberts wrote:

On 12/02/2024 13:54, David Hildenbrand wrote:

If so, I wonder if we could instead do that comparison modulo the access/dirty
bits,


I think that would work - but will need to think a bit more on it.


and leave ptep_get_lockless() only reading a single entry?


I think we will need to do something a bit less fragile. ptep_get() does collect
the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
we will likely want to rename the function and make its documentation explicit
that it does not return those bits.

ptep_get_lockless_noyoungdirty()? yuk... Any ideas?

Of course if I could convince you the current implementation is safe, I might be
able to sidestep this optimization until a later date?


As discussed (and pointed out abive), there might be quite some callsites where
we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
used nowadays.

One way to approach that I had in mind was having an explicit interface:

ptep_get()
ptep_get_uptodate()
ptep_get_lockless()
ptep_get_lockless_uptodate()


Yes, I like the direction of this. I guess we anticipate that call sites
requiring the "_uptodate" variant will be the minority so it makes sense to use
the current names for the "_not_uptodate" variants? But to do a slow migration,
it might be better/safer to have the weaker variant use the new name - that
would allow us to downgrade one at a time?


Yes, I was primarily struggling with names. Likely it makes sense to 
either have two completely new function names, or use the new name only 
for the "faster but less precise" variant.






Especially the last one might not be needed.

I've done a scan through the code and agree with Mark's original conclusions.
Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
access/dirty info. So I think I could migrate everything to the weaker variant
fairly easily.



Futher, "uptodate" might not be the best choice because of PageUptodate() and
friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.


Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
"_nosync"?


I could live with

ptep_get_sync()
ptep_get_nosync()

with proper documentation :)

I don't think we use "_sync" / "_nosync" in the context of pte 
operations yet.


Well, there seems to be "__arm_v7s_pte_sync" in iommu code, bit at least 
in core code nothing jumped at me.


--
Cheers,

David / dhildenb



Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()

2024-02-12 Thread David Hildenbrand

On 12.02.24 15:10, Ryan Roberts wrote:

On 12/02/2024 12:14, David Hildenbrand wrote:

On 02.02.24 09:07, Ryan Roberts wrote:

The goal is to be able to advance a PTE by an arbitrary number of PFNs.
So introduce a new API that takes a nr param.

We are going to remove pte_next_pfn() and replace it with
pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
wrapper around pte_advance_pfn() so that we can incrementally switch the
architectures over. Once all arches are moved over, we will change all
the core-mm callers to call pte_advance_pfn() directly and remove the
wrapper.

Signed-off-by: Ryan Roberts 
---
   include/linux/pgtable.h | 8 +++-
   1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5e7eaf8f2b97..815d92dcb96b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
       #ifndef pte_next_pfn
+#ifndef pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
+{
+    return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
+}
+#endif
   static inline pte_t pte_next_pfn(pte_t pte)
   {
-    return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+    return pte_advance_pfn(pte, 1);
   }
   #endif
   


I do wonder if we simply want to leave pte_next_pfn() around? Especially patch
#4, #6 don't really benefit from the change? So are the other set_ptes()
implementations.

That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
pte_next_pfn() macro in place.

Any downsides to that?


The downside is just having multiple functions that effectively do the same
thing. Personally I think its cleaner and easier to understand the code with
just one generic function which we pass 1 to it where we only want to advance by
1. In the end, there are only a couple of places where pte_advance_pfn(1) is
used, so doesn't really seem valuable to me to maintain a specialization.


Well, not really functions, just a macro. Like we have set_pte_at() 
translating to set_ptes().


Arguably, we have more callers of set_pte_at().

"Easier to understand", I don't know. :)



Unless you feel strongly that we need to keep pte_next_pfn() then I'd prefer to
leave it as I've done in this series.


Well, it makes you patch set shorter and there is less code churn.

So personally, I'd just leave pte_next_pfn() in there. But whatever you 
prefer, not the end of the world.


--
Cheers,

David / dhildenb



Re: [PATCH] mm/hugetlb: Move page order check inside hugetlb_cma_reserve()

2024-02-12 Thread David Hildenbrand

On 09.02.24 06:42, Anshuman Khandual wrote:

All platforms could benefit from page order check against MAX_PAGE_ORDER
before allocating a CMA area for gigantic hugetlb pages. Let's move this
check from individual platforms to generic hugetlb.

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
This applies on v6.8-rc3
  
  arch/arm64/mm/hugetlbpage.c   | 7 ---

  arch/powerpc/mm/hugetlbpage.c | 4 +---
  mm/hugetlb.c  | 7 +++
  3 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 8116ac599f80..6720ec8d50e7 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -45,13 +45,6 @@ void __init arm64_hugetlb_cma_reserve(void)
else
order = CONT_PMD_SHIFT - PAGE_SHIFT;
  
-	/*

-* HugeTLB CMA reservation is required for gigantic
-* huge pages which could not be allocated via the
-* page allocator. Just warn if there is any change
-* breaking this assumption.
-*/
-   WARN_ON(order <= MAX_PAGE_ORDER);
hugetlb_cma_reserve(order);
  }
  #endif /* CONFIG_CMA */
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 0a540b37aab6..16557d008eef 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -614,8 +614,6 @@ void __init gigantic_hugetlb_cma_reserve(void)
 */
order = mmu_psize_to_shift(MMU_PAGE_16G) - PAGE_SHIFT;
  
-	if (order) {

-   VM_WARN_ON(order <= MAX_PAGE_ORDER);
+   if (order)
hugetlb_cma_reserve(order);
-   }
  }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cf9c9b2906ea..345b3524df35 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7699,6 +7699,13 @@ void __init hugetlb_cma_reserve(int order)
bool node_specific_cma_alloc = false;
int nid;
  
+	/*

+* HugeTLB CMA reservation is required for gigantic
+* huge pages which could not be allocated via the
+* page allocator. Just warn if there is any change
+* breaking this assumption.
+*/
+   VM_WARN_ON(order <= MAX_PAGE_ORDER);
cma_reserve_called = true;
  
  	if (!hugetlb_cma_size)


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb



Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-12 Thread David Hildenbrand

If so, I wonder if we could instead do that comparison modulo the access/dirty
bits,


I think that would work - but will need to think a bit more on it.


and leave ptep_get_lockless() only reading a single entry?


I think we will need to do something a bit less fragile. ptep_get() does collect
the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
we will likely want to rename the function and make its documentation explicit
that it does not return those bits.

ptep_get_lockless_noyoungdirty()? yuk... Any ideas?

Of course if I could convince you the current implementation is safe, I might be
able to sidestep this optimization until a later date?


As discussed (and pointed out abive), there might be quite some 
callsites where we don't really care about uptodate accessed/dirty bits 
-- where ptep_get() is used nowadays.


One way to approach that I had in mind was having an explicit interface:

ptep_get()
ptep_get_uptodate()
ptep_get_lockless()
ptep_get_lockless_uptodate()

Especially the last one might not be needed.

Futher, "uptodate" might not be the best choice because of 
PageUptodate() and friends. But it's better than 
"youngdirty"/"noyoungdirty" IMHO.


Of course, any such changes require care and are better done one step at 
at time separately.


--
Cheers,

David / dhildenb



Re: [PATCH v5 23/25] arm64/mm: Implement pte_batch_hint()

2024-02-12 Thread David Hildenbrand

On 02.02.24 09:07, Ryan Roberts wrote:

When core code iterates over a range of ptes and calls ptep_get() for
each of them, if the range happens to cover contpte mappings, the number
of pte reads becomes amplified by a factor of the number of PTEs in a
contpte block. This is because for each call to ptep_get(), the
implementation must read all of the ptes in the contpte block to which
it belongs to gather the access and dirty bits.

This causes a hotspot for fork(), as well as operations that unmap
memory such as munmap(), exit and madvise(MADV_DONTNEED). Fortunately we
can fix this by implementing pte_batch_hint() which allows their
iterators to skip getting the contpte tail ptes when gathering the batch
of ptes to operate on. This results in the number of PTE reads returning
to 1 per pte.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
  arch/arm64/include/asm/pgtable.h | 9 +
  1 file changed, 9 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index ad04adb7b87f..353ea67b5d75 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1220,6 +1220,15 @@ static inline void contpte_try_unfold(struct mm_struct 
*mm, unsigned long addr,
__contpte_try_unfold(mm, addr, ptep, pte);
  }
  
+#define pte_batch_hint pte_batch_hint

+static inline unsigned int pte_batch_hint(pte_t *ptep, pte_t pte)
+{
+   if (!pte_valid_cont(pte))
+   return 1;
+
+   return CONT_PTES - (((unsigned long)ptep >> 3) & (CONT_PTES - 1));
+}
+
  /*
   * The below functions constitute the public API that arm64 presents to the
   * core-mm to manipulate PTE entries within their page tables (or at least 
this



Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb



Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()

2024-02-12 Thread David Hildenbrand

On 02.02.24 09:07, Ryan Roberts wrote:

Some architectures (e.g. arm64) can tell from looking at a pte, if some
follow-on ptes also map contiguous physical memory with the same pgprot.
(for arm64, these are contpte mappings).

Take advantage of this knowledge to optimize folio_pte_batch() so that
it can skip these ptes when scanning to create a batch. By default, if
an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
the changes are optimized out and the behaviour is as before.

arm64 will opt-in to providing this hint in the next patch, which will
greatly reduce the cost of ptep_get() when scanning a range of contptes.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
  include/linux/pgtable.h | 18 ++
  mm/memory.c | 20 +---
  2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 50f32cccbd92..cba31f177d27 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
  #define arch_flush_lazy_mmu_mode()do {} while (0)
  #endif
  
+#ifndef pte_batch_hint

+/**
+ * pte_batch_hint - Number of pages that can be added to batch without 
scanning.
+ * @ptep: Page table pointer for the entry.
+ * @pte: Page table entry.
+ *
+ * Some architectures know that a set of contiguous ptes all map the same
+ * contiguous memory with the same permissions. In this case, it can provide a
+ * hint to aid pte batching without the core code needing to scan every pte.


I think we might want to document here the expectation regarding
dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
always when batching, because the dirty bit may target any pte part of the
cont-pte group either way.

Maybe something like:

"
An architecture implementation may only ignore the PTE accessed and dirty bits.
Further, it may only ignore the dirty bit if that bit is already not
maintained with precision per PTE inside the hinted batch, and ptep_get()
would already have to collect it from various PTEs.
"

I think there are some more details to it, but I'm hoping something along
the lines above is sufficient.



+
  #ifndef pte_advance_pfn
  static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
  {
diff --git a/mm/memory.c b/mm/memory.c
index 65fbe4f886c1..902665b27702 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio, 
unsigned long addr,
  {
unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
const pte_t *end_ptep = start_ptep + max_nr;
-   pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), 
flags);
-   pte_t *ptep = start_ptep + 1;
+   pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
+   pte_t *ptep = start_ptep;
bool writable;
+   int nr;
  
  	if (any_writable)

*any_writable = false;
  
  	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
  
-	while (ptep != end_ptep) {

+   nr = pte_batch_hint(ptep, pte);
+   expected_pte = pte_advance_pfn(expected_pte, nr);
+   ptep += nr;
+


*Maybe* it's easier to get when initializing expected_pte+ptep only once.

Like:

[...]
pte_t expected_pte, *ptep;
[...]

nr = pte_batch_hint(start_ptep, pte);
expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
ptep = start_ptep + nr;


+   while (ptep < end_ptep) {
pte = ptep_get(ptep);
if (any_writable)
writable = !!pte_write(pte);
@@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio, 
unsigned long addr,
 * corner cases the next PFN might fall into a different
 * folio.
 */
-   if (pte_pfn(pte) == folio_end_pfn)
+   if (pte_pfn(pte) >= folio_end_pfn)
break;
  
  		if (any_writable)

*any_writable |= writable;
  
-		expected_pte = pte_advance_pfn(expected_pte, 1);

-   ptep++;
+   nr = pte_batch_hint(ptep, pte);
+   expected_pte = pte_advance_pfn(expected_pte, nr);
+   ptep += nr;
}
  
-	return ptep - start_ptep;

+   return min(ptep - start_ptep, max_nr);
  }


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb



Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

2024-02-12 Thread David Hildenbrand

On 12.02.24 14:05, Ryan Roberts wrote:

On 12/02/2024 12:44, David Hildenbrand wrote:

On 02.02.24 09:07, Ryan Roberts wrote:

Split __flush_tlb_range() into __flush_tlb_range_nosync() +
__flush_tlb_range(), in the same way as the existing flush_tlb_page()
arrangement. This allows calling __flush_tlb_range_nosync() to elide the
trailing DSB. Forthcoming "contpte" code will take advantage of this
when clearing the young bit from a contiguous range of ptes.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
   arch/arm64/include/asm/tlbflush.h | 13 +++--
   1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h
b/arch/arm64/include/asm/tlbflush.h
index 79e932a1bdf8..50a765917327 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -422,7 +422,7 @@ do {    \
   #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
   __flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false,
kvm_lpa2_is_enabled());
   -static inline void __flush_tlb_range(struct vm_area_struct *vma,
+static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
    unsigned long start, unsigned long end,
    unsigned long stride, bool last_level,
    int tlb_level)
@@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct
vm_area_struct *vma,
   __flush_tlb_range_op(vae1is, start, pages, stride, asid,
    tlb_level, true, lpa2_is_enabled());
   -    dsb(ish);
   mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
   }
   +static inline void __flush_tlb_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end,
+ unsigned long stride, bool last_level,
+ int tlb_level)
+{
+    __flush_tlb_range_nosync(vma, start, end, stride,
+ last_level, tlb_level);
+    dsb(ish);
+}
+
   static inline void flush_tlb_range(struct vm_area_struct *vma,
  unsigned long start, unsigned long end)
   {


You're now calling dsb() after mmu_notifier_arch_invalidate_secondary_tlbs().


In flush_tlb_mm(), we have the order

 dsb(ish);
 mmu_notifier_arch_invalidate_secondary_tlbs()

In flush_tlb_page(), we have the effective order:

 mmu_notifier_arch_invalidate_secondary_tlbs()
 dsb(ish);

In flush_tlb_range(), we used to have the order:

 dsb(ish);
 mmu_notifier_arch_invalidate_secondary_tlbs();


So I *suspect* having that DSB before
mmu_notifier_arch_invalidate_secondary_tlbs() is fine. Hopefully, nothing in
there relies on that placement.


Will spotted this against v3. My argument was that I was following the existing
pattern in flush_tlb_page(). Apparently that is not correct and needs changing,
but the conclusion was to leave my change as is for now, since it is consistent
and change them at a later date together.


Good, I think you should add a few words to the patch description 
("ordering might be incorrect, but is in-line with __flush_tlb_page()"; 
will be resolved separately).


--
Cheers,

David / dhildenb



Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

2024-02-12 Thread David Hildenbrand

On 02.02.24 09:07, Ryan Roberts wrote:

Split __flush_tlb_range() into __flush_tlb_range_nosync() +
__flush_tlb_range(), in the same way as the existing flush_tlb_page()
arrangement. This allows calling __flush_tlb_range_nosync() to elide the
trailing DSB. Forthcoming "contpte" code will take advantage of this
when clearing the young bit from a contiguous range of ptes.

Tested-by: John Hubbard 
Signed-off-by: Ryan Roberts 
---
  arch/arm64/include/asm/tlbflush.h | 13 +++--
  1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h 
b/arch/arm64/include/asm/tlbflush.h
index 79e932a1bdf8..50a765917327 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -422,7 +422,7 @@ do {
\
  #define __flush_s2_tlb_range_op(op, start, pages, stride, tlb_level) \
__flush_tlb_range_op(op, start, pages, stride, 0, tlb_level, false, 
kvm_lpa2_is_enabled());
  
-static inline void __flush_tlb_range(struct vm_area_struct *vma,

+static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
 unsigned long start, unsigned long end,
 unsigned long stride, bool last_level,
 int tlb_level)
@@ -456,10 +456,19 @@ static inline void __flush_tlb_range(struct 
vm_area_struct *vma,
__flush_tlb_range_op(vae1is, start, pages, stride, asid,
 tlb_level, true, lpa2_is_enabled());
  
-	dsb(ish);

mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
  }
  
+static inline void __flush_tlb_range(struct vm_area_struct *vma,

+unsigned long start, unsigned long end,
+unsigned long stride, bool last_level,
+int tlb_level)
+{
+   __flush_tlb_range_nosync(vma, start, end, stride,
+last_level, tlb_level);
+   dsb(ish);
+}
+
  static inline void flush_tlb_range(struct vm_area_struct *vma,
   unsigned long start, unsigned long end)
  {


You're now calling dsb() after 
mmu_notifier_arch_invalidate_secondary_tlbs().



In flush_tlb_mm(), we have the order

dsb(ish);   
mmu_notifier_arch_invalidate_secondary_tlbs()

In flush_tlb_page(), we have the effective order:

mmu_notifier_arch_invalidate_secondary_tlbs()
dsb(ish);

In flush_tlb_range(), we used to have the order:

dsb(ish);
mmu_notifier_arch_invalidate_secondary_tlbs();


So I *suspect* having that DSB before 
mmu_notifier_arch_invalidate_secondary_tlbs() is fine. Hopefully, 
nothing in there relies on that placement.


Maybe wort spelling out in the patch description

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb



Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()

2024-02-12 Thread David Hildenbrand

On 02.02.24 09:07, Ryan Roberts wrote:

The goal is to be able to advance a PTE by an arbitrary number of PFNs.
So introduce a new API that takes a nr param.

We are going to remove pte_next_pfn() and replace it with
pte_advance_pfn(). As a first step, implement pte_next_pfn() as a
wrapper around pte_advance_pfn() so that we can incrementally switch the
architectures over. Once all arches are moved over, we will change all
the core-mm callers to call pte_advance_pfn() directly and remove the
wrapper.

Signed-off-by: Ryan Roberts 
---
  include/linux/pgtable.h | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5e7eaf8f2b97..815d92dcb96b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -214,9 +214,15 @@ static inline int pmd_dirty(pmd_t pmd)
  
  
  #ifndef pte_next_pfn

+#ifndef pte_advance_pfn
+static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
+{
+   return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
+}
+#endif
  static inline pte_t pte_next_pfn(pte_t pte)
  {
-   return __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+   return pte_advance_pfn(pte, 1);
  }
  #endif
  


I do wonder if we simply want to leave pte_next_pfn() around? Especially 
patch #4, #6 don't really benefit from the change? So are the other 
set_ptes() implementations.


That is, only convert all pte_next_pfn()->pte_advance_pfn(), and leave a
pte_next_pfn() macro in place.

Any downsides to that? This patch here would become:

#ifndef pte_advance_pfn
static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
{
return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT));
}
#endif

#ifndef pte_next_pfn
#define pte_next_pfn(pte) pte_advance_pfn(pte, 1)
#endif

As you convert the three arches, make them define pte_advance_pfn and 
udnefine pte_next_pfn. in the end, you can drop the #ifdef around 
pte_next_pfn here.


--
Cheers,

David / dhildenb



Re: [PATCH v5 01/25] mm: Clarify the spec for set_ptes()

2024-02-12 Thread David Hildenbrand

On 02.02.24 09:07, Ryan Roberts wrote:

set_ptes() spec implies that it can only be used to set a present pte
because it interprets the PFN field to increment it. However,
set_pte_at() has been implemented on top of set_ptes() since set_ptes()
was introduced, and set_pte_at() allows setting a pte to a not-present
state. So clarify the spec to state that when nr==1, new state of pte
may be present or not present. When nr>1, new state of all ptes must be
present.

While we are at it, tighten the spec to set requirements around the
initial state of ptes; when nr==1 it may be either present or
not-present. But when nr>1 all ptes must initially be not-present. All
set_ptes() callsites already conform to this requirement. Stating it
explicitly is useful because it allows for a simplification to the
upcoming arm64 contpte implementation.

Signed-off-by: Ryan Roberts 
---
  include/linux/pgtable.h | 4 
  1 file changed, 4 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f0feae7f89fb..5e7eaf8f2b97 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -229,6 +229,10 @@ static inline pte_t pte_next_pfn(pte_t pte)
   * @pte: Page table entry for the first page.
   * @nr: Number of pages to map.
   *
+ * When nr==1, initial state of pte may be present or not present, and new 
state
+ * may be present or not present. When nr>1, initial state of all ptes must be
+ * not present, and new state must be present.
+ *
   * May be overridden by the architecture, or the architecture can define
   * set_pte() and PFN_PTE_SHIFT.
   *


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb



Re: [PATCH v2 09/10] mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing

2024-02-12 Thread David Hildenbrand

On 12.02.24 12:21, Ryan Roberts wrote:

On 12/02/2024 11:05, David Hildenbrand wrote:

On 12.02.24 11:56, David Hildenbrand wrote:

On 12.02.24 11:32, Ryan Roberts wrote:

On 12/02/2024 10:11, David Hildenbrand wrote:

Hi Ryan,


-static void tlb_batch_pages_flush(struct mmu_gather *tlb)
+static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
     {
-    struct mmu_gather_batch *batch;
-
-    for (batch = >local; batch && batch->nr; batch = batch->next) {
-    struct encoded_page **pages = batch->encoded_pages;
+    struct encoded_page **pages = batch->encoded_pages;
+    unsigned int nr, nr_pages;
     +    /*
+ * We might end up freeing a lot of pages. Reschedule on a regular
+ * basis to avoid soft lockups in configurations without full
+ * preemption enabled. The magic number of 512 folios seems to work.
+ */
+    if (!page_poisoning_enabled_static() && !want_init_on_free()) {


Is the performance win really worth 2 separate implementations keyed off this?
It seems a bit fragile, in case any other operations get added to free
which are
proportional to size in future. Why not just always do the conservative
version?


I really don't want to iterate over all entries on the "sane" common case. We
already do that two times:

a) free_pages_and_swap_cache()

b) release_pages()

Only the latter really is required, and I'm planning on removing the one in (a)
to move it into (b) as well.

So I keep it separate to keep any unnecessary overhead to the setups that are
already terribly slow.

No need to iterate a page full of entries if it can be easily avoided.
Especially, no need to degrade the common order-0 case.


Yeah, I understand all that. But given this is all coming from an array, (so
easy to prefetch?) and will presumably all fit in the cache for the common case,
at least, so its hot for (a) and (b), does separating this out really make a
measurable performance difference? If yes then absolutely this optimizaiton
makes sense. But if not, I think its a bit questionable.


I primarily added it because

(a) we learned that each cycle counts during mmap() just like it does
during fork().

(b) Linus was similarly concerned about optimizing out another batching
walk in c47454823bd4 ("mm: mmu_gather: allow more than one batch of
delayed rmaps"):

"it needs to walk that array of pages while still holding the page table
lock, and our mmu_gather infrastructure allows for batching quite a lot
of pages.  We may have thousands on pages queued up for freeing, and we
wanted to walk only the last batch if we then added a dirty page to the
queue."

So if it matters enough for reducing the time we hold the page table
lock, it surely adds "some" overhead in general.




You're the boss though, so if your experience tells you this is neccessary, then
I'm ok with that.


I did not do any measurements myself, I just did that intuitively as
above. After all, it's all pretty straight forward (keeping the existing
logic, we need a new one either way) and not that much code.

So unless there are strong opinions, I'd just leave the common case as
it was, and the odd case be special.


I think we can just reduce the code duplication easily:

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index d175c0f1e2c8..99b3e9408aa0 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -91,18 +91,21 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct
vm_area_struct *vma)
  }
  #endif
  
-static void tlb_batch_pages_flush(struct mmu_gather *tlb)

-{
-    struct mmu_gather_batch *batch;
+/*
+ * We might end up freeing a lot of pages. Reschedule on a regular
+ * basis to avoid soft lockups in configurations without full
+ * preemption enabled. The magic number of 512 folios seems to work.
+ */
+#define MAX_NR_FOLIOS_PER_FREE    512
  
-    for (batch = >local; batch && batch->nr; batch = batch->next) {

-    struct encoded_page **pages = batch->encoded_pages;
+static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
+{
+    struct encoded_page **pages = batch->encoded_pages;
+    unsigned int nr, nr_pages;
  
-    while (batch->nr) {

-    /*
- * limit free batch count when PAGE_SIZE > 4K
- */
-    unsigned int nr = min(512U, batch->nr);
+    while (batch->nr) {
+    if (!page_poisoning_enabled_static() && !want_init_on_free()) {
+    nr = min(MAX_NR_FOLIOS_PER_FREE, batch->nr);
  
  /*

   * Make sure we cover page + nr_pages, and don't leave
@@ -111,14 +114,39 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
  if (unlikely(encoded_page_flags(pages[nr - 1]) &
   ENCODED_PAGE_BIT_NR_PAGES_NEXT))
  nr++;
+    } else {
+    /*
+ * With page poisoning and init_on_free, the time it
+   

Re: [PATCH v2 09/10] mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing

2024-02-12 Thread David Hildenbrand

On 12.02.24 11:56, David Hildenbrand wrote:

On 12.02.24 11:32, Ryan Roberts wrote:

On 12/02/2024 10:11, David Hildenbrand wrote:

Hi Ryan,


-static void tlb_batch_pages_flush(struct mmu_gather *tlb)
+static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
    {
-    struct mmu_gather_batch *batch;
-
-    for (batch = >local; batch && batch->nr; batch = batch->next) {
-    struct encoded_page **pages = batch->encoded_pages;
+    struct encoded_page **pages = batch->encoded_pages;
+    unsigned int nr, nr_pages;
    +    /*
+ * We might end up freeing a lot of pages. Reschedule on a regular
+ * basis to avoid soft lockups in configurations without full
+ * preemption enabled. The magic number of 512 folios seems to work.
+ */
+    if (!page_poisoning_enabled_static() && !want_init_on_free()) {


Is the performance win really worth 2 separate implementations keyed off this?
It seems a bit fragile, in case any other operations get added to free which are
proportional to size in future. Why not just always do the conservative version?


I really don't want to iterate over all entries on the "sane" common case. We
already do that two times:

a) free_pages_and_swap_cache()

b) release_pages()

Only the latter really is required, and I'm planning on removing the one in (a)
to move it into (b) as well.

So I keep it separate to keep any unnecessary overhead to the setups that are
already terribly slow.

No need to iterate a page full of entries if it can be easily avoided.
Especially, no need to degrade the common order-0 case.


Yeah, I understand all that. But given this is all coming from an array, (so
easy to prefetch?) and will presumably all fit in the cache for the common case,
at least, so its hot for (a) and (b), does separating this out really make a
measurable performance difference? If yes then absolutely this optimizaiton
makes sense. But if not, I think its a bit questionable.


I primarily added it because

(a) we learned that each cycle counts during mmap() just like it does
during fork().

(b) Linus was similarly concerned about optimizing out another batching
walk in c47454823bd4 ("mm: mmu_gather: allow more than one batch of
delayed rmaps"):

"it needs to walk that array of pages while still holding the page table
lock, and our mmu_gather infrastructure allows for batching quite a lot
of pages.  We may have thousands on pages queued up for freeing, and we
wanted to walk only the last batch if we then added a dirty page to the
queue."

So if it matters enough for reducing the time we hold the page table
lock, it surely adds "some" overhead in general.




You're the boss though, so if your experience tells you this is neccessary, then
I'm ok with that.


I did not do any measurements myself, I just did that intuitively as
above. After all, it's all pretty straight forward (keeping the existing
logic, we need a new one either way) and not that much code.

So unless there are strong opinions, I'd just leave the common case as
it was, and the odd case be special.


I think we can just reduce the code duplication easily:

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index d175c0f1e2c8..99b3e9408aa0 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -91,18 +91,21 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct 
vm_area_struct *vma)
 }
 #endif
 
-static void tlb_batch_pages_flush(struct mmu_gather *tlb)

-{
-   struct mmu_gather_batch *batch;
+/*
+ * We might end up freeing a lot of pages. Reschedule on a regular
+ * basis to avoid soft lockups in configurations without full
+ * preemption enabled. The magic number of 512 folios seems to work.
+ */
+#define MAX_NR_FOLIOS_PER_FREE 512
 
-	for (batch = >local; batch && batch->nr; batch = batch->next) {

-   struct encoded_page **pages = batch->encoded_pages;
+static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
+{
+   struct encoded_page **pages = batch->encoded_pages;
+   unsigned int nr, nr_pages;
 
-		while (batch->nr) {

-   /*
-* limit free batch count when PAGE_SIZE > 4K
-*/
-   unsigned int nr = min(512U, batch->nr);
+   while (batch->nr) {
+   if (!page_poisoning_enabled_static() && !want_init_on_free()) {
+   nr = min(MAX_NR_FOLIOS_PER_FREE, batch->nr);
 
 			/*

 * Make sure we cover page + nr_pages, and don't leave
@@ -111,14 +114,39 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
if (unlikely(encoded_page_flags(pages[nr - 1]) &
 ENCODED_PAGE_BIT_NR_PAGES_NEXT))
nr++;
+   } else {
+   /*
+   

Re: [PATCH v2 09/10] mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing

2024-02-12 Thread David Hildenbrand

On 12.02.24 11:32, Ryan Roberts wrote:

On 12/02/2024 10:11, David Hildenbrand wrote:

Hi Ryan,


-static void tlb_batch_pages_flush(struct mmu_gather *tlb)
+static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
   {
-    struct mmu_gather_batch *batch;
-
-    for (batch = >local; batch && batch->nr; batch = batch->next) {
-    struct encoded_page **pages = batch->encoded_pages;
+    struct encoded_page **pages = batch->encoded_pages;
+    unsigned int nr, nr_pages;
   +    /*
+ * We might end up freeing a lot of pages. Reschedule on a regular
+ * basis to avoid soft lockups in configurations without full
+ * preemption enabled. The magic number of 512 folios seems to work.
+ */
+    if (!page_poisoning_enabled_static() && !want_init_on_free()) {


Is the performance win really worth 2 separate implementations keyed off this?
It seems a bit fragile, in case any other operations get added to free which are
proportional to size in future. Why not just always do the conservative version?


I really don't want to iterate over all entries on the "sane" common case. We
already do that two times:

a) free_pages_and_swap_cache()

b) release_pages()

Only the latter really is required, and I'm planning on removing the one in (a)
to move it into (b) as well.

So I keep it separate to keep any unnecessary overhead to the setups that are
already terribly slow.

No need to iterate a page full of entries if it can be easily avoided.
Especially, no need to degrade the common order-0 case.


Yeah, I understand all that. But given this is all coming from an array, (so
easy to prefetch?) and will presumably all fit in the cache for the common case,
at least, so its hot for (a) and (b), does separating this out really make a
measurable performance difference? If yes then absolutely this optimizaiton
makes sense. But if not, I think its a bit questionable.


I primarily added it because

(a) we learned that each cycle counts during mmap() just like it does 
during fork().


(b) Linus was similarly concerned about optimizing out another batching 
walk in c47454823bd4 ("mm: mmu_gather: allow more than one batch of 
delayed rmaps"):


"it needs to walk that array of pages while still holding the page table 
lock, and our mmu_gather infrastructure allows for batching quite a lot 
of pages.  We may have thousands on pages queued up for freeing, and we 
wanted to walk only the last batch if we then added a dirty page to the 
queue."


So if it matters enough for reducing the time we hold the page table 
lock, it surely adds "some" overhead in general.





You're the boss though, so if your experience tells you this is neccessary, then
I'm ok with that.


I did not do any measurements myself, I just did that intuitively as 
above. After all, it's all pretty straight forward (keeping the existing 
logic, we need a new one either way) and not that much code.


So unless there are strong opinions, I'd just leave the common case as 
it was, and the odd case be special.




By the way, Matthew had an RFC a while back that was doing some clever things
with batches further down the call chain (I think; be memory). Might be worth
taking a look at that if you are planning a follow up change to (a).



Do you have a pointer?






   while (batch->nr) {
-    /*
- * limit free batch count when PAGE_SIZE > 4K
- */
-    unsigned int nr = min(512U, batch->nr);
+    nr = min(512, batch->nr);


If any entries are for more than 1 page, nr_pages will also be encoded in the
batch, so effectively this could be limiting to 256 actual folios (half of 512).


Right, in the patch description I state "256 folio fragments". It's up to 512
folios (order-0).


Is it worth checking for ENCODED_PAGE_BIT_NR_PAGES_NEXT and limiting 
accordingly?


At least with 4k page size, we never have more than 510 (IIRC) entries per batch
page. So any such optimization would only matter for large page sizes, which I
don't think is worth it.


Yep; agreed.



Which exact optimization do you have in mind and would it really make a 
difference?


No I don't think it would make any difference, performance-wise. I'm just
pointing out that in pathalogical cases you could end up with half the number of
pages being freed at a time.


Yes, I'll extend the patch description!







nit: You're using 512 magic number in 2 places now; perhaps make a macro?


I played 3 times with macro names (including just using something "intuitive"
like MAX_ORDER_NR_PAGES) but returned to just using 512.

That cond_resched() handling is just absolutely disgusting, one way or the 
other.

Do you have a good idea for a macro name?


MAX_NR_FOLIOS_PER_BATCH?
MAX_NR_FOLIOS_PER_FREE?

I don't think the name has to be perfect, because its private to the c file; but
it ensures the 2 usages rema

Re: [PATCH v2 09/10] mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing

2024-02-12 Thread David Hildenbrand

Hi Ryan,


-static void tlb_batch_pages_flush(struct mmu_gather *tlb)
+static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
  {
-   struct mmu_gather_batch *batch;
-
-   for (batch = >local; batch && batch->nr; batch = batch->next) {
-   struct encoded_page **pages = batch->encoded_pages;
+   struct encoded_page **pages = batch->encoded_pages;
+   unsigned int nr, nr_pages;
  
+	/*

+* We might end up freeing a lot of pages. Reschedule on a regular
+* basis to avoid soft lockups in configurations without full
+* preemption enabled. The magic number of 512 folios seems to work.
+*/
+   if (!page_poisoning_enabled_static() && !want_init_on_free()) {


Is the performance win really worth 2 separate implementations keyed off this?
It seems a bit fragile, in case any other operations get added to free which are
proportional to size in future. Why not just always do the conservative version?


I really don't want to iterate over all entries on the "sane" common 
case. We already do that two times:


a) free_pages_and_swap_cache()

b) release_pages()

Only the latter really is required, and I'm planning on removing the one 
in (a) to move it into (b) as well.


So I keep it separate to keep any unnecessary overhead to the setups 
that are already terribly slow.


No need to iterate a page full of entries if it can be easily avoided. 
Especially, no need to degrade the common order-0 case.





while (batch->nr) {
-   /*
-* limit free batch count when PAGE_SIZE > 4K
-*/
-   unsigned int nr = min(512U, batch->nr);
+   nr = min(512, batch->nr);


If any entries are for more than 1 page, nr_pages will also be encoded in the
batch, so effectively this could be limiting to 256 actual folios (half of 512).


Right, in the patch description I state "256 folio fragments". It's up 
to 512 folios (order-0).



Is it worth checking for ENCODED_PAGE_BIT_NR_PAGES_NEXT and limiting 
accordingly?


At least with 4k page size, we never have more than 510 (IIRC) entries 
per batch page. So any such optimization would only matter for large 
page sizes, which I don't think is worth it.


Which exact optimization do you have in mind and would it really make a 
difference?




nit: You're using 512 magic number in 2 places now; perhaps make a macro?


I played 3 times with macro names (including just using something 
"intuitive" like MAX_ORDER_NR_PAGES) but returned to just using 512.


That cond_resched() handling is just absolutely disgusting, one way or 
the other.


Do you have a good idea for a macro name?



  
  			/*

 * Make sure we cover page + nr_pages, and don't leave
@@ -119,6 +120,37 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
cond_resched();
}
}
+
+   /*
+* With page poisoning and init_on_free, the time it takes to free
+* memory grows proportionally with the actual memory size. Therefore,
+* limit based on the actual memory size and not the number of involved
+* folios.
+*/
+   while (batch->nr) {
+   for (nr = 0, nr_pages = 0;
+nr < batch->nr && nr_pages < 512; nr++) {
+   if (unlikely(encoded_page_flags(pages[nr]) &
+ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+   nr_pages += encoded_nr_pages(pages[++nr]);
+   else
+   nr_pages++;
+   }


I guess worst case here is freeing (511 + 8192) * 64K pages = ~544M. That's up
from the old limit of 512 * 64K = 32M, and 511 pages bigger than your statement
in the commit log. Are you comfortable with this? I guess the only alternative
is to start splitting a batch which would be really messy. I agree your approach
is preferable if 544M is acceptable.


Right, I have in the description:

"if we cannot even free a single MAX_ORDER page on a system without 
running into soft lockups, something else is already completely bogus.".


That would be 8192 pages on arm64. Anybody freeing a PMD-mapped THP 
would be in trouble already and should just reconsider life choices 
running such a machine.


We could have 511 more pages, yes. If 8192 don't trigger a soft-lockup, 
I am confident that 511 more pages won't make a difference.


But, if that ever is a problem, we can butcher this code as much as we 
want, because performance with poisoning/zeroing is already down the drain.


As you say, splitting even further is messy, so I rather avoid that 
unless really required.


--
Cheers,

David / dhildenb



Re: [PATCH v2 08/10] mm/mmu_gather: add __tlb_remove_folio_pages()

2024-02-12 Thread David Hildenbrand

On 12.02.24 09:51, Ryan Roberts wrote:

On 09/02/2024 22:15, David Hildenbrand wrote:

Add __tlb_remove_folio_pages(), which will remove multiple consecutive
pages that belong to the same large folio, instead of only a single
page. We'll be using this function when optimizing unmapping/zapping of
large folios that are mapped by PTEs.

We're using the remaining spare bit in an encoded_page to indicate that
the next enoced page in an array contains actually shifted "nr_pages".
Teach swap/freeing code about putting multiple folio references, and
delayed rmap handling to remove page ranges of a folio.

This extension allows for still gathering almost as many small folios
as we used to (-1, because we have to prepare for a possibly bigger next
entry), but still allows for gathering consecutive pages that belong to the
same large folio.

Note that we don't pass the folio pointer, because it is not required for
now. Further, we don't support page_size != PAGE_SIZE, it won't be
required for simple PTE batching.

We have to provide a separate s390 implementation, but it's fairly
straight forward.

Another, more invasive and likely more expensive, approach would be to
use folio+range or a PFN range instead of page+nr_pages. But, we should
do that consistently for the whole mmu_gather. For now, let's keep it
simple and add "nr_pages" only.

Note that it is now possible to gather significantly more pages: In the
past, we were able to gather ~1 pages, now we can gather
also gather ~5000 folio fragments that span multiple pages. A folio
fragement on x86-64 can be up to 512 pages (2 MiB THP) and on arm64 with
64k in theory 8192 pages (512 MiB THP). Gathering more memory is not
considered something we should worry about, especially because these are
already corner cases.

While we can gather more total memory, we won't free more folio
fragments. As long as page freeing time primarily only depends on the
number of involved folios, there is no effective change for !preempt
configurations. However, we'll adjust tlb_batch_pages_flush() separately to
handle corner cases where page freeing time grows proportionally with the
actual memory size.

Signed-off-by: David Hildenbrand 
---
  arch/s390/include/asm/tlb.h | 17 +++
  include/asm-generic/tlb.h   |  8 +
  include/linux/mm_types.h| 20 
  mm/mmu_gather.c | 61 +++--
  mm/swap.c   | 12 ++--
  mm/swap_state.c | 15 +++--
  6 files changed, 119 insertions(+), 14 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 48df896d5b79..e95b2c8081eb 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table);
  static inline void tlb_flush(struct mmu_gather *tlb);
  static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, bool delay_rmap, int page_size);
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
+   struct page *page, unsigned int nr_pages, bool delay_rmap);
  
  #define tlb_flush tlb_flush

  #define pte_free_tlb pte_free_tlb
@@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct mmu_gather 
*tlb,
return false;
  }
  
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,

+   struct page *page, unsigned int nr_pages, bool delay_rmap)
+{
+   struct encoded_page *encoded_pages[] = {
+   encode_page(page, ENCODED_PAGE_BIT_NR_PAGES_NEXT),
+   encode_nr_pages(nr_pages),
+   };
+
+   VM_WARN_ON_ONCE(delay_rmap);
+   VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
+
+   free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
+   return false;
+}
+
  static inline void tlb_flush(struct mmu_gather *tlb)
  {
__tlb_flush_mm_lazy(tlb->mm);
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 95d60a4f468a..bd00dd238b79 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -69,6 +69,7 @@
   *
   *  - tlb_remove_page() / __tlb_remove_page()
   *  - tlb_remove_page_size() / __tlb_remove_page_size()
+ *  - __tlb_remove_folio_pages()
   *
   *__tlb_remove_page_size() is the basic primitive that queues a page for
   *freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a
@@ -78,6 +79,11 @@
   *tlb_remove_page() and tlb_remove_page_size() imply the call to
   *tlb_flush_mmu() when required and has no return value.
   *
+ *__tlb_remove_folio_pages() is similar to __tlb_remove_page(), however,
+ *instead of removing a single page, remove the given number of consecutive
+ *pages that are all part of the same (large) folio: just like calling
+ *__tlb_remove_page() on each page individually.
+ *
   *  - tlb_change_page_size()
   *
   *call before __tlb_remove_page*() 

Re: [PATCH v3 01/15] arm64/mm: Make set_ptes() robust when OAs cross 48-bit boundary

2024-02-09 Thread David Hildenbrand

On 08.02.24 07:10, Mike Rapoport wrote:

On Mon, Jan 29, 2024 at 01:46:35PM +0100, David Hildenbrand wrote:

From: Ryan Roberts 

Since the high bits [51:48] of an OA are not stored contiguously in the
PTE, there is a theoretical bug in set_ptes(), which just adds PAGE_SIZE
to the pte to get the pte with the next pfn. This works until the pfn
crosses the 48-bit boundary, at which point we overflow into the upper
attributes.

Of course one could argue (and Matthew Wilcox has :) that we will never
see a folio cross this boundary because we only allow naturally aligned
power-of-2 allocation, so this would require a half-petabyte folio. So
its only a theoretical bug. But its better that the code is robust
regardless.

I've implemented pte_next_pfn() as part of the fix, which is an opt-in
core-mm interface. So that is now available to the core-mm, which will
be needed shortly to support forthcoming fork()-batching optimizations.

Link: https://lkml.kernel.org/r/20240125173534.1659317-1-ryan.robe...@arm.com
Fixes: 4a169d61c2ed ("arm64: implement the new page table range API")
Closes: 
https://lore.kernel.org/linux-mm/fdaeb9a5-d890-499a-92c8-d171df43a...@arm.com/
Signed-off-by: Ryan Roberts 
Reviewed-by: Catalin Marinas 
Reviewed-by: David Hildenbrand 
Signed-off-by: David Hildenbrand 


Reviewed-by: Mike Rapoport (IBM) 


Thanks for the review Mike, appreciated!

--
Cheers,

David / dhildenb



Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings

2024-02-09 Thread David Hildenbrand

1) Convert READ_ONCE() -> ptep_get()
2) Convert set_pte_at() -> set_ptes()
3) All the "New layer" renames and addition of the trivial wrappers


Yep that makes sense. I'll start prepping that today. I'll hold off reposting
until I have your comments on 19-25. I'm also hoping that David will repost the
zap series today so that it can get into mm-unstable by mid-next week. Then I'll
repost on top of that, hopefully by end of next week, folding in all your
comments. This should give planty of time to soak in linux-next.


Just sent out v2. Will review this series (early) next week.

Have a great weekend!

--
Cheers,

David / dhildenb



[PATCH v2 09/10] mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing

2024-02-09 Thread David Hildenbrand
It's a pain that we have to handle cond_resched() in
tlb_batch_pages_flush() manually and cannot simply handle it in
release_pages() -- release_pages() can be called from atomic context.
Well, in a perfect world we wouldn't have to make our code more at all.

With page poisoning and init_on_free, we might now run into soft lockups
when we free a lot of rather large folio fragments, because page freeing
time then depends on the actual memory size we are freeing instead of on
the number of folios that are involved.

In the absolute (unlikely) worst case, on arm64 with 64k we will be able
to free up to 256 folio fragments that each span 512 MiB: zeroing out 128
GiB does sound like it might take a while. But instead of ignoring this
unlikely case, let's just handle it.

So, let's teach tlb_batch_pages_flush() that there are some
configurations where page freeing is horribly slow, and let's reschedule
more frequently -- similarly like we did for now before we had large folio
fragments in there. Note that we might end up freeing only a single folio
fragment at a time that might exceed the old 512 pages limit: but if we
cannot even free a single MAX_ORDER page on a system without running into
soft lockups, something else is already completely bogus.

In the future, we might want to detect if handling cond_resched() is
required at all, and just not do any of that with full preemption enabled.

Signed-off-by: David Hildenbrand 
---
 mm/mmu_gather.c | 50 -
 1 file changed, 41 insertions(+), 9 deletions(-)

diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index d175c0f1e2c8..2774044b5790 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -91,18 +91,19 @@ void tlb_flush_rmaps(struct mmu_gather *tlb, struct 
vm_area_struct *vma)
 }
 #endif
 
-static void tlb_batch_pages_flush(struct mmu_gather *tlb)
+static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch)
 {
-   struct mmu_gather_batch *batch;
-
-   for (batch = >local; batch && batch->nr; batch = batch->next) {
-   struct encoded_page **pages = batch->encoded_pages;
+   struct encoded_page **pages = batch->encoded_pages;
+   unsigned int nr, nr_pages;
 
+   /*
+* We might end up freeing a lot of pages. Reschedule on a regular
+* basis to avoid soft lockups in configurations without full
+* preemption enabled. The magic number of 512 folios seems to work.
+*/
+   if (!page_poisoning_enabled_static() && !want_init_on_free()) {
while (batch->nr) {
-   /*
-* limit free batch count when PAGE_SIZE > 4K
-*/
-   unsigned int nr = min(512U, batch->nr);
+   nr = min(512, batch->nr);
 
/*
 * Make sure we cover page + nr_pages, and don't leave
@@ -119,6 +120,37 @@ static void tlb_batch_pages_flush(struct mmu_gather *tlb)
cond_resched();
}
}
+
+   /*
+* With page poisoning and init_on_free, the time it takes to free
+* memory grows proportionally with the actual memory size. Therefore,
+* limit based on the actual memory size and not the number of involved
+* folios.
+*/
+   while (batch->nr) {
+   for (nr = 0, nr_pages = 0;
+nr < batch->nr && nr_pages < 512; nr++) {
+   if (unlikely(encoded_page_flags(pages[nr]) &
+ENCODED_PAGE_BIT_NR_PAGES_NEXT))
+   nr_pages += encoded_nr_pages(pages[++nr]);
+   else
+   nr_pages++;
+   }
+
+   free_pages_and_swap_cache(pages, nr);
+   pages += nr;
+   batch->nr -= nr;
+
+   cond_resched();
+   }
+}
+
+static void tlb_batch_pages_flush(struct mmu_gather *tlb)
+{
+   struct mmu_gather_batch *batch;
+
+   for (batch = >local; batch && batch->nr; batch = batch->next)
+   __tlb_batch_free_encoded_pages(batch);
tlb->active = >local;
 }
 
-- 
2.43.0



[PATCH v2 10/10] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-02-09 Thread David Hildenbrand
Similar to how we optimized fork(), let's implement PTE batching when
consecutive (present) PTEs map consecutive pages of the same large
folio.

Most infrastructure we need for batching (mmu gather, rmap) is already
there. We only have to add get_and_clear_full_ptes() and
clear_full_ptes(). Similarly, extend zap_install_uffd_wp_if_needed() to
process a PTE range.

We won't bother sanity-checking the mapcount of all subpages, but only
check the mapcount of the first subpage we process. If there is a real
problem hiding somewhere, we can trigger it simply by using small
folios, or when we zap single pages of a large folio. Ideally, we had
that check in rmap code (including for delayed rmap), but then we cannot
print the PTE. Let's keep it simple for now. If we ever have a cheap
folio_mapcount(), we might just want to check for underflows there.

To keep small folios as fast as possible force inlining of a specialized
variant using __always_inline with nr=1.

Signed-off-by: David Hildenbrand 
---
 include/linux/pgtable.h | 70 +++
 mm/memory.c | 92 +
 2 files changed, 136 insertions(+), 26 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index aab227e12493..49ab1f73b5c2 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -580,6 +580,76 @@ static inline pte_t ptep_get_and_clear_full(struct 
mm_struct *mm,
 }
 #endif
 
+#ifndef get_and_clear_full_ptes
+/**
+ * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
+ *  the same folio, collecting dirty/accessed bits.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
+ * returned PTE.
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+   pte_t pte, tmp_pte;
+
+   pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+   while (--nr) {
+   ptep++;
+   addr += PAGE_SIZE;
+   tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+   if (pte_dirty(tmp_pte))
+   pte = pte_mkdirty(pte);
+   if (pte_young(tmp_pte))
+   pte = pte_mkyoung(pte);
+   }
+   return pte;
+}
+#endif
+
+#ifndef clear_full_ptes
+/**
+ * clear_full_ptes - Clear present PTEs that map consecutive pages of the same
+ *  folio.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_get_and_clear_full().
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline void clear_full_ptes(struct mm_struct *mm, unsigned long addr,
+   pte_t *ptep, unsigned int nr, int full)
+{
+   for (;;) {
+   ptep_get_and_clear_full(mm, addr, ptep, full);
+   if (--nr == 0)
+   break;
+   ptep++;
+   addr += PAGE_SIZE;
+   }
+}
+#endif
 
 /*
  * If two threads concurrently fault at the same page, the thread that
diff --git a/mm/memory.c b/mm/memory.c
index a3efc4da258a..3b8e56eb08a3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1515,7 +1515,7 @@ static inline bool zap_drop_file_uffd_wp(struct 
zap_details *details)
  */
 static inline void
 zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
- unsigned long addr, pte_t *pte,
+ unsigned long addr, pte_t *pte, int nr,
  struct zap_details *details, pte_t pteval)
 {
/* Zap on anonymous always means dropping everything */
@@ -1525,20 +1525,27 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct 
*vma,
if (zap_drop_file_uffd_wp(details))
return;
 
-   pte_install_uffd_wp_if_needed(vma, addr, pte, pteval

[PATCH v2 08/10] mm/mmu_gather: add __tlb_remove_folio_pages()

2024-02-09 Thread David Hildenbrand
Add __tlb_remove_folio_pages(), which will remove multiple consecutive
pages that belong to the same large folio, instead of only a single
page. We'll be using this function when optimizing unmapping/zapping of
large folios that are mapped by PTEs.

We're using the remaining spare bit in an encoded_page to indicate that
the next enoced page in an array contains actually shifted "nr_pages".
Teach swap/freeing code about putting multiple folio references, and
delayed rmap handling to remove page ranges of a folio.

This extension allows for still gathering almost as many small folios
as we used to (-1, because we have to prepare for a possibly bigger next
entry), but still allows for gathering consecutive pages that belong to the
same large folio.

Note that we don't pass the folio pointer, because it is not required for
now. Further, we don't support page_size != PAGE_SIZE, it won't be
required for simple PTE batching.

We have to provide a separate s390 implementation, but it's fairly
straight forward.

Another, more invasive and likely more expensive, approach would be to
use folio+range or a PFN range instead of page+nr_pages. But, we should
do that consistently for the whole mmu_gather. For now, let's keep it
simple and add "nr_pages" only.

Note that it is now possible to gather significantly more pages: In the
past, we were able to gather ~1 pages, now we can gather
also gather ~5000 folio fragments that span multiple pages. A folio
fragement on x86-64 can be up to 512 pages (2 MiB THP) and on arm64 with
64k in theory 8192 pages (512 MiB THP). Gathering more memory is not
considered something we should worry about, especially because these are
already corner cases.

While we can gather more total memory, we won't free more folio
fragments. As long as page freeing time primarily only depends on the
number of involved folios, there is no effective change for !preempt
configurations. However, we'll adjust tlb_batch_pages_flush() separately to
handle corner cases where page freeing time grows proportionally with the
actual memory size.

Signed-off-by: David Hildenbrand 
---
 arch/s390/include/asm/tlb.h | 17 +++
 include/asm-generic/tlb.h   |  8 +
 include/linux/mm_types.h| 20 
 mm/mmu_gather.c | 61 +++--
 mm/swap.c   | 12 ++--
 mm/swap_state.c | 15 +++--
 6 files changed, 119 insertions(+), 14 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 48df896d5b79..e95b2c8081eb 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table);
 static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, bool delay_rmap, int page_size);
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
+   struct page *page, unsigned int nr_pages, bool delay_rmap);
 
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
@@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct mmu_gather 
*tlb,
return false;
 }
 
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
+   struct page *page, unsigned int nr_pages, bool delay_rmap)
+{
+   struct encoded_page *encoded_pages[] = {
+   encode_page(page, ENCODED_PAGE_BIT_NR_PAGES_NEXT),
+   encode_nr_pages(nr_pages),
+   };
+
+   VM_WARN_ON_ONCE(delay_rmap);
+   VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
+
+   free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
+   return false;
+}
+
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
__tlb_flush_mm_lazy(tlb->mm);
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 95d60a4f468a..bd00dd238b79 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -69,6 +69,7 @@
  *
  *  - tlb_remove_page() / __tlb_remove_page()
  *  - tlb_remove_page_size() / __tlb_remove_page_size()
+ *  - __tlb_remove_folio_pages()
  *
  *__tlb_remove_page_size() is the basic primitive that queues a page for
  *freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a
@@ -78,6 +79,11 @@
  *tlb_remove_page() and tlb_remove_page_size() imply the call to
  *tlb_flush_mmu() when required and has no return value.
  *
+ *__tlb_remove_folio_pages() is similar to __tlb_remove_page(), however,
+ *instead of removing a single page, remove the given number of consecutive
+ *pages that are all part of the same (large) folio: just like calling
+ *__tlb_remove_page() on each page individually.
+ *
  *  - tlb_change_page_size()
  *
  *call before __tlb_remove_page*() to set the current page-size; implies a
@@ -262,6 +268,8 @@ struct mmu_gather_batch {
 
 extern bool __tlb_remove_page

[PATCH v2 07/10] mm/mmu_gather: add tlb_remove_tlb_entries()

2024-02-09 Thread David Hildenbrand
Let's add a helper that lets us batch-process multiple consecutive PTEs.

Note that the loop will get optimized out on all architectures except on
powerpc. We have to add an early define of __tlb_remove_tlb_entry() on
ppc to make the compiler happy (and avoid making tlb_remove_tlb_entries() a
macro).

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 arch/powerpc/include/asm/tlb.h |  2 ++
 include/asm-generic/tlb.h  | 20 
 2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index b3de6102a907..1ca7d4c4b90d 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -19,6 +19,8 @@
 
 #include 
 
+static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
+ unsigned long address);
 #define __tlb_remove_tlb_entry __tlb_remove_tlb_entry
 
 #define tlb_flush tlb_flush
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2eb7b0d4f5d2..95d60a4f468a 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -608,6 +608,26 @@ static inline void tlb_flush_p4d_range(struct mmu_gather 
*tlb,
__tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)
 
+/**
+ * tlb_remove_tlb_entries - remember unmapping of multiple consecutive ptes for
+ * later tlb invalidation.
+ *
+ * Similar to tlb_remove_tlb_entry(), but remember unmapping of multiple
+ * consecutive ptes instead of only a single one.
+ */
+static inline void tlb_remove_tlb_entries(struct mmu_gather *tlb,
+   pte_t *ptep, unsigned int nr, unsigned long address)
+{
+   tlb_flush_pte_range(tlb, address, PAGE_SIZE * nr);
+   for (;;) {
+   __tlb_remove_tlb_entry(tlb, ptep, address);
+   if (--nr == 0)
+   break;
+   ptep++;
+   address += PAGE_SIZE;
+   }
+}
+
 #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)   \
do {\
unsigned long _sz = huge_page_size(h);  \
-- 
2.43.0



[PATCH v2 06/10] mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP

2024-02-09 Thread David Hildenbrand
Nowadays, encoded pages are only used in mmu_gather handling. Let's
update the documentation, and define ENCODED_PAGE_BIT_DELAY_RMAP. While at
it, rename ENCODE_PAGE_BITS to ENCODED_PAGE_BITS.

If encoded page pointers would ever be used in other context again, we'd
likely want to change the defines to reflect their context (e.g.,
ENCODED_PAGE_FLAG_MMU_GATHER_DELAY_RMAP). For now, let's keep it simple.

This is a preparation for using the remaining spare bit to indicate that
the next item in an array of encoded pages is a "nr_pages" argument and
not an encoded page.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 include/linux/mm_types.h | 17 +++--
 mm/mmu_gather.c  |  5 +++--
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8b611e13153e..1b89eec0d6df 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -210,8 +210,8 @@ struct page {
  *
  * An 'encoded_page' pointer is a pointer to a regular 'struct page', but
  * with the low bits of the pointer indicating extra context-dependent
- * information. Not super-common, but happens in mmu_gather and mlock
- * handling, and this acts as a type system check on that use.
+ * information. Only used in mmu_gather handling, and this acts as a type
+ * system check on that use.
  *
  * We only really have two guaranteed bits in general, although you could
  * play with 'struct page' alignment (see CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
@@ -220,21 +220,26 @@ struct page {
  * Use the supplied helper functions to endcode/decode the pointer and bits.
  */
 struct encoded_page;
-#define ENCODE_PAGE_BITS 3ul
+
+#define ENCODED_PAGE_BITS  3ul
+
+/* Perform rmap removal after we have flushed the TLB. */
+#define ENCODED_PAGE_BIT_DELAY_RMAP1ul
+
 static __always_inline struct encoded_page *encode_page(struct page *page, 
unsigned long flags)
 {
-   BUILD_BUG_ON(flags > ENCODE_PAGE_BITS);
+   BUILD_BUG_ON(flags > ENCODED_PAGE_BITS);
return (struct encoded_page *)(flags | (unsigned long)page);
 }
 
 static inline unsigned long encoded_page_flags(struct encoded_page *page)
 {
-   return ENCODE_PAGE_BITS & (unsigned long)page;
+   return ENCODED_PAGE_BITS & (unsigned long)page;
 }
 
 static inline struct page *encoded_page_ptr(struct encoded_page *page)
 {
-   return (struct page *)(~ENCODE_PAGE_BITS & (unsigned long)page);
+   return (struct page *)(~ENCODED_PAGE_BITS & (unsigned long)page);
 }
 
 /*
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index ac733d81b112..6540c99c6758 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -53,7 +53,7 @@ static void tlb_flush_rmap_batch(struct mmu_gather_batch 
*batch, struct vm_area_
for (int i = 0; i < batch->nr; i++) {
struct encoded_page *enc = batch->encoded_pages[i];
 
-   if (encoded_page_flags(enc)) {
+   if (encoded_page_flags(enc) & ENCODED_PAGE_BIT_DELAY_RMAP) {
struct page *page = encoded_page_ptr(enc);
folio_remove_rmap_pte(page_folio(page), page, vma);
}
@@ -119,6 +119,7 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
 bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
bool delay_rmap, int page_size)
 {
+   int flags = delay_rmap ? ENCODED_PAGE_BIT_DELAY_RMAP : 0;
struct mmu_gather_batch *batch;
 
VM_BUG_ON(!tlb->end);
@@ -132,7 +133,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct 
page *page,
 * Add the page and check if we are full. If so
 * force a flush.
 */
-   batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
+   batch->encoded_pages[batch->nr++] = encode_page(page, flags);
if (batch->nr == batch->max) {
if (!tlb_next_batch(tlb))
return true;
-- 
2.43.0



[PATCH v2 05/10] mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size()

2024-02-09 Thread David Hildenbrand
We have two bits available in the encoded page pointer to store
additional information. Currently, we use one bit to request delay of the
rmap removal until after a TLB flush.

We want to make use of the remaining bit internally for batching of
multiple pages of the same folio, specifying that the next encoded page
pointer in an array is actually "nr_pages". So pass page + delay_rmap flag
instead of an encoded page, to handle the encoding internally.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 arch/s390/include/asm/tlb.h | 13 ++---
 include/asm-generic/tlb.h   | 12 ++--
 mm/mmu_gather.c |  7 ---
 3 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index d1455a601adc..48df896d5b79 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -25,8 +25,7 @@
 void __tlb_remove_table(void *_table);
 static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
- struct encoded_page *page,
- int page_size);
+   struct page *page, bool delay_rmap, int page_size);
 
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
@@ -42,14 +41,14 @@ static inline bool __tlb_remove_page_size(struct mmu_gather 
*tlb,
  * tlb_ptep_clear_flush. In both flush modes the tlb for a page cache page
  * has already been freed, so just do free_page_and_swap_cache.
  *
- * s390 doesn't delay rmap removal, so there is nothing encoded in
- * the page pointer.
+ * s390 doesn't delay rmap removal.
  */
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
- struct encoded_page *page,
- int page_size)
+   struct page *page, bool delay_rmap, int page_size)
 {
-   free_page_and_swap_cache(encoded_page_ptr(page));
+   VM_WARN_ON_ONCE(delay_rmap);
+
+   free_page_and_swap_cache(page);
return false;
 }
 
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 129a3a759976..2eb7b0d4f5d2 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -260,9 +260,8 @@ struct mmu_gather_batch {
  */
 #define MAX_GATHER_BATCH_COUNT (1UL/MAX_GATHER_BATCH)
 
-extern bool __tlb_remove_page_size(struct mmu_gather *tlb,
-  struct encoded_page *page,
-  int page_size);
+extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+   bool delay_rmap, int page_size);
 
 #ifdef CONFIG_SMP
 /*
@@ -462,13 +461,14 @@ static inline void tlb_flush_mmu_tlbonly(struct 
mmu_gather *tlb)
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, int page_size)
 {
-   if (__tlb_remove_page_size(tlb, encode_page(page, 0), page_size))
+   if (__tlb_remove_page_size(tlb, page, false, page_size))
tlb_flush_mmu(tlb);
 }
 
-static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb, struct 
page *page, unsigned int flags)
+static __always_inline bool __tlb_remove_page(struct mmu_gather *tlb,
+   struct page *page, bool delay_rmap)
 {
-   return __tlb_remove_page_size(tlb, encode_page(page, flags), PAGE_SIZE);
+   return __tlb_remove_page_size(tlb, page, delay_rmap, PAGE_SIZE);
 }
 
 /* tlb_remove_page
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 604ddf08affe..ac733d81b112 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -116,7 +116,8 @@ static void tlb_batch_list_free(struct mmu_gather *tlb)
tlb->local.next = NULL;
 }
 
-bool __tlb_remove_page_size(struct mmu_gather *tlb, struct encoded_page *page, 
int page_size)
+bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,
+   bool delay_rmap, int page_size)
 {
struct mmu_gather_batch *batch;
 
@@ -131,13 +132,13 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, 
struct encoded_page *page, i
 * Add the page and check if we are full. If so
 * force a flush.
 */
-   batch->encoded_pages[batch->nr++] = page;
+   batch->encoded_pages[batch->nr++] = encode_page(page, delay_rmap);
if (batch->nr == batch->max) {
if (!tlb_next_batch(tlb))
return true;
batch = tlb->active;
}
-   VM_BUG_ON_PAGE(batch->nr > batch->max, encoded_page_ptr(page));
+   VM_BUG_ON_PAGE(batch->nr > batch->max, page);
 
return false;
 }
-- 
2.43.0



[PATCH v2 04/10] mm/memory: factor out zapping folio pte into zap_present_folio_pte()

2024-02-09 Thread David Hildenbrand
Let's prepare for further changes by factoring it out into a separate
function.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 53 -
 1 file changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7a3ebb6e5909..a3efc4da258a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1528,30 +1528,14 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct 
*vma,
pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
 }
 
-static inline void zap_present_pte(struct mmu_gather *tlb,
-   struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
-   unsigned long addr, struct zap_details *details,
-   int *rss, bool *force_flush, bool *force_break)
+static inline void zap_present_folio_pte(struct mmu_gather *tlb,
+   struct vm_area_struct *vma, struct folio *folio,
+   struct page *page, pte_t *pte, pte_t ptent, unsigned long addr,
+   struct zap_details *details, int *rss, bool *force_flush,
+   bool *force_break)
 {
struct mm_struct *mm = tlb->mm;
bool delay_rmap = false;
-   struct folio *folio;
-   struct page *page;
-
-   page = vm_normal_page(vma, addr, ptent);
-   if (!page) {
-   /* We don't need up-to-date accessed/dirty bits. */
-   ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
-   arch_check_zapped_pte(vma, ptent);
-   tlb_remove_tlb_entry(tlb, pte, addr);
-   VM_WARN_ON_ONCE(userfaultfd_wp(vma));
-   ksm_might_unmap_zero_page(mm, ptent);
-   return;
-   }
-
-   folio = page_folio(page);
-   if (unlikely(!should_zap_folio(details, folio)))
-   return;
 
if (!folio_test_anon(folio)) {
ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
@@ -1586,6 +1570,33 @@ static inline void zap_present_pte(struct mmu_gather 
*tlb,
}
 }
 
+static inline void zap_present_pte(struct mmu_gather *tlb,
+   struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+   unsigned long addr, struct zap_details *details,
+   int *rss, bool *force_flush, bool *force_break)
+{
+   struct mm_struct *mm = tlb->mm;
+   struct folio *folio;
+   struct page *page;
+
+   page = vm_normal_page(vma, addr, ptent);
+   if (!page) {
+   /* We don't need up-to-date accessed/dirty bits. */
+   ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+   arch_check_zapped_pte(vma, ptent);
+   tlb_remove_tlb_entry(tlb, pte, addr);
+   VM_WARN_ON_ONCE(userfaultfd_wp(vma));
+   ksm_might_unmap_zero_page(mm, ptent);
+   return;
+   }
+
+   folio = page_folio(page);
+   if (unlikely(!should_zap_folio(details, folio)))
+   return;
+   zap_present_folio_pte(tlb, vma, folio, page, pte, ptent, addr, details,
+ rss, force_flush, force_break);
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
-- 
2.43.0



[PATCH v2 03/10] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()

2024-02-09 Thread David Hildenbrand
We don't need up-to-date accessed-dirty information for anon folios and can
simply work with the ptent we already have. Also, we know the RSS counter
we want to update.

We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
zap_install_uffd_wp_if_needed() after updating the folio and RSS.

While at it, only call zap_install_uffd_wp_if_needed() if there is even
any chance that pte_install_uffd_wp_if_needed() would do *something*.
That is, just don't bother if uffd-wp does not apply.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4da6923709b2..7a3ebb6e5909 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1552,12 +1552,9 @@ static inline void zap_present_pte(struct mmu_gather 
*tlb,
folio = page_folio(page);
if (unlikely(!should_zap_folio(details, folio)))
return;
-   ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
-   arch_check_zapped_pte(vma, ptent);
-   tlb_remove_tlb_entry(tlb, pte, addr);
-   zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
 
if (!folio_test_anon(folio)) {
+   ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
if (pte_dirty(ptent)) {
folio_mark_dirty(folio);
if (tlb_delay_rmap(tlb)) {
@@ -1567,8 +1564,17 @@ static inline void zap_present_pte(struct mmu_gather 
*tlb,
}
if (pte_young(ptent) && likely(vma_has_recency(vma)))
folio_mark_accessed(folio);
+   rss[mm_counter(folio)]--;
+   } else {
+   /* We don't need up-to-date accessed/dirty bits. */
+   ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+   rss[MM_ANONPAGES]--;
}
-   rss[mm_counter(folio)]--;
+   arch_check_zapped_pte(vma, ptent);
+   tlb_remove_tlb_entry(tlb, pte, addr);
+   if (unlikely(userfaultfd_pte_wp(vma, ptent)))
+   zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+
if (!delay_rmap) {
folio_remove_rmap_pte(folio, page, vma);
if (unlikely(page_mapcount(page) < 0))
-- 
2.43.0



[PATCH v2 02/10] mm/memory: handle !page case in zap_present_pte() separately

2024-02-09 Thread David Hildenbrand
We don't need uptodate accessed/dirty bits, so in theory we could
replace ptep_get_and_clear_full() by an optimized ptep_clear_full()
function. Let's rely on the provided pte.

Further, there is no scenario where we would have to insert uffd-wp
markers when zapping something that is not a normal page (i.e., zeropage).
Add a sanity check to make sure this remains true.

should_zap_folio() no longer has to handle NULL pointers. This change
replaces 2/3 "!page/!folio" checks by a single "!page" one.

Note that arch_check_zapped_pte() on x86-64 checks the HW-dirty bit to
detect shadow stack entries. But for shadow stack entries, the HW dirty
bit (in combination with non-writable PTEs) is set by software. So for the
arch_check_zapped_pte() check, we don't have to sync against HW setting
the HW dirty bit concurrently, it is always set.

Reviewed-by: Ryan Roberts 
Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5b0dc33133a6..4da6923709b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1497,10 +1497,6 @@ static inline bool should_zap_folio(struct zap_details 
*details,
if (should_zap_cows(details))
return true;
 
-   /* E.g. the caller passes NULL for the case of a zero folio */
-   if (!folio)
-   return true;
-
/* Otherwise we should only zap non-anon folios */
return !folio_test_anon(folio);
 }
@@ -1538,24 +1534,28 @@ static inline void zap_present_pte(struct mmu_gather 
*tlb,
int *rss, bool *force_flush, bool *force_break)
 {
struct mm_struct *mm = tlb->mm;
-   struct folio *folio = NULL;
bool delay_rmap = false;
+   struct folio *folio;
struct page *page;
 
page = vm_normal_page(vma, addr, ptent);
-   if (page)
-   folio = page_folio(page);
+   if (!page) {
+   /* We don't need up-to-date accessed/dirty bits. */
+   ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+   arch_check_zapped_pte(vma, ptent);
+   tlb_remove_tlb_entry(tlb, pte, addr);
+   VM_WARN_ON_ONCE(userfaultfd_wp(vma));
+   ksm_might_unmap_zero_page(mm, ptent);
+   return;
+   }
 
+   folio = page_folio(page);
if (unlikely(!should_zap_folio(details, folio)))
return;
ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
arch_check_zapped_pte(vma, ptent);
tlb_remove_tlb_entry(tlb, pte, addr);
zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
-   if (unlikely(!page)) {
-   ksm_might_unmap_zero_page(mm, ptent);
-   return;
-   }
 
if (!folio_test_anon(folio)) {
if (pte_dirty(ptent)) {
-- 
2.43.0



[PATCH v2 01/10] mm/memory: factor out zapping of present pte into zap_present_pte()

2024-02-09 Thread David Hildenbrand
Let's prepare for further changes by factoring out processing of present
PTEs.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 94 ++---
 1 file changed, 53 insertions(+), 41 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7c3ca41a7610..5b0dc33133a6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct 
*vma,
pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
 }
 
+static inline void zap_present_pte(struct mmu_gather *tlb,
+   struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+   unsigned long addr, struct zap_details *details,
+   int *rss, bool *force_flush, bool *force_break)
+{
+   struct mm_struct *mm = tlb->mm;
+   struct folio *folio = NULL;
+   bool delay_rmap = false;
+   struct page *page;
+
+   page = vm_normal_page(vma, addr, ptent);
+   if (page)
+   folio = page_folio(page);
+
+   if (unlikely(!should_zap_folio(details, folio)))
+   return;
+   ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+   arch_check_zapped_pte(vma, ptent);
+   tlb_remove_tlb_entry(tlb, pte, addr);
+   zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+   if (unlikely(!page)) {
+   ksm_might_unmap_zero_page(mm, ptent);
+   return;
+   }
+
+   if (!folio_test_anon(folio)) {
+   if (pte_dirty(ptent)) {
+   folio_mark_dirty(folio);
+   if (tlb_delay_rmap(tlb)) {
+   delay_rmap = true;
+   *force_flush = true;
+   }
+   }
+   if (pte_young(ptent) && likely(vma_has_recency(vma)))
+   folio_mark_accessed(folio);
+   }
+   rss[mm_counter(folio)]--;
+   if (!delay_rmap) {
+   folio_remove_rmap_pte(folio, page, vma);
+   if (unlikely(page_mapcount(page) < 0))
+   print_bad_pte(vma, addr, ptent, page);
+   }
+   if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
+   *force_flush = true;
+   *force_break = true;
+   }
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
struct zap_details *details)
 {
+   bool force_flush = false, force_break = false;
struct mm_struct *mm = tlb->mm;
-   int force_flush = 0;
int rss[NR_MM_COUNTERS];
spinlock_t *ptl;
pte_t *start_pte;
@@ -1555,7 +1603,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
arch_enter_lazy_mmu_mode();
do {
pte_t ptent = ptep_get(pte);
-   struct folio *folio = NULL;
+   struct folio *folio;
struct page *page;
 
if (pte_none(ptent))
@@ -1565,45 +1613,9 @@ static unsigned long zap_pte_range(struct mmu_gather 
*tlb,
break;
 
if (pte_present(ptent)) {
-   unsigned int delay_rmap;
-
-   page = vm_normal_page(vma, addr, ptent);
-   if (page)
-   folio = page_folio(page);
-
-   if (unlikely(!should_zap_folio(details, folio)))
-   continue;
-   ptent = ptep_get_and_clear_full(mm, addr, pte,
-   tlb->fullmm);
-   arch_check_zapped_pte(vma, ptent);
-   tlb_remove_tlb_entry(tlb, pte, addr);
-   zap_install_uffd_wp_if_needed(vma, addr, pte, details,
- ptent);
-   if (unlikely(!page)) {
-   ksm_might_unmap_zero_page(mm, ptent);
-   continue;
-   }
-
-   delay_rmap = 0;
-   if (!folio_test_anon(folio)) {
-   if (pte_dirty(ptent)) {
-   folio_mark_dirty(folio);
-   if (tlb_delay_rmap(tlb)) {
-   delay_rmap = 1;
-   force_flush = 1;
-   }
-   }
-   if (pte_young(ptent) && 
likely(vma_has_recency(vma)))
-   folio_mark_accessed(folio);
-   }
-   rss[mm_counter(folio)]--;
-   

[PATCH v2 00/10] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-02-09 Thread David Hildenbrand
This series is based on [1]. Similar to what we did with fork(), let's
implement PTE batching during unmap/zap when processing PTE-mapped THPs.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
entry removal once per batch.

Ryan was previously working on this in the context of cont-pte for
arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
This series implements the optimization for all architectures, independent
of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
large-folio-pages batches as well, and amkes use of our new rmap batching
function when removing the rmap.

To achieve that, we have to enlighten MMU gather / page freeing code
(i.e., everything that consumes encoded_page) to process unmapping
of consecutive pages that all belong to the same large folio. I'm being
very careful to not degrade order-0 performance, and it looks like I
managed to achieve that.

While this series should -- similar to [1] -- be beneficial for adding
cont-pte support on arm64[2], it's one of the requirements for maintaining
a total mapcount[3] for large folios with minimal added overhead and
further changes[4] that build up on top of the total mapcount.

Independent of all that, this series results in a speedup during munmap()
and similar unmapping (process teardown, MADV_DONTNEED on larger ranges)
with PTE-mapped THP, which is the default with THPs that are smaller than
a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, munmap'ing a 1GiB VMA backed by
PTE-mapped folios of the same size (stddev < 1%) results in the following
runtimes for munmap() in seconds (shorter is better):

Folio Size | mm-unstable |  New | Change
-
  4KiB |0.058110 | 0.057715 |   - 1%
 16KiB |0.044198 | 0.035469 |   -20%
 32KiB |0.034216 | 0.023522 |   -31%
 64KiB |0.029207 | 0.018434 |   -37%
128KiB |0.026579 | 0.014026 |   -47%
256KiB |0.025130 | 0.011756 |   -53%
512KiB |0.024292 | 0.010703 |   -56%
   1024KiB |0.023812 | 0.010294 |   -57%
   2048KiB |0.023785 | 0.009910 |   -58%

CCing especially s390x folks, because they have a tlb freeing hooks that
needs adjustment. Only tested on x86-64 for now, will have to do some more
stress testing. Compile-tested on most other architectures. The PPC
change is negleglible and makes my cross-compiler happy.

[1] https://lkml.kernel.org/r/20240129124649.189745-1-da...@redhat.com
[2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.robe...@arm.com
[3] https://lkml.kernel.org/r/20230809083256.699513-1-da...@redhat.com
[4] https://lkml.kernel.org/r/20231124132626.235350-1-da...@redhat.com
[5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.robe...@arm.com

---

The performance numbers are from v1. I did a quick benchmark run of v2
and nothing significantly changed -- because nothing in the code
significantly changed. Sending this out ASAP, so Ryan can make progress
with cont-pte.

v1 -> v2:
* "mm/memory: factor out zapping of present pte into zap_present_pte()"
 -> Initialize "struct folio *folio" to NULL
* "mm/memory: handle !page case in zap_present_pte() separately"
 -> Extend description regarding arch_check_zapped_pte()
* "mm/mmu_gather: add __tlb_remove_folio_pages()"
 -> ENCODED_PAGE_BIT_NR_PAGES_NEXT
 -> Extend patch description regarding "batching more"
* "mm/mmu_gather: improve cond_resched() handling with large folios and
   expensive page freeing"
 -> Handle the (so far) theoretical case of possible soft lockups when
we zero/poison memory when freeing pages. Try to keep old behavior in
that corner case to be safe.
* "mm/memory: optimize unmap/zap with PTE-mapped THP"
 -> Clarify description of new ptep clearing functions regarding "present
PTEs"
 -> Extend patch description regarding relaxed mapcount sanity checks
 -> Improve zap_present_ptes() description
* Pick up RB's

Cc: Andrew Morton 
Cc: Matthew Wilcox (Oracle) 
Cc: Ryan Roberts 
Cc: Catalin Marinas 
Cc: Yin Fengwei 
Cc: Michal Hocko 
Cc: Will Deacon 
Cc: "Aneesh Kumar K.V" 
Cc: Nick Piggin 
Cc: Peter Zijlstra 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: "Naveen N. Rao" 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Alexander Gordeev 
Cc: Christian Borntraeger 
Cc: Sven Schnelle 
Cc: Arnd Bergmann 
Cc: linux-a...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org

David Hildenbrand (10):
  mm/memory: factor out zapping of present pte into zap_present_pte()
  mm/memory: handle !page case in zap_present_pte() s

Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

dontneed should hopefully/likely see a speedup.


Yes, but that's almost exactly the same path as munmap, so I'm sure it really
adds much for this particular series. 


Right, that's why I'm not including these measurements. dontneed vs. 
munmap is more about measuring the overhead of VMA modifications + page 
table reclaim.



Anyway, on Altra at least, I'm seeing no
regressions, so:

Tested-by: Ryan Roberts 



Thanks!

--
Cheers,

David / dhildenb



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 16:02, Ryan Roberts wrote:

On 31/01/2024 14:29, David Hildenbrand wrote:

Note that regarding NUMA effects, I mean when some memory access within the same
socket is faster/slower even with only a single node. On AMD EPYC that's
possible, depending on which core you are running and on which memory controller
the memory you want to access is located. If both are in different quadrants
IIUC, the access latency will be different.


I've configured the NUMA to only bring the RAM and CPUs for a single socket
online, so I shouldn't be seeing any of these effects. Anyway, I've been using
the Altra as a secondary because its so much slower than the M2. Let me move
over to it and see if everything looks more straightforward there.


Better use a system where people will actually run Linux production workloads
on, even if it is slower :)

[...]



I'll continue to mess around with it until the end of the day. But I'm not
making any headway, then I'll change tack; I'll just measure the performance of
my contpte changes using your fork/zap stuff as the baseline and post based on
that.


You should likely not focus on M2 results. Just pick a representative bare metal
machine where you get consistent, explainable results.

Nothing in the code is fine-tuned for a particular architecture so far, only
order-0 handling is kept separate.

BTW: I see the exact same speedups for dontneed that I see for munmap. For
example, for order-9, it goes from 0.023412s -> 0.009785, so -58%. So I'm
curious why you see a speedup for munmap but not for dontneed.


Ugh... ok, coming up.


Hopefully you were just staring at the wrong numbers (e.g., only with fork
patches). Because both (munmap/pte-dontneed) are using the exact same code path.



Ahh... I'm doing pte-dontneed, which is the only option in your original
benchmark - it does MADV_DONTNEED one page at a time. It looks like your new
benchmark has an additional "dontneed" option that does it in one shot. Which
option are you running? Assuming the latter, I think that explains it.


I temporarily removed that option and then re-added it. Guess you got a 
wrong snapshot of the benchmark :D


pte-dontneed not observing any change is great (no batching possible).

dontneed should hopefully/likely see a speedup.

Great!

--
Cheers,

David / dhildenb



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

Note that regarding NUMA effects, I mean when some memory access within the same
socket is faster/slower even with only a single node. On AMD EPYC that's
possible, depending on which core you are running and on which memory controller
the memory you want to access is located. If both are in different quadrants
IIUC, the access latency will be different.


I've configured the NUMA to only bring the RAM and CPUs for a single socket
online, so I shouldn't be seeing any of these effects. Anyway, I've been using
the Altra as a secondary because its so much slower than the M2. Let me move
over to it and see if everything looks more straightforward there.


Better use a system where people will actually run Linux production 
workloads on, even if it is slower :)


[...]



I'll continue to mess around with it until the end of the day. But I'm not
making any headway, then I'll change tack; I'll just measure the performance of
my contpte changes using your fork/zap stuff as the baseline and post based on
that.


You should likely not focus on M2 results. Just pick a representative bare metal
machine where you get consistent, explainable results.

Nothing in the code is fine-tuned for a particular architecture so far, only
order-0 handling is kept separate.

BTW: I see the exact same speedups for dontneed that I see for munmap. For
example, for order-9, it goes from 0.023412s -> 0.009785, so -58%. So I'm
curious why you see a speedup for munmap but not for dontneed.


Ugh... ok, coming up.


Hopefully you were just staring at the wrong numbers (e.g., only with 
fork patches). Because both (munmap/pte-dontneed) are using the exact 
same code path.


--
Cheers,

David / dhildenb



Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 15:08, Michal Hocko wrote:

On Wed 31-01-24 10:26:13, Ryan Roberts wrote:

IIRC there is an option to zero memory when it is freed back to the buddy? So
that could be a place where time is proportional to size rather than
proportional to folio count? But I think that option is intended for debug only?
So perhaps not a problem in practice?


init_on_free is considered a security/hardening feature more than a
debugging one. It will surely add an overhead and I guess this is
something people who use it know about. The batch size limit is a latency
reduction feature for !PREEMPT kernels but by no means it should be
considered low latency guarantee feature. A lot of has changed since
the limit was introduced and the current latency numbers will surely be
different than back then. As long as soft lockups do not trigger again
this should be acceptable IMHO.


It could now be zeroing out ~512 MiB. That shouldn't take double-digit 
seconds unless we are running in a very problematic environment 
(over-committed VM). But then, we might have different problems already.


I'll do some sanity checks with an extremely large processes (as much as 
I can fit on my machines), with a !CONFIG_PREEMPT kernel and 
init_on_free, to see if anything pops up.


Thanks Michal!

--
Cheers,

David / dhildenb



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

Nope: looks the same. I've taken my test harness out of the picture and done
everything manually from the ground up, with the old tests and the new. Headline
is that I see similar numbers from both.


I took me a while to get really reproducible numbers on Intel. Most importantly:
* Set a fixed CPU frequency, disabling any boost and avoiding any
   thermal throttling.
* Pin the test to CPUs and set a nice level.


I'm already pinning the test to cpu 0. But for M2, at least, I'm running in a VM
on top of macos, and I don't have a mechanism to pin the QEMU threads to the
physical CPUs. Anyway, I don't think these are problems because for a given
kernel build I can accurately repro numbers.


Oh, you do have a layer of virtualization in there. I *suspect* that 
might amplify some odd things regarding code layout, caching effects, etc.


I guess especially the fork() benchmark is too sensible (fast) for 
things like that, so I would just focus on bare metal results where you 
can control the environment completely.


Note that regarding NUMA effects, I mean when some memory access within 
the same socket is faster/slower even with only a single node. On AMD 
EPYC that's possible, depending on which core you are running and on 
which memory controller the memory you want to access is located. If 
both are in different quadrants IIUC, the access latency will be different.



But yes: I was observing something similar on AMD EPYC, where you get
consecutive pages from the buddy, but once you allocate from the PCP it might no
longer be consecutive.


   - test is 5-10% slower when output is printed to terminal vs when redirected 
to
     file. I've always effectively been redirecting. Not sure if this overhead
     could start to dominate the regression and that's why you don't see it?


That's weird, because we don't print while measuring? Anyhow, 5/10% variance on
some system is not the end of the world.


I imagine its cache effects? More work to do to print the output could be
evicting some code that's in the benchmark path?


Maybe. Do you also see these oddities on the bare metal system?







I'm inclined to run this test for the last N kernel releases and if the number
moves around significantly, conclude that these tests don't really matter.
Otherwise its an exercise in randomly refactoring code until it works well, but
that's just overfitting to the compiler and hw. What do you think?


Personally, I wouldn't lose sleep if you see weird, unexplainable behavior on
some system (not even architecture!). Trying to optimize for that would indeed
be random refactorings.

But I would not be so fast to say that "these tests don't really matter" and
then go wild and degrade them as much as you want. There are use cases that care
about fork performance especially with order-0 pages -- such as Redis.


Indeed. But also remember that my fork baseline time is ~2.5ms, and I think you
said yours was 14ms :)


Yes, no idea why M2 is that fast (BTW, which page size? 4k or 16k? ) :)



I'll continue to mess around with it until the end of the day. But I'm not
making any headway, then I'll change tack; I'll just measure the performance of
my contpte changes using your fork/zap stuff as the baseline and post based on 
that.


You should likely not focus on M2 results. Just pick a representative 
bare metal machine where you get consistent, explainable results.


Nothing in the code is fine-tuned for a particular architecture so far, 
only order-0 handling is kept separate.


BTW: I see the exact same speedups for dontneed that I see for munmap. 
For example, for order-9, it goes from 0.023412s -> 0.009785, so -58%. 
So I'm curious why you see a speedup for munmap but not for dontneed.


--
Cheers,

David / dhildenb



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand


I'm also surprised about the dontneed vs. munmap numbers.


You mean the ones for Altra that I posted? (I didn't post any for M2). The altra
numbers look ok to me; dontneed has no change, and munmap has no change for
order-0 and is massively improved for order-9.



I would expect that dontneed would similarly benefit -- same code path. 
But I focused on munmap measurements for now, I'll try finding time to 
confirm that it's the same on Intel.


--
Cheers,

David / dhildenb



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 13:37, Ryan Roberts wrote:

On 31/01/2024 11:49, Ryan Roberts wrote:

On 31/01/2024 11:28, David Hildenbrand wrote:

On 31.01.24 12:16, Ryan Roberts wrote:

On 31/01/2024 11:06, David Hildenbrand wrote:

On 31.01.24 11:43, Ryan Roberts wrote:

On 29/01/2024 12:46, David Hildenbrand wrote:

Now that the rmap overhaul[1] is upstream that provides a clean interface
for rmap batching, let's implement PTE batching during fork when processing
PTE-mapped THPs.

This series is partially based on Ryan's previous work[2] to implement
cont-pte support on arm64, but its a complete rewrite based on [1] to
optimize all architectures independent of any such PTE bits, and to
use the new rmap batching functions that simplify the code and prepare
for further rmap accounting changes.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch and (c) perform batch PTE setting/updates.

While this series should be beneficial for adding cont-pte support on
ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
for large folios with minimal added overhead and further changes[4] that
build up on top of the total mapcount.

Independent of all that, this series results in a speedup during fork with
PTE-mapped THP, which is the default with THPs that are smaller than a PMD
(for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
of the same size (stddev < 1%) results in the following runtimes
for fork() (shorter is better):

Folio Size | v6.8-rc1 |  New | Change
--
     4KiB | 0.014328 | 0.014035 |   - 2%
    16KiB | 0.014263 | 0.01196  |   -16%
    32KiB | 0.014334 | 0.01094  |   -24%
    64KiB | 0.014046 | 0.010444 |   -26%
   128KiB | 0.014011 | 0.010063 |   -28%
   256KiB | 0.013993 | 0.009938 |   -29%
   512KiB | 0.013983 | 0.00985  |   -30%
  1024KiB | 0.013986 | 0.00982  |   -30%
  2048KiB | 0.014305 | 0.010076 |   -30%


Just a heads up that I'm seeing some strange results on Apple M2. Fork for
order-0 is seemingly costing ~17% more. I'm using GCC 13.2 and was pretty
sure I
didn't see this problem with version 1; although that was on a different
baseline and I've thrown the numbers away so will rerun and try to debug this.


Numbers for v1 of the series, both on top of 6.8-rc1 and rebased to the same
mm-unstable base as v3 of the series (first 2 rows are from what I just posted
for context):

| kernel |   mean_rel |   std_rel |
|:---|---:|--:|
| mm-unstabe (base)  |   0.0% |  1.1% |
| mm-unstable + v3   |  16.7% |  0.8% |
| mm-unstable + v1   |  -2.5% |  1.7% |
| v6.8-rc1 + v1  |  -6.6% |  1.1% |

So all looks good with v1. And seems to suggest mm-unstable has regressed by ~4%
vs v6.8-rc1. Is this really a useful benchmark? Does the raw performance of
fork() syscall really matter? Evidence suggests its moving all over the place -
breath on the code and it changes - not a great place to be when using the test
for gating purposes!

Still with the old tests - I'll move to the new ones now.






So far, on my x86 tests (Intel, AMD EPYC), I was not able to observe this.
fork() for order-0 was consistently effectively unchanged. Do you observe that
on other ARM systems as well?


Nope; running the exact same kernel binary and user space on Altra, I see
sensible numbers;

fork order-0: -1.3%
fork order-9: -7.6%
dontneed order-0: -0.5%
dontneed order-9: 0.1%
munmap order-0: 0.0%
munmap order-9: -67.9%

So I guess some pipelining issue that causes the M2 to stall more?


With one effective added folio_test_large(), it could only be a code layout
problem? Or the compiler does something stupid, but you say that you run the
exact same kernel binary, so that doesn't make sense.


Yup, same binary. We know this code is very sensitive - 1 cycle makes a big
difference. So could easily be code layout, branch prediction, etc...



I'm also surprised about the dontneed vs. munmap numbers.


You mean the ones for Altra that I posted? (I didn't post any for M2). The altra
numbers look ok to me; dontneed has no change, and munmap has no change for
order-0 and is massively improved for order-9.

  Doesn't make any sense

(again, there was this VMA merging problem but it would still allow for batching
within a single VMA that spans exactly one large folio).

What are you using as baseline? Really just mm-unstable vs. mm-unstable+patches?


yes. except for "v6.8-rc1 + v1" above.



Let's see if the new test changes the numbers you measure.


Nope: looks the same. I've taken my test harness out of the picture and done
everything manually from the ground up, with the old tests and the 

Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 12:16, Ryan Roberts wrote:

On 31/01/2024 11:06, David Hildenbrand wrote:

On 31.01.24 11:43, Ryan Roberts wrote:

On 29/01/2024 12:46, David Hildenbrand wrote:

Now that the rmap overhaul[1] is upstream that provides a clean interface
for rmap batching, let's implement PTE batching during fork when processing
PTE-mapped THPs.

This series is partially based on Ryan's previous work[2] to implement
cont-pte support on arm64, but its a complete rewrite based on [1] to
optimize all architectures independent of any such PTE bits, and to
use the new rmap batching functions that simplify the code and prepare
for further rmap accounting changes.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch and (c) perform batch PTE setting/updates.

While this series should be beneficial for adding cont-pte support on
ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
for large folios with minimal added overhead and further changes[4] that
build up on top of the total mapcount.

Independent of all that, this series results in a speedup during fork with
PTE-mapped THP, which is the default with THPs that are smaller than a PMD
(for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
of the same size (stddev < 1%) results in the following runtimes
for fork() (shorter is better):

Folio Size | v6.8-rc1 |  New | Change
--
    4KiB | 0.014328 | 0.014035 |   - 2%
   16KiB | 0.014263 | 0.01196  |   -16%
   32KiB | 0.014334 | 0.01094  |   -24%
   64KiB | 0.014046 | 0.010444 |   -26%
  128KiB | 0.014011 | 0.010063 |   -28%
  256KiB | 0.013993 | 0.009938 |   -29%
  512KiB | 0.013983 | 0.00985  |   -30%
     1024KiB | 0.013986 | 0.00982  |   -30%
     2048KiB | 0.014305 | 0.010076 |   -30%


Just a heads up that I'm seeing some strange results on Apple M2. Fork for
order-0 is seemingly costing ~17% more. I'm using GCC 13.2 and was pretty sure I
didn't see this problem with version 1; although that was on a different
baseline and I've thrown the numbers away so will rerun and try to debug this.



So far, on my x86 tests (Intel, AMD EPYC), I was not able to observe this.
fork() for order-0 was consistently effectively unchanged. Do you observe that
on other ARM systems as well?


Nope; running the exact same kernel binary and user space on Altra, I see
sensible numbers;

fork order-0: -1.3%
fork order-9: -7.6%
dontneed order-0: -0.5%
dontneed order-9: 0.1%
munmap order-0: 0.0%
munmap order-9: -67.9%

So I guess some pipelining issue that causes the M2 to stall more?


With one effective added folio_test_large(), it could only be a code 
layout problem? Or the compiler does something stupid, but you say that 
you run the exact same kernel binary, so that doesn't make sense.


I'm also surprised about the dontneed vs. munmap numbers. Doesn't make 
any sense (again, there was this VMA merging problem but it would still 
allow for batching within a single VMA that spans exactly one large folio).


What are you using as baseline? Really just mm-unstable vs. 
mm-unstable+patches?


Let's see if the new test changes the numbers you measure.

--
Cheers,

David / dhildenb



Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

-    folio_remove_rmap_pte(folio, page, vma);
+    folio_remove_rmap_ptes(folio, page, nr, vma);
+
+    /* Only sanity-check the first page in a batch. */
   if (unlikely(page_mapcount(page) < 0))
   print_bad_pte(vma, addr, ptent, page);


Is there a case for either removing this all together or moving it into
folio_remove_rmap_ptes()? It seems odd to only check some pages.



I really wanted to avoid another nasty loop here.

In my thinking, for 4k folios, or when zapping subpages of large folios, we
still perform the exact same checks. Only when batching we don't. So if there is
some problem, there are ways to get it triggered. And these problems are barely
ever seen.

folio_remove_rmap_ptes() feels like the better place -- especially because the
delayed-rmap handling is effectively unchecked. But in there, we cannot
"print_bad_pte()".

[background: if we had a total mapcount -- iow cheap folio_mapcount(), I'd check
here that the total mapcount does not underflow, instead of checking 
per-subpage]


All good points... perhaps extend the comment to describe how this could be
solved in future with cheap total_mapcount()? Or in the commit log if you 
prefer?


I'll add more meat to the cover letter, thanks!

--
Cheers,

David / dhildenb



Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 11:43, Ryan Roberts wrote:

On 29/01/2024 12:46, David Hildenbrand wrote:

Now that the rmap overhaul[1] is upstream that provides a clean interface
for rmap batching, let's implement PTE batching during fork when processing
PTE-mapped THPs.

This series is partially based on Ryan's previous work[2] to implement
cont-pte support on arm64, but its a complete rewrite based on [1] to
optimize all architectures independent of any such PTE bits, and to
use the new rmap batching functions that simplify the code and prepare
for further rmap accounting changes.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch and (c) perform batch PTE setting/updates.

While this series should be beneficial for adding cont-pte support on
ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
for large folios with minimal added overhead and further changes[4] that
build up on top of the total mapcount.

Independent of all that, this series results in a speedup during fork with
PTE-mapped THP, which is the default with THPs that are smaller than a PMD
(for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
of the same size (stddev < 1%) results in the following runtimes
for fork() (shorter is better):

Folio Size | v6.8-rc1 |  New | Change
--
   4KiB | 0.014328 | 0.014035 |   - 2%
  16KiB | 0.014263 | 0.01196  |   -16%
  32KiB | 0.014334 | 0.01094  |   -24%
  64KiB | 0.014046 | 0.010444 |   -26%
 128KiB | 0.014011 | 0.010063 |   -28%
 256KiB | 0.013993 | 0.009938 |   -29%
 512KiB | 0.013983 | 0.00985  |   -30%
1024KiB | 0.013986 | 0.00982  |   -30%
2048KiB | 0.014305 | 0.010076 |   -30%


Just a heads up that I'm seeing some strange results on Apple M2. Fork for
order-0 is seemingly costing ~17% more. I'm using GCC 13.2 and was pretty sure I
didn't see this problem with version 1; although that was on a different
baseline and I've thrown the numbers away so will rerun and try to debug this.



So far, on my x86 tests (Intel, AMD EPYC), I was not able to observe 
this. fork() for order-0 was consistently effectively unchanged. Do you 
observe that on other ARM systems as well?




| kernel  |   mean_rel |   std_rel |
|:|---:|--:|
| mm-unstable |   0.0% |  1.1% |
| patch 1 |  -2.3% |  1.3% |
| patch 10|  -2.9% |  2.7% |
| patch 11|  13.5% |  0.5% |
| patch 12|  15.2% |  1.2% |
| patch 13|  18.2% |  0.7% |
| patch 14|  20.5% |  1.0% |
| patch 15|  17.1% |  1.6% |
| patch 15|  16.7% |  0.8% |

fork for order-9 is looking good (-20%), and for the zap series, munmap is
looking good, but dontneed is looking poor for both order-0 and 9. But one thing
at a time... let's concentrate on fork order-0 first.


munmap and dontneed end up calling the exact same call paths. So a big 
performance difference is rather surprising and might indicate something 
else.


(I think I told you that I was running in some kind of VMA merging 
problem where one would suddenly get with my benchmark 1 VMA per page. 
The new benchmark below works around that, but I am not sure if that was 
fixed in the meantime)


VMA merging can of course explain a big difference in fork and munmap 
vs. dontneed times, especially when comparing different code base where 
that VMA merging behavior was different.




Note that I'm still using the "old" benchmark code. Could you resend me the link
to the new code? Although I don't think there should be any effect for order-0
anyway, if I understood your changes correctly?


This is the combined one (small and large PTEs):

https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c?inline=false

--
Cheers,

David / dhildenb



Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 03:20, Yin Fengwei wrote:

On 1/29/24 22:32, David Hildenbrand wrote:

This series is based on [1] and must be applied on top of it.
Similar to what we did with fork(), let's implement PTE batching
during unmap/zap when processing PTE-mapped THPs.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
entry removal once per batch.

Ryan was previously working on this in the context of cont-pte for
arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
This series implements the optimization for all architectures, independent
of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
large-folio-pages batches as well, and amkes use of our new rmap batching
function when removing the rmap.

To achieve that, we have to enlighten MMU gather / page freeing code
(i.e., everything that consumes encoded_page) to process unmapping
of consecutive pages that all belong to the same large folio. I'm being
very careful to not degrade order-0 performance, and it looks like I
managed to achieve that.


One possible scenario:
If all the folio is 2M size folio, then one full batch could hold 510M memory.
Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
memory?


Good point, we do have CONFIG_INIT_ON_FREE_DEFAULT_ON. I don't remember 
if init_on_free or init_on_alloc was used in production systems. In 
tlb_batch_pages_flush(), there is a cond_resched() to limit the number 
of entries we process.


So if that is actually problematic, we'd run into a soft-lockup and need 
another cond_resched() [I have some faint recollection that people are 
working on removing cond_resched() completely].


One could do some counting in free_pages_and_swap_cache() (where we 
iterate all entries already) and insert cond_resched+release_pages() for 
every (e.g., 512) pages.


--
Cheers,

David / dhildenb



Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 03:30, Yin Fengwei wrote:



On 1/29/24 22:32, David Hildenbrand wrote:

+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+   pte_t pte, tmp_pte;
+
+   pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+   while (--nr) {
+   ptep++;
+   addr += PAGE_SIZE;
+   tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+   if (pte_dirty(tmp_pte))
+   pte = pte_mkdirty(pte);
+   if (pte_young(tmp_pte))
+   pte = pte_mkyoung(pte);

I am wondering whether it's worthy to move the pte_mkdirty() and pte_mkyoung()
out of the loop and just do it one time if needed. The worst case is that they
are called nr - 1 time. Or it's just too micro?


I also thought about just indicating "any_accessed" or "any_dirty" using 
flags to the caller, to avoid the PTE modifications completely. Felt a 
bit micro-optimized.


Regarding your proposal: I thought about that as well, but my assumption 
was that dirty+young are "cheap" to be set.


On x86, pte_mkyoung() is setting _PAGE_ACCESSED.
pte_mkdirty() is setting _PAGE_DIRTY | _PAGE_SOFT_DIRTY, but it also has 
to handle the saveddirty handling, using some bit trickery.


So at least for pte_mkyoung() there would be no real benefit as far as I 
can see (might be even worse). For pte_mkdirty() there might be a small 
benefit.


Is it going to be measurable? Likely not.

Am I missing something?

Thanks!

--
Cheers,

David / dhildenb



Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread David Hildenbrand




+
+#ifndef clear_full_ptes
+/**
+ * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.


I know its implied from "pages of the same folio" (and even more so for the
above variant due to mention of access/dirty), but I wonder if its useful to
explicitly state that "all ptes being cleared are present at the time of the 
call"?


"Clear PTEs" -> "Clear present PTEs" ?

That should make it clearer.

[...]


if (!delay_rmap) {
-   folio_remove_rmap_pte(folio, page, vma);
+   folio_remove_rmap_ptes(folio, page, nr, vma);
+
+   /* Only sanity-check the first page in a batch. */
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);


Is there a case for either removing this all together or moving it into
folio_remove_rmap_ptes()? It seems odd to only check some pages.



I really wanted to avoid another nasty loop here.

In my thinking, for 4k folios, or when zapping subpages of large folios, 
we still perform the exact same checks. Only when batching we don't. So 
if there is some problem, there are ways to get it triggered. And these 
problems are barely ever seen.


folio_remove_rmap_ptes() feels like the better place -- especially 
because the delayed-rmap handling is effectively unchecked. But in 
there, we cannot "print_bad_pte()".


[background: if we had a total mapcount -- iow cheap folio_mapcount(), 
I'd check here that the total mapcount does not underflow, instead of 
checking per-subpage]





}
-   if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
+   if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
*force_flush = true;
*force_break = true;
}
  }
  
-static inline void zap_present_pte(struct mmu_gather *tlb,

+/*
+ * Zap or skip one present PTE, trying to batch-process subsequent PTEs that 
map


Zap or skip *at least* one... ?


Ack

--
Cheers,

David / dhildenb



Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread David Hildenbrand

On 31.01.24 03:20, Yin Fengwei wrote:

On 1/29/24 22:32, David Hildenbrand wrote:

This series is based on [1] and must be applied on top of it.
Similar to what we did with fork(), let's implement PTE batching
during unmap/zap when processing PTE-mapped THPs.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
entry removal once per batch.

Ryan was previously working on this in the context of cont-pte for
arm64, int latest iteration [2] with a focus on arm6 with cont-pte only.
This series implements the optimization for all architectures, independent
of such PTE bits, teaches MMU gather/TLB code to be fully aware of such
large-folio-pages batches as well, and amkes use of our new rmap batching
function when removing the rmap.

To achieve that, we have to enlighten MMU gather / page freeing code
(i.e., everything that consumes encoded_page) to process unmapping
of consecutive pages that all belong to the same large folio. I'm being
very careful to not degrade order-0 performance, and it looks like I
managed to achieve that.




Let's CC Linus and Michal to make sure I'm not daydreaming.

Relevant patch:
  https://lkml.kernel.org/r/20240129143221.263763-8-da...@redhat.com

Context: I'm adjusting MMU gather code to support batching of 
consecutive pages that belong to the same large folio, when 
unmapping/zapping PTEs.


For small folios, there is no (relevant) change.

Imagine we have a PTE-mapped THP (2M folio -> 512 pages) and zap all 512 
PTEs: Instead of adding 512 individual encoded_page entries, we add a 
combined entry that expresses "page+nr_pages". That allows for "easily" 
adding various other per-folio batching (refcount, rmap, swap freeing).


The implication is, that we can now batch effective more pages with 
large folios, exceeding the old 1 limit. The number of involved 
*folios* does not increase, though.



One possible scenario:
If all the folio is 2M size folio, then one full batch could hold 510M memory.
Is it too much regarding one full batch before just can hold (2M - 4096 * 2)
memory?


Excellent point, I think there are three parts to it:

(1) Batch pages / folio fragments per batch page

Before this change (and with 4k folios) we have exactly one page (4k) 
per encoded_page entry in the batch. Now, we can have (with 2M folios), 
512 pages for every two encoded_page entries (page+nr_pages) in a batch 
page. So an average ~256 pages per encoded_page entry.


So one batch page can now store in the worst case ~256 times the number 
of pages, but the number of folio fragments ("pages+nr_pages") would not 
increase.


The time it takes to perform the actual page freeing of a batch will not 
be 256 times higher -- the time is expected to be much closer to the old 
time (i.e., not freeing more folios).


(2) Delayed rmap handling

We limit batching early (see tlb_next_batch()) when we have delayed rmap 
pending. Reason being, that we don't want to check for many entries if 
they require delayed rmap handling, while still holding the page table 
lock (see tlb_flush_rmaps()), because we have to remove the rmap before 
dropping the PTL.


Note that we perform the check whether we need delayed rmap handling per 
page+nr_pages entry, not per page. So we won't perform more such checks.


Once we set tlb->delayed_rmap (because we add one entry that requires 
it), we already force a flush before dropping the PT lock. So once we 
get a single delayed rmap entry in there, we will not batch more than we 
could have in the same page table: so not more than 512 entries (x86-64) 
in the worst case. So it will still be bounded, and not significantly 
more than what we had before.


So regarding delayed rmap handling I think this should be fine.

(3) Total patched pages

MAX_GATHER_BATCH_COUNT effectively limits the number of pages we 
allocate (full batches), and thereby limits the number of pages we were 
able to batch.


The old limit was ~1 pages, now we could batch ~5000 folio fragments 
(page+nr_pages), resulting int the "times 256" increase in the worst 
case on x86-64 as you point out.


This 1 pages limit was introduced in 53a59fc67f97 ("mm: limit 
mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT") where we 
wanted to handle soft-lockups.


As the number of effective folios we are freeing does not increase, I 
*think* this should be fine.



If any of that is a problem, we would have to keep track of the total 
number of pages in our batch, and stop as soon as we hit our 1 limit 
-- independent of page vs. folio fragment. Something I would like to 
avoid of possible.


--
Cheers,

David / dhildenb



Re: [PATCH v1 7/9] mm/mmu_gather: add __tlb_remove_folio_pages()

2024-01-30 Thread David Hildenbrand

On 30.01.24 10:21, Ryan Roberts wrote:

On 29/01/2024 14:32, David Hildenbrand wrote:

Add __tlb_remove_folio_pages(), which will remove multiple consecutive
pages that belong to the same large folio, instead of only a single
page. We'll be using this function when optimizing unmapping/zapping of
large folios that are mapped by PTEs.

We're using the remaining spare bit in an encoded_page to indicate that
the next enoced page in an array contains actually shifted "nr_pages".
Teach swap/freeing code about putting multiple folio references, and
delayed rmap handling to remove page ranges of a folio.

This extension allows for still gathering almost as many small folios
as we used to (-1, because we have to prepare for a possibly bigger next
entry), but still allows for gathering consecutive pages that belong to the
same large folio.

Note that we don't pass the folio pointer, because it is not required for
now. Further, we don't support page_size != PAGE_SIZE, it won't be
required for simple PTE batching.

We have to provide a separate s390 implementation, but it's fairly
straight forward.

Another, more invasive and likely more expensive, approach would be to
use folio+range or a PFN range instead of page+nr_pages. But, we should
do that consistently for the whole mmu_gather. For now, let's keep it
simple and add "nr_pages" only.

Signed-off-by: David Hildenbrand 
---
  arch/s390/include/asm/tlb.h | 17 +++
  include/asm-generic/tlb.h   |  8 +
  include/linux/mm_types.h| 20 
  mm/mmu_gather.c | 61 +++--
  mm/swap.c   | 12 ++--
  mm/swap_state.c | 12 ++--
  6 files changed, 116 insertions(+), 14 deletions(-)

diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 48df896d5b79..abfd2bf29e9e 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -26,6 +26,8 @@ void __tlb_remove_table(void *_table);
  static inline void tlb_flush(struct mmu_gather *tlb);
  static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
struct page *page, bool delay_rmap, int page_size);
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,
+   struct page *page, unsigned int nr_pages, bool delay_rmap);
  
  #define tlb_flush tlb_flush

  #define pte_free_tlb pte_free_tlb
@@ -52,6 +54,21 @@ static inline bool __tlb_remove_page_size(struct mmu_gather 
*tlb,
return false;
  }
  
+static inline bool __tlb_remove_folio_pages(struct mmu_gather *tlb,

+   struct page *page, unsigned int nr_pages, bool delay_rmap)
+{
+   struct encoded_page *encoded_pages[] = {
+   encode_page(page, ENCODED_PAGE_BIT_NR_PAGES),
+   encode_nr_pages(nr_pages),
+   };
+
+   VM_WARN_ON_ONCE(delay_rmap);
+   VM_WARN_ON_ONCE(page_folio(page) != page_folio(page + nr_pages - 1));
+
+   free_pages_and_swap_cache(encoded_pages, ARRAY_SIZE(encoded_pages));
+   return false;
+}
+
  static inline void tlb_flush(struct mmu_gather *tlb)
  {
__tlb_flush_mm_lazy(tlb->mm);
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2eb7b0d4f5d2..428c3f93addc 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -69,6 +69,7 @@
   *
   *  - tlb_remove_page() / __tlb_remove_page()
   *  - tlb_remove_page_size() / __tlb_remove_page_size()
+ *  - __tlb_remove_folio_pages()
   *
   *__tlb_remove_page_size() is the basic primitive that queues a page for
   *freeing. __tlb_remove_page() assumes PAGE_SIZE. Both will return a
@@ -78,6 +79,11 @@
   *tlb_remove_page() and tlb_remove_page_size() imply the call to
   *tlb_flush_mmu() when required and has no return value.
   *
+ *__tlb_remove_folio_pages() is similar to __tlb_remove_page(), however,
+ *instead of removing a single page, remove the given number of consecutive
+ *pages that are all part of the same (large) folio: just like calling
+ *__tlb_remove_page() on each page individually.
+ *
   *  - tlb_change_page_size()
   *
   *call before __tlb_remove_page*() to set the current page-size; implies a
@@ -262,6 +268,8 @@ struct mmu_gather_batch {
  
  extern bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page,

bool delay_rmap, int page_size);
+bool __tlb_remove_folio_pages(struct mmu_gather *tlb, struct page *page,
+   unsigned int nr_pages, bool delay_rmap);
  
  #ifdef CONFIG_SMP

  /*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1b89eec0d6df..198662b7a39a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -226,6 +226,15 @@ struct encoded_page;
  /* Perform rmap removal after we have flushed the TLB. */
  #define ENCODED_PAGE_BIT_DELAY_RMAP   1ul
  
+/*

+ * The next item in an encoded_page array is the "nr_pages" argument, 
specifying
+ * the number o

Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-30 Thread David Hildenbrand

Re-reading the docs myself:


+#ifndef get_and_clear_full_ptes
+/**
+ * get_and_clear_full_ptes - Clear PTEs that map consecutive pages of the same
+ *  folio, collecting dirty/accessed bits.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into
+ * returned PTE.


"into the"


+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock.  The PTEs map consecutive
+ * pages that belong to the same folio.  The PTEs are all in the same PMD.
+ */
+static inline pte_t get_and_clear_full_ptes(struct mm_struct *mm,
+   unsigned long addr, pte_t *ptep, unsigned int nr, int full)
+{
+   pte_t pte, tmp_pte;
+
+   pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+   while (--nr) {
+   ptep++;
+   addr += PAGE_SIZE;
+   tmp_pte = ptep_get_and_clear_full(mm, addr, ptep, full);
+   if (pte_dirty(tmp_pte))
+   pte = pte_mkdirty(pte);
+   if (pte_young(tmp_pte))
+   pte = pte_mkyoung(pte);
+   }
+   return pte;
+}
+#endif
+
+#ifndef clear_full_ptes
+/**
+ * clear_full_ptes - Clear PTEs that map consecutive pages of the same folio.
+ * @mm: Address space the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear.
+ * @full: Whether we are clearing a full mm.


Something went missing:

May be overridden by the architecture; otherwise, implemented as a 
simple loop over ptep_get_and_clear_full().



--
Cheers,

David / dhildenb



Re: [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte()

2024-01-30 Thread David Hildenbrand

On 30.01.24 09:46, Ryan Roberts wrote:

On 30/01/2024 08:41, David Hildenbrand wrote:

On 30.01.24 09:13, Ryan Roberts wrote:

On 29/01/2024 14:32, David Hildenbrand wrote:

Let's prepare for further changes by factoring out processing of present
PTEs.

Signed-off-by: David Hildenbrand 
---
   mm/memory.c | 92 ++---
   1 file changed, 52 insertions(+), 40 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b05fd28dbce1..50a6c79c78fc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct
*vma,
   pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
   }
   +static inline void zap_present_pte(struct mmu_gather *tlb,
+    struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+    unsigned long addr, struct zap_details *details,
+    int *rss, bool *force_flush, bool *force_break)
+{
+    struct mm_struct *mm = tlb->mm;
+    bool delay_rmap = false;
+    struct folio *folio;


You need to init this to NULL otherwise its a random value when calling
should_zap_folio() if vm_normal_page() returns NULL.


Right, and we can stop setting it to NULL in the original function. Patch #2
changes these checks, which is why it's only a problem in this patch.


Yeah I only noticed that after sending out this reply and moving to the next
patch. Still worth fixing this intermediate state I think.


Absolutely, I didn't do path-by-patch compilation yet (I suspect the 
compiler would complain).


--
Cheers,

David / dhildenb



Re: [PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()

2024-01-30 Thread David Hildenbrand

On 30.01.24 09:45, Ryan Roberts wrote:

On 30/01/2024 08:37, David Hildenbrand wrote:

On 30.01.24 09:31, Ryan Roberts wrote:

On 29/01/2024 14:32, David Hildenbrand wrote:

We don't need up-to-date accessed-dirty information for anon folios and can
simply work with the ptent we already have. Also, we know the RSS counter
we want to update.

We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
zap_install_uffd_wp_if_needed() after updating the folio and RSS.

While at it, only call zap_install_uffd_wp_if_needed() if there is even
any chance that pte_install_uffd_wp_if_needed() would do *something*.
That is, just don't bother if uffd-wp does not apply.

Signed-off-by: David Hildenbrand 
---
   mm/memory.c | 16 +++-
   1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 69502cdc0a7d..20bc13ab8db2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1552,12 +1552,9 @@ static inline void zap_present_pte(struct mmu_gather
*tlb,
   folio = page_folio(page);
   if (unlikely(!should_zap_folio(details, folio)))
   return;
-    ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
-    arch_check_zapped_pte(vma, ptent);
-    tlb_remove_tlb_entry(tlb, pte, addr);
-    zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
     if (!folio_test_anon(folio)) {
+    ptent = ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
   if (pte_dirty(ptent)) {
   folio_mark_dirty(folio);
   if (tlb_delay_rmap(tlb)) {
@@ -1567,8 +1564,17 @@ static inline void zap_present_pte(struct mmu_gather
*tlb,
   }
   if (pte_young(ptent) && likely(vma_has_recency(vma)))
   folio_mark_accessed(folio);
+    rss[mm_counter(folio)]--;
+    } else {
+    /* We don't need up-to-date accessed/dirty bits. */
+    ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm);
+    rss[MM_ANONPAGES]--;
   }
-    rss[mm_counter(folio)]--;
+    arch_check_zapped_pte(vma, ptent);


Isn't the x86 (only) implementation of this relying on the dirty bit? So doesn't
that imply you still need get_and_clear for anon? (And in hindsight I think that
logic would apply to the previous patch too?)


x86 uses the encoding !writable && dirty to indicate special shadow stacks. That
is, the hw dirty bit is set by software (to create that combination), not by
hardware.

So you don't have to sync against any hw changes of the hw dirty bit. What you
had in the original PTE you read is sufficient.



Right, got it. In that case:


Thanks a lot for paying that much attention during your reviews! Highly 
appreciated!




Reviewed-by: Ryan Roberts 




--
Cheers,

David / dhildenb



Re: [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte()

2024-01-30 Thread David Hildenbrand

On 30.01.24 09:13, Ryan Roberts wrote:

On 29/01/2024 14:32, David Hildenbrand wrote:

Let's prepare for further changes by factoring out processing of present
PTEs.

Signed-off-by: David Hildenbrand 
---
  mm/memory.c | 92 ++---
  1 file changed, 52 insertions(+), 40 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b05fd28dbce1..50a6c79c78fc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1532,13 +1532,61 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct 
*vma,
pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
  }
  
+static inline void zap_present_pte(struct mmu_gather *tlb,

+   struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+   unsigned long addr, struct zap_details *details,
+   int *rss, bool *force_flush, bool *force_break)
+{
+   struct mm_struct *mm = tlb->mm;
+   bool delay_rmap = false;
+   struct folio *folio;


You need to init this to NULL otherwise its a random value when calling
should_zap_folio() if vm_normal_page() returns NULL.


Right, and we can stop setting it to NULL in the original function. 
Patch #2 changes these checks, which is why it's only a problem in this 
patch.


Will fix, thanks!

--
Cheers,

David / dhildenb



  1   2   3   4   5   6   7   8   9   10   >