Re: [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)

2024-05-23 Thread Peter Xu
On Fri, May 17, 2024 at 08:59:54PM +0200, Christophe Leroy wrote:
> This is the continuation of the RFC v1 series "Reimplement huge pages
> without hugepd on powerpc 8xx". It now get rid of hugepd completely
> after handling also e500 and book3s/64
> 
> Unlike most architectures, powerpc 8xx HW requires a two-level
> pagetable topology for all page sizes. So a leaf PMD-contig approach
> is not feasible as such.
> 
> Possible sizes are 4k, 16k, 512k and 8M.
> 
> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> must point to a single entry level-2 page table. Until now that was
> done using hugepd. This series changes it to use standard page tables
> where the entry is replicated 1024 times on each of the two pagetables
> refered by the two associated PMD entries for that 8M page.
> 
> At the moment it has to look into each helper to know if the
> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> a lower size. I hope this can me handled by core-mm in the future.
> 
> For e500 and book3s/64 there are less constraints because it is not
> tied to the HW assisted tablewalk like on 8xx, so it is easier to use
> leaf PMDs (and PUDs).
> 
> On e500 the supported page sizes are 4M, 16M, 64M, 256M and 1G. All at
> PMD level on e500/32 and mix of PMD and PUD for e500/64. We encode page
> size with 4 available bits in PTE entries. On e300/32 PGD entries size
> is increases to 64 bits in order to allow leaf-PMD entries because PTE
> are 64 bits on e500.
> 
> On book3s/64 only the hash-4k mode is concerned. It supports 16M pages
> as cont-PMD and 16G pages as cont-PUD. In other modes (radix-4k, radix-6k
> and hash-64k) the sizes match with PMD and PUD sizes so that's just leaf
> entries.
> 
> Christophe Leroy (20):
>   mm: Provide pagesize to pmd_populate()
>   mm: Provide page size to pte_alloc_huge()
>   mm: Provide pmd to pte_leaf_size()
>   mm: Provide mm_struct and address to huge_ptep_get()
>   powerpc/mm: Allow hugepages without hugepd
>   powerpc/8xx: Fix size given to set_huge_pte_at()
>   powerpc/8xx: Rework support for 8M pages using contiguous PTE entries
>   powerpc/8xx: Simplify struct mmu_psize_def
>   powerpc/mm: Remove _PAGE_PSIZE
>   powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries
>   powerpc/mm: Complement huge_pte_alloc() for all non HUGEPD setups
>   powerpc/64e: Remove unneeded #ifdef CONFIG_PPC_E500
>   powerpc/64e: Clean up impossible setups
>   powerpc/e500: Remove enc field from struct mmu_psize_def
>   powerpc/85xx: Switch to 64 bits PGD
>   powerpc/e500: Encode hugepage size in PTE bits
>   powerpc/e500: Use contiguous PMD instead of hugepd
>   powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD
>   powerpc/mm: Remove hugepd leftovers
>   mm: Remove CONFIG_ARCH_HAS_HUGEPD

Great to see this series, thanks again Christophe.

I requested for help on the lsfmm hugetlb unification session, but
unfortunately I don't think there were Power people around.. I'd like to
request help from Power developers again here on the list: it will be very
appreciated if you can help have a look at this series.

It's a direct dependent work to the hugetlb refactoring that we'll be
working on, while it looks like the hugetlb refactoring is something the
community as a whole would like to see in the near future.

We don't want to add more Power-only CONFIG_ARCH_HAS_HUGEPD checks for
hugetlb in any new code.

Currently Oscar offered help on that hugetlb project, and Oscar will start
to work on page_walk API refactoring.  I guess currently the simple way is
we'll work on top of Christophe's series.  Some proper review on this
series will definitely make it clearer on what we should do next.

Thanks,

-- 
Peter Xu



Re: [PATCH v2 1/1] arch/fault: don't print logs for pte marker poison errors

2024-05-23 Thread Peter Xu
On Thu, May 23, 2024 at 05:08:29AM +0200, Oscar Salvador wrote:
> On Wed, May 22, 2024 at 05:46:09PM -0400, Peter Xu wrote:
> > > Now, ProcessB still has the page mapped, so upon re-accessing it,
> > > it will trigger a new MCE event. memory-failure code will see that this
> > 
> > The question is why accessing that hwpoison entry from ProcB triggers an
> > MCE.  Logically that's a swap entry and it should generate a page fault
> > rather than MCE.  Then in the pgfault hanlder we don't need that encoded
> > pfn as we have vmf->address.
> 
> It would be a swap entry if we reach try_to_umap_one() without trouble.
> Then we have the code that converts it:
> 
>  ...
>  if (PageHWPoison(p))
>  pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
>set_{huge_}pte_at
>  ...
> 
> But maybe we could only do that for ProcA, while ProcB failed to do that,
> which means that for ProcA that is a hwpoisoned-swap-entry, but ProcB still
> has this page mapped as usual, so if ProcB re-access it, that will not
> trigger a fault (because the page is still mapped in its pagetables).

But in that case "whether encode pfn in hwpoison swap entry" doesn't matter
either.. as it's not yet converted to a swap entry, so the pfn is there.

Thanks,

-- 
Peter Xu



Re: [PATCH v2 1/1] arch/fault: don't print logs for pte marker poison errors

2024-05-22 Thread Peter Xu
On Wed, May 15, 2024 at 10:18:31PM +0200, Borislav Petkov wrote:
> So if I were to design this, I'd do it this way:
> 
> 0. guest gets hw poison injected
> 
> 1. it runs memory_failure() and it kills the processes using the page.
> 
> 2. page is marked poisoned on the host so no other guest gets it.
> 
> That's it. No second accesses whatsoever. At least this is how it works
> on baremetal.
> 
> This hw poisoning emulation is just silly and unnecessary.

We (QEMU) haven't yet consumed this.. but I think it makes sense to have
such emulation, as it's slightly different from a real hwpoison.

I think the important bit that's missing in this picture is migration, that
the VM can migrate from one host to another, carrying that poisoned PFN.

Let's assume we have two hosts: src and dst.  Currently VM runs on src
host.

Before migration, there is a real PFN that is bad, MCE injected. When
accesssed by either guest vcpu or host cpu / hypervisor, VM gets killed.
This is so far the same to any process that has a bad page.

However it's possible a VM got migrated _before_ that bad PFN accessed, in
this case the VM is still legal to run, the hypervisor will not migrate
that bad PFN data knowing that its data is invalid.  What it does is it'll
tell dst that "this guest PFN is bad, if guest access it let's crash it".
Then what dst host needs is a way to describe "this guest PFN is bad": the
easiest way is to describe "this VA of the process is bad", meanwhile
there'll be no real page backing that VA anyway, and also no real poisoned
pages.  We want to poison a VA only. That's why an emulation is needed.
Besides that we want to get exactly whatever we'll get for a real hwpoison,
e.g. SIGBUS with the address encoded, then KVM work naturally with that
just like a real MCE.

One other thing we can do is to inject-poison to the VA together with the
page backing it, but that'll pollute a PFN on dst host to be a real bad PFN
and won't be able to be used by the dst OS anymore, so it's less optimal.

Thanks,

-- 
Peter Xu



Re: [PATCH v2 1/1] arch/fault: don't print logs for pte marker poison errors

2024-05-22 Thread Peter Xu
On Wed, May 15, 2024 at 12:21:51PM +0200, Oscar Salvador wrote:
> On Tue, May 14, 2024 at 03:34:24PM -0600, Peter Xu wrote:
> > The question is whether we can't.
> > 
> > Now we reserved a swp entry just for hwpoison and it makes sense only
> > because we cached the poisoned pfn inside.  My long standing question is
> > why do we ever need that pfn after all.  If we don't need the pfn, we
> > simply need a bit in the pgtable entry saying that it's poisoned, if
> > accessed we should kill the process using sigbus.
> > 
> > I used to comment on this before, the only path that uses that pfn is
> > check_hwpoisoned_entry(), which was introduced in:
> > 
> > commit a3f5d80ea401ac857f2910e28b15f35b2cf902f4
> > Author: Naoya Horiguchi 
> > Date:   Mon Jun 28 19:43:14 2021 -0700
> > 
> > mm,hwpoison: send SIGBUS with error virutal address
> > 
> > Now an action required MCE in already hwpoisoned address surely sends a
> > SIGBUS to current process, but the SIGBUS doesn't convey error virtual
> > address.  That's not optimal for hwpoison-aware applications.
> > 
> > To fix the issue, make memory_failure() call kill_accessing_process(),
> > that does pagetable walk to find the error virtual address.  It could 
> > find
> > multiple virtual addresses for the same error page, and it seems hard to
> > tell which virtual address is correct one.  But that's rare and sending
> > incorrect virtual address could be better than no address.  So let's
> > report the first found virtual address for now.
> > 
> > So this time I read more on this and Naoya explained why - it's only used
> > so far to dump the VA of the poisoned entry.
> 
> Well, not really dumped, but we just pass the VA down the chain to the
> signal handler.
> 
> But on the question whether we need the pfn encoded, I am not sure
> actually.
> check_hwpoisoned_entry() checks whether the pfn where the pte sits is
> the same as the hwpoisoned one, so we know if the process has to be
> killed.
> 
> Now, could we get away with using pte_page() instead of pte_pfn() in there, 
> and
> passing the hwpoisoned page instead ot the pfn?
> I am trying to think of the implications, then we should not need the
> encoded pfn?

I sincerely don't know; we need help from someone know better on hwpoison,
maybe.

It's at least not needed in fault paths, afaiu.

> 
> > However what confused me is, if an entry is poisoned already logically we
> > dump that message in the fault handler not memory_failure(), which is:
> > 
> > MCE: Killing uffd-unit-tests:650 due to hardware memory corruption fault at 
> > 7f3589d7e000
> > 
> > So perhaps we're trying to also dump that when the MCEs (points to the same
> > pfn) are only generated concurrently?  I donno much on hwpoison so I cannot
> > tell, there's also implication where it's only triggered if
> > MF_ACTION_REQUIRED.  But I think it means hwpoison may work without pfn
> > encoded, but I don't know the implication to lose that dmesg line.
> 
> Not necesarily concurrently, but simply if for any reason the page could
> not have been unmapped.
> Let us say you have ProcessA and ProcessB mapping the same page.
> We get an MCE for that page but we could not unmapped it, at least not
> from all processes (maybe just ProcessA).
> 
> memory-failure code will mark it as hwpoison, now ProcessA faults it in
> and gets killed because we see that the page is hwpoisoned in the fault
> path, so we sill send VM_FAULT_HWPOISON all the way down till you see
> the:
> 
> "MCE: Killing uffd-unit-tests:650 due to hardware memory corruption
> fault at 7f3589d7e000" from e.g: arch/x86/mm/fault.c:do_sigbus()
> 
> Now, ProcessB still has the page mapped, so upon re-accessing it,
> it will trigger a new MCE event. memory-failure code will see that this

The question is why accessing that hwpoison entry from ProcB triggers an
MCE.  Logically that's a swap entry and it should generate a page fault
rather than MCE.  Then in the pgfault hanlder we don't need that encoded
pfn as we have vmf->address.

> page has already been marked as hwpoisoned, but since we failed to
> unmap it (otherwise no one should be re-accessing it), it goes: "ok,
> let's just kill everybody who has this page mapped".
> 
> 
> > We used to not dump error for swapin error.  Note that here what I am
> > saying is not that Axel is doing things wrong, but it's just that logically
> > swapin error (as pte marker) can also be with !QUIET, so my final point is
> > we may want to avoid having the assumption that "pte marker should always

Re: [PATCH v2 1/1] arch/fault: don't print logs for pte marker poison errors

2024-05-14 Thread Peter Xu
On Tue, May 14, 2024 at 10:26:49PM +0200, Oscar Salvador wrote:
> On Fri, May 10, 2024 at 03:29:48PM -0400, Peter Xu wrote:
> > IMHO we shouldn't mention that detail, but only state the effect which is
> > to not report the event to syslog.
> > 
> > There's no hard rule that a pte marker can't reflect a real page poison in
> > the future even MCE.  Actually I still remember most places don't care
> > about the pfn in the hwpoison swap entry so maybe we can even do it? But
> > that's another story regardless..
> 
> But we should not use pte markers for real hwpoisons events (aka MCE), right?

The question is whether we can't.

Now we reserved a swp entry just for hwpoison and it makes sense only
because we cached the poisoned pfn inside.  My long standing question is
why do we ever need that pfn after all.  If we don't need the pfn, we
simply need a bit in the pgtable entry saying that it's poisoned, if
accessed we should kill the process using sigbus.

I used to comment on this before, the only path that uses that pfn is
check_hwpoisoned_entry(), which was introduced in:

commit a3f5d80ea401ac857f2910e28b15f35b2cf902f4
Author: Naoya Horiguchi 
Date:   Mon Jun 28 19:43:14 2021 -0700

mm,hwpoison: send SIGBUS with error virutal address

Now an action required MCE in already hwpoisoned address surely sends a
SIGBUS to current process, but the SIGBUS doesn't convey error virtual
address.  That's not optimal for hwpoison-aware applications.

To fix the issue, make memory_failure() call kill_accessing_process(),
that does pagetable walk to find the error virtual address.  It could find
multiple virtual addresses for the same error page, and it seems hard to
tell which virtual address is correct one.  But that's rare and sending
incorrect virtual address could be better than no address.  So let's
report the first found virtual address for now.

So this time I read more on this and Naoya explained why - it's only used
so far to dump the VA of the poisoned entry.

However what confused me is, if an entry is poisoned already logically we
dump that message in the fault handler not memory_failure(), which is:

MCE: Killing uffd-unit-tests:650 due to hardware memory corruption fault at 
7f3589d7e000

So perhaps we're trying to also dump that when the MCEs (points to the same
pfn) are only generated concurrently?  I donno much on hwpoison so I cannot
tell, there's also implication where it's only triggered if
MF_ACTION_REQUIRED.  But I think it means hwpoison may work without pfn
encoded, but I don't know the implication to lose that dmesg line.

> I mean, we do have the means to mark a page as hwpoisoned when a real
> MCE gets triggered, why would we want a pte marker to also reflect that?
> Or is that something for userfaultd realm?

No it's not userfaultfd realm.. it's just that pte marker should be a
generic concept, so it logically can be used outside userfaultfd.  That's
also why it's used in swapin errors, in which case we don't use anything
else in this case but a bit to reflect "this page is bad".

> 
> > And also not report swapin error is, IMHO, only because arch errors said
> > "MCE" in the error logs which may not apply here.  Logically speaking
> > swapin error should also be reported so admin knows better on why a proc is
> > killed.  Now it can still confuse the admin if it really happens, iiuc.
> 
> I am bit confused by this.
> It seems we create poisoned pte markers on swap errors (e.g:
> unuse_pte()), which get passed down the chain with VM_FAULT_HWPOISON,
> which end up in sigbus (I guess?).
> 
> This all seems very subtle to me.
> 
> First of all, why not passing VM_FAULT_SIGBUS if that is what will end
> up happening?
> I mean, at the moment that is not possible because we convolute swaping
> errors and uffd poison in the same type of marker, so we do not have any
> means to differentiate between the two of them.
> 
> Would it make sense to create yet another pte marker type to split that
> up? Because when I look at VM_FAULT_HWPOISON, I get reminded of MCE
> stuff, and that does not hold here.

We used to not dump error for swapin error.  Note that here what I am
saying is not that Axel is doing things wrong, but it's just that logically
swapin error (as pte marker) can also be with !QUIET, so my final point is
we may want to avoid having the assumption that "pte marker should always
be QUITE", because I want to make it clear that pte marker can used in any
form, so itself shouldn't imply anything..

Thanks,

-- 
Peter Xu



Re: [PATCH v2 1/1] arch/fault: don't print logs for pte marker poison errors

2024-05-10 Thread Peter Xu
On Fri, May 10, 2024 at 11:29:26AM -0700, Axel Rasmussen wrote:
> For real MCEs, various architectures print log messages when poisoned
> memory is accessed (which results in a SIGBUS). These messages can be
> important for users to understand the issue.
> 
> On the other hand, we have two other cases: swapin errors and simulated
> poisons via UFFDIO_POISON. These cases also result in SIGBUS, but they
> aren't "real" hardware memory poisoning events, so we want to avoid
> logging MCE error messages to dmesg for these events. This avoids
> spamming the kernel log, and possibly drowning out real events with
> these other cases.
> 
> To identify this situation, add a new VM_FAULT_HWPOISON_SILENT flag.
> This is expected to be set *in addition to* one of the existing
> VM_FAULT_HWPOISON or VM_FAULT_HWPOISON_LARGE flags (which are mutually
> exclusive).
> 
> Reviewed-by: John Hubbard 
> Signed-off-by: Axel Rasmussen 

Acked-by: Peter Xu 

One nicpick below.

> ---
>  arch/parisc/mm/fault.c   |  7 +--
>  arch/powerpc/mm/fault.c  |  6 --
>  arch/x86/mm/fault.c  |  6 --
>  include/linux/mm_types.h | 34 --
>  mm/hugetlb.c |  3 ++-
>  mm/memory.c  |  2 +-
>  6 files changed, 36 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
> index c39de84e98b0..6c5e8d6498bf 100644
> --- a/arch/parisc/mm/fault.c
> +++ b/arch/parisc/mm/fault.c
> @@ -400,9 +400,12 @@ void do_page_fault(struct pt_regs *regs, unsigned long 
> code,
>  #ifdef CONFIG_MEMORY_FAILURE
>   if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
>   unsigned int lsb = 0;
> - printk(KERN_ERR
> +
> + if (!(fault & VM_FAULT_HWPOISON_SILENT)) {
> + pr_err(
>   "MCE: Killing %s:%d due to hardware memory corruption fault at %08lx\n",
> - tsk->comm, tsk->pid, address);
> + tsk->comm, tsk->pid, address);
> + }
>   /*
>* Either small page or large page may be poisoned.
>* In other words, VM_FAULT_HWPOISON_LARGE and
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 215690452495..c43bb6193a80 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -147,8 +147,10 @@ static int do_sigbus(struct pt_regs *regs, unsigned long 
> address,
>   if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
>   unsigned int lsb = 0; /* shutup gcc */
>  
> - pr_err("MCE: Killing %s:%d due to hardware memory corruption 
> fault at %lx\n",
> - current->comm, current->pid, address);
> + if (!(fault & VM_FAULT_HWPOISON_SILENT)) {
> + pr_err("MCE: Killing %s:%d due to hardware memory 
> corruption fault at %lx\n",
> + current->comm, current->pid, address);
> + }
>  
>   if (fault & VM_FAULT_HWPOISON_LARGE)
>   lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 67b18adc75dd..9ae5cc6bd933 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -964,9 +964,11 @@ do_sigbus(struct pt_regs *regs, unsigned long 
> error_code, unsigned long address,
>   struct task_struct *tsk = current;
>   unsigned lsb = 0;
>  
> - pr_err(
> + if (!(fault & VM_FAULT_HWPOISON_SILENT)) {
> + pr_err(
>   "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> - tsk->comm, tsk->pid, address);
> + tsk->comm, tsk->pid, address);
> + }
>   if (fault & VM_FAULT_HWPOISON_LARGE)
>   lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
>   if (fault & VM_FAULT_HWPOISON)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 24323c7d0bd4..7663a2725341 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1224,6 +1224,10 @@ typedef __bitwise unsigned int vm_fault_t;
>   * @VM_FAULT_HWPOISON_LARGE: Hit poisoned large page. Index encoded
>   *   in upper bits
>   * @VM_FAULT_SIGSEGV:segmentation fault
> + * @VM_FAULT_HWPOISON_SILENT Hit a poisoned pte marker, which should not be
> + *   logged t

Re: [PATCH 1/1] arch/fault: don't print logs for simulated poison errors

2024-05-09 Thread Peter Xu
pr_err_ratelimited(
>   "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> - tsk->comm, tsk->pid, address);
> + tsk->comm, tsk->pid, address);
> + }
>   if (fault & VM_FAULT_HWPOISON_LARGE)
>   lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
>   if (fault & VM_FAULT_HWPOISON)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5240bd7bca33..7f8fc3efc5b2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1226,6 +1226,9 @@ typedef __bitwise unsigned int vm_fault_t;
>   * @VM_FAULT_HWPOISON_LARGE: Hit poisoned large page. Index encoded
>   *   in upper bits
>   * @VM_FAULT_SIGSEGV:segmentation fault
> + * @VM_FAULT_HWPOISON_SIMHit poisoned, PTE marker; this indicates a
> + *   simulated poison (e.g. via usefaultfd's
> + *  UFFDIO_POISON), not a "real" hwerror.
>   * @VM_FAULT_NOPAGE: ->fault installed the pte, not return page
>   * @VM_FAULT_LOCKED: ->fault locked the returned page
>   * @VM_FAULT_RETRY:  ->fault blocked, must retry
> @@ -1245,6 +1248,7 @@ enum vm_fault_reason {
>   VM_FAULT_HWPOISON   = (__force vm_fault_t)0x10,
>   VM_FAULT_HWPOISON_LARGE = (__force vm_fault_t)0x20,
>   VM_FAULT_SIGSEGV= (__force vm_fault_t)0x40,
> + VM_FAULT_HWPOISON_SIM   = (__force vm_fault_t)0x80,
>   VM_FAULT_NOPAGE = (__force vm_fault_t)0x000100,
>   VM_FAULT_LOCKED = (__force vm_fault_t)0x000200,
>   VM_FAULT_RETRY  = (__force vm_fault_t)0x000400,
> @@ -1270,6 +1274,7 @@ enum vm_fault_reason {
>   { VM_FAULT_HWPOISON,"HWPOISON" },   \
>   { VM_FAULT_HWPOISON_LARGE,  "HWPOISON_LARGE" }, \
>   { VM_FAULT_SIGSEGV, "SIGSEGV" },\
> + { VM_FAULT_HWPOISON_SIM,"HWPOISON_SIM" },   \
>   { VM_FAULT_NOPAGE,  "NOPAGE" }, \
>   { VM_FAULT_LOCKED,  "LOCKED" }, \
>   { VM_FAULT_RETRY,   "RETRY" },  \
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 65456230cc71..2b4e0173e806 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6485,7 +6485,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   pte_marker_get(pte_to_swp_entry(entry));
>  
>   if (marker & PTE_MARKER_POISONED) {
> - ret = VM_FAULT_HWPOISON_LARGE |
> + ret = VM_FAULT_HWPOISON_SIM |
> +   VM_FAULT_HWPOISON_LARGE |
> VM_FAULT_SET_HINDEX(hstate_index(h));
>   goto out_mutex;
>   }
> diff --git a/mm/memory.c b/mm/memory.c
> index d2155ced45f8..29a833b996ae 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3910,7 +3910,7 @@ static vm_fault_t handle_pte_marker(struct vm_fault 
> *vmf)
>  
>   /* Higher priority than uffd-wp when data corrupted */
>   if (marker & PTE_MARKER_POISONED)
> - return VM_FAULT_HWPOISON;
> + return VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_SIM;
>  
>   if (pte_marker_entry_uffd_wp(entry))
>   return pte_marker_handle_uffd_wp(vmf);
> -- 
> 2.45.0.118.g7fe29c98d7-goog
> 

-- 
Peter Xu



[PATCH v2] mm/gup: Fix hugepd handling in hugetlb rework

2024-04-30 Thread Peter Xu
Commit a12083d721d7 added hugepd handling for gup-slow, reusing gup-fast
functions.  follow_hugepd() correctly took the vma pointer in, however
didn't pass it over into the lower functions, which was overlooked.

The issue is gup_fast_hugepte() uses the vma pointer to make the correct
decision on whether an unshare is needed for a FOLL_PIN|FOLL_LONGTERM.  Now
without vma ponter it will constantly return "true" (needs an unshare) for
a page cache, even though in the SHARED case it will be wrong to unshare.

The other problem is, even if an unshare is needed, it now returns 0 rather
than -EMLINK, which will not trigger a follow up FAULT_FLAG_UNSHARE fault.
That will need to be fixed too when the unshare is wanted.

gup_longterm test didn't expose this issue in the past because it didn't
yet test R/O unshare in this case, another separate patch will enable that
in future tests.

Fix it by passing vma correctly to the bottom, rename gup_fast_hugepte()
back to gup_hugepte() as it is shared between the fast/slow paths, and also
allow -EMLINK to be returned properly by gup_hugepte() even though gup-fast
will take it the same as zero.

Reported-by: David Hildenbrand 
Fixes: a12083d721d7 ("mm/gup: handle hugepd for follow_page()")
Reviewed-by: David Hildenbrand 
Signed-off-by: Peter Xu 
---

v1: https://lore.kernel.org/r/20240428190151.201002-1-pet...@redhat.com

This is v2 and dropped the 2nd test patch as a better one can come later,
this patch alone is kept untouched, added David's R-b.  Should apply to
both mm-stable and mm-unstable.  The target commit to be fixed should just
been moved into mm-stable, so no need to cc stable.
---
 mm/gup.c | 64 ++--
 1 file changed, 39 insertions(+), 25 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 2f7baf96f655..ca0f5cedce9b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -525,9 +525,17 @@ static unsigned long hugepte_addr_end(unsigned long addr, 
unsigned long end,
return (__boundary - 1 < end - 1) ? __boundary : end;
 }
 
-static int gup_fast_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
-   unsigned long end, unsigned int flags, struct page **pages,
-   int *nr)
+/*
+ * Returns 1 if succeeded, 0 if failed, -EMLINK if unshare needed.
+ *
+ * NOTE: for the same entry, gup-fast and gup-slow can return different
+ * results (0 v.s. -EMLINK) depending on whether vma is available.  This is
+ * the expected behavior, where we simply want gup-fast to fallback to
+ * gup-slow to take the vma reference first.
+ */
+static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long 
sz,
+  unsigned long addr, unsigned long end, unsigned int 
flags,
+  struct page **pages, int *nr)
 {
unsigned long pte_end;
struct page *page;
@@ -559,9 +567,9 @@ static int gup_fast_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
return 0;
}
 
-   if (!pte_write(pte) && gup_must_unshare(NULL, flags, >page)) {
+   if (!pte_write(pte) && gup_must_unshare(vma, flags, >page)) {
gup_put_folio(folio, refs, flags);
-   return 0;
+   return -EMLINK;
}
 
*nr += refs;
@@ -577,19 +585,22 @@ static int gup_fast_hugepte(pte_t *ptep, unsigned long 
sz, unsigned long addr,
  * of the other folios. See writable_file_mapping_allowed() and
  * gup_fast_folio_allowed() for more information.
  */
-static int gup_fast_hugepd(hugepd_t hugepd, unsigned long addr,
-   unsigned int pdshift, unsigned long end, unsigned int flags,
-   struct page **pages, int *nr)
+static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+ unsigned long addr, unsigned int pdshift,
+ unsigned long end, unsigned int flags,
+ struct page **pages, int *nr)
 {
pte_t *ptep;
unsigned long sz = 1UL << hugepd_shift(hugepd);
unsigned long next;
+   int ret;
 
ptep = hugepte_offset(hugepd, addr, pdshift);
do {
next = hugepte_addr_end(addr, end, sz);
-   if (!gup_fast_hugepte(ptep, sz, addr, end, flags, pages, nr))
-   return 0;
+   ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr);
+   if (ret != 1)
+   return ret;
} while (ptep++, addr = next, addr != end);
 
return 1;
@@ -613,22 +624,25 @@ static struct page *follow_hugepd(struct vm_area_struct 
*vma, hugepd_t hugepd,
h = hstate_vma(vma);
ptep = hugepte_offset(hugepd, addr, pdshift);
ptl = huge_pte_lock(h, vma->vm_mm, ptep);
-   ret = gup_fast_hugepd(hugepd, addr, pdshift, addr + PAGE_SIZE,
- flags, , );
+   ret = gup_hugepd(vma, hugepd, addr, pdshift, addr + PAGE_SIZE,
+ 

Re: [PATCH 2/2] mm/selftests: Don't prefault in gup_longterm tests

2024-04-29 Thread Peter Xu
On Mon, Apr 29, 2024 at 03:26:22PM +0200, David Hildenbrand wrote:
> > The test patch here doesn't need to rush. David, how about you prepare a
> > better and verified patch and post it separately, making sure to cover all
> > the things we used to cover plus the unshare?  IIUC it used to be not
> > touched because of pte_write() always returns true with a write prefault.
> > 
> > Then we let patch 1 go through first, and drop this one?
> 
> Whatever you prefer!

Thanks!

Andrew, would you consider taking patch 1 but ignore this patch 2? Or do
you prefer me to resend?

-- 
Peter Xu



Re: [PATCH 2/2] mm/selftests: Don't prefault in gup_longterm tests

2024-04-29 Thread Peter Xu
On Mon, Apr 29, 2024 at 09:28:15AM +0200, David Hildenbrand wrote:
> On 28.04.24 21:01, Peter Xu wrote:
> > Prefault, especially with RW, makes the GUP test too easy, and may not yet
> > reach the core of the test.
> > 
> > For example, R/O longterm pins will just hit, pte_write()==true for
> > whatever cases, the unsharing logic won't be ever tested.
> > 
> > This patch remove the prefault.  This tortures more code paths at least to
> > cover the unshare care for R/O longterm pins, in which case the first R/O
> > GUP attempt will fault in the page R/O first, then the 2nd will go through
> > the unshare path, checking whether an unshare is needed.
> > 
> > Cc: David Hildenbrand 
> > Signed-off-by: Peter Xu 
> > ---
> >   tools/testing/selftests/mm/gup_longterm.c | 12 +---
> >   1 file changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/tools/testing/selftests/mm/gup_longterm.c 
> > b/tools/testing/selftests/mm/gup_longterm.c
> > index ad168d35b23b..488e32186246 100644
> > --- a/tools/testing/selftests/mm/gup_longterm.c
> > +++ b/tools/testing/selftests/mm/gup_longterm.c
> > @@ -119,10 +119,16 @@ static void do_test(int fd, size_t size, enum 
> > test_type type, bool shared)
> > }
> > /*
> > -* Fault in the page writable such that GUP-fast can eventually pin
> > -* it immediately.
> > +* Explicitly avoid pre-faulting in the page, this can help testing
> > +* more code paths.
> > +*
> > +* Take example of an upcoming R/O pin test, if we RW prefault the
> > +* page, such pin will directly skip R/O unsharing and the longterm
> > +* pin will success mostly always.  When not prefaulted, R/O
> > +* longterm pin will first fault in a RO page, then the 2nd round
> > +* it'll go via the unshare check.  Otherwise those paths aren't
> > +* covered.
> >  */
> This will mean that GUP-fast never succeeds, which removes quite some testing
> coverage for most other tests here.
> 
> Note that the main motivation of this test was to test 
> gup_fast_folio_allowed(),
> where we had issues with GUP-fast during development.

Ah I didn't notice that, as I thought that whitelists memfd ones.

> 
> Would the following also get the job done?
> 
> diff --git a/tools/testing/selftests/mm/gup_longterm.c 
> b/tools/testing/selftests/mm/gup_longterm.c
> index ad168d35b23b7..e917a7c58d571 100644
> --- a/tools/testing/selftests/mm/gup_longterm.c
> +++ b/tools/testing/selftests/mm/gup_longterm.c
> @@ -92,7 +92,7 @@ static void do_test(int fd, size_t size, enum test_type 
> type, bool shared)
>  {
>   __fsword_t fs_type = get_fs_type(fd);
>   bool should_work;
> - char *mem;
> + char tmp, *mem;
>   int ret;
>   if (ftruncate(fd, size)) {
> @@ -119,10 +119,19 @@ static void do_test(int fd, size_t size, enum test_type 
> type, bool shared)
>   }
>   /*
> -  * Fault in the page writable such that GUP-fast can eventually pin
> -  * it immediately.
> +  * Fault in the page such that GUP-fast might be able to pin it
> +  * immediately. To cover more cases, don't fault in pages writable when
> +  * R/O pinning.
>*/
> - memset(mem, 0, size);
> + switch (type) {
> + case TEST_TYPE_RO:
> + case TEST_TYPE_RO_FAST:
> + tmp = *mem;
> + asm volatile("" : "+r" (tmp));
> + break;
> + default:
> + memset(mem, 0, size);
> + };
>   switch (type) {
>   case TEST_TYPE_RO:

Yes this could work too.

The test patch here doesn't need to rush. David, how about you prepare a
better and verified patch and post it separately, making sure to cover all
the things we used to cover plus the unshare?  IIUC it used to be not
touched because of pte_write() always returns true with a write prefault.

Then we let patch 1 go through first, and drop this one?

Thanks,

-- 
Peter Xu



[PATCH 1/2] mm/gup: Fix hugepd handling in hugetlb rework

2024-04-28 Thread Peter Xu
Commit a12083d721d7 added hugepd handling for gup-slow, reusing gup-fast
functions.  follow_hugepd() correctly took the vma pointer in, however
didn't pass it over into the lower functions, which was overlooked.

The issue is gup_fast_hugepte() uses the vma pointer to make the correct
decision on whether an unshare is needed for a FOLL_PIN|FOLL_LONGTERM.  Now
without vma ponter it will constantly return "true" (needs an unshare) for
a page cache, even though in the SHARED case it will be wrong to unshare.

The other problem is, even if an unshare is needed, it now returns 0 rather
than -EMLINK, which will not trigger a follow up FAULT_FLAG_UNSHARE fault.
That will need to be fixed too when the unshare is wanted.

gup_longterm test didn't expose this issue in the past because it didn't
yet test R/O unshare in this case, another separate patch will enable that
in future tests.

Fix it by passing vma correctly to the bottom, rename gup_fast_hugepte()
back to gup_hugepte() as it is shared between the fast/slow paths, and also
allow -EMLINK to be returned properly by gup_hugepte() even though gup-fast
will take it the same as zero.

Reported-by: David Hildenbrand 
Fixes: a12083d721d7 ("mm/gup: handle hugepd for follow_page()")
Signed-off-by: Peter Xu 
---

Note: The target commit to be fixed should just been moved into mm-stable,
so no need to cc stable.
---
 mm/gup.c | 64 ++--
 1 file changed, 39 insertions(+), 25 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 2f7baf96f655..ca0f5cedce9b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -525,9 +525,17 @@ static unsigned long hugepte_addr_end(unsigned long addr, 
unsigned long end,
return (__boundary - 1 < end - 1) ? __boundary : end;
 }
 
-static int gup_fast_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
-   unsigned long end, unsigned int flags, struct page **pages,
-   int *nr)
+/*
+ * Returns 1 if succeeded, 0 if failed, -EMLINK if unshare needed.
+ *
+ * NOTE: for the same entry, gup-fast and gup-slow can return different
+ * results (0 v.s. -EMLINK) depending on whether vma is available.  This is
+ * the expected behavior, where we simply want gup-fast to fallback to
+ * gup-slow to take the vma reference first.
+ */
+static int gup_hugepte(struct vm_area_struct *vma, pte_t *ptep, unsigned long 
sz,
+  unsigned long addr, unsigned long end, unsigned int 
flags,
+  struct page **pages, int *nr)
 {
unsigned long pte_end;
struct page *page;
@@ -559,9 +567,9 @@ static int gup_fast_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
return 0;
}
 
-   if (!pte_write(pte) && gup_must_unshare(NULL, flags, >page)) {
+   if (!pte_write(pte) && gup_must_unshare(vma, flags, >page)) {
gup_put_folio(folio, refs, flags);
-   return 0;
+   return -EMLINK;
}
 
*nr += refs;
@@ -577,19 +585,22 @@ static int gup_fast_hugepte(pte_t *ptep, unsigned long 
sz, unsigned long addr,
  * of the other folios. See writable_file_mapping_allowed() and
  * gup_fast_folio_allowed() for more information.
  */
-static int gup_fast_hugepd(hugepd_t hugepd, unsigned long addr,
-   unsigned int pdshift, unsigned long end, unsigned int flags,
-   struct page **pages, int *nr)
+static int gup_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+ unsigned long addr, unsigned int pdshift,
+ unsigned long end, unsigned int flags,
+ struct page **pages, int *nr)
 {
pte_t *ptep;
unsigned long sz = 1UL << hugepd_shift(hugepd);
unsigned long next;
+   int ret;
 
ptep = hugepte_offset(hugepd, addr, pdshift);
do {
next = hugepte_addr_end(addr, end, sz);
-   if (!gup_fast_hugepte(ptep, sz, addr, end, flags, pages, nr))
-   return 0;
+   ret = gup_hugepte(vma, ptep, sz, addr, end, flags, pages, nr);
+   if (ret != 1)
+   return ret;
} while (ptep++, addr = next, addr != end);
 
return 1;
@@ -613,22 +624,25 @@ static struct page *follow_hugepd(struct vm_area_struct 
*vma, hugepd_t hugepd,
h = hstate_vma(vma);
ptep = hugepte_offset(hugepd, addr, pdshift);
ptl = huge_pte_lock(h, vma->vm_mm, ptep);
-   ret = gup_fast_hugepd(hugepd, addr, pdshift, addr + PAGE_SIZE,
- flags, , );
+   ret = gup_hugepd(vma, hugepd, addr, pdshift, addr + PAGE_SIZE,
+flags, , );
spin_unlock(ptl);
 
-   if (ret) {
+   if (ret == 1) {
+   /* GUP succeeded */
WARN_ON_ONCE(nr != 1);
ctx->page_mask = (1U << huge_page_order(h)) - 1;
return page;

[PATCH 0/2] mm/gup: Fix hugepd for longterm R/O pin on Power

2024-04-28 Thread Peter Xu
This series should apply to both mm-stable and mm-unstable, I am not sure
whether it's even applicable to apply on mm-stable directly, but perhaps
not urgently needed either.  Anyway, it'll apply to either tree.  It also
means cc stable is not needed even if I had the Fixes attached.

Patch 1 fixes that bug in mm-stable, patch 2 enhances the gup_longterm to
be able to discover such issue.

In general, the previous hugetlb rework [1] on gup-slow introduced an issue
with R/O longterm pin.  Nobody yet found it in either a real report or test
case, probably because our test case doesn't yet cover it (not before patch
2), and it's also a pretty rare path: it only happens with Power longterm
R/O pins on a page cache that is installed as a hugepd read-only.

Please read each of the patch for details.

I retested "./run_vmtests.sh -t gup_test -a" on a Power8 system with a
Power8 VM, with 16MB hugepd hugepd entries installed.  Note that I tested
exactly the same matrix before, but patch 2 will change gup_longterm test,
so it's actually slightly different test carried out, and the new test
(gup_longterm.c, when apply patch 2 only) will hang mm-stable on Andrew's
tree with that 16MB huge page.

Thanks,

[1] https://lore.kernel.org/r/20240327152332.950956-1-pet...@redhat.com

Peter Xu (2):
  mm/gup: Fix hugepd handling in hugetlb rework
  mm/selftests: Don't prefault in gup_longterm tests

 mm/gup.c  | 64 ++-
 tools/testing/selftests/mm/gup_longterm.c | 12 +++--
 2 files changed, 48 insertions(+), 28 deletions(-)

-- 
2.44.0



[PATCH 2/2] mm/selftests: Don't prefault in gup_longterm tests

2024-04-28 Thread Peter Xu
Prefault, especially with RW, makes the GUP test too easy, and may not yet
reach the core of the test.

For example, R/O longterm pins will just hit, pte_write()==true for
whatever cases, the unsharing logic won't be ever tested.

This patch remove the prefault.  This tortures more code paths at least to
cover the unshare care for R/O longterm pins, in which case the first R/O
GUP attempt will fault in the page R/O first, then the 2nd will go through
the unshare path, checking whether an unshare is needed.

Cc: David Hildenbrand 
Signed-off-by: Peter Xu 
---
 tools/testing/selftests/mm/gup_longterm.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/mm/gup_longterm.c 
b/tools/testing/selftests/mm/gup_longterm.c
index ad168d35b23b..488e32186246 100644
--- a/tools/testing/selftests/mm/gup_longterm.c
+++ b/tools/testing/selftests/mm/gup_longterm.c
@@ -119,10 +119,16 @@ static void do_test(int fd, size_t size, enum test_type 
type, bool shared)
}
 
/*
-* Fault in the page writable such that GUP-fast can eventually pin
-* it immediately.
+* Explicitly avoid pre-faulting in the page, this can help testing
+* more code paths.
+*
+* Take example of an upcoming R/O pin test, if we RW prefault the
+* page, such pin will directly skip R/O unsharing and the longterm
+* pin will success mostly always.  When not prefaulted, R/O
+* longterm pin will first fault in a RO page, then the 2nd round
+* it'll go via the unshare check.  Otherwise those paths aren't
+* covered.
 */
-   memset(mem, 0, size);
 
switch (type) {
case TEST_TYPE_RO:
-- 
2.44.0



Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions

2024-04-26 Thread Peter Xu
On Fri, Apr 26, 2024 at 11:33:08PM +0200, David Hildenbrand wrote:
> I raised this topic in the past, and IMHO we either (a) never should have
> added COW support; or (b) added COW support by using ordinary anonymous
> memory (hey, partial mappings of hugetlb pages! ;) ).
> 
> After all, COW is an optimization to speed up fork and defer copying. It
> relies on memory overcommit, but that doesn't really apply to hugetlb, so we
> fake it ...

Good summary.

> 
> One easy ABI break I had in mind was to simply *not* allow COW-sharing of
> anon hugetlb folios; for example, simply don't copy the page into the child.
> Chances are there are not really a lot of child processes that would fail
> ... but likely we would break *something*. So there is no easy way out :(

Right, not easy.  The thing is this is one spot out of many of the
specialties, it also may or may not be worthwhile to have dedicated time
while nobody yet has a problem with it.  It might be easier to start with
v2, even though that's also hard to nail everything properly - the
challenge can come from different angles.

Thanks for the sharings, helpful.  I'll go ahead with the Power fix on
hugepd putting this aside.

I hope that before the end of this year, whatever I'll fix can go away, by
removing hugepd completely from Linux.  For now that may or may not be as
smooth, so we'd better still fix it.

-- 
Peter Xu



Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions

2024-04-26 Thread Peter Xu
On Fri, Apr 26, 2024 at 07:28:31PM +0200, David Hildenbrand wrote:
> On 26.04.24 18:12, Peter Xu wrote:
> > On Fri, Apr 26, 2024 at 09:44:58AM -0400, Peter Xu wrote:
> > > On Fri, Apr 26, 2024 at 09:17:47AM +0200, David Hildenbrand wrote:
> > > > On 02.04.24 14:55, David Hildenbrand wrote:
> > > > > Let's consistently call the "fast-only" part of GUP "GUP-fast" and 
> > > > > rename
> > > > > all relevant internal functions to start with "gup_fast", to make it
> > > > > clearer that this is not ordinary GUP. The current mixture of
> > > > > "lockless", "gup" and "gup_fast" is confusing.
> > > > > 
> > > > > Further, avoid the term "huge" when talking about a "leaf" -- for
> > > > > example, we nowadays check pmd_leaf() because pmd_huge() is gone. For 
> > > > > the
> > > > > "hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that
> > > > > stays.
> > > > > 
> > > > > What remains is the "external" interface:
> > > > > * get_user_pages_fast_only()
> > > > > * get_user_pages_fast()
> > > > > * pin_user_pages_fast()
> > > > > 
> > > > > The high-level internal functions for GUP-fast (+slow fallback) are 
> > > > > now:
> > > > > * internal_get_user_pages_fast() -> gup_fast_fallback()
> > > > > * lockless_pages_from_mm() -> gup_fast()
> > > > > 
> > > > > The basic GUP-fast walker functions:
> > > > > * gup_pgd_range() -> gup_fast_pgd_range()
> > > > > * gup_p4d_range() -> gup_fast_p4d_range()
> > > > > * gup_pud_range() -> gup_fast_pud_range()
> > > > > * gup_pmd_range() -> gup_fast_pmd_range()
> > > > > * gup_pte_range() -> gup_fast_pte_range()
> > > > > * gup_huge_pgd()  -> gup_fast_pgd_leaf()
> > > > > * gup_huge_pud()  -> gup_fast_pud_leaf()
> > > > > * gup_huge_pmd()  -> gup_fast_pmd_leaf()
> > > > > 
> > > > > The weird hugepd stuff:
> > > > > * gup_huge_pd() -> gup_fast_hugepd()
> > > > > * gup_hugepte() -> gup_fast_hugepte()
> > > > 
> > > > I just realized that we end up calling these from follow_hugepd() as 
> > > > well.
> > > > And something seems to be off, because gup_fast_hugepd() won't have the 
> > > > VMA
> > > > even in the slow-GUP case to pass it to gup_must_unshare().
> > > > 
> > > > So these are GUP-fast functions and the terminology seem correct. But 
> > > > the
> > > > usage from follow_hugepd() is questionable,
> > > > 
> > > > commit a12083d721d703f985f4403d6b333cc449f838f6
> > > > Author: Peter Xu 
> > > > Date:   Wed Mar 27 11:23:31 2024 -0400
> > > > 
> > > >  mm/gup: handle hugepd for follow_page()
> > > > 
> > > > 
> > > > states "With previous refactors on fast-gup gup_huge_pd(), most of the 
> > > > code
> > > > can be leveraged", which doesn't look quite true just staring the the
> > > > gup_must_unshare() call where we don't pass the VMA. Also,
> > > > "unlikely(pte_val(pte) != pte_val(ptep_get(ptep)" doesn't make any 
> > > > sense for
> > > > slow GUP ...
> > > 
> > > Yes it's not needed, just doesn't look worthwhile to put another helper on
> > > top just for this.  I mentioned this in the commit message here:
> > > 
> > >There's something not needed for follow page, for example, 
> > > gup_hugepte()
> > >tries to detect pgtable entry change which will never happen with slow
> > >gup (which has the pgtable lock held), but that's not a problem to 
> > > check.
> > > 
> > > > 
> > > > @Peter, any insights?
> > > 
> > > However I think we should pass vma in for sure, I guess I overlooked that,
> > > and it didn't expose in my tests too as I probably missed ./cow.
> > > 
> > > I'll prepare a separate patch on top of this series and the gup-fast 
> > > rename
> > > patches (I saw this one just reached mm-stable), and I'll see whether I 
> > > can
> > > test it too if I can find a Power system fast enough.  I'll probably drop
> > > 

Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions

2024-04-26 Thread Peter Xu
On Fri, Apr 26, 2024 at 09:44:58AM -0400, Peter Xu wrote:
> On Fri, Apr 26, 2024 at 09:17:47AM +0200, David Hildenbrand wrote:
> > On 02.04.24 14:55, David Hildenbrand wrote:
> > > Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename
> > > all relevant internal functions to start with "gup_fast", to make it
> > > clearer that this is not ordinary GUP. The current mixture of
> > > "lockless", "gup" and "gup_fast" is confusing.
> > > 
> > > Further, avoid the term "huge" when talking about a "leaf" -- for
> > > example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the
> > > "hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that
> > > stays.
> > > 
> > > What remains is the "external" interface:
> > > * get_user_pages_fast_only()
> > > * get_user_pages_fast()
> > > * pin_user_pages_fast()
> > > 
> > > The high-level internal functions for GUP-fast (+slow fallback) are now:
> > > * internal_get_user_pages_fast() -> gup_fast_fallback()
> > > * lockless_pages_from_mm() -> gup_fast()
> > > 
> > > The basic GUP-fast walker functions:
> > > * gup_pgd_range() -> gup_fast_pgd_range()
> > > * gup_p4d_range() -> gup_fast_p4d_range()
> > > * gup_pud_range() -> gup_fast_pud_range()
> > > * gup_pmd_range() -> gup_fast_pmd_range()
> > > * gup_pte_range() -> gup_fast_pte_range()
> > > * gup_huge_pgd()  -> gup_fast_pgd_leaf()
> > > * gup_huge_pud()  -> gup_fast_pud_leaf()
> > > * gup_huge_pmd()  -> gup_fast_pmd_leaf()
> > > 
> > > The weird hugepd stuff:
> > > * gup_huge_pd() -> gup_fast_hugepd()
> > > * gup_hugepte() -> gup_fast_hugepte()
> > 
> > I just realized that we end up calling these from follow_hugepd() as well.
> > And something seems to be off, because gup_fast_hugepd() won't have the VMA
> > even in the slow-GUP case to pass it to gup_must_unshare().
> > 
> > So these are GUP-fast functions and the terminology seem correct. But the
> > usage from follow_hugepd() is questionable,
> > 
> > commit a12083d721d703f985f4403d6b333cc449f838f6
> > Author: Peter Xu 
> > Date:   Wed Mar 27 11:23:31 2024 -0400
> > 
> > mm/gup: handle hugepd for follow_page()
> > 
> > 
> > states "With previous refactors on fast-gup gup_huge_pd(), most of the code
> > can be leveraged", which doesn't look quite true just staring the the
> > gup_must_unshare() call where we don't pass the VMA. Also,
> > "unlikely(pte_val(pte) != pte_val(ptep_get(ptep)" doesn't make any sense for
> > slow GUP ...
> 
> Yes it's not needed, just doesn't look worthwhile to put another helper on
> top just for this.  I mentioned this in the commit message here:
> 
>   There's something not needed for follow page, for example, gup_hugepte()
>   tries to detect pgtable entry change which will never happen with slow
>   gup (which has the pgtable lock held), but that's not a problem to check.
> 
> > 
> > @Peter, any insights?
> 
> However I think we should pass vma in for sure, I guess I overlooked that,
> and it didn't expose in my tests too as I probably missed ./cow.
> 
> I'll prepare a separate patch on top of this series and the gup-fast rename
> patches (I saw this one just reached mm-stable), and I'll see whether I can
> test it too if I can find a Power system fast enough.  I'll probably drop
> the "fast" in the hugepd function names too.

Hmm, so when I enable 2M hugetlb I found ./cow is even failing on x86.

  # ./cow  | grep -B1 "not ok"
  # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB)
  not ok 161 No leak from parent into child
  --
  # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with 
hugetlb (2048 kB)
  not ok 215 No leak from parent into child
  --
  # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with 
hugetlb (2048 kB)
  not ok 269 No leak from child into parent
  --
  # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB)
  not ok 323 No leak from child into parent

And it looks like it was always failing.. perhaps since the start?  We
didn't do the same on hugetlb v.s. normal anon from that regard on the
vmsplice() fix.

I drafted a patch to allow refcount>1 detection as the same, then all tests
pass for me, as below.

David, I'd like to double check with you before I post anything: is that
your intention to do so when working on the R/O pinnin

Re: [PATCH v1 1/3] mm/gup: consistently name GUP-fast functions

2024-04-26 Thread Peter Xu
On Fri, Apr 26, 2024 at 09:17:47AM +0200, David Hildenbrand wrote:
> On 02.04.24 14:55, David Hildenbrand wrote:
> > Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename
> > all relevant internal functions to start with "gup_fast", to make it
> > clearer that this is not ordinary GUP. The current mixture of
> > "lockless", "gup" and "gup_fast" is confusing.
> > 
> > Further, avoid the term "huge" when talking about a "leaf" -- for
> > example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the
> > "hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that
> > stays.
> > 
> > What remains is the "external" interface:
> > * get_user_pages_fast_only()
> > * get_user_pages_fast()
> > * pin_user_pages_fast()
> > 
> > The high-level internal functions for GUP-fast (+slow fallback) are now:
> > * internal_get_user_pages_fast() -> gup_fast_fallback()
> > * lockless_pages_from_mm() -> gup_fast()
> > 
> > The basic GUP-fast walker functions:
> > * gup_pgd_range() -> gup_fast_pgd_range()
> > * gup_p4d_range() -> gup_fast_p4d_range()
> > * gup_pud_range() -> gup_fast_pud_range()
> > * gup_pmd_range() -> gup_fast_pmd_range()
> > * gup_pte_range() -> gup_fast_pte_range()
> > * gup_huge_pgd()  -> gup_fast_pgd_leaf()
> > * gup_huge_pud()  -> gup_fast_pud_leaf()
> > * gup_huge_pmd()  -> gup_fast_pmd_leaf()
> > 
> > The weird hugepd stuff:
> > * gup_huge_pd() -> gup_fast_hugepd()
> > * gup_hugepte() -> gup_fast_hugepte()
> 
> I just realized that we end up calling these from follow_hugepd() as well.
> And something seems to be off, because gup_fast_hugepd() won't have the VMA
> even in the slow-GUP case to pass it to gup_must_unshare().
> 
> So these are GUP-fast functions and the terminology seem correct. But the
> usage from follow_hugepd() is questionable,
> 
> commit a12083d721d703f985f4403d6b333cc449f838f6
> Author: Peter Xu 
> Date:   Wed Mar 27 11:23:31 2024 -0400
> 
> mm/gup: handle hugepd for follow_page()
> 
> 
> states "With previous refactors on fast-gup gup_huge_pd(), most of the code
> can be leveraged", which doesn't look quite true just staring the the
> gup_must_unshare() call where we don't pass the VMA. Also,
> "unlikely(pte_val(pte) != pte_val(ptep_get(ptep)" doesn't make any sense for
> slow GUP ...

Yes it's not needed, just doesn't look worthwhile to put another helper on
top just for this.  I mentioned this in the commit message here:

  There's something not needed for follow page, for example, gup_hugepte()
  tries to detect pgtable entry change which will never happen with slow
  gup (which has the pgtable lock held), but that's not a problem to check.

> 
> @Peter, any insights?

However I think we should pass vma in for sure, I guess I overlooked that,
and it didn't expose in my tests too as I probably missed ./cow.

I'll prepare a separate patch on top of this series and the gup-fast rename
patches (I saw this one just reached mm-stable), and I'll see whether I can
test it too if I can find a Power system fast enough.  I'll probably drop
the "fast" in the hugepd function names too.

Thanks,

-- 
Peter Xu



Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

2024-04-16 Thread Peter Xu
On Tue, Apr 16, 2024 at 10:58:33AM +, Christophe Leroy wrote:
> 
> 
> Le 15/04/2024 à 21:12, Christophe Leroy a écrit :
> > 
> > 
> > Le 12/04/2024 à 16:30, Peter Xu a écrit :
> >> On Fri, Apr 12, 2024 at 02:08:03PM +, Christophe Leroy wrote:
> >>>
> >>>
> >>> Le 11/04/2024 à 18:15, Peter Xu a écrit :
> >>>> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
> >>>>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> >>>>>> This series reimplements hugepages with hugepd on powerpc 8xx.
> >>>>>>
> >>>>>> Unlike most architectures, powerpc 8xx HW requires a two-level
> >>>>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
> >>>>>> is not feasible as such.
> >>>>>>
> >>>>>> Possible sizes are 4k, 16k, 512k and 8M.
> >>>>>>
> >>>>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD 
> >>>>>> entries
> >>>>>> must point to a single entry level-2 page table. Until now that was
> >>>>>> done using hugepd. This series changes it to use standard page tables
> >>>>>> where the entry is replicated 1024 times on each of the two 
> >>>>>> pagetables
> >>>>>> refered by the two associated PMD entries for that 8M page.
> >>>>>>
> >>>>>> At the moment it has to look into each helper to know if the
> >>>>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> >>>>>> a lower size. I hope this can me handled by core-mm in the future.
> >>>>>>
> >>>>>> There are probably several ways to implement stuff, so feedback is
> >>>>>> very welcome.
> >>>>>
> >>>>> I thought it looks pretty good!
> >>>>
> >>>> I second it.
> >>>>
> >>>> I saw the discussions in patch 1.  Christophe, I suppose you're 
> >>>> exploring
> >>>> the big hammer over hugepd, and perhaps went already with the 32bit pmd
> >>>> solution for nohash/32bit challenge you mentioned?
> >>>>
> >>>> I'm trying to position my next step; it seems like at least I should 
> >>>> not
> >>>> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD 
> >>>> checks,
> >>>> or you're going to have an RFC soon then I can base on top?
> >>>
> >>> Depends on what you expect by "soon".
> >>>
> >>> I sure won't be able to send any RFC before end of April.
> >>>
> >>> Should be possible to have something during May.
> >>
> >> That's good enough, thanks.  I'll see what is the best I can do.
> >>
> >> Then do you think I can leave p4d/pgd leaves alone?  Please check the 
> >> other
> >> email where I'm not sure whether pgd leaves ever existed for any of
> >> PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
> >> recognize pud and lower for all leaves.  Then if Power can switch from
> >> hugepd to this it should just work.
> > 
> > Well, if I understand correctly, something with no PMD will include 
> >  and will therefore only have  pmd 
> > entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is. 
> > pgd_populate(), p4d_populate(), pud_populate() are all "do { } while 
> > (0)" and only pmd_populate exists. So only pmd_leaf() will exist in that 
> > case.
> > 
> > And therefore including  means  you 
> > have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be 
> > checked.
> > 
> > 
> >>
> >> Even if pgd exists (then something I overlooked..), I'm wondering whether
> >> we can push that downwards to be either pud/pmd (and looks like we all
> >> agree p4d is never used on Power).  That may involve some pgtable
> >> operations moving from pgd level to lower, e.g. my pure imagination would
> >> look like starting with:
> > 
> > Yes I think there is no doubt that p4d is never used:
> > 
> > arch/powerpc/include/asm/book3s/32/pgtable.h:#include 
> > 
> > arch/powerpc/include/asm/book3s/64/pgtable.h:#include 
> > 
> > arch/powerpc/include/asm/nohash/32/pgtable.h:#include 
> > 
> > arch/pow

Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

2024-04-12 Thread Peter Xu
On Fri, Apr 12, 2024 at 02:08:03PM +, Christophe Leroy wrote:
> 
> 
> Le 11/04/2024 à 18:15, Peter Xu a écrit :
> > On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
> >> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> >>> This series reimplements hugepages with hugepd on powerpc 8xx.
> >>>
> >>> Unlike most architectures, powerpc 8xx HW requires a two-level
> >>> pagetable topology for all page sizes. So a leaf PMD-contig approach
> >>> is not feasible as such.
> >>>
> >>> Possible sizes are 4k, 16k, 512k and 8M.
> >>>
> >>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> >>> must point to a single entry level-2 page table. Until now that was
> >>> done using hugepd. This series changes it to use standard page tables
> >>> where the entry is replicated 1024 times on each of the two pagetables
> >>> refered by the two associated PMD entries for that 8M page.
> >>>
> >>> At the moment it has to look into each helper to know if the
> >>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> >>> a lower size. I hope this can me handled by core-mm in the future.
> >>>
> >>> There are probably several ways to implement stuff, so feedback is
> >>> very welcome.
> >>
> >> I thought it looks pretty good!
> > 
> > I second it.
> > 
> > I saw the discussions in patch 1.  Christophe, I suppose you're exploring
> > the big hammer over hugepd, and perhaps went already with the 32bit pmd
> > solution for nohash/32bit challenge you mentioned?
> > 
> > I'm trying to position my next step; it seems like at least I should not
> > adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
> > or you're going to have an RFC soon then I can base on top?
> 
> Depends on what you expect by "soon".
> 
> I sure won't be able to send any RFC before end of April.
> 
> Should be possible to have something during May.

That's good enough, thanks.  I'll see what is the best I can do.

Then do you think I can leave p4d/pgd leaves alone?  Please check the other
email where I'm not sure whether pgd leaves ever existed for any of
PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
recognize pud and lower for all leaves.  Then if Power can switch from
hugepd to this it should just work.

Even if pgd exists (then something I overlooked..), I'm wondering whether
we can push that downwards to be either pud/pmd (and looks like we all
agree p4d is never used on Power).  That may involve some pgtable
operations moving from pgd level to lower, e.g. my pure imagination would
look like starting with:

#define PTE_INDEX_SIZE  PTE_SHIFT
#define PMD_INDEX_SIZE  0
#define PUD_INDEX_SIZE  0
#define PGD_INDEX_SIZE  (32 - PGDIR_SHIFT)

To:

#define PTE_INDEX_SIZE  PTE_SHIFT
#define PMD_INDEX_SIZE  (32 - PMD_SHIFT)
#define PUD_INDEX_SIZE  0
#define PGD_INDEX_SIZE  0

And the rest will need care too.  I hope moving downward is easier
(e.g. the walker should always exist for lower levels but not always for
higher levels), but I actually have little idea on whether there's any
other implications, so please bare with me on stupid mistakes.

I just hope pgd leaves don't exist already, then I think it'll be simpler.

Thanks,

-- 
Peter Xu



Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-11 Thread Peter Xu
On Thu, Apr 11, 2024 at 06:55:44PM +0200, Paolo Bonzini wrote:
> On Mon, Apr 8, 2024 at 3:56 PM Peter Xu  wrote:
> > Paolo,
> >
> > I may miss a bunch of details here (as I still remember some change_pte
> > patches previously on the list..), however not sure whether we considered
> > enable it?  Asked because I remember Andrea used to have a custom tree
> > maintaining that part:
> >
> > https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968
> 
> The patch enables it only for KSM, so it would still require a bunch
> of cleanups, for example I also would still use set_pte_at() in all
> the places that are not KSM. This would at least fix the issue with
> the poor documentation of where to use set_pte_at_notify() vs
> set_pte_at().
> 
> With regard to the implementation, I like the idea of disabling the
> invalidation on the MMU notifier side, but I would rather have
> MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of
> overloading the event field.
> 
> > Maybe it can't be enabled for some reason that I overlooked in the current
> > tree, or we just decided to not to?
> 
> I have just learnt about the patch, nobody had ever mentioned it even
> though it's almost 2 years old... It's a lot of code though and no one
> has ever reported an issue for over 10 years, so I think it's easiest
> to just rip the code out.

Right, it was pretty old and I have no idea if that was discussed or
published before..  It would be better to have discussed this earlier.

As long as we have a decision with that being aware and in mind, then it
looks fine to me to take either way to go, and I also agree either way is
better than keep the status quo.

I also have Andrea copied anyway when I replied, so I guess he should be
aware of this and he can chim in anytime.

Thanks!

-- 
Peter Xu



Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

2024-04-11 Thread Peter Xu
On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> > This series reimplements hugepages with hugepd on powerpc 8xx.
> > 
> > Unlike most architectures, powerpc 8xx HW requires a two-level
> > pagetable topology for all page sizes. So a leaf PMD-contig approach
> > is not feasible as such.
> > 
> > Possible sizes are 4k, 16k, 512k and 8M.
> > 
> > First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> > must point to a single entry level-2 page table. Until now that was
> > done using hugepd. This series changes it to use standard page tables
> > where the entry is replicated 1024 times on each of the two pagetables
> > refered by the two associated PMD entries for that 8M page.
> > 
> > At the moment it has to look into each helper to know if the
> > hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> > a lower size. I hope this can me handled by core-mm in the future.
> > 
> > There are probably several ways to implement stuff, so feedback is
> > very welcome.
> 
> I thought it looks pretty good!

I second it.

I saw the discussions in patch 1.  Christophe, I suppose you're exploring
the big hammer over hugepd, and perhaps went already with the 32bit pmd
solution for nohash/32bit challenge you mentioned?

I'm trying to position my next step; it seems like at least I should not
adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
or you're going to have an RFC soon then I can base on top?

Thanks,

-- 
Peter Xu



Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

2024-04-10 Thread Peter Xu
On Wed, Apr 10, 2024 at 04:30:41PM +, Christophe Leroy wrote:
> 
> 
> Le 10/04/2024 à 17:28, Peter Xu a écrit :
> > On Tue, Apr 09, 2024 at 08:43:55PM -0300, Jason Gunthorpe wrote:
> >> On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote:
> >>> In short, hugetlb mappings shouldn't be special comparing to other huge 
> >>> pXd
> >>> and large folio (cont-pXd) mappings for most of the walkers in my mind, if
> >>> not all.  I need to look at all the walkers and there can be some tricky
> >>> ones, but I believe that applies in general.  It's actually similar to 
> >>> what
> >>> I did with slow gup here.
> >>
> >> I think that is the big question, I also haven't done the research to
> >> know the answer.
> >>
> >> At this point focusing on moving what is reasonable to the pXX_* API
> >> makes sense to me. Then reviewing what remains and making some
> >> decision.
> >>
> >>> Like this series, for cont-pXd we'll need multiple walks comparing to
> >>> before (when with hugetlb_entry()), but for that part I'll provide some
> >>> performance tests too, and we also have a fallback plan, which is to 
> >>> detect
> >>> cont-pXd existance, which will also work for large folios.
> >>
> >> I think we can optimize this pretty easy.
> >>   
> >>>> I think if you do the easy places for pXX conversion you will have a
> >>>> good idea about what is needed for the hard places.
> >>>
> >>> Here IMHO we don't need to understand "what is the size of this hugetlb
> >>> vma"
> >>
> >> Yeh, I never really understood why hugetlb was linked to the VMA.. The
> >> page table is self describing, obviously.
> > 
> > Attaching to vma still makes sense to me, where we should definitely avoid
> > a mixture of hugetlb and !hugetlb pages in a single vma - hugetlb pages are
> > allocated, managed, ...  totally differently.
> > 
> > And since hugetlb is designed as file-based (which also makes sense to me,
> > at least for now), it's also natural that it's vma-attached.
> > 
> >>
> >>> or "which level of pgtable does this hugetlb vma pages locate",
> >>
> >> Ditto
> >>
> >>> because we may not need that, e.g., when we only want to collect some 
> >>> smaps
> >>> statistics.  "whether it's hugetlb" may matter, though. E.g. in the mm
> >>> walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
> >>> hugetlb_entry removed), we may need extra check later to put things into
> >>> the right bucket, but for the walker itself it doesn't necessarily need
> >>> hugetlb_entry().
> >>
> >> Right, places may still need to know it is part of a huge VMA because we
> >> have special stuff linked to that.
> >>
> >>>> But then again we come back to power and its big list of page sizes
> >>>> and variety :( Looks like some there have huge sizes at the pgd level
> >>>> at least.
> >>>
> >>> Yeah this is something I want to be super clear, because I may miss
> >>> something: we don't have real pgd pages, right?  Powerpc doesn't even
> >>> define p4d_leaf(), AFAICT.
> >>
> >> AFAICT it is because it hides it all in hugepd.
> > 
> > IMHO one thing we can benefit from such hugepd rework is, if we can squash
> > all the hugepds like what Christophe does, then we push it one more layer
> > down, and we have a good chance all things should just work.
> > 
> > So again my Power brain is close to zero, but now I'm referring to what
> > Christophe shared in the other thread:
> > 
> > https://github.com/linuxppc/wiki/wiki/Huge-pages
> > 
> > Together with:
> > 
> > https://lore.kernel.org/r/288f26f487648d21fd9590e40b390934eaa5d24a.1711377230.git.christophe.le...@csgroup.eu
> > 
> > Where it has:
> > 
> > --- a/arch/powerpc/platforms/Kconfig.cputype
> > +++ b/arch/powerpc/platforms/Kconfig.cputype
> > @@ -98,6 +98,7 @@ config PPC_BOOK3S_64
> >  select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
> >  select ARCH_ENABLE_SPLIT_PMD_PTLOCK
> >  select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
> > +   select ARCH_HAS_HUGEPD if HUGETLB_PAGE
> >  select ARCH_SUPPORTS_HUGETLBFS
> >  select ARCH_SUPPORTS_NUMA_BALANCING
> >

Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

2024-04-10 Thread Peter Xu
On Tue, Apr 09, 2024 at 08:43:55PM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote:
> > In short, hugetlb mappings shouldn't be special comparing to other huge pXd
> > and large folio (cont-pXd) mappings for most of the walkers in my mind, if
> > not all.  I need to look at all the walkers and there can be some tricky
> > ones, but I believe that applies in general.  It's actually similar to what
> > I did with slow gup here.
> 
> I think that is the big question, I also haven't done the research to
> know the answer.
> 
> At this point focusing on moving what is reasonable to the pXX_* API
> makes sense to me. Then reviewing what remains and making some
> decision.
> 
> > Like this series, for cont-pXd we'll need multiple walks comparing to
> > before (when with hugetlb_entry()), but for that part I'll provide some
> > performance tests too, and we also have a fallback plan, which is to detect
> > cont-pXd existance, which will also work for large folios.
> 
> I think we can optimize this pretty easy.
>  
> > > I think if you do the easy places for pXX conversion you will have a
> > > good idea about what is needed for the hard places.
> > 
> > Here IMHO we don't need to understand "what is the size of this hugetlb
> > vma"
> 
> Yeh, I never really understood why hugetlb was linked to the VMA.. The
> page table is self describing, obviously.

Attaching to vma still makes sense to me, where we should definitely avoid
a mixture of hugetlb and !hugetlb pages in a single vma - hugetlb pages are
allocated, managed, ...  totally differently.

And since hugetlb is designed as file-based (which also makes sense to me,
at least for now), it's also natural that it's vma-attached.

> 
> > or "which level of pgtable does this hugetlb vma pages locate",
> 
> Ditto
> 
> > because we may not need that, e.g., when we only want to collect some smaps
> > statistics.  "whether it's hugetlb" may matter, though. E.g. in the mm
> > walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
> > hugetlb_entry removed), we may need extra check later to put things into
> > the right bucket, but for the walker itself it doesn't necessarily need
> > hugetlb_entry().
> 
> Right, places may still need to know it is part of a huge VMA because we
> have special stuff linked to that.
> 
> > > But then again we come back to power and its big list of page sizes
> > > and variety :( Looks like some there have huge sizes at the pgd level
> > > at least.
> > 
> > Yeah this is something I want to be super clear, because I may miss
> > something: we don't have real pgd pages, right?  Powerpc doesn't even
> > define p4d_leaf(), AFAICT.
> 
> AFAICT it is because it hides it all in hugepd.

IMHO one thing we can benefit from such hugepd rework is, if we can squash
all the hugepds like what Christophe does, then we push it one more layer
down, and we have a good chance all things should just work.

So again my Power brain is close to zero, but now I'm referring to what
Christophe shared in the other thread:

https://github.com/linuxppc/wiki/wiki/Huge-pages

Together with:

https://lore.kernel.org/r/288f26f487648d21fd9590e40b390934eaa5d24a.1711377230.git.christophe.le...@csgroup.eu

Where it has:

--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -98,6 +98,7 @@ config PPC_BOOK3S_64
select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
select ARCH_ENABLE_SPLIT_PMD_PTLOCK
select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
+   select ARCH_HAS_HUGEPD if HUGETLB_PAGE
select ARCH_SUPPORTS_HUGETLBFS
select ARCH_SUPPORTS_NUMA_BALANCING
select HAVE_MOVE_PMD
@@ -290,6 +291,7 @@ config PPC_BOOK3S
 config PPC_E500
select FSL_EMB_PERFMON
bool
+   select ARCH_HAS_HUGEPD if HUGETLB_PAGE
select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
select PPC_SMP_MUXED_IPI
select PPC_DOORBELL

So I think it means we have three PowerPC systems that supports hugepd
right now (besides the 8xx which Christophe is trying to drop support
there), besides 8xx we still have book3s_64 and E500.

Let's check one by one:

  - book3s_64

- hash

  - 64K: p4d is not used, largest pgsize pgd 16G @pud level.  It
means after squashing it'll be a bunch of cont-pmd, all good.

  - 4K: p4d also not used, largest pgsize pgd 128G, after squashed
it'll be cont-pud. all good.

- radix

  - 64K: largest 1G @pud, then cont-pmd after squashed. all good.

  - 4K: largest 1G @pud, then cont-pmd, all good.

  - e500 & 8xx

- both of them use 2

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-08 Thread Peter Xu
On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> The .change_pte() MMU notifier callback was intended as an
> optimization. The original point of it was that KSM could tell KVM to flip
> its secondary PTE to a new location without having to first zap it. At
> the time there was also an .invalidate_page() callback; both of them were
> *not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
> and .invalidate_page() also doubled as a fallback implementation of
> .change_pte().
> 
> Later on, however, both callbacks were changed to occur within an
> invalidate_range_start/end() block.
> 
> In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
> set_pte_at_notify with invalidate_range_start and invalidate_range_end",
> 2012-10-09) did so to remove the fallback from .invalidate_page() to
> .change_pte() and allow sleepable .invalidate_page() hooks.
> 
> This however made KVM's usage of the .change_pte() callback completely
> moot, because KVM unmaps the sPTEs during .invalidate_range_start()
> and therefore .change_pte() has no hope of finding a sPTE to change.
> Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
> well as all the architecture specific implementations.

Paolo,

I may miss a bunch of details here (as I still remember some change_pte
patches previously on the list..), however not sure whether we considered
enable it?  Asked because I remember Andrea used to have a custom tree
maintaining that part:

https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968

Maybe it can't be enabled for some reason that I overlooked in the current
tree, or we just decided to not to?

Thanks,

-- 
Peter Xu



Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

2024-04-05 Thread Peter Xu
On Fri, Apr 05, 2024 at 03:16:33PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 04, 2024 at 05:48:03PM -0400, Peter Xu wrote:
> > On Tue, Mar 26, 2024 at 11:02:52AM -0300, Jason Gunthorpe wrote:
> > > The more I look at this the more I think we need to get to Matthew's
> > > idea of having some kind of generic page table API that is not tightly
> > > tied to level. Replacing the hugetlb trick of 'everything is a PTE'
> > > with 5 special cases in every place seems just horrible.
> > > 
> > >struct mm_walk_ops {
> > >int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk 
> > > *walk);
> > >}
> > > 
> > > And many cases really want something like:
> > >struct mm_walk_state state;
> > > 
> > >if (!mm_walk_seek_leaf(state, mm, address))
> > >   goto no_present
> > >if (mm_walk_is_write(state)) ..
> > > 
> > > And detailed walking:
> > >for_each_pt_leaf(state, mm, address) {
> > >if (mm_walk_is_write(state)) ..
> > >}
> > > 
> > > Replacing it with a mm_walk_state that retains the level or otherwise
> > > to allow decoding any entry composes a lot better. Forced Loop
> > > unrolling can get back to the current code gen in alot of places.
> > > 
> > > It also makes the power stuff a bit nicer as the mm_walk_state could
> > > automatically retain back pointers to the higher levels in the state
> > > struct too...
> > > 
> > > The puzzle is how to do it and still get reasonable efficient codegen,
> > > many operations are going to end up switching on some state->level to
> > > know how to decode the entry.
> > 
> > These discussions are definitely constructive, thanks Jason.  Very helpful.
> > 
> > I thought about this last week but got interrupted.  It does make sense to
> > me; it looks pretty generic and it is flexible enough as a top design.  At
> > least that's what I thought.
> 
> Yeah, exactly..
> 
> > However now when I rethink about it, and look more into the code when I got
> > the chance, it turns out this will be a major rewrite of mostly every
> > walkers..  
> 
> Indeed, it is why it may not be reasonable.
> 
> > Consider that what we (or.. I) want to teach the pXd layers are two things
> > right now: (1) hugetlb mappings (2) MMIO (PFN) mappings.  That mostly
> > shares the generic concept when working on the mm walkers no matter which
> > way to go, just different treatment on different type of mem.  (2) is on
> > top of current code and new stuff, while (1) is a refactoring to drop
> > hugetlb_entry() hook point as the goal.
> 
> Right, I view this as a two pronged attack
> 
> One one front you teach the generic pXX_* macros to process huge pages
> and push that around to the performance walkers like GUP
> 
> On another front you want to replace use of the hugepte with the new
> walkers.
> 
> The challenge with the hugepte code is that it is all structured to
> assume one API that works at all levels and that may be a hard fit to
> replace with pXX_* functions.

That's the core of problem, or otherwise I feel like I might be doing
something else already.  I had a feeling even if it's currently efficient
for hugetlb, we'll drop that sooner or later.

The issue is at least hugetlb pgtable format is exactly the same as the
rest, as large folio grows it will reach the point that we complain more
than before on having hugetlb does its smart things on its own.

> 
> The places that are easy to switch from hugetlb to pXX_* may as well
> do so.
> 
> Other places maybe need a hugetlb replacement that has a similar
> abstraction level of pointing to any page table level.

IMHO this depends.

Per my current knowledge, hugetlb is only special in three forms:

- huge mapping (well, this isn't that special..)
- cont pte/pmd/...
- hugepd

The most fancy one is actually hugepd.. but I plan to put that temporarily
aside - I haven't look at Christophe's series yet, however I think we're
going to solve orthogonal issues but his work will definitely help me on
reaching mine, and I want to think through first on my end of things to
know a plan.  If hugepd has its chance to be generalized, the worst case is
I'll leverage CONFIG_ARCH_HAS_HUGEPD and only keep hugetlb_entry() for them
until hugepd became some cont-pxx variance.  Then when I put HAS_HUGEPD
aside, I don't think it's very complicated, perhaps?

In short, hugetlb mappings shouldn't be special comparing to other huge pXd
and large folio (cont-pXd) mappings for most of the walkers in my mind, if
not all.  I need to look at all the walkers 

Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

2024-04-04 Thread Peter Xu
On Tue, Mar 26, 2024 at 11:02:52AM -0300, Jason Gunthorpe wrote:
> The more I look at this the more I think we need to get to Matthew's
> idea of having some kind of generic page table API that is not tightly
> tied to level. Replacing the hugetlb trick of 'everything is a PTE'
> with 5 special cases in every place seems just horrible.
> 
>struct mm_walk_ops {
>int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk);
>}
> 
> And many cases really want something like:
>struct mm_walk_state state;
> 
>if (!mm_walk_seek_leaf(state, mm, address))
>   goto no_present
>if (mm_walk_is_write(state)) ..
> 
> And detailed walking:
>for_each_pt_leaf(state, mm, address) {
>if (mm_walk_is_write(state)) ..
>}
> 
> Replacing it with a mm_walk_state that retains the level or otherwise
> to allow decoding any entry composes a lot better. Forced Loop
> unrolling can get back to the current code gen in alot of places.
> 
> It also makes the power stuff a bit nicer as the mm_walk_state could
> automatically retain back pointers to the higher levels in the state
> struct too...
> 
> The puzzle is how to do it and still get reasonable efficient codegen,
> many operations are going to end up switching on some state->level to
> know how to decode the entry.

These discussions are definitely constructive, thanks Jason.  Very helpful.

I thought about this last week but got interrupted.  It does make sense to
me; it looks pretty generic and it is flexible enough as a top design.  At
least that's what I thought.

However now when I rethink about it, and look more into the code when I got
the chance, it turns out this will be a major rewrite of mostly every
walkers..  it doesn't mean that this is a bad idea, but then I'll need to
compare the other approach, because there can be a huge difference on when
we can get that code ready, I think. :)

Consider that what we (or.. I) want to teach the pXd layers are two things
right now: (1) hugetlb mappings (2) MMIO (PFN) mappings.  That mostly
shares the generic concept when working on the mm walkers no matter which
way to go, just different treatment on different type of mem.  (2) is on
top of current code and new stuff, while (1) is a refactoring to drop
hugetlb_entry() hook point as the goal.

Taking a simplest mm walker (smaps) as example, I think most codes are
ready thanks to THP's existance, and also like vm_normal_page[_pmd]() which
should even already work for pfnmaps; pud layer is missing but that should
be trivial.  It means we may have chance to drop hugetlb_entry() without an
huge overhaul yet.

Now the important question I'm asking myself is: do we really need huge p4d
or even bigger?  It's 512GB on x86, and we said "No 512 GiB pages yet"
(commit fe1e8c3e963) since 2017 - that is 7 years without chaning this
fact.  While on non-x86 p4d_leaf() never defined.  Then it's also
interesting to see how many codes are "ready" to handle p4d entries (by
looking at p4d_leaf() calls; much easier to see with the removal of the
rest huge apis..) even if none existed.

So, can we over-engineer too much if we go the generic route now?
Considering that we already have most of pmd/pud entries around in the mm
walker ops.  So far it sounds better we leave it for later, until further
justifed to be useful.  And that won't block it if it ever justified to be
needed, I'd say it can also be seen as a step forward if I can make it to
remove hugetlb_entry() first.

Comments welcomed (before I start to work on anything..).

Thanks,

-- 
Peter Xu



Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback

2024-04-04 Thread Peter Xu
On Thu, Apr 04, 2024 at 08:24:04AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 03, 2024 at 02:25:20PM -0400, Peter Xu wrote:
> 
> > > I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> > > not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> > > at least
> > 
> > Yes, that sounds better too to me, however it means we may also risk other
> > archs that can fail another defconfig build.. and I worry I bring trouble
> > to multiple such cases.  Fundamentally it's indeed my patch that broke
> > those builds, so I still sent the change and leave that for arch developers
> > to decide the best for the archs.
> 
> But your change causes silent data corruption if the code path is
> run.. I think we are overall better to wade through the compile time
> bugs from linux-next. Honestly if there were alot then I'd think there
> would be more complaints already.
> 
> Maybe it should just be a seperate step from this series.

Right, that'll be imho better to be done separate, as I think we'd better
consolidate the code.

One thing I don't worry is the warning would cause anything real to fail; I
don't yet expect any arch that will not define pud_pfn when it needs
it.. so it can mean all of the build errors may not cause real benefits as
of now.  But I agree with you we'd better have it.  I'll take a todo and
I'll try to add it back after all these fallouts.  With my cross build
chains now it shouldn't be hard, just take some time to revisit each arch.

Thanks,

-- 
Peter Xu



Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback

2024-04-03 Thread Peter Xu
On Wed, Apr 03, 2024 at 09:08:41AM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 02, 2024 at 07:35:45PM -0400, Peter Xu wrote:
> > On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> > > 
> > > > I actually tested this without hitting the issue (even though I didn't
> > > > mention it in the cover letter..).  I re-kicked the build test, it turns
> > > > out my "make alldefconfig" on loongarch will generate a config with both
> > > > HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> > > > THP=y (which I assume was the one above build used).  I didn't further
> > > > check how "make alldefconfig" generated the config; a bit surprising 
> > > > that
> > > > it didn't fetch from there.
> > > 
> > > I suspect it is weird compiler variations.. Maybe something is not
> > > being inlined.
> > > 
> > > > (and it also surprises me that this BUILD_BUG can trigger.. I used to 
> > > > try
> > > >  triggering it elsewhere but failed..)
> > > 
> > > As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> > > called and the optimizer removing it.
> > 
> > Good point, for some reason loongarch defined pud_leaf() without defining
> > pud_pfn(), which does look strange.
> > 
> > #define pud_leaf(pud)   ((pud_val(pud) & _PAGE_HUGE) != 0)
> > 
> > But I noticed at least MIPS also does it..  Logically I think one arch
> > should define either none of both.
> 
> Wow, this is definately an arch issue. You can't define pud_leaf() and
> not have a pud_pfn(). It makes no sense at all..
> 
> I'd say the BUILD_BUG has done it's job and found an issue, fix it by
> not defining pud_leaf? I don't see any calls to pud_leaf in loongarch
> at least

Yes, that sounds better too to me, however it means we may also risk other
archs that can fail another defconfig build.. and I worry I bring trouble
to multiple such cases.  Fundamentally it's indeed my patch that broke
those builds, so I still sent the change and leave that for arch developers
to decide the best for the archs.

I think if wanted, we can add that BUILD_BUG() back when we're sure no arch
will break with it.  So such changes from arch can still be proposed
alongside of removal of BUILD_BUG() (and I'd guess some other arch will
start to notice such build issue soon if existed.. so it still more or less
has similar effect of a reminder..).

Thanks,

-- 
Peter Xu



Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback

2024-04-02 Thread Peter Xu
On Tue, Apr 02, 2024 at 07:53:20PM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 02, 2024 at 06:43:56PM -0400, Peter Xu wrote:
> 
> > I actually tested this without hitting the issue (even though I didn't
> > mention it in the cover letter..).  I re-kicked the build test, it turns
> > out my "make alldefconfig" on loongarch will generate a config with both
> > HUGETLB=n && THP=n, while arch/loongarch/configs/loongson3_defconfig has
> > THP=y (which I assume was the one above build used).  I didn't further
> > check how "make alldefconfig" generated the config; a bit surprising that
> > it didn't fetch from there.
> 
> I suspect it is weird compiler variations.. Maybe something is not
> being inlined.
> 
> > (and it also surprises me that this BUILD_BUG can trigger.. I used to try
> >  triggering it elsewhere but failed..)
> 
> As the pud_leaf() == FALSE should result in the BUILD_BUG never being
> called and the optimizer removing it.

Good point, for some reason loongarch defined pud_leaf() without defining
pud_pfn(), which does look strange.

#define pud_leaf(pud)   ((pud_val(pud) & _PAGE_HUGE) != 0)

But I noticed at least MIPS also does it..  Logically I think one arch
should define either none of both.

> 
> Perhaps the issue is that the pud_leaf() is too far from the pud_pfn?

My understanding is follow_pud_mask() should completely get optimized and
follow_huge_pud() will be dropped in the compiler output if pud_leaf()==false.

Thanks,

-- 
Peter Xu



Re: [PATCH v4 05/13] mm/arch: Provide pud_pfn() fallback

2024-04-02 Thread Peter Xu
On Tue, Apr 02, 2024 at 12:05:49PM -0700, Nathan Chancellor wrote:
> Hi Peter (and LoongArch folks),
> 
> On Wed, Mar 27, 2024 at 11:23:24AM -0400, pet...@redhat.com wrote:
> > From: Peter Xu 
> > 
> > The comment in the code explains the reasons.  We took a different approach
> > comparing to pmd_pfn() by providing a fallback function.
> > 
> > Another option is to provide some lower level config options (compare to
> > HUGETLB_PAGE or THP) to identify which layer an arch can support for such
> > huge mappings.  However that can be an overkill.
> > 
> > Cc: Mike Rapoport (IBM) 
> > Cc: Matthew Wilcox 
> > Reviewed-by: Jason Gunthorpe 
> > Signed-off-by: Peter Xu 
> > ---
> >  arch/riscv/include/asm/pgtable.h|  1 +
> >  arch/s390/include/asm/pgtable.h |  1 +
> >  arch/sparc/include/asm/pgtable_64.h |  1 +
> >  arch/x86/include/asm/pgtable.h  |  1 +
> >  include/linux/pgtable.h | 10 ++
> >  5 files changed, 14 insertions(+)
> > 
> > diff --git a/arch/riscv/include/asm/pgtable.h 
> > b/arch/riscv/include/asm/pgtable.h
> > index 20242402fc11..0ca28cc8e3fa 100644
> > --- a/arch/riscv/include/asm/pgtable.h
> > +++ b/arch/riscv/include/asm/pgtable.h
> > @@ -646,6 +646,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
> >  
> >  #define __pud_to_phys(pud)  (__page_val_to_pfn(pud_val(pud)) << PAGE_SHIFT)
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> > return ((__pud_to_phys(pud) & PUD_MASK) >> PAGE_SHIFT);
> > diff --git a/arch/s390/include/asm/pgtable.h 
> > b/arch/s390/include/asm/pgtable.h
> > index 1a71cb19c089..6cbbe473f680 100644
> > --- a/arch/s390/include/asm/pgtable.h
> > +++ b/arch/s390/include/asm/pgtable.h
> > @@ -1414,6 +1414,7 @@ static inline unsigned long pud_deref(pud_t pud)
> > return (unsigned long)__va(pud_val(pud) & origin_mask);
> >  }
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> > return __pa(pud_deref(pud)) >> PAGE_SHIFT;
> > diff --git a/arch/sparc/include/asm/pgtable_64.h 
> > b/arch/sparc/include/asm/pgtable_64.h
> > index 4d1bafaba942..26efc9bb644a 100644
> > --- a/arch/sparc/include/asm/pgtable_64.h
> > +++ b/arch/sparc/include/asm/pgtable_64.h
> > @@ -875,6 +875,7 @@ static inline bool pud_leaf(pud_t pud)
> > return pte_val(pte) & _PAGE_PMD_HUGE;
> >  }
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> > pte_t pte = __pte(pud_val(pud));
> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > index cefc7a84f7a4..273f7557218c 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -234,6 +234,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
> > return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
> >  }
> >  
> > +#define pud_pfn pud_pfn
> >  static inline unsigned long pud_pfn(pud_t pud)
> >  {
> > phys_addr_t pfn = pud_val(pud);
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 600e17d03659..75fe309a4e10 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1817,6 +1817,16 @@ typedef unsigned int pgtbl_mod_mask;
> >  #define pte_leaf_size(x) PAGE_SIZE
> >  #endif
> >  
> > +/*
> > + * We always define pmd_pfn for all archs as it's used in lots of generic
> > + * code.  Now it happens too for pud_pfn (and can happen for larger
> > + * mappings too in the future; we're not there yet).  Instead of defining
> > + * it for all archs (like pmd_pfn), provide a fallback.
> > + */
> > +#ifndef pud_pfn
> > +#define pud_pfn(x) ({ BUILD_BUG(); 0; })
> > +#endif
> > +
> >  /*
> >   * Some architectures have MMUs that are configurable or selectable at boot
> >   * time. These lead to variable PTRS_PER_x. For statically allocated 
> > arrays it
> > -- 
> > 2.44.0
> > 
> 
> This BUILD_BUG() triggers for LoongArch with their defconfig, so it
> seems like they need to provide an implementation of pud_pfn()?
> 
>   In function 'follow_huge_pud',
>   inlined from 'follow_pud_mask' at mm/gup.c:1075:10,
>   inlined from 'follow_p4d_mask' at mm/gup.c:1105:9,
>   inlined from 'follow_page_mask' at mm/gup.c:1151:10:
>   include/linux/compiler_types.h:460:45: error: call to 
> '__compiletime_assert_382' declared with att

Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-04-02 Thread Peter Xu
On Tue, Apr 02, 2024 at 05:46:57PM +0100, Ryan Roberts wrote:
> I'll leave you to do the testing on this, if that's ok.

Definitely.  I'll test and send formal patches.

> 
> Just to make my config explicit, I have this kernel command line, which 
> reserves
> hugetlbs of all sizes for the tests:
> 
> "transparent_hugepage=madvise earlycon root=/dev/vda2 secretmem.enable
> hugepagesz=1G hugepages=0:2,1:2 hugepagesz=32M hugepages=0:2,1:2
> default_hugepagesz=2M hugepages=0:64,1:64 hugepagesz=64K hugepages=0:2,1:2"

This helps, thanks.

-- 
Peter Xu



Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-04-02 Thread Peter Xu
On Tue, Apr 02, 2024 at 06:39:31PM +0200, David Hildenbrand wrote:
> On 02.04.24 18:20, Peter Xu wrote:
> > On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> > > On 02.04.24 16:48, Ryan Roberts wrote:
> > > > Hi Peter,
> > 
> > Hey, Ryan,
> > 
> > Thanks for the report!
> > 
> > > > 
> > > > On 27/03/2024 15:23, pet...@redhat.com wrote:
> > > > > From: Peter Xu 
> > > > > 
> > > > > Now follow_page() is ready to handle hugetlb pages in whatever form, 
> > > > > and
> > > > > over all architectures.  Switch to the generic code path.
> > > > > 
> > > > > Time to retire hugetlb_follow_page_mask(), following the previous
> > > > > retirement of follow_hugetlb_page() in 4849807114b8.
> > > > > 
> > > > > There may be a slight difference of how the loops run when processing 
> > > > > slow
> > > > > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: 
> > > > > each
> > > > > loop of __get_user_pages() will resolve one pgtable entry with the 
> > > > > patch
> > > > > applied, rather than relying on the size of hugetlb hstate, the 
> > > > > latter may
> > > > > cover multiple entries in one loop.
> > > > > 
> > > > > A quick performance test on an aarch64 VM on M1 chip shows 15% 
> > > > > degrade over
> > > > > a tight loop of slow gup after the path switched.  That shouldn't be a
> > > > > problem because slow-gup should not be a hot path for GUP in general: 
> > > > > when
> > > > > page is commonly present, fast-gup will already succeed, while when 
> > > > > the
> > > > > page is indeed missing and require a follow up page fault, the slow 
> > > > > gup
> > > > > degrade will probably buried in the fault paths anyway.  It also 
> > > > > explains
> > > > > why slow gup for THP used to be very slow before 57edfcfd3419 
> > > > > ("mm/gup:
> > > > > accelerate thp gup even for "pages != NULL"") lands, the latter not 
> > > > > part of
> > > > > a performance analysis but a side benefit.  If the performance will 
> > > > > be a
> > > > > concern, we can consider handle CONT_PTE in follow_page().
> > > > > 
> > > > > Before that is justified to be necessary, keep everything clean and 
> > > > > simple.
> > > > > 
> > > > > Reviewed-by: Jason Gunthorpe 
> > > > > Signed-off-by: Peter Xu 
> > > > 
> > > > Afraid I'm seeing an oops when running gup_longterm test on arm64 with 
> > > > current mm-unstable. Git bisect blames this patch. The oops reproduces 
> > > > for me every time on 2 different machines:
> > > > 
> > > > 
> > > > [9.340416] kernel BUG at mm/gup.c:778!
> > > > [9.340746] Internal error: Oops - BUG: f2000800 [#1] 
> > > > PREEMPT SMP
> > > > [9.341199] Modules linked in:
> > > > [9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 
> > > > 6.9.0-rc2-00210-g910ff1a347e4 #11
> > > > [9.342232] Hardware name: linux,dummy-virt (DT)
> > > > [9.342647] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS 
> > > > BTYPE=--)
> > > > [9.343195] pc : follow_page_mask+0x4d4/0x880
> > > > [9.343580] lr : follow_page_mask+0x4d4/0x880
> > > > [9.344018] sp : 8000898b3aa0
> > > > [9.344345] x29: 8000898b3aa0 x28: fdffc53973e8 x27: 
> > > > 3c0005d08000
> > > > [9.345028] x26: 00014e5cfd08 x25: d3513a40c000 x24: 
> > > > fdffc5d08000
> > > > [9.345682] x23: c1ffc000 x22: 00080101 x21: 
> > > > 8000898b3ba8
> > > > [9.346337] x20: f420 x19: 00014e52d508 x18: 
> > > > 0010
> > > > [9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 
> > > > 6170286461654865
> > > > [9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 
> > > > 6761705f65636976
> > > > [9.348371] x11: 65645f656e6f7a5f x10: d3513b31d6e0 x9 : 
> > > > d3513852f090
> > > > [9.349062] x8 : efff x7 : d

Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-04-02 Thread Peter Xu
]  el0_svc+0x34/0xe0
> [9.763699]  el0t_64_sync_handler+0x13c/0x158
> [9.764139]  el0t_64_sync+0x190/0x198
> [9.764465] Code: aa1803e0 d000d8e1 911d6021 97fff4c9 (d421) 
> [9.765053] ---[ end trace  ]---
> [9.765520] note: gup_longterm[1155] exited with irqs disabled
> [9.766146] note: gup_longterm[1155] exited with preempt_count 2
> [9.767366] [ cut here ]
> [9.768062] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:128 
> ct_kernel_exit.constprop.0+0x108/0x120
> [9.769146] Modules linked in:
> [9.769429] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G  D
> 6.9.0-rc2-00210-g910ff1a347e4-dirty #12
> [9.770338] Hardware name: linux,dummy-virt (DT)
> [9.770837] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [9.771615] pc : ct_kernel_exit.constprop.0+0x108/0x120
> [9.772150] lr : ct_idle_enter+0x10/0x20
> [9.772539] sp : 8000801b3dc0
> [9.772913] x29: 8000801b3dc0 x28:  x27: 
> 
> [9.773769] x26:  x25: 00014149e900 x24: 
> 
> [9.774526] x23:  x22: aaac08e99d48 x21: 
> aaac08385730
> [9.775255] x20: aaac08e99c28 x19: 00017ffc8da0 x18: 
> f5ff
> [9.775924] x17:  x16: 1fffe0002a57c9e1 x15: 
> 0001
> [9.776619] x14:  x13:  x12: 
> aaac07a06968
> [9.777246] x11: 00ae44c42eec x10: 0ad0 x9 : 
> aaac06189230
> [9.777942] x8 : 00014149f430 x7 : 02c9acb509db422c x6 : 
> 1015a9f0
> [9.778635] x5 : 4002 x4 : 77c46000 x3 : 
> 8000801b3dc0
> [9.779671] x2 : aaac08382da0 x1 : 4000 x0 : 
> aaac08382da0
> [9.780703] Call trace:
> [9.781150]  ct_kernel_exit.constprop.0+0x108/0x120
> [9.781949]  ct_idle_enter+0x10/0x20
> [9.782246]  default_idle_call+0x3c/0x160
> [9.782624]  do_idle+0x21c/0x280
> [9.782945]  cpu_startup_entry+0x3c/0x50
> [9.783268]  secondary_start_kernel+0x140/0x168
> [9.783818]  __secondary_switched+0xb8/0xc0
> [9.784163] ---[ end trace  ]---
> 
> 
> Which is caused by this:
> 
> static __always_inline int PageAnonExclusive(const struct page *page)
> {
>   VM_BUG_ON_PGFLAGS(!PageAnon(page), page);
>   VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); <<<<
>   return test_bit(PG_anon_exclusive, _ANY(page, 1)->flags);
> }
> 
> Which is called from can_follow_write_pmd(), called just after the assert I 
> just commented out.
> 
> 
> It's triggered by this test:
> 
> # [RUN] R/W longterm GUP pin in MAP_PRIVATE file mapping ... with memfd 
> hugetlb (32768 kB)
> 
> Which is the first MAP_PRIVATE test for cont-pmd mapped hugetlb. (All 
> MAP_SHARED tests are passing).
> 
> 
> Looks like can_follow_write_pmd() returns early for VM_SHARED mappings.
> 
> I don't think we only keep the PAE flag in the head page for hugetlb pages? 
> So we can't just remove this assert?
> 
> I tried just commenting it out and get assert further down follow_huge_pmd():
> 
> VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
>   !PageAnonExclusive(page), page);

I just replied in another email; we can try the two patches I attached, or
we can wait until I do some tests (but will be mostly unavailable this
afternoon).

Thanks,

-- 
Peter Xu



Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-04-02 Thread Peter Xu
On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote:
> On 02.04.24 16:48, Ryan Roberts wrote:
> > Hi Peter,

Hey, Ryan,

Thanks for the report!

> > 
> > On 27/03/2024 15:23, pet...@redhat.com wrote:
> > > From: Peter Xu 
> > > 
> > > Now follow_page() is ready to handle hugetlb pages in whatever form, and
> > > over all architectures.  Switch to the generic code path.
> > > 
> > > Time to retire hugetlb_follow_page_mask(), following the previous
> > > retirement of follow_hugetlb_page() in 4849807114b8.
> > > 
> > > There may be a slight difference of how the loops run when processing slow
> > > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> > > loop of __get_user_pages() will resolve one pgtable entry with the patch
> > > applied, rather than relying on the size of hugetlb hstate, the latter may
> > > cover multiple entries in one loop.
> > > 
> > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade 
> > > over
> > > a tight loop of slow gup after the path switched.  That shouldn't be a
> > > problem because slow-gup should not be a hot path for GUP in general: when
> > > page is commonly present, fast-gup will already succeed, while when the
> > > page is indeed missing and require a follow up page fault, the slow gup
> > > degrade will probably buried in the fault paths anyway.  It also explains
> > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > > accelerate thp gup even for "pages != NULL"") lands, the latter not part 
> > > of
> > > a performance analysis but a side benefit.  If the performance will be a
> > > concern, we can consider handle CONT_PTE in follow_page().
> > > 
> > > Before that is justified to be necessary, keep everything clean and 
> > > simple.
> > > 
> > > Reviewed-by: Jason Gunthorpe 
> > > Signed-off-by: Peter Xu 
> > 
> > Afraid I'm seeing an oops when running gup_longterm test on arm64 with 
> > current mm-unstable. Git bisect blames this patch. The oops reproduces for 
> > me every time on 2 different machines:
> > 
> > 
> > [9.340416] kernel BUG at mm/gup.c:778!
> > [9.340746] Internal error: Oops - BUG: f2000800 [#1] PREEMPT SMP
> > [9.341199] Modules linked in:
> > [9.341481] CPU: 1 PID: 1159 Comm: gup_longterm Not tainted 
> > 6.9.0-rc2-00210-g910ff1a347e4 #11
> > [9.342232] Hardware name: linux,dummy-virt (DT)
> > [9.342647] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS 
> > BTYPE=--)
> > [9.343195] pc : follow_page_mask+0x4d4/0x880
> > [9.343580] lr : follow_page_mask+0x4d4/0x880
> > [9.344018] sp : 8000898b3aa0
> > [9.344345] x29: 8000898b3aa0 x28: fdffc53973e8 x27: 
> > 3c0005d08000
> > [9.345028] x26: 00014e5cfd08 x25: d3513a40c000 x24: 
> > fdffc5d08000
> > [9.345682] x23: c1ffc000 x22: 00080101 x21: 
> > 8000898b3ba8
> > [9.346337] x20: f420 x19: 00014e52d508 x18: 
> > 0010
> > [9.347005] x17: 5f656e6f7a5f7369 x16: 2120262620296567 x15: 
> > 6170286461654865
> > [9.347713] x14: 6761502128454741 x13: 2929656761702865 x12: 
> > 6761705f65636976
> > [9.348371] x11: 65645f656e6f7a5f x10: d3513b31d6e0 x9 : 
> > d3513852f090
> > [9.349062] x8 : efff x7 : d3513b31d6e0 x6 : 
> > 
> > [9.349753] x5 : 00017ff98cc8 x4 : 0fff x3 : 
> > 
> > [9.350397] x2 :  x1 : 000190e8b480 x0 : 
> > 0052
> > [9.351097] Call trace:
> > [9.351312]  follow_page_mask+0x4d4/0x880
> > [9.351700]  __get_user_pages+0xf4/0x3e8
> > [9.352089]  __gup_longterm_locked+0x204/0xa70
> > [9.352516]  pin_user_pages+0x88/0xc0
> > [9.352873]  gup_test_ioctl+0x860/0xc40
> > [9.353249]  __arm64_sys_ioctl+0xb0/0x100
> > [9.353648]  invoke_syscall+0x50/0x128
> > [9.354022]  el0_svc_common.constprop.0+0x48/0xf8
> > [9.354488]  do_el0_svc+0x28/0x40
> > [9.354822]  el0_svc+0x34/0xe0
> > [9.355128]  el0t_64_sync_handler+0x13c/0x158
> > [9.355489]  el0t_64_sync+0x190/0x198
> > [9.355793] Code: aa1803e0 d000d8e1 91220021 97fff560 (d421)
> > [9.356280] ---[ end trace  ]---
> > [9.356651] note: gup_longterm[1159] exited with irqs disabled
>

Re: [PATCH RFC 0/3] mm/gup: consistently call it GUP-fast

2024-03-27 Thread Peter Xu
On Wed, Mar 27, 2024 at 02:05:35PM +0100, David Hildenbrand wrote:
> Some cleanups around function names, comments and the config option of
> "GUP-fast" -- GUP without "lock" safety belts on.
> 
> With this cleanup it's easy to judge which functions are GUP-fast specific.
> We now consistently call it "GUP-fast", avoiding mixing it with "fast GUP",
> "lockless", or simply "gup" (which I always considered confusing in the
> ode).
> 
> So the magic now happens in functions that contain "gup_fast", whereby
> gup_fast() is the entry point into that magic. Comments consistently
> reference either "GUP-fast" or "gup_fast()".
> 
> Based on mm-unstable from today. I won't CC arch maintainers, but only
> arch mailing lists, to reduce noise.
> 
> Tested on x86_64, cross compiled on a bunch of archs, whereby some of them
> don't properly even compile on mm-unstable anymore in my usual setup
> (alpha, arc, parisc64, sh) ... maybe the cross compilers are outdated,
> but there are no new ones around. Hm.

I'm not sure what config you tried there; as I am doing some build tests
recently, I found turning off CONFIG_SAMPLES + CONFIG_GCC_PLUGINS could
avoid a lot of issues, I think it's due to libc missing.  But maybe not the
case there.

The series makes sense to me, the naming is confusing.  Btw, thanks for
posting this as RFC. This definitely has a conflict with the other gup
series that I had; I'll post v4 of that shortly.

-- 
Peter Xu



Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

2024-03-25 Thread Peter Xu
On Fri, Mar 22, 2024 at 01:10:00PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2024 at 06:07:50PM -0400, pet...@redhat.com wrote:
> > From: Peter Xu 
> > 
> > v3:
> > - Rebased to latest mm-unstalbe (a824831a082f, of March 21th)
> > - Dropped patch to introduce pmd_thp_or_huge(), replace such uses (and also
> >   pXd_huge() users) with pXd_leaf() [Jason]
> > - Add a comment for CONFIG_PGTABLE_HAS_HUGE_LEAVES [Jason]
> > - Use IS_ENABLED() in follow_huge_pud() [Jason]
> > - Remove redundant none pud check in follow_pud_mask() [Jason]
> > 
> > rfc: https://lore.kernel.org/r/20231116012908.392077-1-pet...@redhat.com
> > v1:  https://lore.kernel.org/r/20231219075538.414708-1-pet...@redhat.com
> > v2:  https://lore.kernel.org/r/20240103091423.400294-1-pet...@redhat.com
> > 
> > The series removes the hugetlb slow gup path after a previous refactor work
> > [1], so that slow gup now uses the exact same path to process all kinds of
> > memory including hugetlb.
> > 
> > For the long term, we may want to remove most, if not all, call sites of
> > huge_pte_offset().  It'll be ideal if that API can be completely dropped
> > from arch hugetlb API.  This series is one small step towards merging
> > hugetlb specific codes into generic mm paths.  From that POV, this series
> > removes one reference to huge_pte_offset() out of many others.
> 
> This remark would be a little easier to understand if you refer to
> hugetlb_walk() not huge_pte_offset() - the latter is only used to
> implement hugetlb_walk() and isn't removed by this series, while a
> single hugetlb_walk() was removed.

Right.  Here huge_pte_offset() is the arch API that I hope we can remove,
the hugetlb_walk() is simply the wrapper.

> 
> Regardless, I think the point is spot on, the direction should be to
> make the page table reading generic with minimal/no interaction with
> hugetlbfs in the generic code.

Yes, and I also like your terms on calling them "pgtable readers".  It's a
better way to describe the difference in that regard between
huge_pte_offset() v.s. huge_pte_alloc().  Exactly that's my goal, that we
should start with the "readers".

The writters might change semantics when merge, and can be more
challenging, I'll need to look into details of each one, like page fault
handlers.  Such work may need to be analyzed case by case, and this GUP
part is definitely the low hanging fruit comparing to the rest.

> 
> After this series I would suggest doing the pagewalk.c stuff as it is
> very parallel to GUP slow (indeed it would be amazing to figure out a
> way to make GUP slow and pagewalk.c use the same code without a
> performance cost)

Yes.  I hope there shouldn't be much perf degrade, I can do some more
measurements too when getting there.  And btw, IIUC the major challenge of
pagewalk.c is not the removal of walk_hugetlb_range() alone - that may not
be that hard if that's the solo purpose.  The better way to go is to remove
mm_walk_ops.hugetlb_entry() altogether, which will cause a change in all
callers; that's "the challenge".. pretty much labor works, not a major
technical challenge it seems.  Not sure if it's a good news or bad..

One thing I'll soon look into is to allow huge mappings for PFNMAP; there's
one request from VFIO side for MMIO. Dropping mm_walk_ops.hugetlb_entry()
seems to be good for all such purposes; well, I may need to bite the bullet
here.. for either of the purposes to move on.

> 
> Some of the other core mm callers don't look too bad either, getting
> to the point where hugetlb_walk() is internal to hugetlb.c would be a
> nice step that looks reasonable size.

Agree.

> 
> > One goal of such a route is that we can reconsider merging hugetlb features
> > like High Granularity Mapping (HGM).  It was not accepted in the past
> > because it may add lots of hugetlb specific codes and make the mm code even
> > harder to maintain.  With a merged codeset, features like HGM can hopefully
> > share some code with THP, legacy (PMD+) or modern (continuous PTEs).
> 
> Yeah, if all the special hugetlb stuff is using generic arch code and
> generic page walkers (maybe with that special vma locking hook) it is
> much easier to swallow than adding yet another special class of code
> to all the page walkers.
> 
> > To make it work, the generic slow gup code will need to at least understand
> > hugepd, which is already done like so in fast-gup.  Due to the specialty of
> > hugepd to be software-only solution (no hardware recognizes the hugepd
> > format, so it's purely artificial structures), there's chance we can merge
> > some or all hugepd formats with cont_pte in the future.  That question is
> > yet unsettled from Power s

Re: [PATCH v3 12/12] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-03-22 Thread Peter Xu
On Fri, Mar 22, 2024 at 08:45:59PM -0400, Peter Xu wrote:
> On Fri, Mar 22, 2024 at 01:48:18PM -0700, Andrew Morton wrote:
> > On Thu, 21 Mar 2024 18:08:02 -0400 pet...@redhat.com wrote:
> > 
> > > From: Peter Xu 
> > > 
> > > Now follow_page() is ready to handle hugetlb pages in whatever form, and
> > > over all architectures.  Switch to the generic code path.
> > > 
> > > Time to retire hugetlb_follow_page_mask(), following the previous
> > > retirement of follow_hugetlb_page() in 4849807114b8.
> > > 
> > > There may be a slight difference of how the loops run when processing slow
> > > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> > > loop of __get_user_pages() will resolve one pgtable entry with the patch
> > > applied, rather than relying on the size of hugetlb hstate, the latter may
> > > cover multiple entries in one loop.
> > > 
> > > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade 
> > > over
> > > a tight loop of slow gup after the path switched.  That shouldn't be a
> > > problem because slow-gup should not be a hot path for GUP in general: when
> > > page is commonly present, fast-gup will already succeed, while when the
> > > page is indeed missing and require a follow up page fault, the slow gup
> > > degrade will probably buried in the fault paths anyway.  It also explains
> > > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > > accelerate thp gup even for "pages != NULL"") lands, the latter not part 
> > > of
> > > a performance analysis but a side benefit.  If the performance will be a
> > > concern, we can consider handle CONT_PTE in follow_page().
> > > 
> > > Before that is justified to be necessary, keep everything clean and 
> > > simple.
> > > 
> > 
> > mm/gup.c:33:21: warning: 'follow_hugepd' declared 'static' but never 
> > defined [-Wunused-function]
> >33 | static struct page *follow_hugepd(struct vm_area_struct *vma, 
> > hugepd_t hugepd,
> >   | ^
> > 
> > --- a/mm/gup.c~mm-gup-handle-hugepd-for-follow_page-fix
> > +++ a/mm/gup.c
> > @@ -30,10 +30,12 @@ struct follow_page_context {
> > unsigned int page_mask;
> >  };
> >  
> > +#ifdef CONFIG_HAVE_FAST_GUP
> >  static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t 
> > hugepd,
> >   unsigned long addr, unsigned int pdshift,
> >   unsigned int flags,
> >   struct follow_page_context *ctx);
> > +#endif
> >  
> >  static inline void sanity_check_pinned_pages(struct page **pages,
> >  unsigned long npages)
> > _
> > 
> > 
> > This looks inelegant.
> > 
> > That's two build issues so far.  Please be more expansive in the
> > Kconfig variations when testing.  Especially when mucking with pgtable
> > macros.
> 
> Andrew,
> 
> Apologies for that, and also for a slightly late response.  Yeah it's time
> I'll need my set of things to do serious build tests, and I'll at least
> start to cover a few error prone configs/archs in with that.
> 
> I was trying to rely on the build bot in many of previous such cases, as
> that was quite useful to me to cover many build issues without investing my
> own test setups, but I think for some reason it retired and stopped working
> for a while.  Maybe I shouldn't have relied on it at all.
> 
> For this specific issue, I'm not sure if CONFIG_HAVE_FAST_GUP is proper?
> As follow_hugepd() is used in slow gup not fast.  So maybe we can put that
> under CONFIG_MMU below that code (and I think we can drop "static" too as I
> don't think it's anything useful).  My version of fixup attached at the end

the static is useful; below patch did pass on m68k but won't on
x86.. ignore that please.

> of email, and I verified it on m68k build.
> 
> I do plan to post a small fixup series to fix these issues (so far it may
> contain 1 formal patch to touch up vmstat_item_print_in_thp, and 2 fixups
> where I'll mark the subject with "fixup!" properly).  Either you can pick
> up below or you can wait for my small patchset, should be there either
> today or tomorrow.

I changed plan here too; I found more users of HPAGE_PMD_NR assuming it's
defined even if !CONFIG_MMU.  That's weird, as CONFIG_MMU doesn't even
define PMD_SHIFT...  To fix this I decided to use the old trick on using

Re: [PATCH v3 12/12] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-03-22 Thread Peter Xu
On Fri, Mar 22, 2024 at 01:48:18PM -0700, Andrew Morton wrote:
> On Thu, 21 Mar 2024 18:08:02 -0400 pet...@redhat.com wrote:
> 
> > From: Peter Xu 
> > 
> > Now follow_page() is ready to handle hugetlb pages in whatever form, and
> > over all architectures.  Switch to the generic code path.
> > 
> > Time to retire hugetlb_follow_page_mask(), following the previous
> > retirement of follow_hugetlb_page() in 4849807114b8.
> > 
> > There may be a slight difference of how the loops run when processing slow
> > GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
> > loop of __get_user_pages() will resolve one pgtable entry with the patch
> > applied, rather than relying on the size of hugetlb hstate, the latter may
> > cover multiple entries in one loop.
> > 
> > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> > a tight loop of slow gup after the path switched.  That shouldn't be a
> > problem because slow-gup should not be a hot path for GUP in general: when
> > page is commonly present, fast-gup will already succeed, while when the
> > page is indeed missing and require a follow up page fault, the slow gup
> > degrade will probably buried in the fault paths anyway.  It also explains
> > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> > a performance analysis but a side benefit.  If the performance will be a
> > concern, we can consider handle CONT_PTE in follow_page().
> > 
> > Before that is justified to be necessary, keep everything clean and simple.
> > 
> 
> mm/gup.c:33:21: warning: 'follow_hugepd' declared 'static' but never defined 
> [-Wunused-function]
>33 | static struct page *follow_hugepd(struct vm_area_struct *vma, 
> hugepd_t hugepd,
>   | ^
> 
> --- a/mm/gup.c~mm-gup-handle-hugepd-for-follow_page-fix
> +++ a/mm/gup.c
> @@ -30,10 +30,12 @@ struct follow_page_context {
>   unsigned int page_mask;
>  };
>  
> +#ifdef CONFIG_HAVE_FAST_GUP
>  static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t 
> hugepd,
> unsigned long addr, unsigned int pdshift,
> unsigned int flags,
> struct follow_page_context *ctx);
> +#endif
>  
>  static inline void sanity_check_pinned_pages(struct page **pages,
>unsigned long npages)
> _
> 
> 
> This looks inelegant.
> 
> That's two build issues so far.  Please be more expansive in the
> Kconfig variations when testing.  Especially when mucking with pgtable
> macros.

Andrew,

Apologies for that, and also for a slightly late response.  Yeah it's time
I'll need my set of things to do serious build tests, and I'll at least
start to cover a few error prone configs/archs in with that.

I was trying to rely on the build bot in many of previous such cases, as
that was quite useful to me to cover many build issues without investing my
own test setups, but I think for some reason it retired and stopped working
for a while.  Maybe I shouldn't have relied on it at all.

For this specific issue, I'm not sure if CONFIG_HAVE_FAST_GUP is proper?
As follow_hugepd() is used in slow gup not fast.  So maybe we can put that
under CONFIG_MMU below that code (and I think we can drop "static" too as I
don't think it's anything useful).  My version of fixup attached at the end
of email, and I verified it on m68k build.

I do plan to post a small fixup series to fix these issues (so far it may
contain 1 formal patch to touch up vmstat_item_print_in_thp, and 2 fixups
where I'll mark the subject with "fixup!" properly).  Either you can pick
up below or you can wait for my small patchset, should be there either
today or tomorrow.

Thanks,

===8<===
diff --git a/mm/gup.c b/mm/gup.c
index 4cd349390477..a2ed8203495a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -30,11 +30,6 @@ struct follow_page_context {
unsigned int page_mask;
 };
 
-static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
- unsigned long addr, unsigned int pdshift,
- unsigned int flags,
- struct follow_page_context *ctx);
-
 static inline void sanity_check_pinned_pages(struct page **pages,
 unsigned long npages)
 {
@@ -505,6 +500,12 @@ static inline void mm_set_has_pinned_flag(unsigned long 
*mm_flags)
 }
 
 #ifdef CONFIG_MMU
+
+struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+ 

Re: [PATCH v3 03/12] mm: Make HPAGE_PXD_* macros even if !THP

2024-03-22 Thread Peter Xu
On Fri, Mar 22, 2024 at 10:14:56AM -0700, SeongJae Park wrote:
> Hi Peter,

Hi, SeongJae,

> 
> On Thu, 21 Mar 2024 18:07:53 -0400 pet...@redhat.com wrote:
> 
> > From: Peter Xu 
> > 
> > These macros can be helpful when we plan to merge hugetlb code into generic
> > code.  Move them out and define them even if !THP.
> > 
> > We actually already defined HPAGE_PMD_NR for other reasons even if !THP.
> > Reorganize these macros.
> > 
> > Reviewed-by: Christoph Hellwig 
> > Reviewed-by: Jason Gunthorpe 
> > Signed-off-by: Peter Xu 
> > ---
> >  include/linux/huge_mm.h | 17 ++---
> >  1 file changed, 6 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index de0c89105076..3bcdfc7e5d57 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -64,9 +64,6 @@ ssize_t single_hugepage_flag_show(struct kobject *kobj,
> >   enum transparent_hugepage_flag flag);
> >  extern struct kobj_attribute shmem_enabled_attr;
> >  
> > -#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> > -#define HPAGE_PMD_NR (1< > -
> >  /*
> >   * Mask of all large folio orders supported for anonymous THP; all orders 
> > up to
> >   * and including PMD_ORDER, except order-0 (which is not "huge") and 
> > order-1
> > @@ -87,14 +84,19 @@ extern struct kobj_attribute shmem_enabled_attr;
> >  #define thp_vma_allowable_order(vma, vm_flags, smaps, in_pf, 
> > enforce_sysfs, order) \
> > (!!thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf, enforce_sysfs, 
> > BIT(order)))
> >  
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  #define HPAGE_PMD_SHIFT PMD_SHIFT
> >  #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
> >  #define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1))
> > +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> > +#define HPAGE_PMD_NR (1< >  
> >  #define HPAGE_PUD_SHIFT PUD_SHIFT
> >  #define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT)
> >  #define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1))
> > +#define HPAGE_PUD_ORDER (HPAGE_PUD_SHIFT-PAGE_SHIFT)
> > +#define HPAGE_PUD_NR (1< 
> I just found latest mm-unstable fails one of my build configurations[1] with
> below error.  'git bisect' says this is the first patch set started the
> failure.  I haven't looked in deep, but just reporting first.
> 
> In file included from .../include/linux/mm.h:1115,
>  from .../mm/vmstat.c:14:
> .../mm/vmstat.c: In function 'zoneinfo_show_print':
> .../include/linux/huge_mm.h:87:25: error: 'PMD_SHIFT' undeclared (first 
> use in this function); did you mean 'PUD_SHIFT'?
>87 | #define HPAGE_PMD_SHIFT PMD_SHIFT
>   | ^
> 
> [1] https://github.com/awslabs/damon-tests/blob/next/corr/tests/build_m68k.sh

Apologies for the issue.  This is caused by !CONFIG_MMU, I think.

I thought this would be fairly easy to fix by putting these macros under
CONFIG_PGTABLE_HAS_HUGE_LEAVES, however when doing this I could have found
some other issue that violates this rule.. mm/vmstat.c has referenced
HPAGE_PMD_NR even if vmstat_item_print_in_thp() wanted to guard it only
with CONFIG_THP.

/home/peterx/git/linux/mm/vmstat.c: In function 'zoneinfo_show_print':
/home/peterx/git/linux/mm/vmstat.c:1689:42: error: 'HPAGE_PMD_NR' undeclared 
(first use in this function)
 1689 | pages /= HPAGE_PMD_NR;
  |  ^~~~
/home/peterx/git/linux/mm/vmstat.c:1689:42: note: each undeclared identifier is 
reported only once for each function it appears in
  CC  drivers/tty/tty_port.o
/home/peterx/git/linux/mm/vmstat.c: In function 'vmstat_start':
/home/peterx/git/linux/mm/vmstat.c:1822:33: error: 'HPAGE_PMD_NR' undeclared 
(first use in this function)
 1822 | v[i] /= HPAGE_PMD_NR;
  | ^~~~
make[4]: *** [/home/peterx/git/linux/scripts/Makefile.build:243: mm/vmstat.o] 
Error 1

static __always_inline bool vmstat_item_print_in_thp(enum node_stat_item item)
{
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
return false;
...
}

I think the problem is vmstat_item_print_in_thp() uses IS_ENABLED() however
that won't stop compiler from looking into the "if".. so it'll still try to
find the HPAGE_PMD_NR macro.

It means, I may need to further change vmstat_item_print_in_thp() to make
this work in the clean way.. by properly switching to a #ifdef.

For that I'll need to post a formal patch and add people to review.  I'll
keep y

Re: [PATCH v3 12/12] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-03-22 Thread Peter Xu
Jason,

On Fri, Mar 22, 2024 at 10:30:12AM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2024 at 06:08:02PM -0400, pet...@redhat.com wrote:
> 
> > A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
> > a tight loop of slow gup after the path switched.  That shouldn't be a
> > problem because slow-gup should not be a hot path for GUP in general: when
> > page is commonly present, fast-gup will already succeed, while when the
> > page is indeed missing and require a follow up page fault, the slow gup
> > degrade will probably buried in the fault paths anyway.  It also explains
> > why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
> > accelerate thp gup even for "pages != NULL"") lands, the latter not part of
> > a performance analysis but a side benefit.  If the performance will be a
> > concern, we can consider handle CONT_PTE in follow_page().
> 
> I think this is probably fine for the moment, at least for this
> series, as CONT_PTE is still very new.
> 
> But it will need to be optimized. "slow" GUP is the only GUP that is
> used by FOLL_LONGTERM and it still needs to be optimized because you
> can't assume a FOLL_LONGTERM user will be hitting the really slow
> fault path. There are enough important cases where it is just reading
> already populted page tables, and these days, often with large folios.

Ah, I thought FOLL_LONGTERM should work in most cases for fast-gup,
especially for hugetlb, but maybe I missed something?  I do see that devmap
skips fast-gup for LONGTERM, we also have that writeback issue but none of
those that I can find applies to hugetlb.  This might be a problem indeed
if we have hugetlb cont_pte pages that will constantly fallback to slow
gup.

OTOH, I also agree with you that such batching would be nice to have for
slow-gup, likely devmap or many fs (exclude shmem/hugetlb) file mappings
can at least benefit from it due to above.  But then that'll be a more
generic issue to solve, IOW, we still don't do that for !hugetlb cont_pte
large folios, before or after this series.

> 
> Reviewed-by: Jason Gunthorpe 

Thanks!

-- 
Peter Xu



Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

2024-03-20 Thread Peter Xu
On Wed, Mar 20, 2024 at 05:40:39PM +, Christophe Leroy wrote:
> 
> 
> Le 20/03/2024 à 17:09, Peter Xu a écrit :
> > On Wed, Mar 20, 2024 at 06:16:43AM +, Christophe Leroy wrote:
> >> At the first place that was to get a close fit between hardware
> >> pagetable topology and linux pagetable topology. But obviously we
> >> already stepped back for 512k pages, so let's go one more step aside and
> >> do similar with 8M pages.
> >>
> >> I'll give it a try and see how it goes.
> > 
> > So you're talking about 8M only for 8xx, am I right?
> 
> Yes I am.
> 
> > 
> > There seem to be other PowerPC systems use hugepd.  Is it possible that we
> > convert all hugepd into cont_pte form?
> 
> Indeed.
> 
> Seems like we have hugepd for book3s/64 and for nohash.
> 
> For book3s I don't know, may Aneesh can answer.
> 
> For nohash I think it should be possible because TLB misses are handled 
> by software. Even the e6500 which has a hardware tablewalk falls back on 
> software walk when it is a hugepage IIUC.

It'll be great if I can get some answer here, and then I know the path for
hugepd in general.  I don't want to add any new code into core mm to
something destined to fade away soon.

One option for me is I can check a macro of hugepd existance, so all new
code will only work when hugepd is not supported on such arch.  However
that'll start to make some PowerPC systems special (which I still tried
hard to avoid, if that wasn't proved in the past..), meanwhile we'll also
need to keep some generic-mm paths (that I can already remove along with
the new code) only for these hugepd systems.  But it's still okay to me,
it'll be just a matter of when to drop those codes, sooner or later.

Thanks,

-- 
Peter Xu



Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

2024-03-20 Thread Peter Xu
On Wed, Mar 20, 2024 at 06:16:43AM +, Christophe Leroy wrote:
> At the first place that was to get a close fit between hardware 
> pagetable topology and linux pagetable topology. But obviously we 
> already stepped back for 512k pages, so let's go one more step aside and 
> do similar with 8M pages.
> 
> I'll give it a try and see how it goes.

So you're talking about 8M only for 8xx, am I right?

There seem to be other PowerPC systems use hugepd.  Is it possible that we
convert all hugepd into cont_pte form?

Thanks,

-- 
Peter Xu



Re: [PATCH v2 05/14] mm/sparc: Change pXd_huge() behavior to exclude swap entries

2024-03-19 Thread Peter Xu
On Tue, Mar 19, 2024 at 12:25:39PM +0800, Muchun Song wrote:
> > @@ -409,14 +409,12 @@ pte_t huge_ptep_get_and_clear(struct mm_struct *mm, 
> > unsigned long addr,
> > 
> > int pmd_huge(pmd_t pmd)
> > {
> > -   return !pmd_none(pmd) &&
> > -   (pmd_val(pmd) & (_PAGE_VALID|_PAGE_PMD_HUGE)) != _PAGE_VALID;
> > +   return pmd_leaf(pmd);;
> 
> There is a redundant semicolon in the end.

Will touch it up, thanks.  PS: This will be dropped as a whole in patch 12.

-- 
Peter Xu



Re: [PATCH 12/13] mm/treewide: Remove pXd_huge()

2024-03-14 Thread Peter Xu
On Thu, Mar 14, 2024 at 08:56:59AM +, Christophe Leroy wrote:
> 
> 
> Le 13/03/2024 à 22:47, pet...@redhat.com a écrit :
> > From: Peter Xu 
> > 
> > This API is not used anymore, drop it for the whole tree.
> > 
> > Signed-off-by: Peter Xu 
> > ---
> >   arch/arm/mm/Makefile  |  1 -
> >   arch/arm/mm/hugetlbpage.c | 29 ---
> >   arch/arm64/mm/hugetlbpage.c   | 10 ---
> >   arch/loongarch/mm/hugetlbpage.c   | 10 ---
> >   arch/mips/include/asm/pgtable-32.h|  2 +-
> >   arch/mips/include/asm/pgtable-64.h|  2 +-
> >   arch/mips/mm/hugetlbpage.c| 10 ---
> >   arch/parisc/mm/hugetlbpage.c  | 11 ---
> >   .../include/asm/book3s/64/pgtable-4k.h| 10 ---
> >   .../include/asm/book3s/64/pgtable-64k.h   | 25 
> >   arch/powerpc/include/asm/nohash/pgtable.h | 10 ---
> >   arch/riscv/mm/hugetlbpage.c   | 10 ---
> >   arch/s390/mm/hugetlbpage.c| 10 ---
> >   arch/sh/mm/hugetlbpage.c  | 10 ---
> >   arch/sparc/mm/hugetlbpage.c   | 10 ---
> >   arch/x86/mm/hugetlbpage.c | 16 --
> >   include/linux/hugetlb.h   | 24 ---
> >   17 files changed, 2 insertions(+), 198 deletions(-)
> >   delete mode 100644 arch/arm/mm/hugetlbpage.c
> > 
> 
> > diff --git a/arch/mips/include/asm/pgtable-32.h 
> > b/arch/mips/include/asm/pgtable-32.h
> > index 0e196650f4f4..92b7591aac2a 100644
> > --- a/arch/mips/include/asm/pgtable-32.h
> > +++ b/arch/mips/include/asm/pgtable-32.h
> > @@ -129,7 +129,7 @@ static inline int pmd_none(pmd_t pmd)
> >   static inline int pmd_bad(pmd_t pmd)
> >   {
> >   #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
> > -   /* pmd_huge(pmd) but inline */
> > +   /* pmd_leaf(pmd) but inline */
> 
> Shouldn't this comment have been changed in patch 11 ?

IMHO it's fine to be here, as this is the patch to finally drop _huge().
Patch 11 only converts the callers to use _leaf()s.  So this comment is
still valid until this patch, because this patch removes that definition.

> 
> > if (unlikely(pmd_val(pmd) & _PAGE_HUGE))
> 
> Unlike pmd_huge() which is an outline function, pmd_leaf() is a macro so 
> it could be used here instead of open coping.

I worry it will break things as pmd_leaf() can sometimes be defined after
arch *pgtable.h headers.  So I avoided touching it except what I think I'm
confident.  I had a feeling it's inlined just because of a similar reason
for the old _huge().

> 
> > return 0;
> >   #endif
> > diff --git a/arch/mips/include/asm/pgtable-64.h 
> > b/arch/mips/include/asm/pgtable-64.h
> > index 20ca48c1b606..7c28510b3768 100644
> > --- a/arch/mips/include/asm/pgtable-64.h
> > +++ b/arch/mips/include/asm/pgtable-64.h
> > @@ -245,7 +245,7 @@ static inline int pmd_none(pmd_t pmd)
> >   static inline int pmd_bad(pmd_t pmd)
> >   {
> >   #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
> > -   /* pmd_huge(pmd) but inline */
> > +   /* pmd_leaf(pmd) but inline */
> 
> Same
> 
> > if (unlikely(pmd_val(pmd) & _PAGE_HUGE))
> 
> Same
> 
> > return 0;
> >   #endif
> 
> > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h 
> > b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
> > index 2fce3498b000..579a7153857f 100644
> > --- a/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/pgtable-64k.h
> > @@ -4,31 +4,6 @@
> >   
> >   #ifndef __ASSEMBLY__
> >   #ifdef CONFIG_HUGETLB_PAGE
> > -/*
> > - * We have PGD_INDEX_SIZ = 12 and PTE_INDEX_SIZE = 8, so that we can have
> > - * 16GB hugepage pte in PGD and 16MB hugepage pte at PMD;
> > - *
> > - * Defined in such a way that we can optimize away code block at build time
> > - * if CONFIG_HUGETLB_PAGE=n.
> > - *
> > - * returns true for pmd migration entries, THP, devmap, hugetlb
> > - * But compile time dependent on CONFIG_HUGETLB_PAGE
> > - */
> 
> Should we keep this comment somewhere for documentation ?

The 2nd/3rd paragraphs are definitely obsolete, so should be dropped.

OTOH, I'm not sure how much that will help if e.g. I move that over to
pmd_leaf(): a check over cpu_to_be64(_PAGE_PTE) is an implementation as
simple as it could be to explain itself with even no comment to me..

I also don't fully digest why that 1st paragraph discusses PGD entrie

Re: [PATCH 11/13] mm/treewide: Replace pXd_huge() with pXd_leaf()

2024-03-14 Thread Peter Xu
On Thu, Mar 14, 2024 at 08:50:20AM +, Christophe Leroy wrote:
> 
> 
> Le 13/03/2024 à 22:47, pet...@redhat.com a écrit :
> > From: Peter Xu 
> > 
> > Now after we're sure all pXd_huge() definitions are the same as pXd_leaf(),
> > reuse it.  Luckily, pXd_huge() isn't widely used.
> > 
> > Signed-off-by: Peter Xu 
> > ---
> >   arch/arm/include/asm/pgtable-3level.h | 2 +-
> >   arch/arm64/include/asm/pgtable.h  | 2 +-
> >   arch/arm64/mm/hugetlbpage.c   | 4 ++--
> >   arch/loongarch/mm/hugetlbpage.c   | 2 +-
> >   arch/mips/mm/tlb-r4k.c| 2 +-
> >   arch/powerpc/mm/pgtable_64.c  | 6 +++---
> >   arch/x86/mm/pgtable.c | 4 ++--
> >   mm/gup.c  | 4 ++--
> >   mm/hmm.c  | 2 +-
> >   mm/memory.c   | 2 +-
> >   10 files changed, 15 insertions(+), 15 deletions(-)
> > 
> > diff --git a/arch/arm/include/asm/pgtable-3level.h 
> > b/arch/arm/include/asm/pgtable-3level.h
> > index e7aecbef75c9..9e3c44f0aea2 100644
> > --- a/arch/arm/include/asm/pgtable-3level.h
> > +++ b/arch/arm/include/asm/pgtable-3level.h
> > @@ -190,7 +190,7 @@ static inline pte_t pte_mkspecial(pte_t pte)
> >   #define pmd_dirty(pmd)(pmd_isset((pmd), L_PMD_SECT_DIRTY))
> >   
> >   #define pmd_hugewillfault(pmd)(!pmd_young(pmd) || !pmd_write(pmd))
> > -#define pmd_thp_or_huge(pmd)   (pmd_huge(pmd) || pmd_trans_huge(pmd))
> > +#define pmd_thp_or_huge(pmd)   (pmd_leaf(pmd) || pmd_trans_huge(pmd))
> 
> Previous patch said pmd_trans_huge() implies pmd_leaf().

Ah here I remember I kept this arm definition there because I think we
should add a patch to drop pmd_thp_or_huge() completely.  If you won't mind
I can add one more patch instead of doing it here.  Then I keep this patch
purely as a replacement patch without further changes on arch-cleanups.

> 
> Or is that only for GUP ?

I think it should apply to all.

> 
> >   
> >   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >   #define pmd_trans_huge(pmd)   (pmd_val(pmd) && !pmd_table(pmd))
> 
> 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index c95b9ec5d95f..93aebd9cc130 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -429,7 +429,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long 
> > start, unsigned long end,
> > return hmm_vma_walk_hole(start, end, -1, walk);
> > }
> >   
> > -   if (pud_huge(pud) && pud_devmap(pud)) {
> > +   if (pud_leaf(pud) && pud_devmap(pud)) {
> 
> Didn't previous patch say devmap implies leaf ? Or is it only for GUP ?

This is an extra safety check that I didn't remove.  Devmap used separate
bits even though I'm not clear on why.  It should still imply a leaf though.

Thanks,

> 
> > unsigned long i, npages, pfn;
> > unsigned int required_fault;
> > unsigned long *hmm_pfns;
> 
> 

-- 
Peter Xu



Re: [PATCH 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

2024-03-14 Thread Peter Xu
On Thu, Mar 14, 2024 at 08:45:34AM +, Christophe Leroy wrote:
> 
> 
> Le 13/03/2024 à 22:47, pet...@redhat.com a écrit :
> > From: Peter Xu 
> > 
> > PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
> > constantly returns 0 for hash MMUs.  As Michael Ellerman pointed out [1],
> > it is safe to check _PAGE_PTE on hash MMUs, as the bit will never be set so
> > it will keep returning false.
> > 
> > As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to create
> > such huge mappings for 4K hash MMUs.  Meanwhile, the major powerpc hugetlb
> > pgtable walker __find_linux_pte() already used pXd_leaf() to check hugetlb
> > mappings.
> > 
> > The goal should be that we will have one API pXd_leaf() to detect all kinds
> > of huge mappings.  AFAICT we need to use the pXd_leaf() impl (rather than
> > pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
> 
> All kinds of huge mappings ?
> 
> pXd_leaf() will detect only leaf mappings (like pXd_huge() ). There are 
> also huge mappings through hugepd. On powerpc 8xx we have 8M huge pages 
> and 512k huge pages. A PGD entry covers 4M so pgd_leaf() won't report 
> those huge pages.

Ah yes, I should always mention this is in the context of leaf huge pages
only.  Are the examples you provided all fall into hugepd category?  If so
I can reword the commit message, as:

As a reference, __p[mu]d_mkhuge() will trigger a BUG_ON trying to
create such huge mappings for 4K hash MMUs.  Meanwhile, the major
powerpc hugetlb pgtable walker __find_linux_pte() already used
pXd_leaf() to check leaf hugetlb mappings.

The goal should be that we will have one API pXd_leaf() to detect
all kinds of huge mappings except hugepd.  AFAICT we need to use
the pXd_leaf() impl (rather than pXd_huge() ones) to make sure
ie. THPs on hash MMU will also return true.

Does this look good to you?

Thanks,

-- 
Peter Xu



Re: [PATCH RFC 00/13] mm/treewide: Remove pXd_huge() API

2024-03-12 Thread Peter Xu
Hi, Christophe,

On Mon, Mar 11, 2024 at 09:58:47AM +, Christophe Leroy wrote:
> Hi Peter, and nice job you are doing in cleaning up things around _huge 
> stuff.

Thanks.  I appreciate your help along the way on Power.

> 
> One thing that might be worth looking at also at some point is the mess 
> around pmd_clear_huge() and pud_clear_huge().
> 
> I tried to clean things up with commit c742199a014d ("mm/pgtable: add 
> stubs for {pmd/pub}_{set/clear}_huge") but it was reverted because of 
> arm64 by commit d8a719059b9d ("Revert "mm/pgtable: add stubs for 
> {pmd/pub}_{set/clear}_huge"")
> 
> So now powerpc/8xx has to implement pmd_clear_huge() and 
> pud_clear_huge() allthough 8xx page hierarchy only has 2 levels.

Those are so far out of my radar, as my focus right now is still more on
hugetlbfs relevant side of things, while kernel mappings are not yet
directly involved in hugetlbfs, even though they're still huge mappings.

It's a pity to know that broke arm and got reverted, as that looks like a
good thing to clean it up if ever possible.  I tend to agree with you that
it seems for 3lvl we should define pgd_huge*() instead of pud_huge*(), so
that it looks like the only way to provide such a treewide clean API is to
properly define those APIs for aarch64, and define different pud helpers
for either 3/4 levels.  But I confess I don't think I fully digested all
the bits.

Thanks,

-- 
Peter Xu



Re: [PATCH RFC 01/13] mm/hmm: Process pud swap entry without pud_huge()

2024-03-07 Thread Peter Xu
On Thu, Mar 07, 2024 at 02:12:33PM -0400, Jason Gunthorpe wrote:
> On Wed, Mar 06, 2024 at 06:41:35PM +0800, pet...@redhat.com wrote:
> > From: Peter Xu 
> > 
> > Swap pud entries do not always return true for pud_huge() for all archs.
> > x86 and sparc (so far) allow it, but all the rest do not accept a swap
> > entry to be reported as pud_huge().  So it's not safe to check swap entries
> > within pud_huge().  Check swap entries before pud_huge(), so it should be
> > always safe.
> > 
> > This is the only place in the kernel that (IMHO, wrongly) relies on
> > pud_huge() to return true on pud swap entries.  The plan is to cleanup
> > pXd_huge() to only report non-swap mappings for all archs.
> > 
> > Cc: Alistair Popple 
> > Signed-off-by: Peter Xu 
> > ---
> >  mm/hmm.c | 7 +--
> >  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> Reviewed-by: Jason Gunthorpe 
> 
> > @@ -424,7 +424,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long 
> > start, unsigned long end,
> > walk->action = ACTION_CONTINUE;
> >  
> > pud = READ_ONCE(*pudp);
> > -   if (pud_none(pud)) {
> > +   if (pud_none(pud) || !pud_present(pud)) {
> 
> Isn't this a tautology? pud_none always implies !present() ?

Hmm yes I think so, afact, it should be "all=none+swap+present". I still
remember I missed that once previously, it's not always obvious when
preparing such patches. :( I'll simplify this and also on patch 3.

Thanks,

-- 
Peter Xu



Re: [PATCH RFC 09/13] mm/powerpc: Redefine pXd_huge() with pXd_leaf()

2024-03-06 Thread Peter Xu
On Wed, Mar 06, 2024 at 11:56:56PM +1100, Michael Ellerman wrote:
> pet...@redhat.com writes:
> > From: Peter Xu 
> >
> > PowerPC book3s 4K mostly has the same definition on both, except pXd_huge()
> > constantly returns 0 for hash MMUs.  AFAICT that is fine to be removed,
> > because pXd_huge() reflects a hugetlb entry, while it's own hugetlb pgtable
> > lookup function (__find_linux_pte() shared by all powerpc code) already use
> > pXd_leaf() irrelevant of the MMU type.  It means pXd_leaf() should work all
> > fine with hash MMU pgtables or something could already went wrong.
> 
> Yes I think that's correct.
> 
> 4K Hash MMU doesn't support any hugepage size at PMD or PUD level (the
> geometry is wrong), so pmd/pud_huge() were written with that in mind,
> ie. they are hard coded to return false.
> 
> But it should be OK to use pmd/pud_leaf(), they will actually look for
> _PAGE_PTE, but it should never be set for 4K Hash.
> 
> See eg. arch/powerpc/include/asm/book3s/64/hash-4k.h:
> 
> static inline pmd_t hash__pmd_mkhuge(pmd_t pmd)
> {
>   BUG();
>   return pmd;
> }

Good to get confirmation on this, thanks, Michael.  These explanations also
look better than what I wrote, I'll amend the commit message.

> 
> > The goal should be that we will have one API pXd_leaf() to detect all kinds
> > of huge mappings.  AFAICT we need to use the pXd_leaf() impl (rather than
> > pXd_huge() ones) to make sure ie. THPs on hash MMU will also return true.
> >
> > This helps to simplify a follow up patch to drop pXd_huge() treewide.
> >
> > Cc: Michael Ellerman 
> > Cc: Nicholas Piggin 
> > Cc: Christophe Leroy 
> > Cc: "Aneesh Kumar K.V" 
> > Cc: "Naveen N. Rao" 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Signed-off-by: Peter Xu 
> > ---
> >  arch/powerpc/include/asm/book3s/64/pgtable-4k.h | 14 ++
> >  1 file changed, 2 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h 
> > b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
> > index 48f21820afe2..92545981bb49 100644
> > --- a/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/pgtable-4k.h
> > @@ -8,22 +8,12 @@
> >  #ifdef CONFIG_HUGETLB_PAGE
> >  static inline int pmd_huge(pmd_t pmd)
> >  {
> > -   /*
> > -* leaf pte for huge page
> > -*/
> > -   if (radix_enabled())
> > -   return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> > -   return 0;
> > +   return pmd_leaf(pmd);
> >  }
> >  
> >  static inline int pud_huge(pud_t pud)
> >  {
> > -   /*
> > -* leaf pte for huge page
> > -*/
> > -   if (radix_enabled())
> > -   return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> > -   return 0;
> > +   return pud_leaf(pud);
> >  }
> 
> This doesn't actually compile though.
> 
>   arch/powerpc/include/asm/book3s/64/pgtable-4k.h:11:16: error: implicit 
> declaration of function ‘pmd_leaf’; did you mean ‘pgd_clear’? 
> [-Werror=implicit-function-declaration]
> 
> etc.
> 
> To make it compile we'd need to relocate the pmd/pud_leaf() definitions:
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index df66dce8306f..fd7180fded75 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -262,6 +262,18 @@ extern unsigned long __kernel_io_end;
> 
>  extern struct page *vmemmap;
>  extern unsigned long pci_io_base;
> +
> +#define pmd_leaf pmd_leaf
> +static inline bool pmd_leaf(pmd_t pmd)
> +{
> +   return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> +}
> +
> +#define pud_leaf pud_leaf
> +static inline bool pud_leaf(pud_t pud)
> +{
> +   return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> +}
>  #endif /* __ASSEMBLY__ */
> 
>  #include 
> @@ -1436,20 +1448,5 @@ static inline bool is_pte_rw_upgrade(unsigned long 
> old_val, unsigned long new_va
> return false;
>  }
> 
> -/*
> - * Like pmd_huge(), but works regardless of config options
> - */
> -#define pmd_leaf pmd_leaf
> -static inline bool pmd_leaf(pmd_t pmd)
> -{
> -   return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
> -}
> -
> -#define pud_leaf pud_leaf
> -static inline bool pud_leaf(pud_t pud)
> -{
> -   return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
> -}
> -
>  #endif /* __ASSEMBLY__ */
>  #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */

Thanks for the help, I'll fi

Re: [PATCH v2 4/7] mm/x86: Drop two unnecessary pud_leaf() definitions

2024-03-04 Thread Peter Xu
On Mon, Mar 04, 2024 at 09:03:34AM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 29, 2024 at 04:42:55PM +0800, pet...@redhat.com wrote:
> > From: Peter Xu 
> > 
> > pud_leaf() has a fallback macro defined in include/linux/pgtable.h already.
> > Drop the extra two for x86.
> > 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: Borislav Petkov 
> > Cc: Dave Hansen 
> > Cc: x...@kernel.org
> > Signed-off-by: Peter Xu 
> > ---
> >  arch/x86/include/asm/pgtable.h  | 1 -
> >  include/asm-generic/pgtable-nopmd.h | 1 -
> >  2 files changed, 2 deletions(-)
> 
> Reviewed-by: Jason Gunthorpe 
> 
> > @@ -31,7 +31,6 @@ static inline int pud_none(pud_t pud) { 
> > return 0; }
> >  static inline int pud_bad(pud_t pud)   { return 0; }
> >  static inline int pud_present(pud_t pud)   { return 1; }
> >  static inline int pud_user(pud_t pud)  { return 0; }
> > -static inline int pud_leaf(pud_t pud)  { return 0; }
> 
> It would be nice to have a final patch making the signatures
> consistent on all the arch inlines, it should return bool not int.

Makes sense, will do, thanks.

-- 
Peter Xu



Re: [PATCH 5/5] mm/treewide: Drop pXd_large()

2024-02-28 Thread Peter Xu
On Thu, Feb 29, 2024 at 01:17:36PM +0800, kernel test robot wrote:
> >> arch/x86/include/asm/pgtable.h:1099:19: error: redefinition of 'pud_leaf'
> 1099 | static inline int pud_leaf(pud_t pud)
>  |   ^
>include/asm-generic/pgtable-nopmd.h:34:19: note: previous definition is 
> here
>   34 | static inline int pud_leaf(pud_t pud)   { return 0; }
>  |   ^

This is CONFIG_PGTABLE_LEVELS=2.  IIUC patch 5 didn't do anything wrong,
but when renaming pud_large() it caused this confliction, while in the past
it was a silent confliction between the old pud_leaf() macro and pud_leaf()
defintion, the macro could have silently overwrote the function.

IIUC such pud_leaf() is not needed as we have a global fallback.  I'll add
a pre-requisite patch to remove such pXd_leaf() definitions.

-- 
Peter Xu



Re: [PATCH 0/5] mm/treewide: Replace pXd_large() with pXd_leaf()

2024-02-28 Thread Peter Xu
On Wed, Feb 28, 2024 at 09:50:52AM +, Christophe Leroy wrote:
> Le 28/02/2024 à 09:53, pet...@redhat.com a écrit :
> > From: Peter Xu 
> > 
> > [based on latest akpm/mm-unstable, commit 1274e7646240]
> > 
> > These two APIs are mostly always the same.  It's confusing to have both of
> > them.  Merge them into one.  Here I used pXd_leaf() only because pXd_leaf()
> > is a global API which is always defined, while pXd_large() is not.
> > 
> > We have yet one more API that is similar which is pXd_huge(), but that's
> > even trickier, so let's do it step by step.
> > 
> > Some cautions are needed on either x86 or ppc: x86 is currently the only
> > user of p4d_large(), while ppc used to define pXd_large() only with THP,
> > while it is not the case for pXd_leaf().  For the rest archs, afaict
> > they're 100% identical.
> 
> Maybe would also be a good opportunity to replace pmd_is_leaf() by 
> pmd_leaf() and the same for pud_is_leaf()

Sure, while I'll wait for some more comments, I can add one on top when
repost.  Thanks,

-- 
Peter Xu



Re: [PATCH v2 03/13] mm: Provide generic pmd_thp_or_huge()

2024-02-22 Thread Peter Xu
On Wed, Feb 21, 2024 at 08:57:53AM -0400, Jason Gunthorpe wrote:
> On Wed, Feb 21, 2024 at 05:37:37PM +0800, Peter Xu wrote:
> > On Mon, Jan 15, 2024 at 01:55:51PM -0400, Jason Gunthorpe wrote:
> > > On Wed, Jan 03, 2024 at 05:14:13PM +0800, pet...@redhat.com wrote:
> > > > From: Peter Xu 
> > > > 
> > > > ARM defines pmd_thp_or_huge(), detecting either a THP or a huge PMD.  It
> > > > can be a helpful helper if we want to merge more THP and hugetlb code
> > > > paths.  Make it a generic default implementation, only exist when
> > > > CONFIG_MMU.  Arch can overwrite it by defining its own version.
> > > > 
> > > > For example, ARM's pgtable-2level.h defines it to always return false.
> > > > 
> > > > Keep the macro declared with all config, it should be optimized to a 
> > > > false
> > > > anyway if !THP && !HUGETLB.
> > > > 
> > > > Signed-off-by: Peter Xu 
> > > > ---
> > > >  include/linux/pgtable.h | 4 
> > > >  mm/gup.c| 3 +--
> > > >  2 files changed, 5 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > > index 466cf477551a..2b42e95a4e3a 100644
> > > > --- a/include/linux/pgtable.h
> > > > +++ b/include/linux/pgtable.h
> > > > @@ -1362,6 +1362,10 @@ static inline int pmd_write(pmd_t pmd)
> > > >  #endif /* pmd_write */
> > > >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > > >  
> > > > +#ifndef pmd_thp_or_huge
> > > > +#define pmd_thp_or_huge(pmd)   (pmd_huge(pmd) || pmd_trans_huge(pmd))
> > > > +#endif
> > > 
> > > Why not just use pmd_leaf() ?
> > > 
> > > This GUP case seems to me exactly like what pmd_leaf() should really
> > > do and be used for..
> > 
> > I think I mostly agree with you, and these APIs are indeed confusing.  IMHO
> > the challenge is about the risk of breaking others on small changes in the
> > details where evil resides.
> 
> These APIs are super confusing, which is why I brought it up.. Adding
> even more subtly different variations is not helping.
> 
> I think pmd_leaf means the entry is present and refers to a physical
> page not another radix level.
> 
> > > eg x86 does:
> > > 
> > > #define pmd_leaf  pmd_large
> > > static inline int pmd_large(pmd_t pte)
> > >   return pmd_flags(pte) & _PAGE_PSE;
> > > 
> > > static inline int pmd_trans_huge(pmd_t pmd)
> > >   return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
> > > 
> > > int pmd_huge(pmd_t pmd)
> > > return !pmd_none(pmd) &&
> > > (pmd_val(pmd) & (_PAGE_PRESENT|_PAGE_PSE)) != 
> > > _PAGE_PRESENT;
> > 
> > For example, here I don't think it's strictly pmd_leaf()? As pmd_huge()
> > will return true if PRESENT=0 && PSE=0 (as long as none pte ruled out
> > first), while pmd_leaf() will return false; I think that came from
> > cbef8478bee5. 
> 
> Yikes, but do you even want to handle non-present entries in GUP
> world? Isn't everything gated by !present in the first place?

I am as confused indeed.

> 
> > Besides that, there're also other cases where it's not clear of such direct
> > replacement, not until further investigated.  E.g., arm-3level has:
> > 
> > #define pmd_leaf(pmd)   pmd_sect(pmd)
> > #define pmd_sect(pmd)   ((pmd_val(pmd) & PMD_TYPE_MASK) == \
> >  PMD_TYPE_SECT)
> > #define PMD_TYPE_SECT   (_AT(pmdval_t, 1) << 0)
> > 
> > While pmd_huge() there relies on PMD_TABLE_BIT ()
> 
> I looked at tht, it looked OK.. 
> 
> #define PMD_TYPE_MASK   (_AT(pmdval_t, 3) << 0)
> #define PMD_TABLE_BIT   (_AT(pmdval_t, 1) << 1)
> 
> It is the same stuff, just a little confusingly written

True, my eyes decided to skip all the shifts. :-( Ok then, let me see
whether I can give it a stab on the pXd_huge() mess.

Thanks,

-- 
Peter Xu



Re: [PATCH v2 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2024-02-21 Thread Peter Xu
On Mon, Jan 15, 2024 at 02:37:48PM -0400, Jason Gunthorpe wrote:
> > Drop that check, not only because it'll never be true for hugepd per any
> > known plan, but also it paves way for reusing the function outside
> > fast-gup.
> 
> I didn't see any other caller of this function in this series? When
> does this re-use happen??

It's reused in patch 12 ("mm/gup: Handle hugepd for follow_page()").

Thanks,

-- 
Peter Xu



Re: [PATCH v2 10/13] mm/gup: Handle huge pud for follow_pud_mask()

2024-02-21 Thread Peter Xu
On Mon, Jan 15, 2024 at 02:49:00PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 03, 2024 at 05:14:20PM +0800, pet...@redhat.com wrote:
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 63845b3ec44f..760406180222 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -525,6 +525,70 @@ static struct page *no_page_table(struct 
> > vm_area_struct *vma,
> > return NULL;
> >  }
> >  
> > +#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
> > +static struct page *follow_huge_pud(struct vm_area_struct *vma,
> > +   unsigned long addr, pud_t *pudp,
> > +   int flags, struct follow_page_context *ctx)
> > +{
> > +   struct mm_struct *mm = vma->vm_mm;
> > +   struct page *page;
> > +   pud_t pud = *pudp;
> > +   unsigned long pfn = pud_pfn(pud);
> > +   int ret;
> > +
> > +   assert_spin_locked(pud_lockptr(mm, pudp));
> > +
> > +   if ((flags & FOLL_WRITE) && !pud_write(pud))
> > +   return NULL;
> > +
> > +   if (!pud_present(pud))
> > +   return NULL;
> > +
> > +   pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
> > +
> > +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> > +   if (pud_devmap(pud)) {
> 
> Can this use IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) ?

Sure.

> 
> > +   /*
> > +* device mapped pages can only be returned if the caller
> > +* will manage the page reference count.
> > +*
> > +* At least one of FOLL_GET | FOLL_PIN must be set, so
> > +* assert that here:
> > +*/
> > +   if (!(flags & (FOLL_GET | FOLL_PIN)))
> > +   return ERR_PTR(-EEXIST);
> > +
> > +   if (flags & FOLL_TOUCH)
> > +   touch_pud(vma, addr, pudp, flags & FOLL_WRITE);
> > +
> > +   ctx->pgmap = get_dev_pagemap(pfn, ctx->pgmap);
> > +   if (!ctx->pgmap)
> > +   return ERR_PTR(-EFAULT);
> > +   }
> > +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> > +   page = pfn_to_page(pfn);
> > +
> > +   if (!pud_devmap(pud) && !pud_write(pud) &&
> > +   gup_must_unshare(vma, flags, page))
> > +   return ERR_PTR(-EMLINK);
> > +
> > +   ret = try_grab_page(page, flags);
> > +   if (ret)
> > +   page = ERR_PTR(ret);
> > +   else
> > +   ctx->page_mask = HPAGE_PUD_NR - 1;
> > +
> > +   return page;
> > +}
> > +#else  /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
> > +static struct page *follow_huge_pud(struct vm_area_struct *vma,
> > +   unsigned long addr, pud_t *pudp,
> > +   int flags, struct follow_page_context *ctx)
> > +{
> > +   return NULL;
> > +}
> > +#endif /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
> > +
> >  static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long 
> > address,
> > pte_t *pte, unsigned int flags)
> >  {
> > @@ -760,11 +824,11 @@ static struct page *follow_pud_mask(struct 
> > vm_area_struct *vma,
> >  
> > pudp = pud_offset(p4dp, address);
> > pud = READ_ONCE(*pudp);
> > -   if (pud_none(pud))
> > +   if (pud_none(pud) || !pud_present(pud))
> > return no_page_table(vma, flags, address);
> 
> Isn't 'pud_none() || !pud_present()' redundent? A none pud is
> non-present, by definition?

Hmm yes, seems redundant.  Let me drop it.

> 
> > -   if (pud_devmap(pud)) {
> > +   if (pud_huge(pud)) {
> > ptl = pud_lock(mm, pudp);
> > -   page = follow_devmap_pud(vma, address, pudp, flags, 
> > >pgmap);
> > +   page = follow_huge_pud(vma, address, pudp, flags, ctx);
> > spin_unlock(ptl);
> > if (page)
> > return page;
> 
> Otherwise it looks OK to me
> 
> Reviewed-by: Jason Gunthorpe 

Thanks!

-- 
Peter Xu



Re: [PATCH v2 03/13] mm: Provide generic pmd_thp_or_huge()

2024-02-21 Thread Peter Xu
On Mon, Jan 15, 2024 at 01:55:51PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 03, 2024 at 05:14:13PM +0800, pet...@redhat.com wrote:
> > From: Peter Xu 
> > 
> > ARM defines pmd_thp_or_huge(), detecting either a THP or a huge PMD.  It
> > can be a helpful helper if we want to merge more THP and hugetlb code
> > paths.  Make it a generic default implementation, only exist when
> > CONFIG_MMU.  Arch can overwrite it by defining its own version.
> > 
> > For example, ARM's pgtable-2level.h defines it to always return false.
> > 
> > Keep the macro declared with all config, it should be optimized to a false
> > anyway if !THP && !HUGETLB.
> > 
> > Signed-off-by: Peter Xu 
> > ---
> >  include/linux/pgtable.h | 4 
> >  mm/gup.c| 3 +--
> >  2 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 466cf477551a..2b42e95a4e3a 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1362,6 +1362,10 @@ static inline int pmd_write(pmd_t pmd)
> >  #endif /* pmd_write */
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >  
> > +#ifndef pmd_thp_or_huge
> > +#define pmd_thp_or_huge(pmd)   (pmd_huge(pmd) || pmd_trans_huge(pmd))
> > +#endif
> 
> Why not just use pmd_leaf() ?
> 
> This GUP case seems to me exactly like what pmd_leaf() should really
> do and be used for..

I think I mostly agree with you, and these APIs are indeed confusing.  IMHO
the challenge is about the risk of breaking others on small changes in the
details where evil resides.

> 
> eg x86 does:
> 
> #define pmd_leaf  pmd_large
> static inline int pmd_large(pmd_t pte)
>   return pmd_flags(pte) & _PAGE_PSE;
> 
> static inline int pmd_trans_huge(pmd_t pmd)
>   return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
> 
> int pmd_huge(pmd_t pmd)
> return !pmd_none(pmd) &&
> (pmd_val(pmd) & (_PAGE_PRESENT|_PAGE_PSE)) != _PAGE_PRESENT;

For example, here I don't think it's strictly pmd_leaf()? As pmd_huge()
will return true if PRESENT=0 && PSE=0 (as long as none pte ruled out
first), while pmd_leaf() will return false; I think that came from
cbef8478bee5.  I'm not sure whether that is the best solution, e.g., from a
1st glance it seems better to me to process swap entries separately
(including both migration and poisoned entries)..

Sparc has similar things there, which in that case I'm not sure whether a
direct replace is always safe.

Besides that, there're also other cases where it's not clear of such direct
replacement, not until further investigated.  E.g., arm-3level has:

#define pmd_leaf(pmd)   pmd_sect(pmd)
#define pmd_sect(pmd)   ((pmd_val(pmd) & PMD_TYPE_MASK) == \
 PMD_TYPE_SECT)
#define PMD_TYPE_SECT   (_AT(pmdval_t, 1) << 0)

While pmd_huge() there relies on PMD_TABLE_BIT ()

int pmd_huge(pmd_t pmd)
{
return pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT);
}

#define PMD_TABLE_BIT   (_AT(pmdval_t, 1) << 1)

These are just the trivial details that I wanted to avoid to touch in this
series, so as to resolve the hugetlb issue separately from others.

The new pmd_huge_or_thp() is not ideal, but that easily isolates all these
trivial details / evils out of the picture, so that we can tackle them one
by one.  It is strictly an OR or huge||thp, so it's hopefully safe to not
break anything yet from that regard.

> 
> I spot checked a couple arches and it looks like it holds up.
> 
> Further, it looks to me like this site in GUP is the only core code
> caller..
> 
> So, I'd suggest a small series to go arch by arch and convert the arch
> to use pmd_huge() == pmd_leaf(). Then retire pmd_huge() as a public
> API.
> 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index df83182ec72d..eebae70d2465 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -3004,8 +3004,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, 
> > unsigned long addr, unsigned lo
> > if (!pmd_present(pmd))
> > return 0;
> >  
> > -   if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) ||
> > -pmd_devmap(pmd))) {
> > +   if (unlikely(pmd_thp_or_huge(pmd) || pmd_devmap(pmd))) {
> > /* See gup_pte_range() */
> >         if (pmd_protnone(pmd))
> > return 0;
> 
> And the devmap thing here doesn't make any sense either. The arch
> should ensure that pmd_devmap() implies pmd_leaf(). Since devmap is a
> purely SW construct it almost certainly does already anyhow.

Yep, but only if pmd_leaf() is safe to be put here. A pmd devmap should
always imply as a pmd_leaf() indeed.

Thanks,

-- 
Peter Xu



Re: [PATCH v2 01/13] mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES

2024-01-22 Thread Peter Xu
On Mon, Jan 15, 2024 at 01:37:37PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 03, 2024 at 05:14:11PM +0800, pet...@redhat.com wrote:
> > From: Peter Xu 
> > 
> > Introduce a config option that will be selected as long as huge leaves are
> > involved in pgtable (thp or hugetlbfs).  It would be useful to mark any
> > code with this new config that can process either hugetlb or thp pages in
> > any level that is higher than pte level.
> > 
> > Signed-off-by: Peter Xu 
> > ---
> >  mm/Kconfig | 3 +++
> >  1 file changed, 3 insertions(+)
> 
> So you mean anything that supports page table entires > PAGE_SIZE ?

Yes.

> 
> Makes sense to me, though maybe add a comment in the kconfig?

Sure I'll add some.

> 
> Reviewed-by: Jason Gunthorpe 

Thanks for your reviews and also positive comments in previous versions,
Jason.  I appreciate that.

I'm just pretty occupied with other tasks recently so I don't yet have time
to revisit this series, along with other comments yet.  I'll do so and
reply to the comments / discussions together afterwards.

-- 
Peter Xu



Re: [PATCH v2 00/13] mm/gup: Unify hugetlb, part 2

2024-01-07 Thread Peter Xu
Hi, Christophe,

On Wed, Jan 03, 2024 at 11:14:54AM +, Christophe Leroy wrote:
> > Test Done
> > =
> > 
> > This v1 went through the normal GUP smoke tests over different memory
> > types on archs (using VM instances): x86_64, aarch64, ppc64le.  For
> > aarch64, tested over 64KB cont_pte huge pages.  For ppc64le, tested over
> > 16MB hugepd entries (Power8 hash MMU on 4K base page size).
> > 
> 
> Can you tell how you test ?
> 
> I'm willing to test this series on powerpc 8xx (PPC32).

My apologies, for some reason I totally overlooked this email..

I only tested using run_vmtests.sh, with:

  $ bash ./run_vmtests.sh -t gup_test -a

It should cover pretty much lots of tests of GUP using gup_test program.  I
think the ones that matters here is "-H" over either "-U/-b".

For ppc8xx, even though kernel mapping uses hugepd, I don't expect anything
should change before/after this series, because the code that I touched
(slow gup only) only affects user pages, so it shouldn't change anything
over kernel mappings.  Said so, please feel free to smoke over whatever
type of kernel hugepd mappings, and I'd trust you're the expert on how to
trigger those paths.

Since I got your attention, when working on this series I talked to David
Gibson and just got to know that hugepd is actually a pure software idea.
IIUC it means there's no PPC hardware that really understands the hugepd
format at all, but only a "this is a huge page" hint for Linux.

Considering that it _seems_ to play a similar role of cont_pXX here: do you
think hugepd can have any chance to be implemented similarly like cont_pXX,
or somehow share the code?

For example, if hugepd is recognized only by Linux kernel itself, maybe
there can be some special pgtable hint that can be attached to the cont_*
entries, showing whether it's a "real cont_*" entry or a "hugepd" entry?
IIUC it can be quite flexible because if hugepd only works for hash MMU so
no hardware will even walk that radix table.  But I can overlook important
things here.

It'll be definitely great if hugepd can be merged into some existing forms
like a generic pgtable (IMHO cont_* is such case: it's the same as no
cont_* entries for softwares, while hardware can accelerate with TLB hits
on larger ranges).  But I can be asking a very silly question here too, as
I can overlook very important things.

Thanks,

-- 
Peter Xu



Re: [PATCH 05/13] mm: Introduce vma_pgtable_walk_{begin|end}()

2024-01-01 Thread Peter Xu
On Mon, Dec 25, 2023 at 02:34:48PM +0800, Muchun Song wrote:
> Reviewed-by: Muchun Song 

You're using the old email address here.  Do you want me to also use the
linux.dev one that you suggested me to use?

-- 
Peter Xu



Re: [PATCH 03/13] mm: Provide generic pmd_thp_or_huge()

2024-01-01 Thread Peter Xu
On Mon, Dec 25, 2023 at 02:29:53PM +0800, Muchun Song wrote:
> > @@ -1355,6 +1355,10 @@ static inline int pmd_write(pmd_t pmd)
> >   #endif /* pmd_write */
> >   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > +#ifndef pmd_thp_or_huge
> 
> I think it may be the time to rename to pmd_thp_or_hugetlb,
> the "huge" is really confusing. thp is not huge? Actually,
> it is huge. It is better to make it more specific from now on, like
> "hugetlb".

The rename will need to touch ARM code, which I wanted to avoid, see:

arch/arm64/include/asm/pgtable.h:#define pmd_thp_or_huge(pmd)   (pmd_huge(pmd) 
|| pmd_trans_huge(pmd))

So far this series only touches generic code.  Would you mind I keep this
patch as-is, and leave renaming to later?

> 
> BTW, please cc me via the new email (muchun.s...@linux.dev) next edition.

Sure.  Thanks for taking a look.

-- 
Peter Xu



Re: [PATCH 00/13] mm/gup: Unify hugetlb, part 2

2023-12-21 Thread Peter Xu
Copy Muchun, which I forgot since the start, sorry.

-- 
Peter Xu



Re: [PATCH 09/13] mm/gup: Cache *pudp in follow_pud_mask()

2023-12-19 Thread Peter Xu
On Tue, Dec 19, 2023 at 11:28:54AM -0500, James Houghton wrote:
> On Tue, Dec 19, 2023 at 2:57 AM  wrote:
> >
> > From: Peter Xu 
> >
> > Introduce "pud_t pud" in the function, so the code won't dereference *pudp
> > multiple time.  Not only because that looks less straightforward, but also
> > because if the dereference really happened, it's not clear whether there
> > can be race to see different *pudp values if it's being modified at the
> > same time.
> >
> > Signed-off-by: Peter Xu 
> > ---
> >  mm/gup.c | 17 +
> >  1 file changed, 9 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 6c0d82fa8cc7..97e87b7a15c3 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -753,26 +753,27 @@ static struct page *follow_pud_mask(struct 
> > vm_area_struct *vma,
> > unsigned int flags,
> > struct follow_page_context *ctx)
> >  {
> > -   pud_t *pud;
> > +   pud_t *pudp, pud;
> > spinlock_t *ptl;
> > struct page *page;
> > struct mm_struct *mm = vma->vm_mm;
> >
> > -   pud = pud_offset(p4dp, address);
> > -   if (pud_none(*pud))
> > +   pudp = pud_offset(p4dp, address);
> > +   pud = *pudp;
> 
> I think you might want a READ_ONCE() on this so that the compiler
> doesn't actually read the pud multiple times.

Makes sense.  I probably only did the "split" part which Christoph
requested, without thinking futher than that. :)

> 
> > +   if (pud_none(pud))
> > return no_page_table(vma, flags, address);
> > -   if (pud_devmap(*pud)) {
> > -   ptl = pud_lock(mm, pud);
> > -   page = follow_devmap_pud(vma, address, pud, flags, 
> > >pgmap);
> > +   if (pud_devmap(pud)) {
> > +   ptl = pud_lock(mm, pudp);
> > +   page = follow_devmap_pud(vma, address, pudp, flags, 
> > >pgmap);
> > spin_unlock(ptl);
> > if (page)
> > return page;
> > return no_page_table(vma, flags, address);
> > }
> > -   if (unlikely(pud_bad(*pud)))
> > +   if (unlikely(pud_bad(pud)))
> > return no_page_table(vma, flags, address);
> 
> Not your change, but reading this, it's not clear to me that
> `pud_present(*pudp)` (and non-leaf) would necessarily be true at this
> point -- like, I would prefer to see `!pud_present(pud)` instead of
> `pud_bad()`. Thank you for adding that in the next patch. :)

I think the assumption here is it is expected to be a directory entry when
reaching here, and for a valid directory entry pud_present() should always
return true (a side note: pud_present() may not mean "PRESENT bit set", see
m68k's implementation for example).

Yeah I added that in the next patch, my intention was to check
!pud_present() for all cases without the need to take pgtable lock, though.

> 
> Feel free to add:
> 
> Acked-by: James Houghton 

Thanks,

-- 
Peter Xu



Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2023-12-04 Thread Peter Xu
On Mon, Dec 04, 2023 at 11:11:26AM +, Ryan Roberts wrote:
> To be honest, while I understand pte_cont() and friends, I don't understand
> their relevance (or at least potential future relevance) to GUP?

GUP in general can be smarter to recognize if a pte/pmd is a cont_pte and
fetch the whole pte/pmd range if the caller specified.  Now it loops over
each pte/pmd.

Fast-gup is better as it at least doesn't take pgtable lock, for cont_pte
it looks inside gup_pte_range() which is good enough, but it'll still do
folio checks for each sub-pte, even though the 2nd+ folio checks should be
mostly the same (if to ignore races when the folio changed within the time
of processing the cont_pte chunk).

Slow-gup (as of what this series is about so far) doesn't do that either,
for each cont_pte whole entry it'll loop N times, frequently taking and
releasing the pgtable lock.  A smarter slow-gup can fundamentallly setup
follow_page_context.page_mask if it sees a cont_pte.  There might be a
challenge on whether holding the head page's refcount would stablize the
whole folio, but that may be another question to ask.

I think I also overlooked that PPC_8XX also has cont_pte support, so we
actually have three users indeed, if not counting potential future archs
adding support to also get that same tlb benefit.

Thanks,

-- 
Peter Xu



Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2023-11-30 Thread Peter Xu
On Fri, Nov 24, 2023 at 11:07:51AM -0500, Peter Xu wrote:
> On Fri, Nov 24, 2023 at 09:06:01AM +, Ryan Roberts wrote:
> > I don't have any micro-benchmarks for GUP though, if that's your question. 
> > Is
> > there an easy-to-use test I can run to get some numbers? I'd be happy to 
> > try it out.
> 
> Thanks Ryan.  Then nothing is needed to be tested if gup is not yet touched
> from your side, afaict.  I'll see whether I can provide some rough numbers
> instead in the next post (I'll probably only be able to test it in a VM,
> though, but hopefully that should still reflect mostly the truth).

An update: I finished a round of 64K cont_pte test, in the slow gup micro
benchmark I see ~15% perf degrade with this patchset applied on a VM on top
of Apple M1.

Frankly that's even less than I expected, considering not only how slow gup
THP used to be, but also on the fact that that's a tight loop over slow
gup, which in normal cases shouldn't happen: "present" ptes normally goes
to fast-gup, while !present goes into a fault following it.  I assume
that's why nobody cared slow gup for THP before.  I think adding cont_pte
support shouldn't be very hard, but that will include making cont_pte idea
global just for arm64 and riscv Svnapot.

The current plan is I'll add that performance number into my commit message
only, as I don't ever expect any real workload will regress with it.  Maybe
a global cont_pte api will be needed at some point, but perhaps not yet
feel strongly for this use case.

Please feel free to raise any concerns otherwise.

Thanks,

-- 
Peter Xu



Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2023-11-24 Thread Peter Xu
Hi, Christophe, Michael, Aneesh,

[I'll reply altogether here]

On Fri, Nov 24, 2023 at 07:03:11AM +, Christophe Leroy wrote:
> I added that code with commit e17eae2b8399 ("mm: pagewalk: fix walk for 
> hugepage tables") because I was getting crazy displays when dumping 
> /sys/kernel/debug/pagetables
> 
> Huge pages can be used for many thing.
> 
> On powerpc 8xx, there are 4 possible page size: 4k, 16k, 512k and 8M.
> Each PGD entry addresses 4M areas, so hugepd is used for anything using 
> 8M pages. Could have used regular page tables instead, but it is not 
> worth allocating a 4k table when the HW will only read first entry.
> 
> At the time being, linear memory mapping is performed with 8M pages, so 
> ptdump_walk_pgd() will walk into huge page directories.
> 
> Also, huge pages can be used in vmalloc() and in vmap(). At the time 
> being we support 512k pages there on the 8xx. 8M pages will be supported 
> once vmalloc() and vmap() support hugepd, as explained in commit 
> a6a8f7c4aa7e ("powerpc/8xx: add support for huge pages on VMAP and VMALLOC")
> 
> So yes as a conclusion hugepd is used outside hugetlbfs, hope it 
> clarifies things.

Yes it does, thanks a lot for all of your replies.

So I think this is what I missed: on Freescale ppc 8xx there's a special
hugepd_populate_kernel() defined to install kernel pgtables for hugepd.
Obviously I didn't check further than hugepd_populate() when I first
looked, and stopped at the first instance of hugepd_populate() definition
on the 64 bits ppc.

For this specific patch: I suppose the change is still all fine to reuse
the fast-gup function, because it is still true when there's a VMA present
(GUP applies only to user mappings, nothing like KASAN should ever pop up).
So AFAIU it's still true that hugepd is only used in hugetlb pages in this
case even for Freescale 8xx, and nothing should yet explode.  So maybe I
can still keep the code changes.

However the comment at least definitely needs fixing (that I'm going to add
some, which hch requested and I agree), that is not yet in the patch I
posted here but I'll refine them locally.

For the whole work: the purpose of it is to start merging hugetlb pgtable
processing with generic mm.  That is my take of previous lsfmm discussions
in the community on how we should move forward with hugetlb in the future,
to avoid code duplications against generic mm.  Hugetlb is kind of blocked
on adding new (especially, large) features in general because of such
complexity.  This is all about that, but a small step towards it.

I see that it seems a trend to make hugepd more general.  Christophe's fix
on dump pgtable is exactly what I would also look for if keep going.  I
hope that's the right way to go.

I'll also need to think more on how this will affect my plan, currently it
seems all fine: I won't ever try to change any kernel mapping specific
code. I suppose any hugetlbfs based test should still cover all codes I
will touch on hugepd.  Then it should just work for kernel mappings on
Freescales; it'll be great if e.g. Christophe can help me double check that
if the series can stablize in a few versions.  If any of you have any hint
on testing it'll be more than welcomed, either specific test case or hints;
currently I'm still at a phase looking for a valid ppc systems - QEMU tcg
ppc64 emulation on x86 is slow enough to let me give up already.

Considering hugepd's specialty in ppc and the possibility that I'll break
it, there's yet another option which is I only apply the new logic into
archs with !ARCH_HAS_HUGEPD.  It'll make my life easier, but that also
means even if my attempt would work out anything new will by default rule
ppc out.  And we'll have a bunch of "#ifdef ARCH_HAS_HUGEPD" in generic
code, which is not preferred either.  For gup, it might be relatively easy
when comparing to the rest. I'm still hesitating for the long term plan.

Please let me know if you have any thoughts on any of above.

Thanks!

-- 
Peter Xu



Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2023-11-24 Thread Peter Xu
On Fri, Nov 24, 2023 at 09:06:01AM +, Ryan Roberts wrote:
> I don't have any micro-benchmarks for GUP though, if that's your question. Is
> there an easy-to-use test I can run to get some numbers? I'd be happy to try 
> it out.

Thanks Ryan.  Then nothing is needed to be tested if gup is not yet touched
from your side, afaict.  I'll see whether I can provide some rough numbers
instead in the next post (I'll probably only be able to test it in a VM,
though, but hopefully that should still reflect mostly the truth).

-- 
Peter Xu



Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2023-11-23 Thread Peter Xu
On Thu, Nov 23, 2023 at 07:11:19PM +, Ryan Roberts wrote:
> Hi,
> 
> I'm not sure I've 100% understood the crossover between this series and my 
> work
> to support arm64's contpte mappings generally for anonymous and file-backed 
> memory.

No worry, there's no confliction.  If you worked on that it's only be
something nice on top.  Also, I'm curious if you have performance numbers,
because I'm going to do some test for hugetlb cont_ptes (which is only the
current plan), and if you got those it'll be a great baseline for me,
because it should be similar in you case even though the goal is slightly
different.

> 
> My approach is to transparently use contpte mappings when core-mm request pte
> mappings that meet the requirements; and its all based around intercepting the
> normal (non-hugetlb) helpers (e.g. set_ptes(), ptep_get() and friends). There 
> is
> no semantic change to the core-mm. See [1]. It relies on 1) the page cache 
> using
> large folios and 2) my "small-sized THP" series which starts using arbitrary
> sized large folios for anonymous memory [2].
> 
> If I've understood this conversation correctly there is an object called 
> hugepd,
> which today is only supported by powerpc, but which could allow the core-mm to
> control the mapping granularity? I can see some value in exposing that control
> to core-mm in the (very) long term.

For me it's needed immediately, because hugetlb_follow_page_mask() will be
gone after the last patch.

> 
> [1] https://lore.kernel.org/all/20231115163018.1303287-1-ryan.robe...@arm.com/
> [2] 
> https://lore.kernel.org/linux-mm/20231115132734.931023-1-ryan.robe...@arm.com/

AFAICT you haven't yet worked on gup then, after I glimpsed the above
series.

It's a matter of whether one follow_page_mask() call can fetch more than
one page* for a cont_pte entry on aarch64 for a large non-hugetlb folio
(and if this series lands, it'll be the same to hugetlb or non-hugetlb).
Now the current code can only fetch one page I think.

Thanks,

-- 
Peter Xu



Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2023-11-23 Thread Peter Xu
On Thu, Nov 23, 2023 at 06:22:33PM +, Christophe Leroy wrote:
> > For fast-gup I think the hugepd code is in use, however for walk_page_*
> > apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific
> > handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit
> > a hugepd can be dead code to me (but note that this "dead code" is good
> > stuff to me, if one would like to merge hugetlb instead into generic mm).
> 
> Not sure what you mean here. What do you mean by "dead code" ?
> A hugepage directory can be plugged at any page level, from PGD to PMD.
> So the following bit in walk_pgd_range() is valid and not dead:
> 
>   if (is_hugepd(__hugepd(pgd_val(*pgd
>   err = walk_hugepd_range((hugepd_t *)pgd, addr, next, 
> walk, PGDIR_SHIFT);

IMHO it boils down to the question on whether hugepd is only used in
hugetlbfs.  I think I already mentioned that above, but I can be more
explicit; what I see is that from higher stack in __walk_page_range():

if (is_vm_hugetlb_page(vma)) {
if (ops->hugetlb_entry)
err = walk_hugetlb_range(start, end, walk);
} else
err = walk_pgd_range(start, end, walk);

It means to me as long as the vma is hugetlb, it'll not trigger any code in
walk_pgd_range(), but only walk_hugetlb_range().  Do you perhaps mean
hugepd is used outside hugetlbfs?

Thanks,

-- 
Peter Xu



Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2023-11-23 Thread Peter Xu
On Thu, Nov 23, 2023 at 03:47:49PM +, Matthew Wilcox wrote:
> It looks like ARM (in the person of Ryan) are going to add support for
> something equivalent to hugepd.

If it's about arm's cont_pte, then it looks ideal because this series
didn't yet touch cont_pte, assuming it'll just work.  From that aspect, his
work may help mine, and no immediately collapsing either.

There can be a slight performance difference which I need to measure for
arm's cont_pte already for hugetlb, but I didn't worry much on that;
quotting my commit message in the last patch:

There may be a slight difference of how the loops run when processing
GUP over a large hugetlb range on either ARM64 (e.g. CONT_PMD) or RISCV
(mostly its Svnapot extension on 64K huge pages): each loop of
__get_user_pages() will resolve one pgtable entry with the patch
applied, rather than relying on the size of hugetlb hstate, the latter
may cover multiple entries in one loop.

However, the performance difference should hopefully not be a major
concern, considering that GUP just yet got 57edfcfd3419 ("mm/gup:
accelerate thp gup even for "pages != NULL""), and that's not part of a
performance analysis but a side dish.  If the performance will be a
concern, we can consider handle CONT_PTE in follow_page(), for example.

So IMHO it can be slightly different comparing to e.g. page fault, because
each fault is still pretty slow as a whole if one fault for each small pte
(of a large folio / cont_pte), while the loop in GUP is still relatively
tight and short, comparing to a fault.  I'd boldly guess more low hanging
fruits out there for large folio outside GUP areas.

In all cases, it'll be interesting to know if Ryan has worked on cont_pte
support for gup on large folios, and whether there's any performance number
to share.  It's definitely good news to me because it means Ryan's work can
also then benefit hugetlb if this series will be merged, I just don't know
how much difference there will be.

Thanks,

-- 
Peter Xu



Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2023-11-23 Thread Peter Xu
On Wed, Nov 22, 2023 at 11:21:50PM -0800, Christoph Hellwig wrote:
> That alone sounds like a good reason to not bother.  So unless more
> qualified people have a different opinion I'm fine with this patch
> as long as you leave a comment in place, and ammend the commit message
> with some of the very useful information from your mail.

Will do, thanks.

This is what I will squash into the same patch in the new version, as the
current plan:

+/*
+ * NOTE: currently hugepd is only used in hugetlbfs file systems on Power,
+ * which does not have issue with folio writeback against GUP updates.
+ * When hugepd will be extended to support non-hugetlbfs or even anonymous
+ * memory, we need to do extra check as what we do with most of the other
+ * folios. See writable_file_mapping_allowed() and folio_fast_pin_allowed()
+ * for more information.
+ */
static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
unsigned int pdshift, unsigned long end, unsigned int flags,
struct page **pages, int *nr)

-- 
Peter Xu



Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2023-11-22 Thread Peter Xu
On Wed, Nov 22, 2023 at 12:00:24AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 21, 2023 at 10:59:35AM -0500, Peter Xu wrote:
> > > What prevents us from ever using hugepd with file mappings?  I think
> > > it would naturally fit in with how large folios for the pagecache work.
> > > 
> > > So keeping this check and generalizing it seems like the better idea to
> > > me.
> > 
> > But then it means we're still keeping that dead code for fast-gup even if
> > we know that fact..  Or do we have a plan to add that support very soon, so
> > this code will be destined to add back?
> 
> The question wasn't mean retorical - we support arbitrary power of two
> sized folios for the pagepage, what prevents us from using hugepd with
> them right now?

Ah, didn't catch that point previously.  Hugepd is just not used outside
hugetlb right now, afaiu.

For example, __hugepte_alloc() (and that's the only one calls
hugepd_populate()) should be the function to allocate a hugepd (ppc only),
and it's only called in huge_pte_alloc(), which is part of the current
arch-specific hugetlb api.

And generic mm paths don't normally have hugepd handling, afaics.  For
example, page_vma_mapped_walk() doesn't handle hugepd at all unless in
hugetlb specific path.

There're actually (only) two generic mm paths that can handle hugepd,
namely:

  - fast-gup
  - walk_page_*() apis (aka, __walk_page_range())

For fast-gup I think the hugepd code is in use, however for walk_page_*
apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific
handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit
a hugepd can be dead code to me (but note that this "dead code" is good
stuff to me, if one would like to merge hugetlb instead into generic mm).

This series tries to add slow gup into that list too, so the 3rd one to
support it.  I plan to look more into this area (e.g., __walk_page_range()
can be another good candidate soon).  I'm not sure whether we should teach
the whole mm to understand hugepd yet, but slow gup and __walk_page_range()
does look like good candidates to already remove the hugetlb specific code
paths - slow-gup has average ~add/~del LOCs (which this series does), and
__walk_page_range() can remove some code logically, no harm I yet see.

Indeed above are based on only my code observations, so I'll be more than
happy to be corrected otherwise, as early as possible.

> 
> > The other option is I can always add a comment above gup_huge_pd()
> > explaining this special bit, so that when someone is adding hugepd support
> > to file large folios we'll hopefully not forget it?  But then that
> > generalization work will only happen when the code will be needed.
> 
> If dropping the check is the right thing for now (and I think the ppc
> maintainers and willy as the large folio guy might have a more useful
> opinions than I do), leaving a comment in would be very useful.

Willy is in the loop, and I just notice I didn't really copy ppc list, even
I planned to..  I am adding the list (linuxppc-dev@lists.ozlabs.org) into
this reply.  I'll remember to do so as long as there's a new version.

The other reason I feel like hugepd may or may not be further developed for
new features like large folio is that I saw Power9 started to shift to
radix pgtables, and afaics hugepd is only supported in hash tables
(hugepd_ok()).  But again, I confess I know nothing about Power at all.

Thanks,

-- 
Peter Xu



Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2023-11-21 Thread Peter Xu
On Mon, Nov 20, 2023 at 12:26:24AM -0800, Christoph Hellwig wrote:
> On Wed, Nov 15, 2023 at 08:29:02PM -0500, Peter Xu wrote:
> > Hugepd format is only used in PowerPC with hugetlbfs.  In commit
> > a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
> > file-backed mappings"), we added a check to fail gup-fast if there's
> > potential risk of violating GUP over writeback file systems.  That should
> > never apply to hugepd.
> >
> > Drop that check, not only because it'll never be true for hugepd, but also
> > it paves way for reusing the function outside fast-gup.
> 
> What prevents us from ever using hugepd with file mappings?  I think
> it would naturally fit in with how large folios for the pagecache work.
> 
> So keeping this check and generalizing it seems like the better idea to
> me.

But then it means we're still keeping that dead code for fast-gup even if
we know that fact..  Or do we have a plan to add that support very soon, so
this code will be destined to add back?

The other option is I can always add a comment above gup_huge_pd()
explaining this special bit, so that when someone is adding hugepd support
to file large folios we'll hopefully not forget it?  But then that
generalization work will only happen when the code will be needed.

Thanks,

-- 
Peter Xu



[PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2023-11-15 Thread Peter Xu
Hugepd format is only used in PowerPC with hugetlbfs.  In commit
a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
file-backed mappings"), we added a check to fail gup-fast if there's
potential risk of violating GUP over writeback file systems.  That should
never apply to hugepd.

Drop that check, not only because it'll never be true for hugepd, but also
it paves way for reusing the function outside fast-gup.

Cc: Lorenzo Stoakes 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Peter Xu 
---
 mm/gup.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 0e00204761d2..424d45e1afb3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2816,11 +2816,6 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
return 0;
}
 
-   if (!folio_fast_pin_allowed(folio, flags)) {
-   gup_put_folio(folio, refs, flags);
-   return 0;
-   }
-
if (!pte_write(pte) && gup_must_unshare(NULL, flags, >page)) {
gup_put_folio(folio, refs, flags);
return 0;
-- 
2.41.0



Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

2023-08-22 Thread Peter Xu
ds to work on
> private shmem mappings which are userfaultfd_armed(), I'm not so sure:
> but assume that it does.
> 
> Now trylock pmd lock without dropping ptlock (suggested by jannh): if
> that fails, drop and retake ptlock around taking pmd lock, and just in
> the uffd private case, go back to recheck and empty the page table.
> 
> Reported-by: Jann Horn 
> Closes: 
> https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5fw-kyd5rg5xo_rdbm0ltt...@mail.gmail.com/
> Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with 
> mmap_read_lock()")
> Signed-off-by: Hugh Dickins 
> ---
>  mm/khugepaged.c | 39 +--
>  1 file changed, 29 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 40d43eccdee8..ad1c571772fe 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1476,7 +1476,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
> unsigned long addr,
>   struct page *hpage;
>   pte_t *start_pte, *pte;
>   pmd_t *pmd, pgt_pmd;
> - spinlock_t *pml, *ptl;
> + spinlock_t *pml = NULL, *ptl;
>   int nr_ptes = 0, result = SCAN_FAIL;
>   int i;
>  
> @@ -1572,9 +1572,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
> unsigned long addr,
>   haddr, haddr + HPAGE_PMD_SIZE);
>   mmu_notifier_invalidate_range_start();
>   notified = true;
> - start_pte = pte_offset_map_lock(mm, pmd, haddr, );
> - if (!start_pte) /* mmap_lock + page lock should prevent this */
> - goto abort;
> + spin_lock(ptl);

.. here will the ptl always be valid?

That comes from the previous round of pte_offset_map_lock(), and I assume
after this whole "thp collapse without write lock" work landed, it has the
same lifecycle with the *pte pointer, so can be invalid right after the rcu
read lock released; mmap read lock isn't strong enough to protect the ptl,
not anymore.

Maybe it's all fine because the thp collapse path is the solo path(s) that
will release the pte pgtable page without write mmap lock (so as to release
the ptl too when doing so), and we at least still hold the page lock, so
the worst case is the other concurrent "thp collapse" will still serialize
with this one on the huge page lock. But that doesn't look as solid as
fetching again the ptl from another pte_offset_map_nolock().  So still just
raise this question up.  It's possible I just missed something.

> +recheck:
> + start_pte = pte_offset_map(pmd, haddr);
> + VM_BUG_ON(!start_pte);  /* mmap_lock + page lock should prevent this */
>  
>   /* step 2: clear page table and adjust rmap */
>   for (i = 0, addr = haddr, pte = start_pte;
> @@ -1608,20 +1609,36 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
> unsigned long addr,
>   nr_ptes++;
>   }
>  
> - pte_unmap_unlock(start_pte, ptl);
> + pte_unmap(start_pte);
>  
>   /* step 3: set proper refcount and mm_counters. */
>   if (nr_ptes) {
>   page_ref_sub(hpage, nr_ptes);
>   add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
> + nr_ptes = 0;
>   }
>  
> - /* step 4: remove page table */
> + /* step 4: remove empty page table */
> + if (!pml) {
> + pml = pmd_lockptr(mm, pmd);
> + if (pml != ptl && !spin_trylock(pml)) {
> + spin_unlock(ptl);
> + spin_lock(pml);
> + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> + /*
> +  * pmd_lock covers a wider range than ptl, and (if split from mm's
> +  * page_table_lock) ptl nests inside pml. The less time we hold pml,
> +  * the better; but userfaultfd's mfill_atomic_pte() on a private VMA
> +  * inserts a valid as-if-COWed PTE without even looking up page cache.
> +  * So page lock of hpage does not protect from it, so we must not drop
> +  * ptl before pgt_pmd is removed, so uffd private needs rechecking.
> +  */
> + if (userfaultfd_armed(vma) &&
> + !(vma->vm_flags & VM_SHARED))
> + goto recheck;
> + }
> + }
>  
> - /* Huge page lock is still held, so page table must remain empty */
> - pml = pmd_lock(mm, pmd);
> - if (ptl != pml)
> - spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>   pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
>   pmdp_get_lockless_sync();
>   if (ptl != pml)
> @@ -1648,6 +1665,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, 
> unsigned long addr,
>   }
>   if (start_pte)
>   pte_unmap_unlock(start_pte, ptl);
> + if (pml && pml != ptl)
> + spin_unlock(pml);
>   if (notified)
>   mmu_notifier_invalidate_range_end();
>  drop_hpage:
> -- 
> 2.35.3


-- 
Peter Xu



Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

2023-08-21 Thread Peter Xu
On Mon, Aug 21, 2023 at 12:51:20PM -0700, Hugh Dickins wrote:
> Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
> shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
> thought it had emptied: page lock on the huge page is enough to protect
> against WP faults (which find the PTE has been cleared), but not enough
> to protect against userfaultfd.  "BUG: Bad rss-counter state" followed.
> 
> retract_page_tables() protects against this by checking !vma->anon_vma;
> but we know that MADV_COLLAPSE needs to be able to work on private shmem
> mappings, even those with an anon_vma prepared for another part of the
> mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
> mappings which are userfaultfd_armed().  Whether it needs to work on
> private shmem mappings which are userfaultfd_armed(), I'm not so sure:
> but assume that it does.
> 
> Just for this case, take the pmd_lock() two steps earlier: not because
> it gives any protection against this case itself, but because ptlock
> nests inside it, and it's the dropping of ptlock which let the bug in.
> In other cases, continue to minimize the pmd_lock() hold time.
> 
> Reported-by: Jann Horn 
> Closes: 
> https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5fw-kyd5rg5xo_rdbm0ltt...@mail.gmail.com/
> Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with 
> mmap_read_lock()")
> Signed-off-by: Hugh Dickins 

The locking is indeed slightly complicated.. but I didn't spot anything
wrong.

Acked-by: Peter Xu 

Thanks,

-- 
Peter Xu



Re: [PATCH v6 03/33] pgtable: Create struct ptdesc

2023-06-27 Thread Peter Xu
On Mon, Jun 26, 2023 at 08:14:01PM -0700, Vishal Moola (Oracle) wrote:
> Currently, page table information is stored within struct page. As part
> of simplifying struct page, create struct ptdesc for page table
> information.
> 
> Signed-off-by: Vishal Moola (Oracle) 
> Acked-by: Mike Rapoport (IBM) 
> ---
>  include/linux/pgtable.h | 68 +
>  1 file changed, 68 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 5063b482e34f..d46cb709ce08 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -987,6 +987,74 @@ static inline void ptep_modify_prot_commit(struct 
> vm_area_struct *vma,
>  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>  #endif /* CONFIG_MMU */
>  
> +
> +/**
> + * struct ptdesc -Memory descriptor for page tables.
> + * @__page_flags: Same as page flags. Unused for page tables.
> + * @pt_rcu_head:  For freeing page table pages.
> + * @pt_list:  List of used page tables. Used for s390 and x86.
> + * @_pt_pad_1:Padding that aliases with page's compound head.
> + * @pmd_huge_pte: Protected by ptdesc->ptl, used for THPs.
> + * @_pt_s390_gaddr:   Aliases with page's mapping. Used for s390 gmap only.

Should some arch-specific bits (and a few others) always under some
#ifdefs, so it shouldn't appear on other archs?

> + * @pt_mm:Used for x86 pgds.
> + * @pt_frag_refcount: For fragmented page table tracking. Powerpc and s390 
> only.
> + * @ptl:  Lock for the page table.
> + * @__page_type:  Same as page->page_type. Unused for page tables.
> + * @_refcount:Same as page refcount. Used for s390 page tables.
> + * @pt_memcg_data:Memcg data. Tracked for page tables here.
> + *
> + * This struct overlays struct page for now. Do not modify without a good
> + * understanding of the issues.
> + */
> +struct ptdesc {
> + unsigned long __page_flags;
> +
> + union {
> + struct rcu_head pt_rcu_head;
> + struct list_head pt_list;
> + struct {
> + unsigned long _pt_pad_1;
> + pgtable_t pmd_huge_pte;
> + };
> + };
> + unsigned long _pt_s390_gaddr;
> +
> + union {
> + struct mm_struct *pt_mm;
> + atomic_t pt_frag_refcount;
> + };
> +
> + union {
> + unsigned long _pt_pad_2;
> +#if ALLOC_SPLIT_PTLOCKS
> + spinlock_t *ptl;
> +#else
> +         spinlock_t ptl;
> +#endif
> + };
> + unsigned int __page_type;
> + atomic_t _refcount;
> +#ifdef CONFIG_MEMCG
> + unsigned long pt_memcg_data;
> +#endif
> +};

-- 
Peter Xu



Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

2023-06-06 Thread Peter Xu
On Tue, Jun 06, 2023 at 03:23:30PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 05, 2023 at 08:40:01PM -0700, Hugh Dickins wrote:
> 
> > diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
> > index 20652daa1d7e..e4f58c5fc2ac 100644
> > --- a/arch/powerpc/mm/pgtable-frag.c
> > +++ b/arch/powerpc/mm/pgtable-frag.c
> > @@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int 
> > kernel)
> > __free_page(page);
> > }
> >  }
> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#define PTE_FREE_DEFERRED 0x1 /* beyond any PTE_FRAG_NR */
> > +
> > +static void pte_free_now(struct rcu_head *head)
> > +{
> > +   struct page *page;
> > +   int refcount;
> > +
> > +   page = container_of(head, struct page, rcu_head);
> > +   refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
> > +>pt_frag_refcount);
> > +   if (refcount < PTE_FREE_DEFERRED) {
> > +   pte_fragment_free((unsigned long *)page_address(page), 0);
> > +   return;
> > +   }
> 
> From what I can tell power doesn't recycle the sub fragment into any
> kind of free list. It just waits for the last fragment to be unused
> and then frees the whole page.
> 
> So why not simply go into pte_fragment_free() and do the call_rcu directly:
> 
>   BUG_ON(atomic_read(>pt_frag_refcount) <= 0);
>   if (atomic_dec_and_test(>pt_frag_refcount)) {
>   if (!kernel)
>   pgtable_pte_page_dtor(page);
>   call_rcu(>rcu_head, free_page_rcu)

We need to be careful on the lock being freed in pgtable_pte_page_dtor(),
in Hugh's series IIUC we need the spinlock being there for the rcu section
alongside the page itself.  So even if to do so we'll need to also rcu call 
pgtable_pte_page_dtor() when needed.

-- 
Peter Xu



Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock

2023-05-31 Thread Peter Xu
 lock is still held, stopping new 
> > mappings
> > +* of page which might then get replaced by pte markers: 
> > only
> > +* existing markers need to be protected here.  (We could 
> > check
> > +* after getting ptl below, but this comment distracting 
> > there!)
> > +*/
> > +   if (userfaultfd_wp(vma))
> > +   continue;
> > +
> > +   /* Huge page lock is still held, so page table must be 
> > empty */
> > +   pml = pmd_lock(mm, pmd);
> > +   ptl = pte_lockptr(mm, pmd);
> > +   if (ptl != pml)
> > +   spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> > +   pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> 
> ... while the new code only does pmdp_collapse_flush(), which clears
> the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
> notifiers. My understanding is that that's problematic - maybe (?) it
> is sort of okay with regards to classic MMU notifier users like KVM,
> but it's probably wrong for IOMMUv2 users, where an IOMMU directly
> consumes the normal page tables?

The iommuv2 wasn't "consuming" the pgtables?  IIUC it relies on that to
make sure no secondary (and illegal) tlb exists in the iommu tlbs.

For this case if the pgtable _must_ be empty when reaching here (we'd
better make sure of it..), maybe we're good?  Because we should have just
invalidated once when unmap all the pages in the thp range, so no existing
tlb should generate anyway for either cpu or iommu hardwares.

However OTOH, maybe it'll also be safer to just have the mmu notifiers like
before (e.g., no idea whether anything can cache invalidate tlb
translations from the empty pgtable)? As that doesn't seems to beat the
purpose of the patchset as notifiers shouldn't fail.

> 
> (FWIW, last I looked, there also seemed to be some other issues with
> MMU notifier usage wrt IOMMUv2, see the thread
> <https://lore.kernel.org/linux-mm/yzbaf9hw1%2frek...@nvidia.com/>.)
> 
> 
> > +   if (ptl != pml)
> > +   spin_unlock(ptl);
> > +   spin_unlock(pml);
> > +
> > +   mm_dec_nr_ptes(mm);
> > +   page_table_check_pte_clear_range(mm, addr, pgt_pmd);
> > +   pte_free_defer(mm, pmd_pgtable(pgt_pmd));
> > }
> > -   i_mmap_unlock_write(mapping);
> > -   return target_result;
> > +   i_mmap_unlock_read(mapping);
> >  }
> >
> >  /**
> > @@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, 
> > unsigned long addr,
> >
> > /*
> >  * Remove pte page tables, so we can re-fault the page as huge.
> > +* If MADV_COLLAPSE, adjust result to call 
> > collapse_pte_mapped_thp().
> >  */
> > -   result = retract_page_tables(mapping, start, mm, addr, hpage,
> > -cc);
> > +   retract_page_tables(mapping, start);
> > +   if (cc && !cc->is_khugepaged)
> > +   result = SCAN_PTE_MAPPED_HUGEPAGE;
> > unlock_page(hpage);
> >
> > /*
> > --
> > 2.35.3
> >
> 

-- 
Peter Xu



Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock

2023-05-31 Thread Peter Xu
On Tue, May 30, 2023 at 05:38:25PM -0700, Hugh Dickins wrote:
> Thanks for looking, Peter: I was well aware of you dropping several hints
> that you wanted to see what's intended before passing judgment on earlier
> series, and I preferred to get on with showing this series, than go into
> detail in responses to you there - thanks for your patience :)

Not a problem at all here!

> 
> On Mon, 29 May 2023, Peter Xu wrote:
> > On Sun, May 28, 2023 at 11:25:15PM -0700, Hugh Dickins wrote:
> ...
> > > @@ -1748,123 +1747,73 @@ static void 
> > > khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_sl
> > >   mmap_write_unlock(mm);
> > >  }
> > >  
> > > -static int retract_page_tables(struct address_space *mapping, pgoff_t 
> > > pgoff,
> > > -struct mm_struct *target_mm,
> > > -unsigned long target_addr, struct page *hpage,
> > > -struct collapse_control *cc)
> > > +static void retract_page_tables(struct address_space *mapping, pgoff_t 
> > > pgoff)
> > >  {
> > >   struct vm_area_struct *vma;
> > > - int target_result = SCAN_FAIL;
> > >  
> > > - i_mmap_lock_write(mapping);
> > > + i_mmap_lock_read(mapping);
> > >   vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
> > > - int result = SCAN_FAIL;
> > > - struct mm_struct *mm = NULL;
> > > - unsigned long addr = 0;
> > > - pmd_t *pmd;
> > > - bool is_target = false;
> > > + struct mm_struct *mm;
> > > + unsigned long addr;
> > > + pmd_t *pmd, pgt_pmd;
> > > + spinlock_t *pml;
> > > + spinlock_t *ptl;
> > >  
> > >   /*
> > >* Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> > > -  * got written to. These VMAs are likely not worth investing
> > > -  * mmap_write_lock(mm) as PMD-mapping is likely to be split
> > > -  * later.
> > > +  * got written to. These VMAs are likely not worth removing
> > > +  * page tables from, as PMD-mapping is likely to be split later.
> > >*
> > > -  * Note that vma->anon_vma check is racy: it can be set up after
> > > -  * the check but before we took mmap_lock by the fault path.
> > > -  * But page lock would prevent establishing any new ptes of the
> > > -  * page, so we are safe.
> > > -  *
> > > -  * An alternative would be drop the check, but check that page
> > > -  * table is clear before calling pmdp_collapse_flush() under
> > > -  * ptl. It has higher chance to recover THP for the VMA, but
> > > -  * has higher cost too. It would also probably require locking
> > > -  * the anon_vma.
> > > +  * Note that vma->anon_vma check is racy: it can be set after
> > > +  * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> > > +  * prevented establishing new ptes of the page. So we are safe
> > > +  * to remove page table below, without even checking it's empty.
> > >*/
> > > - if (READ_ONCE(vma->anon_vma)) {
> > > - result = SCAN_PAGE_ANON;
> > > - goto next;
> > > - }
> > > + if (READ_ONCE(vma->anon_vma))
> > > + continue;
> > 
> > Not directly related to current patch, but I just realized there seems to
> > have similar issue as what ab0c3f1251b4 wanted to fix.
> > 
> > IIUC any shmem vma that used to have uprobe/bp installed will have anon_vma
> > set here, then does it mean that any vma used to get debugged will never be
> > able to merge into a thp (with either madvise or khugepaged)?
> > 
> > I think it'll only make a difference when the page cache is not huge yet
> > when bp was uninstalled, but then it becomes a thp candidate somehow.  Even
> > if so, I think the anon_vma should still be there.
> > 
> > Did I miss something, or maybe that's not even a problem?
> 
> Finding vma->anon_vma set would discourage retract_page_tables() from
> doing its business with that previously uprobed area; but it does not stop
> collapse_pte_mapped_thp() (which uprobes unregister calls directly) from
> dealing with it,

This one is true to me.

> and MADV_COLLAPSE works on anon_vma'ed areas too.  It's just a heuristic
> 

Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock

2023-05-29 Thread Peter Xu
, addr, pmd, hpage);
> - else
> - result = SCAN_SUCCEED;
> -
> -unlock_next:
> - mmap_write_unlock(mm);
> - goto next;
> - }
> - /*
> -  * Calling context will handle target mm/addr. Otherwise, let
> -  * khugepaged try again later.
> -  */
> - if (!is_target) {
> - khugepaged_add_pte_mapped_thp(mm, addr);
> + vma->vm_end < addr + HPAGE_PMD_SIZE)
>   continue;
> - }
> -next:
> - if (is_target)
> - target_result = result;
> +
> + mm = vma->vm_mm;
> + if (find_pmd_or_thp_or_none(mm, addr, ) != SCAN_SUCCEED)
> + continue;
> +
> + if (hpage_collapse_test_exit(mm))
> + continue;
> + /*
> +  * When a vma is registered with uffd-wp, we cannot recycle
> +  * the page table because there may be pte markers installed.
> +  * Other vmas can still have the same file mapped hugely, but
> +  * skip this one: it will always be mapped in small page size
> +  * for uffd-wp registered ranges.
> +  *
> +  * What if VM_UFFD_WP is set a moment after this check?  No
> +  * problem, huge page lock is still held, stopping new mappings
> +  * of page which might then get replaced by pte markers: only
> +  * existing markers need to be protected here.  (We could check
> +  * after getting ptl below, but this comment distracting there!)
> +  */
> + if (userfaultfd_wp(vma))
> + continue;

IIUC here with the new code we only hold (1) hpage lock, and (2)
i_mmap_lock read.  Then could it possible that right after checking this
and found !UFFD_WP, but then someone quickly (1) register uffd-wp on this
vma, then UFFDIO_WRITEPROTECT to install some pte markers, before below
pgtable locks taken?

The thing is installation of pte markers may not need either of the locks
iiuc..

Would taking mmap read lock help in this case?

Thanks,

> +
> + /* Huge page lock is still held, so page table must be empty */
> + pml = pmd_lock(mm, pmd);
> + ptl = pte_lockptr(mm, pmd);
> + if (ptl != pml)
> + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> + pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> + if (ptl != pml)
> + spin_unlock(ptl);
> + spin_unlock(pml);
> +
> + mm_dec_nr_ptes(mm);
> + page_table_check_pte_clear_range(mm, addr, pgt_pmd);
> + pte_free_defer(mm, pmd_pgtable(pgt_pmd));
>   }
> - i_mmap_unlock_write(mapping);
> - return target_result;
> + i_mmap_unlock_read(mapping);
>  }
>  
>  /**
> @@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, 
> unsigned long addr,
>  
>   /*
>* Remove pte page tables, so we can re-fault the page as huge.
> +  * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_thp().
>*/
> - result = retract_page_tables(mapping, start, mm, addr, hpage,
> -  cc);
> + retract_page_tables(mapping, start);
> + if (cc && !cc->is_khugepaged)
> + result = SCAN_PTE_MAPPED_HUGEPAGE;
>   unlock_page(hpage);
>  
>   /*
> -- 
> 2.35.3
> 

-- 
Peter Xu



Re: [PATCH] mm: remove zap_page_range and create zap_vma_pages

2023-01-04 Thread Peter Xu
On Tue, Jan 03, 2023 at 04:27:32PM -0800, Mike Kravetz wrote:
> zap_page_range was originally designed to unmap pages within an address
> range that could span multiple vmas.  While working on [1], it was
> discovered that all callers of zap_page_range pass a range entirely within
> a single vma.  In addition, the mmu notification call within zap_page
> range does not correctly handle ranges that span multiple vmas.  When
> crossing a vma boundary, a new mmu_notifier_range_init/end call pair
> with the new vma should be made.
> 
> Instead of fixing zap_page_range, do the following:
> - Create a new routine zap_vma_pages() that will remove all pages within
>   the passed vma.  Most users of zap_page_range pass the entire vma and
>   can use this new routine.
> - For callers of zap_page_range not passing the entire vma, instead call
>   zap_page_range_single().
> - Remove zap_page_range.
> 
> [1] 
> https://lore.kernel.org/linux-mm/20221114235507.294320-2-mike.krav...@oracle.com/
> Suggested-by: Peter Xu 
> Signed-off-by: Mike Kravetz 

Acked-by: Peter Xu 

-- 
Peter Xu



Re: [RFC PATCH] mm: remove zap_page_range and change callers to use zap_vma_page_range

2022-12-20 Thread Peter Xu
On Fri, Dec 16, 2022 at 11:20:12AM -0800, Mike Kravetz wrote:
> zap_page_range was originally designed to unmap pages within an address
> range that could span multiple vmas.  While working on [1], it was
> discovered that all callers of zap_page_range pass a range entirely within
> a single vma.  In addition, the mmu notification call within zap_page
> range does not correctly handle ranges that span multiple vmas as calls
> should be vma specific.
> 
> Instead of fixing zap_page_range, change all callers to use the new
> routine zap_vma_page_range.  zap_vma_page_range is just a wrapper around
> zap_page_range_single passing in NULL zap details.  The name is also
> more in line with other exported routines that operate within a vma.
> We can then remove zap_page_range.
> 
> Also, change madvise_dontneed_single_vma to use this new routine.
> 
> [1] 
> https://lore.kernel.org/linux-mm/20221114235507.294320-2-mike.krav...@oracle.com/
> Suggested-by: Peter Xu 
> Signed-off-by: Mike Kravetz 

Acked-by: Peter Xu 

Thanks!

-- 
Peter Xu



Re: [PATCH v4] hugetlb: simplify hugetlb handling in follow_page_mask

2022-10-30 Thread Peter Xu
On Fri, Oct 28, 2022 at 11:11:08AM -0700, Mike Kravetz wrote:
> +struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> + unsigned long address, unsigned int flags)
> +{
> + struct hstate *h = hstate_vma(vma);
> + struct mm_struct *mm = vma->vm_mm;
> + unsigned long haddr = address & huge_page_mask(h);
> + struct page *page = NULL;
> + spinlock_t *ptl;
> + pte_t *pte, entry;
> +
> + /*
> +  * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
> +  * follow_hugetlb_page().
> +  */
> + if (WARN_ON_ONCE(flags & FOLL_PIN))
> + return NULL;
> +
> +retry:
> + pte = huge_pte_offset(mm, haddr, huge_page_size(h));
> + if (!pte)
> + return NULL;
> +
> + ptl = huge_pte_lock(h, mm, pte);
> + entry = huge_ptep_get(pte);
> + if (pte_present(entry)) {
> + page = pte_page(entry) +
> + ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
> + /*
> +  * Note that page may be a sub-page, and with vmemmap
> +  * optimizations the page struct may be read only.
> +  * try_grab_page() will increase the ref count on the
> +  * head page, so this will be OK.
> +  *
> +  * try_grab_page() should always succeed here, because we hold
> +  * the ptl lock and have verified pte_present().
> +  */
> + if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
> + page = NULL;
> + goto out;
> + }
> + } else {
> + if (is_hugetlb_entry_migration(entry)) {
> + spin_unlock(ptl);
> + hugetlb_vma_unlock_read(vma);

Just noticed it when pulled the last mm-unstable: this line seems to be a
left-over of v3, while not needed now?

> + __migration_entry_wait_huge(pte, ptl);
> + goto retry;
> + }
> + /*
> +  * hwpoisoned entry is treated as no_page_table in
> +  * follow_page_mask().
> +  */
> + }
> +out:
> + spin_unlock(ptl);
> + return page;
> +}

-- 
Peter Xu



Re: [PATCH v4] hugetlb: simplify hugetlb handling in follow_page_mask

2022-10-28 Thread Peter Xu
On Fri, Oct 28, 2022 at 11:11:08AM -0700, Mike Kravetz wrote:
> v4 -Remove vma (pmd sharing) locking as this can be called with
>   FOLL_NOWAIT. Peter

Thanks, Mike.  For the gup safety on pmd unshare, I'll prepare a few small
patches and post hopefully early next week (will be off-work on Mon & Tue,
but maybe I'll still try).

-- 
Peter Xu



Re: [PATCH v3] hugetlb: simplify hugetlb handling in follow_page_mask

2022-10-28 Thread Peter Xu
On Fri, Oct 28, 2022 at 08:27:57AM -0700, Mike Kravetz wrote:
> On 10/27/22 15:34, Peter Xu wrote:
> > On Wed, Oct 26, 2022 at 05:34:04PM -0700, Mike Kravetz wrote:
> > > On 10/26/22 17:59, Peter Xu wrote:
> > 
> > If we want to use the vma read lock to protect here as the slow gup path,
> > then please check again with below [1] - I think we'll also need to protect
> > it with fast-gup (probably with trylock only, because fast-gup cannot
> > sleep) or it'll encounter the same race, iiuc.
> > 
> > Actually, instead of using vma lock, I really think this is another problem
> > and needs standalone fixing.  The problem is we allows huge_pte_offset() to
> > walk the process pgtable without any protection, while pmd unsharing can
> > drop a page anytime.  huge_pte_offset() is always facing use-after-free
> > when walking the PUD page.
> > 
> > We may want RCU lock to protect the pgtable pages from getting away when
> > huge_pte_offset() is walking it, it'll be safe then because pgtable pages
> > are released in RCU fashion only (e.g. in above example, process [2] will
> > munmap() and release the last ref to the "used to be shared" pmd and the
> > PUD that maps the shared pmds will be released only after a RCU grace
> > period), and afaict that's also what's protecting fast-gup from accessing
> > freed pgtable pages.
> > 
> > If with all huge_pte_offset() callers becoming RCU-safe, then IIUC we can
> > drop the vma lock in all GUP code, aka, in hugetlb_follow_page_mask() here,
> > because both slow and fast gup should be safe too in the same manner.
> > 
> > Thanks,
> > 
> > > > IIUC it's also the same as fast-gup - afaiu we don't take the read vma 
> > > > lock
> > > > in fast-gup too but I also think it's safe.  But I hope I didn't miss
> > > > something.
> > 
> > [1]
> 
> Thanks Peter!  I think the best thing would be to eliminate the vma_lock
> calls in this patch.  The code it is replacing/simplifying does not do any
> locking, so no real regression.

Agreed.

> 
> I think a scheme like you describe above is going to require some more
> thought/work.  It might be better as a follow on patch.

So above is only a thought, but if you think it's so far not very wrong and
worth trying, I can see what I can get from it by some upcoming patches.

It shouldn't need a lot of change, but basically looking after all
huge_pte_offset() to make sure they're safe regarding walking the PUD.  I'm
attaching an initial patch to just start to comment on the usage of
huge_pte_offset() first because that'll be the gust of the upcoming
patchset (if there'll be), further comments welcomed too.  Thanks.

-- 
Peter Xu
>From 186c56026ce8ccc555d3c7ebc0e7e8fd76e9e5c9 Mon Sep 17 00:00:00 2001
From: Peter Xu 
Date: Thu, 27 Oct 2022 17:41:11 -0400
Subject: [PATCH] mm/hugetlb: Comment huge_pte_offset() for its locking
 requirements
Content-type: text/plain

huge_pte_offset() is potentially a pgtable walker, looking up pte_t* for a
hugetlb address.

Normally, it's always safe to walk the pgtable as long as we're with the
mmap lock held for either read or write, because that guarantees the
pgtable pages will always be valid during the process.

But it's not true for hugetlbfs: hugetlbfs has the pmd sharing feature, it
means that even with mmap lock held, the PUD pgtable page can still go away
from under us if pmd unsharing is possible during the walk.

It's not always the case, e.g.:

  (1) If the mapping is private we're not prone to pmd sharing or
  unsharing, so it's okay.

  (2) If we're with the hugetlb vma lock held for either read/write, it's
  okay too because pmd unshare cannot happen at all.

Document all these explicitly for huge_pte_offset(), because it's really
not that obvious.  This also tells all the callers on what it needs to
guarantee huge_pte_offset() thread-safety.

Signed-off-by: Peter Xu 
---
 arch/arm64/mm/hugetlbpage.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 35e9a468d13e..90f084643718 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -329,6 +329,35 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct 
vm_area_struct *vma,
return ptep;
 }
 
+/*
+ * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
+ * Returns the pte_t* if found, or NULL if the address is not mapped.
+ *
+ * NOTE: since this function will walk all the pgtable pages (including not
+ * only normal pgtable page, but also PUD that can be unshared concurrently
+ * for VM_SHARED), the caller of this function should be responsible of its
+ * thread safety.  One can follow this rule:
+ *
+ *  (1) For private mappings: pmd u

Re: [PATCH v3] hugetlb: simplify hugetlb handling in follow_page_mask

2022-10-27 Thread Peter Xu
On Wed, Oct 26, 2022 at 05:34:04PM -0700, Mike Kravetz wrote:
> On 10/26/22 17:59, Peter Xu wrote:
> > Hi, Mike,
> > 
> > On Sun, Sep 18, 2022 at 07:13:48PM -0700, Mike Kravetz wrote:
> > > +struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> > > + unsigned long address, unsigned int flags)
> > > +{
> > > + struct hstate *h = hstate_vma(vma);
> > > + struct mm_struct *mm = vma->vm_mm;
> > > + unsigned long haddr = address & huge_page_mask(h);
> > > + struct page *page = NULL;
> > > + spinlock_t *ptl;
> > > + pte_t *pte, entry;
> > > +
> > > + /*
> > > +  * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
> > > +  * follow_hugetlb_page().
> > > +  */
> > > + if (WARN_ON_ONCE(flags & FOLL_PIN))
> > > + return NULL;
> > > +
> > > +retry:
> > > + /*
> > > +  * vma lock prevents racing with another thread doing a pmd unshare.
> > > +  * This keeps pte as returned by huge_pte_offset valid.
> > > +  */
> > > + hugetlb_vma_lock_read(vma);
> > 
> > I'm not sure whether it's okay to take a rwsem here, as the code can be
> > called by e.g. FOLL_NOWAIT?
> 
> I think you are right.  This is possible even thought not called this
> way today,
> 
> > I'm wondering whether it's fine to just drop this anyway, just always walk
> > it lockless.  IIUC gup callers should be safe here because the worst case
> > is the caller will fetch a wrong page, but then it should be invalidated
> > very soon with mmu notifiers.  One thing worth mention is that pmd unshare
> > should never free a pgtable page.
> 
> You are correct in that pmd unshare will not directly free a pgtable page.
> However, I think a 'very worst case' race could be caused by two threads(1,2)
> in the same process A, and another process B.  Processes A and B share a PMD.
> - Process A thread 1 gets a *ptep via huge_pte_offset and gets scheduled out.
> - Process A thread 2 calls mprotect to change protection and unshares
>   the PMD shared with process B.
> - Process B then unmaps the PMD shared with process A and the PMD page
>   gets deleted.

[2]

> - The *ptep in Process A thread 1 then points into a freed page.
> This is VERY unlikely, but I do think it is possible and is the reason I
> may be overcautious about protecting against races with pmd unshare.

Yes this is possible, I just realized that actually huge_pte_offset() is a
soft pgtable walker too.  Thanks for pointing that out.

If we want to use the vma read lock to protect here as the slow gup path,
then please check again with below [1] - I think we'll also need to protect
it with fast-gup (probably with trylock only, because fast-gup cannot
sleep) or it'll encounter the same race, iiuc.

Actually, instead of using vma lock, I really think this is another problem
and needs standalone fixing.  The problem is we allows huge_pte_offset() to
walk the process pgtable without any protection, while pmd unsharing can
drop a page anytime.  huge_pte_offset() is always facing use-after-free
when walking the PUD page.

We may want RCU lock to protect the pgtable pages from getting away when
huge_pte_offset() is walking it, it'll be safe then because pgtable pages
are released in RCU fashion only (e.g. in above example, process [2] will
munmap() and release the last ref to the "used to be shared" pmd and the
PUD that maps the shared pmds will be released only after a RCU grace
period), and afaict that's also what's protecting fast-gup from accessing
freed pgtable pages.

If with all huge_pte_offset() callers becoming RCU-safe, then IIUC we can
drop the vma lock in all GUP code, aka, in hugetlb_follow_page_mask() here,
because both slow and fast gup should be safe too in the same manner.

Thanks,

> > IIUC it's also the same as fast-gup - afaiu we don't take the read vma lock
> > in fast-gup too but I also think it's safe.  But I hope I didn't miss
> > something.

[1]

-- 
Peter Xu



Re: [PATCH v3] hugetlb: simplify hugetlb handling in follow_page_mask

2022-10-26 Thread Peter Xu
Hi, Mike,

On Sun, Sep 18, 2022 at 07:13:48PM -0700, Mike Kravetz wrote:
> +struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
> + unsigned long address, unsigned int flags)
> +{
> + struct hstate *h = hstate_vma(vma);
> + struct mm_struct *mm = vma->vm_mm;
> + unsigned long haddr = address & huge_page_mask(h);
> + struct page *page = NULL;
> + spinlock_t *ptl;
> + pte_t *pte, entry;
> +
> + /*
> +  * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
> +  * follow_hugetlb_page().
> +  */
> + if (WARN_ON_ONCE(flags & FOLL_PIN))
> + return NULL;
> +
> +retry:
> + /*
> +  * vma lock prevents racing with another thread doing a pmd unshare.
> +  * This keeps pte as returned by huge_pte_offset valid.
> +  */
> + hugetlb_vma_lock_read(vma);

I'm not sure whether it's okay to take a rwsem here, as the code can be
called by e.g. FOLL_NOWAIT?

I'm wondering whether it's fine to just drop this anyway, just always walk
it lockless.  IIUC gup callers should be safe here because the worst case
is the caller will fetch a wrong page, but then it should be invalidated
very soon with mmu notifiers.  One thing worth mention is that pmd unshare
should never free a pgtable page.

IIUC it's also the same as fast-gup - afaiu we don't take the read vma lock
in fast-gup too but I also think it's safe.  But I hope I didn't miss
something.

-- 
Peter Xu



Re: [v2 PATCH 2/2] powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush

2022-09-07 Thread Peter Xu
On Wed, Sep 07, 2022 at 11:01:44AM -0700, Yang Shi wrote:
> The IPI broadcast is used to serialize against fast-GUP, but fast-GUP
> will move to use RCU instead of disabling local interrupts in fast-GUP.
> Using an IPI is the old-styled way of serializing against fast-GUP
> although it still works as expected now.
> 
> And fast-GUP now fixed the potential race with THP collapse by checking
> whether PMD is changed or not.  So IPI broadcast in radix pmd collapse
> flush is not necessary anymore.  But it is still needed for hash TLB.
> 
> Suggested-by: Aneesh Kumar K.V 
> Signed-off-by: Yang Shi 

Acked-by: Peter Xu 

-- 
Peter Xu



Re: [RFC PATCH RESEND 19/28] mm: disallow do_swap_page to handle page faults under VMA lock

2022-09-06 Thread Peter Xu
On Tue, Sep 06, 2022 at 01:08:10PM -0700, Suren Baghdasaryan wrote:
> On Tue, Sep 6, 2022 at 12:39 PM Peter Xu  wrote:
> >
> > On Thu, Sep 01, 2022 at 10:35:07AM -0700, Suren Baghdasaryan wrote:
> > > Due to the possibility of do_swap_page dropping mmap_lock, abort fault
> > > handling under VMA lock and retry holding mmap_lock. This can be handled
> > > more gracefully in the future.
> > >
> > > Signed-off-by: Suren Baghdasaryan 
> > > ---
> > >  mm/memory.c | 5 +
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 9ac9944e8c62..29d2f49f922a 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -3738,6 +3738,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >   vm_fault_t ret = 0;
> > >   void *shadow = NULL;
> > >
> > > + if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> > > + ret = VM_FAULT_RETRY;
> > > + goto out;
> > > + }
> > > +
> >
> > May want to fail early similarly for handle_userfault() too for similar
> > reason.  Thanks,
> 
> I wasn't aware of a similar issue there. Will have a closer look. Thanks!

Sure.

Just in case this would be anything helpful - handle_userfault() will both
assert at the entry (mmap_assert_locked) and will in most cases release
read lock along the way when waiting for page fault resolutions.

And userfaultfd should work on anonymous memory for either missing mode or
write protect mode.

Thanks,

-- 
Peter Xu



Re: [RFC PATCH RESEND 19/28] mm: disallow do_swap_page to handle page faults under VMA lock

2022-09-06 Thread Peter Xu
On Thu, Sep 01, 2022 at 10:35:07AM -0700, Suren Baghdasaryan wrote:
> Due to the possibility of do_swap_page dropping mmap_lock, abort fault
> handling under VMA lock and retry holding mmap_lock. This can be handled
> more gracefully in the future.
> 
> Signed-off-by: Suren Baghdasaryan 
> ---
>  mm/memory.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 9ac9944e8c62..29d2f49f922a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3738,6 +3738,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   vm_fault_t ret = 0;
>   void *shadow = NULL;
>  
> + if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
> + ret = VM_FAULT_RETRY;
> + goto out;
> + }
> +

May want to fail early similarly for handle_userfault() too for similar
reason.  Thanks,

-- 
Peter Xu



Re: [PATCH v4 2/4] mm/migrate_device.c: Add missing flush_cache_page()

2022-09-02 Thread Peter Xu
On Fri, Sep 02, 2022 at 10:35:52AM +1000, Alistair Popple wrote:
> Currently we only call flush_cache_page() for the anon_exclusive case,
> however in both cases we clear the pte so should flush the cache.
> 
> Signed-off-by: Alistair Popple 
> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while 
> collecting pages")
> Cc: sta...@vger.kernel.org

This is the patch to start collide with David's.

David's patch has also unified both paths with ptep_get_and_clear(), but
this patch itself is also correct to me.

It'll probably just become no-diff after rebase, though.. I'm not sure how
the ordering would be at last, but anyway I think this patch stands as its
own too..

Acked-by: Peter Xu 

Thanks for tolerant with my nitpickings,

> 
> ---
> 
> New for v4
> ---
>  mm/migrate_device.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 6a5ef9f..4cc849c 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -193,9 +193,9 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   bool anon_exclusive;
>   pte_t swp_pte;
>  
> + flush_cache_page(vma, addr, pte_pfn(*ptep));
>   anon_exclusive = PageAnon(page) && 
> PageAnonExclusive(page);
>   if (anon_exclusive) {
> - flush_cache_page(vma, addr, pte_pfn(*ptep));
>   ptep_clear_flush(vma, addr, ptep);
>  
>   if (page_try_share_anon_rmap(page)) {
> -- 
> git-series 0.9.1
> 

-- 
Peter Xu



Re: [PATCH v4 1/4] mm/migrate_device.c: Flush TLB while holding PTL

2022-09-02 Thread Peter Xu
On Fri, Sep 02, 2022 at 10:35:51AM +1000, Alistair Popple wrote:
> When clearing a PTE the TLB should be flushed whilst still holding the
> PTL to avoid a potential race with madvise/munmap/etc. For example
> consider the following sequence:
> 
>   CPU0  CPU1
>     
> 
>   migrate_vma_collect_pmd()
>   pte_unmap_unlock()
> madvise(MADV_DONTNEED)
> -> zap_pte_range()
> pte_offset_map_lock()
> [ PTE not present, TLB not flushed ]
> pte_unmap_unlock()
> [ page is still accessible via stale TLB ]
>   flush_tlb_range()
> 
> In this case the page may still be accessed via the stale TLB entry
> after madvise returns. Fix this by flushing the TLB while holding the
> PTL.
> 
> Signed-off-by: Alistair Popple 
> Reported-by: Nadav Amit 
> Reviewed-by: "Huang, Ying" 
> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while 
> collecting pages")
> Cc: sta...@vger.kernel.org

Acked-by: Peter Xu 

-- 
Peter Xu



Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page

2022-08-26 Thread Peter Xu
On Fri, Aug 26, 2022 at 06:46:02PM +0200, David Hildenbrand wrote:
> On 26.08.22 17:55, Peter Xu wrote:
> > On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote:
> >>> To me anon exclusive only shows this mm exclusively owns this page. I
> >>> didn't quickly figure out why that requires different handling on tlb
> >>> flushs.  Did I perhaps miss something?
> >>
> >> GUP-fast is the magic bit, we have to make sure that we won't see new
> >> GUP pins, thus the TLB flush.
> >>
> >> include/linux/mm.h:gup_must_unshare() contains documentation.
> > 
> > Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already
> > guarantees that no other process/thread will see this pte anymore
> > afterwards?
> 
> You could have a GUP-fast thread that just looked up the PTE and is
> going to pin the page afterwards, after the ptep_get_and_clear()
> returned. You'll have to wait until that thread finished.

IIUC the early tlb flush won't protect concurrent fast-gup from happening,
but I think it's safe because fast-gup will check pte after pinning, so
either:

  (1) fast-gup runs before ptep_get_and_clear(), then
  page_try_share_anon_rmap() will fail properly, or,

  (2) fast-gup runs during or after ptep_get_and_clear(), then fast-gup
  will see that either the pte is none or changed, then it'll fail the
  fast-gup itself.

> 
> Another user that relies on this interaction between GUP-fast and TLB
> flushing is for example mm/ksm.c:write_protect_page()
> 
> There is a comment in there explaining the interaction a bit more detailed.
> 
> Maybe we'll be able to handle this differently in the future (maybe once
> this turns out to be an actual performance problem). Unfortunately,
> mm->write_protect_seq isn't easily usable because we'd need have to make
> sure we're the exclusive writer.
> 
> 
> For now, it's not too complicated. For PTEs:
> * try_to_migrate_one() already uses ptep_clear_flush().
> * try_to_unmap_one() already conditionally used ptep_clear_flush().
> * migrate_vma_collect_pmd() was the one case that didn't use it already
>  (and I wonder why it's different than try_to_migrate_one()).

I'm not sure whether I fully get the point, but here one major difference
is all the rest handles one page, so a tlb flush alongside with the pte
clear sounds reasonable.  Even if so try_to_unmap_one() was modified to use
tlb batching, but then I see that anon exclusive made that batching
conditional.  I also have question there on whether we can keep using the
tlb batching even with anon exclusive pages there.

In general, I still don't see how stall tlb could affect anon exclusive
pages on racing with fast-gup, because the only side effect of a stall tlb
is unwanted page update iiuc, the problem is fast-gup doesn't even use tlb,
afaict..

Thanks,

-- 
Peter Xu



Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page

2022-08-26 Thread Peter Xu
On Fri, Aug 26, 2022 at 04:47:22PM +0200, David Hildenbrand wrote:
> > To me anon exclusive only shows this mm exclusively owns this page. I
> > didn't quickly figure out why that requires different handling on tlb
> > flushs.  Did I perhaps miss something?
> 
> GUP-fast is the magic bit, we have to make sure that we won't see new
> GUP pins, thus the TLB flush.
> 
> include/linux/mm.h:gup_must_unshare() contains documentation.

Hmm.. Shouldn't ptep_get_and_clear() (e.g., xchg() on x86_64) already
guarantees that no other process/thread will see this pte anymore
afterwards?

-- 
Peter Xu



Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page

2022-08-26 Thread Peter Xu
On Fri, Aug 26, 2022 at 11:02:58AM +1000, Alistair Popple wrote:
> 
> Peter Xu  writes:
> 
> > On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
> >>
> >> Peter Xu  writes:
> >>
> >> > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
> >> >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
> >> >> installs migration entries directly if it can lock the migrating page.
> >> >> When removing a dirty pte the dirty bit is supposed to be carried over
> >> >> to the underlying page to prevent it being lost.
> >> >>
> >> >> Currently migrate_vma_*() can only be used for private anonymous
> >> >> mappings. That means loss of the dirty bit usually doesn't result in
> >> >> data loss because these pages are typically not file-backed. However
> >> >> pages may be backed by swap storage which can result in data loss if an
> >> >> attempt is made to migrate a dirty page that doesn't yet have the
> >> >> PageDirty flag set.
> >> >>
> >> >> In this case migration will fail due to unexpected references but the
> >> >> dirty pte bit will be lost. If the page is subsequently reclaimed data
> >> >> won't be written back to swap storage as it is considered uptodate,
> >> >> resulting in data loss if the page is subsequently accessed.
> >> >>
> >> >> Prevent this by copying the dirty bit to the page when removing the pte
> >> >> to match what try_to_migrate_one() does.
> >> >>
> >> >> Signed-off-by: Alistair Popple 
> >> >> Acked-by: Peter Xu 
> >> >> Reported-by: Huang Ying 
> >> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma 
> >> >> while collecting pages")
> >> >> Cc: sta...@vger.kernel.org
> >> >>
> >> >> ---
> >> >>
> >> >> Changes for v3:
> >> >>
> >> >>  - Defer TLB flushing
> >> >>  - Split a TLB flushing fix into a separate change.
> >> >>
> >> >> Changes for v2:
> >> >>
> >> >>  - Fixed up Reported-by tag.
> >> >>  - Added Peter's Acked-by.
> >> >>  - Atomically read and clear the pte to prevent the dirty bit getting
> >> >>set after reading it.
> >> >>  - Added fixes tag
> >> >> ---
> >> >>  mm/migrate_device.c |  9 +++--
> >> >>  1 file changed, 7 insertions(+), 2 deletions(-)
> >> >>
> >> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> >> index 6a5ef9f..51d9afa 100644
> >> >> --- a/mm/migrate_device.c
> >> >> +++ b/mm/migrate_device.c
> >> >> @@ -7,6 +7,7 @@
> >> >>  #include 
> >> >>  #include 
> >> >>  #include 
> >> >> +#include 
> >> >>  #include 
> >> >>  #include 
> >> >>  #include 
> >> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> >> anon_exclusive = PageAnon(page) && 
> >> >> PageAnonExclusive(page);
> >> >> if (anon_exclusive) {
> >> >> flush_cache_page(vma, addr, 
> >> >> pte_pfn(*ptep));
> >> >> -   ptep_clear_flush(vma, addr, ptep);
> >> >> +   pte = ptep_clear_flush(vma, addr, ptep);
> >> >>
> >> >> if (page_try_share_anon_rmap(page)) {
> >> >> set_pte_at(mm, addr, ptep, pte);
> >> >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> >> goto next;
> >> >> }
> >> >> } else {
> >> >> -   ptep_get_and_clear(mm, addr, ptep);
> >> >> +   pte = ptep_get_and_clear(mm, addr, 
> >> >> ptep);
> >> >> }
> >> >
> >> > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() 
> >> > are
> >> > moved

Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page

2022-08-25 Thread Peter Xu
On Fri, Aug 26, 2022 at 08:09:28AM +1000, Alistair Popple wrote:
> > I looked at some of the callers, it seems not all of them are ready to
> > handle that (__kvmppc_svm_page_out() or svm_migrate_vma_to_vram()).  Is it
> > safe?  Do the callers need to always properly handle that (unless the
> > migration is only a best-effort, but it seems not always the case).
> 
> Migration is always best effort. Callers need to be prepared to handle
> failure of a particular page to migrate, but I could believe not all of
> them are.

Ok, I see that ppc list is in the loop, hopefully this issue is aware since
afaict ppc will sigbus when migrate_vma_setup() fails, otoh the svm code
just dumps some device error (and I didn't check upper the stack from there).

> 
> > Besides, since I read the old code of prepare(), I saw this comment:
> >
> > -   if (!(migrate->src[i] & MIGRATE_PFN_LOCKED)) {
> > -   /*
> > -* Because we are migrating several pages there can be
> > -* a deadlock between 2 concurrent migration where each
> > -* are waiting on each other page lock.
> > -*
> > -* Make migrate_vma() a best effort thing and backoff
> > -* for any page we can not lock right away.
> > -*/
> > -   if (!trylock_page(page)) {
> > -   migrate->src[i] = 0;
> > -   migrate->cpages--;
> > -   put_page(page);
> > -   continue;
> > -   }
> > -   remap = false;
> > -   migrate->src[i] |= MIGRATE_PFN_LOCKED;
> > -   }
> >
> > I'm a bit curious whether that deadlock mentioned in the comment is
> > observed in reality?
> >
> > If the page was scanned in the same address space, logically the lock order
> > should be guaranteed (if both page A, both threads should lock in order).
> > I think the order can be changed if explicitly did so (e.g. fork() plus
> > mremap() for anonymous here) but I just want to make sure I get the whole
> > point of it.
> 
> You seem to have the point of it. The trylock_page() is to avoid
> deadlock, and failure is always an option for migration. Drivers can
> always retry if they really need the page to migrate, although success
> is never guaranteed. For example the page might be pinned (or have
> swap-cache allocated to it, but I'm hoping to at least get that fixed).

If so I'd suggest even more straightforward document for either this
trylock() or on the APIs (e.g. for migrate_vma_setup()).  This behavior is
IMHO hiding deep and many people may not realize.  I'll comment in the
comment update patch.

Thanks.

-- 
Peter Xu



Re: [PATCH v3 2/3] mm/migrate_device.c: Copy pte dirty bit to page

2022-08-25 Thread Peter Xu
On Fri, Aug 26, 2022 at 08:21:44AM +1000, Alistair Popple wrote:
> 
> Peter Xu  writes:
> 
> > On Wed, Aug 24, 2022 at 01:03:38PM +1000, Alistair Popple wrote:
> >> migrate_vma_setup() has a fast path in migrate_vma_collect_pmd() that
> >> installs migration entries directly if it can lock the migrating page.
> >> When removing a dirty pte the dirty bit is supposed to be carried over
> >> to the underlying page to prevent it being lost.
> >>
> >> Currently migrate_vma_*() can only be used for private anonymous
> >> mappings. That means loss of the dirty bit usually doesn't result in
> >> data loss because these pages are typically not file-backed. However
> >> pages may be backed by swap storage which can result in data loss if an
> >> attempt is made to migrate a dirty page that doesn't yet have the
> >> PageDirty flag set.
> >>
> >> In this case migration will fail due to unexpected references but the
> >> dirty pte bit will be lost. If the page is subsequently reclaimed data
> >> won't be written back to swap storage as it is considered uptodate,
> >> resulting in data loss if the page is subsequently accessed.
> >>
> >> Prevent this by copying the dirty bit to the page when removing the pte
> >> to match what try_to_migrate_one() does.
> >>
> >> Signed-off-by: Alistair Popple 
> >> Acked-by: Peter Xu 
> >> Reported-by: Huang Ying 
> >> Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while 
> >> collecting pages")
> >> Cc: sta...@vger.kernel.org
> >>
> >> ---
> >>
> >> Changes for v3:
> >>
> >>  - Defer TLB flushing
> >>  - Split a TLB flushing fix into a separate change.
> >>
> >> Changes for v2:
> >>
> >>  - Fixed up Reported-by tag.
> >>  - Added Peter's Acked-by.
> >>  - Atomically read and clear the pte to prevent the dirty bit getting
> >>set after reading it.
> >>  - Added fixes tag
> >> ---
> >>  mm/migrate_device.c |  9 +++--
> >>  1 file changed, 7 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> index 6a5ef9f..51d9afa 100644
> >> --- a/mm/migrate_device.c
> >> +++ b/mm/migrate_device.c
> >> @@ -7,6 +7,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>  #include 
> >>  #include 
> >>  #include 
> >> @@ -196,7 +197,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >>anon_exclusive = PageAnon(page) && 
> >> PageAnonExclusive(page);
> >>if (anon_exclusive) {
> >>flush_cache_page(vma, addr, pte_pfn(*ptep));
> >> -  ptep_clear_flush(vma, addr, ptep);
> >> +  pte = ptep_clear_flush(vma, addr, ptep);
> >>
> >>if (page_try_share_anon_rmap(page)) {
> >>set_pte_at(mm, addr, ptep, pte);
> >> @@ -206,11 +207,15 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >>goto next;
> >>}
> >>} else {
> >> -  ptep_get_and_clear(mm, addr, ptep);
> >> +  pte = ptep_get_and_clear(mm, addr, ptep);
> >>}
> >
> > I remember that in v2 both flush_cache_page() and ptep_get_and_clear() are
> > moved above the condition check so they're called unconditionally.  Could
> > you explain the rational on why it's changed back (since I think v2 was the
> > correct approach)?
> 
> Mainly because I agree with your original comments, that it would be
> better to keep the batching of TLB flushing if possible. After the
> discussion I don't think there is any issues with HW pte dirty bits
> here. There are already other cases where HW needs to get that right
> anyway (eg. zap_pte_range).

Yes tlb batching was kept, thanks for doing that way.  Though if only apply
patch 1 we'll have both ptep_clear_flush() and batched flush which seems to
be redundant.

> 
> > The other question is if we want to split the patch, would it be better to
> > move the tlb changes to patch 1, and leave the dirty bit fix in patch 2?
> 
> Isn't that already the case? Patch 1 moves the TLB flush before the PTL
> as suggested, patch 2 atomically copies the dirty bit without changing
> any TLB flushing.

IMHO it's cleaner to have patch 1 fix batch flush, replace
ptep_clear_flush() with ptep_get_and_clear() and update pte properly.

No strong opinions on the layout, but I still think we should drop the
redundant ptep_clear_flush() above, meanwhile add the flush_cache_page()
properly for !exclusive case too.

Thanks,

-- 
Peter Xu



Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page

2022-08-25 Thread Peter Xu
On Thu, Aug 25, 2022 at 11:24:03AM +1000, Alistair Popple wrote:
> By the way it's still an optimisation because in most cases we can avoid
> calling try_to_migrate() and walking the rmap altogether if we install
> the migration entries here. But I agree the comment is misleading.

There's one follow up question I forgot to ask on the trylock thing.  I
figured maybe I should ask out loud since we're at it.

Since migrate_vma_setup() always only use trylock (even if before dropping
the prepare() code), does it mean that it can randomly fail?

I looked at some of the callers, it seems not all of them are ready to
handle that (__kvmppc_svm_page_out() or svm_migrate_vma_to_vram()).  Is it
safe?  Do the callers need to always properly handle that (unless the
migration is only a best-effort, but it seems not always the case).

Besides, since I read the old code of prepare(), I saw this comment:

-   if (!(migrate->src[i] & MIGRATE_PFN_LOCKED)) {
-   /*
-* Because we are migrating several pages there can be
-* a deadlock between 2 concurrent migration where each
-* are waiting on each other page lock.
-*
-* Make migrate_vma() a best effort thing and backoff
-* for any page we can not lock right away.
-*/
-   if (!trylock_page(page)) {
-   migrate->src[i] = 0;
-   migrate->cpages--;
-   put_page(page);
-   continue;
-   }
-   remap = false;
-   migrate->src[i] |= MIGRATE_PFN_LOCKED;
-   }

I'm a bit curious whether that deadlock mentioned in the comment is
observed in reality?

If the page was scanned in the same address space, logically the lock order
should be guaranteed (if both page A, both threads should lock in order).
I think the order can be changed if explicitly did so (e.g. fork() plus
mremap() for anonymous here) but I just want to make sure I get the whole
point of it.

Thanks,

-- 
Peter Xu



Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page

2022-08-25 Thread Peter Xu
On Thu, Aug 25, 2022 at 10:42:41AM +1000, Alistair Popple wrote:
> 
> Peter Xu  writes:
> 
> > On Wed, Aug 24, 2022 at 04:25:44PM -0400, Peter Xu wrote:
> >> On Wed, Aug 24, 2022 at 11:56:25AM +1000, Alistair Popple wrote:
> >> > >> Still I don't know whether there'll be any side effect of having 
> >> > >> stall tlbs
> >> > >> in !present ptes because I'm not familiar enough with the private dev 
> >> > >> swap
> >> > >> migration code.  But I think having them will be safe, even if 
> >> > >> redundant.
> >> >
> >> > What side-effect were you thinking of? I don't see any issue with not
> >> > TLB flushing stale device-private TLBs prior to the migration because
> >> > they're not accessible anyway and shouldn't be in any TLB.
> >>
> >> Sorry to be misleading, I never meant we must add them.  As I said it's
> >> just that I don't know the code well so I don't know whether it's safe to
> >> not have it.
> >>
> >> IIUC it's about whether having stall system-ram stall tlb in other
> >> processor would matter or not here.  E.g. some none pte that this code
> >> collected (boosted both "cpages" and "npages" for a none pte) could have
> >> stall tlb in other cores that makes the page writable there.
> >
> > For this one, let me give a more detailed example.
> 
> Thanks, I would have been completely lost about what you were talking
> about without this :-)
> 
> > It's about whether below could happen:
> >
> >thread 1thread 2 thread 3
> > 
> >   write to page P (data=P1)
> > (cached TLB writable)
> >   zap_pte_range()
> > pgtable lock
> > clear pte for page P
> > pgtable unlock
> > ...
> >  migrate_vma_collect
> >pte none, npages++, 
> > cpages++
> >allocate device page
> >copy data (with P1)
> >map pte as device 
> > swap
> >   write to page P again
> >   (data updated from P1->P2)
> >   flush tlb
> >
> > Then at last from processor side P should have data P2 but actually from
> > device memory it's P1. Data corrupt.
> 
> In the above scenario migrate_vma_collect_pmd() will observe pte_none.
> This will mark the src_pfn[] array as needing a new zero page which will
> be installed by migrate_vma_pages()->migrate_vma_insert_page().
> 
> So there is no data to be copied hence there can't be any data
> corruption. Remember these are private anonymous pages, so any
> zap_pte_range() indicates the data is no longer needed (eg.
> MADV_DONTNEED).

My bad to have provided an example but invalid. :)

So if the trylock in the function is the only way to migrate this page,
then I agree stall tlb is fine.

> 
> >>
> >> When I said I'm not familiar with the code, it's majorly about one thing I
> >> never figured out myself, in that migrate_vma_collect_pmd() has this
> >> optimization to trylock on the page, collect if it succeeded:
> >>
> >>   /*
> >>* Optimize for the common case where page is only mapped once
> >>* in one process. If we can lock the page, then we can safely
> >>* set up a special migration page table entry now.
> >>*/
> >>if (trylock_page(page)) {
> >>   ...
> >>} else {
> >>   put_page(page);
> >>   mpfn = 0;
> >>}
> >>
> >> But it's kind of against a pure "optimization" in that if trylock failed,
> >> we'll clear the mpfn so the src[i] will be zero at last.  Then will we
> >> directly give up on this page, or will we try to lock_page() again
> >> somewhere?
> 
> That comment is out dated. We used to try locking the page again but
> that was removed by ab09243aa95a ("mm/migrate.c: remove
> MIGRATE_PFN_LOCKED"). See
> https://lkml.kernel.org/r/20211025041608.289017-1-apop...@nvidia.com
> 
> Will post a clean-up for it.

That'll help, thanks.

-- 
Peter Xu



Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page

2022-08-24 Thread Peter Xu
On Wed, Aug 24, 2022 at 04:25:44PM -0400, Peter Xu wrote:
> On Wed, Aug 24, 2022 at 11:56:25AM +1000, Alistair Popple wrote:
> > >> Still I don't know whether there'll be any side effect of having stall 
> > >> tlbs
> > >> in !present ptes because I'm not familiar enough with the private dev 
> > >> swap
> > >> migration code.  But I think having them will be safe, even if redundant.
> > 
> > What side-effect were you thinking of? I don't see any issue with not
> > TLB flushing stale device-private TLBs prior to the migration because
> > they're not accessible anyway and shouldn't be in any TLB.
> 
> Sorry to be misleading, I never meant we must add them.  As I said it's
> just that I don't know the code well so I don't know whether it's safe to
> not have it.
> 
> IIUC it's about whether having stall system-ram stall tlb in other
> processor would matter or not here.  E.g. some none pte that this code
> collected (boosted both "cpages" and "npages" for a none pte) could have
> stall tlb in other cores that makes the page writable there.

For this one, let me give a more detailed example.

It's about whether below could happen:

   thread 1thread 2 thread 3
    
  write to page P (data=P1)
(cached TLB writable)
  zap_pte_range()
pgtable lock
clear pte for page P
pgtable unlock
...
 migrate_vma_collect
   pte none, npages++, 
cpages++
   allocate device page
   copy data (with P1)
   map pte as device swap 
  write to page P again
  (data updated from P1->P2)
  flush tlb

Then at last from processor side P should have data P2 but actually from
device memory it's P1. Data corrupt.

> 
> When I said I'm not familiar with the code, it's majorly about one thing I
> never figured out myself, in that migrate_vma_collect_pmd() has this
> optimization to trylock on the page, collect if it succeeded:
> 
>   /*
>* Optimize for the common case where page is only mapped once
>* in one process. If we can lock the page, then we can safely
>* set up a special migration page table entry now.
>*/
>if (trylock_page(page)) {
>   ...
>} else {
>   put_page(page);
>   mpfn = 0;
>}
> 
> But it's kind of against a pure "optimization" in that if trylock failed,
> we'll clear the mpfn so the src[i] will be zero at last.  Then will we
> directly give up on this page, or will we try to lock_page() again
> somewhere?
> 
> The future unmap op is also based on this "cpages", not "npages":
> 
>   if (args->cpages)
>   migrate_vma_unmap(args);
> 
> So I never figured out how this code really works.  It'll be great if you
> could shed some light to it.
> 
> Thanks,
> 
> -- 
> Peter Xu

-- 
Peter Xu



  1   2   >